I was wondering if there exist any data sets of all companies and their returns over the last 30 years or so? Further, if anyone has run the numbers of what your returns and variance would be with 1, 2, 3, 10, 100... all the way up to the total number of stocks?

I briefly searched, found this for the Russel 3000:

http://business.nasdaq.com/media/The_Capitalism_Distribution_Blackstar_Funds_tcm5044-42315.pdfand it says that 36% of stocks outperformed the market, and includes a histogram for distribution of returns. This makes sense given a Pareto distribution.

So for instance in this case, with a 1 stock portfolio, your EV would be market return, the range would be -100% to +10,000% or something, and the likelihood of beating the market would be 36%.

Next, look at the 3000*2999 possibilities of an equally weighted 2 stock portfolio, and make a histogram of that. In that case you would expect the tails to contract (best case is top two companies, which by definition would be less likely, and less total return than top 1), and likelihood of beating the market to go up.

Finally continue with all this all the way up with bigger and bigger portfolio to an index fund where you match returns.

Towards the middle I assume this would be computationally almost impossible because with a 3000 choose 1500 I guess you get 1.791967E+901 combinations, so a numerical method approach where you just pick a bunch of random data sets until you get something stable would be the only approach.

What I was thinking though, is that towards the top end, a random 2999 stock portfolio would be *more* likely than not to beat the market, because the one random stock you didn't pick, has a 64% chance of under performing the market.

Is this reasoning sound? Do such data sets or such analyses exist?