In figure 2, the authors show that no one model predicts future performance very well. Shiller's PE10 is the best with and R^2 of 0.43 (ie 43% of the observed variance in stock performance can be explained by that variable). They also list some other variables, like debt/GDP and dividend yield, which have weaker correlations. I would think that the obvious thing to do would be to combine the variables into a simple model to see if the correlation became stronger if you accounted for multiple variables at once.
So I dug out the data and tried it myself. The best model I have come up with (for the years that the paper uses) has an R2 of about 0.7. Surely I can't be the first person to think of this. I realize that some variables will be spurious, but in a model with a lot of variables (mine has four) wouldn't the effect of the spurious variables be diluted out?
Also, the authors found that there was a positive correlation (R2 = 0.2) between government debt/GDP and future stock returns. This was the opposite of what was expected, and they just dismiss it -- "we would not expect such a correlation to persist." But when I look at the same correlation for the years 1900-1928 (the authors only went back to 1929) the correlation holds with R2=0.58. It seems plausible to me that borrowing money might stimulate the economy in the near future. Why not use this as a predictive factor?
They also include 10-year rainfall as a "reality check" and predict no correlation. When they find a small correlation, they interpret it as spurious. But wouldn't an extended drought be expected to have some effect on the economy?
I know you are a scientist so this should be in your wheelhouse.
(1) Recall that the coeff. of determination (r2) is
not a model selection tool. Each time you add an independent factor, your r2 will do up, no matter how uncorrelated it is with your dependent variable. By using r2 in this way, you easily run the risk of overfitting.
(2) If you are adding alot of factors, you should first run a correlation plot to see how certain independent factors co-vary. If you do have cross-correlation then your ordinary least square regression that you did does not produce the 'best, linear unbiased estimator'. You need to either address the cross-correlation with a generalized linear model adding the covariance matrix of your ind. factors.
(3) If you are making OLS models you should at least run an ANOVA ad-hoc to look at the significance of each factor
(4) A more robust investigation of each ind. factor is to use the Lindeman, Merenda, and Gold bootstrapped sequential sums-of-squares.
(5) Consider using a fit statistic that adequately penalizes for adding addtl. factors such as log-liklihood, AIC or BIC.
(6) Consider using an unbiased algorithm of model selection such as All Subsets Regression
(7) Returns are auto-correlated. When you have data taken over time or space you need to consider that observations closer in time or space tend to be more similar than observations taken over larger lag distances. To look at this, simply plot a correlogram of your dataset. For that reason I would scrap your entire approach and consider a time-series analysis.
No offense but this is amateur technical analysis and it leave you without pants once the tide goes out.