As researchers, we are often overly-focussed on our samples and our data sets. It’s natural, as they are our only reference point.
It is easy to forget that with even the best designed samples, all we can hope to get is a reasonable estimate of “the truth”, for example “What percentage of people use Brand X”?
Of course we normally have only one sample, but if we had many 1000’s we would get a distribution of estimates. In multivariate problems such as Key Drivers analysis, where we are making estimates of many slopes (the effect of each predictor on some variable of interest e.g. Likelihood to recommend Brand X, which we call the dependent variable) the problem of sample estimates can be even greater. Statistical text book rules of thumb exist which tell us things like “you should have at least 25 cases per predictor” to get stable estimates.
However these rules are designed for situations where we have a small handful of “known” predictors, large sample and only weak correlations between the predictors. In 12 years of being a full time statistician, I have never encountered this idealised situation in real market research data!
In practice, we almost always have many, highly correlated potential predictors from which we have to screen the best set and often samples sizes which are less than ideal for the number of variables screened. These classic stepwise selection procedures give us false re-assurances that we are getting the best model possible. In truth, if we were to repeat the procedure on 1000’s of other samples you would get different models in each sample. The size of the difference is greatest for small samples, small sample to predictor ratios and high correlations between predictors.
It is a mathematical certainty that as we approach the worst case scenario, where the number of predictors approaches the number cases in the data, the sample r-squared for such models will approach 100%, whilst at the same time the model will become weaker and weaker at predicting new cases.
It is a little known fact in a Key Drivers analysis that if you have completely randomly generated predictors and any dependent variable you will always get an R-squared of exactly 100%, despite the fact that in the population the relationship is 0%. I have met many statistical dabblers who squirm at this inconvenient fact!
This is known by statisticians as the problem of overfitting your sample. To put it crudely, any estimates from the sample are part signal and part noise. The noise being the random element in that particular sample. In multivariate problems like the one described above, if you don’t take care to avoid this then your model is mostly capturing the noise, rather than the signal.
So be very wary of anyone who tries to impress you with R-squared of 70% or more in a sample. This is very often a sign of overfitting and is highly misleading. In reality these type of models are likely to provide terrible estimates. Our usual guide is to look for R-squared in the range 30%-70% and be wary of straying outside of this range.
Use statistical tools such as Correlated Component Regression Predictive modelling which focus on “Cross-validated” R-squared, rather than in-sample R-squared. This measures performance on NEW CASES outside of the sample and avoids overfitting.