We are often asked to comment on “the statistical significance of a find”, usually at the 95% level *. This can be when looking at a statistic, such as a percentage or mean score, in absolute terms or comparing the results of statistics such as the difference in percentage / mean scores across subgroups or question items.
The term “Sig Testing” is used by researchers as a catch all phrase, which refers to measuring how confident we are that a statistic in our sample is a robust estimate of the population statistic. The most common manifestations are building a confidence interval around a statistics (46% ± 4 percentage points) or determine whether a statistic is significantly different from zero. For example comparing the percentage in Group A (41%) and Group B (48%) is there evidence that this difference is statistically significant? Although Confidence Intervals and Sig Testing are related we will concentrate on the latter in this article.
Sig Tests are also used by some common algorithms (e.g. Stepwise selection methods) when determining which predictors to retain in a model, such as a key drivers regression analysis. The p-value for a variable for entry into the model is used to determine whether that variable should be retained in the model or dropped.
Sig tests are often used as a filter by clients to determine which finds to focus on (i.e. take seriously) and which to ignore. They are assumed by many users to differentiate real effects from “noise”.
However, there are a number of facts about sig testing which are often not well understood and yet have a bearing on how they should be used:
- Sig Testing tells you whether you have evidence that an “uninteresting” hypothesis is false; it does not tell you in absolute or probabilistic terms whether an “interesting” result is true. This nuance often confuses people: we start out with a “boring” assumption – in this case that there is “no difference” in the population – and look for evidence in our sample to support it. A large p-value from the sig test (conventionally ≥ 0.05) tells us that the sample supports the hypothesis of “no difference”, or more technically that we don’t have enough information to reject the possibility. A small p value (conventionally <0.05) tells us the sample does not support the hypothesis of “no difference” and therefore we have enough information to reject it. Just to confuse you further the low p-value is classed as a “significant” result and the high p-value “non-significant” result
- Sig testing is entirely a function of sample size. The larger the sample size, the more likely you find that even small differences between statistics are statistically significant. In very large samples (say 10,000+) even tiny differences might be flagged as statistically significant; in very small samples (<100) even very large differences might be classed as “insignificant” (i.e. there is not enough evidence to disprove the “no difference” theory). This can be very problematic in step-wise algorithms: if you have 100 possible predictors you may find in your very large sample that 50+ of them are retained, whereas in your very small sample only 1-2 are retained. Stepwise selection results often tell you more about the sample size than they do about which predictors are useful!
- Some sig tests have alternative formulae which can give different results. Sig Tests are a function of the standard error of the sampling distribution. This is the distribution (spread of results) we would obtain in theory were we to draw and compare lots of sample (1000s) from the same population. It can be estimated from our sample. However, for some sig-tests, different formulae use different estimates depending on the assumptions you want to make about the spread of data (variance) in each subgroup. It is just worth being aware of this as there is often no definitive agreement on which version to use.
- Multiple sig testing can lead to spurious differences being flagged as significant. This nuance is often ignored, even in published articles. If you make multiple comparisons, for example by comparing across multiple subgroups, you will get an inflated “false positive” rate which means an increasing probability that a difference is falsely identified as significant. This is possible even when comparing say all possible pairs of groups from as few as 6 subgroups. In other words if you make enough comparisons you are likely to get false readings of statistical significance. There is an adjustment which can be made for multiple comparisons called the Bonferroni Correction.
- Computer Tables often use the wrong tests and don’t adjust for multiple tests. Don’t just rely on the assumption that the sig tests in your tables use the correct tests. In this case a little knowledge is helpful and can avoid egg on your face further down the road. The most common error is to use the tests of difference between independent samples when comparing items measured in the same sample. There are separate (more sensitive) same samples tests which should be used in this case.
Key Take Outs
Sig Testing is largely a function of the adequacy of your sample, but gives an indication of whether results are likely to be genuine, rather than down to chance if used sparingly. It should never be used as an absolute threshold to determine what is interesting and what is not. What is more important is practical significance; whether a difference is large enough to justify interest rather than whether it is statistically significant. Ideally a sample should be designed with the power to detect statistically significant differences within this threshold though practical constraints mean this is often not possible.
Beware of the pitfalls of multiple sig tests and, if necessary, make the appropriate adjustments. Also, arm yourself with a good sig testing tool which clearly explains the different tests. We have developed our own, user friendly, interactive excel tool which can be downloaded below.
Significance Testing & Confidence Interval Tool