Keywords
null hypothesis significance testing, tutorial, p-value, reporting, confidence intervals
null hypothesis significance testing, tutorial, p-value, reporting, confidence intervals
A few grammatical errors were noted by the reviewers and corrected in this new version.
See the author's detailed response to the review by Daniel Lakens
See the author's detailed response to the review by Marcel A.L.M. van Assen
See the author's detailed response to the review by Dorothy Vera Margaret Bishop
NHST is a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect or no relationship based on a given observation. The method is a combination of the concepts of significance testing developed by Fisher in 1925 and of acceptance based on critical rejection regions developed by Neyman & Pearson in 1928. In the following I am first presenting each approach, highlighting the key differences and common misconceptions that result from their combination into the NHST framework (for a more mathematical comparison, along with the Bayesian method, see Christensen, 2005). I next present the related concept of confidence intervals. I finish by discussing practical aspects in using NHST and reporting practice.
The method developed by (Fisher, 1934; Fisher, 1955; Fisher, 1959) allows us to compute the probability of observing a result at least as extreme as a test statistic (e.g. t value), assuming the null hypothesis of no effect is true. This probability or p-value reflects (1) the conditional probability of achieving the observed outcome or larger: p(Obs≥t|H0), and (2) is therefore a cumulative probability rather than a point estimate. It is equal to the area under the null probability distribution curve from the observed test statistic to the tail of the null distribution (Turkheimer et al., 2004 – Figure 1). The approach proposed is of ‘proof by contradiction’ (Christensen, 2005), we pose the null model and test if data conform to it.
In practice, it is recommended to set a level of significance (a theoretical p-value) that acts as a reference point to identify significant results, that is to identify results that differ from the null-hypothesis of no effect. Fisher recommended using p=0.05 to judge whether an effect is significant or not as it is roughly two standard deviations away from the mean for the normal distribution (Fisher, 1934 page 45: ‘The value for which p=.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not’). It is important to appreciate that this threshold is partly subjective: 2 standard deviations seems reasonable in biological, human and social sciences but in particle physics, this threshold is set at 5 standard deviations. That choice of two standard deviations is also contested, for instance in psychology calls for p<.001 (Colquhoun, 2014) or p<.005 (Benjamin et al., 2017) have been made. A key aspect of Fishers’ theory is that the significance threshold is only part of the process to accept or reject a hypothesis. Since only the null-hypothesis is tested, p-values are meant to be used in a graded manner to decide whether the evidence is worth additional investigation and/or replication (Fisher, 1971 page 13: ‘it is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require […]’ and ‘no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon’). How small the level of significance is, is thus left to researchers.
The p-value is not an indication of the strength or magnitude of an effect. Any interpretation of the p-value in relation to the effect under study (strength, reliability, probability) is wrong, since p-values are based on H0. In addition, while p-values are randomly distributed (if all the assumptions of the test are met) when there is no effect, their distribution depends on both the population effect size and the number of participants, making impossible to infer strength of effect from them.
Similarly, 1-p is not the probability of replicating an effect. Often, a small value of p is considered to mean a strong likelihood of getting the same results on another try, but again this cannot be obtained because the p-value is not informative about the effect itself (Miller, 2009). Because the p-value depends on the number of subjects, it can only be used in high powered studies to interpret results. In low powered studies (typically small number of subjects), the p-value has a large variance across repeated samples, making it unreliable to estimate replication (Halsey et al., 2015).
A (small) p-value is not an indication favouring a given hypothesis. Because a low p-value only indicates a misfit of the null hypothesis to the data, it cannot be taken as evidence in favour of a specific alternative hypothesis more than any other possible alternatives such as measurement error and selection bias (Gelman, 2013). Some authors have even argued that the more (a priori) implausible the alternative hypothesis, the greater the chance that a finding is a false alarm (Krzywinski & Altman, 2013; Nuzzo, 2014).
The p-value is not the probability of the null hypothesis p(H0) being true, (Krzywinski & Altman, 2013). This common misconception arises from a confusion between the probability of an observation given the null p(Obs≥t|H0) and the probability of the null given an observation p(H0|Obs≥t) that is then taken as an indication for p(H0) (see Nickerson, 2000).
Neyman & Pearson (1933) proposed a framework of statistical inference for applied decision making and quality control. In such framework, two hypotheses are proposed: the null hypothesis of no effect and the alternative hypothesis of an effect, along with a control of the long run probabilities of making errors. The first key concept in this approach, is the establishment of an alternative hypothesis along with an a priori effect size. This differs markedly from Fisher who proposed a general approach for scientific inference conditioned on the null hypothesis only. The second key concept is the control of error rates. Neyman & Pearson (1928) introduced the notion of critical intervals, therefore dichotomising the space of possible observations into correct vs. incorrect zones. This dichotomisation allows one to distinguish correct results (rejecting H0 when there is an effect and not rejecting H0 when there is no effect) from errors (rejecting H0 when there is no effect, the type I error, and not rejecting H0 when there is an effect, the type II error). In this context, alpha is the probability of committing a Type I error in the long run and Beta is the probability of committing a Type II error in the long run.
The (theoretical) difference in terms of hypothesis testing between Fisher and Neyman-Pearson is illustrated on Figure 1. In the 1st case, we choose a level of significance for observed data of 5%, and compute the p-value. If the p-value is below the level of significance, it is used to reject H0. Importantly, the exact p-value is then taken as measure of evidence. In the 2nd case, we set a critical interval based on the a priori effect size and error rates (alpha and beta). If an observed statistic value is below and above the critical values (the bounds of the confidence region), it is deemed significantly different from H0. Note that in this framework, the p-value is irrelevant (Szucs & Ioannidis, 2017). In the NHST framework, the level of significance is (in practice) assimilated to the alpha level (the level of acceptance), which appears as a simple decision rule: if the p-value is less or equal to alpha, the null is rejected. It is however a common mistake to assimilate these two concepts. The level of significance set for a given sample is not the same as the frequency of acceptance alpha found on repeated sampling because alpha (a point estimate) is meant to reflect the long run probability whilst the p-value (a cumulative estimate) reflects the current probability (Fisher, 1955; Hubbard & Bayarri, 2003).
Imagine you want to test that median reaction times between two experimental conditions differ. We first compute the difference per participant, given the mean difference and associated standard deviation (Table 1). The null hypothesis is that the mean reaction time difference is 0, and a one sample Student t-test gives t(34)=0.3037 p=0.7632. Following Fisher hypothesis testing, setting the level of significance a p=0.05, we cannot reject the null and thus continue to assume conditions do not differ (see below Acceptance or rejection of H0?). If we follow Neyman-Pearson, we must specify an alternative hypothesis along with our alpha and beta rates. The simplest alternative hypothesis is to state that conditions differ, i.e. mean reaction time differences are not equal to 0 and we chose our acceptance level with alpha 0.05. We are also compelled to define beta (which is not the case for Fisher hypothesis testing). To compute the prior probability of type II error, we need to define an a-priori effect size as well. Assuming reaction times differences cannot be less than 10 ms (± 20 ms), which correspond to a medium effect size (d=0.5), then 34 subjects are needed to achieve 80% power (1-beta). The results (t and p values) are the same but we gain in certainty regarding the type II error (here less than 20% chance to see this result if there was an effect). One typical issue in NHST is defining H0, defining H1 as a simple difference, taking a conventional alpha say 0.05, and not defining beta (and thus a priori power). By doing so, we cannot go into the acceptance / rejection inference mode (i.e. take a binary decision) because we have not dichotomized the decision space. That case is more akin to significance hypothesis testing (Fisher) and inference must thus be graded rather than binary.
The acceptance level α can also be viewed as the maximum probability that a test statistic falls into the rejection region when the null hypothesis is true (Johnson, 2013). Therefore, one can only reject the null hypothesis if the test statistics falls into the critical region(s), or fail to reject this hypothesis. In the latter case, all we can say is that no significant effect was observed, but one cannot conclude that the null hypothesis is true. This is another common mistake in using NHST: there is a profound difference between accepting the null hypothesis and simply failing to reject it (Killeen, 2005). By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot argue against a theory from a non-significant result (absence of evidence is not evidence of absence). To accept the null hypothesis, tests of equivalence (Walker & Nowacki, 2011) or Bayesian approaches (Dienes, 2014; Kruschke, 2011) must be used.
Taking again the data from Table 1, The NHST tells us the 95% CI of the mean reaction time difference is [-8.11 10.97]. Since the CI includes 0, we cannot reject H0, and we continue to assume that conditions do not differ. A test of equivalence considers instead equivalence bounds. Here the bounds are taken to be -10ms and +10ms, i.e. we consider that any difference lower than 10 ms is meaningless (same as the effect size considered above). The computation is simply two one sample t-tests at alpha = 10%, one for the lower bound (-10ms) and one for the upper bound (+10 ms) and rejecting for differences (inverting H0 and H1 - Lakens, 2017). In this case, because results indicate that the difference is within the equivalence bounds, i.e. significantly above the lower bound (t(34)=-1.82 p=.038) and significantly below the upper bound (t(34)=2.43 p=.001), we can conclude that conditions do not differ, i.e. we can accept H0. Using a Bayesian t-test with normal priors (0 mean, 1 standard deviation, Dienes, 2014), the Bayes factor is 0.175 for H1 (BF10) and 5.73 for H0 (BF01) which gives moderate evidence for H0. Being able to confirm H0, we can argue against a theory that proposed differences – this is not possible using the NHST framework.
Confidence intervals (CI) are constructs that fail to cover the true value at a rate of alpha, the Type I error rate (Morey & Rouder, 2011) and therefore indicate if observed values can be rejected by a (two tailed) test with a given alpha. CI have been advocated as alternatives to p-values because (i) they allow judging the statistical significance and (ii) provide estimates of effect size. Assuming the CI (a)symmetry and width are correct (but see Wilcox, 2012), they also give some indication about the likelihood that a similar value can be observed in future studies. For future studies of the same sample size, 95% CI give about 83% chance of replication success (Cumming & Maillardet, 2006). If sample sizes however differ between studies, there is no warranty that a CI from one study will be true at the rate alpha in a different study, which implies that CI cannot be compared across studies, as these rarely use the same sample sizes.
Although CI provide more information, they are not less subject to interpretation errors (see Savalei & Dunn, 2015 for a review). The most common mistake is to interpret CI as the probability that a parameter (e.g. the population mean) will fall in that interval X% of the time. The correct interpretation is that, for repeated measurements with the same sample sizes, taken from the same population, X% of times the CI obtained will contain the true parameter value (Tan & Tan, 2010). The alpha value has the same interpretation as testing against H0, e.g. a 95% CI is wrong in 5% of the times in the long run (i.e. if we repeat the experiment many times). This implies that CI do not allow to make strong statements about the parameter of interest (e.g. the mean difference) or about H1 (Hoekstra et al., 2014). To make a statement about the probability of a parameter of interest (e.g. the probability of the mean), Bayesian intervals must be used (Morey & Rouder, 2011).
From the Table 1, the 95% confidence interval is [-8.11 10.97] and this means that this interval will be wrong 5% of the times when we repeat the experiment. The 95% Bayesian interval of the mean is in this case the same. Its meaning is however completely different: there is 95% probability that the average is in this interval.
NHST has always been criticized, and yet is still used every day in scientific reports (Nickerson, 2000). One question to ask oneself is what is the goal of a scientific experiment at hand? If the goal is to establish a discrepancy with the null hypothesis and/or establish a pattern of order (i.e. establish that A > B), because both requires ruling out equivalence (i.e. ruling out A=B), then NHST is a good tool (Frick, 1996; Walker & Nowacki, 2011). If the goal is to test the presence of an effect (i.e. compute its probability) and/or establish some quantitative values related to an effect, then NHST is not the method of choice since testing can only reject the null hypothesis.
While a Bayesian analysis is suited to estimate the probability that a hypothesis is correct, like NHST, it does not prove a theory by itself, but adds to its plausibility (Lindley, 2000). It has however another advantage: it allows to choose between competing hypotheses while NHST cannot prove any specific hypothesis (Szucs & Ioannidis, 2017). No matter what testing procedure is used and how strong results are, Fisher (1959, p13) reminds us that ‘ […] no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon’. Similarly, the recent statement of the American Statistical Association (Wasserstein & Lazar, 2016) makes it clear that conclusions should be based on the researcher's understanding of the problem in context, along with all summary data and tests, and that no single value (whether p-values, Bayes factors or something else) can be used to support or invalidate a theory.
Considering that quantitative reports will always have more information content than binary (significant or not) reports, we can always argue that raw and/or normalized effect size, confidence intervals, or Bayes factor must be reported. Reporting everything can however hinder the communication of the main result(s), and we should aim at giving only the information needed, at least in the core of a manuscript. I recommend adopting ‘optimal reporting’ in the result section to keep the message clear, but have detailed supplementary material. When the hypothesis is about the presence/absence (two-sided test) or order of an effect (one-sided test), and providing that a study has sufficient power, NHST is appropriate and it is sufficient to report in the text the actual p-value since it conveys the information needed to rule out equivalence. When the hypothesis and/or the discussion involve some quantitative value, and because p-values do not inform on the effect, it is essential to report on effect sizes (Lakens, 2013), preferably accompanied by confidence or credible intervals. The reasoning is simply that one cannot predict and/or discuss quantities without accounting for variability.
Because science progress is obtained by cumulating evidence (Rosenthal, 1991), scientists should also anticipate the secondary use of the data. With today’s electronic articles, there are no reasons for not including all derived data (mean, standard deviations, effect size, CI, Bayes factor) as supplementary tables (or even better also share raw data). It is also essential to report the context in which tests were performed – that is to report all tests performed (all t, F, p values) because of the increase type I error rate due to selective reporting (multiple comparisons and p-hacking problems - Ioannidis, 2005). Providing all of this information allows (i) other researchers to directly and effectively compare their results in quantitative terms (replication of effects beyond significance, Open Science Collaboration, 2015), (ii) to compute power to future studies (Lakens & Evers, 2014), and (iii) to aggregate results for meta-analyses whilst minimizing publication bias (van Assen et al., 2014).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
References
1. Colquhoun D: An investigation of the false discovery rate and the misinterpretation of p-values.R Soc Open Sci. 2014; 1 (3): 140216 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
References
1. Senn S: A comment on replication,p-values and evidence S.N.Goodman,Statistics in Medicine 1992;11:875-879. Statistics in Medicine. 2002; 21 (16): 2437-2444 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 5 (revision) 12 Oct 17 |
||||
Version 4 (revision) 26 Sep 17 |
read | |||
Version 3 (revision) 10 Oct 16 |
read | read | ||
Version 2 (revision) 13 Jul 16 |
read | |||
Version 1 25 Aug 15 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)