9
$\begingroup$

I'm currently studying "Statistics 1" as part of my Computer Science degree, and I'm having trouble understanding the concept of "significance."

We were provided with the following definitions:

$H_0$ - The zero hypothesis

$H_1$ - The alternative hypothesis.

$R$ - $H_0$ Rejection Zone

$\bar{R}$ - $H_0$ No-Rejection Zone

$\begin{aligned} \alpha &= P(\text{type I error})\\&=P_{H_0}(\text{Rejecting } H_0) \\&= P_{H_0}(X\in R) \\& \end{aligned} $

While I believe I have a good grasp of these definitions and how they relate to p-values, the term "significance" is still puzzling to me.

I understand the concept of something being "statistically significant," but it seems counterintuitive that a higher significance level would entail a greater risk of error. Intuitively, when something is highly significant, I would expect a lower risk.

Could someone shed some light on why the term "significance" was chosen?

For those looking, these are some really good discussions on the statistical meaning of these terms. I am looking for a more intuitive explanation of why this term was chosen.

Comparing and contrasting, p-values, significance levels and type I error

What is the relation of the significance level alpha to the type 1 error alpha?

$\endgroup$
7
  • 3
    $\begingroup$ It is confusing, isn't it? See en.wikipedia.org/wiki/Statistical_significance#History for the history. $\endgroup$
    – whuber
    Commented Apr 11, 2024 at 15:13
  • 1
    $\begingroup$ Have a look at this question and answers as well stats.stackexchange.com/q/639628/32477. The p-value is a probability statement about the data conditional on a true null hypothesis. When the p-value is small or significant, then there is a greater "risk" that data are not plausible under the null hypothesis. $\endgroup$
    – Stefan
    Commented Apr 11, 2024 at 15:13
  • 1
    $\begingroup$ Higher and lower risk of what? // @Stefan I would upvote your verbatim comment posted as an answer. $\endgroup$
    – Dave
    Commented Apr 11, 2024 at 15:28
  • 2
    $\begingroup$ I suspect the answers posted so far might be missing the point: namely, if we call a numerical quantity "significance," isn't that a (strong) implication that as the value of that quantity increases, the "significance" (whatever that is intended to mean) also increases? But the reverse appears to happen here. In other words, I would suggest researching Stigler's Law of Eponymy (as potentially more instructive, even though not directly applicable) rather than reiterating the usual characterizations of $\alpha$ and p-values. $\endgroup$
    – whuber
    Commented Apr 11, 2024 at 16:47
  • 1
    $\begingroup$ @whuber you are completely correct. I wonder if I should fix my original post. $\endgroup$
    – Yup8
    Commented Apr 11, 2024 at 17:59

6 Answers 6

5
$\begingroup$

Why was the term "significance" ($\alpha$) chosen for the probability of Type I error?

Type I error, is the rejection of the null hypothesis when it is actually true.

The p-value gives the probability of making this error when the null hypothesis is true (if the null hypothesis is false then type I error is not equal to $\alpha$ or the p-value).

The term significance relates to having a small p-value. And the 'significance level' $\alpha$ is considered as some decision boundary for the p-value that is used in hypothesis testing. So if $p<\alpha$ then some result is considered significant. The p-value itself is not exactly the same as 'significance'. A lower p-value means a higher significant result.

but it seems counterintuitive that a higher significance level would entail a greater risk of error. Intuitively, when something is highly significant, I would expect a lower risk.

This relates to the relationship between

  • type I error: rejection of the null hypothesis when it is actually true.
  • type II error: not rejecting the null hypothesis when it is actually false.

If you have a high significance level (ie. a small $\alpha$, as a higher significance relates to a lower p-value), then your observations must have a low $p$-value before rejecting the null hypothesis. When effect sizes are small and observations are noisy, then you risk to not reject a false null hypothesis because your experiment has little power and fails to reject a false null hypothesis due to not reaching a high significance level.

$\endgroup$
1
  • $\begingroup$ Thank you for addressing my concerns regarding high significance levels. It's very surprising to me that a high significance level means a small $\alpha$, because I thought $\alpha$ itself was the significance level. $\endgroup$
    – Yup8
    Commented Apr 12, 2024 at 20:01
10
$\begingroup$

To understand what is going on with 'significance' you first need to understand the two contrasting approaches to 'testing' and the ill-formed combinations of them. That difficulty is probably why @Stefan writes "Picking 100% bullet proof terminology is difficult (at least to me) when it comes to p-values and null-hypothesis testing etc."

The definitions you list are all parts of the Neyman–Pearsonian hypothesis testing framework. It is notable that there is no p-value among them. That framework was devised later than p-values and it does different things from the significance testing approach of Fisher which is where the p-values belong. I have attempted to explain those two systems in depth elsewhere, so I will give only an outline here.

The Neyman–Peartsonian framework dichotomise results into 'significant' where the observed test statistic lies in the rejection zone, and 'not significant' otherwise. The 'significant' result entails a decision to behave as if the null hypothesis (not "zero hypothesis"!) is false.

The benefits of such a framework are, in my opinion, that it leads to a simple to follow recipe that takes data as an input and gives a simple significant/not significant output, and the dichotomisation makes accounting of errors simple. However, that error accounting is only relevant to the long-run rates that belong to the method and does not directly tell you anything about the particular null hypothesis in question. They are 'global' error rates and are only tangentially related to the 'local' evidence in the data concerning the hypotheses of interest in the particular experiment. In my opinion those benefits are trivial compared to the harm that enforced decisions do to careful and reasoned scientific inferences. The Neyman–Pearsonian approach is well matched to relatively few experimental circumstances.

The Fisherian significance test approach (sometimes called neo-Fisherian) involves evaluation of the discordance between the data and the null hypothesis, according to the statistical model. The p-value expresses that discordance numerically (but non-linearly!) because data yielding a small p-value are uncommon where the null hypothesis is true. The meaning of a p-value is evidential: the smaller the p-value the stronger the evidence in the data against the null hypothesis, according to the model. (A small p-value can also interpreted as casting doubt on the applicability of the statistical model!) Notice that the p-value does not entail any decision.

The p-value in that previous paragraph is the precise p-value, not a dichotomised less than greater than version of it. The result is 'significant' at the level of the p-value. Some people point to statements by Fisher that make it seem that the p-value should be considered granular with typical levels of significance of 0.05, 0.01, 0.001 or the like, but that is because Fisher was working from pre-tabulated 'critical' values of the test statistics and did not have the ability to conveniently calculate precise p-values in most cases. (You know, pre-computer, pre-calculator days!) Reporting p-values as 'less than's rather than exact numbers is discarding information for no gain.

Here is one of the confusing things: if the p-value is found to be less than any value of $\alpha$ that might have been selected for a Neyman–Pearsonian hypothesis test then the observed test statistic will be inside the rejection region of that test.

Beyond that, I recommend strongly that you look at the linked papers.

Now, why 'significance'? Fisher used it before Neyman and Pearson devised their approach, and a significant result was mostly interpreted as one that is worthy of follow-up by further experimentation. Yep, worthy of more work, not an end. A decision to end a study and to make an inference about the real world was to Fisher (and should be to most scientists) a non-statistical one. Such a meaning marries well with Fisher's usage of the word significant. People have subsequently added their own meanings, perhaps with the aim of 'making things easy', but at the cost of obscuring the proper role of reasoning in inference making. It is a shame that any statistical meaning has been attached to the word, but there you go.

$\endgroup$
2
  • 1
    $\begingroup$ +1 Michael for distinguishing the two approaches unlike what we conventionally follow as the hybrid of the two. Fisher was the propounder of p-value where as it was Neyman-Pearson who devised significant level for their approach. Aris Spanos in his work Statistical Foundations of Econometric Modelling deals in extreme detail the philosophy and how the two frameworks contrast with each other (& why Fisherian is better for misspecification problems etc). $\endgroup$ Commented Apr 12, 2024 at 1:48
  • 2
    $\begingroup$ David Cox was a strong proponent of Fisherian significance testing over his long and productive life, (e.g. from the Cox and Hinkley (1974) text through to his 2020 paper in the Annual Review of Statistics and Its Application.) $\endgroup$ Commented Apr 12, 2024 at 3:07
7
$\begingroup$

It's called "significant" because it signifies that the null hypothesis looks wrong in the light of the data; evidence testifies against the null hypothesis.

The potentially confusing bit is that tests are not symmetric, i.e., you can have evidence against the null hypothesis but not in its favour. Not rejecting the null hypothesis (not having significance) does not indicate the null hypothesis to be true. At best (e.g., with large samples) it indicates that reality may not be so far away from the H0.

Another confusing bit is that the term "highly significant" is inconsistent with the significance level being higher or lower. High significance level (e.g., 10% rather than the lower numbers 5% or 1%) means that things are already significant that are not strongly ("highly") significant, the latter meaning that the p-value is very small, significant even at very small level.

Note also that significance is not the probability of type I error, as claimed in the question title. Significance means that we have observed a result that counts against the null hypothesis. Such a result would have a low probability under the null hypothesis, but if the null hypothesis is indeed wrong (which may very well be the case; actually we would normally expect this if we have a significant result), then it's not a type I error (rather not an error at all).

$\endgroup$
7
$\begingroup$

Complementing all the other answers, it's apt to trace a bit of history germane to the topic in hand.

Edgeworth in investigating the differences of two means, taking a cue from Laplace's empirical work on calculating the difference of two means of $400$ respective barometric observations taken at different times (cf. $[\rm I]$), used the notion of measuring the observed divergence to assess whether the excess was "accidental" or due to a mere chance or whether there was a "constant cause". He introduced a pre-specified constant (it was $2\sqrt 2$) for his rejection rule. As $[\rm II] $ writes:

... the difference between the two means could not be justified as "accidental" and it would appear to be significant.

Fast forward to Fisher who formally introduced the concept of null hypothesis, which is

a statement about the underlying statistical model.

He developed the concept of probability-value or p-value to evaluate the plausibility of the null hypothesis -- that is, by p-value, he measured to what extent

the sample realization leads credence to the null hypothesis.

Fisher's philosophy was clear: it was inferential. The implicit alternative hypothesis (retroactively) would be any possible statistical model that goes beyond the boundaries of the postulated ones (more to that later).

As mentioned in the last para of Michael Lew's answer, Fisher never wanted to end the investigation by some sort of mechanism at the end of calculating the p-value. Rather it would mark a broader investigation.

This brings us to decision-making framework of Neyman & Pearson, who didn't like the ad-hoc choice of a test statistic and the subsequent use of p-value. The solution they proposed was to have a "choice between rival hypotheses" and thus focusing on whether to reject or accept the null hypothesis instead of simply to measure the extent of (or lack thereof) legitimacy the observed data bestowed upon the null hypothesis.

This would turn the testing problem into a problem of optimization: fix the probability of type I error ($\alpha$ -- the size) and minimize the probability of type II error. The test statistic $\tau(\mathbf X) ,$ based on a distance function from the postulated parameter, if results in a difference "significantly different from zero" (that is $|\tau(\mathbf X) | > c_\alpha,$ the latter being the constant which determines the rejection region of size $\alpha$), leads to a rejection of the null hypothesis and the $\alpha$ is sometimes termed as significance level.

Thus the concepts of p-value and significance level emanated from two different philosophical frameworks, which can be summed up in viewing their respective alternative hypotheses -- if $\boldsymbol\Phi$ represents a specification of probability model, and we postulate that the true probability distribution $f(\mathbf X) $ belongs to proper subset that is, $f\in \boldsymbol\Phi_0\subset \boldsymbol\Phi,$ in NP framework, it would be

$$\begin{align}\mathrm H_0&:= f\in \boldsymbol\Phi_0\\\textrm{against}\\\mathrm H_1&:=f\in \boldsymbol\Phi\setminus \boldsymbol\Phi_0\end{align};$$

whereas in Fisher's framework, it would be

$$\begin{align}\mathrm H_0&:= f\in \boldsymbol\Phi_0\\\textrm{against}\\\mathrm H_1&:=f\in \boldsymbol{\mathcal P}\setminus \boldsymbol\Phi_0\end{align},$$

where $\boldsymbol{\mathcal P}$ represents the collection of all possible statistical models.

Much to the chagrin of both Fisher and Neyman-Pearson, modern day statisticians justify the use of p-value in an N-P test (a "monstrous hybrid"), calling it the observed significance level as

... the critical value $c_\alpha$ depends on the significance level $\alpha$, which is often arbitrary, except for the requirement of being "small"...

As Gigerenzer lamented

... the significance level is a property of the test itself, irrespective of any observed data, but the p-value is a measure which is inextricably bound up with the specific data under consideration

which again reflects the difference between the above two approaches.

--

References:

$\rm[I]$ On the Determination of the Modulus of Errors, F. Y. Edgeworth, $1886,$ Phil. Mg. $21,$ pp. $500 - 507.$

$\rm[II]$ Probability Theory and Statistical Inference: Econometric Modeling with Observational Data, Aris Spanos, Cambridge University Press, $1999, $ ch. $14.$

$\endgroup$
5
$\begingroup$

In short, the p-value is a probability statement about the data under a true null hypothesis. When the p-value is small or significant, then there is a greater "risk" that data are not plausible under the null hypothesis. Or in other words, the data shows a significant deviation from what would be expected under the null hypothesis.

Have a look at this question and answers as well: Never understood the concept of the p-value, it should be higher than 0.05

To address @Dave's comment:

Picking 100% bullet proof terminology is difficult (at least to me) when it comes to p-values and null-hypothesis testing etc. So instead of me modifying my answer to address Dave's comment, I will cite some relevant sections from Wasserstein 2016, page 131:

  1. What is a p-Value?

Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

and,

  1. Principles
    1. P-values can indicate how incompatible the data are with a specified statistical model.

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.

$\endgroup$
4
  • $\begingroup$ +1 though I take issue with the conditional on a true null hypothesis comment, which makes it sound like a conditional probability that could be flipped with Bayes' theorem. $\endgroup$
    – Dave
    Commented Apr 11, 2024 at 16:11
  • $\begingroup$ @Dave I edited my answer but wasn't 100% sure what you think is more correct to say? Perhaps given a true null hypothesis? Although that means the same, no? $\endgroup$
    – Stefan
    Commented Apr 11, 2024 at 16:33
  • 1
    $\begingroup$ "Under the null hypothesis" or "assuming the null hypothesis is true" work for me. The important part is to keep from thinking $P(H_0\vert p<\alpha)$ makes sense because the p-value is a conditional probability $P(p<\alpha\vert H_0)$ that can be reversed with Bayes' theorem. $\endgroup$
    – Dave
    Commented Apr 11, 2024 at 16:38
  • $\begingroup$ @Dave I updated my wording. Thanks for pointing this out! $\endgroup$
    – Stefan
    Commented Apr 11, 2024 at 16:42
5
$\begingroup$

There is a post at our sister site that may be relevant Did the meaning of "significant" change in the 20th century?. That post is specifically about the meaning of "significance" in general language, not as a technical term. If that changed, maybe its former meaning is what influenced its technical meaning?

Specifically, what did Fisher mean with "significance"? He reacted strongly on Pearson & Neyman interpreting it as a decision, maybe he thought more about it in practical terms as "worthy of notice", to make a note of it in the lab notebook.

$\endgroup$
3
  • 1
    $\begingroup$ Sir David Cox wrote: Section 3.4 The contrast made here between the calculation of p-values as measures of evidence of consistency and the more decision-focused emphasis on accepting and rejecting hypotheses might be taken as one characteristic difference between the Fisherian and the Neyman-Pearson formulations of statistical theory. While this is in some respects the case, the actual practice in specific applications as between Fisher and Neyman was almost the reverse. Neyman often in effect reported p-values whereas some of Fisher's use of tests in applications was much more dichotomous. $\endgroup$ Commented Apr 14, 2024 at 17:12
  • 2
    $\begingroup$ See his 2006 book Principles of Statistical Inference.This suggests the conflict between the two approaches is more nuanced than modern authors describe. $\endgroup$ Commented Apr 14, 2024 at 17:18
  • 1
    $\begingroup$ Yet, at other times, "worthy of notice" seems a good description of how Fisher uses it. $\endgroup$ Commented Apr 14, 2024 at 17:27

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.