This HTML version of "Think Stats 2e" is provided for convenience, but it is not the best format for the book. In particular, some of the math symbols are not rendered correctly.
Chapter 9 Hypothesis testing
The code for this chapter is in
hypothesis.py. For information
about downloading and working with this code, see Section 0.2.
9.1 Classical hypothesis testing
Exploring the data from the NSFG, we saw several “apparent effects,” including differences between first babies and others. So far we have taken these effects at face value; in this chapter, we put them to the test.
The fundamental question we want to address is whether the effects we see in a sample are likely to appear in the larger population. For example, in the NSFG sample we see a difference in mean pregnancy length for first babies and others. We would like to know if that effect reflects a real difference for women in the U.S., or if it might appear in the sample by chance.
There are several ways we could formulate this question, including Fisher null hypothesis testing, Neyman-Pearson decision theory, and Bayesian inference1. What I present here is a subset of all three that makes up most of what people use in practice, which I will call classical hypothesis testing.
The goal of classical hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” Here’s how we answer that question:
- The first step is to quantify the size of the apparent effect by choosing a test statistic. In the NSFG example, the apparent effect is a difference in pregnancy length between first babies and others, so a natural choice for the test statistic is the difference in means between the two groups.
- The second step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. In the NSFG example the null hypothesis is that there is no difference between first babies and others; that is, that pregnancy lengths for both groups have the same distribution.
- The third step is to compute a p-value, which is the probability of seeing the apparent effect if the null hypothesis is true. In the NSFG example, we would compute the actual difference in means, then compute the probability of seeing a difference as big, or bigger, under the null hypothesis.
- The last step is to interpret the result. If the p-value is low, the effect is said to be statistically significant, which means that it is unlikely to have occurred by chance. In that case we infer that the effect is more likely to appear in the larger population.
The logic of this process is similar to a proof by contradiction. To prove a mathematical statement, A, you assume temporarily that A is false. If that assumption leads to a contradiction, you conclude that A must actually be true.
Similarly, to test a hypothesis like, “This effect is real,” we assume, temporarily, that it is not. That’s the null hypothesis. Based on that assumption, we compute the probability of the apparent effect. That’s the p-value. If the p-value is low, we conclude that the null hypothesis is unlikely to be true.
class HypothesisTest(object): def __init__(self, data): self.data = data self.MakeModel() self.actual = self.TestStatistic(data) def PValue(self, iters=1000): self.test_stats = [self.TestStatistic(self.RunModel()) for _ in range(iters)] count = sum(1 for x in self.test_stats if x >= self.actual) return count / iters def TestStatistic(self, data): raise UnimplementedMethodException() def MakeModel(self): pass def RunModel(self): raise UnimplementedMethodException()
HypothesisTest is an abstract parent class that provides
complete definitions for some methods and place-keepers for others.
Child classes based on
PValue and provide
RunModel, and optionally
__init__ takes the data in whatever form is appropriate. It
MakeModel, which builds a representation of the null
hypothesis, then passes the data to
computes the size of the effect in the sample.
PValue computes the probability of the apparent effect under
the null hypothesis. It takes as a parameter
iters, which is
the number of simulations to run. The first line generates simulated
data, computes test statistics, and stores them in
The result is
the fraction of elements in
exceed or equal the observed test statistic,
As a simple example2, suppose we toss a coin 250 times and see 140 heads and 110 tails. Based on this result, we might suspect that the coin is biased; that is, more likely to land heads. To test this hypothesis, we compute the probability of seeing such a difference if the coin is actually fair:
class CoinTest(thinkstats2.HypothesisTest): def TestStatistic(self, data): heads, tails = data test_stat = abs(heads - tails) return test_stat def RunModel(self): heads, tails = self.data n = heads + tails sample = [random.choice('HT') for _ in range(n)] hist = thinkstats2.Hist(sample) data = hist['H'], hist['T'] return data
Now all we have to do is instantiate
CoinTest and call
ct = CoinTest((140, 110)) pvalue = ct.PValue()
The result is about 0.07, which means that if the coin is fair, we expect to see a difference as big as 30 about 7% of the time.
But the choice of 5% is arbitrary, and (as we will see later) the p-value depends on the choice of the test statistics and the model of the null hypothesis. So p-values should not be considered precise measurements.
I recommend interpreting p-values according to their order of magnitude: if the p-value is less than 1%, the effect is unlikely to be due to chance; if it is greater than 10%, the effect can plausibly be explained by chance. P-values between 1% and 10% should be considered borderline. So in this example I conclude that the data do not provide strong evidence that the coin is biased or not.
9.3 Testing a difference in means
One of the most common effects to test is a difference in mean between two groups. In the NSFG data, we saw that the mean pregnancy length for first babies is slightly longer, and the mean birth weight is slightly smaller. Now we will see if those effects are statistically significant.
For these examples, the null hypothesis is that the distributions for the two groups are the same. One way to model the null hypothesis is by permutation; that is, we can take values for first babies and others and shuffle them, treating the two groups as one big group:
class DiffMeansPermute(thinkstats2.HypothesisTest): def TestStatistic(self, data): group1, group2 = data test_stat = abs(group1.mean() - group2.mean()) return test_stat def MakeModel(self): group1, group2 = self.data self.n, self.m = len(group1), len(group2) self.pool = np.hstack((group1, group2)) def RunModel(self): np.random.shuffle(self.pool) data = self.pool[:self.n], self.pool[self.n:] return data
RunModel simulates the null hypothesis by shuffling the
pooled values and splitting them into two groups with sizes
m. As always, the return value from
the same format as the observed data.
To test the difference in pregnancy length, we run:
live, firsts, others = first.MakeFrames() data = firsts.prglngth.values, others.prglngth.values ht = DiffMeansPermute(data) pvalue = ht.PValue()
MakeFrames reads the NSFG data and returns DataFrames
representing all live births, first babies, and others.
We extract pregnancy lengths as NumPy arrays, pass them as
DiffMeansPermute, and compute the p-value. The
result is about 0.17, which means that we expect to see a difference
as big as the observed effect about 17% of the time. So
this effect is not statistically significant.
ht.PlotCdf() thinkplot.Show(xlabel='test statistic', ylabel='CDF')
Figure 9.1 shows the result. The CDF intersects the observed difference at 0.83, which is the complement of the p-value, 0.17.
If we run the same analysis with birth weight, the computed p-value is 0; after 1000 attempts, the simulation never yields an effect as big as the observed difference, 0.12 lbs. So we would report p < 0.001, and conclude that the difference in birth weight is statistically significant.
9.4 Other test statistics
Choosing the best test statistic depends on what question you are trying to address. For example, if the relevant question is whether pregnancy lengths are different for first babies, then it makes sense to test the absolute difference in means, as we did in the previous section.
If we had some reason to think that first babies are likely to be late, then we would not take the absolute value of the difference; instead we would use this test statistic:
class DiffMeansOneSided(DiffMeansPermute): def TestStatistic(self, data): group1, group2 = data test_stat = group1.mean() - group2.mean() return test_stat
DiffMeansPermute; the only difference is that
TestStatistic does not take the absolute value of the
difference. This kind of test is called one-sided because
it only counts one side of the distribution of differences. The
previous test, using both sides, is two-sided.
The one-sided hypothesis, that first babies are born late, is more specific than the two-sided hypothesis, so the p-value is smaller. But even for the stronger hypothesis, the difference is not statistically significant.
We can use the same framework to test for a difference in standard deviation. In Section 3.3, we saw some evidence that first babies are more likely to be early or late, and less likely to be on time. So we might hypothesize that the standard deviation is higher. Here’s how we can test that:
class DiffStdPermute(DiffMeansPermute): def TestStatistic(self, data): group1, group2 = data test_stat = group1.std() - group2.std() return test_stat
9.5 Testing a correlation
This framework can also test correlations. For example, in the NSFG data set, the correlation between birth weight and mother’s age is about 0.07. It seems like older mothers have heavier babies. But could this effect be due to chance?
For the test statistic, I use Pearson’s correlation, but Spearman’s would work as well. If we had reason to expect positive correlation, we would do a one-sided test. But since we have no such reason, I’ll do a two-sided test using the absolute value of correlation.
The null hypothesis is that there is no correlation between mother’s age and birth weight. By shuffling the observed values, we can simulate a world where the distributions of age and birth weight are the same, but where the variables are unrelated:
class CorrelationPermute(thinkstats2.HypothesisTest): def TestStatistic(self, data): xs, ys = data test_stat = abs(thinkstats2.Corr(xs, ys)) return test_stat def RunModel(self): xs, ys = self.data xs = np.random.permutation(xs) return xs, ys
Here’s the code that reads the data and runs the test:
live, firsts, others = first.MakeFrames() live = live.dropna(subset=['agepreg', 'totalwgt_lb']) data = live.agepreg.values, live.totalwgt_lb.values ht = CorrelationPermute(data) pvalue = ht.PValue()
The actual correlation is 0.07. The computed p-value is 0; after 1000 iterations the largest simulated correlation is 0.04. So although the observed correlation is small, it is statistically significant.
This example is a reminder that “statistically significant” does not always mean that an effect is important, or significant in practice. It only means that it is unlikely to have occurred by chance.
9.6 Testing proportions
Suppose you run a casino and you suspect that a customer is using a crooked die; that is, one that has been modified to make one of the faces more likely than the others. You apprehend the alleged cheater and confiscate the die, but now you have to prove that it is crooked. You roll the die 60 times and get the following results:
On average you expect each value to appear 10 times. In this dataset, the value 3 appears more often than expected, and the value 4 appears less often. But are these differences statistically significant?
To test this hypothesis, we can compute the expected frequency for each value, the difference between the expected and observed frequencies, and the total absolute difference. In this example, we expect each side to come up 10 times out of 60; the deviations from this expectation are -2, -1, 9, -5, -2, and 1; so the total absolute difference is 20. How often would we see such a difference by chance?
class DiceTest(thinkstats2.HypothesisTest): def TestStatistic(self, data): observed = data n = sum(observed) expected = np.ones(6) * n / 6 test_stat = sum(abs(observed - expected)) return test_stat def RunModel(self): n = sum(self.data) values = [1, 2, 3, 4, 5, 6] rolls = np.random.choice(values, n, replace=True) hist = thinkstats2.Hist(rolls) freqs = hist.Freqs(values) return freqs
The p-value for this data is 0.13, which means that if the die is fair we expect to see the observed total deviation, or more, about 13% of the time. So the apparent effect is not statistically significant.
9.7 Chi-squared tests
In the previous section we used total deviation as the test statistic. But for testing proportions it is more common to use the chi-squared statistic:
class DiceChiTest(DiceTest): def TestStatistic(self, data): observed = data n = sum(observed) expected = np.ones(6) * n / 6 test_stat = sum((observed - expected)**2 / expected) return test_stat
Squaring the deviations (rather than taking absolute values) gives
more weight to large deviations. Dividing through by
standardizes the deviations, although in this case it has no effect
because the expected frequencies are all equal.
The p-value using the chi-squared statistic is 0.04, substantially smaller than what we got using total deviation, 0.13. If we take the 5% threshold seriously, we would consider this effect statistically significant. But considering the two tests togther, I would say that the results are borderline. I would not rule out the possibility that the die is crooked, but I would not convict the accused cheater.
This example demonstrates an important point: the p-value depends on the choice of test statistic and the model of the null hypothesis, and sometimes these choices determine whether an effect is statistically significant or not.
9.8 First babies again
Earlier in this chapter we looked at pregnancy lengths for first babies and others, and concluded that the apparent differences in mean and standard deviation are not statistically significant. But in Section 3.3, we saw several apparent differences in the distribution of pregnancy length, especially in the range from 35 to 43 weeks. To see whether those differences are statistically significant, we can use a test based on a chi-squared statistic.
class PregLengthTest(thinkstats2.HypothesisTest): def MakeModel(self): firsts, others = self.data self.n = len(firsts) self.pool = np.hstack((firsts, others)) pmf = thinkstats2.Pmf(self.pool) self.values = range(35, 44) self.expected_probs = np.array(pmf.Probs(self.values)) def RunModel(self): np.random.shuffle(self.pool) data = self.pool[:self.n], self.pool[self.n:] return data
The data are represented as two lists of pregnancy lengths. The null
hypothesis is that both samples are drawn from the same distribution.
MakeModel models that distribution by pooling the two
simulated data by shuffling the pooled sample and splitting it
into two parts.
MakeModel also defines
values, which is the
range of weeks we’ll use, and
which is the probability of each value in the pooled distribution.
Here’s the code that computes the test statistic:
# class PregLengthTest: def TestStatistic(self, data): firsts, others = data stat = self.ChiSquared(firsts) + self.ChiSquared(others) return stat def ChiSquared(self, lengths): hist = thinkstats2.Hist(lengths) observed = np.array(hist.Freqs(self.values)) expected = self.expected_probs * len(lengths) stat = sum((observed - expected)**2 / expected) return stat
ChiSquared takes a sequence of pregnancy lengths, computes
its histogram, and computes
observed, which is a list of
frequencies corresponding to
To compute the list of expected frequencies, it multiplies the
expected_probs, by the sample
size. It returns the chi-squared statistic,
For the NSFG data the total chi-squared statistic is 102, which doesn’t mean much by itself. But after 1000 iterations, the largest test statistic generated under the null hypothesis is 32. We conclude that the observed chi-squared statistic is unlikely under the null hypothesis, so the apparent effect is statistically significant.
This example demonstrates a limitation of chi-squared tests: they indicate that there is a difference between the two groups, but they don’t say anything specific about what the difference is.
- If the effect is actually due to chance, what is the probability that we will wrongly consider it significant? This probability is the false positive rate.
- If the effect is real, what is the chance that the hypothesis test will fail? This probability is the false negative rate.
The false positive rate is relatively easy to compute: if the threshold is 5%, the false positive rate is 5%. Here’s why:
- If there is no real effect, the null hypothesis is true, so we can compute the distribution of the test statistic by simulating the null hypothesis. Call this distribution CDFT.
- Each time we run an experiment, we get a test statistic, t,
which is drawn from CDFT. Then we compute a p-value, which is
the probability that a random value from CDFT exceeds
t, so that’s 1 − CDFT(t).
- The p-value is less than 5% if CDFT(t) is greater than 95%; that is, if t exceeds the 95th percentile. And how often does a value chosen from CDFT exceed the 95th percentile? 5% of the time.
So if you perform one hypothesis test with a 5% threshold, you expect a false positive 1 time in 20.
def FalseNegRate(data, num_runs=100): group1, group2 = data count = 0 for i in range(num_runs): sample1 = thinkstats2.Resample(group1) sample2 = thinkstats2.Resample(group2) ht = DiffMeansPermute((sample1, sample2)) pvalue = ht.PValue(iters=101) if pvalue > 0.05: count += 1 return count / num_runs
FalseNegRate takes data in the form of two sequences, one for
each group. Each time through the loop, it simulates an experiment by
drawing a random sample from each group and running a hypothesis test.
Then it checks the result and counts the number of false negatives.
def Resample(xs): return np.random.choice(xs, len(xs), replace=True)
Here’s the code that tests pregnancy lengths:
live, firsts, others = first.MakeFrames() data = firsts.prglngth.values, others.prglngth.values neg_rate = FalseNegRate(data)
This result is often presented the other way around: if the actual difference is 0.078 weeks, we should expect a positive test only 30% of the time. This “correct positive rate” is called the power of the test, or sometimes “sensitivity”. It reflects the ability of the test to detect an effect of a given size.
In this example, the test had only a 30% chance of yielding a positive result (again, assuming that the difference is 0.078 weeks). As a rule of thumb, a power of 80% is considered acceptable, so we would say that this test was “underpowered.”
In general a negative hypothesis test does not imply that there is no difference between the groups; instead it suggests that if there is a difference, it is too small to detect with this sample size.
The hypothesis testing process I demonstrated in this chapter is not, strictly speaking, good practice.
First, I performed multiple tests. If you run one hypothesis test, the chance of a false positive is about 1 in 20, which might be acceptable. But if you run 20 tests, you should expect at least one false positive, most of the time.
Second, I used the same dataset for exploration and testing. If you explore a large dataset, find a surprising effect, and then test whether it is significant, you have a good chance of generating a false positive.
To compensate for multiple tests, you can adjust the p-value threshold (see https://en.wikipedia.org/wiki/Holm-Bonferroni_method). Or you can address both problems by partitioning the data, using one set for exploration and the other for testing.
In some fields these practices are required or at least encouraged. But it is also common to address these problems implicitly by replicating published results. Typically the first paper to report a new result is considered exploratory. Subsequent papers that replicate the result with new data are considered confirmatory.
As it happens, we have an opportunity to replicate the results in this
chapter. The first edition of this book is based on Cycle 6 of the
NSFG, which was released in 2002. In October 2011, the CDC released
additional data based on interviews conducted from 2006–2010.
contains code to read and clean this data. In the new
- The difference in mean pregnancy length is 0.16 weeks and statistically significant with p < 0.001 (compared to 0.078 weeks in the original dataset).
- The difference in birth weight is 0.17 pounds with p < 0.001 (compared to 0.12 lbs in the original dataset).
- The correlation between birth weight and mother’s age is 0.08 with p < 0.001 (compared to 0.07).
- The chi-squared test is statistically significant with p < 0.001 (as it was in the original).
In summary, all of the effects that were statistically significant in the original dataset were replicated in the new dataset, and the difference in pregnancy length, which was not significant in the original, is bigger in the new dataset and significant.
A solution to these exercises is in
In Section 9.3, we simulated the null hypothesis by permutation; that is, we treated the observed values as if they represented the entire population, and randomly assigned the members of the population to the two groups.
An alternative is to use the sample to estimate the distribution for the population, then draw a random sample from that distribution. This process is called resampling. There are several ways to implement resampling, but one of the simplest is to draw a sample with replacement from the observed values, as in Section 9.10.
- hypothesis testing: The process of determining whether an apparent effect is statistically significant.
- test statistic: A statistic used to quantify an effect size.
- null hypothesis: A model of a system based on the assumption that an apparent effect is due to chance.
- p-value: The probability that an effect could occur by chance.
- statistically significant: An effect is statistically significant if it is unlikely to occur by chance.
- permutation test: A way to compute p-values by generating permutations of an observed dataset.
- resampling test: A way to compute p-values by generating samples, with replacement, from an observed dataset.
- two-sided test: A test that asks, “What is the chance of an effect as big as the observed effect, positive or negative?”
- one-sided test: A test that asks, “What is the chance of an effect as big as the observed effect, and with the same sign?”
- chi-squared test: A test that uses the chi-squared statistic as the test statistic.
- false positive: The conclusion that an effect is real when it is not.
- false negative: The conclusion that an effect is due to chance when it is not.
- power: The probability of a positive test if the null hypothesis is false.