This HTML version of is provided for convenience, but it is not the best format for the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version, or you can buy a hardcopy from Amazon. Chapter 2 Distributions2.1 HistogramsOne of the best ways to describe a variable is to report the values that appear in the dataset and how many times each value appears. This description is called the distribution of the variable. The most common representation of a distribution is a histogram, which is a graph that shows the frequency of each value. In this context, “frequency” means the number of times the value appears. In Python, an efficient way to compute frequencies is with a dictionary. Given a sequence of values, t: hist = {}
for x in t:
hist[x] = hist.get(x, 0) + 1
The result is a dictionary that maps from values to frequencies. Alternatively, you could use the Counter class defined in the collections module: from collections import Counter
counter = Counter(t)
The result is a Counter object, which is a subclass of dictionary. Another option is to use the pandas method 2.2 Representing histogramsThe Hist constructor can take a sequence, dictionary, pandas Series, or another Hist. You can instantiate a Hist object like this: >>> import thinkstats2
>>> hist = thinkstats2.Hist([1, 2, 2, 3, 5])
>>> hist
Hist({1: 1, 2: 2, 3: 1, 5: 1})
Hist objects provide Freq, which takes a value and returns its frequency: >>> hist.Freq(2)
2
The bracket operator does the same thing: >>> hist[2]
2
If you look up a value that has never appeared, the frequency is 0. >>> hist.Freq(4)
0
Values returns an unsorted list of the values in the Hist: >>> hist.Values()
[1, 5, 3, 2]
To loop through the values in order, you can use the builtin function sorted: for val in sorted(hist.Values()):
print(val, hist.Freq(val))
Or you can use Items to iterate through valuefrequency pairs: for val, freq in hist.Items():
print(val, freq)
2.3 Plotting histogramsFor this book I wrote a module called thinkplot.py that provides functions for plotting Hists and other objects defined in thinkstats2.py. It is based on pyplot, which is part of the matplotlib package. See Section 0.2 for information about installing matplotlib. To plot hist with thinkplot, try this: >>> import thinkplot
>>> thinkplot.Hist(hist)
>>> thinkplot.Show(xlabel='value', ylabel='frequency')
You can read the documentation for thinkplot at http://greenteapress.com/thinkstats2/thinkplot.html. 2.4 NSFG variablesNow let’s get back to the data from the NSFG. The code in this chapter is in first.py. For information about downloading and working with this code, see Section 0.2. When you start working with a new dataset, I suggest you explore the variables you are planning to use one at a time, and a good way to start is by looking at histograms. In Section 1.6 we transformed agepreg
from centiyears to years, and combined I’ll start by reading the data and selecting records for live births: preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
The expression in brackets is a boolean Series that
selects rows from the DataFrame and returns a new DataFrame.
Next I generate and plot the histogram of
hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
thinkplot.Hist(hist)
thinkplot.Show(xlabel='pounds', ylabel='frequency')
When the argument passed to Hist is a pandas Series, any nan values are dropped. label is a string that appears in the legend when the Hist is plotted. Figure 2.1 shows the result. The most common value, called the mode, is 7 pounds. The distribution is approximately bellshaped, which is the shape of the normal distribution, also called a Gaussian distribution. But unlike a true normal distribution, this distribution is asymmetric; it has a tail that extends farther to the left than to the right. Figure 2.2 shows the histogram of
Figure 2.3 shows the histogram of Figure 2.4 shows the histogram of
2.5 OutliersLooking at histograms, it is easy to identify the most common values and the shape of the distribution, but rare values are not always visible. Before going on, it is a good idea to check for outliers, which are extreme values that might be errors in measurement and recording, or might be accurate reports of rare events. Hist provides methods Largest and Smallest, which take an integer n and return the n largest or smallest values from the histogram: for weeks, freq in hist.Smallest(10):
print(weeks, freq)
In the list of pregnancy lengths for live births, the 10 lowest values are [0, 4, 9, 13, 17, 18, 19, 20, 21, 22]. Values below 10 weeks are certainly errors; the most likely explanation is that the outcome was not coded correctly. Values higher than 30 weeks are probably legitimate. Between 10 and 30 weeks, it is hard to be sure; some values are probably errors, but some represent premature babies. On the other end of the range, the highest values are: weeks count
43 148
44 46
45 10
46 1
47 1
48 7
50 2
Most doctors recommend induced labor if a pregnancy exceeds 42 weeks, so some of the longer values are surprising. In particular, 50 weeks seems medically unlikely. The best way to handle outliers depends on “domain knowledge”; that is, information about where the data come from and what they mean. And it depends on what analysis you are planning to perform. In this example, the motivating question is whether first babies tend to be early (or late). When people ask this question, they are usually interested in fullterm pregnancies, so for this analysis I will focus on pregnancies longer than 27 weeks. 2.6 First babiesNow we can compare the distribution of pregnancy lengths for first babies and others. I divided the DataFrame of live births using birthord, and computed their histograms: firsts = live[live.birthord == 1]
others = live[live.birthord != 1]
first_hist = thinkstats2.Hist(firsts.prglngth)
other_hist = thinkstats2.Hist(others.prglngth)
Then I plotted their histograms on the same axis: width = 0.45
thinkplot.PrePlot(2)
thinkplot.Hist(first_hist, align='right', width=width)
thinkplot.Hist(other_hist, align='left', width=width)
thinkplot.Show(xlabel='weeks', ylabel='frequency',
xlim=[27, 46])
thinkplot.PrePlot takes the number of histograms we are planning to plot; it uses this information to choose an appropriate collection of colors. thinkplot.Hist normally uses align=’center’ so that each bar is centered over its value. For this figure, I use align=’right’ and align=’left’ to place corresponding bars on either side of the value. With width=0.45, the total width of the two bars is 0.9, leaving some space between each pair. Finally, I adjust the axis to show only data between 27 and 46 weeks. Figure 2.5 shows the result. Histograms are useful because they make the most frequent values immediately apparent. But they are not the best choice for comparing two distributions. In this example, there are fewer “first babies” than “others,” so some of the apparent differences in the histograms are due to sample sizes. In the next chapter we address this problem using probability mass functions. 2.7 Summarizing distributionsA histogram is a complete description of the distribution of a sample; that is, given a histogram, we could reconstruct the values in the sample (although not their order). If the details of the distribution are important, it might be necessary to present a histogram. But often we want to summarize the distribution with a few descriptive statistics. Some of the characteristics we might want to report are:
Statistics designed to answer these questions are called summary statistics. By far the most common summary statistic is the mean, which is meant to describe the central tendency of the distribution. If you have a sample of n values, x_{i}, the mean, x, is the sum of the values divided by the number of values; in other words
The words “mean” and “average” are sometimes used interchangeably, but I make this distinction:
Sometimes the mean is a good description of a set of values. For example, apples are all pretty much the same size (at least the ones sold in supermarkets). So if I buy 6 apples and the total weight is 3 pounds, it would be a reasonable summary to say they are about a half pound each. But pumpkins are more diverse. Suppose I grow several varieties in my garden, and one day I harvest three decorative pumpkins that are 1 pound each, two pie pumpkins that are 3 pounds each, and one Atlantic Giant® pumpkin that weighs 591 pounds. The mean of this sample is 100 pounds, but if I told you “The average pumpkin in my garden is 100 pounds,” that would be misleading. In this example, there is no meaningful average because there is no typical pumpkin. 2.8 VarianceIf there is no single number that summarizes pumpkin weights, we can do a little better with two numbers: mean and variance. Variance is a summary statistic intended to describe the variability or spread of a distribution. The variance of a set of values is
The term x_{i} − x is called the “deviation from the mean,” so variance is the mean squared deviation. The square root of variance, S, is the standard deviation. If you have prior experience, you might have seen a formula for variance with n−1 in the denominator, rather than n. This statistic is used to estimate the variance in a population using a sample. We will come back to this in Chapter 8. Pandas data structures provides methods to compute mean, variance and standard deviation: mean = live.prglngth.mean()
var = live.prglngth.var()
std = live.prglngth.std()
For all live births, the mean pregnancy length is 38.6 weeks, the standard deviation is 2.7 weeks, which means we should expect deviations of 23 weeks to be common. Variance of pregnancy length is 7.3, which is hard to interpret, especially since the units are weeks^{2}, or “square weeks.” Variance is useful in some calculations, but it is not a good summary statistic. 2.9 Effect sizeAn effect size is a summary statistic intended to describe (wait for it) the size of an effect. For example, to describe the difference between two groups, one obvious choice is the difference in the means. Mean pregnancy length for first babies is 38.601; for other babies it is 38.523. The difference is 0.078 weeks, which works out to 13 hours. As a fraction of the typical pregnancy length, this difference is about 0.2%. If we assume this estimate is accurate, such a difference would have no practical consequences. In fact, without observing a large number of pregnancies, it is unlikely that anyone would notice this difference at all. Another way to convey the size of the effect is to compare the difference between groups to the variability within groups. Cohen’s d is a statistic intended to do that; it is defined
where x_1 and x_2 are the means of the groups and s is the “pooled standard deviation”. Here’s the Python code that computes Cohen’s d: def CohenEffectSize(group1, group2):
diff = group1.mean()  group2.mean()
var1 = group1.var()
var2 = group2.var()
n1, n2 = len(group1), len(group2)
pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
d = diff / math.sqrt(pooled_var)
return d
In this example, the difference in means is 0.029 standard deviations, which is small. To put that in perspective, the difference in height between men and women is about 1.7 standard deviations (see https://en.wikipedia.org/wiki/Effect_size). 2.10 Reporting resultsWe have seen several ways to describe the difference in pregnancy length (if there is one) between first babies and others. How should we report these results? The answer depends on who is asking the question. A scientist might be interested in any (real) effect, no matter how small. A doctor might only care about effects that are clinically significant; that is, differences that affect treatment decisions. A pregnant woman might be interested in results that are relevant to her, like the probability of delivering early or late. How you report results also depends on your goals. If you are trying to demonstrate the importance of an effect, you might choose summary statistics that emphasize differences. If you are trying to reassure a patient, you might choose statistics that put the differences in context. Of course your decisions should also be guided by professional ethics. It’s ok to be persuasive; you should design statistical reports and visualizations that tell a story clearly. But you should also do your best to make your reports honest, and to acknowledge uncertainty and limitations. 2.11 ExercisesExercise 1
Based on the results in this chapter, suppose you were asked to
summarize what you learned about whether first babies arrive late. Which summary statistics would you use if you wanted to get a story on the evening news? Which ones would you use if you wanted to reassure an anxious patient? Finally, imagine that you are Cecil Adams, author of The Straight Dope (http://straightdope.com), and your job is to answer the question, “Do first babies arrive late?” Write a paragraph that uses the results in this chapter to answer the question clearly, precisely, and honestly. Exercise 2
In the repository you downloaded, you should find a file named
chap02ex.ipynb ; open it. Some cells are already filled in, and
you should execute them. Other cells give you instructions for
exercises. Follow the instructions and fill in the answers.A solution to this exercise is in In the repository you downloaded, you should find a file named
Exercise 3
The mode of a distribution is the most frequent value; see
http://wikipedia.org/wiki/Mode_(statistics). Write a function
called Mode that takes a Hist and returns the most
frequent value.
As a more challenging exercise, write a function called AllModes that returns a list of valuefrequency pairs in descending order of frequency. Exercise 4
Using the variable
totalwgt_lb , investigate whether first
babies are lighter or heavier than others. Compute Cohen’s d
to quantify the difference between the groups. How does it
compare to the difference in pregnancy length?
2.12 Glossary

Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
