This HTML version of "Think Stats 2e" is provided for convenience, but it is not the best format for the book. In particular, some of the math symbols are not rendered correctly.
You might prefer to read the PDF version, or you can buy a hard copy from Amazon.
Chapter 10 Linear least squares
The code for this chapter is in linear.py
. For information
about downloading and working with this code, see Section ??.
10.1 Least squares fit
Correlation coefficients measure the strength and sign of a relationship, but not the slope. There are several ways to estimate the slope; the most common is a linear least squares fit. A “linear fit” is a line intended to model the relationship between variables. A “least squares” fit is one that minimizes the mean squared error (MSE) between the line and the data.
Suppose we have a sequence of points, ys
, that we want to
express as a function of another sequence xs
. If there is a
linear relationship between xs
and ys
with intercept inter
and slope slope
, we expect each y[i]
to be
inter + slope * x[i]
.
But unless the correlation is perfect, this prediction is only approximate. The vertical deviation from the line, or residual, is
res = ys - (inter + slope * xs)
The residuals might be due to random factors like measurement error, or non-random factors that are unknown. For example, if we are trying to predict weight as a function of height, unknown factors might include diet, exercise, and body type.
If we get the parameters inter
and slope
wrong, the residuals
get bigger, so it makes intuitive sense that the parameters we want
are the ones that minimize the residuals.
We might try to minimize the absolute value of the
residuals, or their squares, or their cubes; but the most common
choice is to minimize the sum of squared residuals,
sum(res**2)
.
Why? There are three good reasons and one less important one:
- Squaring has the feature of treating positive and negative residuals the same, which is usually what we want.
- Squaring gives more weight to large residuals, but not so much weight that the largest residual always dominates.
- If the residuals are uncorrelated and normally distributed with
mean 0 and constant (but unknown) variance, then the least squares
fit is also the maximum likelihood estimator of
inter
andslope
. See https://en.wikipedia.org/wiki/Linear_regression. - The values of
inter
andslope
that minimize the squared residuals can be computed efficiently.
The last reason made sense when computational efficiency was more important than choosing the method most appropriate to the problem at hand. That’s no longer the case, so it is worth considering whether squared residuals are the right thing to minimize.
For example, if you are using xs
to predict values of ys
,
guessing too high might be better (or worse) than guessing too low.
In that case you might want to compute some cost function for each
residual, and minimize total cost, sum(cost(res))
.
However, computing a least squares fit is quick, easy and often good
enough.
10.2 Implementation
thinkstats2
provides simple functions that demonstrate
linear least squares:
def LeastSquares(xs, ys):
meanx, varx = MeanVar(xs)
meany = Mean(ys)
slope = Cov(xs, ys, meanx, meany) / varx
inter = meany - slope * meanx
return inter, slope
LeastSquares
takes sequences
xs
and ys
and returns the estimated parameters inter
and slope
.
For details on how it works, see
http://wikipedia.org/wiki/Numerical_methods_for_linear_least_squares.
thinkstats2
also provides FitLine
, which takes inter
and slope
and returns the fitted line for a sequence
of xs
.
def FitLine(xs, inter, slope):
fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys
We can use these functions to compute the least squares fit for birth weight as a function of mother’s age.
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb
inter, slope = thinkstats2.LeastSquares(ages, weights)
fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)
The estimated intercept and slope are 6.8 lbs and 0.017 lbs per year. These values are hard to interpret in this form: the intercept is the expected weight of a baby whose mother is 0 years old, which doesn’t make sense in context, and the slope is too small to grasp easily.
Instead of presenting the intercept at x=0, it is often helpful to present the intercept at the mean of x. In this case the mean age is about 25 years and the mean baby weight for a 25 year old mother is 7.3 pounds. The slope is 0.27 ounces per year, or 0.17 pounds per decade.
Figure ?? shows a scatter plot of birth weight and age along with the fitted line. It’s a good idea to look at a figure like this to assess whether the relationship is linear and whether the fitted line seems like a good model of the relationship.
10.3 Residuals
Another useful test is to plot the residuals.
thinkstats2
provides a function that computes residuals:
def Residuals(xs, ys, inter, slope):
xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res
Residuals
takes sequences xs
and ys
and
estimated parameters inter
and slope
. It returns
the differences between the actual values and the fitted line.
To visualize the residuals, I group respondents by age and compute percentiles in each group, as we saw in Section ??. Figure ?? shows the 25th, 50th and 75th percentiles of the residuals for each age group. The median is near zero, as expected, and the interquartile range is about 2 pounds. So if we know the mother’s age, we can guess the baby’s weight within a pound, about 50% of the time.
Ideally these lines should be flat, indicating that the residuals are random, and parallel, indicating that the variance of the residuals is the same for all age groups. In fact, the lines are close to parallel, so that’s good; but they have some curvature, indicating that the relationship is nonlinear. Nevertheless, the linear fit is a simple model that is probably good enough for some purposes.
10.4 Estimation
The parameters slope
and inter
are estimates based on a
sample; like other estimates, they are vulnerable to sampling bias,
measurement error, and sampling error. As discussed in
Chapter ??, sampling bias is caused by non-representative
sampling, measurement error is caused by errors in collecting
and recording data, and sampling error is the result of measuring a
sample rather than the entire population.
To assess sampling error, we ask, “If we run this experiment again, how much variability do we expect in the estimates?” We can answer this question by running simulated experiments and computing sampling distributions of the estimates.
I simulate the experiments by resampling the data; that is, I treat the observed pregnancies as if they were the entire population and draw samples, with replacement, from the observed sample.
def SamplingDistributions(live, iters=101):
t = []
for _ in range(iters):
sample = thinkstats2.ResampleRows(live)
ages = sample.agepreg
weights = sample.totalwgt_lb
estimates = thinkstats2.LeastSquares(ages, weights)
t.append(estimates)
inters, slopes = zip(*t)
return inters, slopes
SamplingDistributions
takes a DataFrame with one row per live
birth, and iters
, the number of experiments to simulate. It
uses ResampleRows
to resample the observed pregnancies. We’ve
already seen SampleRows
, which chooses random rows from a
DataFrame. thinkstats2
also provides ResampleRows
, which
returns a sample the same size as the original:
def ResampleRows(df):
return SampleRows(df, len(df), replace=True)
After resampling, we use the simulated sample to estimate parameters. The result is two sequences: the estimated intercepts and estimated slopes.
I summarize the sampling distributions by printing the standard error and confidence interval:
def Summarize(estimates, actual=None):
mean = thinkstats2.Mean(estimates)
stderr = thinkstats2.Std(estimates, mu=actual)
cdf = thinkstats2.Cdf(estimates)
ci = cdf.ConfidenceInterval(90)
print('mean, SE, CI', mean, stderr, ci)
Summarize
takes a sequence of estimates and the actual value.
It prints the mean of the estimates, the standard error and
a 90% confidence interval.
For the intercept, the mean estimate is 6.83, with standard error 0.07 and 90% confidence interval (6.71, 6.94). The estimated slope, in more compact form, is 0.0174, SE 0.0028, CI (0.0126, 0.0220). There is almost a factor of two between the low and high ends of this CI, so it should be considered a rough estimate.
To visualize the sampling error of the estimate, we could plot all of the fitted lines, or for a less cluttered representation, plot a 90% confidence interval for each age. Here’s the code:
def PlotConfidenceIntervals(xs, inters, slopes,
percent=90, **options):
fys_seq = []
for inter, slope in zip(inters, slopes):
fxs, fys = thinkstats2.FitLine(xs, inter, slope)
fys_seq.append(fys)
p = (100 - percent) / 2
percents = p, 100 - p
low, high = thinkstats2.PercentileRows(fys_seq, percents)
thinkplot.FillBetween(fxs, low, high, **options)
xs
is the sequence of mother’s age. inters
and slopes
are the estimated parameters generated by SamplingDistributions
.
percent
indicates which confidence interval to plot.
PlotConfidenceIntervals
generates a fitted line for each pair
of inter
and slope
and stores the results in a sequence,
fys_seq
. Then it uses PercentileRows
to select the
upper and lower percentiles of y
for each value of x
.
For a 90% confidence interval, it selects the 5th and 95th percentiles.
FillBetween
draws a polygon that fills the space between two
lines.
Figure ?? shows the 50% and 90% confidence intervals for curves fitted to birth weight as a function of mother’s age. The vertical width of the region represents the effect of sampling error; the effect is smaller for values near the mean and larger for the extremes.
10.5 Goodness of fit
There are several ways to measure the quality of a linear model, or goodness of fit. One of the simplest is the standard deviation of the residuals.
If you use a linear model to make predictions, Std(res)
is the root mean squared error (RMSE) of your predictions. For
example, if you use mother’s age to guess birth weight, the RMSE of
your guess would be 1.40 lbs.
If you guess birth weight without knowing the mother’s age, the RMSE
of your guess is Std(ys)
, which is 1.41 lbs. So in this
example, knowing a mother’s age does not improve the predictions
substantially.
Another way to measure goodness of fit is the coefficient of determination, usually denoted R2 and called “R-squared”:
def CoefDetermination(ys, res):
return 1 - Var(res) / Var(ys)
Var(res)
is the MSE of your guesses using the model,
Var(ys)
is the MSE without it. So their ratio is the fraction
of MSE that remains if you use the model, and R2 is the fraction
of MSE the model eliminates.
For birth weight and mother’s age, R2 is 0.0047, which means that mother’s age predicts about half of 1% of variance in birth weight.
There is a simple relationship between the coefficient of determination and Pearson’s coefficient of correlation: R2 = ρ2. For example, if ρ is 0.8 or -0.8, R2 = 0.64.
Although ρ and R2 are often used to quantify the strength of a
relationship, they are not easy to interpret in terms of predictive
power. In my opinion, Std(res)
is the best representation
of the quality of prediction, especially if it is presented
in relation to Std(ys)
.
For example, when people talk about the validity of the SAT (a standardized test used for college admission in the U.S.) they often talk about correlations between SAT scores and other measures of intelligence.
According to one study, there is a Pearson correlation of ρ=0.72 between total SAT scores and IQ scores, which sounds like a strong correlation. But R2 = ρ2 = 0.52, so SAT scores account for only 52% of variance in IQ.
IQ scores are normalized with Std(ys) = 15
, so
>>> var_ys = 15**2
>>> rho = 0.72
>>> r2 = rho**2
>>> var_res = (1 - r2) * var_ys
>>> std_res = math.sqrt(var_res)
10.4096
So using SAT score to predict IQ reduces RMSE from 15 points to 10.4 points. A correlation of 0.72 yields a reduction in RMSE of only 31%.
If you see a correlation that looks impressive, remember that R2 is a better indicator of reduction in MSE, and reduction in RMSE is a better indicator of predictive power.
10.6 Testing a linear model
The effect of mother’s age on birth weight is small, and has little predictive power. So is it possible that the apparent relationship is due to chance? There are several ways we might test the results of a linear fit.
One option is to test whether the apparent reduction in MSE is due to chance. In that case, the test statistic is R2 and the null hypothesis is that there is no relationship between the variables. We can simulate the null hypothesis by permutation, as in Section ??, when we tested the correlation between mother’s age and birth weight. In fact, because R2 = ρ2, a one-sided test of R2 is equivalent to a two-sided test of ρ. We’ve already done that test, and found p < 0.001, so we conclude that the apparent relationship between mother’s age and birth weight is statistically significant.
Another approach is to test whether the apparent slope is due to chance. The null hypothesis is that the slope is actually zero; in that case we can model the birth weights as random variations around their mean. Here’s a HypothesisTest for this model:
class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
_, slope = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
weights = self.ybar + np.random.permutation(self.res)
return ages, weights
The data are represented as sequences of ages and weights. The
test statistic is the slope estimated by LeastSquares
.
The model of the null hypothesis is represented by the mean weight
of all babies and the deviations from the mean. To
generate simulated data, we permute the deviations and add them to
the mean.
Here’s the code that runs the hypothesis test:
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ht = SlopeTest((live.agepreg, live.totalwgt_lb))
pvalue = ht.PValue()
The p-value is less than 0.001, so although the estimated slope is small, it is unlikely to be due to chance.
Estimating the p-value by simulating the null hypothesis is strictly correct, but there is a simpler alternative. Remember that we already computed the sampling distribution of the slope, in Section ??. To do that, we assumed that the observed slope was correct and simulated experiments by resampling.
Figure ?? shows the sampling distribution of the slope, from Section ??, and the distribution of slopes generated under the null hypothesis. The sampling distribution is centered about the estimated slope, 0.017 lbs/year, and the slopes under the null hypothesis are centered around 0; but other than that, the distributions are identical. The distributions are also symmetric, for reasons we will see in Section ??.
So we could estimate the p-value two ways:
- Compute the probability that the slope under the null hypothesis exceeds the observed slope.
- Compute the probability that the slope in the sampling distribution falls below 0. (If the estimated slope were negative, we would compute the probability that the slope in the sampling distribution exceeds 0.)
The second option is easier because we normally want to compute the sampling distribution of the parameters anyway. And it is a good approximation unless the sample size is small and the distribution of residuals is skewed. Even then, it is usually good enough, because p-values don’t have to be precise.
Here’s the code that estimates the p-value of the slope using the sampling distribution:
inters, slopes = SamplingDistributions(live, iters=1001)
slope_cdf = thinkstats2.Cdf(slopes)
pvalue = slope_cdf[0]
Again, we find p < 0.001.
10.7 Weighted resampling
So far we have treated the NSFG data as if it were a representative sample, but as I mentioned in Section ??, it is not. The survey deliberately oversamples several groups in order to improve the chance of getting statistically significant results; that is, in order to improve the power of tests involving these groups.
This survey design is useful for many purposes, but it means that we cannot use the sample to estimate values for the general population without accounting for the sampling process.
For each respondent, the NSFG data includes a variable called finalwgt
,
which is the number of people in the general population the respondent
represents. This value is called a sampling weight, or just “weight.”
As an example, if you survey 100,000 people in a country of 300 million, each respondent represents 3,000 people. If you oversample one group by a factor of 2, each person in the oversampled group would have a lower weight, about 1500.
To correct for oversampling, we can use resampling; that is, we can draw samples from the survey using probabilities proportional to sampling weights. Then, for any quantity we want to estimate, we can generate sampling distributions, standard errors, and confidence intervals. As an example, I will estimate mean birth weight with and without sampling weights.
In Section ??, we saw ResampleRows
, which chooses
rows from a DataFrame, giving each row the same probability.
Now we need to do the same thing using probabilities
proportional to sampling weights.
ResampleRowsWeighted
takes a DataFrame, resamples rows according
to the weights in finalwgt
, and returns a DataFrame containing
the resampled rows:
def ResampleRowsWeighted(df, column='finalwgt'):
weights = df[column]
cdf = Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample
weights
is a Series; converting it to a dictionary makes
a map from the indices to the weights. In cdf
the values
are indices and the probabilities are proportional to the
weights.
indices
is a sequence of row indices; sample
is a
DataFrame that contains the selected rows. Since we sample with
replacement, the same row might appear more than once.
Now we can compare the effect of resampling with and without weights. Without weights, we generate the sampling distribution like this:
estimates = [ResampleRows(live).totalwgt_lb.mean()
for _ in range(iters)]
With weights, it looks like this:
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
for _ in range(iters)]
The following table summarizes the results:
mean birth | standard | 90% CI | |
weight (lbs) | error | ||
Unweighted | 7.27 | 0.014 | (7.24, 7.29) |
Weighted | 7.35 | 0.014 | (7.32, 7.37) |
In this example, the effect of weighting is small but non-negligible. The difference in estimated means, with and without weighting, is about 0.08 pounds, or 1.3 ounces. This difference is substantially larger than the standard error of the estimate, 0.014 pounds, which implies that the difference is not due to chance.
10.8 Exercises
A solution to this exercise is in chap10soln.ipynb
Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?
Like the NSFG, the BRFSS oversamples some groups and provides
a sampling weight for each respondent. In the BRFSS data, the variable
name for these weights is finalwt
.
Use resampling, with and without weights, to estimate the mean height
of respondents in the BRFSS, the standard error of the mean, and a
90% confidence interval. How much does correct weighting affect the
estimates?
10.9 Glossary
- linear fit: a line intended to model the relationship between variables.
- least squares fit: A model of a dataset that minimizes the sum of squares of the residuals.
- residual: The deviation of an actual value from a model.
- goodness of fit: A measure of how well a model fits data.
- coefficient of determination: A statistic intended to quantify goodness of fit.
- sampling weight: A value associated with an observation in a sample that indicates what part of the population it represents.