This HTML version of "Think Stats 2e" is provided for convenience, but it is not the best format for the book. In particular, some of the math symbols are not rendered correctly.
Chapter 10 Linear least squares
The code for this chapter is in
linear.py. For information
about downloading and working with this code, see Section 0.2.
10.1 Least squares fit
Correlation coefficients measure the strength and sign of a relationship, but not the slope. There are several ways to estimate the slope; the most common is a linear least squares fit. A “linear fit” is a line intended to model the relationship between variables. A “least squares” fit is one that minimizes the mean squared error (MSE) between the line and the data.
Suppose we have a sequence of points,
ys, that we want to
express as a function of another sequence
xs. If there is a
linear relationship between
ys with intercept
slope, we expect each
y[i] to be
inter + slope * x[i].
res = ys - (inter + slope * xs)
The residuals might be due to random factors like measurement error, or non-random factors that are unknown. For example, if we are trying to predict weight as a function of height, unknown factors might include diet, exercise, and body type.
We might try to minimize the absolute value of the
residuals, or their squares, or their cubes; but the most common
choice is to minimize the sum of squared residuals,
Why? There are three good reasons and one less important one:
- Squaring has the feature of treating positive and negative residuals the same, which is usually what we want.
- Squaring gives more weight to large residuals, but not so much weight that the largest residual always dominates.
- If the residuals are uncorrelated and normally distributed with
mean 0 and constant (but unknown) variance, then the least squares
fit is also the maximum likelihood estimator of
slope. See https://en.wikipedia.org/wiki/Linear_regression.
- The values of
slopethat minimize the squared residuals can be computed efficiently.
The last reason made sense when computational efficiency was more important than choosing the method most appropriate to the problem at hand. That’s no longer the case, so it is worth considering whether squared residuals are the right thing to minimize.
For example, if you are using
xs to predict values of
guessing too high might be better (or worse) than guessing too low.
In that case you might want to compute some cost function for each
residual, and minimize total cost,
However, computing a least squares fit is quick, easy and often good
def LeastSquares(xs, ys): meanx, varx = MeanVar(xs) meany = Mean(ys) slope = Cov(xs, ys, meanx, meany) / varx inter = meany - slope * meanx return inter, slope
LeastSquares takes sequences
ys and returns the estimated parameters
For details on how it works, see
def FitLine(xs, inter, slope): fit_xs = np.sort(xs) fit_ys = inter + slope * fit_xs return fit_xs, fit_ys
live, firsts, others = first.MakeFrames() live = live.dropna(subset=['agepreg', 'totalwgt_lb']) ages = live.agepreg weights = live.totalwgt_lb inter, slope = thinkstats2.LeastSquares(ages, weights) fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)
The estimated intercept and slope are 6.8 lbs and 0.017 lbs per year. These values are hard to interpret in this form: the intercept is the expected weight of a baby whose mother is 0 years old, which doesn’t make sense in context, and the slope is too small to grasp easily.
Instead of presenting the intercept at x=0, it is often helpful to present the intercept at the mean of x. In this case the mean age is about 25 years and the mean baby weight for a 25 year old mother is 7.3 pounds. The slope is 0.27 ounces per year, or 0.17 pounds per decade.
Figure 10.1 shows a scatter plot of birth weight and age along with the fitted line. It’s a good idea to look at a figure like this to assess whether the relationship is linear and whether the fitted line seems like a good model of the relationship.
def Residuals(xs, ys, inter, slope): xs = np.asarray(xs) ys = np.asarray(ys) res = ys - (inter + slope * xs) return res
Residuals takes sequences
slope. It returns
the differences between the actual values and the fitted line.
To visualize the residuals, I group respondents by age and compute percentiles in each group, as we saw in Section 7.2. Figure 10.2 shows the 25th, 50th and 75th percentiles of the residuals for each age group. The median is near zero, as expected, and the interquartile range is about 2 pounds. So if we know the mother’s age, we can guess the baby’s weight within a pound, about 50% of the time.
Ideally these lines should be flat, indicating that the residuals are random, and parallel, indicating that the variance of the residuals is the same for all age groups. In fact, the lines are close to parallel, so that’s good; but they have some curvature, indicating that the relationship is nonlinear. Nevertheless, the linear fit is a simple model that is probably good enough for some purposes.
inter are estimates based on a
sample; like other estimates, they are vulnerable to sampling bias,
measurement error, and sampling error. As discussed in
Chapter 8, sampling bias is caused by non-representative
sampling, measurement error is caused by errors in collecting
and recording data, and sampling error is the result of measuring a
sample rather than the entire population.
To assess sampling error, we ask, “If we run this experiment again, how much variability do we expect in the estimates?” We can answer this question by running simulated experiments and computing sampling distributions of the estimates.
def SamplingDistributions(live, iters=101): t =  for _ in range(iters): sample = thinkstats2.ResampleRows(live) ages = sample.agepreg weights = sample.totalwgt_lb estimates = thinkstats2.LeastSquares(ages, weights) t.append(estimates) inters, slopes = zip(*t) return inters, slopes
SamplingDistributions takes a DataFrame with one row per live
iters, the number of experiments to simulate. It
ResampleRows to resample the observed pregnancies. We’ve
SampleRows, which chooses random rows from a
thinkstats2 also provides
returns a sample the same size as the original:
def ResampleRows(df): return SampleRows(df, len(df), replace=True)
def Summarize(estimates, actual=None): mean = thinkstats2.Mean(estimates) stderr = thinkstats2.Std(estimates, mu=actual) cdf = thinkstats2.Cdf(estimates) ci = cdf.ConfidenceInterval(90) print('mean, SE, CI', mean, stderr, ci)
For the intercept, the mean estimate is 6.83, with standard error 0.07 and 90% confidence interval (6.71, 6.94). The estimated slope, in more compact form, is 0.0174, SE 0.0028, CI (0.0126, 0.0220). There is almost a factor of two between the low and high ends of this CI, so it should be considered a rough estimate.
To visualize the sampling error of the estimate, we could plot all of the fitted lines, or for a less cluttered representation, plot a 90% confidence interval for each age. Here’s the code:
def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options): fys_seq =  for inter, slope in zip(inters, slopes): fxs, fys = thinkstats2.FitLine(xs, inter, slope) fys_seq.append(fys) p = (100 - percent) / 2 percents = p, 100 - p low, high = thinkstats2.PercentileRows(fys_seq, percents) thinkplot.FillBetween(fxs, low, high, **options)
xs is the sequence of mother’s age.
are the estimated parameters generated by
percent indicates which confidence interval to plot.
PlotConfidenceIntervals generates a fitted line for each pair
slope and stores the results in a sequence,
fys_seq. Then it uses
PercentileRows to select the
upper and lower percentiles of
y for each value of
For a 90% confidence interval, it selects the 5th and 95th percentiles.
FillBetween draws a polygon that fills the space between two
Figure 10.3 shows the 50% and 90% confidence intervals for curves fitted to birth weight as a function of mother’s age. The vertical width of the region represents the effect of sampling error; the effect is smaller for values near the mean and larger for the extremes.
10.5 Goodness of fit
If you use a linear model to make predictions,
is the root mean squared error (RMSE) of your predictions. For
example, if you use mother’s age to guess birth weight, the RMSE of
your guess would be 1.40 lbs.
If you guess birth weight without knowing the mother’s age, the RMSE
of your guess is
Std(ys), which is 1.41 lbs. So in this
example, knowing a mother’s age does not improve the predictions
def CoefDetermination(ys, res): return 1 - Var(res) / Var(ys)
Var(res) is the MSE of your guesses using the model,
Var(ys) is the MSE without it. So their ratio is the fraction
of MSE that remains if you use the model, and R2 is the fraction
of MSE the model eliminates.
For birth weight and mother’s age, R2 is 0.0047, which means that mother’s age predicts about half of 1% of variance in birth weight.
Although ρ and R2 are often used to quantify the strength of a
relationship, they are not easy to interpret in terms of predictive
power. In my opinion,
Std(res) is the best representation
of the quality of prediction, especially if it is presented
in relation to
For example, when people talk about the validity of the SAT (a standardized test used for college admission in the U.S.) they often talk about correlations between SAT scores and other measures of intelligence.
According to one study, there is a Pearson correlation of ρ=0.72 between total SAT scores and IQ scores, which sounds like a strong correlation. But R2 = ρ2 = 0.52, so SAT scores account for only 52% of variance in IQ.
IQ scores are normalized with
Std(ys) = 15, so
>>> var_ys = 15**2 >>> rho = 0.72 >>> r2 = rho**2 >>> var_res = (1 - r2) * var_ys >>> std_res = math.sqrt(var_res) 10.4096
So using SAT score to predict IQ reduces RMSE from 15 points to 10.4 points. A correlation of 0.72 yields a reduction in RMSE of only 31%.
10.6 Testing a linear model
The effect of mother’s age on birth weight is small, and has little predictive power. So is it possible that the apparent relationship is due to chance? There are several ways we might test the results of a linear fit.
One option is to test whether the apparent reduction in MSE is due to chance. In that case, the test statistic is R2 and the null hypothesis is that there is no relationship between the variables. We can simulate the null hypothesis by permutation, as in Section 9.5, when we tested the correlation between mother’s age and birth weight. In fact, because R2 = ρ2, a one-sided test of R2 is equivalent to a two-sided test of ρ. We’ve already done that test, and found p < 0.001, so we conclude that the apparent relationship between mother’s age and birth weight is statistically significant.
Another approach is to test whether the apparent slope is due to chance. The null hypothesis is that the slope is actually zero; in that case we can model the birth weights as random variations around their mean. Here’s a HypothesisTest for this model:
class SlopeTest(thinkstats2.HypothesisTest): def TestStatistic(self, data): ages, weights = data _, slope = thinkstats2.LeastSquares(ages, weights) return slope def MakeModel(self): _, weights = self.data self.ybar = weights.mean() self.res = weights - self.ybar def RunModel(self): ages, _ = self.data weights = self.ybar + np.random.permutation(self.res) return ages, weights
The data are represented as sequences of ages and weights. The
test statistic is the slope estimated by
The model of the null hypothesis is represented by the mean weight
of all babies and the deviations from the mean. To
generate simulated data, we permute the deviations and add them to
Here’s the code that runs the hypothesis test:
live, firsts, others = first.MakeFrames() live = live.dropna(subset=['agepreg', 'totalwgt_lb']) ht = SlopeTest((live.agepreg, live.totalwgt_lb)) pvalue = ht.PValue()
Estimating the p-value by simulating the null hypothesis is strictly correct, but there is a simpler alternative. Remember that we already computed the sampling distribution of the slope, in Section 10.4. To do that, we assumed that the observed slope was correct and simulated experiments by resampling.
Figure 10.4 shows the sampling distribution of the slope, from Section 10.4, and the distribution of slopes generated under the null hypothesis. The sampling distribution is centered about the estimated slope, 0.017 lbs/year, and the slopes under the null hypothesis are centered around 0; but other than that, the distributions are identical. The distributions are also symmetric, for reasons we will see in Section 14.4.
- Compute the probability that the slope under the null hypothesis exceeds the observed slope.
- Compute the probability that the slope in the sampling distribution falls below 0. (If the estimated slope were negative, we would compute the probability that the slope in the sampling distribution exceeds 0.)
The second option is easier because we normally want to compute the sampling distribution of the parameters anyway. And it is a good approximation unless the sample size is small and the distribution of residuals is skewed. Even then, it is usually good enough, because p-values don’t have to be precise.
inters, slopes = SamplingDistributions(live, iters=1001) slope_cdf = thinkstats2.Cdf(slopes) pvalue = slope_cdf
Again, we find p < 0.001.
10.7 Weighted resampling
So far we have treated the NSFG data as if it were a representative sample, but as I mentioned in Section 1.2, it is not. The survey deliberately oversamples several groups in order to improve the chance of getting statistically significant results; that is, in order to improve the power of tests involving these groups.
This survey design is useful for many purposes, but it means that we cannot use the sample to estimate values for the general population without accounting for the sampling process.
For each respondent, the NSFG data includes a variable called
which is the number of people in the general population the respondent
represents. This value is called a sampling weight, or just “weight.”
As an example, if you survey 100,000 people in a country of 300 million, each respondent represents 3,000 people. If you oversample one group by a factor of 2, each person in the oversampled group would have a lower weight, about 1500.
To correct for oversampling, we can use resampling; that is, we can draw samples from the survey using probabilities proportional to sampling weights. Then, for any quantity we want to estimate, we can generate sampling distributions, standard errors, and confidence intervals. As an example, I will estimate mean birth weight with and without sampling weights.
In Section 10.4, we saw
ResampleRows, which chooses
rows from a DataFrame, giving each row the same probability.
Now we need to do the same thing using probabilities
proportional to sampling weights.
ResampleRowsWeighted takes a DataFrame, resamples rows according
to the weights in
finalwgt, and returns a DataFrame containing
the resampled rows:
def ResampleRowsWeighted(df, column='finalwgt'): weights = df[column] cdf = Cdf(dict(weights)) indices = cdf.Sample(len(weights)) sample = df.loc[indices] return sample
weights is a Series; converting it to a dictionary makes
a map from the indices to the weights. In
cdf the values
are indices and the probabilities are proportional to the
estimates = [ResampleRows(live).totalwgt_lb.mean() for _ in range(iters)]
With weights, it looks like this:
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean() for _ in range(iters)]
The following table summarizes the results:
|mean birth||standard||90% CI|
In this example, the effect of weighting is small but non-negligible. The difference in estimated means, with and without weighting, is about 0.08 pounds, or 1.3 ounces. This difference is substantially larger than the standard error of the estimate, 0.014 pounds, which implies that the difference is not due to chance.
A solution to this exercise is in
Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?
Like the NSFG, the BRFSS oversamples some groups and provides
a sampling weight for each respondent. In the BRFSS data, the variable
name for these weights is
Use resampling, with and without weights, to estimate the mean height
of respondents in the BRFSS, the standard error of the mean, and a
90% confidence interval. How much does correct weighting affect the
- linear fit: a line intended to model the relationship between variables.
- least squares fit: A model of a dataset that minimizes the sum of squares of the residuals.
- residual: The deviation of an actual value from a model.
- goodness of fit: A measure of how well a model fits data.
- coefficient of determination: A statistic intended to quantify goodness of fit.
- sampling weight: A value associated with an observation in a sample that indicates what part of the population it represents.