This HTML version of is provided for convenience, but it is not the best format for the book. In particular, some of the symbols are not rendered correctly.
Chapter 10 Linear least squares
The code for this chapter is in linear.py. For information about downloading and working with this code, see Section 0.2.
10.1 Least squares fit
Correlation coefficients measure the strength and sign of a relationship, but not the slope. There are several ways to estimate the slope; the most common is a linear least squares fit. A “linear fit” is a line intended to model the relationship between variables. A “least squares” fit is one that minimizes the mean squared error (MSE) between the line and the data.
Suppose we have a sequence of points, ys, that we want to express as a function of another sequence xs. If there is a linear relationship between xs and ys with intercept inter and slope slope, we expect each y[i] to be inter + slope * x[i].
res = ys - (inter + slope * xs)
The residuals might be due to random factors like measurement error, or non-random factors that are unknown. For example, if we are trying to predict weight as a function of height, unknown factors might include diet, exercise, and body type.
We might try to minimize the absolute value of the residuals, or their squares, or their cubes; but the most common choice is to minimize the sum of squared residuals, sum(res**2)).
Why? There are three good reasons and one less important one:
The last reason made sense when computational efficiency was more important than choosing the method most appropriate to the problem at hand. That’s no longer the case, so it is worth considering whether squared residuals are the right thing to minimize.
For example, if you are using xs to predict values of ys, guessing too high might be better (or worse) than guessing too low. In that case you might want to compute some cost function for each residual, and minimize total cost, sum(cost(res)). However, computing a least squares fit is quick, easy and often good enough.
def LeastSquares(xs, ys): meanx, varx = MeanVar(xs) meany = Mean(ys) slope = Cov(xs, ys, meanx, meany) / varx inter = meany - slope * meanx return inter, slope
LeastSquares takes sequences xs and ys and returns the estimated parameters inter and slope. For details on how it works, see http://wikipedia.org/wiki/Numerical_methods_for_linear_least_squares.
def FitLine(xs, inter, slope): fit_xs = np.sort(xs) fit_ys = inter + slope * fit_xs return fit_xs, fit_ys
live, firsts, others = first.MakeFrames() live = live.dropna(subset=['agepreg', 'totalwgt_lb']) ages = live.agepreg weights = live.totalwgt_lb inter, slope = thinkstats2.LeastSquares(ages, weights) fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)
The estimated intercept and slope are 6.8 lbs and 0.017 lbs per year. These values are hard to interpret in this form: the intercept is the expected weight of a baby whose mother is 0 years old, which doesn’t make sense in context, and the slope is too small to grasp easily.
Instead of presenting the intercept at x=0, it is often helpful to present the intercept at the mean of x. In this case the mean age is about 25 years and the mean baby weight for a 25 year old mother is 7.3 pounds. The slope is 0.27 ounces per year, or 0.17 pounds per decade.
Figure 10.1 shows a scatter plot of birth weight and age along with the fitted line. It’s a good idea to look at a figure like this to assess whether the relationship is linear and whether the fitted line seems like a good model of the relationship.
def Residuals(xs, ys, inter, slope): xs = np.asarray(xs) ys = np.asarray(ys) res = ys - (inter + slope * xs) return res
Residuals takes sequences xs and ys and estimated parameters inter and slope. It returns the differences between the actual values and the fitted line.
To visualize the residuals, I group respondents by age and compute percentiles in each group, as we saw in Section 7.2. Figure 10.2 shows the 25th, 50th and 75th percentiles of the residuals for each age group. The median is near zero, as expected, and the interquartile range is about 2 pounds. So if we know the mother’s age, we can guess the baby’s weight within a pound, about 50% of the time.
Ideally these lines should be flat, indicating that the residuals are random, and parallel, indicating that the variance of the residuals is the same for all age groups. In fact, the lines are close to parallel, so that’s good; but they have some curvature, indicating that the relationship is nonlinear. Nevertheless, the linear fit is a simple model that is probably good enough for some purposes.
The parameters slope and inter are estimates based on a sample; like other estimates, they are vulnerable to sampling bias, measurement error, and sampling error. As discussed in Chapter 8, sampling bias is caused by non-representative sampling, measurement error is caused by errors in collecting and recording data, and sampling error is the result of measuring a sample rather than the entire population.
To assess sampling error, we ask, “If we run this experiment again, how much variability do we expect in the estimates?” We can answer this question by running simulated experiments and computing sampling distributions of the estimates.
def SamplingDistributions(live, iters=101): t =  for _ in range(iters): sample = thinkstats2.ResampleRows(live) ages = sample.agepreg weights = sample.totalwgt_lb estimates = thinkstats2.LeastSquares(ages, weights) t.append(estimates) inters, slopes = zip(*t) return inters, slopes
SamplingDistributions takes a DataFrame with one row per live birth, and iters, the number of experiments to simulate. It uses ResampleRows to resample the observed pregnancies. We’ve already seen SampleRows, which chooses random rows from a DataFrame. thinkstats2 also provides ResampleRows, which returns a sample the same size as the original:
def ResampleRows(df): return SampleRows(df, len(df), replace=True)
def Summarize(estimates, actual=None): mean = thinkstats2.Mean(estimates) stderr = thinkstats2.Std(estimates, mu=actual) cdf = thinkstats2.Cdf(estimates) ci = cdf.ConfidenceInterval(90) print('mean, SE, CI', mean, stderr, ci)
For the intercept, the mean estimate is 6.83, with standard error 0.07 and 90% confidence interval (6.71, 6.94). The estimated slope, in more compact form, is 0.0174, SE 0.0028, CI (0.0126, 0.0220). There is almost a factor of two between the low and high ends of this CI, so it should be considered a rough estimate.
To visualize the sampling error of the estimate, we could plot all of the fitted lines, or for a less cluttered representation, plot a 90% confidence interval for each age. Here’s the code:
def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options): fys_seq =  for inter, slope in zip(inters, slopes): fxs, fys = thinkstats2.FitLine(xs, inter, slope) fys_seq.append(fys) p = (100 - percent) / 2 percents = p, 100 - p low, high = thinkstats2.PercentileRows(fys_seq, percents) thinkplot.FillBetween(fxs, low, high, **options)
xs is the sequence of mother’s age. inters and slopes are the estimated parameters generated by SamplingDistributions. percent indicates which confidence interval to plot.
PlotConfidenceIntervals generates a fitted line for each pair
of inter and slope and stores the results in a sequence,
Figure 10.3 shows the 50% and 90% confidence intervals for curves fitted to birth weight as a function of mother’s age. The vertical width of the region represents the effect of sampling error; the effect is smaller for values near the mean and larger for the extremes.
10.5 Goodness of fit
If you use a linear model to make predictions, Std(res) is the root mean squared error (RMSE) of your predictions. For example, if you use mother’s age to guess birth weight, the RMSE of your guess would be 1.40 lbs.
If you guess birth weight without knowing the mother’s age, the RMSE of your guess is Std(ys), which is 1.41 lbs. So in this example, knowing a mother’s age does not improve the predictions substantially.
def CoefDetermination(ys, res): return 1 - Var(res) / Var(ys)
Var(res) is the MSE of your guesses using the model, Var(ys) is the MSE without it. So their ratio is the fraction of MSE that remains if you use the model, and R2 is the fraction of MSE the model eliminates.
For birth weight and mother’s age, R2 is 0.0047, which means that mother’s age predicts about half of 1% of variance in birth weight.
Although ρ and R2 are often used to quantify the strength of a relationship, they are not easy to interpret in terms of predictive power. In my opinion, Std(res) is the best representation of the quality of prediction, especially if it is presented in relation to Std(ys).
For example, when people talk about the validity of the SAT (a standardized test used for college admission in the U.S.) they often talk about correlations between SAT scores and other measures of intelligence.
According to one study, there is a Pearson correlation of ρ=0.72 between total SAT scores and IQ scores, which sounds like a strong correlation. But R2 = ρ2 = 0.52, so SAT scores account for only 52% of variance in IQ.
IQ scores are normalized with Std(ys) = 15, so
>>> var_ys = 15**2 >>> rho = 0.72 >>> r2 = rho**2 >>> var_res = (1 - r2) * var_ys >>> std_res = math.sqrt(var_res) 10.4096
So using SAT score to predict IQ reduces RMSE from 15 points to 10.4 points. A correlation of 0.72 yields a reduction in RMSE of only 31%.
10.6 Testing a linear model
The effect of mother’s age on birth weight is small, and has little predictive power. So is it possible that the apparent relationship is due to chance? There are several ways we might test the results of a linear fit.
One option is to test whether the apparent reduction in MSE is due to chance. In that case, the test statistic is R2 and the null hypothesis is that there is no relationship between the variables. We can simulate the null hypothesis by permutation, as in Section 9.5, when we tested the correlation between mother’s age and birth weight. In fact, because R2 = ρ2, a one-sided test of R2 is equivalent to a two-sided test of ρ. We’ve already done that test, and found p < 0.001, so we conclude that the apparent relationship between mother’s age and birth weight is statistically significant.
Another approach is to test whether the apparent slope is due to chance. The null hypothesis is that the slope is actually zero; in that case we can model the birth weights as random variations around their mean. Here’s a HypothesisTest for this model:
class SlopeTest(thinkstats2.HypothesisTest): def TestStatistic(self, data): ages, weights = data _, slope = thinkstats2.LeastSquares(ages, weights) return slope def MakeModel(self): _, weights = self.data self.ybar = weights.mean() self.res = weights - self.ybar def RunModel(self): ages, _ = self.data weights = self.ybar + np.random.permutation(self.res) return ages, weights
The data are represented as sequences of ages and weights. The test statistic is the slope estimated by LeastSquares. The model of the null hypothesis is represented by the mean weight of all babies and the deviations from the mean. To generate simulated data, we permute the deviations and add them to the mean.
Here’s the code that runs the hypothesis test:
live, firsts, others = first.MakeFrames() live = live.dropna(subset=['agepreg', 'totalwgt_lb']) ht = SlopeTest((live.agepreg, live.totalwgt_lb)) pvalue = ht.PValue()
Estimating the p-value by simulating the null hypothesis is strictly correct, but there is a simpler alternative. Remember that we already computed the sampling distribution of the slope, in Section 10.4. To do that, we assumed that the observed slope was correct and simulated experiments by resampling.
Figure 10.4 shows the sampling distribution of the slope, from Section 10.4, and the distribution of slopes generated under the null hypothesis. The sampling distribution is centered about the estimated slope, 0.017 lbs/year, and the slopes under the null hypothesis are centered around 0; but other than that, the distributions are identical. The distributions are also symmetric, for reasons we will see in Section 14.4.
The second option is easier because we normally want to compute the sampling distribution of the parameters anyway. And it is a good approximation unless the sample size is small and the distribution of residuals is skewed. Even then, it is usually good enough, because p-values don’t have to be precise.
inters, slopes = SamplingDistributions(live, iters=1001) slope_cdf = thinkstats2.Cdf(slopes) pvalue = slope_cdf
Again, we find p < 0.001.
10.7 Weighted resampling
So far we have treated the NSFG data as if it were a representative sample, but as I mentioned in Section 1.2, it is not. The survey deliberately oversamples several groups in order to improve the chance of getting statistically significant results; that is, in order to improve the power of tests involving these groups.
This survey design is useful for many purposes, but it means that we cannot use the sample to estimate values for the general population without accounting for the sampling process.
For each respondent, the NSFG data includes a variable called finalwgt, which is the number of people in the general population the respondent represents. This value is called a sampling weight, or just “weight.”
As an example, if you survey 100,000 people in a country of 300 million, each respondent represents 3,000 people. If you oversample one group by a factor of 2, each person in the oversampled group would have a lower weight, about 1500.
To correct for oversampling, we can use resampling; that is, we can draw samples from the survey using probabilities proportional to sampling weights. Then, for any quantity we want to estimate, we can generate sampling distributions, standard errors, and confidence intervals. As an example, I will estimate mean birth weight with and without sampling weights.
In Section 10.4, we saw ResampleRows, which chooses rows from a DataFrame, giving each row the same probability. Now we need to do the same thing using probabilities proportional to sampling weights. ResampleRowsWeighted takes a DataFrame, resamples rows according to the weights in finalwgt, and returns a DataFrame containing the resampled rows:
def ResampleRowsWeighted(df, column='finalwgt'): weights = df[column] cdf = Cdf(dict(weights)) indices = cdf.Sample(len(weights)) sample = df.loc[indices] return sample
weights is a Series; converting it to a dictionary makes a map from the indices to the weights. In cdf the values are indices and the probabilities are proportional to the weights.
estimates = [ResampleRows(live).totalwgt_lb.mean() for _ in range(iters)]
With weights, it looks like this:
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean() for _ in range(iters)]
The following table summarizes the results:
In this example, the effect of weighting is small but non-negligible. The difference in estimated means, with and without weighting, is about 0.08 pounds, or 1.3 ounces. This difference is substantially larger than the standard error of the estimate, 0.014 pounds, which implies that the difference is not due to chance.
A solution to this exercise is in
Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?
Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is finalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.