This HTML version of "Think Stats 2e" is provided for convenience, but it is not the best format for the book. In particular, some of the math symbols are not rendered correctly.

You might prefer to read the PDF version, or you can buy a hard copy from Amazon.

Chapter 10 Linear least squares

The code for this chapter is in linear.py. For information about downloading and working with this code, see Section ??.

10.1 Least squares fit

Correlation coefficients measure the strength and sign of a relationship, but not the slope. There are several ways to estimate the slope; the most common is a linear least squares fit. A “linear fit” is a line intended to model the relationship between variables. A “least squares” fit is one that minimizes the mean squared error (MSE) between the line and the data.

Suppose we have a sequence of points, ys, that we want to express as a function of another sequence xs. If there is a linear relationship between xs and ys with intercept inter and slope slope, we expect each y[i] to be inter + slope * x[i].

But unless the correlation is perfect, this prediction is only approximate. The vertical deviation from the line, or residual, is

res = ys - (inter + slope * xs)

The residuals might be due to random factors like measurement error, or non-random factors that are unknown. For example, if we are trying to predict weight as a function of height, unknown factors might include diet, exercise, and body type.

If we get the parameters inter and slope wrong, the residuals get bigger, so it makes intuitive sense that the parameters we want are the ones that minimize the residuals.

We might try to minimize the absolute value of the residuals, or their squares, or their cubes; but the most common choice is to minimize the sum of squared residuals, sum(res**2).

Why? There are three good reasons and one less important one:

Squaring has the feature of treating positive and negative residuals the same, which is usually what we want.
Squaring gives more weight to large residuals, but not so much weight that the largest residual always dominates.
If the residuals are uncorrelated and normally distributed with mean 0 and constant (but unknown) variance, then the least squares fit is also the maximum likelihood estimator of inter and slope. See https://en.wikipedia.org/wiki/Linear_regression.
The values of inter and slope that minimize the squared residuals can be computed efficiently.

The last reason made sense when computational efficiency was more important than choosing the method most appropriate to the problem at hand. That’s no longer the case, so it is worth considering whether squared residuals are the right thing to minimize.

For example, if you are using xs to predict values of ys, guessing too high might be better (or worse) than guessing too low. In that case you might want to compute some cost function for each residual, and minimize total cost, sum(cost(res)). However, computing a least squares fit is quick, easy and often good enough.

10.2 Implementation

thinkstats2 provides simple functions that demonstrate linear least squares:

def LeastSquares(xs, ys):
    meanx, varx = MeanVar(xs)
    meany = Mean(ys)

    slope = Cov(xs, ys, meanx, meany) / varx
    inter = meany - slope * meanx

    return inter, slope

LeastSquares takes sequences xs and ys and returns the estimated parameters inter and slope. For details on how it works, see http://wikipedia.org/wiki/Numerical_methods_for_linear_least_squares.

thinkstats2 also provides FitLine, which takes inter and slope and returns the fitted line for a sequence of xs.

def FitLine(xs, inter, slope):
    fit_xs = np.sort(xs)
    fit_ys = inter + slope * fit_xs
    return fit_xs, fit_ys

We can use these functions to compute the least squares fit for birth weight as a function of mother’s age.

    live, firsts, others = first.MakeFrames()
    live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
    ages = live.agepreg
    weights = live.totalwgt_lb

    inter, slope = thinkstats2.LeastSquares(ages, weights)
    fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)

The estimated intercept and slope are 6.8 lbs and 0.017 lbs per year. These values are hard to interpret in this form: the intercept is the expected weight of a baby whose mother is 0 years old, which doesn’t make sense in context, and the slope is too small to grasp easily.

Instead of presenting the intercept at x=0, it is often helpful to present the intercept at the mean of x. In this case the mean age is about 25 years and the mean baby weight for a 25 year old mother is 7.3 pounds. The slope is 0.27 ounces per year, or 0.17 pounds per decade.

Figure 10.1: Scatter plot of birth weight and mother’s age with a linear fit.

Figure ?? shows a scatter plot of birth weight and age along with the fitted line. It’s a good idea to look at a figure like this to assess whether the relationship is linear and whether the fitted line seems like a good model of the relationship.

10.3 Residuals

Another useful test is to plot the residuals. thinkstats2 provides a function that computes residuals:

def Residuals(xs, ys, inter, slope):
    xs = np.asarray(xs)
    ys = np.asarray(ys)
    res = ys - (inter + slope * xs)
    return res

Residuals takes sequences xs and ys and estimated parameters inter and slope. It returns the differences between the actual values and the fitted line.

Figure 10.2: Residuals of the linear fit.

To visualize the residuals, I group respondents by age and compute percentiles in each group, as we saw in Section ??. Figure ?? shows the 25th, 50th and 75th percentiles of the residuals for each age group. The median is near zero, as expected, and the interquartile range is about 2 pounds. So if we know the mother’s age, we can guess the baby’s weight within a pound, about 50% of the time.

Ideally these lines should be flat, indicating that the residuals are random, and parallel, indicating that the variance of the residuals is the same for all age groups. In fact, the lines are close to parallel, so that’s good; but they have some curvature, indicating that the relationship is nonlinear. Nevertheless, the linear fit is a simple model that is probably good enough for some purposes.

10.4 Estimation

The parameters slope and inter are estimates based on a sample; like other estimates, they are vulnerable to sampling bias, measurement error, and sampling error. As discussed in Chapter ??, sampling bias is caused by non-representative sampling, measurement error is caused by errors in collecting and recording data, and sampling error is the result of measuring a sample rather than the entire population.

To assess sampling error, we ask, “If we run this experiment again, how much variability do we expect in the estimates?” We can answer this question by running simulated experiments and computing sampling distributions of the estimates.

I simulate the experiments by resampling the data; that is, I treat the observed pregnancies as if they were the entire population and draw samples, with replacement, from the observed sample.

def SamplingDistributions(live, iters=101):
    t = []
    for _ in range(iters):
        sample = thinkstats2.ResampleRows(live)
        ages = sample.agepreg
        weights = sample.totalwgt_lb
        estimates = thinkstats2.LeastSquares(ages, weights)
        t.append(estimates)

    inters, slopes = zip(*t)
    return inters, slopes

SamplingDistributions takes a DataFrame with one row per live birth, and iters, the number of experiments to simulate. It uses ResampleRows to resample the observed pregnancies. We’ve already seen SampleRows, which chooses random rows from a DataFrame. thinkstats2 also provides ResampleRows, which returns a sample the same size as the original:

def ResampleRows(df):
    return SampleRows(df, len(df), replace=True)

After resampling, we use the simulated sample to estimate parameters. The result is two sequences: the estimated intercepts and estimated slopes.

I summarize the sampling distributions by printing the standard error and confidence interval:

def Summarize(estimates, actual=None):
    mean = thinkstats2.Mean(estimates)
    stderr = thinkstats2.Std(estimates, mu=actual)
    cdf = thinkstats2.Cdf(estimates)
    ci = cdf.ConfidenceInterval(90)
    print('mean, SE, CI', mean, stderr, ci)

Summarize takes a sequence of estimates and the actual value. It prints the mean of the estimates, the standard error and a 90% confidence interval.

For the intercept, the mean estimate is 6.83, with standard error 0.07 and 90% confidence interval (6.71, 6.94). The estimated slope, in more compact form, is 0.0174, SE 0.0028, CI (0.0126, 0.0220). There is almost a factor of two between the low and high ends of this CI, so it should be considered a rough estimate.

To visualize the sampling error of the estimate, we could plot all of the fitted lines, or for a less cluttered representation, plot a 90% confidence interval for each age. Here’s the code:

def PlotConfidenceIntervals(xs, inters, slopes,
                            percent=90, **options):
    fys_seq = []
    for inter, slope in zip(inters, slopes):
        fxs, fys = thinkstats2.FitLine(xs, inter, slope)
        fys_seq.append(fys)

    p = (100 - percent) / 2
    percents = p, 100 - p
    low, high = thinkstats2.PercentileRows(fys_seq, percents)
    thinkplot.FillBetween(fxs, low, high, **options)

xs is the sequence of mother’s age. inters and slopes are the estimated parameters generated by SamplingDistributions. percent indicates which confidence interval to plot.

PlotConfidenceIntervals generates a fitted line for each pair of inter and slope and stores the results in a sequence, fys_seq. Then it uses PercentileRows to select the upper and lower percentiles of y for each value of x. For a 90% confidence interval, it selects the 5th and 95th percentiles. FillBetween draws a polygon that fills the space between two lines.

Figure 10.3: 50% and 90% confidence intervals showing variability in the fitted line due to sampling error of inter and slope.

Figure ?? shows the 50% and 90% confidence intervals for curves fitted to birth weight as a function of mother’s age. The vertical width of the region represents the effect of sampling error; the effect is smaller for values near the mean and larger for the extremes.

10.5 Goodness of fit

There are several ways to measure the quality of a linear model, or goodness of fit. One of the simplest is the standard deviation of the residuals.

If you use a linear model to make predictions, Std(res) is the root mean squared error (RMSE) of your predictions. For example, if you use mother’s age to guess birth weight, the RMSE of your guess would be 1.40 lbs.

If you guess birth weight without knowing the mother’s age, the RMSE of your guess is Std(ys), which is 1.41 lbs. So in this example, knowing a mother’s age does not improve the predictions substantially.

Another way to measure goodness of fit is the coefficient of determination, usually denoted R² and called “R-squared”:

def CoefDetermination(ys, res):
    return 1 - Var(res) / Var(ys)

Var(res) is the MSE of your guesses using the model, Var(ys) is the MSE without it. So their ratio is the fraction of MSE that remains if you use the model, and R² is the fraction of MSE the model eliminates.

For birth weight and mother’s age, R² is 0.0047, which means that mother’s age predicts about half of 1% of variance in birth weight.

There is a simple relationship between the coefficient of determination and Pearson’s coefficient of correlation: R² = ρ². For example, if ρ is 0.8 or -0.8, R² = 0.64.

Although ρ and R² are often used to quantify the strength of a relationship, they are not easy to interpret in terms of predictive power. In my opinion, Std(res) is the best representation of the quality of prediction, especially if it is presented in relation to Std(ys).

For example, when people talk about the validity of the SAT (a standardized test used for college admission in the U.S.) they often talk about correlations between SAT scores and other measures of intelligence.

According to one study, there is a Pearson correlation of ρ=0.72 between total SAT scores and IQ scores, which sounds like a strong correlation. But R² = ρ² = 0.52, so SAT scores account for only 52% of variance in IQ.

IQ scores are normalized with Std(ys) = 15, so

>>> var_ys = 15**2
>>> rho = 0.72
>>> r2 = rho**2
>>> var_res = (1 - r2) * var_ys
>>> std_res = math.sqrt(var_res)
10.4096

So using SAT score to predict IQ reduces RMSE from 15 points to 10.4 points. A correlation of 0.72 yields a reduction in RMSE of only 31%.

If you see a correlation that looks impressive, remember that R² is a better indicator of reduction in MSE, and reduction in RMSE is a better indicator of predictive power.

10.6 Testing a linear model

The effect of mother’s age on birth weight is small, and has little predictive power. So is it possible that the apparent relationship is due to chance? There are several ways we might test the results of a linear fit.

One option is to test whether the apparent reduction in MSE is due to chance. In that case, the test statistic is R² and the null hypothesis is that there is no relationship between the variables. We can simulate the null hypothesis by permutation, as in Section ??, when we tested the correlation between mother’s age and birth weight. In fact, because R² = ρ², a one-sided test of R² is equivalent to a two-sided test of ρ. We’ve already done that test, and found p < 0.001, so we conclude that the apparent relationship between mother’s age and birth weight is statistically significant.

Another approach is to test whether the apparent slope is due to chance. The null hypothesis is that the slope is actually zero; in that case we can model the birth weights as random variations around their mean. Here’s a HypothesisTest for this model:

class SlopeTest(thinkstats2.HypothesisTest):

    def TestStatistic(self, data):
        ages, weights = data
        _, slope = thinkstats2.LeastSquares(ages, weights)
        return slope

    def MakeModel(self):
        _, weights = self.data
        self.ybar = weights.mean()
        self.res = weights - self.ybar

    def RunModel(self):
        ages, _ = self.data
        weights = self.ybar + np.random.permutation(self.res)
        return ages, weights

The data are represented as sequences of ages and weights. The test statistic is the slope estimated by LeastSquares. The model of the null hypothesis is represented by the mean weight of all babies and the deviations from the mean. To generate simulated data, we permute the deviations and add them to the mean.

Here’s the code that runs the hypothesis test:

    live, firsts, others = first.MakeFrames()
    live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
    ht = SlopeTest((live.agepreg, live.totalwgt_lb))
    pvalue = ht.PValue()

The p-value is less than 0.001, so although the estimated slope is small, it is unlikely to be due to chance.

Estimating the p-value by simulating the null hypothesis is strictly correct, but there is a simpler alternative. Remember that we already computed the sampling distribution of the slope, in Section ??. To do that, we assumed that the observed slope was correct and simulated experiments by resampling.

Figure ?? shows the sampling distribution of the slope, from Section ??, and the distribution of slopes generated under the null hypothesis. The sampling distribution is centered about the estimated slope, 0.017 lbs/year, and the slopes under the null hypothesis are centered around 0; but other than that, the distributions are identical. The distributions are also symmetric, for reasons we will see in Section ??.

Figure 10.4: The sampling distribution of the estimated slope and the distribution of slopes generated under the null hypothesis. The vertical lines are at 0 and the observed slope, 0.017 lbs/year.

So we could estimate the p-value two ways:

Compute the probability that the slope under the null hypothesis exceeds the observed slope.
Compute the probability that the slope in the sampling distribution falls below 0. (If the estimated slope were negative, we would compute the probability that the slope in the sampling distribution exceeds 0.)

The second option is easier because we normally want to compute the sampling distribution of the parameters anyway. And it is a good approximation unless the sample size is small and the distribution of residuals is skewed. Even then, it is usually good enough, because p-values don’t have to be precise.

Here’s the code that estimates the p-value of the slope using the sampling distribution:

    inters, slopes = SamplingDistributions(live, iters=1001)
    slope_cdf = thinkstats2.Cdf(slopes)
    pvalue = slope_cdf[0]

Again, we find p < 0.001.

10.7 Weighted resampling

So far we have treated the NSFG data as if it were a representative sample, but as I mentioned in Section ??, it is not. The survey deliberately oversamples several groups in order to improve the chance of getting statistically significant results; that is, in order to improve the power of tests involving these groups.

This survey design is useful for many purposes, but it means that we cannot use the sample to estimate values for the general population without accounting for the sampling process.

For each respondent, the NSFG data includes a variable called finalwgt, which is the number of people in the general population the respondent represents. This value is called a sampling weight, or just “weight.”

As an example, if you survey 100,000 people in a country of 300 million, each respondent represents 3,000 people. If you oversample one group by a factor of 2, each person in the oversampled group would have a lower weight, about 1500.

To correct for oversampling, we can use resampling; that is, we can draw samples from the survey using probabilities proportional to sampling weights. Then, for any quantity we want to estimate, we can generate sampling distributions, standard errors, and confidence intervals. As an example, I will estimate mean birth weight with and without sampling weights.

In Section ??, we saw ResampleRows, which chooses rows from a DataFrame, giving each row the same probability. Now we need to do the same thing using probabilities proportional to sampling weights. ResampleRowsWeighted takes a DataFrame, resamples rows according to the weights in finalwgt, and returns a DataFrame containing the resampled rows:

def ResampleRowsWeighted(df, column='finalwgt'):
    weights = df[column]
    cdf = Cdf(dict(weights))
    indices = cdf.Sample(len(weights))
    sample = df.loc[indices]
    return sample

weights is a Series; converting it to a dictionary makes a map from the indices to the weights. In cdf the values are indices and the probabilities are proportional to the weights.

indices is a sequence of row indices; sample is a DataFrame that contains the selected rows. Since we sample with replacement, the same row might appear more than once.

Now we can compare the effect of resampling with and without weights. Without weights, we generate the sampling distribution like this:

    estimates = [ResampleRows(live).totalwgt_lb.mean()
                 for _ in range(iters)]

With weights, it looks like this:

    estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
                 for _ in range(iters)]

The following table summarizes the results:

	mean birth	standard	90% CI
	weight (lbs)	error
Unweighted	7.27	0.014	(7.24, 7.29)
Weighted	7.35	0.014	(7.32, 7.37)

In this example, the effect of weighting is small but non-negligible. The difference in estimated means, with and without weighting, is about 0.08 pounds, or 1.3 ounces. This difference is substantially larger than the standard error of the estimate, 0.014 pounds, which implies that the difference is not due to chance.

10.8 Exercises

A solution to this exercise is in chap10soln.ipynb

Exercise 1

Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?

Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is finalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?

10.9 Glossary

linear fit: a line intended to model the relationship between variables.
least squares fit: A model of a dataset that minimizes the sum of squares of the residuals.
residual: The deviation of an actual value from a model.
goodness of fit: A measure of how well a model fits data.
coefficient of determination: A statistic intended to quantify goodness of fit.
sampling weight: A value associated with an observation in a sample that indicates what part of the population it represents.