This HTML version of is provided for convenience, but it is not the best format for the book. In particular, some of the symbols are not rendered correctly.
Chapter 11 Regression
The linear least squares fit in the previous chapter is an example of regression, which is the more general problem of fitting any kind of model to any kind of data. This use of the term “regression” is a historical accident; it is only indirectly related to the original meaning of the word.
The goal of regression analysis is to describe the relationship between one set of variables, called the dependent variables, and another set of variables, called independent or explanatory variables.
In the previous chapter we used mother’s age as an explanatory variable to predict birth weight as a dependent variable. When there is only one dependent and one explanatory variable, that’s simple regression. In this chapter, we move on to multiple regression, with more than one explanatory variable. If there is more than one dependent variable, that’s multivariate regression.
If the relationship between the dependent and explanatory variable is linear, that’s linear regression. For example, if the dependent variable is y and the explanatory variables are x1 and x2, we would write the following linear regression model:
Given a sequence of values for y and sequences for x1 and x2, we can find the parameters, β0, β1, and β2, that minimize the sum of ε2. This process is called ordinary least squares. The computation is similar to thinkstats2.LeastSquare, but generalized to deal with more than one explanatory variable. You can find the details at https://en.wikipedia.org/wiki/Ordinary_least_squares
The code for this chapter is in regression.py. For information about downloading and working with this code, see Section 0.2.
In the previous chapter I presented thinkstats2.LeastSquares, an implementation of simple linear regression intended to be easy to read. For multiple regression we’ll switch to StatsModels, a Python package that provides several forms of regression and other analyses. If you are using Anaconda, you already have StatsModels; otherwise you might have to install it.
import statsmodels.formula.api as smf live, firsts, others = first.MakeFrames() formula = 'totalwgt_lb ~ agepreg' model = smf.ols(formula, data=live) results = model.fit()
statsmodels provides two interfaces (APIs); the “formula”
API uses strings to identify the dependent and explanatory variables.
It uses a syntax called patsy; in this example, the
inter = results.params['Intercept'] slope = results.params['agepreg']
slope_pvalue = results.pvalues['agepreg']
The results object provides summary(), which represents the results in a readable format.
But it prints a lot of information that is not relevant (yet), so I use a simpler function called SummarizeResults. Here are the results of this model:
Intercept 6.83 (0) agepreg 0.0175 (5.72e-11) R^2 0.004738 Std(ys) 1.408 Std(res) 1.405
Std(ys) is the standard deviation of the dependent variable, which is the RMSE if you have to guess birth weights without the benefit of any explanatory variables. Std(res) is the standard deviation of the residuals, which is the RMSE if your guesses are informed by the mother’s age. As we have already seen, knowing the mother’s age provides no substantial improvement to the predictions.
11.2 Multiple regression
In Section 4.5 we saw that first babies tend to be lighter than others, and this effect is statistically significant. But it is a strange result because there is no obvious mechanism that would cause first babies to be lighter. So we might wonder whether this relationship is spurious.
With a few calculations we can check whether this explanation is plausible. Then we’ll use multiple regression to investigate more carefully. First, let’s see how big the difference in weight is:
diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()
First babies are 0.125 lbs lighter, or 2 ounces. And the difference in ages:
diff_age = firsts.agepreg.mean() - others.agepreg.mean()
results = smf.ols('totalwgt_lb ~ agepreg', data=live).fit() slope = results.params['agepreg']
The slope is 0.0175 pounds per year. If we multiply the slope by the difference in ages, we get the expected difference in birth weight for first babies and others, due to mother’s age:
slope * diff_age
The result is 0.063, just about half of the observed difference. So we conclude, tentatively, that the observed difference in birth weight can be partly explained by the difference in mother’s age.
live['isfirst'] = live.birthord == 1 formula = 'totalwgt_lb ~ isfirst' results = smf.ols(formula, data=live).fit()
Here are the results:
Intercept 7.33 (0) isfirst[T.True] -0.125 (2.55e-05) R^2 0.00196
Because isfirst is a boolean, ols treats it as a categorical variable, which means that the values fall into categories, like True and False, and should not be treated as numbers. The estimated parameter is the effect on birth weight when isfirst is true, so the result, -0.125 lbs, is the difference in birth weight between first babies and others.
The slope and the intercept are statistically significant, which means that they were unlikely to occur by chance, but the the R2 value for this model is small, which means that isfirst doesn’t account for a substantial part of the variation in birth weight.
The results are similar with agepreg:
Intercept 6.83 (0) agepreg 0.0175 (5.72e-11) R^2 0.004738
These models confirm results we have already seen. But now we
can fit a single model that includes both variables. With the
Intercept 6.91 (0) isfirst[T.True] -0.0698 (0.0253) agepreg 0.0154 (3.93e-08) R^2 0.005289
In the combined model, the parameter for isfirst is smaller by about half, which means that part of the apparent effect of isfirst is actually accounted for by agepreg. And the p-value for isfirst is about 2.5%, which is on the border of statistical significance.
11.3 Nonlinear relationships
Remembering that the contribution of agepreg might be nonlinear, we might consider adding a variable to capture more of this relationship. One option is to create a column, agepreg2, that contains the squares of the ages:
live['agepreg2'] = live.agepreg**2 formula = 'totalwgt_lb ~ isfirst + agepreg + agepreg2'
Now by estimating parameters for agepreg and agepreg2, we are effectively fitting a parabola:
Intercept 5.69 (1.38e-86) isfirst[T.True] -0.0504 (0.109) agepreg 0.112 (3.23e-07) agepreg2 -0.00185 (8.8e-06) R^2 0.007462
The parameter of agepreg2 is negative, so the parabola curves downward, which is consistent with the shape of the lines in Figure 10.2.
Using computed variables like agepreg2 is a common way to fit polynomials and other functions to data. This process is still considered linear regression, because the dependent variable is a linear function of the explanatory variables, regardless of whether some variables are nonlinear functions of others.
The following table summarizes the results of these regressions:
The columns in this table are the explanatory variables and the coefficient of determination, R2. Each entry is an estimated parameter and either a p-value in parentheses or an asterisk to indicate a p-value less that 0.001.
We conclude that the apparent difference in birth weight is explained, at least in part, by the difference in mother’s age. When we include mother’s age in the model, the effect of isfirst gets smaller, and the remaining effect might be due to chance.
In this example, mother’s age acts as a control variable; including agepreg in the model “controls for” the difference in age between first-time mothers and others, making it possible to isolate the effect (if any) of isfirst.
11.4 Data mining
So far we have used regression models for explanation; for example, in the previous section we discovered that an apparent difference in birth weight is actually due to a difference in mother’s age. But the R2 values of those models is very low, which means that they have little predictive power. In this section we’ll try to do better.
Suppose one of your co-workers is expecting a baby and there is an office pool to guess the baby’s birth weight (if you are not familiar with betting pools, see https://en.wikipedia.org/wiki/Betting_pool).
Now suppose that you really want to win the pool. What could you do to improve your chances? Well, the NSFG dataset includes 244 variables about each pregnancy and another 3087 variables about each respondent. Maybe some of those variables have predictive power. To find out which ones are most useful, why not try them all?
Testing the variables in the pregnancy table is easy, but in order to use the variables in the respondent table, we have to match up each pregnancy with a respondent. In theory we could iterate through the rows of the pregnancy table, use the caseid to find the corresponding respondent, and copy the values from the correspondent table into the pregnancy table. But that would be slow.
A better option is to recognize this process as a join operation as defined in SQL and other relational database languages (see https://en.wikipedia.org/wiki/Join_(SQL)). Join is implemented as a DataFrame method, so we can perform the operation like this:
live = live[live.prglngth>30] resp = chap01soln.ReadFemResp() resp.index = resp.caseid join = live.join(resp, on='caseid', rsuffix='_r')
The next line reads the respondent file. The result is a DataFrame with integer indices; in order to look up respondents efficiently, I replace resp.index with resp.caseid.
The join method is invoked on live, which is considered the “left” table, and passed resp, which is the “right” table. The keyword argument on indicates the variable used to match up rows from the two tables.
In this example some column names appear in both tables,
so we have to provide rsuffix, which is a string that will be
appended to the names of overlapping columns from the right table.
For example, both tables have a column named race that encodes
the race of the respondent. The result of the join contains two
columns named race and
t =  for name in join.columns: try: if join[name].var() < 1e-7: continue formula = 'totalwgt_lb ~ agepreg + ' + name model = smf.ols(formula, data=join) if model.nobs < len(join)/2: continue results = model.fit() except (ValueError, TypeError): continue t.append((results.rsquared, name))
I check that each explanatory variable has some variability; otherwise the results of the regression are unreliable. I also check the number of observations for each model. Variables that contain a large number of nans are not good candidates for prediction.
For most of these variables, we haven’t done any cleaning. Some of them are encoded in ways that don’t work very well for linear regression. As a result, we might overlook some variables that would be useful if they were cleaned properly. But maybe we will find some good candidates.
t.sort(reverse=True) for mse, name in t[:30]: print(name, mse)
The first useful predictive variable is babysex which indicates whether the baby is male or female. In the NSFG dataset, boys are about 0.3 lbs heavier. So, assuming that the sex of the baby is known, we can use it for prediction.
Next is race, which indicates whether the respondent is white, black, or other. As an explanatory variable, race can be problematic. In datasets like the NSFG, race is correlated with many other variables, including income and other socioeconomic factors. In a regression model, race acts as a proxy variable, so apparent correlations with race are often caused, at least in part, by other factors.
The next variable on the list is nbrnaliv, which indicates whether the pregnancy yielded multiple births. Twins and triplets tend to be smaller than other babies, so if we know whether our hypothetical co-worker is expecting twins, that would help.
Next on the list is paydu, which indicates whether the respondent owns her home. It is one of several income-related variables that turn out to be predictive. In datasets like the NSFG, income and wealth are correlated with just about everything. In this example, income is related to diet, health, health care, and other factors likely to affect birth weight.
Some of the other variables on the list are things that would not be known until later, like bfeedwks, the number of weeks the baby was breast fed. We can’t use these variables for prediction, but you might want to speculate on reasons bfeedwks might be correlated with birth weight.
Sometimes you start with a theory and use data to test it. Other times you start with data and go looking for possible theories. The second approach, which this section demonstrates, is called data mining. An advantage of data mining is that it can discover unexpected patterns. A hazard is that many of the patterns it discovers are either random or spurious.
formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + ' 'nbrnaliv>1 + paydu==1 + totincr') results = smf.ols(formula, data=join).fit()
Similarly nbrnaliv>1 is True for multiple births and paydu==1 is True for respondents who own their houses.
Here are the results of the model:
Intercept 6.63 (0) C(race)[T.2] 0.357 (5.43e-29) C(race)[T.3] 0.266 (2.33e-07) babysex == 1[T.True] 0.295 (5.39e-29) nbrnaliv > 1[T.True] -1.38 (5.1e-37) paydu == 1[T.True] 0.12 (0.000114) agepreg 0.00741 (0.0035) totincr 0.0122 (0.00188)
The estimated parameters for race are larger than I expected, especially since we control for income. The encoding is 1 for black, 2 for white, and 3 for other. Babies of black mothers are lighter than babies of other races by 0.27–0.36 lbs.
People who own their homes have heavier babies by about 0.12 lbs, even when we control for income. The parameter for mother’s age is smaller than what we saw in Section 11.2, which suggests that some of the other variables are correlated with age, probably including paydu and totincr.
All of these variables are statistically significant, some with very low p-values, but R2 is only 0.06, still quite small. RMSE without using the model is 1.27 lbs; with the model it drops to 1.23. So your chance of winning the pool is not substantially improved. Sorry!
11.6 Logistic regression
Linear regression can be generalized to handle other kinds of dependent variables. If the dependent variable is boolean, the generalized model is called logistic regression. If the dependent variable is an integer count, it’s called Poisson regression.
As an example of logistic regression, let’s consider a variation on the office pool scenario. Suppose a friend of yours is pregnant and you want to predict whether the baby is a boy or a girl. You could use data from the NSFG to find factors that affect the “sex ratio”, which is conventionally defined to be the probability of having a boy.
If you encode the dependent variable numerically, for example 0 for a girl and 1 for a boy, you could apply ordinary least squares, but there would be problems. The linear model might be something like this:
The problem with this approach is that it produces predictions that are hard to interpret. Given estimated parameters and values for x1 and x2, the model might predict y=0.5, but the only meaningful values of y are 0 and 1.
It is tempting to interpret a result like that as a probability; for example, we might say that a respondent with particular values of x1 and x2 has a 50% chance of having a boy. But it is also possible for this model to predict y=1.1 or y=−0.1, and those are not valid probabilities.
Logistic regression avoids this problem by expressing predictions in terms of odds rather than probabilities. If you are not familiar with odds, “odds in favor” of an event is the ratio of the probability it will occur to the probability that it will not.
So if I think my team has a 75% chance of winning, I would say that the odds in their favor are three to one, because the chance of winning is three times the chance of losing.
Odds and probabilities are different representations of the same information. Given a probability, you can compute the odds like this:
o = p / (1-p)
Given odds in favor, you can convert to probability like this:
p = o / (o+1)
Logistic regression is based on the following model:
Suppose we have estimated the parameters β0, β1, and β2 (I’ll explain how in a minute). And suppose we are given values for x1 and x2. We can compute the predicted value of logo, and then convert to a probability:
o = np.exp(log_o) p = o / (o+1)
11.7 Estimating parameters
>>> y = np.array([0, 1, 0, 1]) >>> x1 = np.array([0, 0, 0, 1]) >>> x2 = np.array([0, 1, 1, 1])
And we start with the initial guesses β0=−1.5, β1=2.8, and β2=1.1:
>>> beta = [-1.5, 2.8, 1.1]
Then for each row we can compute
>>> log_o = beta + beta * x1 + beta * x2 [-1.5 -0.4 -0.4 2.4]
>>> o = np.exp(log_o) [ 0.223 0.670 0.670 11.02 ] >>> p = o / (o+1) [ 0.182 0.401 0.401 0.916 ]
Notice that when
The likelihood of an outcome is p when y==1 and 1-p when y==0. For example, if we think the probability of a boy is 0.8 and the outcome is a boy, the likelihood is 0.8; if the outcome is a girl, the likelihood is 0.2. We can compute that like this:
>>> likes = y * p + (1-y) * (1-p) [ 0.817 0.401 0.598 0.916 ]
The overall likelihood of the data is the product of likes:
>>> like = np.prod(likes) 0.18
For these values of beta, the likelihood of the data is 0.18. The goal of logistic regression is to find parameters that maximize this likelihood. To do that, most statistics packages use an iterative solver like Newton’s method (see https://en.wikipedia.org/wiki/Logistic_regression#Model_fitting).
StatsModels provides an implementation of logistic regression called logit, named for the function that converts from probability to log odds. To demonstrate its use, I’ll look for variables that affect the sex ratio.
Again, I load the NSFG data and select pregnancies longer than 30 weeks:
live, firsts, others = first.MakeFrames() df = live[live.prglngth>30]
df['boy'] = (df.babysex==1).astype(int)
Factors that have been found to affect sex ratio include parents’ age, birth order, race, and social status. We can use logistic regression to see if these effects appear in the NSFG data. I’ll start with the mother’s age:
import statsmodels.formula.api as smf model = smf.logit('boy ~ agepreg', data=df) results = model.fit() SummarizeResults(results)
logit takes the same arguments as ols, a formula in Patsy syntax and a DataFrame. The result is a Logit object that represents the model. It contains attributes called endog and exog that contain the endogenous variable, another name for the dependent variable, and the exogenous variables, another name for the explanatory variables. Since they are NumPy arrays, it is sometimes convenient to convert them to DataFrames:
endog = pandas.DataFrame(model.endog, columns=[model.endog_names]) exog = pandas.DataFrame(model.exog, columns=model.exog_names)
The result of model.fit is a BinaryResults object, which is similar to the RegressionResults object we got from ols. Here is a summary of the results:
Intercept 0.00579 (0.953) agepreg 0.00105 (0.783) R^2 6.144e-06
The coefficient of determination, R2, does not apply to logistic regression, but there are several alternatives that are used as “pseudo R2 values.” These values can be useful for comparing models. For example, here’s a model that includes several factors believed to be associated with sex ratio:
formula = 'boy ~ agepreg + hpagelb + birthord + C(race)' model = smf.logit(formula, data=df) results = model.fit()
Intercept -0.0301 (0.772) C(race)[T.2] -0.0224 (0.66) C(race)[T.3] -0.000457 (0.996) agepreg -0.00267 (0.629) hpagelb 0.0047 (0.266) birthord 0.00501 (0.821) R^2 0.000144
In the NSFG data, there are more boys than girls, so the baseline strategy is to guess “boy” every time. The accuracy of this strategy is just the fraction of boys:
actual = endog['boy'] baseline = actual.mean()
Since actual is encoded in binary integers, the mean is the fraction of boys, which is 0.507.
Here’s how we compute the accuracy of the model:
predict = (results.predict() >= 0.5) true_pos = predict * actual true_neg = (1 - predict) * (1 - actual)
results.predict returns a NumPy array of probabilities, which we
round off to 0 or 1. Multiplying by actual
yields 1 if we predict a boy and get it right, 0 otherwise. So,
acc = (sum(true_pos) + sum(true_neg)) / len(actual)
The result is 0.512, slightly better than the baseline, 0.507. But, you should not take this result too seriously. We used the same data to build and test the model, so the model may not have predictive power on new data.
Nevertheless, let’s use the model to make a prediction for the office pool. Suppose your friend is 35 years old and white, her husband is 39, and they are expecting their third child:
columns = ['agepreg', 'hpagelb', 'birthord', 'race'] new = pandas.DataFrame([[35, 39, 3, 2]], columns=columns) y = results.predict(new)
To invoke results.predict for a new case, you have to construct a DataFrame with a column for each variable in the model. The result in this case is 0.52, so you should guess “boy.” But if the model improves your chances of winning, the difference is very small.
My solution to these exercises is in
Exercise 1 Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.
Exercise 2 The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis
Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.
Exercise 3 If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called poisson. It works the same way as ols and logit. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called numbabes.
Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?
Exercise 4 If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called mnlogit. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called rmarital.
Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.