Multiple Regression

Econ 358: Econometrics

Agenda pt. 1

1

Omitted Variable Bias

2

Multiple Regression Model

3

OLS Estimator in Multiple Regression

4

Measures of Fit

5

Casual Inference

Agenda pt. 2

1

Multicollinearity

2

Conditional Mean Independence

3

Wrap-Up

Reminder: Projects

Topic Memo

  • Present question, explain basic model, and provide data sources

    • The easiest path to a practical question and model is to meet with me
  • Memo must include three academic journal articles in addition to other sources

  • It should correctly use terminology from class

  • Description of model, and key variables with justification

  • Description of potential data sources

Summary of Single Regressor

  • Hypothesis testing for regression coefficients is analogous to hypothesis testing for the population mean: Use the t-statistic to calculate the p-values and either accept or reject the null hypothesis.

  • When X is binary, the regression model can be used to estimate and test hypotheses about the difference between the population means.

  • In general, the error is heteroskedastic; Homoskedasticity-only standard errors do not produce valid statistical inferences when the errors are heteroskedastic, but heteroskedasticity-robust standard errors do

  • If the three least squares assumption hold and if the regression errors are homoskedastic, then, the OLS estimator is BLUE.

A Problem

  • We’ve said many times that it seems likely that lower student-teacher-ratio school districts probably have other advantages that affect test scores

  • Perhaps this makes our regression with just student-teacher-ratio misleading (biased)

  • Luckily, if we have data on these other advantages, we can try to “control” for them by including them in the regression

Example: English as a second language learners

  • it is possible that the percentage of English learners may be related to both test scores and the student teacher ratio

  • if we run:

\[ Test Score = \hat{\beta_0} + \hat{\beta_1}STR + u \]

  • second language learner % will then just get tucked into the error term

    • which will now mean that \(E(u|X) \neq 0\) (biased estimator)

Omitted variable bias

  • In general, when we omit a factor that would change our estiamte of the ceteris paribus effect of class size on test score, we say we have “omitted variable bias”

  • Two conditions

    • if the OV is one determinant of Y
    • the OV is correlated with our X

Omitted variable bias

It is plausible that:

  • English language ability (whether the student has English as a second language) plausibly affects standardized test scores: Z is a determinant of Y.

  • Immigrant communities tend to be less affluent and thus have smaller school budgets and higher STR: Z is correlated with X.

In what direction will this bias the estimate of \(\beta_1\) ?

A formula

  • Let the correlation between \(X_i\) and \(u_i\) be:

\[ corr(X_i, u_i) = \varrho_{Xu}\] - then as the sample size increases, \(\hat{\beta_1}\) will tend toward:

\[\beta_1 + \varrho_{Xu} \frac{\sigma_u}{\sigma_X} \]

Ommitted Variable Bias

  • If an omitted variable is both:

    • a determinant of Y (that is, contained in u)
    • correlated with X
  • Then the OLS estimator is biased and not consistent

Finding out if this is a real problem

First let’s divide districts into small and large and by english learners:

CASchools <- CASchools %>%
  mutate(size = ifelse(stratio >= 20, 1, 0))

english_group_levels <- c("< 1.9", "1.9 - 8.8", "8.8 - 23.0", ">= 23.0")

CASchools <- CASchools %>%
  mutate(english_group = case_when(
    english < 1.9 ~ "< 1.9",
    english >= 1.9 & english < 8.8 ~ "1.9 - 8.8",
    english >= 8.8 & english < 23.0 ~ "8.8 - 23.0",
    english >= 23.0 ~ ">= 23.0"
  ))

Now Let’s Summarize

Average Test Score (STR<20) n Average Test Score (STR>20) n Difference t-stat
All Districts 657.4 238 650 182 7.4 4.04
% of Engl Learners
<1.9% 664.5 76 665.4 27 -0.9 -0.30
1.9 - 8.8% 665.2 64 661.8 44 3.3 1.13
8.8 - 23.0% 654.9 54 649.7 50 5.2 1.72
>23.0% 636.7 44 634.8 61 1.9 0.68


  • Districts with fewer English Learners have higher test scores

  • Districts with lower percent EL have smaller classes

  • Among districts with comparable EL, the effect of class size is small (overall “test score gap” = 7.4)

In summary

  • The test score/STR/fraction English Learners example shows that, if an omitted variable satisfies the two conditions for omitted variable bias, then the OLS estimator in the regression omitting that variable is biased and inconsistent. So, even if n is large, \(\widehat{\beta_1}\) will not be close to \(\beta_1\)

Overcoming OVB

Three ways to overcome ommitted variable bias

  • Run a randomized experiment

    • % Engl learner is still a determinant of test score, but will be uncorrelated with STR
  • Adopt the “cross tabulation” approach

    • finer and finer gradations fo STR and % Engl learner
    • But soon you will run out of data, and what about other determinants like family income and parental education?
  • Use a regression in which the omitted variable (% Engl learner) is no longer omitted

Multiple Regression

A Population Regression Line

Assume: \[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + u_i \] Interpretation of coefficients is now:

  • \(\beta_1 = \frac{\Delta Y}{\Delta X_1}\) holding \(X_2\) constant

  • \(\beta_2 = \frac{\Delta Y}{\Delta X_2}\) holding \(X_1\) constant

  • \(\beta_0 =\) predicted value of Y when both \(X_1\) and \(X_2\) are zero

The OLS estimator

With two regressors, our minimization problem is now:

\[ min_{b_0,b_i,b_2} \sum_{i=1}^{n} [Y_i -(b_o +b_1X_{1i}+b_2X_{2i})]^2 \] - The LS estimator minimizes the average squared difference between the actual values of \(Y_i\) and the prediction based on the estimated line.

  • This minimization problem is solved using calculus and yields our estimators, \(\beta_0, \beta_1, \beta_2\)

Let’s Experiment

Code
model1 <- lm(score ~ stratio, data = CASchools)
model2 <- lm(score ~ stratio + english, data = CASchools)
coef_names <- c("Intercept" = "(Intercept)","Class size" = "stratio", "Percent English learners" = "english")

export_summs(model1, model2, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )
Model 1 Model 2
Intercept 698.93 *** 686.03 ***
(10.36)    (8.73)   
Class size -2.28 *** -1.10 *  
(0.52)    (0.43)   
Percent English learners         -0.65 ***
        (0.03)   
N 420        420       
R2 0.05     0.43    
AdjR2 0.05     0.42    
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05.
  • note: reporting regression output in this way is very common in published papers

Let’s Experiment

Code
model1 <- lm(score ~ stratio, data = CASchools)
model2 <- lm(score ~ stratio + english, data = CASchools)
coef_names <- c("Class size" = "stratio", "Percent English learners" = "english")

export_summs(model1, model2, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )
Model 1 Model 2
Class size -2.28 *** -1.10 *  
(0.52)    (0.43)   
Percent English learners         -0.65 ***
        (0.03)   
N 420        420       
R2 0.05     0.43    
AdjR2 0.05     0.42    
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05.

What happens to the coefficient on Class Size?

Measures of Fit

  • Residuals have the same interpretation (actual = model prediction + residual)

  • SER is the same (standard deviation of the residuals with a d.f. correction)

  • RMSE is the same (standard deviation of the residuals with no d.f. correction)

  • \(R^2\) is the same (fraction of variance explained by model)

but now we introduce a new measure of fit

Adjusted \(R^2\)

  • As we add variables to the model, it is likely that \(R^2\) increases

    • a bit of a problem for a measure of fit
  • So the adjusted \(R^2\) “penalizes” for adding another regressor

\[ \text{Adjusted } R^2 = 1- \frac{SSR}{TSS}(\frac{n-1}{n-k-1}) \]

  • Note that if \(n\) is large, then \(R^2\) and Adj. \(R^2\) get very close

Measures of Fit

Code
export_summs(model1, model2, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )
Model 1 Model 2
Class size -2.28 *** -1.10 *  
(0.52)    (0.43)   
Percent English learners         -0.65 ***
        (0.03)   
N 420        420       
R2 0.05     0.43    
AdjR2 0.05     0.42    
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05.

Casual inference assumptions

  • In multiple regression, our first three causal inference assumptions remain the same

  • But we add a fourth:

  1. There is no perfect multicollinearity.

Assumption 1

\[E(u|X_1 = x_1, \ldots, x_k = x_k) = 0 \]

  • Same interpretation as before: the conditional mean of the residuals is zero

  • Failure of this conditions leads to omitted variable bias, specifically, if an omitted variable

    • belongs in the equation (is in \(u\)) and
    • is correlated with an included X
  • The best solution is to include the OV in the model

  • A second solution is to include a variable that controls for the OV (discussed shortly)

Assumption 2

  • Assumption 2 suggests that the data are collected by simple random sampling

  • Nothing about this changes in multiple regression

Assumption 3

  • Assumption 3 suggests that there are no large outliers

  • Nothing about this changes in multiple regression

  • Always check your data (scatterplots are useful!)

Assumption 4

  • First, lets describe perfect multicolinearity

  • This would exist if one regressor was an exact linear function of other regressors

  • It makes R go nuts:

Code
CASchools <- CASchools %>%
  mutate(stratio1 = stratio + 2)

modelcol <- lm(score ~ stratio + stratio1, data = CASchools)

summary(modelcol)

Call:
lm(formula = score ~ stratio + stratio1, data = CASchools)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.727 -14.251   0.483  12.822  48.540 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
stratio      -2.2798     0.4798  -4.751 2.78e-06 ***
stratio1          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897 
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Perfect Collinearity

  • Clearly it makes no sense to try to find the effect of STR holding STR + 2 constant

  • Other examples:

    • you include a variable twice

    • you include a dummy for small classes (\(D = 1\) if \(STR\leq20\), else 0) and a dummy for large classes (\(B=1\) if \(STR>20\), else 0), so that B = 1-D

The dummy variable trap

  • Suppose you have a set of multiple binary (dummy) variables, which are mutually exclusive and exhaustive

– that is, there are multiple categories and every observation falls in one and only one category (Freshmen, Sophomores, Juniors, Seniors, Other).

  • If you include all these dummy variables and a constant, you will have perfect multicollinearity

– this is sometimes called the dummy variable trap.

Perfect multicollinearity

  • Perfect mutlicollinearity is easy to resolve

  • usually reflects a mistake in the definitions of the regressors, or an oddity in the data

  • If you have perfect multicollinearity, your statistical software will let you know – either by crashing or giving an error message or by “dropping” one of the variables arbitrarily

  • The solution to perfect multicollinearity is to modify your list of regressors so that you no longer have perfect multicollinearity.

Imperfect multicollinearity

  • Imperfect multicollinearity occurs when two or more regressors are very highly correlated.

    • this is a harder problem
  • Why the term “multicollinearity”? If two regressors are very highly correlated, then their scatterplot will pretty much look like a straight line – they are “co-linear” – but unless the correlation is exactly \(\pm 1\), that collinearity is imperfect

Imperfect multicollinearity

  • Imperfect multicollinearity implies that one or more of the regression coefficients will be imprecisely estimated.

  • the coefficient on \(X_1\) is the effect of \(X_1\) holding \(X_2\) constant

  • but if \(X_1\) and \(X_2\) are highly correlated, there is very little variation in \(X_1\) once \(X_2\) is held constant - so the data don’t contain much info about what happens when \(X_1\) changes but \(X_2\) doesn’t

  • The variable of the OLS estimator of the coefficient on \(X_1\) will be large and thus will have large standard errors

A control variable

  • With observational data we usually have a host of factors that affect our Y and get tucked in the residuals

    • for instance parental involvement, access in the community to learning opportunities, and so on… all will affect test scores
  • If you can observe the factors, then we include them

  • But if we can’t observe, we can try to use a “control variable”

    • correlated with the omitted factor, but not themselves causal

Control Variables

A control variable W is a regressor included to hold constant factors that, if neglected, could lead the estimated causal effect of interest to suffer from omitted variable bias.

Example

Code
model3 <- lm(score ~ stratio + english + lunch, data = CASchools)

coef_names <- c("Intercept" = "(Intercept)", "Class size" = "stratio", "Percent English learners" = "english", "Percent recieving free/reduced lunch" = "lunch")

export_summs(model1, model2, model3, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )
Model 1 Model 2 Model 3
Intercept 698.93 *** 686.03 *** 700.15 ***
(10.36)    (8.73)    (5.57)   
Class size -2.28 *** -1.10 *   -1.00 ***
(0.52)    (0.43)    (0.27)   
Percent English learners         -0.65 *** -0.12 ***
        (0.03)    (0.03)   
Percent recieving free/reduced lunch                 -0.55 ***
                (0.02)   
N 420        420        420       
R2 0.05     0.43     0.77    
AdjR2 0.05     0.42     0.77    
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05.

Example

\[ TestScore = 700.15 -1(ClassSize) - 0.12(EnglishLearners) - 0.55(FreeLunch) \]

  • Which variable is the variable of interest?

  • Which variables are control variables? Might they have a causal effect themselves? What do they control for?

Example

  • class size is the variable of interest

  • EnglishLearners probably has a direct causal effect (school is tougher if you are learning English!). But it is also a control variable: immigrant communities tend to be less affluent and often have fewer outside learning opportunities, and EnglishLearners is correlated with those omitted causal variables.

    • EnglishLearners is both a possible causal variable and a control variable.
  • FreeLunch might have a causal effect (eating lunch helps learning); it also is correlated with and controls for income-related outside learning opportunities.

    • FreeLunch is both a possible causal variable and a control variable.

Control variables

what makes for an effective control variable:

  • An effective control variable is one which, when included in the regression, makes the error term uncorrelated with the variable of interest.

  • Holding constant the control variable(s), the variable of interest is “as if” randomly assigned.

  • Among individuals (entities) with the same value of the control variable(s), the variable of interest is uncorrelated with the omitted determinants of Y

Your projects

  • We’ve now seen the basics of a multiple regression model

  • Recall that your job is to read and write up a summary of the topic and 3 academic peer-reviewed papers

    • If you haven’t yet, meet with me ASAP!
  • After that you will collect the data and propose a model

  • Finally you will run the model and write up its problems