Multiple Regression

Econ 358: Econometrics

Agenda pt. 1

Omitted Variable Bias

Multiple Regression Model

OLS Estimator in Multiple Regression

Measures of Fit

Casual Inference

Agenda pt. 2

Multicollinearity

Conditional Mean Independence

Wrap-Up

Reminder: Projects

Topic Memo

Present question, explain basic model, and provide data sources
- The easiest path to a practical question and model is to meet with me
Memo must include three academic journal articles in addition to other sources
It should correctly use terminology from class
Description of model, and key variables with justification
Description of potential data sources

Summary of Single Regressor

Hypothesis testing for regression coefficients is analogous to hypothesis testing for the population mean: Use the t-statistic to calculate the p-values and either accept or reject the null hypothesis.
When X is binary, the regression model can be used to estimate and test hypotheses about the difference between the population means.
In general, the error is heteroskedastic; Homoskedasticity-only standard errors do not produce valid statistical inferences when the errors are heteroskedastic, but heteroskedasticity-robust standard errors do
If the three least squares assumption hold and if the regression errors are homoskedastic, then, the OLS estimator is BLUE.

A Problem

We’ve said many times that it seems likely that lower student-teacher-ratio school districts probably have other advantages that affect test scores
Perhaps this makes our regression with just student-teacher-ratio misleading (biased)
Luckily, if we have data on these other advantages, we can try to “control” for them by including them in the regression

Example: English as a second language learners

it is possible that the percentage of English learners may be related to both test scores and the student teacher ratio
if we run:

\[ Test Score = \hat{\beta_0} + \hat{\beta_1}STR + u \]

second language learner % will then just get tucked into the error term
- which will now mean that \(E(u|X) \neq 0\) (biased estimator)

Omitted variable bias

In general, when we omit a factor that would change our estiamte of the ceteris paribus effect of class size on test score, we say we have “omitted variable bias”
Two conditions
- if the OV is one determinant of Y
- the OV is correlated with our X

Omitted variable bias

It is plausible that:

English language ability (whether the student has English as a second language) plausibly affects standardized test scores: Z is a determinant of Y.
Immigrant communities tend to be less affluent and thus have smaller school budgets and higher STR: Z is correlated with X.

In what direction will this bias the estimate of \(\beta_1\) ?

A formula

Let the correlation between \(X_i\) and \(u_i\) be:

\[ corr(X_i, u_i) = \varrho_{Xu}\] - then as the sample size increases, \(\hat{\beta_1}\) will tend toward:

\[\beta_1 + \varrho_{Xu} \frac{\sigma_u}{\sigma_X} \]

The term \(\frac{\sigma_u}{\sigma_X}\) reflects the relative spread of the error term compared to the spread of the independent variable.If this ratio is large, it suggests that the values of the independent variable X have a relatively small spread compared to the spread of the errors u. Conversely, if the ratio is large, it means that the errors u have a relatively small spread compared to the values of X.

It scales the influence of the correlation. Why? Because if the X variable has a large spread it suggests that variation in the independent variable X is dominant in the model. So the correlation between X and the errors will have a smaller effect on the coefficient. On the other hand, when it is small, it suggests that the spread of u is large, and the correlation between X and u will have a larger effect on the estimate.

Ommitted Variable Bias

If an omitted variable is both:
- a determinant of Y (that is, contained in u)
- correlated with X
Then the OLS estimator is biased and not consistent

Finding out if this is a real problem

First let’s divide districts into small and large and by english learners:

CASchools <- CASchools %>%
  mutate(size = ifelse(stratio >= 20, 1, 0))

english_group_levels <- c("< 1.9", "1.9 - 8.8", "8.8 - 23.0", ">= 23.0")

CASchools <- CASchools %>%
  mutate(english_group = case_when(
    english < 1.9 ~ "< 1.9",
    english >= 1.9 & english < 8.8 ~ "1.9 - 8.8",
    english >= 8.8 & english < 23.0 ~ "8.8 - 23.0",
    english >= 23.0 ~ ">= 23.0"
  ))

Now Let’s Summarize

	Average Test Score (STR<20)	n	Average Test Score (STR>20)	n	Difference	t-stat
All Districts	657.4	238	650	182	7.4	4.04
% of Engl Learners
<1.9%	664.5	76	665.4	27	-0.9	-0.30
1.9 - 8.8%	665.2	64	661.8	44	3.3	1.13
8.8 - 23.0%	654.9	54	649.7	50	5.2	1.72
>23.0%	636.7	44	634.8	61	1.9	0.68

Districts with fewer English Learners have higher test scores
Districts with lower percent EL have smaller classes
Among districts with comparable EL, the effect of class size is small (overall “test score gap” = 7.4)

In summary

The test score/STR/fraction English Learners example shows that, if an omitted variable satisfies the two conditions for omitted variable bias, then the OLS estimator in the regression omitting that variable is biased and inconsistent. So, even if n is large, \(\widehat{\beta_1}\) will not be close to \(\beta_1\)

Overcoming OVB

Three ways to overcome ommitted variable bias

Run a randomized experiment
- % Engl learner is still a determinant of test score, but will be uncorrelated with STR
Adopt the “cross tabulation” approach
- finer and finer gradations fo STR and % Engl learner
- But soon you will run out of data, and what about other determinants like family income and parental education?
Use a regression in which the omitted variable (% Engl learner) is no longer omitted

Multiple Regression

A Population Regression Line

Assume: \[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + u_i \] Interpretation of coefficients is now:

\(\beta_1 = \frac{\Delta Y}{\Delta X_1}\) holding \(X_2\) constant
\(\beta_2 = \frac{\Delta Y}{\Delta X_2}\) holding \(X_1\) constant
\(\beta_0 =\) predicted value of Y when both \(X_1\) and \(X_2\) are zero

The OLS estimator

With two regressors, our minimization problem is now:

\[ min_{b_0,b_i,b_2} \sum_{i=1}^{n} [Y_i -(b_o +b_1X_{1i}+b_2X_{2i})]^2 \] - The LS estimator minimizes the average squared difference between the actual values of \(Y_i\) and the prediction based on the estimated line.

This minimization problem is solved using calculus and yields our estimators, \(\beta_0, \beta_1, \beta_2\)

Let’s Experiment

Code

model1 <- lm(score ~ stratio, data = CASchools)
model2 <- lm(score ~ stratio + english, data = CASchools)
coef_names <- c("Intercept" = "(Intercept)","Class size" = "stratio", "Percent English learners" = "english")

export_summs(model1, model2, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )

	Model 1	Model 2
Intercept	698.93 ***	686.03 ***
	(10.36)	(8.73)
Class size	-2.28 ***	-1.10 *
	(0.52)	(0.43)
Percent English learners		-0.65 ***
		(0.03)
N	420	420
R2	0.05	0.43
AdjR2	0.05	0.42
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

note: reporting regression output in this way is very common in published papers

Let’s Experiment

Code

model1 <- lm(score ~ stratio, data = CASchools)
model2 <- lm(score ~ stratio + english, data = CASchools)
coef_names <- c("Class size" = "stratio", "Percent English learners" = "english")

export_summs(model1, model2, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )

	Model 1	Model 2
Class size	-2.28 ***	-1.10 *
	(0.52)	(0.43)
Percent English learners		-0.65 ***
		(0.03)
N	420	420
R2	0.05	0.43
AdjR2	0.05	0.42
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

What happens to the coefficient on Class Size?

Measures of Fit

Residuals have the same interpretation (actual = model prediction + residual)
SER is the same (standard deviation of the residuals with a d.f. correction)
RMSE is the same (standard deviation of the residuals with no d.f. correction)
\(R^2\) is the same (fraction of variance explained by model)

but now we introduce a new measure of fit

Adjusted \(R^2\)

As we add variables to the model, it is likely that \(R^2\) increases
- a bit of a problem for a measure of fit
So the adjusted \(R^2\) “penalizes” for adding another regressor

\[ \text{Adjusted } R^2 = 1- \frac{SSR}{TSS}(\frac{n-1}{n-k-1}) \]

Note that if \(n\) is large, then \(R^2\) and Adj. \(R^2\) get very close

Measures of Fit

Code

export_summs(model1, model2, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )

	Model 1	Model 2
Class size	-2.28 ***	-1.10 *
	(0.52)	(0.43)
Percent English learners		-0.65 ***
		(0.03)
N	420	420
R2	0.05	0.43
AdjR2	0.05	0.42
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

Casual inference assumptions

In multiple regression, our first three causal inference assumptions remain the same
But we add a fourth:

There is no perfect multicollinearity.

Assumption 1

\[E(u|X_1 = x_1, \ldots, x_k = x_k) = 0 \]

Same interpretation as before: the conditional mean of the residuals is zero
Failure of this conditions leads to omitted variable bias, specifically, if an omitted variable
- belongs in the equation (is in \(u\)) and
- is correlated with an included X
The best solution is to include the OV in the model
A second solution is to include a variable that controls for the OV (discussed shortly)

Assumption 2

Assumption 2 suggests that the data are collected by simple random sampling
Nothing about this changes in multiple regression

Assumption 3

Assumption 3 suggests that there are no large outliers
Nothing about this changes in multiple regression
Always check your data (scatterplots are useful!)

Assumption 4

First, lets describe perfect multicolinearity
This would exist if one regressor was an exact linear function of other regressors
It makes R go nuts:

Code

CASchools <- CASchools %>%
  mutate(stratio1 = stratio + 2)

modelcol <- lm(score ~ stratio + stratio1, data = CASchools)

summary(modelcol)


Call:
lm(formula = score ~ stratio + stratio1, data = CASchools)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.727 -14.251   0.483  12.822  48.540 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
stratio      -2.2798     0.4798  -4.751 2.78e-06 ***
stratio1          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897 
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Perfect Collinearity

Clearly it makes no sense to try to find the effect of STR holding STR + 2 constant
Other examples:
- you include a variable twice
- you include a dummy for small classes (\(D = 1\) if \(STR\leq20\), else 0) and a dummy for large classes (\(B=1\) if \(STR>20\), else 0), so that B = 1-D

The dummy variable trap

Suppose you have a set of multiple binary (dummy) variables, which are mutually exclusive and exhaustive

– that is, there are multiple categories and every observation falls in one and only one category (Freshmen, Sophomores, Juniors, Seniors, Other).

If you include all these dummy variables and a constant, you will have perfect multicollinearity

– this is sometimes called the dummy variable trap.

Perfect multicollinearity

Perfect mutlicollinearity is easy to resolve
usually reflects a mistake in the definitions of the regressors, or an oddity in the data
If you have perfect multicollinearity, your statistical software will let you know – either by crashing or giving an error message or by “dropping” one of the variables arbitrarily
The solution to perfect multicollinearity is to modify your list of regressors so that you no longer have perfect multicollinearity.

Imperfect multicollinearity

Imperfect multicollinearity occurs when two or more regressors are very highly correlated.
- this is a harder problem
Why the term “multicollinearity”? If two regressors are very highly correlated, then their scatterplot will pretty much look like a straight line – they are “co-linear” – but unless the correlation is exactly \(\pm 1\), that collinearity is imperfect

Imperfect multicollinearity

Imperfect multicollinearity implies that one or more of the regression coefficients will be imprecisely estimated.
the coefficient on \(X_1\) is the effect of \(X_1\) holding \(X_2\) constant
but if \(X_1\) and \(X_2\) are highly correlated, there is very little variation in \(X_1\) once \(X_2\) is held constant - so the data don’t contain much info about what happens when \(X_1\) changes but \(X_2\) doesn’t
The variable of the OLS estimator of the coefficient on \(X_1\) will be large and thus will have large standard errors

A control variable

With observational data we usually have a host of factors that affect our Y and get tucked in the residuals
- for instance parental involvement, access in the community to learning opportunities, and so on… all will affect test scores
If you can observe the factors, then we include them
But if we can’t observe, we can try to use a “control variable”
- correlated with the omitted factor, but not themselves causal

Control Variables

A control variable W is a regressor included to hold constant factors that, if neglected, could lead the estimated causal effect of interest to suffer from omitted variable bias.

Example

Code

model3 <- lm(score ~ stratio + english + lunch, data = CASchools)

coef_names <- c("Intercept" = "(Intercept)", "Class size" = "stratio", "Percent English learners" = "english", "Percent recieving free/reduced lunch" = "lunch")

export_summs(model1, model2, model3, robust = "HC1", coefs = coef_names, statistics = c(N = "nobs", R2 = "r.squared", AdjR2 = "adj.r.squared")
             )

	Model 1	Model 2	Model 3
Intercept	698.93 ***	686.03 ***	700.15 ***
	(10.36)	(8.73)	(5.57)
Class size	-2.28 ***	-1.10 *	-1.00 ***
	(0.52)	(0.43)	(0.27)
Percent English learners		-0.65 ***	-0.12 ***
		(0.03)	(0.03)
Percent recieving free/reduced lunch			-0.55 ***
			(0.02)
N	420	420	420
R2	0.05	0.43	0.77
AdjR2	0.05	0.42	0.77
Standard errors are heteroskedasticity robust. * p < 0.001; p < 0.01; * p < 0.05.

Example

\[ TestScore = 700.15 -1(ClassSize) - 0.12(EnglishLearners) - 0.55(FreeLunch) \]

Which variable is the variable of interest?
Which variables are control variables? Might they have a causal effect themselves? What do they control for?

Example

class size is the variable of interest
EnglishLearners probably has a direct causal effect (school is tougher if you are learning English!). But it is also a control variable: immigrant communities tend to be less affluent and often have fewer outside learning opportunities, and EnglishLearners is correlated with those omitted causal variables.
- EnglishLearners is both a possible causal variable and a control variable.
FreeLunch might have a causal effect (eating lunch helps learning); it also is correlated with and controls for income-related outside learning opportunities.
- FreeLunch is both a possible causal variable and a control variable.

Control variables

what makes for an effective control variable:

An effective control variable is one which, when included in the regression, makes the error term uncorrelated with the variable of interest.
Holding constant the control variable(s), the variable of interest is “as if” randomly assigned.
Among individuals (entities) with the same value of the control variable(s), the variable of interest is uncorrelated with the omitted determinants of Y

Your projects

We’ve now seen the basics of a multiple regression model
Recall that your job is to read and write up a summary of the topic and 3 academic peer-reviewed papers
- If you haven’t yet, meet with me ASAP!
After that you will collect the data and propose a model
Finally you will run the model and write up its problems

Multiple Regression

Agenda pt. 1

Agenda pt. 2

Reminder: Projects

Topic Memo

Omitted Variables

Summary of Single Regressor

A Problem

Example: English as a second language learners

Omitted variable bias

Omitted variable bias

A formula

Ommitted Variable Bias

Finding out if this is a real problem

Now Let’s Summarize

In summary

Overcoming OVB

Multiple Regression

A Population Regression Line

The OLS estimator

Let’s Experiment

Let’s Experiment

Measures of Fit

Measures of Fit

Adjusted \(R^2\)

Measures of Fit

Causal Inference

Casual inference assumptions

Assumption 1

Assumption 2

Assumption 3

Multicolinnearity

Assumption 4

Perfect Collinearity

The dummy variable trap

Perfect multicollinearity

Imperfect multicollinearity

Imperfect multicollinearity

Conditional Mean Independence

A control variable

Control Variables

Example

Example

Example

Control variables

Wrap-up

Your projects