Regression is about predicting the average value of Y given some value of X.
Ultimately we will want to be able to examine the effect of class size on controlling for other variables that might influence class size.
Regression is going to help get us a bit closer to this ideal.
We are still going to have the problems of core assumptions and extrapolating from a sample to a population.
A scatter plot shows a linear, negative, weak relationship
Regression
Regression attempts to make predictions about the average outcome (test scores) based on some information about other variables (class size, income, etc…)
“The change in average test score given a one unit change in Class Size”
Interpreting the “mistakes” in a regression
The regression error consists of omitted factors.
In general, these omitted factors are other factors that influence Y, other than the variable X.
The regression error also includes error in the measurement of Y.
Fitting a Line
Imagine for a minute
If we had a perfect relationship, what would be true?
How do we fit a line?
Let’s think a bit more about fitting a line to data.
Given a cloud of data, how do we choose a best fitting line?
Which line provides the best fit?
How do we choose a line?
Ok, aside from eyeballing, how do we choose a line?
We begin by looking at the errors
These are the leftovers from the model fit:
Data = Fit + error
Errors
An error is the difference between the observed \(Y_i\) and predicted \(\hat{Y_i}\)
Notation
The book uses \(B_0\) and \(B_1\) to represent the true population parameters
In practice we don’t know the true population \(B_0\) and \(B_1\) so we estimate them - just as we did for the mean of the population.
In this case, we are going to try to use the sample data to choose a line (regression coefficients) that are as close as possible to the observed sample data.
We use \(\hat{B_0}\) and \(\hat{B_1}\) to represent our estimates
Closeness
Option 1: Minimize the sum of magnitudes (absolute values) of residuals \[|\mu_1|+|\mu_2|+\ldots+|\mu_n|\]
Option 2: Minimize the sum of squared residuals – least squares \[\mu_1^2 + \mu_2^2 +\ldots + \mu_n^2\]
the most common way is to choose the line that produces the “least squares” fit to these data — the ordinary least squares (OLS) estimator.
Why OLS?
Most commonly used
Easier to compute by hand and using software
In many applications, a residual twice as large as another is usually more than twice as bad
The errors
Let \(\hat{\beta_0}\) and \(\hat{\beta_1}\) be some estimators of \(\beta_0\) and \(\beta_1\)
So that:
\[\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i\] The mistakes are then:
We find the minimum by setting the partial derivatives to zero.
Essentially we are looking for points where the function has no steep slope in any direction \[\frac{\partial}{\partial\hat{\beta_0}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i) = 0\] and
model_ols <-lm(score ~ stratio, data = CASchools)summary(model_ols)
Call:
lm(formula = score ~ stratio, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
stratio -2.2798 0.4798 -4.751 2.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Review 1
In the simple linear regression model, the regression slope:
A. indicates by how many units Y increases, given a one-unit increase in X.
B. represents the elasticity of Y on X.
C.indicates by how many percent Y increases, given a one percent increase in X.
D.when multiplied with the explanatory variable will give you the predicted Y.
Review 1
In the simple linear regression model, the regression slope:
A. indicates by how many units Y increases, given a one-unit increase in X.
B. represents the elasticity of Y on X.
C.indicates by how many percent Y increases, given a one percent increase in X.
D.when multiplied with the explanatory variable will give you the predicted Y.
Review 2
In the simple linear regression model \(Y_i = \beta_0 + \beta_1X_i + \mu_i\):
A. represents the population regression function.
B. the intercept is typically small and unimportant.
C. the absolute value of the slope is typically between 0 and 1.
D. represents the sample regression function.
Review 2
In the simple linear regression model \(Y_i = \beta_0 + \beta_1X_i + \mu_i\):
A. represents the population regression function.
B. the intercept is typically small and unimportant.
C. the absolute value of the slope is typically between 0 and 1.
D. represents the sample regression function.
Review 3
Assume that you have collected a sample of observations from over 100 households and their consumption and income patterns. Using these observations, you estimate the following regression \(C_i = \hat{\beta_0} + \hat{\beta_1} Y_i + \hat{u_i}\), where C is consumption and Y is disposable income. The estimate \(\hat{\beta_1}\) will tell you:
A. \(\frac{\Delta Income}{\Delta Predicted Consumption}\)
B. \(\frac{\Delta Predicted Consumption}{\Delta Income}\)
C. The amount you need to consume to survive
D. \(\frac{\text{Predicted Consumption}}{Income}\)
Review 3
Assume that you have collected a sample of observations from over 100 households and their consumption and income patterns. Using these observations, you estimate the following regression \(C_i = \hat{\beta_0} + \hat{\beta_1} Y_i + \hat{u_i}\), where C is consumption and Y is disposable income. The estimate \(\hat{\beta_1}\) will tell you:
A. \(\frac{\Delta Income}{\Delta Predicted Consumption}\)
B. \(\frac{\Delta Predicted Consumption}{\Delta Income}\)
C. The amount you need to consume to survive
D. \(\frac{\text{Predicted Consumption}}{Income}\)
Review 4
The OLS residuals, \(\hat{u_i}\), are sample counterparts of the population:
A. regression function’s predicted values.
B. errors.
C. regression function intercept.
D.regression function slope.
Review 4
The OLS residuals, \(\hat{u_i}\), are sample counterparts of the population:
A. regression function’s predicted values.
B. errors.
C. regression function intercept.
D.regression function slope.
Measures of fit
\(R^2\) and SER
Two regression statistics provide complementary measures of how well the regression line “fits” or explains the data:
The regression \(R^2\) measures the fraction of the variance of Y that is explained by X; it is unitless and ranges between zero (no fit) and one (perfect fit)
The standard error of the regression (SER) measures the magnitude of a typical regression residual in the units of Y.
Sums of Squares
Explained sum of squares (ESS):
\[\sum_{i=1}^{n}(\hat{Y_i} - \bar{Y})^2\] Total sum of squares (TSS):
\[\sum_{i=1}^{n}(Y_i - \bar{Y})^2\] Residual sum of squares (SSR):
\[\sum_{i=1}^{n}\hat{u_i}^2\]
\[TSS = ESS + SSR\]
\(R^2\)
\[R^2 = \frac{ESS}{TSS}\]
\(R^2\) is the ratio of the explained variation compared to the total variation
It is the fraction of the sample variation in y that is explained by x
\(R^2\) in CASchools regression
summary(model_ols)$r.squared
[1] 0.05124009
\(R^2\) will always lie between 0 and 1 (perfectly explained by the model). An \(R^2\) of 0.051 means that very little of the variation in test scores is explained by our model.
\(R^2\) in CASchools regression
Doing this manually:
ESS:
Code
# Calculate the mean of the dependent variable 'score'mean_score <-mean(CASchools$score)# Predicted values from the regression modelpredicted_values <-predict(model_ols)# Calculate the explained sum of squares (ESS)ESS <-sum((predicted_values - mean_score)^2)print(ESS)
[1] 7794.109
TSS
Code
# Calculate the total sum of squares (TSS)TSS <-sum((CASchools$score - mean_score)^2)print(TSS)
[1] 152109.6
Calculate \(R^2\)
Code
R2 <- ESS/TSSprint(R2)
[1] 0.05124009
Standard Errors
The SER measures the spread of the distribution of u. The SER is (almost) the sample standard deviation of the OLS residuals:
\[ SER = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n} (\hat{u_i} - \bar{\hat{u}})^2} = \frac{\sum_{i=1}^{n}\hat{u_i}^2}{n-2} \]
Note
The above makes use of the fact that \(\bar{\hat{u}}\) is zero. Why would that be?
In other words, it is the average sum of squared residuals (corrected for degrees of freedom)
has the units of u, which are the units of Y measures the average “size” of the OLS residual (the average “mistake” made by the OLS regression line)
Review 5
The slope estimator, \(\beta_1\), has a smaller standard error, other things equal, if
A. there is a large variance of the error term, \(u\).
B. the intercept, \(\beta_0\) is small
C. there is more variation in the explanatory variable, \(X\).
D. the sample size is smaller
Review 5
The slope estimator, \(\beta_1\), has a smaller standard error, other things equal, if
A. there is a large variance of the error term, \(u\).
B. the intercept, \(\beta_0\) is small
C. there is more variation in the explanatory variable, \(X\).
D. the sample size is smaller
Practice with AgeEarnings in RCloud
Assumptions for Causal Inference
Causal inference
So far we have treated OLS as a way to draw a straight line through the data on Y and X. Under what conditions does the slope of this line have a causal interpretation?
When will the OLS estimator be unbiased for the causal effect on Y of X?
What is the variance of the OLS estimator over repeated samples?
Causal inference
To answer these questions, we need to make some assumptions about how Y and X are related to each other, and about how they are collected (the sampling scheme)
These assumptions – there are three – are known as the Least Squares Assumptions for Causal Inference.
Definition of Causal Effect
The causal effect on Y of a unit change in X is the expected difference in Y as measured in a randomized controlled experiment
For a binary treatment, the causal effect is the expected difference in means between the treatment and control groups, as discussed in Ch. 3
With a binary treatment, for the difference in means to measure a causal effect requires random assignment or as-if random assignment.
Random assignment ensures that the treatment (X) is uncorrelated with all other determinants of Y, so that there are no confounding variables
The least squares assumptions for causal inference generalize the binary treatment case to regression.
Assumptions for Causal Inference 1
Let \(\beta_1\) be the causal effect on Y of a change in X:
Outliers can result in meaningless values of \(\hat{beta_1}\)
Importance
The three least squares assumptions are important because:
They help us organize the circumstances that pose difficulties for OLS
If the least square assumptions hold, then mathematically, the sampling distributions of OLS estimators are normal in large samples.
Review 6
Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,
\[Wage = \beta_0 + \beta_1 Education + u\]
Suppose further that the only unobservable that can possibly affect both wage and education is intelligence of the individual.
OLS assumption (1): The conditional distribution of \(u_i\) given \(X_i\) has a mean of zero. (\(E(u_i | X_i)=0\))
Which of the following provides evidence in favor of OLS assumption #1?
\(E(Intelligence | Education = x) = E (Intelligence | Education = y)\) for all \(x \neq y\)
\(covariance(Intelligen ce, Education) \neq 0\)
\(corr(Intelligence, Education) \neq 0\)
\(corr(Intelligence, Education) = 0\)
Review 6
Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,
\[Wage = \beta_0 + \beta_1 Education + u\]
Suppose further that the only unobservable that can possibly affect both wage and education is intelligence of the individual.
OLS assumption (1): The conditional distribution of \(u_i\) given \(X_i\) has a mean of zero. (\(E(u_i | X_i)=0\))
Which of the following provides evidence in favor of OLS assumption #1?
\(E(Intelligence | Education = x) = E (Intelligence | Education = y)\) for all \(x \neq y\)
\(covariance(Intelligence, Education) \neq 0\)
\(corr(Intelligence, Education) \neq 0\)
\(corr(Intelligence, Education) = 0\)
Review 7
Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,
\[Wage = \beta_0 + \beta_1 Education + u\]
OLS assumption (2): \((X_i, Y_i), i = 1, \ldots, n\) are i.i.d.
Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor of OLS assumption #2?
a random sample is drawn from a population of college graduates.
A sample consisting of all honor students is drawn from a population of college graduates.
\(corr(Intelligence, Education) = 0\)
A sample consisting of a group of college students is drawn repeatedly each year over the course of their college careers.
Review 7
Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,
\[Wage = \beta_0 + \beta_1 Education + u\]
OLS assumption (2): \((X_i, Y_i), i = 1, \ldots, n\) are i.i.d.
Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor of OLS assumption #2?
a random sample is drawn from a population of college graduates.
A sample consisting of all honor students is drawn from a population of college graduates.
\(corr(Intelligence, Education) = 0\)
A sample consisting of a group of college students is drawn repeatedly each year over the course of their college careers.
Review 8
Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,
\[Wage = \beta_0 + \beta_1 Education + u\]
OLS assumption (3): Large outliers are unlikely. Mathematically, X and Y have nonzero finite fourth moments: (\(0<E(X^4_i)<\infty\) and \(0<E(Y^4_i)<\infty\))
Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor OLS assumption #3?
For some individuals in the sample, years of education were recorded in days rather than years.
The maximum wage an individual can get is a finite number.
Half of the wages in the sample were incorrectly multiplied by 1 million when recorded.
The years of education an individual can get is bounded from above.
Review 8
Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,
\[Wage = \beta_0 + \beta_1 Education + u\]
OLS assumption (3): Large outliers are unlikely. Mathematically, X and Y have nonzero finite fourth moments: (\(0<E(X^4_i)<\infty\) and \(0<E(Y^4_i)<\infty\))
Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor OLS assumption #3?
For some individuals in the sample, years of education were recorded in days rather than years.
The maximum wage an individual can get is a finite number.
Half of the wages in the sample were incorrectly multiplied by 1 million when recorded.
The years of education an individual can get is bounded from above.
Sampling Distributions
Samples
The OLS estimator is computed from a sample of data. A different sample yields a different value of \(\hat{\beta_1}\). This is the source of the ’sampling uncertainty” of \(\hat{\beta_1}\)
We want to:
quantify the sampling uncertainty associated with \(\hat{\beta_1}\)
use \(\hat{\beta_1}\) to test hypotheses such as \(\beta_1=0\)
construct a confidence interval for \(\hat{\beta_1}\)
Sampling distributions
Just like \(\bar{Y}\), \(\hat{\beta_1}\) has a sampling distribution
Recall that:
\(\bar{Y}\) is an estimator of the unkown population mean \(\mu_Y\)
We take random samples and compute their means (\(\bar{Y}\)) to try to estimate the population mean (\(\mu_Y\))
Because of the random sampling, \(\bar{Y}\) is a random variable
\(E(\bar{Y}) = \mu_Y\)
if \(n\) is large, then the distrinution of sample means is normal
Sample distribution of of \(\hat{\beta_1}\)
The OLS estimator is computed from a sample of data. A different sample yields a different value of \(\hat{\beta_1}\). This is the source of the ’sampling uncertainty” of \(\hat{\beta_1}\)
Under thel 3 least squares assumptions, we can say that \(\hat{\beta_1}\) is an unbiased estimator of \(\beta_1\)
if \(n\) is large (more than 100?)
distribution of \(\hat{\beta_1}\) is normal
In addition, the estimator is consistent
as \(n\) increases, the variance falls, and it becomes tightly concentrated around the true \(\beta_1\)
Variance of \(X_i\)
the larger the variance of \(X_i\) the smaller the variance of \(\hat{beta_1}\)
this makes intuitive sense since an X variable that doesn’t vary will have a very hard time predicting a Y variable that does vary.
Which would you prefer?
Importance
The normal approximation to the sampling distribution of \(\hat{\beta_1}\) (and also \(\hat{\beta_0}\) !) is a powerful tool.
With this approximation in hand, we are able to develop methods for making inferences about the true population values of the regression coefficients using only a sample of data.
Review 9
The reason why estimators have a sampling distribution is that:
in real life you typically get to sample many times.
individuals respond differently to incentives.
the values of the explanatory variable and the error term differ across samples.
d.economics is not a precise science.
Review 9
The reason why estimators have a sampling distribution is that:
in real life you typically get to sample many times.
individuals respond differently to incentives.
the values of the explanatory variable and the error term differ across samples.