Introduction to Regression

Econ 358: Econometrics

Agenda

1

Introduction to Regression

2

Fitting a Line

3

Measures of Fit

4

Assumptions for Causal Inference

5

Sample Distributions

Regression

Regression is about predicting the average value of Y given some value of X.

Ultimately we will want to be able to examine the effect of class size on controlling for other variables that might influence class size.

Regression is going to help get us a bit closer to this ideal.

We are still going to have the problems of core assumptions and extrapolating from a sample to a population.

A scatter plot shows a linear, negative, weak relationship

Regression

Regression attempts to make predictions about the average outcome (test scores) based on some information about other variables (class size, income, etc…)

We will start with the simplest case:

\[ \text{Test Score} = \beta_0 + \beta_1\text{Class Size} + \text{error}\]

This proposes a linear relationship between class size and test scores.

Regression

\[ \text{Test Score} = \beta_0 + \beta_1\text{Class Size} + \text{error}\]

This regression tells you what the test score will be, on average, for districts with class sizes of that value (using X to predict Y)

it does not tell you what specifically the test score will be in any one district

…there is error!

Regression terms

“Explains variable y in terms of variable x”

Interpreting coefficients in a regression

\[ \text{Test Score} = \beta_0 + \beta_1\text{Class Size} + \text{error}\]

The coefficient \(B_1\) tells us:

\[\frac{\Delta \text{Test Score}}{\Delta \text{Class Size}}\]

Or in words:

“The change in average test score given a one unit change in Class Size”

Interpreting the “mistakes” in a regression

  • The regression error consists of omitted factors.

  • In general, these omitted factors are other factors that influence Y, other than the variable X.

  • The regression error also includes error in the measurement of Y.

Imagine for a minute

If we had a perfect relationship, what would be true?

How do we fit a line?


Let’s think a bit more about fitting a line to data.


Given a cloud of data, how do we choose a best fitting line?

Which line provides the best fit?

How do we choose a line?


  • Ok, aside from eyeballing, how do we choose a line?

  • We begin by looking at the errors

  • These are the leftovers from the model fit:

Data = Fit + error

Errors

An error is the difference between the observed \(Y_i\) and predicted \(\hat{Y_i}\)

Notation

  • The book uses \(B_0\) and \(B_1\) to represent the true population parameters

  • In practice we don’t know the true population \(B_0\) and \(B_1\) so we estimate them - just as we did for the mean of the population.

  • In this case, we are going to try to use the sample data to choose a line (regression coefficients) that are as close as possible to the observed sample data.

  • We use \(\hat{B_0}\) and \(\hat{B_1}\) to represent our estimates

Closeness

Option 1: Minimize the sum of magnitudes (absolute values) of residuals \[|\mu_1|+|\mu_2|+\ldots+|\mu_n|\]

Option 2: Minimize the sum of squared residuals – least squares \[\mu_1^2 + \mu_2^2 +\ldots + \mu_n^2\]

the most common way is to choose the line that produces the “least squares” fit to these data — the ordinary least squares (OLS) estimator.

Why OLS?

  1. Most commonly used
  2. Easier to compute by hand and using software
  3. In many applications, a residual twice as large as another is usually more than twice as bad

The errors

Let \(\hat{\beta_0}\) and \(\hat{\beta_1}\) be some estimators of \(\beta_0\) and \(\beta_1\)

So that:

\[\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i\] The mistakes are then:

\[Y_i - (\hat{\beta_0} + \hat{\beta_1}X_i) = Y_i - \hat{\beta_0} - \hat{\beta_1}X_i\]

When dealing with a sample, we call the errors (mistakes): residuals or \(\hat{u}\)

Setting up the problem

So our job is to find the values of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that:

\[min \sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2\]

Rotations

Rotations

Rotations

The Sum of Squared Residuals

Minimizing

Just like in Econ 350 when you tried to optimize for the consumer or producer, we can minimize the sum of squared residuals

We first take the partial derivatives:

\[\frac{\partial}{\partial\hat{\beta_0}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)\] and

\[\frac{\partial}{\partial\hat{\beta_1}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)X_i\]

Finding the minimum

We find the minimum by setting the partial derivatives to zero.

Essentially we are looking for points where the function has no steep slope in any direction \[\frac{\partial}{\partial\hat{\beta_0}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i) = 0\] and

\[\frac{\partial}{\partial\hat{\beta_1}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)X_i =0 \]

OLS Estimators

Sparing you the derivation:

\[\hat{\beta_1} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta_0} = \bar{Y}-\hat{\beta_1}\bar{X}\]

The OLS predicted values \(\hat{Y_i}\) and residuals a \(\hat{u_i}\) are: \[\hat{Y_i} = \hat{\beta_o} + \hat{\beta_1}X_i, \: i = 1,\ldots,n\]

\[\hat{u_i} = Y_i - \hat{Y_i}, \: i=1,\ldots,n\] ::: {.notes}

The numerator measures how X and Y vary together - similar to a correlation coefficient.

the denominator measures the variability of the independent variable X

the intercept equation tells us the portion of Y that is not affected by X. :::

Our Schools Data

Predicted values and residuals

One of the districts in the data set is Antelope, CA, for which STR =19.33 and Test Score = 657.8

Predicted value: \(\hat{Y}_{Antelope} = 689-2.28\times19.33 = 654.8\)

residual: \(\hat{u}_{Antelope} = 657.8 - 654.8 = 3.0\)

In R

Code
model_ols <- lm(score ~ stratio, data = CASchools)

summary(model_ols)

Call:
lm(formula = score ~ stratio, data = CASchools)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.727 -14.251   0.483  12.822  48.540 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
stratio      -2.2798     0.4798  -4.751 2.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897 
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Review 1

In the simple linear regression model, the regression slope:

A. indicates by how many units Y increases, given a one-unit increase in X.

B. represents the elasticity of Y on X.

C.indicates by how many percent Y increases, given a one percent increase in X.

D.when multiplied with the explanatory variable will give you the predicted Y.

Review 1

In the simple linear regression model, the regression slope:

A. indicates by how many units Y increases, given a one-unit increase in X.

B. represents the elasticity of Y on X.

C.indicates by how many percent Y increases, given a one percent increase in X.

D.when multiplied with the explanatory variable will give you the predicted Y.

Review 2

In the simple linear regression model \(Y_i = \beta_0 + \beta_1X_i + \mu_i\):

A. represents the population regression function.

B. the intercept is typically small and unimportant.

C. the absolute value of the slope is typically between 0 and 1.

D. represents the sample regression function.

Review 2

In the simple linear regression model \(Y_i = \beta_0 + \beta_1X_i + \mu_i\):

A. represents the population regression function.

B. the intercept is typically small and unimportant.

C. the absolute value of the slope is typically between 0 and 1.

D. represents the sample regression function.

Review 3

Assume that you have collected a sample of observations from over 100 households and their consumption and income patterns. Using these observations, you estimate the following regression \(C_i = \hat{\beta_0} + \hat{\beta_1} Y_i + \hat{u_i}\), where C is consumption and Y is disposable income. The estimate \(\hat{\beta_1}\) will tell you:

A. \(\frac{\Delta Income}{\Delta Predicted Consumption}\)

B. \(\frac{\Delta Predicted Consumption}{\Delta Income}\)

C. The amount you need to consume to survive

D. \(\frac{\text{Predicted Consumption}}{Income}\)

Review 3

Assume that you have collected a sample of observations from over 100 households and their consumption and income patterns. Using these observations, you estimate the following regression \(C_i = \hat{\beta_0} + \hat{\beta_1} Y_i + \hat{u_i}\), where C is consumption and Y is disposable income. The estimate \(\hat{\beta_1}\) will tell you:

A. \(\frac{\Delta Income}{\Delta Predicted Consumption}\)

B. \(\frac{\Delta Predicted Consumption}{\Delta Income}\)

C. The amount you need to consume to survive

D. \(\frac{\text{Predicted Consumption}}{Income}\)

Review 4

The OLS residuals, \(\hat{u_i}\), are sample counterparts of the population:

A. regression function’s predicted values.

B. errors.

C. regression function intercept.

D.regression function slope.

Review 4

The OLS residuals, \(\hat{u_i}\), are sample counterparts of the population:

A. regression function’s predicted values.

B. errors.

C. regression function intercept.

D.regression function slope.

\(R^2\) and SER

Two regression statistics provide complementary measures of how well the regression line “fits” or explains the data:

  • The regression \(R^2\) measures the fraction of the variance of Y that is explained by X; it is unitless and ranges between zero (no fit) and one (perfect fit)

  • The standard error of the regression (SER) measures the magnitude of a typical regression residual in the units of Y.

Sums of Squares

Explained sum of squares (ESS):

\[\sum_{i=1}^{n}(\hat{Y_i} - \bar{Y})^2\] Total sum of squares (TSS):

\[\sum_{i=1}^{n}(Y_i - \bar{Y})^2\] Residual sum of squares (SSR):

\[\sum_{i=1}^{n}\hat{u_i}^2\]

\[TSS = ESS + SSR\]

\(R^2\)

\[R^2 = \frac{ESS}{TSS}\]

  • \(R^2\) is the ratio of the explained variation compared to the total variation
  • It is the fraction of the sample variation in y that is explained by x

\(R^2\) in CASchools regression

summary(model_ols)$r.squared
[1] 0.05124009


\(R^2\) will always lie between 0 and 1 (perfectly explained by the model). An \(R^2\) of 0.051 means that very little of the variation in test scores is explained by our model.

\(R^2\) in CASchools regression

Doing this manually:

ESS:

Code
# Calculate the mean of the dependent variable 'score'
mean_score <- mean(CASchools$score)

# Predicted values from the regression model
predicted_values <- predict(model_ols)

# Calculate the explained sum of squares (ESS)
ESS <- sum((predicted_values - mean_score)^2)

print(ESS)
[1] 7794.109

TSS

Code
# Calculate the total sum of squares (TSS)
TSS <- sum((CASchools$score - mean_score)^2)

print(TSS)
[1] 152109.6

Calculate \(R^2\)

Code
R2 <- ESS/TSS

print(R2)
[1] 0.05124009

Standard Errors

  • The SER measures the spread of the distribution of u. The SER is (almost) the sample standard deviation of the OLS residuals:

\[ SER = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n} (\hat{u_i} - \bar{\hat{u}})^2} = \frac{\sum_{i=1}^{n}\hat{u_i}^2}{n-2} \]

Note

The above makes use of the fact that \(\bar{\hat{u}}\) is zero. Why would that be?

  • In other words, it is the average sum of squared residuals (corrected for degrees of freedom)

  • has the units of u, which are the units of Y measures the average “size” of the OLS residual (the average “mistake” made by the OLS regression line)

Review 5

The slope estimator, \(\beta_1\), has a smaller standard error, other things equal, if

A. there is a large variance of the error term, \(u\).

B. the intercept, \(\beta_0\) is small

C. there is more variation in the explanatory variable, \(X\).

D. the sample size is smaller

Review 5

The slope estimator, \(\beta_1\), has a smaller standard error, other things equal, if

A. there is a large variance of the error term, \(u\).

B. the intercept, \(\beta_0\) is small

C. there is more variation in the explanatory variable, \(X\).

D. the sample size is smaller

Practice with AgeEarnings in RCloud

Causal inference

  • So far we have treated OLS as a way to draw a straight line through the data on Y and X. Under what conditions does the slope of this line have a causal interpretation?

  • When will the OLS estimator be unbiased for the causal effect on Y of X?

  • What is the variance of the OLS estimator over repeated samples?

Causal inference

  • To answer these questions, we need to make some assumptions about how Y and X are related to each other, and about how they are collected (the sampling scheme)

  • These assumptions – there are three – are known as the Least Squares Assumptions for Causal Inference.

Definition of Causal Effect

  • The causal effect on Y of a unit change in X is the expected difference in Y as measured in a randomized controlled experiment

  • For a binary treatment, the causal effect is the expected difference in means between the treatment and control groups, as discussed in Ch. 3

  • With a binary treatment, for the difference in means to measure a causal effect requires random assignment or as-if random assignment.

  • Random assignment ensures that the treatment (X) is uncorrelated with all other determinants of Y, so that there are no confounding variables

  • The least squares assumptions for causal inference generalize the binary treatment case to regression.

Assumptions for Causal Inference 1

Let \(\beta_1\) be the causal effect on Y of a change in X:

\[Y_i = \beta_0 + \beta_1 X_i + u_i, i = 1, \ldots, n\]

Assumption 1:

The conditional distribution of \(u\) given \(X\) has mean zero, that is,

\[E(u|X=x)=0\]

This implies that \(\hat{\beta_1}\) is unbiased for the causal effect \(\beta_1\)

Mean of \(u\) is zero

Assumptions for Causal Inference 2

Let \(\beta_1\) be the causal effect on Y of a change in X:

\[Y_i = \beta_0 + \beta_1 X_i + u_i, i = 1, \ldots, n\]

Assumption 2:

\((X_1 , Y_i), i = 1, \ldots, n\), are i.i.d

This is true if \((X,Y)\) are collected by simple random sampling

This delivers the sampling distribution of \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

Assumptions for Causal Inference 3

Let \(\beta_1\) be the causal effect on Y of a change in X:

\[Y_i = \beta_0 + \beta_1 X_i + u_i, i = 1, \ldots, n\]

Assumption 3:

Large outliers in \(X\) and \(Y\) are rare

  • Technically, X and Y have finite fourth moments
  • Outliers can result in meaningless values of \(\hat{beta_1}\)

Importance

The three least squares assumptions are important because:

  • They help us organize the circumstances that pose difficulties for OLS

  • If the least square assumptions hold, then mathematically, the sampling distributions of OLS estimators are normal in large samples.

Review 6

Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,

\[Wage = \beta_0 + \beta_1 Education + u\]

Suppose further that the only unobservable that can possibly affect both wage and education is intelligence of the individual.

OLS assumption (1): The conditional distribution of \(u_i\) given \(X_i\) has a mean of zero. (\(E(u_i | X_i)=0\))

Which of the following provides evidence in favor of OLS assumption #1?

  1. \(E(Intelligence | Education = x) = E (Intelligence | Education = y)\) for all \(x \neq y\)
  2. \(covariance(Intelligen ce, Education) \neq 0\)
  3. \(corr(Intelligence, Education) \neq 0\)
  4. \(corr(Intelligence, Education) = 0\)

Review 6

Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,

\[Wage = \beta_0 + \beta_1 Education + u\]

Suppose further that the only unobservable that can possibly affect both wage and education is intelligence of the individual.

OLS assumption (1): The conditional distribution of \(u_i\) given \(X_i\) has a mean of zero. (\(E(u_i | X_i)=0\))

Which of the following provides evidence in favor of OLS assumption #1?

  1. \(E(Intelligence | Education = x) = E (Intelligence | Education = y)\) for all \(x \neq y\)
  2. \(covariance(Intelligence, Education) \neq 0\)
  3. \(corr(Intelligence, Education) \neq 0\)
  4. \(corr(Intelligence, Education) = 0\)

Review 7

Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (2): \((X_i, Y_i), i = 1, \ldots, n\) are i.i.d.

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor of OLS assumption #2?

  1. a random sample is drawn from a population of college graduates.

  2. A sample consisting of all honor students is drawn from a population of college graduates.

  3. \(corr(Intelligence, Education) = 0\)

  4. A sample consisting of a group of college students is drawn repeatedly each year over the course of their college careers.

Review 7

Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (2): \((X_i, Y_i), i = 1, \ldots, n\) are i.i.d.

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor of OLS assumption #2?

  1. a random sample is drawn from a population of college graduates.

  2. A sample consisting of all honor students is drawn from a population of college graduates.

  3. \(corr(Intelligence, Education) = 0\)

  4. A sample consisting of a group of college students is drawn repeatedly each year over the course of their college careers.

Review 8

Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (3): Large outliers are unlikely. Mathematically, X and Y have nonzero finite fourth moments: (\(0<E(X^4_i)<\infty\) and \(0<E(Y^4_i)<\infty\))

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor OLS assumption #3?

  1. For some individuals in the sample, years of education were recorded in days rather than years.

  2. The maximum wage an individual can get is a finite number.

  3. Half of the wages in the sample were incorrectly multiplied by 1 million when recorded.

  4. The years of education an individual can get is bounded from above.

Review 8

Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (3): Large outliers are unlikely. Mathematically, X and Y have nonzero finite fourth moments: (\(0<E(X^4_i)<\infty\) and \(0<E(Y^4_i)<\infty\))

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor OLS assumption #3?

  1. For some individuals in the sample, years of education were recorded in days rather than years.
  2. The maximum wage an individual can get is a finite number.
  3. Half of the wages in the sample were incorrectly multiplied by 1 million when recorded.
  4. The years of education an individual can get is bounded from above.

Samples

The OLS estimator is computed from a sample of data. A different sample yields a different value of \(\hat{\beta_1}\). This is the source of the ’sampling uncertainty” of \(\hat{\beta_1}\)

We want to:

  • quantify the sampling uncertainty associated with \(\hat{\beta_1}\)

  • use \(\hat{\beta_1}\) to test hypotheses such as \(\beta_1=0\)

  • construct a confidence interval for \(\hat{\beta_1}\)

Sampling distributions

  • Just like \(\bar{Y}\), \(\hat{\beta_1}\) has a sampling distribution

  • Recall that:

    • \(\bar{Y}\) is an estimator of the unkown population mean \(\mu_Y\)
    • We take random samples and compute their means (\(\bar{Y}\)) to try to estimate the population mean (\(\mu_Y\))
    • Because of the random sampling, \(\bar{Y}\) is a random variable
      • \(E(\bar{Y}) = \mu_Y\)
    • if \(n\) is large, then the distrinution of sample means is normal

Sample distribution of of \(\hat{\beta_1}\)

  • The OLS estimator is computed from a sample of data. A different sample yields a different value of \(\hat{\beta_1}\). This is the source of the ’sampling uncertainty” of \(\hat{\beta_1}\)

  • Under thel 3 least squares assumptions, we can say that \(\hat{\beta_1}\) is an unbiased estimator of \(\beta_1\)

  • if \(n\) is large (more than 100?)

    • distribution of \(\hat{\beta_1}\) is normal
  • In addition, the estimator is consistent

    • as \(n\) increases, the variance falls, and it becomes tightly concentrated around the true \(\beta_1\)

Variance of \(X_i\)

  • the larger the variance of \(X_i\) the smaller the variance of \(\hat{beta_1}\)

  • this makes intuitive sense since an X variable that doesn’t vary will have a very hard time predicting a Y variable that does vary.

Which would you prefer?

Importance

  • The normal approximation to the sampling distribution of \(\hat{\beta_1}\) (and also \(\hat{\beta_0}\) !) is a powerful tool.

  • With this approximation in hand, we are able to develop methods for making inferences about the true population values of the regression coefficients using only a sample of data.

Review 9

The reason why estimators have a sampling distribution is that:

  1. in real life you typically get to sample many times.

  2. individuals respond differently to incentives.

  3. the values of the explanatory variable and the error term differ across samples.

d.economics is not a precise science.

Review 9

The reason why estimators have a sampling distribution is that:

  1. in real life you typically get to sample many times.

  2. individuals respond differently to incentives.

  3. the values of the explanatory variable and the error term differ across samples.

d.economics is not a precise science.