Introduction to Regression

Econ 358: Econometrics

Agenda

Introduction to Regression

Fitting a Line

Measures of Fit

Assumptions for Causal Inference

Sample Distributions

Regression

Regression is about predicting the average value of Y given some value of X.

Ultimately we will want to be able to examine the effect of class size on controlling for other variables that might influence class size.

Regression is going to help get us a bit closer to this ideal.

We are still going to have the problems of core assumptions and extrapolating from a sample to a population.

A scatter plot shows a linear, negative, weak relationship

Regression

Regression attempts to make predictions about the average outcome (test scores) based on some information about other variables (class size, income, etc…)

We will start with the simplest case:

\[ \text{Test Score} = \beta_0 + \beta_1\text{Class Size} + \text{error}\]

This proposes a linear relationship between class size and test scores.

Regression

\[ \text{Test Score} = \beta_0 + \beta_1\text{Class Size} + \text{error}\]

This regression tells you what the test score will be, on average, for districts with class sizes of that value (using X to predict Y)

it does not tell you what specifically the test score will be in any one district

…there is error!

Regression terms

“Explains variable y in terms of variable x”

Interpreting coefficients in a regression

\[ \text{Test Score} = \beta_0 + \beta_1\text{Class Size} + \text{error}\]

The coefficient \(B_1\) tells us:

\[\frac{\Delta \text{Test Score}}{\Delta \text{Class Size}}\]

Or in words:

“The change in average test score given a one unit change in Class Size”

Interpreting the “mistakes” in a regression

The regression error consists of omitted factors.
In general, these omitted factors are other factors that influence Y, other than the variable X.
The regression error also includes error in the measurement of Y.

Imagine for a minute

If we had a perfect relationship, what would be true?

How do we fit a line?

Let’s think a bit more about fitting a line to data.

Given a cloud of data, how do we choose a best fitting line?

Which line provides the best fit?

How do we choose a line?

Ok, aside from eyeballing, how do we choose a line?
We begin by looking at the errors
These are the leftovers from the model fit:

Data = Fit + error

Errors

An error is the difference between the observed \(Y_i\) and predicted \(\hat{Y_i}\)

Notation

The book uses \(B_0\) and \(B_1\) to represent the true population parameters
In practice we don’t know the true population \(B_0\) and \(B_1\) so we estimate them - just as we did for the mean of the population.
In this case, we are going to try to use the sample data to choose a line (regression coefficients) that are as close as possible to the observed sample data.
We use \(\hat{B_0}\) and \(\hat{B_1}\) to represent our estimates

Closeness

Option 1: Minimize the sum of magnitudes (absolute values) of residuals \[|\mu_1|+|\mu_2|+\ldots+|\mu_n|\]

Option 2: Minimize the sum of squared residuals – least squares \[\mu_1^2 + \mu_2^2 +\ldots + \mu_n^2\]

the most common way is to choose the line that produces the “least squares” fit to these data — the ordinary least squares (OLS) estimator.

Why OLS?

Most commonly used
Easier to compute by hand and using software
In many applications, a residual twice as large as another is usually more than twice as bad

The errors

Let \(\hat{\beta_0}\) and \(\hat{\beta_1}\) be some estimators of \(\beta_0\) and \(\beta_1\)

So that:

\[\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i\] The mistakes are then:

\[Y_i - (\hat{\beta_0} + \hat{\beta_1}X_i) = Y_i - \hat{\beta_0} - \hat{\beta_1}X_i\]

When dealing with a sample, we call the errors (mistakes): residuals or \(\hat{u}\)

Setting up the problem

So our job is to find the values of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that:

\[min \sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2\]

Rotations

The Sum of Squared Residuals

Minimizing

Just like in Econ 350 when you tried to optimize for the consumer or producer, we can minimize the sum of squared residuals

We first take the partial derivatives:

\[\frac{\partial}{\partial\hat{\beta_0}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)\] and

\[\frac{\partial}{\partial\hat{\beta_1}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)X_i\]

Finding the minimum

We find the minimum by setting the partial derivatives to zero.

Essentially we are looking for points where the function has no steep slope in any direction \[\frac{\partial}{\partial\hat{\beta_0}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i) = 0\] and

\[\frac{\partial}{\partial\hat{\beta_1}}\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2 = -2\sum_{i=1}^{n}(Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)X_i =0 \]

OLS Estimators

Sparing you the derivation:

\[\hat{\beta_1} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta_0} = \bar{Y}-\hat{\beta_1}\bar{X}\]

The OLS predicted values \(\hat{Y_i}\) and residuals a \(\hat{u_i}\) are: \[\hat{Y_i} = \hat{\beta_o} + \hat{\beta_1}X_i, \: i = 1,\ldots,n\]

\[\hat{u_i} = Y_i - \hat{Y_i}, \: i=1,\ldots,n\] ::: {.notes}

The numerator measures how X and Y vary together - similar to a correlation coefficient.

the denominator measures the variability of the independent variable X

the intercept equation tells us the portion of Y that is not affected by X. :::

Our Schools Data

Predicted values and residuals

One of the districts in the data set is Antelope, CA, for which STR =19.33 and Test Score = 657.8

Predicted value: \(\hat{Y}_{Antelope} = 689-2.28\times19.33 = 654.8\)

residual: \(\hat{u}_{Antelope} = 657.8 - 654.8 = 3.0\)

In R

Code

model_ols <- lm(score ~ stratio, data = CASchools)

summary(model_ols)


Call:
lm(formula = score ~ stratio, data = CASchools)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.727 -14.251   0.483  12.822  48.540 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
stratio      -2.2798     0.4798  -4.751 2.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897 
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

For each one-unit increase in “stratio,” the estimated value of the dependent variable decreases by approximately -2.2798.

Std. Error (Standard Error): This is a measure of the variability or uncertainty associated with the coefficient estimates. In the case of the intercept, the standard error is 9.4675, and for “stratio,” it is 0.4798. Smaller standard errors indicate more precise estimates. The standard error is in the same units as the dependent variable.

t value (t-statistic): This statistic tells you how many standard errors the coefficient estimate is away from zero. A higher absolute t value (in this case, -4.751 for “stratio”) suggests that the coefficient is likely to be significant. It’s essentially a measure of how strongly the independent variable is related to the dependent variable.

Pr(>|t|) (p-value): This is the p-value associated with the t-statistic. It tells you the probability of observing a t-statistic as extreme as the one calculated (or even more extreme) if there were no real relationship between the independent variable and the dependent variable. In your results, the p-value for “stratio” is very close to zero (2.78e-06), which suggests that “stratio” is likely to be a significant predictor of the dependent variable. The “***” symbols indicate that the p-value is very close to zero, typically indicating strong significance.

Review 1

In the simple linear regression model, the regression slope:

A. indicates by how many units Y increases, given a one-unit increase in X.

B. represents the elasticity of Y on X.

C.indicates by how many percent Y increases, given a one percent increase in X.

D.when multiplied with the explanatory variable will give you the predicted Y.

Review 1

In the simple linear regression model, the regression slope:

A. indicates by how many units Y increases, given a one-unit increase in X.

B. represents the elasticity of Y on X.

C.indicates by how many percent Y increases, given a one percent increase in X.

D.when multiplied with the explanatory variable will give you the predicted Y.

Review 2

In the simple linear regression model \(Y_i = \beta_0 + \beta_1X_i + \mu_i\):

A. represents the population regression function.

B. the intercept is typically small and unimportant.

C. the absolute value of the slope is typically between 0 and 1.

D. represents the sample regression function.

Review 2

In the simple linear regression model \(Y_i = \beta_0 + \beta_1X_i + \mu_i\):

A. represents the population regression function.

B. the intercept is typically small and unimportant.

C. the absolute value of the slope is typically between 0 and 1.

D. represents the sample regression function.

Review 3

Assume that you have collected a sample of observations from over 100 households and their consumption and income patterns. Using these observations, you estimate the following regression \(C_i = \hat{\beta_0} + \hat{\beta_1} Y_i + \hat{u_i}\), where C is consumption and Y is disposable income. The estimate \(\hat{\beta_1}\) will tell you:

A. \(\frac{\Delta Income}{\Delta Predicted Consumption}\)

B. \(\frac{\Delta Predicted Consumption}{\Delta Income}\)

C. The amount you need to consume to survive

D. \(\frac{\text{Predicted Consumption}}{Income}\)

Review 3

A. \(\frac{\Delta Income}{\Delta Predicted Consumption}\)

B. \(\frac{\Delta Predicted Consumption}{\Delta Income}\)

C. The amount you need to consume to survive

D. \(\frac{\text{Predicted Consumption}}{Income}\)

Review 4

The OLS residuals, \(\hat{u_i}\), are sample counterparts of the population:

A. regression function’s predicted values.

B. errors.

C. regression function intercept.

D.regression function slope.

Review 4

The OLS residuals, \(\hat{u_i}\), are sample counterparts of the population:

A. regression function’s predicted values.

B. errors.

C. regression function intercept.

D.regression function slope.

\(R^2\) and SER

Two regression statistics provide complementary measures of how well the regression line “fits” or explains the data:

The regression \(R^2\) measures the fraction of the variance of Y that is explained by X; it is unitless and ranges between zero (no fit) and one (perfect fit)
The standard error of the regression (SER) measures the magnitude of a typical regression residual in the units of Y.

Sums of Squares

Explained sum of squares (ESS):

\[\sum_{i=1}^{n}(\hat{Y_i} - \bar{Y})^2\] Total sum of squares (TSS):

\[\sum_{i=1}^{n}(Y_i - \bar{Y})^2\] Residual sum of squares (SSR):

\[\sum_{i=1}^{n}\hat{u_i}^2\]

\[TSS = ESS + SSR\]

\(R^2\)

\[R^2 = \frac{ESS}{TSS}\]

\(R^2\) is the ratio of the explained variation compared to the total variation
It is the fraction of the sample variation in y that is explained by x

\(R^2\) in CASchools regression

summary(model_ols)$r.squared

[1] 0.05124009

\(R^2\) will always lie between 0 and 1 (perfectly explained by the model). An \(R^2\) of 0.051 means that very little of the variation in test scores is explained by our model.

\(R^2\) in CASchools regression

Doing this manually:

ESS:

Code

# Calculate the mean of the dependent variable 'score'
mean_score <- mean(CASchools$score)

# Predicted values from the regression model
predicted_values <- predict(model_ols)

# Calculate the explained sum of squares (ESS)
ESS <- sum((predicted_values - mean_score)^2)

print(ESS)

[1] 7794.109

TSS

Code

# Calculate the total sum of squares (TSS)
TSS <- sum((CASchools$score - mean_score)^2)

print(TSS)

[1] 152109.6

Calculate \(R^2\)

Code

R2 <- ESS/TSS

print(R2)

[1] 0.05124009

Standard Errors

The SER measures the spread of the distribution of u. The SER is (almost) the sample standard deviation of the OLS residuals:

\[ SER = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n} (\hat{u_i} - \bar{\hat{u}})^2} = \frac{\sum_{i=1}^{n}\hat{u_i}^2}{n-2} \]

Note

The above makes use of the fact that \(\bar{\hat{u}}\) is zero. Why would that be?

In other words, it is the average sum of squared residuals (corrected for degrees of freedom)
has the units of u, which are the units of Y measures the average “size” of the OLS residual (the average “mistake” made by the OLS regression line)

Review 5

The slope estimator, \(\beta_1\), has a smaller standard error, other things equal, if

A. there is a large variance of the error term, \(u\).

B. the intercept, \(\beta_0\) is small

C. there is more variation in the explanatory variable, \(X\).

D. the sample size is smaller

Review 5

The slope estimator, \(\beta_1\), has a smaller standard error, other things equal, if

A. there is a large variance of the error term, \(u\).

B. the intercept, \(\beta_0\) is small

C. there is more variation in the explanatory variable, \(X\).

D. the sample size is smaller

Practice with AgeEarnings in RCloud

Causal inference

So far we have treated OLS as a way to draw a straight line through the data on Y and X. Under what conditions does the slope of this line have a causal interpretation?
When will the OLS estimator be unbiased for the causal effect on Y of X?
What is the variance of the OLS estimator over repeated samples?

Causal inference

To answer these questions, we need to make some assumptions about how Y and X are related to each other, and about how they are collected (the sampling scheme)
These assumptions – there are three – are known as the Least Squares Assumptions for Causal Inference.

Definition of Causal Effect

The causal effect on Y of a unit change in X is the expected difference in Y as measured in a randomized controlled experiment
For a binary treatment, the causal effect is the expected difference in means between the treatment and control groups, as discussed in Ch. 3
With a binary treatment, for the difference in means to measure a causal effect requires random assignment or as-if random assignment.
Random assignment ensures that the treatment (X) is uncorrelated with all other determinants of Y, so that there are no confounding variables
The least squares assumptions for causal inference generalize the binary treatment case to regression.

Assumptions for Causal Inference 1

Let \(\beta_1\) be the causal effect on Y of a change in X:

\[Y_i = \beta_0 + \beta_1 X_i + u_i, i = 1, \ldots, n\]

Assumption 1:

The conditional distribution of \(u\) given \(X\) has mean zero, that is,

\[E(u|X=x)=0\]

This implies that \(\hat{\beta_1}\) is unbiased for the causal effect \(\beta_1\)

In any causal relationship, there are factors that we can observe and measure (like X) and factors that we cannot observe or measure (like \(u\) in the equation \(Y_i = \beta_0 + \beta_1 X_i + u_i\)). Assumption 1 is essentially saying that when we look at the unobservable factors (\(u_i\)) that affect our outcome (\(Y_i\)), given a specific value of X (denoted as \(X=x\)), the average impact of these unobservable factors is zero. In other words, these unobservable factors don’t systematically push the outcome in one direction or another based on the value of X.

Imagine we’re studying the impact of a new drug (X) on patient health (Y). If Assumption 1 holds, it means that when we observe patients with the same health characteristics (same X value), any unobservable factors (like genetic variations or unknown health conditions) should, on average, balance out. Some patients may be affected positively by these unobservable factors, while others negatively, but on average, there’s no systematic bias related to X. This balance is essential for causal inference because it ensures that our estimate of the causal effect (\(\hat{\beta_1}\)) isn’t skewed by these unobservable factors.

If this is true, our estimate will be unbiased. In other words, on average, our estimate will correctly capture the true causal effect without any systematic overestimation or underestimation.

Mean of \(u\) is zero

Assumptions for Causal Inference 2

Let \(\beta_1\) be the causal effect on Y of a change in X:

\[Y_i = \beta_0 + \beta_1 X_i + u_i, i = 1, \ldots, n\]

Assumption 2:

\((X_1 , Y_i), i = 1, \ldots, n\), are i.i.d

This is true if \((X,Y)\) are collected by simple random sampling

This delivers the sampling distribution of \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

Recall that “i.i.d” stands for “independent and identically distributed.” It means that the pairs of data points \((X_1 , Y_i)\) for all values of i (from 1 to n) are independent of each other and are drawn from the same probability distribution.

“Independent” means that each pair of data points \((X_i , Y_i)\) is not influenced by or dependent on any other pair in the dataset. In practical terms, it implies that the value of X for one observation does not affect the value of Y for another observation. There should be no hidden or systematic relationships between the pairs of data.

“Identically distributed” means that all the data points come from the same probability distribution. In other words, each observation \((X_i , Y_i)\) follows the same underlying pattern or probability distribution. This ensures that our data is consistent and representative.

Let’s consider an example: Suppose you’re conducting a study to assess the impact of a new teaching method (X) on student performance (Y) across different schools (i). To satisfy Assumption 2, you should ensure that the data collection process adheres to these principles: Independence: The performance of students in one school should not depend on or influence the performance of students in another school. There should be no systematic bias or interaction between schools. Identically Distributed: The students within each school should be drawn from the same type of population or have similar characteristics. In other words, the teaching method should be the only systematic difference.

Understanding the sampling distribution of our estimates is essential because it provides insights into the variability and uncertainty associated with our estimated coefficients, \(\hat{\beta_0}\) and \(\hat{\beta_1}\). It helps us answer questions like: How much can we trust our estimates? Are they likely to vary if we were to collect a different sample from the same population?

The main place we will encounter non-i.i.d. sampling is when data are recorded over time for the same entity (panel data and time series data) – we will deal with that complication when we cover panel data.

Assumptions for Causal Inference 3

Let \(\beta_1\) be the causal effect on Y of a change in X:

\[Y_i = \beta_0 + \beta_1 X_i + u_i, i = 1, \ldots, n\]

Assumption 3:

Large outliers in \(X\) and \(Y\) are rare

Technically, X and Y have finite fourth moments
Outliers can result in meaningless values of \(\hat{beta_1}\)

Importance

The three least squares assumptions are important because:

They help us organize the circumstances that pose difficulties for OLS
If the least square assumptions hold, then mathematically, the sampling distributions of OLS estimators are normal in large samples.

Review 6

Suppose you are interested in studying the relationship between education and wage. More specifically, suppose that you believe the relationship to be captured by the following linear regression model,

\[Wage = \beta_0 + \beta_1 Education + u\]

Suppose further that the only unobservable that can possibly affect both wage and education is intelligence of the individual.

OLS assumption (1): The conditional distribution of \(u_i\) given \(X_i\) has a mean of zero. (\(E(u_i | X_i)=0\))

Which of the following provides evidence in favor of OLS assumption #1?

\(E(Intelligence | Education = x) = E (Intelligence | Education = y)\) for all \(x \neq y\)
\(covariance(Intelligen ce, Education) \neq 0\)
\(corr(Intelligence, Education) \neq 0\)
\(corr(Intelligence, Education) = 0\)

When we’re trying to figure out how education affects wages using a linear equation (like the one mentioned), we assume that any hidden factors (things we can’t see or measure) that might affect both wages and education, like a person’s intelligence, don’t consistently make wages go up or down on average as education changes.

In even simpler terms, this assumption says that intelligence (or any other hidden factor) doesn’t have a predictable and consistent impact on wages when we’re looking at different levels of education. In other words, changes in intelligence should not be causing a clear and systematic increase or decrease in wages as education levels vary.

So we need to figure out which of these statements provides evidence that

\(E(Intelligence | Education = x) = E (Intelligence | Education = y)\) for all \(x \neq y\)

This statement suggests that the conditional expectation of intelligence is the same for all levels of education - If this is true, it suggests that intelligence is not systematically related to education, which aligns with OLS assumption #1.

\(covariance(Intelligence, Education) \neq 0\)

This statement implies that there is a covariance between intelligence and education. If this is true, then we will have a systematic relationship between the errors and the x values.

\(corr(Intelligence, Education) \neq 0\)

Same as above

\(corr(Intelligence, Education) = 0\)

This statement implies that there is no correlation between intelligence and education. While a lack of correlation is in line with OLS assumption #1, it doesn’t provide evidence in favor of it; it simply suggests the absence of a linear relationship.

Review 6

\[Wage = \beta_0 + \beta_1 Education + u\]

Suppose further that the only unobservable that can possibly affect both wage and education is intelligence of the individual.

OLS assumption (1): The conditional distribution of \(u_i\) given \(X_i\) has a mean of zero. (\(E(u_i | X_i)=0\))

Which of the following provides evidence in favor of OLS assumption #1?

\(E(Intelligence | Education = x) = E (Intelligence | Education = y)\) for all \(x \neq y\)
\(covariance(Intelligence, Education) \neq 0\)
\(corr(Intelligence, Education) \neq 0\)
\(corr(Intelligence, Education) = 0\)

Review 7

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (2): \((X_i, Y_i), i = 1, \ldots, n\) are i.i.d.

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor of OLS assumption #2?

a random sample is drawn from a population of college graduates.
A sample consisting of all honor students is drawn from a population of college graduates.
\(corr(Intelligence, Education) = 0\)
A sample consisting of a group of college students is drawn repeatedly each year over the course of their college careers.

A random sample is drawn from a population of college graduates.

This option suggests that a random sample is drawn from the population of college graduates. Random sampling is a key method to achieve independent and identically distributed (i.i.d.) observations. Therefore, option a provides evidence in favor of OLS assumption #2.

A sample consisting of all honor students is drawn from a population of college graduates.

This option selects a non-random sample consisting only of honor students. Such non-random sampling might not satisfy the i.i.d. requirement, so it doesn’t provide strong evidence for OLS assumption #2.

\(corr(Intelligence, Education) = 0\)

This option indicates that there is no correlation between intelligence and education. While this might be relevant for other aspects of the analysis, it doesn’t directly address the independence and identical distribution of the data, which is what OLS assumption #2 is about.

A sample consisting of a group of college students is drawn repeatedly each year over the course of their college careers.

This option describes a repeated measurement of the same group of college students over multiple years. This violates the assumption of independent observations since each year’s data is not independent of the previous year’s data. Therefore, it does not provide evidence in favor of OLS assumption #2.

Review 7

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (2): \((X_i, Y_i), i = 1, \ldots, n\) are i.i.d.

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor of OLS assumption #2?

a random sample is drawn from a population of college graduates.
A sample consisting of all honor students is drawn from a population of college graduates.
\(corr(Intelligence, Education) = 0\)
A sample consisting of a group of college students is drawn repeatedly each year over the course of their college careers.

Review 8

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (3): Large outliers are unlikely. Mathematically, X and Y have nonzero finite fourth moments: (\(0<E(X^4_i)<\infty\) and \(0<E(Y^4_i)<\infty\))

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor OLS assumption #3?

For some individuals in the sample, years of education were recorded in days rather than years.
The maximum wage an individual can get is a finite number.
Half of the wages in the sample were incorrectly multiplied by 1 million when recorded.
The years of education an individual can get is bounded from above.

For some individuals in the sample, years of education were recorded in days rather than years. changing units should not affect
The maximum wage an individual can get is a finite number.

This option indicates that wages have an upper limit, which aligns with OLS assumption #3. When wages have a finite upper limit, it reduces the likelihood of extreme outliers.

Half of the wages in the sample were incorrectly multiplied by 1 million when recorded.

This option points to a data recording error where wages were incorrectly scaled by a large factor. Similar to option a, this error doesn’t provide evidence in favor of OLS assumption #3 and, in fact, introduces the possibility of outliers due to the scaling issue.

The years of education an individual can get is bounded from above.

This option indicates that there is a maximum limit to the number of years of education an individual can obtain. This limitation is consistent with OLS assumption #3, as it implies that education levels do not have extreme, unbounded outliers.

Review 8

\[Wage = \beta_0 + \beta_1 Education + u\]

OLS assumption (3): Large outliers are unlikely. Mathematically, X and Y have nonzero finite fourth moments: (\(0<E(X^4_i)<\infty\) and \(0<E(Y^4_i)<\infty\))

Suppose you would like to draw a sample to study the effect of education on wage. Which of the following provides evidence in favor OLS assumption #3?

For some individuals in the sample, years of education were recorded in days rather than years.
The maximum wage an individual can get is a finite number.
Half of the wages in the sample were incorrectly multiplied by 1 million when recorded.
The years of education an individual can get is bounded from above.

Samples

The OLS estimator is computed from a sample of data. A different sample yields a different value of \(\hat{\beta_1}\). This is the source of the ’sampling uncertainty” of \(\hat{\beta_1}\)

We want to:

quantify the sampling uncertainty associated with \(\hat{\beta_1}\)
use \(\hat{\beta_1}\) to test hypotheses such as \(\beta_1=0\)
construct a confidence interval for \(\hat{\beta_1}\)

Sampling distributions

Just like \(\bar{Y}\), \(\hat{\beta_1}\) has a sampling distribution
Recall that:
- \(\bar{Y}\) is an estimator of the unkown population mean \(\mu_Y\)
- We take random samples and compute their means (\(\bar{Y}\)) to try to estimate the population mean (\(\mu_Y\))
- Because of the random sampling, \(\bar{Y}\) is a random variable
  - \(E(\bar{Y}) = \mu_Y\)
- if \(n\) is large, then the distrinution of sample means is normal

Sample distribution of of \(\hat{\beta_1}\)

The OLS estimator is computed from a sample of data. A different sample yields a different value of \(\hat{\beta_1}\). This is the source of the ’sampling uncertainty” of \(\hat{\beta_1}\)
Under thel 3 least squares assumptions, we can say that \(\hat{\beta_1}\) is an unbiased estimator of \(\beta_1\)
if \(n\) is large (more than 100?)
- distribution of \(\hat{\beta_1}\) is normal
In addition, the estimator is consistent
- as \(n\) increases, the variance falls, and it becomes tightly concentrated around the true \(\beta_1\)

Variance of \(X_i\)

the larger the variance of \(X_i\) the smaller the variance of \(\hat{beta_1}\)
this makes intuitive sense since an X variable that doesn’t vary will have a very hard time predicting a Y variable that does vary.

Which would you prefer?

Importance

The normal approximation to the sampling distribution of \(\hat{\beta_1}\) (and also \(\hat{\beta_0}\) !) is a powerful tool.
With this approximation in hand, we are able to develop methods for making inferences about the true population values of the regression coefficients using only a sample of data.

Review 9

The reason why estimators have a sampling distribution is that:

in real life you typically get to sample many times.
individuals respond differently to incentives.
the values of the explanatory variable and the error term differ across samples.

d.economics is not a precise science.

Review 9

The reason why estimators have a sampling distribution is that:

in real life you typically get to sample many times.
individuals respond differently to incentives.
the values of the explanatory variable and the error term differ across samples.

d.economics is not a precise science.

Introduction to Regression

Agenda

Introduction to Regression

Regression

Regression

Regression

Regression terms

Interpreting coefficients in a regression

Interpreting the “mistakes” in a regression

Fitting a Line

Imagine for a minute

How do we fit a line?

How do we choose a line?

Errors

Notation

Closeness

Why OLS?

The errors

Setting up the problem

Rotations

Rotations

Rotations

The Sum of Squared Residuals

Minimizing

Finding the minimum

OLS Estimators

Our Schools Data

Predicted values and residuals

In R

Review 1

Review 1

Review 2

Review 2

Review 3

Review 3

Review 4

Review 4

Measures of fit

\(R^2\) and SER

Sums of Squares

\(R^2\)

\(R^2\) in CASchools regression

\(R^2\) in CASchools regression

Standard Errors

Review 5

Review 5

Practice with AgeEarnings in RCloud

Assumptions for Causal Inference

Causal inference

Causal inference

Definition of Causal Effect

Assumptions for Causal Inference 1

Mean of \(u\) is zero

Assumptions for Causal Inference 2

Assumptions for Causal Inference 3

Importance

Review 6

Review 6

Review 7

Review 7

Review 8

Review 8

Sampling Distributions

Samples

Sampling distributions

Sample distribution of of \(\hat{\beta_1}\)

Variance of \(X_i\)

Which would you prefer?

Importance

Review 9

Review 9