Introduction to Econometrics

Introduction

Nathaniel Cline

Agenda

Introductions

Overview of the course

Causality and statistical methods

Review and to do

About me

Prof. Cline, Dr. Cline, or Nate

Born and raised in MD just outside of DC
PhD in Economics from the University of Utah
Research interests in macroeconomics, international political economy, and childcare
In my spare time I make furniture with handtools!

Contacting

nathaniel_cline@redlands.edu

Schedule an appointment or drop by my office T, Th 3pm - 4pm.

Office: Duke 201

About You

Name
Major/emphasis
What ridiculous thing would you buy?

Course Materials

Stock and Watson, Introduction to Econometrics, 4th edition

MyLab Economics

positCloud RStudio

Where our class lives

Canvas

MyLab

positCloud

Grading

Course Outline

Introduction and Review

Introduction

Review of Probability

Review of Statistics

Introduction to RStudio

Fundamentals of Regression

Regression with one regressor

Hypothesis tests and confidence intervals 1

Regression with multiple regressors

Hypothesis tests and confidence intervals 2

Further Topics

Non-linear regression

Assessing studies

Regression with a binary dependent variable

Instrumental variables

Experiments and quasi-experiments

Time series and forecasting

Overview

Economics suggests important relationships, often with policy implications, but virtually never suggests quantitative magnitudes of causal effects.

What is the quantitative effect of reducing class size on student achievement?
How does another year of education change earnings?
How much do cigarette taxes reduce smoking?
What is the effect on output growth of a 1 percentage point increase in interest rates by the Fed?

Causal Effects

Does health insurance lead to better health outcomes?

Table 1: Husbands
	Some HI	No HI	Difference
Health Index.	4.01 [.93]	3.70 [1.01]	.31 (.03)

Table 2: Wives
	Some HI	No HI	Difference
Health Index	4.02 [.92]	3.62 [1.01]	.39 (.04)

Excerpted from Angrist and Pischke (2015) using 2009 National Health Interview Survey (NHIS) data. Standard deviations are in brackets; standard errors are reported in parentheses.

Experimental

The data in the previous table are an example of “observational” or “non-experimental” data.

Experimental data would result from controlled random assignment.

Understanding check

Describe a hypothetical ideal randomized controlled experiment to study the effect on highway traffic deaths of wearing seat belts. Suggest some impediments to implementing this experiment in practice.

Most of the course deals with difficulties arising from using observational to estimate causal effects

Famous econometric questions and problems

What are the returns to schooling?

Confounding effects: lots of systemic factors affect wages and years of schooling.

What is the effect of minimum wage on employment?

Selection bias: states that change minimum wages might differ systematically

What is the effect of policing on crime?

Simultaneous causality: areas with high crime might cause more policing

Understanding check

Describe a hypothetical ideal randomized controlled experiment to study the effect on highway traffic deaths of wearing seat belts.

Suggest some impediments to implementing this experiment in practice.

In this course you will

Learn methods for estimating causal effects using observational data
Learn methods for prediction
Focus on applications
Learn to evaluate the regression analysis of others
Get some hands-on experience with regression analysis in your problem sets.

What is the quantitative effect of reducing class size on student achievement?

The California Test Score Data Set

All K-6 and K-8 California school districts (n = 420)
Book uses:
- 5th grade test scores (Stanford-9 achievement test, combined math and reading), district average
- Student-teacher ratio (S T R) = no. of students in the district divided by no. full-time equivalent teachers

Loading Data

Code

data(CASchools)

glimpse(slice(CASchools, 1:3))

Rows: 3
Columns: 14
$ district    <chr> "75119", "61499", "61549"
$ school      <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
$ county      <fct> Alameda, Butte, Butte
$ grades      <fct> KK-08, KK-08, KK-08
$ students    <dbl> 195, 240, 1550
$ teachers    <dbl> 10.90, 11.15, 82.90
$ calworks    <dbl> 0.5102, 15.4167, 55.0323
$ lunch       <dbl> 2.0408, 47.9167, 76.3226
$ computer    <dbl> 67, 101, 169
$ expenditure <dbl> 6384.911, 5099.381, 5501.955
$ income      <dbl> 22.690, 9.824, 8.978
$ english     <dbl> 0.000000, 4.583333, 30.000002
$ read        <dbl> 691.6, 660.5, 636.3
$ math        <dbl> 690.0, 661.9, 650.9

But we want the combo test scores and student teacher ratio…

Transforming

Code

CASchools <- CASchools %>%
  mutate(stratio = students/teachers,
         score = (math + read)/2)

summary_stats <- data.frame(
  Metric = c("Student-Teacher Ratio", "Test Score"),
  Mean = c(mean(CASchools$stratio), mean(CASchools$score)),
  SD = c(sd(CASchools$stratio), sd(CASchools$score)),
  P10 = c(quantile(CASchools$stratio, 0.10), quantile(CASchools$score, 0.10)),
  P25 = c(quantile(CASchools$stratio, 0.25), quantile(CASchools$score, 0.25)),
  P40 = c(quantile(CASchools$stratio, 0.40), quantile(CASchools$score, 0.40)),
  Median = c(median(CASchools$stratio), median(CASchools$score)),
  P60 = c(quantile(CASchools$stratio, 0.60), quantile(CASchools$score, 0.60)),
  P75 = c(quantile(CASchools$stratio, 0.75), quantile(CASchools$score, 0.75)),
  P90 = c(quantile(CASchools$stratio, 0.90), quantile(CASchools$score, 0.90))
)

So we generate the variables we want (student-teacher ratio and average math+reading scores) and create summary statistics.

Make a table

Code

summary_stats <- summary_stats %>%
  mutate_if(is.numeric, format, digits = 4, nsmall = 2)

summary_stats_table <- kable(summary_stats, format = "html")

summary_stats_table_formatted <- summary_stats_table %>%
  kable_styling("striped", full_width = FALSE, position = "center", font_size = 20)

summary_stats_table_formatted

Metric	Mean	SD	P10	P25	P40	Median	P60	P75	P90
Student-Teacher Ratio	19.64	1.892	17.35	18.58	19.27	19.72	20.08	20.87	21.87
Test Score	654.16	19.053	630.40	640.05	649.07	654.45	659.40	666.66	678.86

Do you know how to interpret this table?

Code

caschool_scatter <- ggplot(CASchools, aes(x = stratio, y = score)) +
  geom_point(color = "#DD8C6E") +  
  labs(title = "Scatterplot of test score v. student-teacher ratio",
       x = "Student-Teacher Ratio",
       y = "Average Score") +
  theme_minimal() +  
  theme(text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),    
        axis.line = element_line(colour = "#3D4C5F"),  
        plot.title = element_text(size = 24, face = "bold", vjust = 2))
  
caschool_scatter

Estimation

Compare average test scores in districts with low STRs to those with high STRs

Hypothesis testing

Test the “null” hypothesis that the mean test scores in the two types of districts are the same

Confidence interval

Estimate an interval for the difference in the mean test scores, high v. low STR districts

Initial data analysis

Small

657.4

19.4

238

Large

650.0

17.9

182

Estimation of \(\Delta\) = difference between group means
Test the hypothesis that \(\Delta\) = 0
Construct confidence interval for \(\Delta\)

1. Estimation

\[\begin{align} \bar{Y}_{small} - \bar{Y}_{large} = \frac{1}{n_{small}} \sum_{i=1}^{n_{small}}Y_{i} - \frac{1}{n_{large}} \sum_{i=1}^{n_{large}}Y_{i} \end{align}\]

\[\begin{align} 657.4-650.0 = 7.4 \end{align}\]

Is this a large difference in a real-world sense?

Standard deviation across districts = 19.1
Difference between 60th and 75th percentiles of test score distribution = 8.2

2. Hypothesis testing

Difference-in-means test: compute the t-statistic,

\[\begin{align} t = \frac{\bar{Y}_s - \bar{Y}_l}{\sqrt{\frac{s_s^2}{n_s}+{\frac{s_l^2}{n_l}}}} = \frac{\bar{Y}_s - \bar{Y}_l}{SE(\bar{Y}_s - \bar{Y}_l)} \end{align}\]

where \(SE(\bar{Y}_s - \bar{Y}_l)\) is the “standard error” of \(\bar{Y}_s - \bar{Y}_l\), the subscripts s and l refer to “small” and “large” STR districts, and:

\[\begin{align} s_s^2 = \frac{1}{n_s - 1} \sum_{i=1}^{n_s}(Y_{i} - Y_s)^2 \end{align}\]

2. Hypothesis testing

Class size	\(\bar{Y}\)	\(S^2\)	n
Small	657.4	19.4	238
Large	650.0	17.9	182

\[\begin{align} t = \frac{\bar{Y}_s - \bar{Y}_l}{\sqrt{\frac{s_s^2}{n_s}+ \frac{s_l^2}{n_l}}} = \frac{\bar{Y}_s - \bar{Y}_l}{SE(\bar{Y}_s - \bar{Y}_l)} = = \frac{657.4 - 650.0}{\sqrt{\frac{19.4^2}{238}+{\frac{17.9^2}{182}}}} \end{align}\]

\[\begin{align} = \frac{7.4}{1.83} = 4.05 \end{align}\]

Since \(|t|>1.96\) we reject (at the 5% signficance level) the null hypothesis that the two means are the same.

Confidence interval

A 95% confidence interval for the difference between the means is,

\[\begin{align} (\bar{Y}_s - \bar{Y}_l) \pm 1.96 \times SE (\bar{Y}_s - \bar{Y}_l) \end{align}\]

Two equivalent statements:

The 95% confidence interval for \(\Delta\) doesn’t include 0.
The hypothesis that \(\Delta\) = 0 is rejected at the 5% level.

What comes next

The mechanics of estimation, hypothesis testing, and confidence intervals should be familiar
These concepts extend directly to regression and its variants
Before turning to regression, however, we will review some of the underlying theory of estimation, hypothesis testing, and confidence intervals

Review

Introductions

Overview of the course

Causality and statistical methods

To Do

Read Ch.1-2

Getting Started Homework, Getting Started Quiz, RStudio Primer

See you next time!