Introduction to Econometrics

Introduction

Nathaniel Cline

Agenda

1

Introductions

2

Overview of the course

3

Causality and statistical methods

4

Review and to do

About me

Prof. Cline, Dr. Cline, or Nate

  • Born and raised in MD just outside of DC
  • PhD in Economics from the University of Utah
  • Research interests in macroeconomics, international political economy, and childcare
  • In my spare time I make furniture with handtools!

Contacting


nathaniel_cline@redlands.edu


Schedule an appointment or drop by my office T, Th 3pm - 4pm.


Office: Duke 201

About You


  1. Name

  2. Major/emphasis

  3. What ridiculous thing would you buy?

Course Materials


  • Stock and Watson, Introduction to Econometrics, 4th edition


  • MyLab Economics


  • positCloud RStudio

Where our class lives


  • Canvas


  • MyLab


  • positCloud

Grading

Course Outline

Introduction and Review

Introduction

Review of Probability

Review of Statistics

Introduction to RStudio

Fundamentals of Regression

Regression with one regressor

Hypothesis tests and confidence intervals 1

Regression with multiple regressors

Hypothesis tests and confidence intervals 2

Further Topics

Non-linear regression

Assessing studies

Regression with a binary dependent variable

Instrumental variables

Experiments and quasi-experiments

Time series and forecasting

Overview

Economics suggests important relationships, often with policy implications, but virtually never suggests quantitative magnitudes of causal effects.

  • What is the quantitative effect of reducing class size on student achievement?
  • How does another year of education change earnings?
  • How much do cigarette taxes reduce smoking?
  • What is the effect on output growth of a 1 percentage point increase in interest rates by the Fed?

Causal Effects

Does health insurance lead to better health outcomes?

Table 1: Husbands
Some HI No HI Difference
Health Index.

4.01

[.93]

3.70

[1.01]

.31

(.03)

Table 2: Wives
Some HI No HI Difference
Health Index

4.02

[.92]

3.62

[1.01]

.39

(.04)

Excerpted from Angrist and Pischke (2015) using 2009 National Health Interview Survey (NHIS) data. Standard deviations are in brackets; standard errors are reported in parentheses.

Experimental


The data in the previous table are an example of “observational” or “non-experimental” data.


Experimental data would result from controlled random assignment.

Understanding check


Describe a hypothetical ideal randomized controlled experiment to study the effect on highway traffic deaths of wearing seat belts. Suggest some impediments to implementing this experiment in practice.


Most of the course deals with difficulties arising from using observational to estimate causal effects

Famous econometric questions and problems

What are the returns to schooling?

Confounding effects: lots of systemic factors affect wages and years of schooling.

What is the effect of minimum wage on employment?

Selection bias: states that change minimum wages might differ systematically

What is the effect of policing on crime?

Simultaneous causality: areas with high crime might cause more policing

Understanding check


Describe a hypothetical ideal randomized controlled experiment to study the effect on highway traffic deaths of wearing seat belts.


Suggest some impediments to implementing this experiment in practice.

In this course you will

  • Learn methods for estimating causal effects using observational data

  • Learn methods for prediction

  • Focus on applications

  • Learn to evaluate the regression analysis of others

  • Get some hands-on experience with regression analysis in your problem sets.

What is the quantitative effect of reducing class size on student achievement?

The California Test Score Data Set

  • All K-6 and K-8 California school districts (n = 420)
  • Book uses:
    • 5th grade test scores (Stanford-9 achievement test, combined math and reading), district average
    • Student-teacher ratio (S T R) = no. of students in the district divided by no. full-time equivalent teachers

Loading Data

Code
data(CASchools)

glimpse(slice(CASchools, 1:3))
Rows: 3
Columns: 14
$ district    <chr> "75119", "61499", "61549"
$ school      <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
$ county      <fct> Alameda, Butte, Butte
$ grades      <fct> KK-08, KK-08, KK-08
$ students    <dbl> 195, 240, 1550
$ teachers    <dbl> 10.90, 11.15, 82.90
$ calworks    <dbl> 0.5102, 15.4167, 55.0323
$ lunch       <dbl> 2.0408, 47.9167, 76.3226
$ computer    <dbl> 67, 101, 169
$ expenditure <dbl> 6384.911, 5099.381, 5501.955
$ income      <dbl> 22.690, 9.824, 8.978
$ english     <dbl> 0.000000, 4.583333, 30.000002
$ read        <dbl> 691.6, 660.5, 636.3
$ math        <dbl> 690.0, 661.9, 650.9

But we want the combo test scores and student teacher ratio…

Transforming


Code
CASchools <- CASchools %>%
  mutate(stratio = students/teachers,
         score = (math + read)/2)

summary_stats <- data.frame(
  Metric = c("Student-Teacher Ratio", "Test Score"),
  Mean = c(mean(CASchools$stratio), mean(CASchools$score)),
  SD = c(sd(CASchools$stratio), sd(CASchools$score)),
  P10 = c(quantile(CASchools$stratio, 0.10), quantile(CASchools$score, 0.10)),
  P25 = c(quantile(CASchools$stratio, 0.25), quantile(CASchools$score, 0.25)),
  P40 = c(quantile(CASchools$stratio, 0.40), quantile(CASchools$score, 0.40)),
  Median = c(median(CASchools$stratio), median(CASchools$score)),
  P60 = c(quantile(CASchools$stratio, 0.60), quantile(CASchools$score, 0.60)),
  P75 = c(quantile(CASchools$stratio, 0.75), quantile(CASchools$score, 0.75)),
  P90 = c(quantile(CASchools$stratio, 0.90), quantile(CASchools$score, 0.90))
)

So we generate the variables we want (student-teacher ratio and average math+reading scores) and create summary statistics.

Make a table


Code
summary_stats <- summary_stats %>%
  mutate_if(is.numeric, format, digits = 4, nsmall = 2)

summary_stats_table <- kable(summary_stats, format = "html")

summary_stats_table_formatted <- summary_stats_table %>%
  kable_styling("striped", full_width = FALSE, position = "center", font_size = 20)

summary_stats_table_formatted
Metric Mean SD P10 P25 P40 Median P60 P75 P90
Student-Teacher Ratio 19.64 1.892 17.35 18.58 19.27 19.72 20.08 20.87 21.87
Test Score 654.16 19.053 630.40 640.05 649.07 654.45 659.40 666.66 678.86


Do you know how to interpret this table?

Code
caschool_scatter <- ggplot(CASchools, aes(x = stratio, y = score)) +
  geom_point(color = "#DD8C6E") +  
  labs(title = "Scatterplot of test score v. student-teacher ratio",
       x = "Student-Teacher Ratio",
       y = "Average Score") +
  theme_minimal() +  
  theme(text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),    
        axis.line = element_line(colour = "#3D4C5F"),  
        plot.title = element_text(size = 24, face = "bold", vjust = 2))
  
caschool_scatter

1

Estimation

Compare average test scores in districts with low STRs to those with high STRs

2

Hypothesis testing

Test the “null” hypothesis that the mean test scores in the two types of districts are the same

3

Confidence interval

Estimate an interval for the difference in the mean test scores, high v. low STR districts

Initial data analysis

Class size

Average Score

\(\bar{Y}\).

Standard Deviation

(\(S^2\))

n
Small 657.4 19.4 238
Large 650.0 17.9 182


  1. Estimation of \(\Delta\) = difference between group means

  2. Test the hypothesis that \(\Delta\) = 0

  3. Construct confidence interval for \(\Delta\)

1. Estimation

\[\begin{align} \bar{Y}_{small} - \bar{Y}_{large} = \frac{1}{n_{small}} \sum_{i=1}^{n_{small}}Y_{i} - \frac{1}{n_{large}} \sum_{i=1}^{n_{large}}Y_{i} \end{align}\]
\[\begin{align} 657.4-650.0 = 7.4 \end{align}\]

Is this a large difference in a real-world sense?

  • Standard deviation across districts = 19.1
  • Difference between 60th and 75th percentiles of test score distribution = 8.2

2. Hypothesis testing

Difference-in-means test: compute the t-statistic,

\[\begin{align} t = \frac{\bar{Y}_s - \bar{Y}_l}{\sqrt{\frac{s_s^2}{n_s}+{\frac{s_l^2}{n_l}}}} = \frac{\bar{Y}_s - \bar{Y}_l}{SE(\bar{Y}_s - \bar{Y}_l)} \end{align}\]

where \(SE(\bar{Y}_s - \bar{Y}_l)\) is the “standard error” of \(\bar{Y}_s - \bar{Y}_l\), the subscripts s and l refer to “small” and “large” STR districts, and:

\[\begin{align} s_s^2 = \frac{1}{n_s - 1} \sum_{i=1}^{n_s}(Y_{i} - Y_s)^2 \end{align}\]

2. Hypothesis testing

Class size \(\bar{Y}\) \(S^2\) n
Small 657.4 19.4 238
Large 650.0 17.9 182
\[\begin{align} t = \frac{\bar{Y}_s - \bar{Y}_l}{\sqrt{\frac{s_s^2}{n_s}+ \frac{s_l^2}{n_l}}} = \frac{\bar{Y}_s - \bar{Y}_l}{SE(\bar{Y}_s - \bar{Y}_l)} = = \frac{657.4 - 650.0}{\sqrt{\frac{19.4^2}{238}+{\frac{17.9^2}{182}}}} \end{align}\]
\[\begin{align} = \frac{7.4}{1.83} = 4.05 \end{align}\]

Since \(|t|>1.96\) we reject (at the 5% signficance level) the null hypothesis that the two means are the same.

Confidence interval

A 95% confidence interval for the difference between the means is,

\[\begin{align} (\bar{Y}_s - \bar{Y}_l) \pm 1.96 \times SE (\bar{Y}_s - \bar{Y}_l) \end{align}\]

Two equivalent statements:

  1. The 95% confidence interval for \(\Delta\) doesn’t include 0.
  2. The hypothesis that \(\Delta\) = 0 is rejected at the 5% level.

What comes next

  • The mechanics of estimation, hypothesis testing, and confidence intervals should be familiar

  • These concepts extend directly to regression and its variants

  • Before turning to regression, however, we will review some of the underlying theory of estimation, hypothesis testing, and confidence intervals

Review

1

Introductions

2

Overview of the course

3

Causality and statistical methods

To Do

1

Read Ch.1-2

2

Getting Started Homework, Getting Started Quiz, RStudio Primer

See you next time!