Confidence Intervals and R

Econ 358: Econometrics

A review

  • We want to estimate something (the population mean, proportion, etc…)

  • We want to test our hypothesis (is the mean different from 0, different from another mean, etc…)

  • We want to establish a confidence interval for our estimate (we are sure that 95% of the time an estimate of the mean will lie between…)

Overview of estimates

  • We are often interested in population parameters

  • Complete populations are difficult to collect data on, so we use sample statistics as point estimates for the unknown population parameters of interest.

  • Error in the estimate = difference between population parameter and sample statistic

  • Bias is systematic tendency to over- or under-estimate the true population parameter.

  • Sampling error describes how much an estimate will tend to vary from one sample to the next.

  • Much of statistics is focused on understanding and quantifying sampling error, and sample size is helpful for quantifying this error.

Law of Large Numbers

Law of Large Numbers

Under general conditions, the sample average will be close to the population mean with a very high probability when the sample size is large.

e.x.: As we filp a coin more and more times, the average coin flip will come to approximate 0.5


e.x.: Casino’s don’t get mad when you win once…

Law of Large Numbers


More formally:

As the sample size (n) increases, the sample mean (average) of a sequence of i.i.d. random variables converges in probability to the true population mean.


It is a consistent estimator of the true population mean

Law of Large Numbers


THE MOST FORMAL


If \((Y_1,\ldots, Y_2)\) are i.i.d. and \(\sigma_Y^2 < \infty\), then \(\bar{Y}\) is a consistent estimator of \(\mu_Y\), that is,

\[Pr[|\bar{Y}-\mu_Y | < \mu] \rightarrow 1 \text{ as } n \rightarrow \infty\]

which can be written, \(\bar{Y}^p \rightarrow \mu_Y\)

The Central Limit Theorem

  • The CLT helps us understand how our samples would be distributed if we took a large number of samples.

  • In real-world applications, we never actually observe the sampling distribution, yet it is useful to always think of a point estimate (for instance our estimate of the mean using one sample) as coming from such a hypothetical distribution.

  • Understanding the sampling distribution will help us characterize and make sense of the point estimates that we do observe.

Central Limit Theorem

As we take more samples the distribution around the mean of the sample means will be normal.

The mean and variance will depend on the specifics of the data, but if the distribution is normal, we can always standardize (make the average equal to zero and variance equal to 1)

Standardizing

We make the average equal to zero by subtracting the population mean from mean of the sample means:

\[\bar{Y} - \mu_Y\] (true because of L.L.N.)

We make the variance equal to 1 by dividing by the standard deviation of the sample

\[\frac{\bar{Y} - \mu_Y}{\sigma_{\bar{Y}}}\]

Standard Normal

By converting our sample distribution to the standard normal, we can more easily

  • conduct hypothesis testing (e.g. t-tests and z-tests)

  • construct confidence intervals

(plus lots of other useful things we will see later)

Our R-Studio Project

The set up

We’ve taken two samples of workers, and collected information on their wages, age, education and gender (this is a real sample from the Current Population Survey). One sample is from 1996, one sample is from 2015.

Our first task will be to figure out whether average wages increased between 1996 and 2013.

  • Are differences between these two samples because they have different population averages? Or is the difference just because of randomness? In other words, are these samples coming from the same distribution?

Chunks

Our QMD file tells RStudio what to produce:

  1. Text chunks get treated as text (with some formatting occassionally specified)

  2. Code chunks (indicated by three “backticks” and a programming language) tell RStudio to run these commands.

Loading the data

  1. click the green arrow on the packages code chunk. This will run:
library(readxl)
library(tidyverse)
  1. click the green arrow on the load-data code chunk. This will run:
cps <- read_excel("CPS96_15.xlsx")

Glimpse

Use the glimpse function to see what the data is like:

glimpse(cps)
Rows: 13,201
Columns: 5
$ year     <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1…
$ ahe      <dbl> 11.171735, 8.653846, 9.615385, 11.217949, 9.615385, 14.423077…
$ bachelor <dbl> 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0…
$ female   <dbl> 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0…
$ age      <dbl> 31, 31, 27, 26, 28, 32, 30, 31, 31, 25, 32, 33, 26, 27, 29, 3…

Computing sample mean

R is at its core a calculator - we can ask it calculate things and save the results.

We want means for 1996 and for 2015, but we don’t have separate columns these two different dates.

The easiest way forward is to split our data into two data sets.

Code chunk

Create a new code chunk by placing your cursor under the text for Question 1.a and then:

  • Click on the “Code” menu at the top of the RStudio window, select “insert new code chunk”

or

  • Use the shortcut:
    • On Windows/Linux: Ctrl + Alt + I
    • On macOS: Cmd + Option + I

Splitting the data

“tidyverse”includes a package called “dplyr”

  • useful for wrangling our data.

The first argument is always a data frame

  • Subsequent arguments say what to do with that data frame

Produces a data frame that we can store by assigning it to a name:

cps_1996 <- cps %>%
  filter(year==1996)

cps_2015 <- cps %>%
  filter(year==2015)

dplyr functions

functions as verbs that manipulate data frames

  • select: pick columns by name
  • arrange: reorder rows
  • slice: pick rows using index(es)
  • filter: pick rows matching criteria
  • distinct: filter for unique rows
  • mutate: add new variables
  • summarise: reduce variables to values
  • group_by: for grouped operations
  • … (many more)

Calculating means

Now that we have two different data sets for each year, we can calculate the means and store them as values (not data frames) in R. So add to your code chunk:

cps_1996 <- cps %>%
  filter(year==1996)

cps_2015 <- cps %>%
  filter(year==2015)


mean_ahe_96 <- mean(cps_1996$ahe)
mean_ahe_15 <- mean(cps_2015$ahe)

Click the green arrow to run this code chunk and look in the “environment” pane to see what happened. Do the means seem different at first glance?

Standard Deviation

Part B asks us to find the standard deviation. This is important on its own, but will also help us test whether the two means are likely to be part of the same distribution or not.

If you’ve been clicking those green arrows, R will have remembered all the prior stuff. So we can write:

sd_ahe_96 <- sd(cps_1996$ahe)
sd_ahe_15 <- sd(cps_2015$ahe)

Confidence interval

Now the question wants us to construct intervals around our sample means.

These intervals will tell us the range within which the true population mean is likely to fall with 95% confidence

i.e.: an interval that contains the true value of the population mean in 95% of repeated samples

http://pewresearch.org/pubs/2191/young-adults-workers-labor-market-pay-careers-advancement-recession

Margin of Error

  • \(41\% \pm 2.9%\) : We are 95% confident that 38.1% to 43.9% of the public believe young adults, rather than middle-aged or older adults, are having the toughest time in today’s economy.


  • \(49\% \pm 4.4%\): We are 95% confident that 44.6% to 53.4% of 18-34 year olds have taken a job they didn’t want just to pay the bills.

Confidence intervals

Confidence intervals are always of the form:

point estimate ± ME

ME is always calculated as the product of a critical value and SE.

The standard error provides a guide for how large we should make the confidence interval (a large standard error implies we have to make our interval larger)

The critical value is some value for a test statistic beyond which we are going to reject the null hypothesis.

The Z-score

A z-score (also known as a standard score) is a statistical measure that quantifies how many standard deviations a data point is away from the mean of a dataset

In large samples, this provides our critical value.

Standard Error

The standard error is a measure of the variability or uncertainty associated with estimating the population mean (\(\mu_Y\))from a sample mean (\(\bar{Y}\)).

\[SE = \frac{s_Y}{\sqrt{n}}\] - Where \(s_Y\) is the sample standard deviation which is an estimator of the population standard deviation

We won’t go through a detailed proof of why, but note that as n increases SE decreases.

Steps to calculate the confidence intervals

  1. Estimate mean

  2. Determine standard deviation

  3. Calculate standard error

  4. Find the critical value

  5. Find lower bound: \(\bar{Y} - \text{critical value} \times SE\)

  6. Find upper bound: \(\bar{Y} + \text{critical value} \times SE\)

Our standard errors

We’ve already calculate the mean and standard deviation, so lets now calculate the standard error, critical value, and bounds.

  1. standard error
# Sample sizes
n_1996 <- length(cps_1996$ahe)
n_2015 <- length(cps_2015$ahe)

# Standard errors
se_1996 <- (sd_ahe_96 / sqrt(n_1996))
se_2015 <- (sd_ahe_15 / sqrt(n_2015))          

Our critical values


  • The problem asks for a 95% confidence interval. So we want 2.5% on the tails.

  • You can calculate this by hand, or computer, or…

  • it’s 1.96 standard deviation on either side of the mean.

Our confidence intervals

ci_lower_1996 <- mean_ahe_96 - (1.96 * se_1996)
ci_upper_1996 <- mean_ahe_96 + (1.96 * se_1996)

ci_lower_2015 <- mean_ahe_15 - (1.96 * se_2015)
ci_upper_2015 <- mean_ahe_15 + (1.96 * se_2015)

CI for 1996: \(12.693 \pm 0.159\)

CI for 2015: \(21.237 \pm 0.148\)

The easy way

# Calculate the 95% confidence interval for AHE in 1996
ci_1996 <- t.test(cps_1996$ahe, conf.level = 0.95)

# Calculate the 95% confidence interval for AHE in 2015
ci_2015 <- t.test(cps_2015$ahe, conf.level = 0.95)

ci_1996

    One Sample t-test

data:  cps_1996$ahe
t = 155.94, df = 6102, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 12.53369 12.85283
sample estimates:
mean of x 
 12.69326 

Taking a look

Code
# Create a data frame for the confidence intervals
ci_data <- data.frame(
  Year = c("1996", "2015"),
  Mean = c(mean(cps_1996$ahe), mean(cps_2015$ahe)),
  Lower = c(ci_1996$conf.int[1], ci_2015$conf.int[1]),
  Upper = c(ci_1996$conf.int[2], ci_2015$conf.int[2])
)

# Create a point estimate with error bars plot
plot <- ggplot(ci_data, aes(x = Year, y = Mean)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
  labs(
    title = "95% Confidence Intervals for AHE in 1996 and 2015",
    x = "Year",
    y = "Mean AHE"
  ) +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"))

# Show the plot
plot

Adjusting for inflation

As question 2 points out - we might want to adjust for inflation here

# Adjust AHE in 1996 to real 2015 dollars
cps_1996$ahe_real <- cps_1996$ahe * (237 / 156.9)

## Calculate mean

mean_real_96 <- mean(cps_1996$ahe_real)

## Calculate standard deviation
sd_real_96 = sd(cps_1996$ahe_real)

## Calculate standard error
se_real_1996 <- (sd_real_96 / sqrt(n_1996))

## Calculate confidence interval 
ci_real_lower_1996 <- mean_real_96 - (1.96 * se_real_1996)
ci_real_upper_1996 <- mean_real_96 + (1.96 * se_real_1996)

Taking a look

Code
# Create a data frame for the confidence intervals
ci_data <- data.frame(
  Year = c("1996", "2015"),
  Mean = c(mean_real_96, mean_ahe_15),
  Lower = c(ci_real_lower_1996, ci_lower_2015),
  Upper = c(ci_real_upper_1996, ci_upper_2015)
)

# Create a point estimate with error bars plot
plot <- ggplot(ci_data, aes(x = Year, y = Mean)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
  labs(
    title = "95% Confidence Intervals for AHE in 1996 and 2015",
    x = "Year",
    y = "Mean AHE"
  ) +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"))
# Show the plot
plot

Construct a 95% confidence interval for the change in population means

change_means <- mean_ahe_15 - mean_real_96

se_1996 <- (sd_ahe_96 / sqrt(n_1996))

se_diff <- (sd(cps$ahe)/sqrt(length(cps$ahe)))
  
ci_lower_diff <- change_means - (1.96 * se_diff)
ci_upper_diff <- change_means + (1.96 * se_diff)

Plotting it

Mean of high school

Question 4 wants us to figure out if education makes any meaningful difference to average incomes in 2015. First we get a CI for high school grads:

hs_2015 <- cps_2015 %>%
  filter(bachelor==0)

ci_hs_2015 <- t.test(hs_2015$ahe, conf.level = 0.95)

ci_hs_2015

    One Sample t-test

data:  hs_2015$ahe
t = 111.33, df = 3364, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 16.09262 16.66960
sample estimates:
mean of x 
 16.38111 

Mean of college grads

Next we get a CI for college grads:

col_2015 <- cps_2015 %>%
  filter(bachelor==1)

ci_col_2015 <- t.test(col_2015$ahe, conf.level = 0.95)

ci_col_2015

    One Sample t-test

data:  col_2015$ahe
t = 118.83, df = 3732, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 25.19241 26.03765
sample estimates:
mean of x 
 25.61503 

CI for difference in earnings by education

Now we construct a CI for the difference:

ci_ed_diff_2015 <- t.test(col_2015$ahe, hs_2015$ahe, conf.level = 0.95)

ci_ed_diff_2015

    Welch Two Sample t-test

data:  col_2015$ahe and hs_2015$ahe
t = 35.381, df = 6463.4, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 8.722304 9.745543
sample estimates:
mean of x mean of y 
 25.61503  16.38111 

Repeat with real 1996 data

hs_1996 <- cps_1996 %>%
  filter(bachelor==0)

col_1996 <- cps_1996 %>%
  filter(bachelor==1)

ci_hs_1996 <- t.test(hs_1996$ahe_real, conf.level = 0.95)

ci_col_1996 <- t.test(col_1996$ahe_real, conf.level = 0.95)

ci_ed_diff_1996 <- t.test(col_1996$ahe_real, hs_1996$ahe_real, conf.level = 0.95)

Difference in high school?

Code
hs_compare_data <- data.frame(
  Education = c("HS Grad 1996", "HS Grad 2015"),
  Estimate = c(mean(hs_1996$ahe_real), mean(hs_2015$ahe)),
  Lower_CI = c(ci_hs_1996$conf.int[1], ci_hs_2015$conf.int[1]),
  Upper_CI = c(ci_hs_1996$conf.int[2], ci_hs_2015$conf.int[2])
)

hs_plot <- ggplot(hs_compare_data, aes(x = Education, y = Estimate)) +
  geom_point( color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
  labs(title = "High School Graduates: Wage Estimates and Confidence Intervals",
       y = "Average Hourly Earnings",
       x = "Year") +
   theme_minimal() +
   theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"),
        axis.text.x = element_text(angle = 45, hjust = 1))


hs_plot

Difference in college?

Code
col_compare_data <- data.frame(
  Education = c("College Grad 1996", "College Grad 2015"),
  Estimate = c(mean(col_1996$ahe_real), mean(col_2015$ahe)),
  Lower_CI = c(ci_col_1996$conf.int[1], ci_col_2015$conf.int[1]),
  Upper_CI = c(ci_col_1996$conf.int[2], ci_col_2015$conf.int[2])
)

col_plot <- ggplot(col_compare_data, aes(x = Education, y = Estimate)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
  labs(title = "College Graduates: Wage Estimates and Confidence Intervals",
       y = "Average Hourly Earnings",
       x = "Year") +
  theme_minimal() +
   theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"),
        axis.text.x = element_text(angle = 45, hjust = 1))

col_plot

The education gap over time?

Code
diff_compare_data <- data.frame(
  Year = c("1996", "2015"),
  Estimate = c(mean(col_1996$ahe_real) - mean(hs_1996$ahe_real), mean(col_2015$ahe) - mean(hs_2015$ahe)),
  Lower_CI = c(ci_ed_diff_1996$conf.int[1], ci_ed_diff_2015$conf.int[1]),
  Upper_CI = c(ci_ed_diff_1996$conf.int[2], ci_ed_diff_2015$conf.int[2])
)

diff_plot <- ggplot(diff_compare_data, aes(x = Year, y = Estimate)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
  labs(title = "Difference in Means: College Graduates vs. High School Graduates",
       y = "Difference in Average Hourly Earnings",
       x = "Year") +
  theme_minimal() +
   theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"),
        axis.text.x = element_text(angle = 45, hjust = 1))

diff_plot

Gender wage gap

The final question asks you to prepare a table for high school graduates. You don’t have to do anything fancy here, just estimate the following for high school graduates only:

  • mean ahe for men and women in 1996
  • mean ahe for men and women in 2015
  • standard deviation for men and women in 1996
  • standard deviation for men and women in 2015
  • the number of men and the number of women in 1996 and 2015
  • the gender wage gap (difference in mean ahe) in 1996 and 2015
  • the standard error of the gender wage gap in 1996 and 2015
  • the confidence interval for the gender wage gap in 1996 and 2015