Confidence Intervals and R

Econ 358: Econometrics

A review

We want to estimate something (the population mean, proportion, etc…)
We want to test our hypothesis (is the mean different from 0, different from another mean, etc…)
We want to establish a confidence interval for our estimate (we are sure that 95% of the time an estimate of the mean will lie between…)

Overview of estimates

We are often interested in population parameters
Complete populations are difficult to collect data on, so we use sample statistics as point estimates for the unknown population parameters of interest.
Error in the estimate = difference between population parameter and sample statistic
Bias is systematic tendency to over- or under-estimate the true population parameter.
Sampling error describes how much an estimate will tend to vary from one sample to the next.
Much of statistics is focused on understanding and quantifying sampling error, and sample size is helpful for quantifying this error.

Law of Large Numbers

Law of Large Numbers

Under general conditions, the sample average will be close to the population mean with a very high probability when the sample size is large.

e.x.: As we filp a coin more and more times, the average coin flip will come to approximate 0.5

e.x.: Casino’s don’t get mad when you win once…

Law of Large Numbers

More formally:

As the sample size (n) increases, the sample mean (average) of a sequence of i.i.d. random variables converges in probability to the true population mean.

It is a consistent estimator of the true population mean

Law of Large Numbers

THE MOST FORMAL

If \((Y_1,\ldots, Y_2)\) are i.i.d. and \(\sigma_Y^2 < \infty\), then \(\bar{Y}\) is a consistent estimator of \(\mu_Y\), that is,

\[Pr[|\bar{Y}-\mu_Y | < \mu] \rightarrow 1 \text{ as } n \rightarrow \infty\]

which can be written, \(\bar{Y}^p \rightarrow \mu_Y\)

The Central Limit Theorem

The CLT helps us understand how our samples would be distributed if we took a large number of samples.
In real-world applications, we never actually observe the sampling distribution, yet it is useful to always think of a point estimate (for instance our estimate of the mean using one sample) as coming from such a hypothetical distribution.
Understanding the sampling distribution will help us characterize and make sense of the point estimates that we do observe.

Central Limit Theorem

As we take more samples the distribution around the mean of the sample means will be normal.

The mean and variance will depend on the specifics of the data, but if the distribution is normal, we can always standardize (make the average equal to zero and variance equal to 1)

Standardizing

We make the average equal to zero by subtracting the population mean from mean of the sample means:

\[\bar{Y} - \mu_Y\] (true because of L.L.N.)

We make the variance equal to 1 by dividing by the standard deviation of the sample

\[\frac{\bar{Y} - \mu_Y}{\sigma_{\bar{Y}}}\]

Standard Normal

By converting our sample distribution to the standard normal, we can more easily

conduct hypothesis testing (e.g. t-tests and z-tests)
construct confidence intervals

(plus lots of other useful things we will see later)

Our R-Studio Project

The set up

We’ve taken two samples of workers, and collected information on their wages, age, education and gender (this is a real sample from the Current Population Survey). One sample is from 1996, one sample is from 2015.

Our first task will be to figure out whether average wages increased between 1996 and 2013.

Are differences between these two samples because they have different population averages? Or is the difference just because of randomness? In other words, are these samples coming from the same distribution?

Chunks

Our QMD file tells RStudio what to produce:

Text chunks get treated as text (with some formatting occassionally specified)
Code chunks (indicated by three “backticks” and a programming language) tell RStudio to run these commands.

Loading the data

click the green arrow on the packages code chunk. This will run:

library(readxl)
library(tidyverse)

click the green arrow on the load-data code chunk. This will run:

cps <- read_excel("CPS96_15.xlsx")

Glimpse

Use the glimpse function to see what the data is like:

glimpse(cps)

Rows: 13,201
Columns: 5
$ year     <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1…
$ ahe      <dbl> 11.171735, 8.653846, 9.615385, 11.217949, 9.615385, 14.423077…
$ bachelor <dbl> 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0…
$ female   <dbl> 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0…
$ age      <dbl> 31, 31, 27, 26, 28, 32, 30, 31, 31, 25, 32, 33, 26, 27, 29, 3…

Computing sample mean

R is at its core a calculator - we can ask it calculate things and save the results.

We want means for 1996 and for 2015, but we don’t have separate columns these two different dates.

The easiest way forward is to split our data into two data sets.

Code chunk

Create a new code chunk by placing your cursor under the text for Question 1.a and then:

Click on the “Code” menu at the top of the RStudio window, select “insert new code chunk”

Use the shortcut:
- On Windows/Linux: Ctrl + Alt + I
- On macOS: Cmd + Option + I

Splitting the data

“tidyverse”includes a package called “dplyr”

useful for wrangling our data.

The first argument is always a data frame

Subsequent arguments say what to do with that data frame

Produces a data frame that we can store by assigning it to a name:

cps_1996 <- cps %>%
  filter(year==1996)

cps_2015 <- cps %>%
  filter(year==2015)

dplyr functions

functions as verbs that manipulate data frames

select: pick columns by name
arrange: reorder rows
slice: pick rows using index(es)
filter: pick rows matching criteria
distinct: filter for unique rows
mutate: add new variables
summarise: reduce variables to values
group_by: for grouped operations
… (many more)

Calculating means

Now that we have two different data sets for each year, we can calculate the means and store them as values (not data frames) in R. So add to your code chunk:

cps_1996 <- cps %>%
  filter(year==1996)

cps_2015 <- cps %>%
  filter(year==2015)


mean_ahe_96 <- mean(cps_1996$ahe)
mean_ahe_15 <- mean(cps_2015$ahe)

Click the green arrow to run this code chunk and look in the “environment” pane to see what happened. Do the means seem different at first glance?

Standard Deviation

Part B asks us to find the standard deviation. This is important on its own, but will also help us test whether the two means are likely to be part of the same distribution or not.

If you’ve been clicking those green arrows, R will have remembered all the prior stuff. So we can write:

sd_ahe_96 <- sd(cps_1996$ahe)
sd_ahe_15 <- sd(cps_2015$ahe)

Confidence interval

Now the question wants us to construct intervals around our sample means.

These intervals will tell us the range within which the true population mean is likely to fall with 95% confidence

i.e.: an interval that contains the true value of the population mean in 95% of repeated samples

http://pewresearch.org/pubs/2191/young-adults-workers-labor-market-pay-careers-advancement-recession

Margin of Error

\(41\% \pm 2.9%\) : We are 95% confident that 38.1% to 43.9% of the public believe young adults, rather than middle-aged or older adults, are having the toughest time in today’s economy.

\(49\% \pm 4.4%\): We are 95% confident that 44.6% to 53.4% of 18-34 year olds have taken a job they didn’t want just to pay the bills.

Confidence intervals

Confidence intervals are always of the form:

point estimate ± ME

ME is always calculated as the product of a critical value and SE.

The standard error provides a guide for how large we should make the confidence interval (a large standard error implies we have to make our interval larger)

The critical value is some value for a test statistic beyond which we are going to reject the null hypothesis.

The Z-score

A z-score (also known as a standard score) is a statistical measure that quantifies how many standard deviations a data point is away from the mean of a dataset

In large samples, this provides our critical value.

Standard Error

The standard error is a measure of the variability or uncertainty associated with estimating the population mean (\(\mu_Y\))from a sample mean (\(\bar{Y}\)).

\[SE = \frac{s_Y}{\sqrt{n}}\] - Where \(s_Y\) is the sample standard deviation which is an estimator of the population standard deviation

We won’t go through a detailed proof of why, but note that as n increases SE decreases.

Steps to calculate the confidence intervals

Estimate mean
Determine standard deviation
Calculate standard error
Find the critical value
Find lower bound: \(\bar{Y} - \text{critical value} \times SE\)
Find upper bound: \(\bar{Y} + \text{critical value} \times SE\)

Our standard errors

We’ve already calculate the mean and standard deviation, so lets now calculate the standard error, critical value, and bounds.

standard error

# Sample sizes
n_1996 <- length(cps_1996$ahe)
n_2015 <- length(cps_2015$ahe)

# Standard errors
se_1996 <- (sd_ahe_96 / sqrt(n_1996))
se_2015 <- (sd_ahe_15 / sqrt(n_2015))

Our critical values

The problem asks for a 95% confidence interval. So we want 2.5% on the tails.
You can calculate this by hand, or computer, or…
it’s 1.96 standard deviation on either side of the mean.

Our confidence intervals

ci_lower_1996 <- mean_ahe_96 - (1.96 * se_1996)
ci_upper_1996 <- mean_ahe_96 + (1.96 * se_1996)

ci_lower_2015 <- mean_ahe_15 - (1.96 * se_2015)
ci_upper_2015 <- mean_ahe_15 + (1.96 * se_2015)

CI for 1996: \(12.693 \pm 0.159\)

CI for 2015: \(21.237 \pm 0.148\)

The easy way

# Calculate the 95% confidence interval for AHE in 1996
ci_1996 <- t.test(cps_1996$ahe, conf.level = 0.95)

# Calculate the 95% confidence interval for AHE in 2015
ci_2015 <- t.test(cps_2015$ahe, conf.level = 0.95)

ci_1996


    One Sample t-test

data:  cps_1996$ahe
t = 155.94, df = 6102, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 12.53369 12.85283
sample estimates:
mean of x 
 12.69326

Taking a look

Code

# Create a data frame for the confidence intervals
ci_data <- data.frame(
  Year = c("1996", "2015"),
  Mean = c(mean(cps_1996$ahe), mean(cps_2015$ahe)),
  Lower = c(ci_1996$conf.int[1], ci_2015$conf.int[1]),
  Upper = c(ci_1996$conf.int[2], ci_2015$conf.int[2])
)

# Create a point estimate with error bars plot
plot <- ggplot(ci_data, aes(x = Year, y = Mean)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
  labs(
    title = "95% Confidence Intervals for AHE in 1996 and 2015",
    x = "Year",
    y = "Mean AHE"
  ) +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"))

# Show the plot
plot

Adjusting for inflation

As question 2 points out - we might want to adjust for inflation here

# Adjust AHE in 1996 to real 2015 dollars
cps_1996$ahe_real <- cps_1996$ahe * (237 / 156.9)

## Calculate mean

mean_real_96 <- mean(cps_1996$ahe_real)

## Calculate standard deviation
sd_real_96 = sd(cps_1996$ahe_real)

## Calculate standard error
se_real_1996 <- (sd_real_96 / sqrt(n_1996))

## Calculate confidence interval 
ci_real_lower_1996 <- mean_real_96 - (1.96 * se_real_1996)
ci_real_upper_1996 <- mean_real_96 + (1.96 * se_real_1996)

Taking a look

Code

# Create a data frame for the confidence intervals
ci_data <- data.frame(
  Year = c("1996", "2015"),
  Mean = c(mean_real_96, mean_ahe_15),
  Lower = c(ci_real_lower_1996, ci_lower_2015),
  Upper = c(ci_real_upper_1996, ci_upper_2015)
)

# Create a point estimate with error bars plot
plot <- ggplot(ci_data, aes(x = Year, y = Mean)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
  labs(
    title = "95% Confidence Intervals for AHE in 1996 and 2015",
    x = "Year",
    y = "Mean AHE"
  ) +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"))
# Show the plot
plot

Construct a 95% confidence interval for the change in population means

change_means <- mean_ahe_15 - mean_real_96

se_1996 <- (sd_ahe_96 / sqrt(n_1996))

se_diff <- (sd(cps$ahe)/sqrt(length(cps$ahe)))
  
ci_lower_diff <- change_means - (1.96 * se_diff)
ci_upper_diff <- change_means + (1.96 * se_diff)

Plotting it

Mean of high school

Question 4 wants us to figure out if education makes any meaningful difference to average incomes in 2015. First we get a CI for high school grads:

hs_2015 <- cps_2015 %>%
  filter(bachelor==0)

ci_hs_2015 <- t.test(hs_2015$ahe, conf.level = 0.95)

ci_hs_2015


    One Sample t-test

data:  hs_2015$ahe
t = 111.33, df = 3364, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 16.09262 16.66960
sample estimates:
mean of x 
 16.38111

Mean of college grads

Next we get a CI for college grads:

col_2015 <- cps_2015 %>%
  filter(bachelor==1)

ci_col_2015 <- t.test(col_2015$ahe, conf.level = 0.95)

ci_col_2015


    One Sample t-test

data:  col_2015$ahe
t = 118.83, df = 3732, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 25.19241 26.03765
sample estimates:
mean of x 
 25.61503

CI for difference in earnings by education

Now we construct a CI for the difference:

ci_ed_diff_2015 <- t.test(col_2015$ahe, hs_2015$ahe, conf.level = 0.95)

ci_ed_diff_2015


    Welch Two Sample t-test

data:  col_2015$ahe and hs_2015$ahe
t = 35.381, df = 6463.4, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 8.722304 9.745543
sample estimates:
mean of x mean of y 
 25.61503  16.38111

Repeat with real 1996 data

hs_1996 <- cps_1996 %>%
  filter(bachelor==0)

col_1996 <- cps_1996 %>%
  filter(bachelor==1)

ci_hs_1996 <- t.test(hs_1996$ahe_real, conf.level = 0.95)

ci_col_1996 <- t.test(col_1996$ahe_real, conf.level = 0.95)

ci_ed_diff_1996 <- t.test(col_1996$ahe_real, hs_1996$ahe_real, conf.level = 0.95)

Difference in high school?

Code

hs_compare_data <- data.frame(
  Education = c("HS Grad 1996", "HS Grad 2015"),
  Estimate = c(mean(hs_1996$ahe_real), mean(hs_2015$ahe)),
  Lower_CI = c(ci_hs_1996$conf.int[1], ci_hs_2015$conf.int[1]),
  Upper_CI = c(ci_hs_1996$conf.int[2], ci_hs_2015$conf.int[2])
)

hs_plot <- ggplot(hs_compare_data, aes(x = Education, y = Estimate)) +
  geom_point( color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
  labs(title = "High School Graduates: Wage Estimates and Confidence Intervals",
       y = "Average Hourly Earnings",
       x = "Year") +
   theme_minimal() +
   theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"),
        axis.text.x = element_text(angle = 45, hjust = 1))


hs_plot

Difference in college?

Code

col_compare_data <- data.frame(
  Education = c("College Grad 1996", "College Grad 2015"),
  Estimate = c(mean(col_1996$ahe_real), mean(col_2015$ahe)),
  Lower_CI = c(ci_col_1996$conf.int[1], ci_col_2015$conf.int[1]),
  Upper_CI = c(ci_col_1996$conf.int[2], ci_col_2015$conf.int[2])
)

col_plot <- ggplot(col_compare_data, aes(x = Education, y = Estimate)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
  labs(title = "College Graduates: Wage Estimates and Confidence Intervals",
       y = "Average Hourly Earnings",
       x = "Year") +
  theme_minimal() +
   theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"),
        axis.text.x = element_text(angle = 45, hjust = 1))

col_plot

The education gap over time?

Code

diff_compare_data <- data.frame(
  Year = c("1996", "2015"),
  Estimate = c(mean(col_1996$ahe_real) - mean(hs_1996$ahe_real), mean(col_2015$ahe) - mean(hs_2015$ahe)),
  Lower_CI = c(ci_ed_diff_1996$conf.int[1], ci_ed_diff_2015$conf.int[1]),
  Upper_CI = c(ci_ed_diff_1996$conf.int[2], ci_ed_diff_2015$conf.int[2])
)

diff_plot <- ggplot(diff_compare_data, aes(x = Year, y = Estimate)) +
  geom_point(color = "#DD8C6E", size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
  labs(title = "Difference in Means: College Graduates vs. High School Graduates",
       y = "Difference in Average Hourly Earnings",
       x = "Year") +
  theme_minimal() +
   theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
        panel.border = element_blank(),  
        plot.background = element_rect(fill = "#F8F8F8", color = NA),
        panel.grid = element_blank(),
        axis.line = element_line(colour = "#3D4C5F"),
        axis.text.x = element_text(angle = 45, hjust = 1))

diff_plot

Gender wage gap

The final question asks you to prepare a table for high school graduates. You don’t have to do anything fancy here, just estimate the following for high school graduates only:

mean ahe for men and women in 1996
mean ahe for men and women in 2015
standard deviation for men and women in 1996
standard deviation for men and women in 2015
the number of men and the number of women in 1996 and 2015
the gender wage gap (difference in mean ahe) in 1996 and 2015
the standard error of the gender wage gap in 1996 and 2015
the confidence interval for the gender wage gap in 1996 and 2015

Confidence Intervals and R

A review

Samples and estimators

Overview of estimates

Law of Large Numbers

Law of Large Numbers

Law of Large Numbers

The Central Limit Theorem

Central Limit Theorem

Standardizing

Standard Normal

Our R-Studio Project

The set up

Chunks

Loading the data

Glimpse

Computing sample mean

Code chunk

Splitting the data

dplyr functions

Calculating means

Standard Deviation

Confidence interval

Margin of Error

Confidence intervals

The Z-score

Standard Error

Steps to calculate the confidence intervals

Our standard errors

Our critical values

Our confidence intervals

The easy way

Taking a look

Adjusting for inflation

Taking a look

Construct a 95% confidence interval for the change in population means

Plotting it

Mean of high school

Mean of college grads

CI for difference in earnings by education

Repeat with real 1996 data

Difference in high school?

Difference in college?

The education gap over time?

Gender wage gap