Econ 358: Econometrics
We want to estimate something (the population mean, proportion, etc…)
We want to test our hypothesis (is the mean different from 0, different from another mean, etc…)
We want to establish a confidence interval for our estimate (we are sure that 95% of the time an estimate of the mean will lie between…)
We are often interested in population parameters
Complete populations are difficult to collect data on, so we use sample statistics as point estimates for the unknown population parameters of interest.
Error in the estimate = difference between population parameter and sample statistic
Bias is systematic tendency to over- or under-estimate the true population parameter.
Sampling error describes how much an estimate will tend to vary from one sample to the next.
Much of statistics is focused on understanding and quantifying sampling error, and sample size is helpful for quantifying this error.
Law of Large Numbers
Under general conditions, the sample average will be close to the population mean with a very high probability when the sample size is large.
e.x.: As we filp a coin more and more times, the average coin flip will come to approximate 0.5
e.x.: Casino’s don’t get mad when you win once…
More formally:
As the sample size (n) increases, the sample mean (average) of a sequence of i.i.d. random variables converges in probability to the true population mean.
It is a consistent estimator of the true population mean
THE MOST FORMAL
If \((Y_1,\ldots, Y_2)\) are i.i.d. and \(\sigma_Y^2 < \infty\), then \(\bar{Y}\) is a consistent estimator of \(\mu_Y\), that is,
\[Pr[|\bar{Y}-\mu_Y | < \mu] \rightarrow 1 \text{ as } n \rightarrow \infty\]
which can be written, \(\bar{Y}^p \rightarrow \mu_Y\)
The CLT helps us understand how our samples would be distributed if we took a large number of samples.
In real-world applications, we never actually observe the sampling distribution, yet it is useful to always think of a point estimate (for instance our estimate of the mean using one sample) as coming from such a hypothetical distribution.
Understanding the sampling distribution will help us characterize and make sense of the point estimates that we do observe.
As we take more samples the distribution around the mean of the sample means will be normal.
The mean and variance will depend on the specifics of the data, but if the distribution is normal, we can always standardize (make the average equal to zero and variance equal to 1)
We make the average equal to zero by subtracting the population mean from mean of the sample means:
\[\bar{Y} - \mu_Y\] (true because of L.L.N.)
We make the variance equal to 1 by dividing by the standard deviation of the sample
\[\frac{\bar{Y} - \mu_Y}{\sigma_{\bar{Y}}}\]
By converting our sample distribution to the standard normal, we can more easily
conduct hypothesis testing (e.g. t-tests and z-tests)
construct confidence intervals
(plus lots of other useful things we will see later)
We’ve taken two samples of workers, and collected information on their wages, age, education and gender (this is a real sample from the Current Population Survey). One sample is from 1996, one sample is from 2015.
Our first task will be to figure out whether average wages increased between 1996 and 2013.
Our QMD file tells RStudio what to produce:
Text chunks get treated as text (with some formatting occassionally specified)
Code chunks (indicated by three “backticks” and a programming language) tell RStudio to run these commands.
Use the glimpse function to see what the data is like:
Rows: 13,201
Columns: 5
$ year <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1…
$ ahe <dbl> 11.171735, 8.653846, 9.615385, 11.217949, 9.615385, 14.423077…
$ bachelor <dbl> 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0…
$ female <dbl> 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0…
$ age <dbl> 31, 31, 27, 26, 28, 32, 30, 31, 31, 25, 32, 33, 26, 27, 29, 3…
R is at its core a calculator - we can ask it calculate things and save the results.
We want means for 1996 and for 2015, but we don’t have separate columns these two different dates.
The easiest way forward is to split our data into two data sets.
Create a new code chunk by placing your cursor under the text for Question 1.a and then:
or
“tidyverse”includes a package called “dplyr”
The first argument is always a data frame
Produces a data frame that we can store by assigning it to a name:
functions as verbs that manipulate data frames
select
: pick columns by namearrange
: reorder rowsslice
: pick rows using index(es)filter
: pick rows matching criteriadistinct
: filter for unique rowsmutate
: add new variablessummarise
: reduce variables to valuesgroup_by
: for grouped operationsNow that we have two different data sets for each year, we can calculate the means and store them as values (not data frames) in R. So add to your code chunk:
Click the green arrow to run this code chunk and look in the “environment” pane to see what happened. Do the means seem different at first glance?
Part B asks us to find the standard deviation. This is important on its own, but will also help us test whether the two means are likely to be part of the same distribution or not.
If you’ve been clicking those green arrows, R will have remembered all the prior stuff. So we can write:
Now the question wants us to construct intervals around our sample means.
These intervals will tell us the range within which the true population mean is likely to fall with 95% confidence
i.e.: an interval that contains the true value of the population mean in 95% of repeated samples
http://pewresearch.org/pubs/2191/young-adults-workers-labor-market-pay-careers-advancement-recession
Confidence intervals are always of the form:
point estimate ± ME
ME is always calculated as the product of a critical value and SE.
The standard error provides a guide for how large we should make the confidence interval (a large standard error implies we have to make our interval larger)
The critical value is some value for a test statistic beyond which we are going to reject the null hypothesis.
A z-score (also known as a standard score) is a statistical measure that quantifies how many standard deviations a data point is away from the mean of a dataset
In large samples, this provides our critical value.
The standard error is a measure of the variability or uncertainty associated with estimating the population mean (\(\mu_Y\))from a sample mean (\(\bar{Y}\)).
\[SE = \frac{s_Y}{\sqrt{n}}\] - Where \(s_Y\) is the sample standard deviation which is an estimator of the population standard deviation
We won’t go through a detailed proof of why, but note that as n increases SE decreases.
Estimate mean
Determine standard deviation
Calculate standard error
Find the critical value
Find lower bound: \(\bar{Y} - \text{critical value} \times SE\)
Find upper bound: \(\bar{Y} + \text{critical value} \times SE\)
We’ve already calculate the mean and standard deviation, so lets now calculate the standard error, critical value, and bounds.
The problem asks for a 95% confidence interval. So we want 2.5% on the tails.
You can calculate this by hand, or computer, or…
it’s 1.96 standard deviation on either side of the mean.
CI for 1996: \(12.693 \pm 0.159\)
CI for 2015: \(21.237 \pm 0.148\)
# Calculate the 95% confidence interval for AHE in 1996
ci_1996 <- t.test(cps_1996$ahe, conf.level = 0.95)
# Calculate the 95% confidence interval for AHE in 2015
ci_2015 <- t.test(cps_2015$ahe, conf.level = 0.95)
ci_1996
One Sample t-test
data: cps_1996$ahe
t = 155.94, df = 6102, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
12.53369 12.85283
sample estimates:
mean of x
12.69326
# Create a data frame for the confidence intervals
ci_data <- data.frame(
Year = c("1996", "2015"),
Mean = c(mean(cps_1996$ahe), mean(cps_2015$ahe)),
Lower = c(ci_1996$conf.int[1], ci_2015$conf.int[1]),
Upper = c(ci_1996$conf.int[2], ci_2015$conf.int[2])
)
# Create a point estimate with error bars plot
plot <- ggplot(ci_data, aes(x = Year, y = Mean)) +
geom_point(color = "#DD8C6E", size = 3) +
geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
labs(
title = "95% Confidence Intervals for AHE in 1996 and 2015",
x = "Year",
y = "Mean AHE"
) +
theme_minimal() +
theme(legend.position = "none",
text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
panel.border = element_blank(),
plot.background = element_rect(fill = "#F8F8F8", color = NA),
panel.grid = element_blank(),
axis.line = element_line(colour = "#3D4C5F"))
# Show the plot
plot
As question 2 points out - we might want to adjust for inflation here
# Adjust AHE in 1996 to real 2015 dollars
cps_1996$ahe_real <- cps_1996$ahe * (237 / 156.9)
## Calculate mean
mean_real_96 <- mean(cps_1996$ahe_real)
## Calculate standard deviation
sd_real_96 = sd(cps_1996$ahe_real)
## Calculate standard error
se_real_1996 <- (sd_real_96 / sqrt(n_1996))
## Calculate confidence interval
ci_real_lower_1996 <- mean_real_96 - (1.96 * se_real_1996)
ci_real_upper_1996 <- mean_real_96 + (1.96 * se_real_1996)
# Create a data frame for the confidence intervals
ci_data <- data.frame(
Year = c("1996", "2015"),
Mean = c(mean_real_96, mean_ahe_15),
Lower = c(ci_real_lower_1996, ci_lower_2015),
Upper = c(ci_real_upper_1996, ci_upper_2015)
)
# Create a point estimate with error bars plot
plot <- ggplot(ci_data, aes(x = Year, y = Mean)) +
geom_point(color = "#DD8C6E", size = 3) +
geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
labs(
title = "95% Confidence Intervals for AHE in 1996 and 2015",
x = "Year",
y = "Mean AHE"
) +
theme_minimal() +
theme(legend.position = "none",
text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
panel.border = element_blank(),
plot.background = element_rect(fill = "#F8F8F8", color = NA),
panel.grid = element_blank(),
axis.line = element_line(colour = "#3D4C5F"))
# Show the plot
plot
Question 4 wants us to figure out if education makes any meaningful difference to average incomes in 2015. First we get a CI for high school grads:
hs_2015 <- cps_2015 %>%
filter(bachelor==0)
ci_hs_2015 <- t.test(hs_2015$ahe, conf.level = 0.95)
ci_hs_2015
One Sample t-test
data: hs_2015$ahe
t = 111.33, df = 3364, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
16.09262 16.66960
sample estimates:
mean of x
16.38111
Next we get a CI for college grads:
col_2015 <- cps_2015 %>%
filter(bachelor==1)
ci_col_2015 <- t.test(col_2015$ahe, conf.level = 0.95)
ci_col_2015
One Sample t-test
data: col_2015$ahe
t = 118.83, df = 3732, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
25.19241 26.03765
sample estimates:
mean of x
25.61503
Now we construct a CI for the difference:
Welch Two Sample t-test
data: col_2015$ahe and hs_2015$ahe
t = 35.381, df = 6463.4, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
8.722304 9.745543
sample estimates:
mean of x mean of y
25.61503 16.38111
hs_compare_data <- data.frame(
Education = c("HS Grad 1996", "HS Grad 2015"),
Estimate = c(mean(hs_1996$ahe_real), mean(hs_2015$ahe)),
Lower_CI = c(ci_hs_1996$conf.int[1], ci_hs_2015$conf.int[1]),
Upper_CI = c(ci_hs_1996$conf.int[2], ci_hs_2015$conf.int[2])
)
hs_plot <- ggplot(hs_compare_data, aes(x = Education, y = Estimate)) +
geom_point( color = "#DD8C6E", size = 3) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
labs(title = "High School Graduates: Wage Estimates and Confidence Intervals",
y = "Average Hourly Earnings",
x = "Year") +
theme_minimal() +
theme(legend.position = "none",
text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
panel.border = element_blank(),
plot.background = element_rect(fill = "#F8F8F8", color = NA),
panel.grid = element_blank(),
axis.line = element_line(colour = "#3D4C5F"),
axis.text.x = element_text(angle = 45, hjust = 1))
hs_plot
col_compare_data <- data.frame(
Education = c("College Grad 1996", "College Grad 2015"),
Estimate = c(mean(col_1996$ahe_real), mean(col_2015$ahe)),
Lower_CI = c(ci_col_1996$conf.int[1], ci_col_2015$conf.int[1]),
Upper_CI = c(ci_col_1996$conf.int[2], ci_col_2015$conf.int[2])
)
col_plot <- ggplot(col_compare_data, aes(x = Education, y = Estimate)) +
geom_point(color = "#DD8C6E", size = 3) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
labs(title = "College Graduates: Wage Estimates and Confidence Intervals",
y = "Average Hourly Earnings",
x = "Year") +
theme_minimal() +
theme(legend.position = "none",
text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
panel.border = element_blank(),
plot.background = element_rect(fill = "#F8F8F8", color = NA),
panel.grid = element_blank(),
axis.line = element_line(colour = "#3D4C5F"),
axis.text.x = element_text(angle = 45, hjust = 1))
col_plot
diff_compare_data <- data.frame(
Year = c("1996", "2015"),
Estimate = c(mean(col_1996$ahe_real) - mean(hs_1996$ahe_real), mean(col_2015$ahe) - mean(hs_2015$ahe)),
Lower_CI = c(ci_ed_diff_1996$conf.int[1], ci_ed_diff_2015$conf.int[1]),
Upper_CI = c(ci_ed_diff_1996$conf.int[2], ci_ed_diff_2015$conf.int[2])
)
diff_plot <- ggplot(diff_compare_data, aes(x = Year, y = Estimate)) +
geom_point(color = "#DD8C6E", size = 3) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.15) +
labs(title = "Difference in Means: College Graduates vs. High School Graduates",
y = "Difference in Average Hourly Earnings",
x = "Year") +
theme_minimal() +
theme(legend.position = "none",
text = element_text(family = "Atkinson Hyperlegible", color = "#3D4C5F"),
panel.border = element_blank(),
plot.background = element_rect(fill = "#F8F8F8", color = NA),
panel.grid = element_blank(),
axis.line = element_line(colour = "#3D4C5F"),
axis.text.x = element_text(angle = 45, hjust = 1))
diff_plot
The final question asks you to prepare a table for high school graduates. You don’t have to do anything fancy here, just estimate the following for high school graduates only: