Week 10: Hypothesis Testing with Estimated Regression Coefficients
POL269 Political Research
Javier Sajuria
2024-03-25
Midterm results
mean sd
1 64.23095 7.755076
Plan for Today
Hypothesis Testing Intuition
Null Hypothesis
Alternative Hypothesis
Test Statistic
P-Values
Hypothesis Testing Formal Procedure
Example: Do Small Classes Improve Math Scores?
What Is the Estimated Average Treatment Effect?
Is the Effect Statistically Significant?
The Context
Suppose we are analysing data from a randomized experiment for the purpose of estimating the average causal effect of a treatment on an outcome
In this context, X is _____
And, Y is _____
What do we need to calculate to estimate the average treatment effect? _____
If we want to compute it by fitting a linear model: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\), which estimated coefficient is equivalent to the difference-in-means estimator? _____
\(\widehat{\beta}\) is the average treatment effect at the sample level: what we can estimate
\(\beta\) (without a hat) is the true value of the average treatment effect at the population level: what we would like to know
As we saw, sample statistics often differ from population parameters because of the noise introduced by sampling variability so we cannot assume that \(\widehat{\beta}\) equals \(\beta\)
The question we would like to answer is: Looking at the sample data, do we have enough evidence to say that the average treatment effect is likely to be different than zero at the population level?
In other words, can we say that \(\beta\) is likely to not be zero?
To answer this question, we need to do hypothesis testing
Hypothesis Testing
Methodology based on proof by contradiction: We start by assuming the contrary of what we would like to prove and show how this assumption leads to a logical contradiction
In this class, we will use hypothesis testing to determine whether \(\beta\) (the true value of the average treatment effect at the population level) is likely to be different than zero
We will set the null hypothesis to state that \(\beta\) is zero: \[H_0{:} \,\beta = 0\]
We will set the alternative hypothesis to state that \(\beta\) is either positive or negative (this is known as a two-sided alternative hypothesis):\[H_1{:} \,\beta \neq 0\]
Thanks to CLT, we know that if \(H_0\) is true, then the test-statistic over multiple samples:
\[
\textrm{test-statistic} = \frac{\widehat{\beta}}{\textrm{ standard error of }\widehat{\beta}} \sim N(0,1)
\]
That is, if we were to draw multiple large samples from the same target population, calculate \(\widehat{\beta}\) and the standard error of \(\widehat{\beta}\) each time, then the multiple test statistics would be distributed as a standard normal distribution
In reality, we only draw one sample, so we won’t be able to observe the distribution of test statistics
We will only observe one test statistic: \(z^{obs}\)
Since we know the distribution of the test statistics under the null (if the null hypothesis is true), we can calculate the probability that we observe a test statistic as extreme or more extreme as the one we do observe if \(H_0\) is true
This is know as the p-value: \(P(Z{\leq}{-}|z^{obs}|)+P(Z{\geq}|z^{obs}|)\)
If the p-value is large: the probability that we observe \(z^{obs}\) or more extreme is large if \(H_0\) is true
\(z^{obs}\) is common relative to the distribution of test statistics under the null (if the null hypothesis is true)
Our evidence is consistent with \(H_0\) being true
Conclusion: We fail to reject \(H_0\) and conclude the average causal effect is not statistically significant (we cannot statistically distinguish \(\beta\) from zero)
If the p-value is small: the probability that we observe \(z^{obs}\) or more extreme is small if \(H_0\) is true
\(z^{obs}\) is extreme relative to the distribution of test statistics under the null (if the null hypothesis is true)
Our evidence is inconsistent with \(H_0\) being true (either \(H_0\) is not true or we got unlucky by drawing a fringe sample)
Conclusion: We reject \(H_0\) and conclude the average causal effect is statistically significant (we can statistically distinguish \(\beta\) from zero)
How small does the p-value need to be for us to reject the null hypothesis
Social scientists conventionally use one of three significance levels: 10%, 5%, and 1%
We will use 5%
When p-value $>$ 0.05: we will fail to reject \(H_0\) and conclude the average treatment effect is not statistically significant at the 5% level (it is not distinguishable from zero at the population level)
When p-value $\leq$ 0.05: we will reject \(H_0\) and conclude the average treatment effect is statistically significant at the 5% level (it is distinguishable from zero at the population level)
Note that through this procedure, we NEVER accept the null hypothesis
Failing to reject the null hypothesis is not the same as accepting it
Just because we have not found evidence against the null hypothesis doesn’t mean that we have proven it to be true
On the flip side, however, rejecting the null hypothesis is the same as accepting the alternative hypothesis, although we typically do not express it that way
Shortcut
Recall: \(\textrm{\textit{P}}(\textrm{-1.96} \leq Z \leq\textrm{1.96}) \,\, \approx \,\, \textrm{0.95}\)
So, the probability that \(Z\) takes a value less than or equal to -1.96 plus the probability that \(Z\) takes a value greater than or equal to 1.96 is approximately 5% (1-0.95=0.05)
When p-value \(>\) 0.05, it means that \(|z^{obs}|<\) 1.96
When p-value \(\leq\) 0.05, it means that \(|z^{obs}|\geq\) 1.96
We can draw conclusions based on either the value of \(|z^{obs}|\) or the value of p-value
Both procedures are mathematically equivalent and lead to the same conclusion
Hypothesis Testing
Specify null and alternative hypotheses
\(\textrm{H}_{0} {:} \,\, \beta{=}{0}\) (meaning: the true value of the average treatment effect at the population level is zero)
\(\textrm{H}_{1} {:} \,\, \beta{\neq}{0}\) (meaning: the true value of the average treatment effect at the population level is either positive or negative)
Compute observed value of the test statistic and (perhaps also) the associated p-value
\[
z^{obs} = \frac{\widehat{\beta}}{\textrm{standard error of }\widehat{\beta}}
\]
\[
\textrm{p-value} = 2xP(Z \leq -|z^{obs}|)
\]
Conclude
If \(|z^{obs}|<1.96\) or \(p-value > 0.05\), fail to reject \(H_0\) and conclude the average treatment effect is not statistically significant at the 5% level
If \(|z^{obs}| \geq 1.96\) or \(p-value = \geq 0.05\), reject \(H_0\) and conclude the average treatment effect is statistically significant at the 5% level
The importance of replication
When an effect is statistically significant at the 5% level, do we know for sure that the true value of the average causal effect at the population level is not zero?
No, we do not
A p-value of 0.05 does not rule out the possibility that the average causal effect is zero
Thanks to CLT, we know that if the null hypothesis is true, in 5% of the samples drawn from the target population, we will wrongly reject the null when using a significance level of 5%
It is important to replicate social scientific studies to confirm that we arrive at similar conclusions when analysing a different sample from the same target population
While the probability of falsely rejecting the null hypothesis in any one sample is 5%, the probability of falsely rejecting the null twice in a row, when analysing two independent samples of data drawn from the same target population, is only 0.25%
Let’s return to the STAR dataset and estimate the average causal effect of attending a small class on maths test scores (through the book, we estimated the effect on reading test scores)
The data come from a randomized experiment conducted in Tennessee, where students were randomly assigned to attend either a small class or a regular-size class from kindergarten until 3rd grade
To estimate the average causal effect of attending a small class on math test scores, what estimator can we use? The difference-in-means estimator
Although, we could compute it directly let’s compute it by fitting a linear model so that \(\widehat{\beta}\) is equivalent to the difference-in-means estimator
0. Get Ready for the Analysis
Open new R file, save it in your working directory (POL269 folder)
Load the data and create any variables needed
## set the working directory to DSS foldersetwd("~/Desktop/pol269") # if Macsetwd("C:/user/Desktop/pol269") # if Windows
## load and look at the datastar <-read.csv("STAR.csv") # reads and stores data
classtype reading math graduated small
1 small 578 610 1 1
2 regular 612 612 1 0
3 regular 583 606 1 0
4 small 661 648 1 1
5 small 614 636 1 1
6 regular 610 603 0 0
The treatment variable is
The outcome variable is
Is the outcome variable binary? And, if not, what’s its unit of measurement? (this affects the interpretation of \(\widehat{\beta}\))
no, Y is not binary and it is measured in points
1. What Is the Estimated Average Treatment Effect?
Fit a linear model so that the estimated slope coefficient is equivalent to the difference-in-means estimator. In this case, the fitted line is: \(\widehat{math} = \widehat{\alpha} + \widehat{\beta} small\)
Store the fitted model in an object called fit and then ask R to provide the contents of fit
R code to fit and store linear model?
fit <-lm(star$math ~ star$small) # fits linear model
Direction, size, and unit of measurement of the effect?
An increase of about 6 (or 5.99) points
Why?
increase because we are measuring a change in \(Y\) and \(\widehat{\beta}\) (which is equivalent to the difference-in-means estimator) is positive
points because math is non-binary and measured in points
about 6 (or 5.99) because we do not need to transform the result since math is non-binary
What’s the estimated average treatment effect?
This is the conclusion statement we learned to write back in Chapter 2 of the book
See Tip on page 43
Ideally you want to mention all the key elements of the analysis: the assumption, why the assumption is reasonable, the treatment, the outcome, as well as the direction, size, and unit of measurement of the average treatment effect
Conclusion Statement
Assuming that the treatment and control groups are comparable (a reasonable assumption because …, we estimate that thetreatment] [increases/decreases] [the outcome] by size and unit of measurement of the effect, on average.
Answer: Assuming that students who attended a small class were comparable to students who attended a regular-size class (a reasonable assumption because the data come from a randomized experiment), we estimate that attending a small class increases math test scores by about 6 points, on average
2. Is the Effect Statistically Significant?
That is, is the average treatment effect distinguishable from zero at the population level, statistically speaking?
To answer, we need to do hypothesis testing
Specify null and alternative hypotheses
\(H_0 {:} \,\, \beta=0\) (meaning: attending a small class has no average causal effect on math test scores at the population level)
\(H_1 {:} \,\, \beta\neq0\) (meaning: attending a small class either increases or decreases math test scores, on average, at the population level)
Compute observed test statistic and associated p-value
We ask R to compute both for us by running summary() where we specify inside the parentheses the name of the object where we stored the output of the lm() function
##
## Call:
## lm(formula = star$math ~ star$small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.827 -27.585 -0.827 26.163 145.163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 628.837 1.476 426.09 < 2e-16 ***
## star$small 5.990 2.178 2.75 0.00604 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.74 on 1272 degrees of freedom
## Multiple R-squared: 0.005911, Adjusted R-squared: 0.00513
## F-statistic: 7.564 on 1 and 1272 DF, p-value: 0.006039
The observed test statistic is ____
The associated p-value is ____
As explained in the book, R does not assume that the sample size is large enough to use the CLT and, as a result, the distribution of the test statistic under the null hypothesis is the t-distribution (not the standard normal distribution)
The name of the test statistic is the t-statistic (instead of the z-statistic), which is why R refers to the observed value of the t-statistic as the t-value
The associated p-values are slightly larger than the ones we would arrive at using the standard normal distribution; they lead to somewhat more conservative inferences
As long as the sample is not very small, however, the difference is typically negligible, so, in this class, we will rely on the p-values provided by R
Do we reject or fail to reject the null hypothesis? We reject the null hypothesis because …
option A: absolute value of the observed test statistic is greater than 1.96 (|2.75| \({>}\) 1.96)
option B: p-value is smaller than 0.05 (0.006 \({<}\) 0.05)
Is the effect statistically significant at the 5% level?
Answer: Yes, the effect is statistically significant at the 5% level. Attending a small class is likely to have a non-zero average causal effect on maths test scores, at the population level.
Today’s class
Hypothesis Testing with Estimated Regression Coefficients
Example with Non-Binary Outcome: Do Small Classes Improve Maths Scores?
What is the estimated average treatment effect?
Is the effect statistically significant?
Next class
Example with Binary Outcome: Do Small Classes Increase the Probability of Graduating?
What is the estimated average treatment effect?
Is the effect statistically significant?
Can we interpret the estimated average treatment effect as a causal effect?
Can we generalise the results to the population level?