Week 10: Hypothesis Testing with Estimated Regression Coefficients

POL269 Political Research

Javier Sajuria

2024-03-25

Midterm results

      mean       sd
1 64.23095 7.755076

Plan for Today

  • Hypothesis Testing Intuition
    • Null Hypothesis
    • Alternative Hypothesis
    • Test Statistic
    • P-Values
  • Hypothesis Testing Formal Procedure
  • Example: Do Small Classes Improve Math Scores?
    • What Is the Estimated Average Treatment Effect?
    • Is the Effect Statistically Significant?

The Context

  • Suppose we are analysing data from a randomized experiment for the purpose of estimating the average causal effect of a treatment on an outcome
  • In this context, X is _____
  • And, Y is _____
  • What do we need to calculate to estimate the average treatment effect? _____
  • If we want to compute it by fitting a linear model: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\), which estimated coefficient is equivalent to the difference-in-means estimator? _____

  • \(\widehat{\beta}\) is the average treatment effect at the sample level: what we can estimate
  • \(\beta\) (without a hat) is the true value of the average treatment effect at the population level: what we would like to know
  • As we saw, sample statistics often differ from population parameters because of the noise introduced by sampling variability so we cannot assume that \(\widehat{\beta}\) equals \(\beta\)
  • The question we would like to answer is: Looking at the sample data, do we have enough evidence to say that the average treatment effect is likely to be different than zero at the population level?
    • In other words, can we say that \(\beta\) is likely to not be zero?
  • To answer this question, we need to do hypothesis testing

Hypothesis Testing

  • Methodology based on proof by contradiction: We start by assuming the contrary of what we would like to prove and show how this assumption leads to a logical contradiction
  • In this class, we will use hypothesis testing to determine whether \(\beta\) (the true value of the average treatment effect at the population level) is likely to be different than zero
    • We will set the null hypothesis to state that \(\beta\) is zero: \[H_0{:} \,\beta = 0\]
    • We will set the alternative hypothesis to state that \(\beta\) is either positive or negative (this is known as a two-sided alternative hypothesis):\[H_1{:} \,\beta \neq 0\]

  • Thanks to CLT, we know that if \(H_0\) is true, then the test-statistic over multiple samples:

\[ \textrm{test-statistic} = \frac{\widehat{\beta}}{\textrm{ standard error of }\widehat{\beta}} \sim N(0,1) \]

  • That is, if we were to draw multiple large samples from the same target population, calculate \(\widehat{\beta}\) and the standard error of \(\widehat{\beta}\) each time, then the multiple test statistics would be distributed as a standard normal distribution

  • In reality, we only draw one sample, so we won’t be able to observe the distribution of test statistics
  • We will only observe one test statistic: \(z^{obs}\)
  • Since we know the distribution of the test statistics under the null (if the null hypothesis is true), we can calculate the probability that we observe a test statistic as extreme or more extreme as the one we do observe if \(H_0\) is true
  • This is know as the p-value: \(P(Z{\leq}{-}|z^{obs}|)+P(Z{\geq}|z^{obs}|)\)

  • If the p-value is large: the probability that we observe \(z^{obs}\) or more extreme is large if \(H_0\) is true
  • \(z^{obs}\) is common relative to the distribution of test statistics under the null (if the null hypothesis is true)
  • Our evidence is consistent with \(H_0\) being true
  • Conclusion: We fail to reject \(H_0\) and conclude the average causal effect is not statistically significant (we cannot statistically distinguish \(\beta\) from zero)

  • If the p-value is small: the probability that we observe \(z^{obs}\) or more extreme is small if \(H_0\) is true
  • \(z^{obs}\) is extreme relative to the distribution of test statistics under the null (if the null hypothesis is true)
  • Our evidence is inconsistent with \(H_0\) being true (either \(H_0\) is not true or we got unlucky by drawing a fringe sample)
  • Conclusion: We reject \(H_0\) and conclude the average causal effect is statistically significant (we can statistically distinguish \(\beta\) from zero)

  • How small does the p-value need to be for us to reject the null hypothesis
    • Social scientists conventionally use one of three significance levels: 10%, 5%, and 1%
    • We will use 5%
  • When p-value $>$ 0.05: we will fail to reject \(H_0\) and conclude the average treatment effect is not statistically significant at the 5% level (it is not distinguishable from zero at the population level)
  • When p-value $\leq$ 0.05: we will reject \(H_0\) and conclude the average treatment effect is statistically significant at the 5% level (it is distinguishable from zero at the population level)

  • Note that through this procedure, we NEVER accept the null hypothesis
  • Failing to reject the null hypothesis is not the same as accepting it
  • Just because we have not found evidence against the null hypothesis doesn’t mean that we have proven it to be true
  • On the flip side, however, rejecting the null hypothesis is the same as accepting the alternative hypothesis, although we typically do not express it that way

Shortcut

  • Recall: \(\textrm{\textit{P}}(\textrm{-1.96} \leq Z \leq\textrm{1.96}) \,\, \approx \,\, \textrm{0.95}\)
  • So, the probability that \(Z\) takes a value less than or equal to -1.96 plus the probability that \(Z\) takes a value greater than or equal to 1.96 is approximately 5% (1-0.95=0.05)
  • \(P(Z\leq\textrm{-1.96}) + P(Z\geq\textrm{1.96}) \,\, \approx \,\, \textrm{0.05}\)

  • In short, given the characteristics of \(Z\):
    • When p-value \(>\) 0.05, it means that \(|z^{obs}|<\) 1.96
    • When p-value \(\leq\) 0.05, it means that \(|z^{obs}|\geq\) 1.96
  • We can draw conclusions based on either the value of \(|z^{obs}|\) or the value of p-value
  • Both procedures are mathematically equivalent and lead to the same conclusion

Hypothesis Testing

  1. Specify null and alternative hypotheses

    \(\textrm{H}_{0} {:} \,\, \beta{=}{0}\) (meaning: the true value of the average treatment effect at the population level is zero)

    \(\textrm{H}_{1} {:} \,\, \beta{\neq}{0}\) (meaning: the true value of the average treatment effect at the population level is either positive or negative)

  2. Compute observed value of the test statistic and (perhaps also) the associated p-value

\[ z^{obs} = \frac{\widehat{\beta}}{\textrm{standard error of }\widehat{\beta}} \]

\[ \textrm{p-value} = 2xP(Z \leq -|z^{obs}|) \]

  1. Conclude

    If \(|z^{obs}|<1.96\) or \(p-value > 0.05\), fail to reject \(H_0\) and conclude the average treatment effect is not statistically significant at the 5% level

    If \(|z^{obs}| \geq 1.96\) or \(p-value = \geq 0.05\), reject \(H_0\) and conclude the average treatment effect is statistically significant at the 5% level

The importance of replication

  • When an effect is statistically significant at the 5% level, do we know for sure that the true value of the average causal effect at the population level is not zero?
    • No, we do not
  • A p-value of 0.05 does not rule out the possibility that the average causal effect is zero
  • Thanks to CLT, we know that if the null hypothesis is true, in 5% of the samples drawn from the target population, we will wrongly reject the null when using a significance level of 5%

  • It is important to replicate social scientific studies to confirm that we arrive at similar conclusions when analysing a different sample from the same target population
  • While the probability of falsely rejecting the null hypothesis in any one sample is 5%, the probability of falsely rejecting the null twice in a row, when analysing two independent samples of data drawn from the same target population, is only 0.25%
  • Let’s return to the STAR dataset and estimate the average causal effect of attending a small class on maths test scores (through the book, we estimated the effect on reading test scores)

Do Small Classes Improve Maths Scores?

(Based on Frederick Mosteller. 1995.”The Tennessee Study of Class Size in the Early School Grades’’ Future of Children, 5 (2): 113–27.)

  • The data come from a randomized experiment conducted in Tennessee, where students were randomly assigned to attend either a small class or a regular-size class from kindergarten until 3rd grade
  • To estimate the average causal effect of attending a small class on math test scores, what estimator can we use? The difference-in-means estimator
  • Although, we could compute it directly let’s compute it by fitting a linear model so that \(\widehat{\beta}\) is equivalent to the difference-in-means estimator

0. Get Ready for the Analysis

  • Open new R file, save it in your working directory (POL269 folder)
  • Load the data and create any variables needed
## set the working directory to DSS folder
setwd("~/Desktop/pol269") # if Mac
setwd("C:/user/Desktop/pol269") # if Windows
## load and look at the data
star <- read.csv("STAR.csv") # reads and stores data
library(tidyverse)

## create treatment variable
star <- star |> dplyr::mutate(small = case_when(classtype == "small" ~ 1, TRUE ~ 0))

  • Look at the data
star |> head()
  classtype reading math graduated small
1     small     578  610         1     1
2   regular     612  612         1     0
3   regular     583  606         1     0
4     small     661  648         1     1
5     small     614  636         1     1
6   regular     610  603         0     0
  • The treatment variable is
  • The outcome variable is
  • Is the outcome variable binary? And, if not, what’s its unit of measurement? (this affects the interpretation of \(\widehat{\beta}\))
    • no, Y is not binary and it is measured in points

1. What Is the Estimated Average Treatment Effect?

  • Fit a linear model so that the estimated slope coefficient is equivalent to the difference-in-means estimator. In this case, the fitted line is: \(\widehat{math} = \widehat{\alpha} + \widehat{\beta} small\)

  • Store the fitted model in an object called fit and then ask R to provide the contents of fit

  • R code to fit and store linear model?

fit <- lm(star$math ~ star$small) # fits linear model

  • R code to ask R to provide contents of fit?
fit # shows contents of object

Call:
lm(formula = star$math ~ star$small)

Coefficients:
(Intercept)   star$small  
     628.84         5.99  
  • \(\widehat{\beta}\) = ?

  • \(\widehat{\beta}\) = 5.99

  • Direction, size, and unit of measurement of the effect?

    • An increase of about 6 (or 5.99) points
  • Why?

    • increase because we are measuring a change in \(Y\) and \(\widehat{\beta}\) (which is equivalent to the difference-in-means estimator) is positive
    • points because math is non-binary and measured in points
    • about 6 (or 5.99) because we do not need to transform the result since math is non-binary

  • What’s the estimated average treatment effect?

    • This is the conclusion statement we learned to write back in Chapter 2 of the book

      • See Tip on page 43
    • Ideally you want to mention all the key elements of the analysis: the assumption, why the assumption is reasonable, the treatment, the outcome, as well as the direction, size, and unit of measurement of the average treatment effect

Conclusion Statement

Assuming that the treatment and control groups are comparable (a reasonable assumption because , we estimate that thetreatment] [increases/decreases] [the outcome] by size and unit of measurement of the effect, on average.

  • Answer: Assuming that students who attended a small class were comparable to students who attended a regular-size class (a reasonable assumption because the data come from a randomized experiment), we estimate that attending a small class increases math test scores by about 6 points, on average

2. Is the Effect Statistically Significant?

  • That is, is the average treatment effect distinguishable from zero at the population level, statistically speaking?
    • To answer, we need to do hypothesis testing
  1. Specify null and alternative hypotheses

    • \(H_0 {:} \,\, \beta=0\) (meaning: attending a small class has no average causal effect on math test scores at the population level)
    • \(H_1 {:} \,\, \beta\neq0\) (meaning: attending a small class either increases or decreases math test scores, on average, at the population level)

  1. Compute observed test statistic and associated p-value
  • We ask R to compute both for us by running summary() where we specify inside the parentheses the name of the object where we stored the output of the lm() function

    ## 
    ## Call:
    ## lm(formula = star$math ~ star$small)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -119.827  -27.585   -0.827   26.163  145.163 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  628.837      1.476  426.09  < 2e-16 ***
    ## star$small     5.990      2.178    2.75  0.00604 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 38.74 on 1272 degrees of freedom
    ## Multiple R-squared:  0.005911,   Adjusted R-squared:  0.00513 
    ## F-statistic: 7.564 on 1 and 1272 DF,  p-value: 0.006039
  • The observed test statistic is ____

  • The associated p-value is ____

As explained in the book, R does not assume that the sample size is large enough to use the CLT and, as a result, the distribution of the test statistic under the null hypothesis is the t-distribution (not the standard normal distribution)

  • The name of the test statistic is the t-statistic (instead of the z-statistic), which is why R refers to the observed value of the t-statistic as the t-value
  • The associated p-values are slightly larger than the ones we would arrive at using the standard normal distribution; they lead to somewhat more conservative inferences
  • As long as the sample is not very small, however, the difference is typically negligible, so, in this class, we will rely on the p-values provided by R

  1. Do we reject or fail to reject the null hypothesis? We reject the null hypothesis because …

    • option A: absolute value of the observed test statistic is greater than 1.96 (|2.75| \({>}\) 1.96)
    • option B: p-value is smaller than 0.05 (0.006 \({<}\) 0.05)
  • Is the effect statistically significant at the 5% level?
  • Answer: Yes, the effect is statistically significant at the 5% level. Attending a small class is likely to have a non-zero average causal effect on maths test scores, at the population level.

Today’s class

  • Hypothesis Testing with Estimated Regression Coefficients
  • Example with Non-Binary Outcome: Do Small Classes Improve Maths Scores?
    • What is the estimated average treatment effect?
    • Is the effect statistically significant?

Next class

  • Example with Binary Outcome: Do Small Classes Increase the Probability of Graduating?
    • What is the estimated average treatment effect?
    • Is the effect statistically significant?
    • Can we interpret the estimated average treatment effect as a causal effect?
    • Can we generalise the results to the population level?