Week 6: Binary Outcomes and Observational Data

POL269 Political Data Research

Javier Sajuria

2022-02-26

Plan for today

  • Prediction and Linear Regression

  • Example with Binary Outcome Variable: Using Midterm Scores to Predict Probability of Earning a Distinction

  • Review: Causation and Randomized Experiments

  • Observational Studies

  • Confounding Variables or Confounders

Using Midterm Scores to Predict Probability of Earning a Distinction

  1. Load and explore data
  X.1 X Assignment.1 Take.Home.Exam Course.total distinction
1   1 1           75             86         80.5         yes
2   2 2           75             86         80.5         yes
3   3 3           74             86         80.0         yes
4   4 4           55             86         70.5         yes
  • what’s the unit of observation

  • for each variable: type and unit of measurement?

  • substantively interpret the first observation

  1. Identify X and Y
  • The predictor (X) is the variable we want to use to predict the outcome (Y)
    • in this case, the predictor is Assignment.1
  • let’s visualize the distribution of Assignment.1

  • The outcome (Y) is the variable that we want to predict
    • in this case, the outcome variable is distinction
    • let’s visualize the distribution of distinction

  • What type of variable is distinction
    • non-numeric, but can be recoded as a numeric binary using case_when()
  • How would you compute the proportion of students who received a distinction?
    • by computing the mean of distinction
    • since distinction is a binary variable, its mean should be interpreted as the proportion of the observations that have the characteristic identified by the variable

  • Code to compute the mean of distinction
    • Answer:
data <- data |> mutate(distinction = case_when(distinction == "yes" ~ 1,
                                       TRUE ~ 0))
data |> 
  summarise(mean = mean(distinction))
   mean
1 0.375
  • Interpretation? - 37.5% of the students earned an 70 or above in the class - RECALL: You need to multiply the output by 100

  • Since Y is binary
    • unit of measurement of \(\overline{Y}\)?
      • % (after x 100)
    • unit of measurement of \(\widehat{Y}\)?
      • % (after x 100)
    • unit of measurement of \(\widehat{\alpha}\)?
      • % (after x 100)
  • unit of measurement of \(\triangle\overline{Y}\)?
    • p.p. (after x 100)
  • unit of measurement of \(\triangle\widehat{Y}\)?
    • p.p. (after x 100)
  • unit of measurement of \(\widehat{\beta}\)?
    • p.p. (after x 100)

  1. What is the relationship between X and Y?
  • Create scatter plot to visualize the relationship between Assingment.1 and distinction
data |>
  ggplot(aes(Assignment.1, distinction)) + geom_point()

  • what does each dot represent?
  • does the relationship look positive or negative?
  • does the relationship look weekly or strongly linear?

  • Calculate correlation to measure direction and strength of linear association between Assignment.1 and distinction
cor(data$Assignment.1, data$distinction)
[1] 0.6962523
  • we find a moderately strong positive correlation
  • are we surprised by this? no because in the scatter plot above we observed that the relationship was positive and moderately strongly linear

  1. Fit a linear model using the least squares method
  • R function to fit a linear model: lm()
    • required argument: a formula of the type Y \(\sim\) X
lm(distinction ~ Assignment.1, data = data)

Call:
lm(formula = distinction ~ Assignment.1, data = data)

Coefficients:
 (Intercept)  Assignment.1  
    -1.16031       0.02449  
  • \(\widehat{\alpha}\) = -1.16 and \(\widehat{\beta}\) = 0.02
  • The fitted line is \(\widehat{Y}\) = -1.16 + 0.02 \(X\)
  • More specifically, it is \(\widehat{\textrm{distinction}}\) = -1.16 + 0.02 Assignment.1

data |> ggplot(aes(Assignment.1, distinction)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE) + theme_ipsum_rc()

\[ \widehat{\textrm{distinction}} = \textrm{-1.16} + \textrm{0.02} \textrm{Assignment.1} \]

  1. Interpretation of Coefficients:

substantive interpretation of \(\widehat{\alpha}\)?

  • start with mathematical definition:
    • \(\widehat{\alpha}\) is the \(\widehat{Y}\) when X=0
  • substitute X, Y, and \(\widehat{\alpha}\):
    • \(\widehat{\alpha}\) = -1.16 is the \(\widehat{\textrm{distinction}}\) when midterm=0

  • put it in words (using units of measurement):
    • when a student scores 0 points in the midterm, we predict that their probability of earning a distinction in the class is -116%, on average
  • nonsensical (due to extrapolation)
  • unit of measurement of \(\widehat{\alpha}\)?
    • same as \(\overline{Y}\); here, Y is binary so \(\overline{Y}\) and \(\widehat{\alpha}\) are measured in % (after x 100)

  1. Interpretation of Coefficients:
  • substantive interpretation of \(\widehat{\beta}\)?
    • start with mathematical definition:
      • \(\widehat{\beta}\) is the \(\triangle \widehat{Y}\) associated with \(\triangle X\)=1
    • substitute X, Y, and \(\widehat{\beta}\):
      • \(\widehat{\beta}\) =0.02 is the \(\triangle \widehat{\textrm{distinction}}\) associated with \(\triangle\)Assignment.1=1

  • put it in words (using units of measurement):
    • an increase in midterm scores of 1 point is associated with a predicted increase in the probability of earning a distinction in the class of 2 percentage points, on average
  • unit of measurement of \(\widehat{\beta}\)?
    • same as \(\triangle \overline{Y}\); here, Y is binary so \(\triangle \overline{Y}\) and \(\widehat{\beta}\) are measured in p.p (after x 100)

THE FITTED LINE

\[ \widehat{Y} = \widehat{\alpha} + \widehat{\beta} X \]

  • \(\widehat{\alpha}\) (alpha-hat) is the estimated intercept coefficient the \(\widehat{Y}\) when \(X{=}\textrm{0}\) (in same unit of measurement as \(\overline Y\))

  • \(\widehat{\beta}\) (beta-hat) is the estimated slope coefficient the \(\triangle \widehat{Y}\) associated with \(\triangle X{=}\textrm{1}\) (in the same unit of measurement as \(\triangle\overline Y\))

Using the Fitted Line to Make Predictions

  • To predict \(\widehat{Y}\) based on X: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\)
  • To predict \(\triangle\widehat{Y}\) based on \(\triangle\)X: \(\triangle\widehat{Y} = \widehat{\beta} \triangle X\)

  1. Make predictions

To predict \(\widehat{Y}\) based on \(X\): \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\)

  • Example 1: Imagine you earn 80 points in the midterm, what would we predict your probability of earning a distinction in the class will be?

\[ \widehat{\textrm{distinction}} = \textrm{-1.16} + \textrm{0.02} \textrm{Assignment.1}\\ \widehat{\textrm{distinction}} = \textrm{-1.16} + \textrm{0.02} \times \textrm{80} \textrm{(if Assignment.1 = 80)} \\ \widehat{\textrm{distinction}} = \textrm{0.44} \\ \]

$$

$$

  • If you earn 80 points in the midterm, we would predict that your probability of earning a distinction in the class is of %, on average
  • Note: since Y is binary, \(\widehat{Y}\) is measured in % (after x 100)

  • Example 2: Imagine you earn 90 points in the midterm, what would we predict your probability of earning a description in the class will be?
\[\begin{eqnarray*} \widehat{\textrm{distinction}} &=& \textrm{-1.16} + \textrm{0.02} \textrm{Assignment.1}\\ \widehat{\textrm{distinction}} &=& \textrm{-1.16} + \textrm{0.02} \times \textrm{90} \textrm{(if midterm = 90)}\\ \widehat{\textrm{distinction}} &=& \textrm{0.64} \\ \end{eqnarray*}\]
  • If you earn 90 points in the midterm, we would predict that your probability of earning a distinc tion in the class is of 64, on average

To predict \(\triangle \widehat{Y}\) associated with \(\triangle X\): \(\triangle \widehat{Y} = \widehat{\beta} \, \triangle \text{X}\)

  • Example 3: What is the predicted change in the probability of earning a distinction in the class associated with an increase in midterm scores of 10 points?
\[\begin{eqnarray*} \triangle\widehat{\textrm{distinction}} &=& \textrm{0.02} \triangle\textrm{Assignment.1}\\ \triangle \widehat{\textrm{distinction}} &=& \textrm{0.02} \times \textrm{10} \textrm{ (if $\triangle$midterm = 10)} \\ \triangle \widehat{\textrm{distinction}} &=& \textrm{0.2} \\ \end{eqnarray*}\]
  • An increase of midterm scores of 10 points is associated with a predicted increase in the probability of earning a distinction in the class of20 percentage points, on average

  • Note: since Y is binary, \(\triangle\widehat{Y}\) is in p.p. (after x 100)

  1. Measure how well the model fits the data with \(\textrm{R}^2\)
  • How good is the model at making predictions? How well does the model fit the data?
  • One way of answering is by calculating \(R^2\)

\(R^2\) measures the proportion of the variation in the outcome variable explained by the model

  • It ranges from 0 to 1

  • The higher the \(R^2\), the better the model fits the data

  • In the simple linear model: \(R^2 = \textrm{cor}(X,Y)^2\)

  • When cor(X,Y) = 1 or cor(X,Y) = -1, the relationship between X and Y is perfectly linear
  • \(R^2\) = cor(X,Y)\(^2\) = 1, the model explains 100% of the variation of Y
  • All prediction errors (vertical distance between the dots and the line) = 0

  • When cor(X,Y) = 0, the relationship between X and Y is non-linear
  • \(R^2\)=cor(X,Y)\(^2\)= 0, the model explains 0% of the variation of Y
  • The prediction errors (vertical distance between the dots and the line) are very large

  • Let’s compute \(R^2\)
    cor(data$distinction, data$Assignment.1)^2
## [1] 0.4847673
  • Interpretation?
    • It means that the linear model explains 48% of the variation of the outcome variable (distinction)

Let’s return to the predictive model from last lecture:


  • R code to compute \(R^2\)?
    • Answer:
cor(data$Assignment.1, data$Course.total)^2
## [1] 0.748613
  • Interpretation?
    • It does NOT mean that the model is right 75% of the time; It means that the linear model explains 75% of the variation of the outcome variable (Course.total)
  • Warning: only compare \(R^2\) between models with the same outcome variable (\(Y\)); some variables are intrinsically harder to predict than others

Predicting Outcomes Using Linear Models}:

We look for \(X\) variables that are highly correlated with \(Y\) because the higher the correlation between \(X\) and \(Y\) (in absolute terms), the higher the \(R^2\) and the better the fitted linear model will usually be at predicting \(Y\) using \(X\).

Why do we analyse data?

MEASURE:To infer population characteristics via survey research

  • what proportion of constituents support a particular policy?

PREDICT:To make predictions

  • who is the most likely candidate to win an upcoming election?

EXPLAIN:To estimate the causal effect of a treatment on an outcome

  • what is the effect of small classrooms on student performance?

Book Outline

Chapter Goal
2. Estimating Causal Effects with Randomized Experiments EXPLAIN
3. Inferring Population Characteristics via Survey Research MEASURE
4. Predicting Outcomes Using Linear Regression PREDICT
5. Estimating Causal Effects with Observational Data EXPLAIN

Review: Causation

  • To measure causal effects, we need to compare the factual outcome with the counterfactual outcome
    • Fundamental problem: We can never observe the counterfactual outcome
  • To estimate causal effects, we must find or create a situation in which the treatment and control groups are comparable with respect to all the variables that might affect the outcome other than the treatment variable itself
  • Only when that assumption is satisfied can we use the factual outcome of one group as a good proxy for the counterfactual outcome of the other, and vice versa, thus, bypassing the fundamental problem of causal inference

Review: Randomized Experiments

  • In randomized experiments, we can rely on the random assignment of treatment to make treatment and control groups, on average, identical to each other in terms of all observed and unobserved pre-treatment characteristics
  • Thus, we can estimate the average treatment effect with the difference-in-means estimator

\[\overline{Y}_\text{treatment group} - \overline{Y}_\text{control group}\]

Observational Data

  • But, what happens when we cannot conduct a randomized experiment and have to analyse observational data?
    • Observational data: data collected about naturally occurring events (i.e., researchers do not get to assign the treatment)
  • We can no longer assume that treatment and control groups are comparable
  • We need to identify and measure any relevant differences between treatment and control groups (known as confounding variables or confounders)
  • Then, we will need to statistically control for them so that we can make the two groups comparable after statistical controls are applied

Confounders or Confounding Variables

  • A confounding variable is a variable that affects both
    • i. the likelihood to receive the treatment \(X\) and
    • ii. the outcome \(Y\)
  • In mathematical notation, we represent a confounding variable as \(Z\)

  • Let’s look at a simple example. Suppose we are interested in the average causal effect of attending a private school, as opposed to a public one, on students test performance

    • What is the treatment variable \(X\)?
    • What is the outcome variable \(Y\)?
    • Can you think of a confounder \(Z\)?

Why Are Confounders a Problem?

  • They obscure the causal relationship between \(X\) and \(Y\)

  • In the example above, if we observed that, on average, private school students perform better than public school students, we would not know whether it is

    • because they attended a private school or
    • because they came from wealthier families that could afford to provide them with after-school help
  • We would not know what portion of the observed differences in test score performance (the difference-in-means estimator), if any, could be attributed to the causal effect of the treatment (attending a private school) and what portion could be attributed to the confounding variable (coming from a wealthy family)

  • In the presence of confounders, correlation does not necessarily imply causation

  • Just because we observe two variables highly correlated with each other—when we observe one increase, we usually observe the other increase or decrease—it does not automatically mean that one causes the other

    • There could be a third variable that causes both
  • For example, ice cream sales and shark attacks are highly correlated with each other. Does this mean that eating ice cream increases the probability that a shark attacks you?

IN THE PRESENCE OF CONFOUNDERS

  • correlation does NOT necessarily imply causation

  • the treatment and control groups are NOT comparable

  • the difference-in-means estimator does NOT provide a valid estimate of the average treatment effect

Why Don’t We Worry About Confoundersin Randomized Experiments?

  • Randomization of treatment assignment eliminates all potential confounders
  • It ensures that treatment and control groups are comparable by breaking the link between any potential confounder and the treatment
  • If we assign who attends a private school at random, we ensure that nothing related to the outcome is also related to the likelihood of receiving the treatment

How Can We Estimate Causal Effectswith Observational Data?

  • We cannot rely on random treatment assignment to eliminate potential confounders
  • We need to identify and measure all confounding variables and statistically control for them
  • Before we learn how to do that, we should learn how to fit a simple linear regression model to produce an estimated coefficient equivalent to the difference-in-means estimator
  • Let’s quickly review how we fit a line and interpret the estimated coefficients

scatter plot where every dot is an observation

first observation: \((X_1, Y_1)\)

if we summarise the relationship between X and Y with a line

we can use the fitted line, to compute \(\widehat{Y}\) for every value of \(X\)

prediction errors = vertical distance between dots and line

we choose the line with the smallest possible errors

the fitted line: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta}X\)

estimated intercept (\(\widehat{\alpha}\)): \(\widehat{Y}\) when \(X{=}0\)

estimated slope (\(\widehat{\beta}\)): \(\triangle\widehat{Y}\) associated with \(\triangle X{=}1\)

Using the Simple Linear Mode to Compute the Difference-in-Means Estimator

When X is the treatment variable and Y is the outcome variable of interest, the estimated slope coefficient (\(\widehat{\beta}\)) is equivalent to the difference-in-means estimator.

  • Let’s examine this …

  • Mathematical definition of \(\widehat{\beta}\): \(\triangle \widehat{Y}\) associated with \(\triangle X{=}\textrm{1}\)
\[\begin{eqnarray*} \widehat{\beta} &=& \triangle \widehat{Y} \textrm{(if } \triangle X{=}\textrm{1}\textrm{)} \\ &=& \widehat{Y}_{\textrm{final}}{-}\widehat{Y}_{\textrm{initial}} \textrm{(if } \triangle X{=}\textrm{1}\textrm{)} \end{eqnarray*}\]
  • If \(X\) is the treatment variable:

    • \(\triangle X{=}\textrm{1}\) is equivalent to changing from the control group (\(X{=}\textrm{0}\)) to the treatment group (\(X{=}\textrm{1}\))
    • the control group is the initial state, and the treatment group is the final state

\[\widehat{\beta}=\widehat{Y}_{\textrm{treatment group}}{-}\widehat{Y}_{\textrm{control group}}\] - Recall: \(\widehat{Y}\) are predicted values. In this case: \(\widehat{Y}_{\textrm{treatment group}} = \overline{Y}_{\textrm{treatment group}}\) and \(\widehat{Y}_{\textrm{control group}} = \overline{Y}_{\textrm{control group}}\)

\[ \widehat{\beta} = \overline{Y}_{\textrm{treatment group}} -\overline{Y}_{\textrm{control group}} \] - Conclusion: When X is the treatment variable and Y is the outcome variable of interest, the estimated slope coefficient (\(\widehat{\beta}\)) is equivalent to the difference-in-means estimator

  • Let’s return to the exercise from the experiment: Does Social Pressure Affect Turnout?

  • We answer it by analysing data from a randomized experiment where registered voters were randomly assigned to either (a) receive a message designed to induce social pressure, or (b) receive nothing

Does Social Pressure Affect Turnout?

  1. Load and look at the data
    voting <- read.csv("voting.csv") # loads and stores data
    head(voting) # shows first six observations
##   birth message voted
## 1  1981      no     0
## 2  1959      no     1
## 3  1956      no     1
## 4  1939     yes     1
## 5  1968      no     0
## 6  1967      no     0
  1. Create treatment variable
voting <- voting |> 
  mutate(pressure = case_when(
    message == "yes" ~ 1,
    message == "no" ~ 0
  )) 

  • Make sure the new variable was created correctly by looking at the first few observations again:
    head(voting) # shows first six observations
##   birth message voted pressure
## 1  1981      no     0        0
## 2  1959      no     1        0
## 3  1956      no     1        0
## 4  1939     yes     1        1
## 5  1968      no     0        0
## 6  1967      no     0        0

  1. Compute difference-in-means estimator directly
    mean(voting$voted[voting$pressure==1]) -
        mean(voting$voted[voting$pressure==0]) 
## [1] 0.08130991
  1. Alternatively, we can fit a linear model where X is the treatment variable and Y is the outcome variable

  • Recall: the R function to fit a linear model is
    • required argument: a formula of the type
    lm(voted ~ pressure, data=voting) 

Call:
lm(formula = voted ~ pressure, data = voting)

Coefficients:
(Intercept)     pressure  
    0.29664      0.08131  
  • Fitted model: \(\widehat{\textrm{voted}}\) = 0.30 + 0.08 pressure
  • Note that \(\widehat{\beta}\) has the same value as the difference-in-means estimator above (both equal 0.08)

Interpretation of \(\widehat{\beta}\) When X Is the Treatment Variable and Y Is the Outcome Variable

  • Start same as in predictive models

    • definition: \(\widehat{\beta}\) is the \(\triangle \widehat{Y}\) associated with \(\triangle X\)=1
    • here: \(\widehat{\beta}\) = 0.08 is the \(\triangle \widehat{\textrm{voted}}\) associated with \(\triangle\)pressure=1
    • in words: receiving the message inducing social pressure (i.e., an increase in pressure of 1 by going from pressure=0 to pressure=1) is associated with a predicted increase in the probability of voting of 8 percentage points, on average

  • unit of measurement of \(\widehat{\beta}\)? same as \(\triangle \overline{Y}\); here, Y is binary so \(\triangle \overline{Y}\) is measured in p.p and so is \(\widehat{\beta}\) (after x 100)

  • Now, since here X is the treatment variable and Y is the outcome variable of interest, \(\widehat{\beta}\) is equivalent to the difference-in-means estimator

  • As a result, we can interpret \(\widehat{\beta}\) using causal langauge

  • Predictive language: We estimate that receiving the message inducing social pressure is associated with a predicted increase in the probability of voting of 8 percentage points, on average

  • Causal language: We estimate that receiving the message inducing social pressure increases the probability of voting by 8 percentage points, on average

  • This should be a valid estimate of the average treatment effect if there are no confounding variables present

    • if registered voters who received the message are comparable to the registered voters who did not
  • Since the data come from a randomized experiment there should be no confounding variables

  • And thus the difference-in-means estimator should produce a valid estimate of the average treatment effect

  • Whether we compute the difference-in-means estimator directly or we fit a simple linear model where Y is the outcome variable and X is the treatment variable, we arrive to the same conclusion

  • Conclusion: We estimate that receiving the message inducing social pressure increases the probability of voting by 8 percentage points, on average. This is a valid estimate of the average treatment effect if registered voters who received the message are comparable to the registered voters who did not (that is, if there a no confounding variables). Given that the data come from a randomized experiment, this is a reasonable assumption.

Interpretation of the Estimated Slope Coefficient in the Simple Linear Model

  • By default, we interpret \(\widehat{\beta}\) using predictive language: It is the \(\triangle \widehat{Y}\) associated with \(\triangle X{=}\textrm{1}\).

  • When \(X\) is the treatment variable, then \(\widehat{\beta}\) is equivalent to the difference-in-means estimator and, thus, we interpret \(\widehat{\beta}\) using causal language: It is the \(\triangle \widehat{Y}\) caused by \(\triangle X{=}\textrm{1}\) (the presence of the treatment). This causal interpretation is valid if there are no confounding variables present and, thus, the treatment and control groups are comparable.