Week 6: Binary Outcomes and Observational Data

POL269 Political Data Research

Javier Sajuria

2022-02-26

Plan for today

Prediction and Linear Regression
Example with Binary Outcome Variable: Using Midterm Scores to Predict Probability of Earning a Distinction
Review: Causation and Randomized Experiments
Observational Studies
Confounding Variables or Confounders

Using Midterm Scores to Predict Probability of Earning a Distinction

Load and explore data

  X.1 X Assignment.1 Take.Home.Exam Course.total distinction
1   1 1           75             86         80.5         yes
2   2 2           75             86         80.5         yes
3   3 3           74             86         80.0         yes
4   4 4           55             86         70.5         yes

what’s the unit of observation
for each variable: type and unit of measurement?
substantively interpret the first observation

Identify X and Y

The predictor (X) is the variable we want to use to predict the outcome (Y)
- in this case, the predictor is Assignment.1
let’s visualize the distribution of Assignment.1

The outcome (Y) is the variable that we want to predict
- in this case, the outcome variable is distinction
- let’s visualize the distribution of distinction

What type of variable is distinction
- non-numeric, but can be recoded as a numeric binary using case_when()
How would you compute the proportion of students who received a distinction?
- by computing the mean of distinction
- since distinction is a binary variable, its mean should be interpreted as the proportion of the observations that have the characteristic identified by the variable

Code to compute the mean of distinction
- Answer:

data <- data |> mutate(distinction = case_when(distinction == "yes" ~ 1,
                                       TRUE ~ 0))
data |> 
  summarise(mean = mean(distinction))

   mean
1 0.375

Interpretation? - 37.5% of the students earned an 70 or above in the class - RECALL: You need to multiply the output by 100

Since Y is binary
- unit of measurement of $\overline{Y}$?
  - % (after x 100)
- unit of measurement of $\widehat{Y}$?
  - % (after x 100)
- unit of measurement of $\widehat{\alpha}$?
  - % (after x 100)

unit of measurement of $\triangle\overline{Y}$?
- p.p. (after x 100)
unit of measurement of $\triangle\widehat{Y}$?
- p.p. (after x 100)
unit of measurement of $\widehat{\beta}$?
- p.p. (after x 100)

What is the relationship between X and Y?

Create scatter plot to visualize the relationship between Assingment.1 and distinction

data |>
  ggplot(aes(Assignment.1, distinction)) + geom_point()

what does each dot represent?
does the relationship look positive or negative?
does the relationship look weekly or strongly linear?

Calculate correlation to measure direction and strength of linear association between Assignment.1 and distinction

cor(data$Assignment.1, data$distinction)

[1] 0.6962523

we find a moderately strong positive correlation
are we surprised by this? no because in the scatter plot above we observed that the relationship was positive and moderately strongly linear

Fit a linear model using the least squares method

R function to fit a linear model: lm()
- required argument: a formula of the type Y $\sim$ X

lm(distinction ~ Assignment.1, data = data)


Call:
lm(formula = distinction ~ Assignment.1, data = data)

Coefficients:
 (Intercept)  Assignment.1  
    -1.16031       0.02449

$\widehat{\alpha}$ = -1.16 and $\widehat{\beta}$ = 0.02
The fitted line is $\widehat{Y}$ = -1.16 + 0.02 $X$
More specifically, it is $\widehat{\textrm{distinction}}$ = -1.16 + 0.02 Assignment.1

data |> ggplot(aes(Assignment.1, distinction)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE) + theme_ipsum_rc()

\[ \widehat{\textrm{distinction}} = \textrm{-1.16} + \textrm{0.02} \textrm{Assignment.1} \]

Interpretation of Coefficients:

substantive interpretation of $\widehat{\alpha}$?

start with mathematical definition:
- $\widehat{\alpha}$ is the $\widehat{Y}$ when X=0
substitute X, Y, and $\widehat{\alpha}$:
- $\widehat{\alpha}$ = -1.16 is the $\widehat{\textrm{distinction}}$ when midterm=0

put it in words (using units of measurement):
- when a student scores 0 points in the midterm, we predict that their probability of earning a distinction in the class is -116%, on average
nonsensical (due to extrapolation)
unit of measurement of $\widehat{\alpha}$?
- same as $\overline{Y}$; here, Y is binary so $\overline{Y}$ and $\widehat{\alpha}$ are measured in % (after x 100)

Interpretation of Coefficients:

substantive interpretation of $\widehat{\beta}$?
- start with mathematical definition:
  - $\widehat{\beta}$ is the $\triangle \widehat{Y}$ associated with $\triangle X$=1
- substitute X, Y, and $\widehat{\beta}$:
  - $\widehat{\beta}$ =0.02 is the $\triangle \widehat{\textrm{distinction}}$ associated with $\triangle$Assignment.1=1

put it in words (using units of measurement):
- an increase in midterm scores of 1 point is associated with a predicted increase in the probability of earning a distinction in the class of 2 percentage points, on average
unit of measurement of $\widehat{\beta}$?
- same as $\triangle \overline{Y}$; here, Y is binary so $\triangle \overline{Y}$ and $\widehat{\beta}$ are measured in p.p (after x 100)

THE FITTED LINE

\[ \widehat{Y} = \widehat{\alpha} + \widehat{\beta} X \]

$\widehat{\alpha}$ (alpha-hat) is the estimated intercept coefficient the $\widehat{Y}$ when $X{=}\textrm{0}$ (in same unit of measurement as $\overline Y$)
$\widehat{\beta}$ (beta-hat) is the estimated slope coefficient the $\triangle \widehat{Y}$ associated with $\triangle X{=}\textrm{1}$ (in the same unit of measurement as $\triangle\overline Y$)

Using the Fitted Line to Make Predictions

To predict $\widehat{Y}$ based on X: $\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X$
To predict $\triangle\widehat{Y}$ based on $\triangle$X: $\triangle\widehat{Y} = \widehat{\beta} \triangle X$

Make predictions

To predict $\widehat{Y}$ based on $X$: $\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X$

Example 1: Imagine you earn 80 points in the midterm, what would we predict your probability of earning a distinction in the class will be?

\[ \widehat{\textrm{distinction}} = \textrm{-1.16} + \textrm{0.02} \textrm{Assignment.1}\\ \widehat{\textrm{distinction}} = \textrm{-1.16} + \textrm{0.02} \times \textrm{80} \textrm{(if Assignment.1 = 80)} \\ \widehat{\textrm{distinction}} = \textrm{0.44} \\ \]

If you earn 80 points in the midterm, we would predict that your probability of earning a distinction in the class is of %, on average
Note: since Y is binary, $\widehat{Y}$ is measured in % (after x 100)

Example 2: Imagine you earn 90 points in the midterm, what would we predict your probability of earning a description in the class will be?

\[\begin{eqnarray*} \widehat{\textrm{distinction}} &=& \textrm{-1.16} + \textrm{0.02} \textrm{Assignment.1}\\ \widehat{\textrm{distinction}} &=& \textrm{-1.16} + \textrm{0.02} \times \textrm{90} \textrm{(if midterm = 90)}\\ \widehat{\textrm{distinction}} &=& \textrm{0.64} \\ \end{eqnarray*}\]

If you earn 90 points in the midterm, we would predict that your probability of earning a distinc tion in the class is of 64, on average

To predict $\triangle \widehat{Y}$ associated with $\triangle X$: $\triangle \widehat{Y} = \widehat{\beta} \, \triangle \text{X}$

Example 3: What is the predicted change in the probability of earning a distinction in the class associated with an increase in midterm scores of 10 points?

\[\begin{eqnarray*} \triangle\widehat{\textrm{distinction}} &=& \textrm{0.02} \triangle\textrm{Assignment.1}\\ \triangle \widehat{\textrm{distinction}} &=& \textrm{0.02} \times \textrm{10} \textrm{ (if $\triangle$midterm = 10)} \\ \triangle \widehat{\textrm{distinction}} &=& \textrm{0.2} \\ \end{eqnarray*}\]

An increase of midterm scores of 10 points is associated with a predicted increase in the probability of earning a distinction in the class of20 percentage points, on average
Note: since Y is binary, $\triangle\widehat{Y}$ is in p.p. (after x 100)

Measure how well the model fits the data with $\textrm{R}^2$

How good is the model at making predictions? How well does the model fit the data?
One way of answering is by calculating $R^2$

$R^2$ measures the proportion of the variation in the outcome variable explained by the model

It ranges from 0 to 1
The higher the $R^2$, the better the model fits the data
In the simple linear model: $R^2 = \textrm{cor}(X,Y)^2$

When cor(X,Y) = 1 or cor(X,Y) = -1, the relationship between X and Y is perfectly linear
$R^2$ = cor(X,Y)$^2$ = 1, the model explains 100% of the variation of Y
All prediction errors (vertical distance between the dots and the line) = 0

When cor(X,Y) = 0, the relationship between X and Y is non-linear
$R^2$=cor(X,Y)$^2$= 0, the model explains 0% of the variation of Y
The prediction errors (vertical distance between the dots and the line) are very large

Let’s compute $R^2$

    cor(data$distinction, data$Assignment.1)^2
## [1] 0.4847673

Interpretation?
- It means that the linear model explains 48% of the variation of the outcome variable (distinction)

Let’s return to the predictive model from last lecture:

R code to compute $R^2$?
- Answer:

cor(data$Assignment.1, data$Course.total)^2
## [1] 0.748613

Interpretation?
- It does NOT mean that the model is right 75% of the time; It means that the linear model explains 75% of the variation of the outcome variable (Course.total)
Warning: only compare $R^2$ between models with the same outcome variable ($Y$); some variables are intrinsically harder to predict than others

Predicting Outcomes Using Linear Models}:

We look for $X$ variables that are highly correlated with $Y$ because the higher the correlation between $X$ and $Y$ (in absolute terms), the higher the $R^2$ and the better the fitted linear model will usually be at predicting $Y$ using $X$.

Why do we analyse data?

MEASURE:To infer population characteristics via survey research

what proportion of constituents support a particular policy?

PREDICT:To make predictions

who is the most likely candidate to win an upcoming election?

EXPLAIN:To estimate the causal effect of a treatment on an outcome

what is the effect of small classrooms on student performance?

Book Outline

	Chapter	Goal
2.	Estimating Causal Effects with Randomized Experiments	EXPLAIN
3.	Inferring Population Characteristics via Survey Research	MEASURE
4.	Predicting Outcomes Using Linear Regression	PREDICT
5.	Estimating Causal Effects with Observational Data	EXPLAIN

Review: Causation

To measure causal effects, we need to compare the factual outcome with the counterfactual outcome
- Fundamental problem: We can never observe the counterfactual outcome
To estimate causal effects, we must find or create a situation in which the treatment and control groups are comparable with respect to all the variables that might affect the outcome other than the treatment variable itself
Only when that assumption is satisfied can we use the factual outcome of one group as a good proxy for the counterfactual outcome of the other, and vice versa, thus, bypassing the fundamental problem of causal inference

Review: Randomized Experiments

In randomized experiments, we can rely on the random assignment of treatment to make treatment and control groups, on average, identical to each other in terms of all observed and unobserved pre-treatment characteristics
Thus, we can estimate the average treatment effect with the difference-in-means estimator

\[\overline{Y}_\text{treatment group} - \overline{Y}_\text{control group}\]

Observational Data

But, what happens when we cannot conduct a randomized experiment and have to analyse observational data?
- Observational data: data collected about naturally occurring events (i.e., researchers do not get to assign the treatment)
We can no longer assume that treatment and control groups are comparable
We need to identify and measure any relevant differences between treatment and control groups (known as confounding variables or confounders)
Then, we will need to statistically control for them so that we can make the two groups comparable after statistical controls are applied

Confounders or Confounding Variables

A confounding variable is a variable that affects both
- i. the likelihood to receive the treatment $X$ and
- ii. the outcome $Y$
In mathematical notation, we represent a confounding variable as $Z$

Let’s look at a simple example. Suppose we are interested in the average causal effect of attending a private school, as opposed to a public one, on students test performance
- What is the treatment variable $X$?
- What is the outcome variable $Y$?
- Can you think of a confounder $Z$?

Why Are Confounders a Problem?

They obscure the causal relationship between $X$ and $Y$
In the example above, if we observed that, on average, private school students perform better than public school students, we would not know whether it is
- because they attended a private school or
- because they came from wealthier families that could afford to provide them with after-school help
We would not know what portion of the observed differences in test score performance (the difference-in-means estimator), if any, could be attributed to the causal effect of the treatment (attending a private school) and what portion could be attributed to the confounding variable (coming from a wealthy family)

In the presence of confounders, correlation does not necessarily imply causation
Just because we observe two variables highly correlated with each other—when we observe one increase, we usually observe the other increase or decrease—it does not automatically mean that one causes the other
- There could be a third variable that causes both
For example, ice cream sales and shark attacks are highly correlated with each other. Does this mean that eating ice cream increases the probability that a shark attacks you?

IN THE PRESENCE OF CONFOUNDERS

correlation does NOT necessarily imply causation
the treatment and control groups are NOT comparable
the difference-in-means estimator does NOT provide a valid estimate of the average treatment effect

Why Don’t We Worry About Confoundersin Randomized Experiments?

Randomization of treatment assignment eliminates all potential confounders
It ensures that treatment and control groups are comparable by breaking the link between any potential confounder and the treatment
If we assign who attends a private school at random, we ensure that nothing related to the outcome is also related to the likelihood of receiving the treatment

How Can We Estimate Causal Effectswith Observational Data?

We cannot rely on random treatment assignment to eliminate potential confounders
We need to identify and measure all confounding variables and statistically control for them
Before we learn how to do that, we should learn how to fit a simple linear regression model to produce an estimated coefficient equivalent to the difference-in-means estimator
Let’s quickly review how we fit a line and interpret the estimated coefficients

scatter plot where every dot is an observation

first observation: $(X_1, Y_1)$

if we summarise the relationship between X and Y with a line

we can use the fitted line, to compute $\widehat{Y}$ for every value of $X$

prediction errors = vertical distance between dots and line

we choose the line with the smallest possible errors

the fitted line: $\widehat{Y} = \widehat{\alpha} + \widehat{\beta}X$

estimated intercept ($\widehat{\alpha}$): $\widehat{Y}$ when $X{=}0$

estimated slope ($\widehat{\beta}$): $\triangle\widehat{Y}$ associated with $\triangle X{=}1$

Using the Simple Linear Mode to Compute the Difference-in-Means Estimator

When X is the treatment variable and Y is the outcome variable of interest, the estimated slope coefficient ($\widehat{\beta}$) is equivalent to the difference-in-means estimator.

Let’s examine this …

Mathematical definition of $\widehat{\beta}$: $\triangle \widehat{Y}$ associated with $\triangle X{=}\textrm{1}$

\[\begin{eqnarray*} \widehat{\beta} &=& \triangle \widehat{Y} \textrm{(if } \triangle X{=}\textrm{1}\textrm{)} \\ &=& \widehat{Y}_{\textrm{final}}{-}\widehat{Y}_{\textrm{initial}} \textrm{(if } \triangle X{=}\textrm{1}\textrm{)} \end{eqnarray*}\]

If $X$ is the treatment variable:
- $\triangle X{=}\textrm{1}$ is equivalent to changing from the control group ($X{=}\textrm{0}$) to the treatment group ($X{=}\textrm{1}$)
- the control group is the initial state, and the treatment group is the final state

\[\widehat{\beta}=\widehat{Y}_{\textrm{treatment group}}{-}\widehat{Y}_{\textrm{control group}}\] - Recall: $\widehat{Y}$ are predicted values. In this case: $\widehat{Y}_{\textrm{treatment group}} = \overline{Y}_{\textrm{treatment group}}$ and $\widehat{Y}_{\textrm{control group}} = \overline{Y}_{\textrm{control group}}$

\[ \widehat{\beta} = \overline{Y}_{\textrm{treatment group}} -\overline{Y}_{\textrm{control group}} \] - Conclusion: When X is the treatment variable and Y is the outcome variable of interest, the estimated slope coefficient ($\widehat{\beta}$) is equivalent to the difference-in-means estimator

Let’s return to the exercise from the experiment: Does Social Pressure Affect Turnout?
We answer it by analysing data from a randomized experiment where registered voters were randomly assigned to either (a) receive a message designed to induce social pressure, or (b) receive nothing

Load and look at the data

    voting <- read.csv("voting.csv") # loads and stores data
    head(voting) # shows first six observations

##   birth message voted
## 1  1981      no     0
## 2  1959      no     1
## 3  1956      no     1
## 4  1939     yes     1
## 5  1968      no     0
## 6  1967      no     0

Create treatment variable

voting <- voting |> 
  mutate(pressure = case_when(
    message == "yes" ~ 1,
    message == "no" ~ 0
  ))

Make sure the new variable was created correctly by looking at the first few observations again:

    head(voting) # shows first six observations
##   birth message voted pressure
## 1  1981      no     0        0
## 2  1959      no     1        0
## 3  1956      no     1        0
## 4  1939     yes     1        1
## 5  1968      no     0        0
## 6  1967      no     0        0

Compute difference-in-means estimator directly

    mean(voting$voted[voting$pressure==1]) -
        mean(voting$voted[voting$pressure==0]) 
## [1] 0.08130991

Alternatively, we can fit a linear model where X is the treatment variable and Y is the outcome variable

Recall: the R function to fit a linear model is
- required argument: a formula of the type

    lm(voted ~ pressure, data=voting)


Call:
lm(formula = voted ~ pressure, data = voting)

Coefficients:
(Intercept)     pressure  
    0.29664      0.08131

Fitted model: $\widehat{\textrm{voted}}$ = 0.30 + 0.08 pressure
Note that $\widehat{\beta}$ has the same value as the difference-in-means estimator above (both equal 0.08)

Interpretation of $\widehat{\beta}$ When X Is the Treatment Variable and Y Is the Outcome Variable

Start same as in predictive models
- definition: $\widehat{\beta}$ is the $\triangle \widehat{Y}$ associated with $\triangle X$=1
- here: $\widehat{\beta}$ = 0.08 is the $\triangle \widehat{\textrm{voted}}$ associated with $\triangle$pressure=1
- in words: receiving the message inducing social pressure (i.e., an increase in pressure of 1 by going from pressure=0 to pressure=1) is associated with a predicted increase in the probability of voting of 8 percentage points, on average

unit of measurement of $\widehat{\beta}$? same as $\triangle \overline{Y}$; here, Y is binary so $\triangle \overline{Y}$ is measured in p.p and so is $\widehat{\beta}$ (after x 100)

Now, since here X is the treatment variable and Y is the outcome variable of interest, $\widehat{\beta}$ is equivalent to the difference-in-means estimator
As a result, we can interpret $\widehat{\beta}$ using causal langauge
Predictive language: We estimate that receiving the message inducing social pressure is associated with a predicted increase in the probability of voting of 8 percentage points, on average
Causal language: We estimate that receiving the message inducing social pressure increases the probability of voting by 8 percentage points, on average

This should be a valid estimate of the average treatment effect if there are no confounding variables present
- if registered voters who received the message are comparable to the registered voters who did not
Since the data come from a randomized experiment there should be no confounding variables
And thus the difference-in-means estimator should produce a valid estimate of the average treatment effect

Whether we compute the difference-in-means estimator directly or we fit a simple linear model where Y is the outcome variable and X is the treatment variable, we arrive to the same conclusion
Conclusion: We estimate that receiving the message inducing social pressure increases the probability of voting by 8 percentage points, on average. This is a valid estimate of the average treatment effect if registered voters who received the message are comparable to the registered voters who did not (that is, if there a no confounding variables). Given that the data come from a randomized experiment, this is a reasonable assumption.

Interpretation of the Estimated Slope Coefficient in the Simple Linear Model

By default, we interpret $\widehat{\beta}$ using predictive language: It is the $\triangle \widehat{Y}$ associated with $\triangle X{=}\textrm{1}$.
When $X$ is the treatment variable, then $\widehat{\beta}$ is equivalent to the difference-in-means estimator and, thus, we interpret $\widehat{\beta}$ using causal language: It is the $\triangle \widehat{Y}$ caused by $\triangle X{=}\textrm{1}$ (the presence of the treatment). This causal interpretation is valid if there are no confounding variables present and, thus, the treatment and control groups are comparable.

Week 6: Binary Outcomes and Observational Data

Plan for today

Using Midterm Scores to Predict Probability of Earning a Distinction

THE FITTED LINE

Using the Fitted Line to Make Predictions

Predicting Outcomes Using Linear Models}:

Why do we analyse data?

Book Outline

Review: Causation

Review: Randomized Experiments

Observational Data

Confounders or Confounding Variables

Why Are Confounders a Problem?

IN THE PRESENCE OF CONFOUNDERS

Why Don’t We Worry About Confoundersin Randomized Experiments?

How Can We Estimate Causal Effectswith Observational Data?

Using the Simple Linear Mode to Compute the Difference-in-Means Estimator

Does Social Pressure Affect Turnout?

Interpretation of \(\widehat{\beta}\) When X Is the Treatment Variable and Y Is the Outcome Variable