The predictor (X) is the variable we want to use to predict the outcome (Y)
in this case, the predictor is Assignment.1
let’s visualize the distribution of Assignment.1
The outcome (Y) is the variable that we want to predict
in this case, the outcome variable is distinction
let’s visualize the distribution of distinction
What type of variable is distinction
non-numeric, but can be recoded as a numeric binary using case_when()
How would you compute the proportion of students who received a distinction?
by computing the mean of distinction
since distinction is a binary variable, its mean should be interpreted as the proportion of the observations that have the characteristic identified by the variable
Code to compute the mean of distinction
Answer:
data <- data |>mutate(distinction =case_when(distinction =="yes"~1,TRUE~0))data |>summarise(mean =mean(distinction))
mean
1 0.375
Interpretation? - 37.5% of the students earned an 70 or above in the class - RECALL: You need to multiply the output by 100
Since Y is binary
unit of measurement of \(\overline{Y}\)?
% (after x 100)
unit of measurement of \(\widehat{Y}\)?
% (after x 100)
unit of measurement of \(\widehat{\alpha}\)?
% (after x 100)
unit of measurement of \(\triangle\overline{Y}\)?
p.p. (after x 100)
unit of measurement of \(\triangle\widehat{Y}\)?
p.p. (after x 100)
unit of measurement of \(\widehat{\beta}\)?
p.p. (after x 100)
What is the relationship between X and Y?
Create scatter plot to visualize the relationship between Assingment.1 and distinction
data |>ggplot(aes(Assignment.1, distinction)) +geom_point()
what does each dot represent?
does the relationship look positive or negative?
does the relationship look weekly or strongly linear?
Calculate correlation to measure direction and strength of linear association between Assignment.1 and distinction
cor(data$Assignment.1, data$distinction)
[1] 0.6962523
we find a moderately strong positive correlation
are we surprised by this? no because in the scatter plot above we observed that the relationship was positive and moderately strongly linear
Fit a linear model using the least squares method
R function to fit a linear model: lm()
required argument: a formula of the type Y \(\sim\) X
substantive interpretation of \(\widehat{\alpha}\)?
start with mathematical definition:
\(\widehat{\alpha}\) is the \(\widehat{Y}\) when X=0
substitute X, Y, and \(\widehat{\alpha}\):
\(\widehat{\alpha}\) = -1.16 is the \(\widehat{\textrm{distinction}}\) when midterm=0
put it in words (using units of measurement):
when a student scores 0 points in the midterm, we predict that their probability of earning a distinction in the class is -116%, on average
nonsensical (due to extrapolation)
unit of measurement of \(\widehat{\alpha}\)?
same as \(\overline{Y}\); here, Y is binary so \(\overline{Y}\) and \(\widehat{\alpha}\) are measured in % (after x 100)
Interpretation of Coefficients:
substantive interpretation of \(\widehat{\beta}\)?
start with mathematical definition:
\(\widehat{\beta}\) is the \(\triangle \widehat{Y}\) associated with \(\triangle X\)=1
substitute X, Y, and \(\widehat{\beta}\):
\(\widehat{\beta}\) =0.02 is the \(\triangle \widehat{\textrm{distinction}}\) associated with \(\triangle\)Assignment.1=1
put it in words (using units of measurement):
an increase in midterm scores of 1 point is associated with a predicted increase in the probability of earning a distinction in the class of 2 percentage points, on average
unit of measurement of \(\widehat{\beta}\)?
same as \(\triangle \overline{Y}\); here, Y is binary so \(\triangle \overline{Y}\) and \(\widehat{\beta}\) are measured in p.p (after x 100)
THE FITTED LINE
\[
\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X
\]
\(\widehat{\alpha}\) (alpha-hat) is the estimated intercept coefficient the \(\widehat{Y}\) when \(X{=}\textrm{0}\) (in same unit of measurement as \(\overline Y\))
\(\widehat{\beta}\) (beta-hat) is the estimated slope coefficient the \(\triangle \widehat{Y}\) associated with \(\triangle X{=}\textrm{1}\) (in the same unit of measurement as \(\triangle\overline Y\))
Using the Fitted Line to Make Predictions
To predict \(\widehat{Y}\) based on X: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\)
To predict \(\triangle\widehat{Y}\) based on \(\triangle\)X: \(\triangle\widehat{Y} = \widehat{\beta} \triangle X\)
Make predictions
To predict \(\widehat{Y}\) based on \(X\): \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\)
Example 1: Imagine you earn 80 points in the midterm, what would we predict your probability of earning a distinction in the class will be?
If you earn 90 points in the midterm, we would predict that your probability of earning a distinc tion in the class is of 64, on average
To predict \(\triangle \widehat{Y}\) associated with \(\triangle X\): \(\triangle \widehat{Y} = \widehat{\beta} \, \triangle \text{X}\)
Example 3: What is the predicted change in the probability of earning a distinction in the class associated with an increase in midterm scores of 10 points?
An increase of midterm scores of 10 points is associated with a predicted increase in the probability of earning a distinction in the class of20 percentage points, on average
Note: since Y is binary, \(\triangle\widehat{Y}\) is in p.p. (after x 100)
Measure how well the model fits the data with \(\textrm{R}^2\)
How good is the model at making predictions? How well does the model fit the data?
One way of answering is by calculating \(R^2\)
\(R^2\) measures the proportion of the variation in the outcome variable explained by the model
It ranges from 0 to 1
The higher the \(R^2\), the better the model fits the data
In the simple linear model: \(R^2 = \textrm{cor}(X,Y)^2\)
When cor(X,Y) = 1 or cor(X,Y) = -1, the relationship between X and Y is perfectly linear
\(R^2\) = cor(X,Y)\(^2\) = 1, the model explains 100% of the variation of Y
All prediction errors (vertical distance between the dots and the line) = 0
When cor(X,Y) = 0, the relationship between X and Y is non-linear
\(R^2\)=cor(X,Y)\(^2\)= 0, the model explains 0% of the variation of Y
The prediction errors (vertical distance between the dots and the line) are very large
It does NOT mean that the model is right 75% of the time; It means that the linear model explains 75% of the variation of the outcome variable (Course.total)
Warning: only compare \(R^2\) between models with the same outcome variable (\(Y\)); some variables are intrinsically harder to predict than others
Predicting Outcomes Using Linear Models}:
We look for \(X\) variables that are highly correlated with \(Y\) because the higher the correlation between \(X\) and \(Y\) (in absolute terms), the higher the \(R^2\) and the better the fitted linear model will usually be at predicting \(Y\) using \(X\).
Why do we analyse data?
MEASURE:To infer population characteristics via survey research
what proportion of constituents support a particular policy?
PREDICT:To make predictions
who is the most likely candidate to win an upcoming election?
EXPLAIN:To estimate the causal effect of a treatment on an outcome
what is the effect of small classrooms on student performance?
Book Outline
Chapter
Goal
2.
Estimating Causal Effects with Randomized Experiments
EXPLAIN
3.
Inferring Population Characteristics via Survey Research
MEASURE
4.
Predicting Outcomes Using Linear Regression
PREDICT
5.
Estimating Causal Effects with Observational Data
EXPLAIN
Review: Causation
To measure causal effects, we need to compare the factual outcome with the counterfactual outcome
Fundamental problem: We can never observe the counterfactual outcome
To estimate causal effects, we must find or create a situation in which the treatment and control groups are comparable with respect to all the variables that might affect the outcome other than the treatment variable itself
Only when that assumption is satisfied can we use the factual outcome of one group as a good proxy for the counterfactual outcome of the other, and vice versa, thus, bypassing the fundamental problem of causal inference
Review: Randomized Experiments
In randomized experiments, we can rely on the random assignment of treatment to make treatment and control groups, on average, identical to each other in terms of all observed and unobserved pre-treatment characteristics
Thus, we can estimate the average treatment effect with the difference-in-means estimator
But, what happens when we cannot conduct a randomized experiment and have to analyse observational data?
Observational data: data collected about naturally occurring events (i.e., researchers do not get to assign the treatment)
We can no longer assume that treatment and control groups are comparable
We need to identify and measure any relevant differences between treatment and control groups (known as confounding variables or confounders)
Then, we will need to statistically control for them so that we can make the two groups comparable after statistical controls are applied
Confounders or Confounding Variables
A confounding variable is a variable that affects both
i. the likelihood to receive the treatment \(X\) and
ii. the outcome \(Y\)
In mathematical notation, we represent a confounding variable as \(Z\)
Let’s look at a simple example. Suppose we are interested in the average causal effect of attending a private school, as opposed to a public one, on students test performance
What is the treatment variable \(X\)?
What is the outcome variable \(Y\)?
Can you think of a confounder \(Z\)?
Why Are Confounders a Problem?
They obscure the causal relationship between \(X\) and \(Y\)
In the example above, if we observed that, on average, private school students perform better than public school students, we would not know whether it is
because they attended a private school or
because they came from wealthier families that could afford to provide them with after-school help
We would not know what portion of the observed differences in test score performance (the difference-in-means estimator), if any, could be attributed to the causal effect of the treatment (attending a private school) and what portion could be attributed to the confounding variable (coming from a wealthy family)
In the presence of confounders, correlation does not necessarily imply causation
Just because we observe two variables highly correlated with each other—when we observe one increase, we usually observe the other increase or decrease—it does not automatically mean that one causes the other
There could be a third variable that causes both
For example, ice cream sales and shark attacks are highly correlated with each other. Does this mean that eating ice cream increases the probability that a shark attacks you?
IN THE PRESENCE OF CONFOUNDERS
correlation does NOT necessarily imply causation
the treatment and control groups are NOT comparable
the difference-in-means estimator does NOT provide a valid estimate of the average treatment effect
Why Don’t We Worry About Confoundersin Randomized Experiments?
Randomization of treatment assignment eliminates all potential confounders
It ensures that treatment and control groups are comparable by breaking the link between any potential confounder and the treatment
If we assign who attends a private school at random, we ensure that nothing related to the outcome is also related to the likelihood of receiving the treatment
How Can We Estimate Causal Effectswith Observational Data?
We cannot rely on random treatment assignment to eliminate potential confounders
We need to identify and measure all confounding variables and statistically control for them
Before we learn how to do that, we should learn how to fit a simple linear regression model to produce an estimated coefficient equivalent to the difference-in-means estimator
Let’s quickly review how we fit a line and interpret the estimated coefficients
scatter plot where every dot is an observation
first observation: \((X_1, Y_1)\)
if we summarise the relationship between X and Y with a line
we can use the fitted line, to compute \(\widehat{Y}\) for every value of \(X\)
prediction errors = vertical distance between dots and line
we choose the line with the smallest possible errors
the fitted line: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta}X\)
estimated intercept (\(\widehat{\alpha}\)): \(\widehat{Y}\) when \(X{=}0\)
estimated slope (\(\widehat{\beta}\)): \(\triangle\widehat{Y}\) associated with \(\triangle X{=}1\)
Using the Simple Linear Mode to Compute the Difference-in-Means Estimator
When X is the treatment variable and Y is the outcome variable of interest, the estimated slope coefficient (\(\widehat{\beta}\)) is equivalent to the difference-in-means estimator.
Let’s examine this …
Mathematical definition of \(\widehat{\beta}\): \(\triangle \widehat{Y}\) associated with \(\triangle X{=}\textrm{1}\)
\(\triangle X{=}\textrm{1}\) is equivalent to changing from the control group (\(X{=}\textrm{0}\)) to the treatment group (\(X{=}\textrm{1}\))
the control group is the initial state, and the treatment group is the final state
\[\widehat{\beta}=\widehat{Y}_{\textrm{treatment group}}{-}\widehat{Y}_{\textrm{control group}}\] - Recall: \(\widehat{Y}\) are predicted values. In this case: \(\widehat{Y}_{\textrm{treatment group}} = \overline{Y}_{\textrm{treatment group}}\) and \(\widehat{Y}_{\textrm{control group}} = \overline{Y}_{\textrm{control group}}\)
\[
\widehat{\beta} = \overline{Y}_{\textrm{treatment group}} -\overline{Y}_{\textrm{control group}}
\] - Conclusion: When X is the treatment variable and Y is the outcome variable of interest, the estimated slope coefficient (\(\widehat{\beta}\)) is equivalent to the difference-in-means estimator
Let’s return to the exercise from the experiment: Does Social Pressure Affect Turnout?
We answer it by analysing data from a randomized experiment where registered voters were randomly assigned to either (a) receive a message designed to induce social pressure, or (b) receive nothing
Does Social Pressure Affect Turnout?
Load and look at the data
voting <-read.csv("voting.csv") # loads and stores datahead(voting) # shows first six observations
## birth message voted
## 1 1981 no 0
## 2 1959 no 1
## 3 1956 no 1
## 4 1939 yes 1
## 5 1968 no 0
## 6 1967 no 0
Note that \(\widehat{\beta}\) has the same value as the difference-in-means estimator above (both equal 0.08)
Interpretation of \(\widehat{\beta}\) When X Is the Treatment Variable and Y Is the Outcome Variable
Start same as in predictive models
definition: \(\widehat{\beta}\) is the \(\triangle \widehat{Y}\) associated with \(\triangle X\)=1
here: \(\widehat{\beta}\) = 0.08 is the \(\triangle \widehat{\textrm{voted}}\) associated with \(\triangle\)pressure=1
in words: receiving the message inducing social pressure (i.e., an increase in pressure of 1 by going from pressure=0 to pressure=1) is associated with a predicted increase in the probability of voting of 8 percentage points, on average
unit of measurement of \(\widehat{\beta}\)? same as \(\triangle \overline{Y}\); here, Y is binary so \(\triangle \overline{Y}\) is measured in p.p and so is \(\widehat{\beta}\) (after x 100)
Now, since here X is the treatment variable and Y is the outcome variable of interest, \(\widehat{\beta}\) is equivalent to the difference-in-means estimator
As a result, we can interpret \(\widehat{\beta}\) using causal langauge
Predictive language: We estimate that receiving the message inducing social pressure is associated with a predicted increase in the probability of voting of 8 percentage points, on average
Causal language: We estimate that receiving the message inducing social pressure increases the probability of voting by 8 percentage points, on average
This should be a valid estimate of the average treatment effect if there are no confounding variables present
if registered voters who received the message are comparable to the registered voters who did not
Since the data come from a randomized experiment there should be no confounding variables
And thus the difference-in-means estimator should produce a valid estimate of the average treatment effect
Whether we compute the difference-in-means estimator directly or we fit a simple linear model where Y is the outcome variable and X is the treatment variable, we arrive to the same conclusion
Conclusion: We estimate that receiving the message inducing social pressure increases the probability of voting by 8 percentage points, on average. This is a valid estimate of the average treatment effect if registered voters who received the message are comparable to the registered voters who did not (that is, if there a no confounding variables). Given that the data come from a randomized experiment, this is a reasonable assumption.
Interpretation of the Estimated Slope Coefficient in the Simple Linear Model
By default, we interpret \(\widehat{\beta}\) using predictive language: It is the \(\triangle \widehat{Y}\)associated with\(\triangle X{=}\textrm{1}\).
When \(X\) is the treatment variable, then \(\widehat{\beta}\) is equivalent to the difference-in-means estimator and, thus, we interpret \(\widehat{\beta}\) using causal language: It is the \(\triangle \widehat{Y}\)caused by\(\triangle X{=}\textrm{1}\) (the presence of the treatment). This causal interpretation is valid if there are no confounding variables present and, thus, the treatment and control groups are comparable.