POL269 Political Data Research
Javier Sajuria
05.02.2024
MEASURE:To infer population characteristics via survey research
PREDICT:To make predictions
EXPLAIN:To estimate the causal effect of a treatment on an outcome
Dear Registered Voter: WHAT IF YOUR NEIGHBORS KNEW WHETHER YOU VOTED? … We’re sending this mailing to you and your neighbours to publicize who does and does not vote. The chart shows the names of some of your neighbours, showing which have voted in the past. After the August 8 election, we intend to mail an updated chart. You and your neighbours will all know who voted and who did not. DO YOUR CIVIC DUTY–VOTE!
MAPEL DR | Name | Aug 2004 | Nov 2004 | Aug 2006 |
---|---|---|---|---|
9993 | JOSEPH JAMES SMITH | Voted | Voted | ?? |
9995 | JENNIFER KAY SMITH | Didn’t vote | Voted | ?? |
9997 | RICHARD B JACKSON | Didn’t vote | Voted | ?? |
9999 | KATHY MARIE JACKSON | Didn’t vote | Voted | ?? |
Unit of observation: Registered voters
Variables:
Variable | Description |
---|---|
birth | year of birth |
message | whether registered voter received the message (“yes” or “no”) |
voted | whether registered voter voted: 1= yes; 0=no |
Many of the most important research questions in politics involve estimating a causal effect:
Causal Effects refer to the cause-and-effect connection between two variables:
treatment variable (X):variable whose change may produce a change in the outcome variable
outcome variable (Y):variable that may change as a result of a change in the treatment variable
The causal relationship we are interested is
\[ X \rightarrow Y \]
In the voting dataset we have three variables, birth, message, and voted, and we aim to answer the research question: “Does social pressure increase the probability of turning out to vote?”
What is the treatment variable?
What is the outcome variable?
The causal relationship we are interested in is:
\[ message \rightarrow voted \]
In this class, treatment variables will always be binary:
\[ \textrm{X}_i = \begin{cases} \textrm{1} \text{ if individual i takes the treatment} \\ \textrm{0} \text{ if inidividual i does not take the treatment}\end{cases} \]
In the voting experiment, the treatment variable is:
\[ \textrm{message}_i = \begin{cases} \textrm{1} \text{ if registered voter i received message} \\ \textrm{0} \text{ if registered voter i did not}\end{cases} \]
Based on whether the individual receives the treatment, we speak of two different conditions
treatment is the condition with the treatment: \(X_i{=}\textrm{1}\)
control is the condition without the treatment: \(X_i{=}\textrm{0}\)
We will see different types of outcome variables
binary
non-binary
In the voting experiment, the outcome variable is:
\[ \textrm{voted}_i = \begin{cases} \textrm{1} \text{ if registered voter i voted}\\ \textrm{0} \text{ if registered voter i didn't vote}\end{cases} \]
what type of variable is voted?
The causal effect of X on Y is the change in the outcome variable caused by a change in the treatment variable
\[ \triangle Y_i = Y_i(X_i{=}\textrm{1}) - Y_i(X_i{=}\textrm{0}) \]
In the voting experiment, we aim to measure the extent to which the probability of voting changes as a result of receiving the social pressure message
Ideally, for each registered voter we would like to observe:
whether they voted after receiving the social pressure message: voted\(_i\)(message\(_i\)=1)
If this were possible, the effect of receiving the social pressure message on the probability of voting would be:
\[ \triangle \textrm{voted}_i = \textrm{voted}_i (\textrm{message}_i = \textrm{1}) - \textrm{voted}_i(\textrm{message}_i = \textrm{0}) \]
Do we ever observe both potential outcomes for the same individual at the exact same time under the same circumstances?
We only observe the factual outcome: potential outcome under the condition received in reality
We can never observe the counterfactual outcome: potential outcome under the opposite condition as the one received in reality
To get around the fundamental problem of causal inference, we must find good approximations for the counterfactual outcomes
We move away from individual-level effects and focus on the average causal effects across a group of individuals
The average causal effect of the treatment X on the outcome Y (also known as the average treatment effect) is the average of all the individual causal effects of X on Y within a group
How can we obtain good approximations for the counterfactual outcomes?
We must find or create a situation in which the observations treated and the observations untreated are, at the aggregate level, similar with respect to all the variables that might affect the outcome other than the treatment variable itself
Then, we can use the factual outcome of one group as a proxy for the counterfactual outcome of the other
The best way to accomplish this is by conducting a randomised experiment
A randomised experiment is a type of study design in which treatment assignment is randomized
Once treatment is administered, we differentiate between:
In the voting experiment, what are the treatment and control groups?
Random treatment assignment makes the treatment and control groups on average identical to each other in all observed and unobserved pre-treatment characteristics
When treatment assignment is randomised, the only thing that distinguishes the treatment group from the control group, besides the treatment itself, is chance
If the treatment and control groups are comparable before the treatment is administered
we can use the factual outcome of one group as a proxy for the counterfactual outcome of the other
we can estimate the average treatment effect by calculating the difference-in-means estimator
\[ \bar{Y}_\text{treatment group} - \bar{Y}_\text{control group} \]
\(\bar{Y}_\text{treatment group}\): average outcome for the treatment group
\(\bar{Y}_\text{control group}\): average outcome for the control group
Only when the treatment and control groups are comparable does the diffs-in-means estimator produce a valid estimate of the average treatment effect
\(\widehat{\textrm{average_effect}} = \bar{Y}_\text{treatment group} - \bar{Y}_\text{control group}\)
“hat” on top of the name denotes this is an estimate
\[ \overline{\textrm{voted}}_\text{treatment group} - \overline{\textrm{voted}}_\text{control group} \]
\(\overline{\textrm{voted}}_\text{treatment group}\): proportion of registered voters who voted among those who received the message
\(\overline{\textrm{voted}}_\text{control group}\): proportion of registered voters who voted among those who did not receive the message
Random Treatment Assignment Makes Treatment and Control Groups Comparable When Sample Size is Large Enough (Link to interactive graph)
\[ \% - \% = p.p. \]
\[ \triangle\textrm{vshare} = \textrm{vshare}_{\textrm{final}} - \textrm{vshare}_{\textrm{initial}} = \textrm{60%} - \textrm{50%} = \textrm{10 p.p.} \]
Why not 10%? If someone told us that an initial vote share was 50% and that it increased by 10%, the final vote share would be _________ (instead of 60%)
\[ \textrm{vshare}_{\textrm{final}} = \textrm{vshare}_{\textrm{initial}} + \triangle \textrm{vshare} = \textrm{50%} + \textrm{5 p.p.} = \textrm{55%} \]
(Based on Alan S. Gerber, Donald P. Green, and Christopher W. Larimer. 2008. ``Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” American Political Science Review, 102 (1): 33-48.)
setwd()
for your computerOPTION 1: ifelse()
First, we need to learn how to use ==
and ifelse()
The operator ==
is used to create logical tests that evaluate whether the observations of a variable equal a particular value (the particular values should be in quotes if text but without quotes if numbers)
examples:
data$variable == 1
data$variable == "yes"
ifelse()
creates the contents of a new variable based on the values of an existing one
requires three arguments, separated by commas, in the following order:
(1) logical test (using ==
)
(2) return value if logical test is true,
(3) return value if logical test is false
example: ifelse(data$variable == "yes", 1, 0)
You need to run the code all at once (not line by line)
Remember that R will ignore anything that follows the #
sign, until the end of the line
What would have happened had we not added voting
in front of pressure
on the first line of code above?
Whenever we create a new variable, we should make sure it was created correctly by looking at the first few observations of the dataframe again
Note that when message equals “yes”, pressure equals 1; and when message equals “no”, pressure equals 0
OPTION 2: case_when()
This is the tidyverse option, and uses piping (%>%
)
It uses a similar structure than ifelse()
, but it requires to specify the default option using the TRUE
parameter.
example:
data <- data %>% case_when( variable == "yes" ~ 1, TRUE ~ 0)
This option is slightly longer, but it remains consistent with the use of tidyverse
You can choose which option to use, the results are basically the same.
birth message voted pressure pressure2
1 1981 no 0 0 0
2 1959 no 1 0 0
3 1956 no 1 0 0
4 1939 yes 1 1 1
5 1968 no 0 0 0
6 1967 no 0 0 0
pressure
and pressure2
are the same.\[ \overline{Y}_\text{treatment group} - \overline{Y}_\text{control group} \]
\(\overline{Y}_\text{treatment group}\): average outcome for the treatment group
\(\overline{Y}_\text{control group}\): average outcome for the control group
\[ \overline{\textrm{voted}}_\text{treatment group} - \overline{\textrm{voted}}_\text{control group} \]
mean(voting$voted)
or summarise(mean = mean(voted))
compute the mean of voted for ALL the observations in the dataset[]
operator and the group_by()
function[]
:
extracts a selection of observations from a variable
to its left, we specify the variable we want to subset
inside the square brackets, we specify the criteria of selection; we can specify a logical test using the relational operator ==
; only the observations for which the test is true will be extracted
example: data$var1[data$var2==1]
# extracts the observations of the variable var1
for which the variable var2
equals 1
group_by()
function:
It groups the observations according to the values of a variable
We then use the outcome to estimate the mean
example:
data %>% group_by(var2) %>% summarise(mean = mean(var1))
Compute the mean of voted for the treatment and control groups, separately
Interpretation of the first mean?
Interpretation of the second mean?
Now, we can compute the difference-in-means estimator as the difference between the two means above:
direction, size, and unit of measurement of the effect?
increase because we are measuring a change in \(Y\) and the number is positive
percentage points because it is the result of subtracting two percentages: %-% = p.p. (because voted is binary)
8 (and not 0.08) because we need to multiply the number by 100 to turn it into p.p. (because voted is binary)
38% - 30% = 8 p.p.
Assuming that [the treatment and control groups are comparable](a reasonable assumption because …), we estimate that [the treatment] [increases/decreases] [the outcome] by [size and unit of measurement of the effect], on average.
Assuming that registered voters who received the message are comparable to the registered voters who did not(a reasonable assumption because the data come from a randomized experiment), we estimate that receiving the message inducing social pressure increases the probability of voting by 8 percentage points, on average.
==, ifelse(), group_by(), summarise(), mean()
POL269