POL269 Political Research
Javier Sajuria
12.02.2024
Assessment is a research project, that will take the shape of a 3-minute TikTok-style video. You will need to submit the video alongside a script and the R code. The maximum limit is 2,000 words - this is a limit, not a goal.
The assessment will comprise a series of tasks, which will require you to manipulate data and produce output using RStudio
Pro-tip: the practical tasks are the same ones you have done in the seminars.
MEASURE:To infer population characteristics via survey research
PREDICT:To make predictions
EXPLAIN:To estimate the causal effect of a treatment on an outcome
MEASURE:To infer population characteristics via survey research
PREDICT:To make predictions
EXPLAIN:To estimate the causal effect of a treatment on an outcome
We often want to know the characteristics of a large population such as the residents of a country
Yet collecting data from every individual in the population is either prohibitively expensive or simply infeasible
In the UK, we try to collect data from each individual every ten years
We use surveys to collect data from a small subset of observations in order to understand the population
Subset of individuals chosen for study is called a sample
In survey research, it is vital for the sample to be representative of the population of interest
A representative sample accurately reflects the characteristics of the population from which it is drawn, that is, characteristics appear in the sample in similar proportions as in the population as a whole
If the sample is not representative, our inferences regarding the population characteristics based on the sample will be wrong
Random sampling
makes the sample and the target population on average identical to each other in all observed and unobserved characteristics
Do not confuse them with each other: they both use a random process but for two very different reasons
Random treatment assignment means that the treatment is assigned at random
Random sampling means that individuals are selected at random from the population into the sample
Suppose we have collected data from a sample, now what?
To understand the content and distribution of each variable we can
Let’s return to the voting experiment
Unit of observation: Registered voters
Variables:
Variable | Description |
---|---|
birth | year of birth |
message | whether registered voter received the message (“yes” or “no”) |
voted | whether registered voter voted: 1= yes; 0=no |
Open RStudio
Download exercise_3.R from the module’s website and open it within RStudio
Run steps 1 through 3
Explore one variable at a time
## STEP 3. Look at the data
head(voting) # shows the first six observations
## birth message voted
## 1 1981 no 0
## 2 1959 no 1
## 3 1956 no 1
## 4 1939 yes 1
## 5 1968 no 0
## 6 1967 no 0
## what's the unit of observation?
## for each variable: type and unit of measurement?
## substantively interpret the first observation
The frequency table of a variable shows the values the variable takes and the number of times each value appears in the variable
R functions: table(), count()
The table of proportions of a variable shows the proportion of observations that take each value in the variable
The proportions in the table should add up to 1
R functions: prop.table(table( )), mutate()
Interpretation?
The histogram of a variable is the visual representation of its distribution through bins of different heights
The position of the bins along the x-axis indicate the interval of values
The height of the bins indicate the frequency (or count) of the interval of values
R functions: hist(), geom_hist()
example of change in centrality
example of change in spread
As we saw in Chapter 2, the mean of a variable equals the sum of the values across all observations divided by the total number of observations
What is the function in R?
Example:
Interpretations?
The median of a variable is the value at the midpoint of the distribution that divides the data into two equal-size groups
When the variable contains an odd number of observations, the median is the middle value of the distribution
When the variable contains an even number of observations, the median is the average of the two middle values
Example, if X={10, 4, 6, 8, 22}, what is the median of \(X\)?
First, we need to sort the values of \(X\) in ascending order (as they would be in the distribution):
{4, 6,8, 10, 22}
The median is \(8\) because that is the value in the middle of the distribution
R function: median()
\[ \textrm{sd(X)} = \sqrt{\frac{\sum^n_{i=1}(X_i-\overline{X})^2}{n}} \]
\(sd(X)\) stands for the standard deviation of \(X\)
\(X_i\) is a particular observation of \(X\)
\(\overline{X}\) stands for the mean of \(X\)
\(n\) is the total number of observations in the variable
\(\sum^{n}_{i=1} (X_i{-}\overline{X})^2\) means the sum of all \((X_i{-}\overline{X})^2\) from \(i{=}1\) to \(i{=}n\)\
Important
The standard deviation of a variable measures the average distance of the observations to the mean.
The standard deviation of a variable gives us a sense of the range of the data, especially when dealing with bell-shaped distributions known as normal distributions
In normal distributions, 95% of the observations fall within two standard deviations from the mean
R function: sd()
If birth were normally distributed, about 95% of the registered voters in the voting experiment would have been born between 1927 and 1985
R function: var()
Alternatively: sd()\^{}2
We are usually better off using standard deviations as our measure of spread, as they are easier to interpret because they are in the same unit of measurement as the variable
If we are given a variance, we can compute the standard deviation by taking the square root of the variance.
What is the R function to compute a square root?
Imagine we have two variables:
X | Y |
---|---|
4 | 2 |
8 | 5 |
10 | 3 |
We can create the scatter plot by plotting one layer at a time:
Imagine we have two variables:
X | Y | |
---|---|---|
4 | 2 | First, let’s plot this point:(\(x_1\), \(y_1\)) = (4,2) |
8 | 5 | |
10 | 3 |
We can create the scatter plot by plotting one layer at a time:
Imagine we have two variables:
X | Y | |
---|---|---|
4 | 2 | First, let’s plot this point:(\(x_1\), \(y_1\)) = (4,2) |
8 | 5 | |
10 | 3 |
We can create the scatter plot by plotting one layer at a time:
Imagine we have two variables:
X | Y | |
---|---|---|
4 | 2 | First, let’s plot this point:(\(x_1\), \(y_1\)) = (4,2) |
8 | 5 | Now, let’s plot this point:(\(x_2\), \(y_2\)) = (8,5) |
10 | 3 |
We can create the scatter plot by plotting one layer at a time:
Imagine we have two variables:
X | Y | |
---|---|---|
4 | 2 | First, let’s plot this point:(\(x_1\), \(y_1\)) = (4,2) |
8 | 5 | Now, let’s plot this point:(\(x_2\), \(y_2\)) = (8,5) |
10 | 3 |
We can create the scatter plot by plotting one layer at a time:
Imagine we have two variables:
X | Y | |
---|---|---|
4 | 2 | First, let’s plot this point:(\(x_1\), \(y_1\)) = (4,2) |
8 | 5 | Now, let’s plot this point:(\(x_2\), \(y_2\)) = (8,5) |
10 | 3 | Finally, let’s plot:(\(x_3\), \(y_3\)) = (10,3) |
We can create the scatter plot by plotting one layer at a time:
Imagine we have two variables:
X | Y | |
---|---|---|
4 | 2 | First, let’s plot this point:(\(x_1\), \(y_1\)) = (4,2) |
8 | 5 | Now, let’s plot this point:(\(x_2\), \(y_2\)) = (8,5) |
10 | 3 | Finally, let’s plot:(\(x_3\), \(y_3\)) = (10,3) |
We can create the scatter plot by plotting one layer at a time:
ggplot()
ggplot( aes(x,y))
geom_point(), annotate(), segment()
Let’s use the data from Project STAR:
classtype reading math graduated
1 small 578 610 1
2 regular 612 612 1
3 regular 583 606 1
4 small 661 648 1
5 small 614 636 1
6 regular 610 603 0
Unit of observation?
Unit of measurement of reading and math?
What would be the code to create the scatter plot between reading and math, where reading is on the x-axis and math is on the y-axis?
Answer:
cor(X,Y)
in mathematical notationcor(X,Y)
ranges from -1 to 1
The sign reflects the direction of the linear association:
cor(X,Y) > 0
if the slope of the line of best fit is positive
cor(X,Y) < 0
if the slope of the line of best fit is negative
The absolute value reflects the strength of the linear association:
|cor(X,Y)| = 0
if there is no linear association
|cor(X,Y)| = 1
if there is a perfect linear association
|cor(X,Y)|
increases as the observations move closer to the line of best fit and the linear association becomes stronger
Example of change in direction of the linear association between two variables:
positive linear association | negative linear association\positive |
correlation | negative correlation |
Example of change in strength of the linear association between two variables:
weak linear association | strong linear association\positive |
absolute value close to 0 | absolute close to 1 |
cor()
cor(X,Y) = cor(Y,X)
Is the correlation what we expected?
sign of slope of best line? ___
strong linear association? ___
cor(X,Y)
= ___
sign of slope of best line? ___
strong linear association? ___
cor(X,Y)
= ___
sign of slope of best line? ___
strong linear association? ___
cor(X,Y)
\(\approx\) ___
line of best fit is steeper in ___ (first/second) scatter plot- correlation is higher in ___ (first/second) scatter plot
A steeper line of best fit does not necessarily mean a higher correlation in absolute terms, or vice versa
No, it just means that there is no linear relationship between the two variables
If two variables have a correlation of zero, it does not necessarily mean that there is no relationship between them
table(), count()
prop.table(table())
hist(), geom_histogram()
mean()
median()
sd()
var()
POL269