Week 4: Survey Research and Exploring Data

POL269 Political Research

Javier Sajuria

12.02.2024

Notes on the midterm assessment

Assessment is a research project, that will take the shape of a 3-minute TikTok-style video. You will need to submit the video alongside a script and the R code. The maximum limit is 2,000 words - this is a limit, not a goal.
- Instructions and data will be posted on February 16 and we will dedicate a section of the Lecture to discuss them
- You will have to submit on QMPlus, assignment is due on November 9
The assessment will comprise a series of tasks, which will require you to manipulate data and produce output using RStudio
Pro-tip: the practical tasks are the same ones you have done in the seminars.

Why do we analyse data?

MEASURE:To infer population characteristics via survey research

what proportion of constituents support a particular policy?

PREDICT:To make predictions

who is the most likely candidate to win an upcoming election?

EXPLAIN:To estimate the causal effect of a treatment on an outcome

what is the effect of small classrooms on student performance?

Why do we analyse data?

MEASURE:To infer population characteristics via survey research

what proportion of constituents support a particular policy?

PREDICT:To make predictions

who is the most likely candidate to win an upcoming election?

EXPLAIN:To estimate the causal effect of a treatment on an outcome

what is the effect of small classrooms on student performance?

Plan for today

Sample vs. Population
Representative samples
Random Sampling
Random Treatment Assignment vs. Random Sampling
Exploring One Variable At a Time
- Table of frequencies
- Table of proportions
- Histogram
- Descriptive Statistics: mean, median, standard deviation, and variance
Exploring the Relationship Between Two Variables
- Scatter plots
- Correlations

Sample vs. Population

We often want to know the characteristics of a large population such as the residents of a country
Yet collecting data from every individual in the population is either prohibitively expensive or simply infeasible
In the UK, we try to collect data from each individual every ten years
- The 2021 census cost £1 billion, approximately (population at that time was around 60 million)
- This is not feasible for research purposes!
We use surveys to collect data from a small subset of observations in order to understand the population

Subset of individuals chosen for study is called a sample

In the UK, researchers typically survey only about 1,200 people to infer the characteristics of more than 35 million adult citizens (n=1,200, N=35 million)

Representative Samples

In survey research, it is vital for the sample to be representative of the population of interest
A representative sample accurately reflects the characteristics of the population from which it is drawn, that is, characteristics appear in the sample in similar proportions as in the population as a whole
If the sample is not representative, our inferences regarding the population characteristics based on the sample will be wrong

Are you a representative sample of UK residents?
Are you a representative sample of QMUL students?
Are you a representative sample of QMUL Politics and IR students?
Are you a representative sample of POL269 students?
What would be the best way to draw a representative sample of QMUL students?
- using random sampling
- get the list of all QMUL students and select n students at random

Random Sampling

The best way to draw a representative sample is to select individuals at random from the population
- This procedure is called random sampling

Random sampling

makes the sample and the target population on average identical to each other in all observed and unobserved characteristics

Random sampling ensures that the sample is representative of the target population
- enabling us to infer valid population characteristics from the sample

Random Treatment Assignment vs. Random Sampling

Do not confuse them with each other: they both use a random process but for two very different reasons
Random treatment assignment means that the treatment is assigned at random
- makes treatment and control groups comparable
- enables us to produce valid estimates of the average treatment effect (using diffs-in-means estimator)
Random sampling means that individuals are selected at random from the population into the sample
- makes sample representative of the population
- enables us to infer valid population characteristics from the sample

Exploring One Variable At a Time

Suppose we have collected data from a sample, now what?
To understand the content and distribution of each variable we can
- create a table of frequencies
- create a table of proportions
- create a histogram
- compute descriptive statistics
Let’s return to the voting experiment
- data collected from a sample of registered voters in the state of Michigan

The voting dataset

Unit of observation: Registered voters

Variables:

Variable	Description
birth	year of birth
message	whether registered voter received the message (“yes” or “no”)
voted	whether registered voter voted: 1= yes; 0=no

In-Class Exercise

Open RStudio
Download exercise_3.R from the module’s website and open it within RStudio
Run steps 1 through 3
Explore one variable at a time
- follow along

## STEP 1. Set the working directory
setwd("~/Desktop/POL269") # example if Mac 
setwd("C:/user/Desktop/POL269") # example if Windows

## STEP 2. Load the dataset
voting <- read.csv("voting.csv") # reads and stores data

## STEP 3. Look at the data
head(voting) # shows the first six observations
##   birth message voted
## 1  1981      no     0
## 2  1959      no     1
## 3  1956      no     1
## 4  1939     yes     1
## 5  1968      no     0
## 6  1967      no     0
## what's the unit of observation?
## for each variable: type and unit of measurement?
## substantively interpret the first observation

Table of Frequencies

The frequency table of a variable shows the values the variable takes and the number of times each value appears in the variable
R functions: table(), count()

library(tidyverse)

table(voting$voted)
## 
##      0      1 
## 158276  71168
# OR
voting %>%
  count(voted)
##   voted      n
## 1     0 158276
## 2     1  71168

Interpretation?

Table of Proportions

The table of proportions of a variable shows the proportion of observations that take each value in the variable
The proportions in the table should add up to 1

R functions: prop.table(table( )), mutate()

prop.table(table(voting$voted))
## 
##         0         1 
## 0.6898241 0.3101759
# OR
voting %>%
  count(voted) %>%
  mutate(proportion = n / sum(n))
##   voted      n proportion
## 1     0 158276  0.6898241
## 2     1  71168  0.3101759

Interpretation?

Histogram

The histogram of a variable is the visual representation of its distribution through bins of different heights
The position of the bins along the x-axis indicate the interval of values
The height of the bins indicate the frequency (or count) of the interval of values
R functions: hist(), geom_hist()

hist(voting$birth)
# OR
voting %>%
  ggplot(aes(x=birth)) +
  geom_histogram()

Interpretation?

voting %>%
  ggplot(aes(x = birth)) +
  geom_histogram(binwidth = 1) +
  labs(title = "Histogram of birth", x = "Year of birth", y = "Frequency")

Descriptive Statistics

The descriptive statistics of a variable numerically summarize the main characteristics of its distribution
- summarise the centre of the distribution
  - mean
  - median
- summarise the spread of the distribution (amount of variation of the distribution relative to its centre)
  - standard deviation
  - variance

example of change in centrality

example of change in spread

Mean

As we saw in Chapter 2, the mean of a variable equals the sum of the values across all observations divided by the total number of observations
What is the function in R?

Example:

mean(voting$birth)

[1] 1956.18

mean(voting$voted)

[1] 0.3101759

# OR
voting %>%
  summarise(mean_birth = mean(birth), mean_voted = mean(voted))

  mean_birth mean_voted
1    1956.18  0.3101759

Interpretations?

Median

The median of a variable is the value at the midpoint of the distribution that divides the data into two equal-size groups
When the variable contains an odd number of observations, the median is the middle value of the distribution
When the variable contains an even number of observations, the median is the average of the two middle values

Example, if X={10, 4, 6, 8, 22}, what is the median of $X$?
- First, we need to sort the values of $X$ in ascending order (as they would be in the distribution):
  
  {4, 6,8, 10, 22}
- The median is $8$ because that is the value in the middle of the distribution
R function: median()

## compute medians
median(voting$birth) # of birth

[1] 1956

# OR
voting %>%
  summarise(median_birth = median(birth))

  median_birth
1         1956

Standard Deviation

The standard deviation of a variable is a measure of the spread of its distribution

\[ \textrm{sd(X)} = \sqrt{\frac{\sum^n_{i=1}(X_i-\overline{X})^2}{n}} \]

$sd(X)$ stands for the standard deviation of $X$
$X_i$ is a particular observation of $X$
$\overline{X}$ stands for the mean of $X$
$n$ is the total number of observations in the variable
$\sum^{n}_{i=1} (X_i{-}\overline{X})^2$ means the sum of all $(X_i{-}\overline{X})^2$ from $i{=}1$ to $i{=}n$\

Important

The standard deviation of a variable measures the average distance of the observations to the mean.

The larger the standard deviation,
- the flatter the distribution
Which distribution has a larger standard deviation?

The standard deviation of a variable gives us a sense of the range of the data, especially when dealing with bell-shaped distributions known as normal distributions
In normal distributions, 95% of the observations fall within two standard deviations from the mean

R function: sd()
```
sd(voting$birth)
```
```
[1] 14.46019
```
If birth were normally distributed, about 95% of the registered voters in the voting experiment would have been born between 1927 and 1985
- $\overline{X}$ - 2 ${\times}$ sd(X) = 1956 - 2 $\times$ 14.5 = 1927
- $\overline{X}$ + 2 $\times$ sd(X) = 1956 + 2 $\times$ 14.5 = 1985

Variance

Another measure of spread of the distribution
The variance of a variable is simply the square of the standard deviation \[var(X) = sd(X)^2\]

$var(X)$ stands for the variance of $X$
$sd(X)$ stands for the standard deviation of $X$

R function: var()
```
var(voting$birth)
```
```
[1] 209.0971
```
Alternatively: sd()\^{}2
```
sd(voting$birth)^2
```
```
[1] 209.0971
```
We are usually better off using standard deviations as our measure of spread, as they are easier to interpret because they are in the same unit of measurement as the variable
If we are given a variance, we can compute the standard deviation by taking the square root of the variance.

What is the R function to compute a square root?

Understanding How the Mean and the Standard Deviation of a Variable Change the Variable’s Distribution (link to interactive graph)

Understanding the relationship between two variables

We saw how to explore one variable at a time
- creating table of frequencies and/or proportions
- creating histograms
- computing descriptive statistics: mean, median, standard deviation, and variance
Most data analyses are about understanding the relationship between two variables
To explore the relationship between two variables we
- create scatter plots
- compute correlation coefficient