Week 2: Data, Datasets, Computing Means

POL269 Political Research

Javier Sajuria

29.01.2024

Plan for Today

What are data/datasets?
- what is an observation?
- what is a variable?
Types of variables based on content
- character vs. numeric variables
- binary vs. non-binary variables
How to load and make sense of data in R
- new R functions and operators: setwd(), read.csv(), View(), head(), dim()

Plan for Today

Average or mean of a variable
- how to compute it?
- how to interpret it?
Practice in R
- how to access a variable in dataframe: data$variable
- how to compute means: mean()

What are data/datasets?

Datasets capture the characteristics of a particular set of individuals or entities:
- students, classrooms, schools, …
Datasets are typically organized as dataframes where rows are observations and columns are variables

Example of a dataframe

What is an observation?

It is the information collected from a particular entity or individual in the study
The unit of observation of the dataset defines the individuals or the entities that each observation in the dataframe represents
- if unit of observation is students, each row in the dataframe represents a different student
- We usually refer to an observation by the row number in the dataframe, which we denote as $i$
- what is the first observation ($i$=1) in the dataframe above?

What is a variable?

A variable contains the values of a changing characteristic for the various individuals or entities in the study
Every column of data in a dataframe is a variable
- if unit of observation is students, each variable captures a specific characteristic of the students, for all the students in the study
We usually refer to a variable by its name
- first_name, test_scores

Notation

When defining new variables, we represent a variable and its contents in the following format:

\[ X = \{10, 5, 8\} \]

On the left-hand side of the equal sign, we identify the name of the variable:
- what is the name of the variable here?
On the right-hand side of the equal sign and inside curly brackets, we have the content of the variable: multiple observations, separated by commas
- what are the observations in $X$?

Notation

\[ X = \{10, 5, 8\} \]

To represent each individual observation we use $X_i$ + where $i$ stands for the observation number
- the subscript $i$ means that we have a different value of $X$ for each value of $i$
- what is $X_3$?
The total number of observations is denoted as $n$ - what does $n$ equal to here?

Types of Variables Based on Content

Character vs. Numeric

Character variables contain text
- first_names = {ana, elena, maria, ...}
Numeric variables contain numbers
- test_score = {80, 75, 99, ...}

Numeric: Binary variables

Binary variables can take only two values: 1s and 0s
They represent the presence/absence of a trait:
- 1 if individual $i$ has the trait
- 0 if individual $i$ does not have the trait
Example: voted = {1, 0, 0, 1, 1, 1, 0} where \[ \textit{voted} _i = \begin{cases} 1 \text{ if individual } i \text{ voted} \\ 0 \text{ if individual } i \textit{ didn't vote} \end{cases} \]Can you think of another example?

Numeric: Non-binary variables

Non-binary variables can take more than two values
- distance = {1.452, 2.345, 0.298}
- dice_roll = {2, 4, 6}
Can you think of another example?

Average or Mean of a Variable: How to Compute it?

Sum the values across all observations and divide the result by the total number of observations

\[ \bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1+X_2+...+X_n}{n} \]

$\overline{X}$ (pronounced X-bar) stands for the average of $X$
$\sum^{n}_{i=1} X_i$ stands for the sum of all $X_i$ (observations of $X$) from $i$=1 to $i=n$, meaning from the first observation of the variable $X$ to the last one ($\sum$ is Greek letter sigma)
$X_i$ stands for a particular observation of $X$, where $i$ denotes the position of the observation and $n$ is the total number of observations in the variable

Example: if $X$={10, 4, 6, 8, 22}, then:
- $n$ = ?
- $\bar{X}$ = ?
Let’s compute it!

\[ \bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1 + X_2 + X_3 + X_4 + X_5}{5} \\ = \frac{\textrm{10 + 4 + 6 + 8 + 22}}{\textrm{5}} = \frac{\textrm{50}}{\textrm{5}}=\textrm{10} \]

Average or Mean of a Variable: How to Interpret it?

First, we need to figure out the quantity in which the value is measured
Whenever interpreting numeric results, you should make it clear whether the number is measured in points, percents, miles, kilometers, etc.
This is called the unit of measurement

Unit of Measurement of the Mean of a Variable

When the variable is non-binary, the mean should be interpreted as an average in the same unit of measurement as the values in the variable
Example: if $X$={10, 4, 6, 8, 22} and measured in miles
- $\bar{X}$ = ?
- what type of variable is $X$ (binary or non-binary)?
- shall we interpret $\bar{X}$ as an average or a proportion?
- unit of measurement of $\overline{X}$ = 10?

When the variable is binary, the mean should be interpreted as a proportion, in % after multiplying the result by 100
- Why?
- Because the mean of a binary variable is equivalent to the proportion of the observations that have the characteristic identified by the variable (i.e., that meet a criterion)

Example: if $X$={1, 1, 1, 0, 0, 0}, then:

\[ \bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1 + X_2 + X_3 + X_4 + X_5 + X_6}{\textrm{6}} \\ = \frac{\textrm{1 + 1 + 1 + 0 + 0 + 0}}{\textrm{6}} = \frac{\textrm{3}}{\textrm{6}}= \textrm{0.5} \]

what type of variable is $X$ (binary or non-binary)?
shall we interpret $\bar{X}$ as an average or a proportion?
interpretation of $\bar{X}$ = 0.5 (including units)?
- 50% of the observations are 1s, that is, have the characteristic identified by $X$ (0.5x100=50%) - note that the fraction $\frac{\textrm{3}}{\textrm{6}}$ is equivalent to the proportion of the observations that are 1s

The proportion of observations in a variable that meet a criterion is calculated as:

\[ \frac{\textrm{number of observations that meet criterion}}{\textrm{total number of observations}} \]

Example: if $X$={1, 1, 1, 0, 0, 0}, the proportion of observations in $X$ that are 1s is:
- $\frac{\textrm{3}}{\textrm{6}}$= 0.50
to interpret the result of this fraction as a percentage, we multiply the decimal by 100 (0.50x100=50%)
interpretation: 50% of the observations in $X$ are 1s

Today’s lecture

Data/datasets/dataframes
Observations and variables
Unit of observation
Character vs. numeric variables / Binary vs. non-binary variables
How to compute and interpret means
$\sum$
Unit of measurement

Why do we analyse data?

MEASURE:To infer population characteristics via survey research

what proportion of constituents support a particular policy?

PREDICT:To make predictions

who is the most likely candidate to win an upcoming election?

EXPLAIN:To estimate the causal effect of a treatment on an outcome

what is the effect of small classrooms on student performance?

We will progress from simple to more complex methods
We begin with EXPLAIN by learning how to estimate causal effects with randomized experiments
- involves relatively simple maths
Then, we will learn how to MEASURE the characteristics of an entire population from a sample of survey respondents
- visualizations, descriptive statistics, correlation
Then, we will learn how to PREDICT outcome variables
- simple linear regression
Then, we will return to EXPLAIN and estimate causal effects with observational data
- multiple linear regression

Next lecture

Causal effects
Randomized experiments
Difference-in-means estimator