Week 2: Data, Datasets, Computing Means
POL269 Political Research
Javier Sajuria
29.01.2024
Plan for Today
- What are data/datasets?
- what is an observation?
- what is a variable?
- Types of variables based on content
- character vs. numeric variables
- binary vs. non-binary variables
- How to load and make sense of data in R
- new R functions and operators:
setwd(), read.csv(), View(), head(), dim()
Plan for Today
- Average or mean of a variable
- how to compute it?
- how to interpret it?
- Practice in R
- how to access a variable in dataframe:
data$variable
- how to compute means:
mean()
Example of a dataframe
What is an observation?
It is the information collected from a particular entity or individual in the study
The unit of observation of the dataset defines the individuals or the entities that each observation in the dataframe represents
- if unit of observation is students, each row in the dataframe represents a different student
- We usually refer to an observation by the row number in the dataframe, which we denote as \(i\)
- what is the first observation (\(i\)=1) in the dataframe above?
What is a variable?
- A variable contains the values of a changing characteristic for the various individuals or entities in the study
- Every column of data in a dataframe is a variable
- if unit of observation is students, each variable captures a specific characteristic of the students, for all the students in the study
- We usually refer to a variable by its name
Notation
- When defining new variables, we represent a variable and its contents in the following format:
\[
X = \{10, 5, 8\}
\]
- On the left-hand side of the equal sign, we identify the name of the variable:
- what is the name of the variable here?
- On the right-hand side of the equal sign and inside curly brackets, we have the content of the variable: multiple observations, separated by commas
- what are the observations in \(X\)?
Notation
\[
X = \{10, 5, 8\}
\]
- To represent each individual observation we use \(X_i\) + where \(i\) stands for the observation number
- the subscript \(i\) means that we have a different value of \(X\) for each value of \(i\)
- what is \(X_3\)?
- The total number of observations is denoted as \(n\) - what does \(n\) equal to here?
Types of Variables Based on Content
Character vs. Numeric
- Character variables contain text
first_names = {ana, elena, maria, ...}
- Numeric variables contain numbers
test_score = {80, 75, 99, ...}
Numeric: Binary variables
- Binary variables can take only two values: 1s and 0s
- They represent the presence/absence of a trait:
- 1 if individual \(i\) has the trait
- 0 if individual \(i\) does not have the trait
- Example: voted = {1, 0, 0, 1, 1, 1, 0} where \[
\textit{voted} _i = \begin{cases} 1 \text{ if individual } i \text{ voted} \\
0 \text{ if individual } i \textit{ didn't vote} \end{cases}
\]Can you think of another example?
Numeric: Non-binary variables
- Non-binary variables can take more than two values
- distance = {1.452, 2.345, 0.298}
- dice_roll = {2, 4, 6}
- Can you think of another example?
Average or Mean of a Variable: How to Compute it?
- Sum the values across all observations and divide the result by the total number of observations
\[
\bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1+X_2+...+X_n}{n}
\]
\(\overline{X}\) (pronounced X-bar) stands for the average of \(X\)
\(\sum^{n}_{i=1} X_i\) stands for the sum of all \(X_i\) (observations of \(X\)) from \(i\)=1 to \(i=n\), meaning from the first observation of the variable \(X\) to the last one (\(\sum\) is Greek letter sigma)
\(X_i\) stands for a particular observation of \(X\), where \(i\) denotes the position of the observation and \(n\) is the total number of observations in the variable
- Example: if \(X\)={10, 4, 6, 8, 22}, then:
- \(n\) = ?
- \(\bar{X}\) = ?
- Let’s compute it!
\[
\bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1 + X_2 + X_3 + X_4 + X_5}{5} \\
= \frac{\textrm{10 + 4 + 6 + 8 + 22}}{\textrm{5}} = \frac{\textrm{50}}{\textrm{5}}=\textrm{10}
\]
Average or Mean of a Variable: How to Interpret it?
- First, we need to figure out the quantity in which the value is measured
- Whenever interpreting numeric results, you should make it clear whether the number is measured in points, percents, miles, kilometers, etc.
- This is called the unit of measurement
Unit of Measurement of the Mean of a Variable
- When the variable is non-binary, the mean should be interpreted as an average in the same unit of measurement as the values in the variable
- Example: if \(X\)={10, 4, 6, 8, 22} and measured in miles
- \(\bar{X}\) = ?
- what type of variable is \(X\) (binary or non-binary)?
- shall we interpret \(\bar{X}\) as an average or a proportion?
- unit of measurement of \(\overline{X}\) = 10?
- When the variable is binary, the mean should be interpreted as a proportion, in % after multiplying the result by 100
- Why?
- Because the mean of a binary variable is equivalent to the proportion of the observations that have the characteristic identified by the variable (i.e., that meet a criterion)
- Example: if \(X\)={1, 1, 1, 0, 0, 0}, then:
\[
\bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1 + X_2 + X_3 + X_4 + X_5 + X_6}{\textrm{6}} \\
= \frac{\textrm{1 + 1 + 1 + 0 + 0 + 0}}{\textrm{6}} = \frac{\textrm{3}}{\textrm{6}}= \textrm{0.5}
\]
what type of variable is \(X\) (binary or non-binary)?
shall we interpret \(\bar{X}\) as an average or a proportion?
interpretation of \(\bar{X}\) = 0.5 (including units)?
- 50% of the observations are 1s, that is, have the characteristic identified by \(X\) (0.5x100=50%) - note that the fraction \(\frac{\textrm{3}}{\textrm{6}}\) is equivalent to the proportion of the observations that are 1s
- The proportion of observations in a variable that meet a criterion is calculated as:
\[
\frac{\textrm{number of observations that meet criterion}}{\textrm{total number of observations}}
\]
Example: if \(X\)={1, 1, 1, 0, 0, 0}, the proportion of observations in \(X\) that are 1s is:
- \(\frac{\textrm{3}}{\textrm{6}}\)= 0.50
to interpret the result of this fraction as a percentage, we multiply the decimal by 100 (0.50x100=50%)
interpretation: 50% of the observations in \(X\) are 1s
Today’s lecture
- Data/datasets/dataframes
- Observations and variables
- Unit of observation
- Character vs. numeric variables / Binary vs. non-binary variables
- How to compute and interpret means
- \(\sum\)
- Unit of measurement
Why do we analyse data?
MEASURE:To infer population characteristics via survey research
- what proportion of constituents support a particular policy?
PREDICT:To make predictions
- who is the most likely candidate to win an upcoming election?
EXPLAIN:To estimate the causal effect of a treatment on an outcome
- what is the effect of small classrooms on student performance?
- We will progress from simple to more complex methods
- We begin with EXPLAIN by learning how to estimate causal effects with randomized experiments
- involves relatively simple maths
- Then, we will learn how to MEASURE the characteristics of an entire population from a sample of survey respondents
- visualizations, descriptive statistics, correlation
- Then, we will learn how to PREDICT outcome variables
- Then, we will return to EXPLAIN and estimate causal effects with observational data
- multiple linear regression
Next lecture
- Causal effects
- Randomized experiments
- Difference-in-means estimator