Week 2: Data, Datasets, Computing Means

POL269 Political Research

Javier Sajuria

29.01.2024

Plan for Today

  • What are data/datasets?
    • what is an observation?
    • what is a variable?
  • Types of variables based on content
    • character vs. numeric variables
    • binary vs. non-binary variables
  • How to load and make sense of data in R
    • new R functions and operators: setwd(), read.csv(), View(), head(), dim()

Plan for Today

  • Average or mean of a variable
    • how to compute it?
    • how to interpret it?
  • Practice in R
    • how to access a variable in dataframe: data$variable
    • how to compute means: mean()

What are data/datasets?

  • Datasets capture the characteristics of a particular set of individuals or entities:

    • students, classrooms, schools, …
  • Datasets are typically organized as dataframes where rows are observations and columns are variables

Example of a dataframe

What is an observation?

  • It is the information collected from a particular entity or individual in the study

  • The unit of observation of the dataset defines the individuals or the entities that each observation in the dataframe represents

    • if unit of observation is students, each row in the dataframe represents a different student
    • We usually refer to an observation by the row number in the dataframe, which we denote as \(i\)
    • what is the first observation (\(i\)=1) in the dataframe above?

What is a variable?

  • A variable contains the values of a changing characteristic for the various individuals or entities in the study
  • Every column of data in a dataframe is a variable
    • if unit of observation is students, each variable captures a specific characteristic of the students, for all the students in the study
  • We usually refer to a variable by its name
    • first_name, test_scores

Notation

  • When defining new variables, we represent a variable and its contents in the following format:

\[ X = \{10, 5, 8\} \]

  • On the left-hand side of the equal sign, we identify the name of the variable:
    • what is the name of the variable here?
  • On the right-hand side of the equal sign and inside curly brackets, we have the content of the variable: multiple observations, separated by commas
    • what are the observations in \(X\)?

Notation

\[ X = \{10, 5, 8\} \]

  • To represent each individual observation we use \(X_i\) + where \(i\) stands for the observation number
    • the subscript \(i\) means that we have a different value of \(X\) for each value of \(i\)
    • what is \(X_3\)?
  • The total number of observations is denoted as \(n\) - what does \(n\) equal to here?

Types of Variables Based on Content

Character vs. Numeric

  • Character variables contain text
    • first_names = {ana, elena, maria, ...}
  • Numeric variables contain numbers
    • test_score = {80, 75, 99, ...}

Numeric: Binary variables

  • Binary variables can take only two values: 1s and 0s
  • They represent the presence/absence of a trait:
    • 1 if individual \(i\) has the trait
    • 0 if individual \(i\) does not have the trait
  • Example: voted = {1, 0, 0, 1, 1, 1, 0} where \[ \textit{voted} _i = \begin{cases} 1 \text{ if individual } i \text{ voted} \\ 0 \text{ if individual } i \textit{ didn't vote} \end{cases} \]Can you think of another example?

Numeric: Non-binary variables

  • Non-binary variables can take more than two values
    • distance = {1.452, 2.345, 0.298}
    • dice_roll = {2, 4, 6}
  • Can you think of another example?

Average or Mean of a Variable: How to Compute it?

  • Sum the values across all observations and divide the result by the total number of observations

\[ \bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1+X_2+...+X_n}{n} \]

  • \(\overline{X}\) (pronounced X-bar) stands for the average of \(X\)

  • \(\sum^{n}_{i=1} X_i\) stands for the sum of all \(X_i\) (observations of \(X\)) from \(i\)=1 to \(i=n\), meaning from the first observation of the variable \(X\) to the last one (\(\sum\) is Greek letter sigma)

  • \(X_i\) stands for a particular observation of \(X\), where \(i\) denotes the position of the observation and \(n\) is the total number of observations in the variable

  • Example: if \(X\)={10, 4, 6, 8, 22}, then:
    • \(n\) = ?
    • \(\bar{X}\) = ?
  • Let’s compute it!

\[ \bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1 + X_2 + X_3 + X_4 + X_5}{5} \\ = \frac{\textrm{10 + 4 + 6 + 8 + 22}}{\textrm{5}} = \frac{\textrm{50}}{\textrm{5}}=\textrm{10} \]

Average or Mean of a Variable: How to Interpret it?

  • First, we need to figure out the quantity in which the value is measured
  • Whenever interpreting numeric results, you should make it clear whether the number is measured in points, percents, miles, kilometers, etc.
  • This is called the unit of measurement

Unit of Measurement of the Mean of a Variable

  • When the variable is non-binary, the mean should be interpreted as an average in the same unit of measurement as the values in the variable
  • Example: if \(X\)={10, 4, 6, 8, 22} and measured in miles
    • \(\bar{X}\) = ?
    • what type of variable is \(X\) (binary or non-binary)?
    • shall we interpret \(\bar{X}\) as an average or a proportion?
    • unit of measurement of \(\overline{X}\) = 10?

  • When the variable is binary, the mean should be interpreted as a proportion, in % after multiplying the result by 100
    • Why?
    • Because the mean of a binary variable is equivalent to the proportion of the observations that have the characteristic identified by the variable (i.e., that meet a criterion)

  • Example: if \(X\)={1, 1, 1, 0, 0, 0}, then:

\[ \bar{X} = \frac{\sum^{n}_{i=1}{X_i}}{n} = \frac{X_1 + X_2 + X_3 + X_4 + X_5 + X_6}{\textrm{6}} \\ = \frac{\textrm{1 + 1 + 1 + 0 + 0 + 0}}{\textrm{6}} = \frac{\textrm{3}}{\textrm{6}}= \textrm{0.5} \]

  • what type of variable is \(X\) (binary or non-binary)?

  • shall we interpret \(\bar{X}\) as an average or a proportion?

  • interpretation of \(\bar{X}\) = 0.5 (including units)?

    • 50% of the observations are 1s, that is, have the characteristic identified by \(X\) (0.5x100=50%) - note that the fraction \(\frac{\textrm{3}}{\textrm{6}}\) is equivalent to the proportion of the observations that are 1s

  • The proportion of observations in a variable that meet a criterion is calculated as:

\[ \frac{\textrm{number of observations that meet criterion}}{\textrm{total number of observations}} \]

  • Example: if \(X\)={1, 1, 1, 0, 0, 0}, the proportion of observations in \(X\) that are 1s is:

    • \(\frac{\textrm{3}}{\textrm{6}}\)= 0.50
  • to interpret the result of this fraction as a percentage, we multiply the decimal by 100 (0.50x100=50%)

  • interpretation: 50% of the observations in \(X\) are 1s

Today’s lecture

  • Data/datasets/dataframes
  • Observations and variables
  • Unit of observation
  • Character vs. numeric variables / Binary vs. non-binary variables
  • How to compute and interpret means
  • \(\sum\)
  • Unit of measurement

Why do we analyse data?

MEASURE:To infer population characteristics via survey research

  • what proportion of constituents support a particular policy?

PREDICT:To make predictions

  • who is the most likely candidate to win an upcoming election?

EXPLAIN:To estimate the causal effect of a treatment on an outcome

  • what is the effect of small classrooms on student performance?

  • We will progress from simple to more complex methods
  • We begin with EXPLAIN by learning how to estimate causal effects with randomized experiments
    • involves relatively simple maths
  • Then, we will learn how to MEASURE the characteristics of an entire population from a sample of survey respondents
    • visualizations, descriptive statistics, correlation
  • Then, we will learn how to PREDICT outcome variables
    • simple linear regression
  • Then, we will return to EXPLAIN and estimate causal effects with observational data
    • multiple linear regression

Next lecture

  • Causal effects
  • Randomized experiments
  • Difference-in-means estimator