Week 1: Introduction and first steps into data analysis

POL269 Political Data Research

Javier Sajuria

22.01.2024

Dr Javier Sajuria

Reader in Comparative Politics

📧 : j.sajuria@qmul.ac.uk

💻 : www.sajuria.com

📍 : ArtsOne 2.29

A&F hours: Mondays 3.30pm - 4.30pm. Book via Calendly and by appointment

Dr Elizabeth Simon

Postdoctoral Researcher in British Politics

📧 : e.simon@qmul.ac.uk

💻 : QM profile

📍 : TBC

A&F hours: TBC

Plan for today

First Part

  • What are the course goals?

  • How will your grade be determined?

  • What resources will be available to you?

  • Syllabus review

  • Questions?

  • Introductions

Second Part

  • Become familiar with RStudio

  • Become familiar with R

    • Do calculations: +, -, *, /

    • Create objects: <-, ""

    • Use functions: (), sqrt(), #

What are the module goals?

  • This is a very applied class
    • We are going to teach you practical skills
  • From the very beginning, we will use
    • the programming language R
    • to analyze real-world data
    • for the purpose of answering social and political questions
  • In the process, you will learn
    • statistics
    • how to code in R

What are the module goals?

  • This course will teach you how to
    • analyze data and
    • evaluate someone else’s data analysis
  • Why do political scientists need to learn how to analyse data and evaluate data analyses? (Why are you required to take POL269 as a Politics/IR/Sociology major?)
    • Most political decisions are made based on data
    • If you cannot do the analysis yourself, you need to at least know how to distinguish a good quantitative study from a poorly conducted one

Statistics and Coding

  • For the purpose of analyzing data, I am going to have to teach you statistics and coding 😱😱😱
  • Do not panic! The course is setup so anyone can do well in it. I assume no prior knowledge of either
  • Good news! These are very practical skills, currently in high demand for well-paid jobs

How are you going to be taught?

Lectures

  • Lectures are required and run on Mondays from 14.00-16.00
  • They will take 2 hours, with some breaks in the middle

Seminars

  • Seminars are mandatory. They will take 50 minutes.

  • They will take place on Mondays and Tuesdays, check your timetable to find out the right time and place

How will your grade be determined?

  • 40% of the course mark is based on TikTok video

  • 60% of the course mark is based on a final take-home research project

  • The two assignments will require you to:

    1. understand the theoretical concepts
    2. answer applied questions
    3. work with R
  • Details will follow during the term

What resources will be available to you?

  • Course Website
  • QMPlus
  • Textbook
  • Lizzie
  • Me!
  • Assigned Partner

Course website

  • Course website is: pol269.sajuria.com

  • Where you will find lecture slides, seminar activities, solutions.

QMPlus

  • QMPlus access is essential for this course

  • Students should be automatically enrolled, but please let me know if you have problems accessing it

    • All substantive questions should be asked via the forum at QMPlus. The tutors will respond as soon as possible.
    • All administrative questions should be asked either before/after lecture or in office hours
    • When possible, avoid e-mails. There is a forum on QMPlus for substantive questions.

Textbook

Elena Llaudet and Kosuke Imai. Data Analysis for Social Science: A Friendly and Practical Introduction, Princeton University Press, 2022

  • Teaches from scratch R, statistics, and the fundamentals of quantitative social science
  • Specifically written for beginners—it assumes no prior knowledge of statistics or coding and only minimal knowledge of math
  • Learn by doing textbook – it walks you through exercises
    • please do the exercises in your own computer as you do the readings
    • the folder with files to follow along is on the course website
  • You are expected to complete the required readings prior to attending lecture

Advice on the Readings

  • When doing the readings:

    • Pay close attention to the TIPs; they are written especially for students with no prior experience

    • Skip the FORMULAS IN DETAIL, unless I tell you otherwise; they contain more advanced-level material

    • Pay especial attention to the key concepts and \(R\) operators and functions

Assigned Partner

  • When working on seminars and other exercises, you are required to collaborate with your assigned partner

  • In the next week, I will send an email to each of you introducing you to your assigned partner

    • partners are chosen based on an algorithm
  • Collaboration can be in person, via Zoom, by phone, email, text, pigeons…

  • The important thing is so that you can ask and answer each other’s questions on a regular basis

Important Module Information (aka Syllabus)

Take a few minutes to look over the important module information on QMPlus. Notice that:

  1. attendance is mandatory
  2. you are expected to come to class on time and ready to participate (having done the readings due on that day)

Syllabus: Tentative Course Schedule

  • For most classes, you are asked to do some readings ahead of time and follow along the exercises in the book with your own computer
    • make sure to focus on the key concepts and R operators/functions listed on the syllabus for that day

CAUTION: Cumulative Material

  • Material is cumulative: lectures later in the course assume that you know what was covered earlier in the course

  • Make sure to keep up with the material and take some time to review each week!

  • If you miss class, make sure to watch the recording

    • class recordings are usually posted on the course website 24 hours after class

Workload

  • It’s a 15-credit course: You are expected to work an average of 10 hours per week
    • to do well in this course, you really need to
  • This is not an easy course
    • Please take it seriously from the beginning!
    • You do not need to have a strong math background to do well but you need to put in time and effort

Workload

How are you supposed to fill those 10 hours of work a week?

  1. Attend all seminars and lectures

  2. review old material: make sure you understand everything we have already covered by reviewing all previous lecture notes in sequence

  3. learn new material: do the new readings following along the exercises with your own computer and attend lectures and seminars

Questions?

Meet Your Classmates

  • There are going to be plenty of resources available to you
    • Probably the best one: your classmates
    • So… let’s get to know each other!
  • Please pair up and introduce yourselves. Find out:
    • name
    • current favorite TV show and/or activity
  • In a few minutes, some of you will be asked to introduce your partner to the rest of the class

R and RStudio

  • As part of Homework #0, you should have installed two programs in your computer:

    • R and RStudio
  • R is the statistical program that will perform calculations and create graphics for us (it’s the engine)

  • RStudio is the user-friendly interface that we will use to communicate with R

  • We will never open R directly; we will always start by opening RStudio (RStudio will open R by itself)

RStudio

  • Go ahead and open RStudio

  • Then, open a new R script:

    • dropdown menu: File > New File > R Script
  • What is an R script?

    • type of file we use to store the code we write to analyse data

RStudio Layout

RStudio Layout

  • R Script (upper left window): where we write and run code

  • R Console (lower left window): where R provides the executed code and its outputs, including errors

  • Environment (upper right window): storage room of current R session; lists objects that we have created

  • Help and Plots tabs (lower right window)

The R Programming Language

  • To use R, we need to learn its language

    • the R programming language
    • R is both the name of the program and the name of the language
  • Learning a programming language is like learning a foreign language

    • not easy
    • requires practice
    • requires patience

We will use R to:

  1. Do calculations

  2. Create objects

  3. Use functions

1. Do calculations

  • We can use R as a calculator

    • R understands arithmetic operators such as +, -, *, /
  • Let’s ask R to calculate 20 plus 5

    • First, we type on the R script (upper left window): 20+5

    • Then, to run this code: we highlight it and either manually hit the run icon or use the shortcut command+enter in Mac or ctrl+enter in Windows

    • Go ahead and do it

In the Console, you should see the following:

20+5
[1] 25
  • first, the executed code - then, the output of the executed code in black - what does the [1] mean?
    • it indicates that the output immediately to its right is the first (and only, in this case) output
  • The title of the R script is now red to indicate that you have unsaved changes
    • to save the R script either use shortcuts (command+S or ctr+S) or click on File > Save or Save As…
    • name it “lecture1” so that you know what it refers to

2. Create objects

  • R stores information in the form of objects

  • In order to analyse data, we will need to create objects

  • An object is like a box that can contain anything

    To create one, we need to:

    • give it a name

    • specify its contents

    • use the assignment operator

In R, we use the assignment operator <- to create an object:

  • To its left, we specify the name of the object

    • name cannot begin with a number or contain spaces or special symbols like $ or % that are reserved for other purposes

    • name can contain _ underscores, which are good substitutes for spaces

    • To its right, we specify the content of the object

object_name <- object_contents

object_name <- object_contents
  • For example, type and run:

    twentyfive <- 25
  • After running this code, the object twentyfive will show up in the Environment (the upper right window of RStudio)

  • To find out the contents of an object, you can run the name of the object in R:

    twentyfive
    [1] 25
  • This is equivalent of asking to R: what is inside of twentyfive?

  • Objects can contain text as well as numbers. Run for example:

    class <- "pol269"
  • Now in the environment there should be two objects

  • What are they? Note that in this last piece of code we used " around the contents, but we did not use " in the previous piece of code

    • Why?

  • When do we need to use " when writing code in R?

    • the names of objects, names of functions, and names of arguments as well as special values such as TRUE, FALSE, NA, and NULL should NOT be in quotes

    • all other text should be in quotes

    • numbers should never be in quotes unless you want R to treat them as text

What would happen if you run instead: class <- pol269?

class <- pol269
  • without the ", R thinks that pol269 is the name of an object and R is right; there is no object called pol269 in the environment

  • Running into errors is part of the coding process

    • do not be discouraged

    • if you have problems figuring out what a particular error means, google it; there are lots of Q&A sites

    • if that doesn’t help, post the code and error in our discussion board

  • R will overwrite objects if you assign new content to an existing object name

    class <- "data analysis"
  • After running the code above, class will contain the text “data analysis” instead of “po269”

  • R is case sensitive:

    • class is different than Class
    • to avoid confusion, we use lower-case letters when naming objects

3. Use functions

  • Think of a function as an action that you request R to perform on a particular object or piece of data, such as calculating the square root of 25

AN R FUNCTION

  • A function: + takes input(s)
    • example: takes the number 25
  • performs an action with the input(s)
    • computes \(\sqrt{\textrm{25}}\)
  • produces an output
    • produces the number 5

  • We will learn how to use these functions: sqrt(), setwd(), read.csv(), View(), head(), dim(), mean(), ifelse(), table(), prop.table(), na.omit(), hist(), median(), sd(),var(), plot(), abline(), cor(), lm(), log(), c(), sample(), rnorm(), pnorm(), print(), nrow(), predict(), abs(), summary() among others
  • In time, we will learn:
    • their names
    • the actions they perform
    • the inputs they require
    • the outputs they produce

  • The name of a function (without quotes) is always followed by parentheses: function_name()
  • Inside the parentheses, we specify the inputs, which we refer to as arguments: function_name()
  • Most functions require that we specify at least one argument but can take many optional arguments
    • some arguments are required, others are optional
  • When multiple arguments are specified inside the parentheses, they are separated by commas: function_name(argument1, argument2)

  • To specify the arguments, we enter them in a particular order or include the name of the argument (without quotes) in our specification:
function_name(argument1, argument2)

or

function_name(argument1_name = argument1,
              argument2_name = argument2)
  • We always specify required arguments first. If there is more than one required argument, we enter them in the order expected by R. We specify any optional arguments we want next and include their names:
function_name(required_argument,
              optional_argument_name = optional_argument)

Using R functions

We typically write code in one of these two formats:

function_name(required_argument)

or

function_name(required_argument,
              optional_argument_name = optional_argument)

  • Fictitious Example: Suppose R were capable of baking and that it had a function named bake() that, by default, bakes the specified ingredient for 60 minutes at 180\(^{\circ}\)C
    • Required argument: the ingredient
      • example: cake mix
    • Optional arguments: named degrees and minutes to change the default temperature and duration of the bake, respectively
      • degrees=200 changes temperature to 200\(^{\circ}\)C
      • minutes=30 changes duration of bake to 30 minutes

The following code would ask R to bake a cake mix for 30 minutes at 350\(^{\circ}\)F, so that we can have cake as the output:

bake(cake_mix, degrees = 200, minutes = 30)

  • Example: sqrt() computes the square root of the argument specified inside the parentheses. To compute \(\sqrt{\textrm{25}}\), run:

    sqrt(25)
    [1] 5
  • sqrt is the name of the function, which, as all function names, is followed by parentheses ()

  • 25 is the required argument

  • 5 is the output

  • Alternatively, since the object twentyfive contains the number 25, we can run:
sqrt(twentyfive)
[1] 5
  • R will give you an error message if you run this line of code before creating the object twentyfive
    • Code is sequential! One must run code in order
    • Whenever returning to work on an R script, run all the code from the beginning

  • It is good practice to comment code
    • include short notes to yourself or your collaborators explaining what the code does
  • To comment code, we use #
    • R ignores everything that follows this character until the end of the line
  • Examples:
sqrt(25) #calculates square root of 25
[1] 5

  • Before closing your computer, remember to save the R script, otherwise you risk losing unsaved changes
    • either use shortcuts (command+S or ctr+S) or
    • click on File > Save}
  • If you quit RStudio, R will ask whether you want to save the workspace image, which contains all the objects you have created during the R session
    • I recommend that you do not save it
    • You can always re-create the objects by re-running the code in your R script

Today’s lecture

  • Introductions and Housekeeping

  • R and Rstudio, scripts, console and the environment

  • R calculations, objects, functions

Next lecture

  • What are data/datasets?

    • What is an observation

    • What is a variable

  • Types of variables on content

  • How to load and make sense of data

  • Computing and interpreting means