This lab will use some libraries you’ve seen before, and one you may not have. We’ll load them all now.

library("MASS")
library("tidyverse")
library("broom")

Some useful tools

z <- rnorm(10, mean = 100, sd = 1)  # draw randomly from the normal distribution

z1 <- runif(10, 0, 1)  #draw from a uniform distribution between min and max values

sample(z, 2)  # samples two numbers from the z vector

sort(z)  # puts z in (ascending) order.
sort(z, decreasing = TRUE)


z3 <- sample(c("red", "blue", "yellow", "green"), 10, replace = TRUE)


sample(c("red", "blue", "yellow", "green"), 3, replace = FALSE)  #note you cannot have a higher number of draws when replace=FALSE


which(z3 == "green")  # tells you the position(s) where the condition is met

which.max(z1)

IF conditions

a <- 10

if (a == 10) print("Yes, a is 10.")

print("This gets printed no matter what value a was.")

The statement immediately after the if statement is run only when the argument in parentheses evaluates to TRUE. Try #setting ‘a’ to another value and see what happens.

Note: use braces {} after the if statement if you want multiple commands to be executed when the condition is met:

if (a == 10) {
  print("Yes, a is 10.")
  print("Because a is 10 I'm doing this too.")
} else {
  print("DIDNT WORK")
}

print("This gets printed no matter what value a was.")

FOR loops

for (i in 1:10) {
  print(i^2)
}  # close the i loop

You can put as many lines of code as you want in between the {} braces, including other for loops and if statements, and whatever else you want.

Examples using loops

(Example 1 for data online)

Read in the Ross data from last week’s lab (selecting and renaming variables as appropriate:

rossdata <- read.csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/ross_2012.csv", 
  stringsAsFactors = FALSE)

data <- rossdata %>% dplyr::select(cty, year, polity, logoil_gdpcap2000_sup_Q_1, 
  logGDPcap, oecd)

data <- data %>% rename(oil = logoil_gdpcap2000_sup_Q_1, loggdp = logGDPcap)

We’re going to examine the extent of missing data by country (what proportion of each country’s observations remain after listwise deletion?)

First, create a list of every country name

uc <- unique(data$cty)

Then calculate the number of countries in the dataset:

numcountries <- length(uc)

Then create empty vectors to hold our results (we’ll use them in the for loop)

country_obs <- rep(NA, numcountries)
country_na <- rep(NA, numcountries)

Next, we write the loop. This loop is going to go through each individual country in turn, and find out two things:

  • How many observations are there for that country?
  • How many observations for that country are left after listwise deletion of missing data?

After evaluating these questions for each country, and storing the results in two different vectors, the loop will repeat the same tasks for the next country. The counter variable i simply goes up by 1 for each iteration of the loop. By referencing uc[i], we are asking the lines of code inside the loop to apply to the first country, then the second country, then the third, and so on…

for (i in 1:numcountries) {
  tempdata <- filter(data, cty == uc[i])
  na_omit_tempdata <- na.omit(tempdata)
  country_obs[i] <- nrow(tempdata)
  country_na[i] <- nrow(na_omit_tempdata)
}

dta_na <- cbind(uc, country_obs, country_na, country_na/country_obs)
  1. Why are there quotes around the numbers? How might you fix this?
na_table <- data_frame(Country = uc, All_obs = country_obs, Remaining_obs = country_na, 
  Proportion_remaining = round(country_na/country_obs, 2))

Another loop example, with regressions:

Run a separate regression model for each year in the dataset and summarize changing magnitude of effects (hint, this is really easy to do with dplyr and broom, in case you needed illustration of why the packages are useful and not just extra things to learn!)

First, set up a vector of all unique years in the data frame

uy <- unique(data$year)

Then, create a dichotomous variable of whether a country is a democracy or not:

data <- mutate(data, dem_indicator = polity >= 8) %>% tbl_df()

Next we create a data frame of empty vectors in which to store the regression results:

results <- data_frame(year = uy, dem_coef = rep(NA, length(uy)), 
  dem_se = rep(NA, length(uy)), oil_coef = rep(NA, length(uy)), 
  oil_se = rep(NA, length(uy)), df = rep(NA, length(uy)))

results <- expand.grid(year = uy, dem_coef = NA, dem_se = NA, 
  oil_coef = NA, oil_se = NA, df = NA)

And now we make a loop that checks if we have enough complete observations and runs a regression for every year, and then stores the coefficients and standard errors.

for (i in 1:length(uy)) {
  tempdata <- filter(data, year == uy[i])
  if (nrow(na.omit(subset(tempdata, select = c("oil", "dem_indicator")))) > 
    1) {
    res_temp <- lm(loggdp ~ dem_indicator + oil, data = tempdata)
    results$dem_coef[i] <- coef(res_temp)[2]
    results$dem_se[i] <- sqrt(diag(vcov(res_temp)))[2]
    results$oil_coef[i] <- coef(res_temp)[3]
    results$oil_se[i] <- sqrt(diag(vcov(res_temp)))[3]
    results$df[i] <- res_temp$df
  } else {
    print(paste("Year", uy[i], "does not have enough complete observations.", 
      sep = " "))
  }
}

Let’s look at our results:

results
ggplot(results, aes(x = year, y = dem_coef, ymin = dem_coef - 
  1.96 * dem_se, ymax = dem_coef + 1.96 * dem_se)) + geom_line() + 
  geom_ribbon(alpha = 0.3)
  1. create a similar plot to explore how the effect of oil has changed over time.
Your Code Here

Another example: Using simulations to summarize a regression model:

This is basically an alternate way to work through part 1.e. of Homework 2

Load the data and run the basic regression with interaction:

sprinters <- read.csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/sprinters.csv", 
  na.strings = ".")


sprinter_model <- lm(finish ~ year * women, data = sprinters)

Create a range of hypothetical years to get expected values:

year_range <- seq(from = min(sprinters$year), to = max(sprinters$year), 
  by = 2)

Create a data frame that will relate years to the estimated expected values and CIs (currently empty columns):

expected <- expand.grid(year = year_range,
  # point estimates from model for men
  EVs_male = NA,
  lowerCIs_male = NA,
  upperCIs_male = NA,
  # point estimates from model for women
  EVs_female = NA,
  lowerCIs_female = NA,
  upperCIs_female = NA
  ) 
  


expected <- data_frame(
  year = year_range,
  # point estimates from model for men
  EVs_male = rep(NA, length(year_range)),
  lowerCIs_male = rep(NA, length(year_range)),
  upperCIs_male = rep(NA, length(year_range)),
  # point estimates from model for women
  EVs_female = rep(NA, length(year_range)),
  lowerCIs_female = rep(NA, length(year_range)),
  upperCIs_female = rep(NA, length(year_range))
  )

Get the coefficients and variance-covariance matrix from the model:

pe <- coef(sprinter_model)
vc <- vcov(sprinter_model)

Take 1000 draws from the multivariate normal distribution:

simbetas <- mvrnorm(1000, pe, vc)

Set up a loop that goes through all values in year.range, with FEMALE set to 0 We need define the covariate values for the current scenario. These must be in the same order as in model2$coefficients: (Intercept), year, women, year:women.

for (i in 1:length(year_range)) {
  x.current <- c("(Intercept)" = 1,
                 "year" = year_range[i],
                 "women" = 0,
                 "year:women" = year_range[i] * 0
  )
  # calculate the 1000 predictions:
  pred.current <- simbetas %*% x.current
  # Now slot the mean, lowerCI and upperCI values into the appropriate
  # positions in the results vectors:
  expected$EVs_male[i] <- mean(pred.current)
  # for illustration; quantile is a better way to do this
  # see CIs for women below
  expected$lowerCIs_male[i] <- sort(pred.current)[25]
  expected$upperCIs_male[i]<-sort(pred.current)[975]
} # close i loop

Now repeat that same loop, but change to FEMALE=1 (both in the base term and the int. term)

for (i in 1:length(year_range)) {
  # define the covariate values for the current scenario
  x.current <- c(`(Intercept)` = 1, year = year_range[i], women = 1, 
    `year:women` = year_range[i] * 1)
  # calculate the 1000 predictions:
  pred.current <- simbetas %*% x.current
  # Now slot the mean, lowerCI and upperCI values into the
  # appropriate positions in the results vectors:
  expected$EVs_female[i] <- mean(pred.current)
  expected$lowerCIs_female[i] <- quantile(pred.current, 0.025)
  expected$upperCIs_female[i] <- quantile(pred.current, 0.975)
}  # close i loop

Plotting

ggplot(expected, aes(x = year, y = EVs_male, ymin = lowerCIs_male, 
  ymax = upperCIs_male)) + geom_line() + geom_ribbon(alpha = 0.3)
  1. plot the line for women
Your Code Here

Now… map. My favorite

data  %>% 
  dplyr::select(loggdp, dem_indicator,  oil, year) %>% 
  na.omit() %>% 
  split(.$year) %>% # from base R
  map(~ lm(loggdp ~ dem_indicator + oil, data = .)) %>% 
  map(tidy) %>% 
  enframe(name = "Year") %>% 
  unnest()

On your own

Monty Hall Simulation

On Let’s Make a Deal, host Monty Hall offers you the following choice:

What should you do?

The simulation approach:

  • Set up the doors, goats, and car
  • Contestant picks a door
  • Monty “picks” a remaining door
  • Record where the car and goats were
  • Do all of the above many many times
  • Print the fraction of times a car was found
sims <- 10000  # Simulations run
doors <- c(1, 0, 0)  # The car (1) and the goats (0)
cars.stay <- 0  # Save times cars won with first choice here
cars.switch <- 0  # Save times cars won from switching here
for (i in 1:sims) {
  random.doors <- sample(doors, 3, replace = FALSE)
  cars.stay <- cars.stay + random.doors[1]  #First choose 'door number 1'
  cars.switch <- cars.switch + sort(random.doors[2:3])[2]  #Do you understand what this line of code is doing?
  # cars.switch <- cars.switch + (sum(random.doors[2:3])) # an
  # alternative approach
}

paste("Probability of a car from staying with 1st door", cars.stay/sims, 
  sep = ": ")
paste("Probability of a car from switching to 2nd door", cars.switch/sims, 
  sep = ": ")

Birthday simulation

Question: What is the probability that at least two students will share the same birthday in a class of this size?

Use a simulation method to answer this. (Ignore leap years.)

Bonus question: Modify your quote to answer the question for varying class sizes.

Your Code Here