This lab will use some libraries you’ve seen before, and one you may not have. We’ll load them all now.
library("MASS")
library("tidyverse")
library("broom")
z <- rnorm(10, mean = 100, sd = 1) # draw randomly from the normal distribution
z1 <- runif(10, 0, 1) #draw from a uniform distribution between min and max values
sample(z, 2) # samples two numbers from the z vector
sort(z) # puts z in (ascending) order.
sort(z, decreasing = TRUE)
z3 <- sample(c("red", "blue", "yellow", "green"), 10, replace = TRUE)
sample(c("red", "blue", "yellow", "green"), 3, replace = FALSE) #note you cannot have a higher number of draws when replace=FALSE
which(z3 == "green") # tells you the position(s) where the condition is met
which.max(z1)
a <- 10
if (a == 10) print("Yes, a is 10.")
print("This gets printed no matter what value a was.")
The statement immediately after the if statement is run only when the argument in parentheses evaluates to TRUE. Try #setting ‘a’ to another value and see what happens.
Note: use braces {} after the if statement if you want multiple commands to be executed when the condition is met:
if (a == 10) {
print("Yes, a is 10.")
print("Because a is 10 I'm doing this too.")
} else {
print("DIDNT WORK")
}
print("This gets printed no matter what value a was.")
for (i in 1:10) {
print(i^2)
} # close the i loop
You can put as many lines of code as you want in between the {} braces, including other for loops and if statements, and whatever else you want.
Read in the Ross data from last week’s lab (selecting and renaming variables as appropriate:
rossdata <- read.csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/ross_2012.csv",
stringsAsFactors = FALSE)
data <- rossdata %>% dplyr::select(cty, year, polity, logoil_gdpcap2000_sup_Q_1,
logGDPcap, oecd)
data <- data %>% rename(oil = logoil_gdpcap2000_sup_Q_1, loggdp = logGDPcap)
We’re going to examine the extent of missing data by country (what proportion of each country’s observations remain after listwise deletion?)
First, create a list of every country name
uc <- unique(data$cty)
Then calculate the number of countries in the dataset:
numcountries <- length(uc)
Then create empty vectors to hold our results (we’ll use them in the for
loop)
country_obs <- rep(NA, numcountries)
country_na <- rep(NA, numcountries)
Next, we write the loop. This loop is going to go through each individual country in turn, and find out two things:
After evaluating these questions for each country, and storing the results in two different vectors, the loop will repeat the same tasks for the next country. The counter variable i
simply goes up by 1 for each iteration of the loop. By referencing uc[i]
, we are asking the lines of code inside the loop to apply to the first country, then the second country, then the third, and so on…
for (i in 1:numcountries) {
tempdata <- filter(data, cty == uc[i])
na_omit_tempdata <- na.omit(tempdata)
country_obs[i] <- nrow(tempdata)
country_na[i] <- nrow(na_omit_tempdata)
}
dta_na <- cbind(uc, country_obs, country_na, country_na/country_obs)
na_table <- data_frame(Country = uc, All_obs = country_obs, Remaining_obs = country_na,
Proportion_remaining = round(country_na/country_obs, 2))
Run a separate regression model for each year in the dataset and summarize changing magnitude of effects (hint, this is really easy to do with dplyr and broom, in case you needed illustration of why the packages are useful and not just extra things to learn!)
First, set up a vector of all unique years in the data frame
uy <- unique(data$year)
Then, create a dichotomous variable of whether a country is a democracy or not:
data <- mutate(data, dem_indicator = polity >= 8) %>% tbl_df()
Next we create a data frame of empty vectors in which to store the regression results:
results <- data_frame(year = uy, dem_coef = rep(NA, length(uy)),
dem_se = rep(NA, length(uy)), oil_coef = rep(NA, length(uy)),
oil_se = rep(NA, length(uy)), df = rep(NA, length(uy)))
results <- expand.grid(year = uy, dem_coef = NA, dem_se = NA,
oil_coef = NA, oil_se = NA, df = NA)
And now we make a loop that checks if we have enough complete observations and runs a regression for every year, and then stores the coefficients and standard errors.
for (i in 1:length(uy)) {
tempdata <- filter(data, year == uy[i])
if (nrow(na.omit(subset(tempdata, select = c("oil", "dem_indicator")))) >
1) {
res_temp <- lm(loggdp ~ dem_indicator + oil, data = tempdata)
results$dem_coef[i] <- coef(res_temp)[2]
results$dem_se[i] <- sqrt(diag(vcov(res_temp)))[2]
results$oil_coef[i] <- coef(res_temp)[3]
results$oil_se[i] <- sqrt(diag(vcov(res_temp)))[3]
results$df[i] <- res_temp$df
} else {
print(paste("Year", uy[i], "does not have enough complete observations.",
sep = " "))
}
}
Let’s look at our results:
results
ggplot(results, aes(x = year, y = dem_coef, ymin = dem_coef -
1.96 * dem_se, ymax = dem_coef + 1.96 * dem_se)) + geom_line() +
geom_ribbon(alpha = 0.3)
Your Code Here
This is basically an alternate way to work through part 1.e. of Homework 2
Load the data and run the basic regression with interaction:
sprinters <- read.csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/sprinters.csv",
na.strings = ".")
sprinter_model <- lm(finish ~ year * women, data = sprinters)
Create a range of hypothetical years to get expected values:
year_range <- seq(from = min(sprinters$year), to = max(sprinters$year),
by = 2)
Create a data frame that will relate years to the estimated expected values and CIs (currently empty columns):
expected <- expand.grid(year = year_range,
# point estimates from model for men
EVs_male = NA,
lowerCIs_male = NA,
upperCIs_male = NA,
# point estimates from model for women
EVs_female = NA,
lowerCIs_female = NA,
upperCIs_female = NA
)
expected <- data_frame(
year = year_range,
# point estimates from model for men
EVs_male = rep(NA, length(year_range)),
lowerCIs_male = rep(NA, length(year_range)),
upperCIs_male = rep(NA, length(year_range)),
# point estimates from model for women
EVs_female = rep(NA, length(year_range)),
lowerCIs_female = rep(NA, length(year_range)),
upperCIs_female = rep(NA, length(year_range))
)
Get the coefficients and variance-covariance matrix from the model:
pe <- coef(sprinter_model)
vc <- vcov(sprinter_model)
Take 1000 draws from the multivariate normal distribution:
simbetas <- mvrnorm(1000, pe, vc)
Set up a loop that goes through all values in year.range, with FEMALE set to 0 We need define the covariate values for the current scenario. These must be in the same order as in model2$coefficients
: (Intercept)
, year
, women
, year:women
.
for (i in 1:length(year_range)) {
x.current <- c("(Intercept)" = 1,
"year" = year_range[i],
"women" = 0,
"year:women" = year_range[i] * 0
)
# calculate the 1000 predictions:
pred.current <- simbetas %*% x.current
# Now slot the mean, lowerCI and upperCI values into the appropriate
# positions in the results vectors:
expected$EVs_male[i] <- mean(pred.current)
# for illustration; quantile is a better way to do this
# see CIs for women below
expected$lowerCIs_male[i] <- sort(pred.current)[25]
expected$upperCIs_male[i]<-sort(pred.current)[975]
} # close i loop
Now repeat that same loop, but change to FEMALE=1 (both in the base term and the int. term)
for (i in 1:length(year_range)) {
# define the covariate values for the current scenario
x.current <- c(`(Intercept)` = 1, year = year_range[i], women = 1,
`year:women` = year_range[i] * 1)
# calculate the 1000 predictions:
pred.current <- simbetas %*% x.current
# Now slot the mean, lowerCI and upperCI values into the
# appropriate positions in the results vectors:
expected$EVs_female[i] <- mean(pred.current)
expected$lowerCIs_female[i] <- quantile(pred.current, 0.025)
expected$upperCIs_female[i] <- quantile(pred.current, 0.975)
} # close i loop
ggplot(expected, aes(x = year, y = EVs_male, ymin = lowerCIs_male,
ymax = upperCIs_male)) + geom_line() + geom_ribbon(alpha = 0.3)
Your Code Here
map
. My favoritedata %>%
dplyr::select(loggdp, dem_indicator, oil, year) %>%
na.omit() %>%
split(.$year) %>% # from base R
map(~ lm(loggdp ~ dem_indicator + oil, data = .)) %>%
map(tidy) %>%
enframe(name = "Year") %>%
unnest()
Monty Hall Simulation
On Let’s Make a Deal, host Monty Hall offers you the following choice:
What should you do?
sims <- 10000 # Simulations run
doors <- c(1, 0, 0) # The car (1) and the goats (0)
cars.stay <- 0 # Save times cars won with first choice here
cars.switch <- 0 # Save times cars won from switching here
for (i in 1:sims) {
random.doors <- sample(doors, 3, replace = FALSE)
cars.stay <- cars.stay + random.doors[1] #First choose 'door number 1'
cars.switch <- cars.switch + sort(random.doors[2:3])[2] #Do you understand what this line of code is doing?
# cars.switch <- cars.switch + (sum(random.doors[2:3])) # an
# alternative approach
}
paste("Probability of a car from staying with 1st door", cars.stay/sims,
sep = ": ")
paste("Probability of a car from switching to 2nd door", cars.switch/sims,
sep = ": ")
Question: What is the probability that at least two students will share the same birthday in a class of this size?
Use a simulation method to answer this. (Ignore leap years.)
Bonus question: Modify your quote to answer the question for varying class sizes.
Your Code Here
Science should be open! Here at Cornell and everywhere, this lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.