Lab 4: Multivariate Linear Regression, and Predicted Values

Initial Setup
Reading in the Data
Data Management
Sacatterplots
Visualiazing Regression Results
Regression tables
Predicted Values

For this lab we will use the replication data from Michael Ross’s “The Oil Curse: How Petroleum Wealth Shapes the Development of Nations.”

We will be exploring the relationship between oil dependency and democracy.

Initial Setup

This lab will use some libraries you’ve seen before and we should load them now plus broomand coefplot which you might need to install

library(tidyverse)
library(broom)
library(coefplot)

Reading in the Data

Read in the Ross data

rossdata <- read.csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/ross_2012.csv",   stringsAsFactors = FALSE)

head (rossdata)
glimpse(rossdata)

This is a pretty big data-set, we do not need all the variables let’s subset our data to include only the variables we will use.

Using only what you need
- Create a new data-set containing only:
  - cty
  - year
  - polity
  - logoil_gdpcap2000_sup_Q_1
  - logGDPcap
  - oecd
- Call this new data-frame rossdata_short

Data Management

Some of those names are too long, we should change it to something meaningful and short. This can be done easily using dplyr and rename.

rossdata_short <- rossdata_short %>%
  rename(oil = logoil_gdpcap2000_sup_Q_1, 
         gdp = logGDPcap)

This data-frame is way easier to glimpse at:

glimpse(rossdata_short)

A lot of missing values here. Let’s omit them and then look at a summary of the data set

rossdata_short <- rossdata_short %>% 
  na.omit()

rossdata_short %>%
  summary()

Finally! we are ready to start data-analyzing..

Sacatterplots

We are going to be exploring the relationship between Democracy level (polity) and other covariates.

Let’s explore these relationships with plots.

Preliminary Exploration
- Create a plot that explores the relationship between democracy level and at least another variable then try to include more than two covariates using different colors and shapes.

Okay, let’s do it together

We begin simple…

Unfortunately, a simple scatter plot makes it hard to detect any relationship. However, ggplot2 makes it easy to add different colors and shapes which might help identify trends.

It seems like the higher the GDP the more democratic countries are, except if you are a high oil producer or a non-OECD. Let’s explore these relationships using a regression.

model1 <- lm(polity ~ oil, data = rossdata_short)
summary(model1)

Let’s now include controls for GDP per capita and OECD membership

model2<-lm(polity ~ gdp + oil + oecd, data = rossdata_short)
summary(model2)

Interpreting a multivariate model
- Which has a larger impact on the level of democracy: oil dependence or OECD membership?
- Which has a larger impact on the level of democracy: oil dependence or GDP per capital?

Recall the OECD membership clustering? Let’s try an interaction

model3 <- lm(polity ~ gdp + oil*oecd, data = rossdata_short)
summary(model3)

Interpreting a multivariate model with an interaction
- How would you interpret these results?

Visualiazing Regression Results

broom has three main functions, all of which return data frames (not lists, numeric vectors, or other types of object). glance returns a data frame with a single row summary of the model:

glance(model2)

tidy returns a data frame with a row for each coefficient estimate:

tidy(model2)

augment returns the original data frame used in the model with additional columns for fitted values, the standard errors of those fitted values, residuals, etc.

head(augment(model2))

How about a coefficient plot, roppeladder… etc.

ggplot(tidy(model2) %>% filter(term != "(Intercept)"), 
       aes(x = term, y = estimate, 
           ymin = estimate - 2 * std.error, 
           ymax = estimate + 2 * std.error)) + 
  geom_pointrange() + 
  coord_flip()

We can also use coefplot

coefplot(model2, coefficients = c("oecd", "oil", "gdp"))

Coefficient Plots
- What is wrong with this plot?
- Is it useful?
- Why? Why not?

Regression tables

Several packages (stargazer, texreg, apsrtable) are useful for creating publication type regression tables. stargazer and texreg are the most complete package. Both allow output to either LaTeX or HTML tables for many types of statistical models. We’ll use stargazer here:

library(stargazer)
stargazer(model1, model2, model3, type = "html")

Predicted Values

We are going to use predict to get predicted values. We first have to set up a newdata

xnew <- data.frame(
  gdp = 5.9, oil = 0, oecd = 1)

predict(model2, newdata = xnew, interval = "confidence")

xnew2 <- data.frame(gdp = 5.9, oil = 0, oecd = 0)

predict(model2, newdata = xnew2, interval = "confidence")

What is this really doing?

model2
names(model2)

pe2<-model2$coefficients
pe2
1*pe2[1] + 5.9*pe2[2] + 0*pe2[3] + 1*pe2[4]


1*pe2[1] + 5.9*pe2[2] + 0*pe2[3] + 0*pe2[4]

Going Beyond a simple prediction

We can create a matrix of hypothetical data to obtain predictions for a range of values:

xnew <- expand.grid(gdp = seq(4,11),
                    oil = 0,
                    oecd = 1)

xnew <- expand.grid(gdp = seq(4,11),
                    oil = mean(rossdata_short$oil),
                    oecd = mean(rossdata_short$oecd))

Now we feed this new data into predict

pred.res <- predict(model2, newdata = xnew, interval="confidence")

Ploting Predicted Values

To plot these predicted values we have to create a data frame containing both the predicted values generated by predict and the data used to generate those values

mod2_pred_df <- cbind(xnew, pred.res)

mod2_pred_df

We now have a data-frame that can easily be taken by ggplot

ggplot(mod2_pred_df, aes(x = gdp, y = fit,
                         ymin = lwr, ymax = upr)) +
  geom_line() +
  geom_ribbon(alpha = 0.2)+
  labs(y = "Democracy", x = "GDP")+
  theme_bw()

What about oil?

xnew <- expand.grid(gdp = mean(rossdata_short$gdp),
                    oil = seq(0,11),
                    oecd = mean(rossdata_short$oecd))

pred.res <- predict(model2, newdata = xnew, interval = "confidence")

mod2_pred_df <- cbind(xnew, pred.res)


ggplot(mod2_pred_df, aes(x = oil, y = fit,
                         ymin = lwr, ymax = upr)) +
  geom_line() +
  geom_ribbon(alpha = 0.2)+
  labs(y = "Democracy", x = "Oil")+
  theme_bw()

Predicting Interactions

Let’s get predictions from the model with the interaction.

summary(model3)

xnew <- expand.grid(gdp = mean(rossdata_short$gdp),
                    oil = seq(0,11),
                    oecd = c(0,1))



pred.res <- predict(model3, newdata=xnew, interval="confidence")

mod3_pred_df <- cbind(xnew, pred.res)

Now we can ggplot it

ggplot(mod3_pred_df, aes(x = oil , y = fit, 
                         ymin = lwr, ymax = upr,
                         color = factor(oecd),
                         fill = factor(oecd))) +
  geom_line() +
  geom_ribbon( alpha = 0.7) +
  labs(title = "Predicted Democracy by GDP and OECD membership",
       y = "Predicted Value of Democracy",
       x = "Oil") + 
  scale_fill_discrete(name = "OECD",
                      labels = c("Non Member","Member")) +
  scale_color_discrete(name = "OECD",
                      labels = c("Non Member","Member")) +
  theme_bw()

Science should be open! Here at Cornell and everywhere, this lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.