We will use the following libraries so let’s make sure you install them
library(tidyverse)
library(texreg)
library(stargazer)This dataset is cross sectional data on industrial democracies. Containing:
povred |
Percent of citizens lifted out of poverty by taxes and transfers |
enp |
Effective number of parties |
lnenp |
Natural log of effective number of parties |
maj |
Majoritarian election system dummy |
pr |
Proportional representation dummy |
unam |
Unanimity government dummy |
Source of data and model Torben Iversen and David Soskice, 2002, ``Why do some democracies redistribute more than others?’’ Harvard University.
You can download it and load it (again, most of the time you will not be able to load it directly from the web)
iver <- read_csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/iver.csv")ok, let’s take a quick look
glimpse(iver) Now a quick summary
summary(iver)The basic command for linear regression in R is lm(). A call to this function takes the generic form illustrated below:
res <- lm(y ~ x1 + x2 + x3, data = your.dataframe)Let’s run a regression using the Iverson and Soskice data. We save the regression as an object and then print the summary of the results:
Here let’s use povred as the outcome variable and lnenp as our only covariate
lm_bivar <- lm(povred ~ lnenp, data = iver)
summary(lm_bivar)It’s probably much more complicated than the number of parties. Let’s add more covariates
What else do we have?
head(iver)Ok, how about the type of electoral system
lm_multi <- lm(povred ~ lnenp + elec_sys, data = iver)
summary(lm_multi)Let’s add an interaction here. It will help us observe observe the joint effect
lm_multi_interact <- lm(povred ~ lnenp * elec_sys, data = iver)
summary(lm_multi_interact)
lm_multi_interact <- lm(povred ~ lnenp * elec_sys, data = iver)So the first thing that we would have done, and that you should do every time is to strat by familiarizing with the data before you run regressions
Let’s run a few ggplots to see what’s going on
Note: the first argument to the function is an R formula. Formulas appear throughout many R functions, and have some special features of their syntax, some of which are illustrated below.
In lm, the formula is used to generate the exact \(X\) matrix that will be used to estimate the model. To see the matrix being generated internally by lm, add the argument x = TRUE to the lm() call:
lm_cat <- lm(povred ~ lnenp + elec_sys, data = iver, x = TRUE)lm_cat$xTake a look at the help page for formula
?formulaTo better understand what the formula is doing, let’s look at the model matrix one of the more complex formulas above generates, note the interaction term and the base terms of this interaction:
lm_multi_interact <- lm(povred ~ lnenp * elec_sys, data = iver, x = TRUE)
lm_multi_interact$xMatrices are collections of equal-length vectors in row and column arrangement. Matrices store information in a rectangular format, so look like data frames, but are less flexible as all data must be the same type (you can’t mix character and numeric data, for example). At first when you start working in R, matrices will be in use behind the scenes more than something you work with much.
As an example, see what happens when we convert our iver data frame into a matrix:
iver_matrix <- as.matrix(iver)
iver_matrixWhy is everything in quotation marks now?
You can index a matrix using square brackets, indicating first the row you want, then the column. This is the same way that you could directly extract specific values from a data frame
iver_matrix[2:5, ]A blank before or after the comma indicates all rows or columns, respectively
iver_matrix[4, 3]If we leave out the character vectors and convert the data frame to a matrix, then the matrix will be numeric,
as.matrix(select(iver, - cty, - elec_sys))names() and glimpse() (or str() it will do the same as glimpse in this case) to explore the contents of one of the lm objects you’ve created. (Look at the help file for lm for further details)If you notice the standard errors are missing from the lm() list. Let’s calculate it
se_multi <-
lm_multi %>%
vcov() %>%
diag %>%
sqrt
se_multiThis calculates the standard errors by calculating the square root of the diagonal of the variance-covariance matrix of the parameters of the model object. vcov() is an example of a function that has a specific “method” for different types of objects: it knows we have an lm object and acts accordingly.
Another way to get the fitted values is with predict():
predict(lm_bivar)The nice thing about predict is that it will actually let us calculate the expected values from our model for any set of real or hypothetical data with the same X variables:
Here’s the general form of a call to predict, giving 95% confidence intervals:
predict(object, # lm object
newdata, # a data frame with same x vars as data, but new or specific values
interval = "confidence",
level = 0.95 #the default
)So the newdata component refers to the specific values we want. For instance, if our model is:
\(y_1 = x1 + x2, x3\) where:
We might want to specify this:
newdata =
data_frame(
x1 = 5,
x2 = 320,
x3 = "Texas"
)
Let’s try this with our models.
Of course, we can use more than one value of interest
predict(lm_cat,
newdata = data_frame(lnenp = seq(1:5),
elec_sys = "maj"),
interval = "confidence")Same thing as just creating the new hypothetical data and then just insert it into the predict function
hypx <- data_frame(lnenp = seq(1:5),
elec_sys = "maj")
hypxpredict(lm_cat, newdata = hypx, interval="confidence")I can do this for multiple values and combinations:
hypx <- expand.grid(lnenp = seq(0,2, .1),
elec_sys = c("pr", "maj"))
predict(lm_cat, newdata = hypx, interval="confidence")Plotting regression results can be even more informative and more intuitive than regression tables!
To plot a regression line (not just using the lm smoother in ggplot2), you can either fit a line to the observed values of X and the fitted values and CIs from predict, or fit a line to hypothetical data to illustrate the estimated relationship (the latter can help you to have smoother confidence intervals where you have fewer observations).
We will use the hypothetical data we calculated above:
preds <- predict(lm_cat, newdata = hypx, interval="confidence", level = .90)This is calculating expected values of povred for various hypothetical values of lnenp, setting the other covariates to a fixed level, illustrating the effect of a change in lnenp, all else equal (for a “typical” respondent, use the mean of the covariates you are keeping fixed). Remember to keep variable names identical to those in the model!
So those predictions are great and we could plot as they are them but let’s merge (bind) our hypothetical data and these predictions. It will make things easier when we graph
pred_plot <- cbind(hypx, preds)
pred_plotNow we can graph it
ggplot(pred_plot, aes(x = lnenp, y = fit,
ymin = lwr, ymax = upr,
color = elec_sys,
fill = elec_sys )) +
geom_line() +
geom_ribbon(alpha = .3) +
facet_wrap(~ elec_sys)There are mutiple package that could help with the prese ntaiton for your regression tables.
cat(texreg::htmlreg(list(lm_cat),
html.tag = FALSE,
head.tag = FALSE,
body.tag = FALSE,
doctype = FALSE))| Model 1 | |
|---|---|
| (Intercept) | 17.66 |
| (12.69) | |
| lnenp | 26.69 |
| (14.15) | |
| elec_syspr | 9.22 |
| (11.34) | |
| elec_sysunam | -48.95* |
| (17.86) | |
| R2 | 0.74 |
| Adj. R2 | 0.66 |
| Num. obs. | 14 |
| p < 0.001; p < 0.01; p < 0.05 | |
stargazer(lm_cat, type = "html")| Dependent variable: | |
| povred | |
| lnenp | 26.693* |
| (14.154) | |
| elec_syspr | 9.221 |
| (11.341) | |
| elec_sysunam | -48.952** |
| (17.864) | |
| Constant | 17.658 |
| (12.686) | |
| Observations | 14 |
| R2 | 0.738 |
| Adjusted R2 | 0.659 |
| Residual Std. Error | 12.367 (df = 10) |
| F Statistic | 9.381*** (df = 3; 10) |
| Note: | p<0.1; p<0.05; p<0.01 |
I use stargazer most of the time but there many others like these. Here is a list of some of these packages
stargazerxtablekablepandertexregLike always, use whatever works for you.
Science should be open! Here at Cornell and everywhere, this lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.