We will use the following libraries so let’s make sure you install them
library(tidyverse)
library(texreg)
library(stargazer)
This dataset is cross sectional data on industrial democracies. Containing:
povred |
Percent of citizens lifted out of poverty by taxes and transfers |
enp |
Effective number of parties |
lnenp |
Natural log of effective number of parties |
maj |
Majoritarian election system dummy |
pr |
Proportional representation dummy |
unam |
Unanimity government dummy |
Source of data and model Torben Iversen and David Soskice, 2002, ``Why do some democracies redistribute more than others?’’ Harvard University.
You can download it and load it (again, most of the time you will not be able to load it directly from the web)
<- read_csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/iver.csv") iver
ok, let’s take a quick look
glimpse(iver)
Now a quick summary
summary(iver)
The basic command for linear regression in R is lm()
. A call to this function takes the generic form illustrated below:
<- lm(y ~ x1 + x2 + x3, data = your.dataframe) res
Let’s run a regression using the Iverson and Soskice data. We save the regression as an object and then print the summary of the results:
Here let’s use povred
as the outcome variable and lnenp
as our only covariate
<- lm(povred ~ lnenp, data = iver)
lm_bivar
summary(lm_bivar)
It’s probably much more complicated than the number of parties. Let’s add more covariates
What else do we have?
head(iver)
Ok, how about the type of electoral system
<- lm(povred ~ lnenp + elec_sys, data = iver)
lm_multi
summary(lm_multi)
Let’s add an interaction here. It will help us observe observe the joint effect
<- lm(povred ~ lnenp * elec_sys, data = iver)
lm_multi_interact
summary(lm_multi_interact)
<- lm(povred ~ lnenp * elec_sys, data = iver) lm_multi_interact
So the first thing that we would have done, and that you should do every time is to strat by familiarizing with the data before you run regressions
Let’s run a few ggplots
to see what’s going on
Note: the first argument to the function is an R formula. Formulas appear throughout many R functions, and have some special features of their syntax, some of which are illustrated below.
In lm
, the formula is used to generate the exact \(X\) matrix that will be used to estimate the model. To see the matrix being generated internally by lm
, add the argument x = TRUE
to the lm()
call:
<- lm(povred ~ lnenp + elec_sys, data = iver, x = TRUE) lm_cat
$x lm_cat
Take a look at the help page for formula
?formula
To better understand what the formula is doing, let’s look at the model matrix one of the more complex formulas above generates, note the interaction term and the base terms of this interaction:
<- lm(povred ~ lnenp * elec_sys, data = iver, x = TRUE)
lm_multi_interact
$x lm_multi_interact
Matrices are collections of equal-length vectors in row and column arrangement. Matrices store information in a rectangular format, so look like data frames, but are less flexible as all data must be the same type (you can’t mix character and numeric data, for example). At first when you start working in R, matrices will be in use behind the scenes more than something you work with much.
As an example, see what happens when we convert our iver
data frame into a matrix:
<- as.matrix(iver)
iver_matrix
iver_matrix
Why is everything in quotation marks now?
You can index a matrix using square brackets, indicating first the row you want, then the column. This is the same way that you could directly extract specific values from a data frame
2:5, ] iver_matrix[
A blank before or after the comma indicates all rows or columns, respectively
4, 3] iver_matrix[
If we leave out the character vectors and convert the data frame to a matrix, then the matrix will be numeric,
as.matrix(select(iver, - cty, - elec_sys))
names()
and glimpse()
(or str()
it will do the same as glimpse
in this case) to explore the contents of one of the lm objects you’ve created. (Look at the help file for lm
for further details)If you notice the standard errors are missing from the lm()
list. Let’s calculate it
<-
se_multi %>%
lm_multi vcov() %>%
%>%
diag
sqrt
se_multi
This calculates the standard errors by calculating the square root of the diagonal of the variance-covariance matrix of the parameters of the model object. vcov()
is an example of a function that has a specific “method” for different types of objects: it knows we have an lm
object and acts accordingly.
Another way to get the fitted values is with predict()
:
predict(lm_bivar)
The nice thing about predict is that it will actually let us calculate the expected values from our model for any set of real or hypothetical data with the same X variables:
Here’s the general form of a call to predict, giving 95% confidence intervals:
predict(object, # lm object
# a data frame with same x vars as data, but new or specific values
newdata, interval = "confidence",
level = 0.95 #the default
)
So the newdata component refers to the specific values we want. For instance, if our model is:
\(y_1 = x1 + x2, x3\) where:
We might want to specify this:
newdata =
data_frame(
x1 = 5,
x2 = 320,
x3 = "Texas"
)
Let’s try this with our models.
Of course, we can use more than one value of interest
predict(lm_cat,
newdata = data_frame(lnenp = seq(1:5),
elec_sys = "maj"),
interval = "confidence")
Same thing as just creating the new hypothetical data and then just insert it into the predict function
<- data_frame(lnenp = seq(1:5),
hypx elec_sys = "maj")
hypx
predict(lm_cat, newdata = hypx, interval="confidence")
I can do this for multiple values and combinations:
<- expand.grid(lnenp = seq(0,2, .1),
hypx elec_sys = c("pr", "maj"))
predict(lm_cat, newdata = hypx, interval="confidence")
Plotting regression results can be even more informative and more intuitive than regression tables!
To plot a regression line (not just using the lm
smoother in ggplot2), you can either fit a line to the observed values of X and the fitted values and CIs from predict
, or fit a line to hypothetical data to illustrate the estimated relationship (the latter can help you to have smoother confidence intervals where you have fewer observations).
We will use the hypothetical data we calculated above:
<- predict(lm_cat, newdata = hypx, interval="confidence", level = .90) preds
This is calculating expected values of povred
for various hypothetical values of lnenp
, setting the other covariates to a fixed level, illustrating the effect of a change in lnenp
, all else equal (for a “typical” respondent, use the mean of the covariates you are keeping fixed). Remember to keep variable names identical to those in the model!
So those predictions are great and we could plot as they are them but let’s merge (bind) our hypothetical data and these predictions. It will make things easier when we graph
<- cbind(hypx, preds)
pred_plot pred_plot
Now we can graph it
ggplot(pred_plot, aes(x = lnenp, y = fit,
ymin = lwr, ymax = upr,
color = elec_sys,
fill = elec_sys )) +
geom_line() +
geom_ribbon(alpha = .3) +
facet_wrap(~ elec_sys)
There are mutiple package that could help with the prese ntaiton for your regression tables.
cat(texreg::htmlreg(list(lm_cat),
html.tag = FALSE,
head.tag = FALSE,
body.tag = FALSE,
doctype = FALSE))
Model 1 | |
---|---|
(Intercept) | 17.66 |
(12.69) | |
lnenp | 26.69 |
(14.15) | |
elec_syspr | 9.22 |
(11.34) | |
elec_sysunam | -48.95* |
(17.86) | |
R2 | 0.74 |
Adj. R2 | 0.66 |
Num. obs. | 14 |
p < 0.001; p < 0.01; p < 0.05 |
stargazer(lm_cat, type = "html")
Dependent variable: | |
povred | |
lnenp | 26.693* |
(14.154) | |
elec_syspr | 9.221 |
(11.341) | |
elec_sysunam | -48.952** |
(17.864) | |
Constant | 17.658 |
(12.686) | |
Observations | 14 |
R2 | 0.738 |
Adjusted R2 | 0.659 |
Residual Std. Error | 12.367 (df = 10) |
F Statistic | 9.381*** (df = 3; 10) |
Note: | p<0.1; p<0.05; p<0.01 |
I use stargazer most of the time but there many others like these. Here is a list of some of these packages
stargazer
xtable
kable
pander
texreg
Like always, use whatever works for you.
Science should be open! Here at Cornell and everywhere, this lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.