Lab 3: Linear Regression, Matrices, and Predicted Values

Initial set up, reading in data and initial data exploration
- Load libraries
- Data
Regression!
- A simple bivariate model
- A multivariate model
Interaction models
Prelimnary analysis:
- Aside on the formula
- Another aside, now on matrices
Fitted values and predictions
- Expanding the posibilities of the preditions
Plotting regression results
Formating regression tables.

Initial set up, reading in data and initial data exploration

Load libraries

We will use the following libraries so let’s make sure you install them

library(tidyverse)
library(texreg)
library(stargazer)

Data

This dataset is cross sectional data on industrial democracies. Containing:

`povred`	Percent of citizens lifted out of poverty by taxes and transfers
`enp`	Effective number of parties
`lnenp`	Natural log of effective number of parties
`maj`	Majoritarian election system dummy
`pr`	Proportional representation dummy
`unam`	Unanimity government dummy

Source of data and model Torben Iversen and David Soskice, 2002, ``Why do some democracies redistribute more than others?’’ Harvard University.

You can download it and load it (again, most of the time you will not be able to load it directly from the web)

iver <- read_csv("https://raw.githubusercontent.com/GarciaRios/govt_6029/master/data/iver.csv")

ok, let’s take a quick look

glimpse(iver)

Now a quick summary

summary(iver)

Regression!

The basic command for linear regression in R is lm(). A call to this function takes the generic form illustrated below:

res <- lm(y ~ x1 + x2 + x3, data = your.dataframe)

Let’s run a regression using the Iverson and Soskice data. We save the regression as an object and then print the summary of the results:

A simple bivariate model

Here let’s use povred as the outcome variable and lnenp as our only covariate

lm_bivar <- lm(povred ~ lnenp, data = iver)

summary(lm_bivar)

Bivariate Regression:
- How do we interpret this output?

A multivariate model

It’s probably much more complicated than the number of parties. Let’s add more covariates

What else do we have?

head(iver)

Ok, how about the type of electoral system

lm_multi <- lm(povred ~ lnenp + elec_sys, data = iver)

summary(lm_multi)

Multivariate Regression:
- How do we interpret this output?
- What changed?
- Try changing the categorical variable and use the dummies instead
  - Are there any changes?

Interaction models

Let’s add an interaction here. It will help us observe observe the joint effect

lm_multi_interact <- lm(povred ~ lnenp * elec_sys, data = iver)

summary(lm_multi_interact)


lm_multi_interact <- lm(povred ~ lnenp * elec_sys, data = iver)

Prelimnary analysis:

So the first thing that we would have done, and that you should do every time is to strat by familiarizing with the data before you run regressions

Let’s run a few ggplots to see what’s going on

Aside on the formula

Note: the first argument to the function is an R formula. Formulas appear throughout many R functions, and have some special features of their syntax, some of which are illustrated below.

In lm, the formula is used to generate the exact \(X\) matrix that will be used to estimate the model. To see the matrix being generated internally by lm, add the argument x = TRUE to the lm() call:

lm_cat <- lm(povred ~ lnenp + elec_sys, data = iver, x = TRUE)

lm_cat$x

Take a look at the help page for formula

?formula

To better understand what the formula is doing, let’s look at the model matrix one of the more complex formulas above generates, note the interaction term and the base terms of this interaction:

lm_multi_interact <- lm(povred ~ lnenp * elec_sys, data = iver, x = TRUE)


lm_multi_interact$x

Another aside, now on matrices

Matrices are collections of equal-length vectors in row and column arrangement. Matrices store information in a rectangular format, so look like data frames, but are less flexible as all data must be the same type (you can’t mix character and numeric data, for example). At first when you start working in R, matrices will be in use behind the scenes more than something you work with much.

As an example, see what happens when we convert our iver data frame into a matrix:

iver_matrix <- as.matrix(iver)

iver_matrix

Why is everything in quotation marks now?

You can index a matrix using square brackets, indicating first the row you want, then the column. This is the same way that you could directly extract specific values from a data frame

iver_matrix[2:5, ]

A blank before or after the comma indicates all rows or columns, respectively

iver_matrix[4, 3]

If we leave out the character vectors and convert the data frame to a matrix, then the matrix will be numeric,

as.matrix(select(iver, - cty, - elec_sys))

Anatomy of lm objects
- Use names() and glimpse() (or str() it will do the same as glimpse in this case) to explore the contents of one of the lm objects you’ve created. (Look at the help file for lm for further details)
- Extract and save as separate objects:
  - The coefficients
  - The residuals (what are the residuals?)
  - The fitted values (what are the fitted values?)

Standard errors

If you notice the standard errors are missing from the lm() list. Let’s calculate it

se_multi <- 
  lm_multi %>% 
  vcov() %>% 
  diag %>% 
  sqrt

se_multi

This calculates the standard errors by calculating the square root of the diagonal of the variance-covariance matrix of the parameters of the model object. vcov() is an example of a function that has a specific “method” for different types of objects: it knows we have an lm object and acts accordingly.

Fitted values and predictions

Another way to get the fitted values is with predict():

predict(lm_bivar)

The nice thing about predict is that it will actually let us calculate the expected values from our model for any set of real or hypothetical data with the same X variables:

Here’s the general form of a call to predict, giving 95% confidence intervals:

predict(object, # lm object
        newdata, # a data frame with same x vars as data, but new or specific values
        interval = "confidence",
        level = 0.95 #the default
)

So the newdata component refers to the specific values we want. For instance, if our model is:

\(y_1 = x1 + x2, x3\) where:

We might want to specify this:

newdata =  
  data_frame(
    x1 = 5,
    x2 = 320,
    x3 = "Texas"
)

Let’s try this with our models.

Predicting Values
- Using the bivariate model What is the predicted level of poverty with 3 parties
- What would we expect the level of poverty reduction to be for a majoritarian country with 2 parties?
- What would we expect the level of poverty reduction to be for a PR country as it goes from 1 to 5 parties?
hint (refer to data frame info above for how to create a new dataframe for newdata argument)

Expanding the posibilities of the preditions

Of course, we can use more than one value of interest

predict(lm_cat, 
        newdata = data_frame(lnenp = seq(1:5),  
                             elec_sys = "maj"),
        interval = "confidence")

Same thing as just creating the new hypothetical data and then just insert it into the predict function

hypx <-  data_frame(lnenp = seq(1:5), 
                    elec_sys = "maj")

hypx

predict(lm_cat, newdata = hypx, interval="confidence")

I can do this for multiple values and combinations:

hypx <-  expand.grid(lnenp = seq(0,2, .1),
                    elec_sys = c("pr", "maj"))




predict(lm_cat, newdata = hypx, interval="confidence")

Plotting regression results

Plotting regression results can be even more informative and more intuitive than regression tables!

To plot a regression line (not just using the lm smoother in ggplot2), you can either fit a line to the observed values of X and the fitted values and CIs from predict, or fit a line to hypothetical data to illustrate the estimated relationship (the latter can help you to have smoother confidence intervals where you have fewer observations).

We will use the hypothetical data we calculated above:

preds <-  predict(lm_cat, newdata = hypx, interval="confidence", level = .90)

This is calculating expected values of povred for various hypothetical values of lnenp, setting the other covariates to a fixed level, illustrating the effect of a change in lnenp, all else equal (for a “typical” respondent, use the mean of the covariates you are keeping fixed). Remember to keep variable names identical to those in the model!

So those predictions are great and we could plot as they are them but let’s merge (bind) our hypothetical data and these predictions. It will make things easier when we graph

pred_plot <- cbind(hypx, preds)
pred_plot

Now we can graph it

ggplot(pred_plot, aes(x = lnenp, y = fit,
                      ymin = lwr, ymax =  upr, 
                      color = elec_sys,
                      fill = elec_sys )) +
  geom_line() +
  geom_ribbon(alpha = .3) +
  facet_wrap(~ elec_sys)

Formating regression tables.

There are mutiple package that could help with the prese ntaiton for your regression tables.

cat(texreg::htmlreg(list(lm_cat),
                html.tag = FALSE,
                head.tag = FALSE,
                body.tag = FALSE,
                doctype = FALSE))

Statistical models
	Model 1
(Intercept)	17.66
	(12.69)
lnenp	26.69
	(14.15)
elec_syspr	9.22
	(11.34)
elec_sysunam	-48.95^*
	(17.86)
R²	0.74
Adj. R²	0.66
Num. obs.	14
p < 0.001; p < 0.01; p < 0.05

stargazer(lm_cat, type = "html")


	Dependent variable:

	povred

lnenp	26.693^*
	(14.154)

elec_syspr	9.221
	(11.341)

elec_sysunam	-48.952^**
	(17.864)

Constant	17.658
	(12.686)


Observations	14
R²	0.738
Adjusted R²	0.659
Residual Std. Error	12.367 (df = 10)
F Statistic	9.381^*** (df = 3; 10)

Note:	p<0.1; p<0.05; p<0.01

I use stargazer most of the time but there many others like these. Here is a list of some of these packages

stargazer
xtable
kable
pander
texreg

Like always, use whatever works for you.

Science should be open! Here at Cornell and everywhere, this lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.