Advanced Regression Analysis - GOVT 6029: Problem Set 1

Due via Canvas: Fri, March 4

Instructions

Create a new RMarkdown called hw1_lastname (where lastname should be your actual last name, of course.
Do all your analysis in the R markdown document.
Compile an html document
Submit the html and RMarkdown through Canvas. This should contain all necessary code and materials for another person to run your RMarkdown file.
Try to work on your own as much as possible. Often, troubleshooting as a group is helpful but please try to push yourself as much as possible
Do not use new packages that we haven’t used in class/labs. I know some additional packages will make things easier but that’s not the point.

Some other guidance

All problems should be answerable in at most a few lines of R code. Questions which require looking up values should be answered using R code and not manually checking the value through the RStudio GUI.
Problems are thematically divided but each bullet point should be seen as a separate exercise
Do not print unnecessary output you will be penalized for printing long strings of unnecessary output. These reports should be clean and as concise as possible.
- You can use additional options in each code chunk to control how much the html output renders
  - You can find more information on how to control display options here
Use ggplot2 for your plots. Some of the plots in the problem set might be easier using base R but the purpose of the problem set is to use the skills we are learning.
Like everything else in the world of coding, there are multiple ways to do this; some more simple (require only one or two verbs or lines of code) other more complex where you might need to combine multiple verbs and perhaps might need to do some googling. I actually want you to do this.
When writing interpretation to the questions use markdown, do not use comments inside the chunks

Data

The file democracy.csv contains data from Przeworski et. al, Demoracy and Deveolpment: Political Institutions and Well-Being in the Worlds, 1950-1990¹. The data have been slightly recoded, to make higher values indicate higher levels of political liberty and democracy.

Variable	Description
`COUNTRY`	numerical code for each country
`CTYNAME`	name of each country
`REGION`	name of region containing country
`YEAR`	year of observation
`GDPW`	GDP per capita in real international prices
`EDT`	average years of education
`ELF60`	ethnolinguistic fractionalization
`MOSLEM`	percentage of Muslims in country
`CATH`	percentage of Catholics in country
`OIL`	whether oil accounts for 50+% of exports
`STRA`	count of recent regime transitions
`NEWC`	whether county was created after 1945
`BRITCOL`	whether country was a British colony
`POLLIB`	degree of political liberty (1–7 scale, rising in political liberty)
`CIVLIB`	degree of civil liberties (1–7 scale, rising in civil liberties)
`REG`	presence of democracy (0=non-democracy, 1=democracy)

Problems

Initial set up
- Load the Democracy dataset into memory as a dataframe. Use the read.csv function, and the stringsAsFactors = FALSE option. Note that missing values are indicated by “.” in the data. Find the option in read.csv that controls the string used to indicate missing values.
Initial data exploration
- Report summary statistics (means and medians, at least) for all variables.
- Create a histogram for political liberties.
- Now, create a histogram for political liberties in which each unique value of the variable is in its own panel. What is new in this plot as compared to the previous one?
- Create a histogram for GDP percapita.
- Create a histogram for log GDP per capita. How is this histogram different than the one for GDP per capita when it was not logged?
Explore relationships
- Create a scatterplot of political liberties against GDP per capita. That is, political liberties is the dependent variable.
- When there is a lot of overlap in a scatter plot it is useful to “jitter” the points (randomly move them up and down). Make the previous plot but jitter the points to mitigate the problem of overplotting. (Only jitter the points vertically). You can use geom_jitter in ggplot2 for this.
- Create a scatterplot of political liberties against log GDP per capita. Jitter the points. How is the relationship different than when GDP per capita was not logged.
- Create a boxplot of GDP per capita for oil producing and non-oil producing nations, make sure to have both values in one single graph.
- Add a substantive interpretation to this graph.
- Now, create a graph with boxplots of each region’s GDP per capita where oil producing and non-oil show different color.
- Add a substantive interpretation to this graph. How does it compare to the previous graph?
Transform data and analyze
- Calculate the mean GDP per capita in countries with at least 40 percent Catholics. How does it compare to mean GDP per capita for all countries?
- Calculate the average GDP per capita in countries with greater than 60% ethnolinguistic fractionalization, less than 60%, and missing ethnolinguistic fractionalization. Hint: you can calculate this with the dplyr verbs: filter,mutate, group_by and/or summarise.
- What was the median of the average years of education in 1985 for all countries? One country is right at the median, which country is this?
- Which countries were closest to the median years of education in 1985 among all countries?
- What was the median of the average years of education in 1985 for democracies?
- Which democracy was (or democracies were) closest to the median years of education in 1985 among all democracies?
- What were the 25th and 75th percentiles of ethnolinguistic fractionalization for new and old countries?

Notes:

¹ Przeworski, Adam, Michael E. Alvarez, Jose Antonio Cheibub, and Fernando Limongi. 2000. Democracy and Development: Political Institutions and Well-Being in the World, 1950-1990. Cambridge University Press.

Advanced Regression Analysis - GOVT 6029: Problem Set 1

Sergio I Garica-Rios

Feb 14, 2022

Instructions

Some other guidance

Data

Problems