The objectives of this lab are:
%>%
to simplify complicated code by chaining expressions togetherFor this lab we will use the following libraries dplyr
and ggplot2
. We can load those individually or use tidyverse
which is not really a package but a collection of packages. Lately, I only load tidyverse
since it contains all the packages that I frequently use. Remember, make sure you have installed these packages before you load them.
You will be using the gapminder data again.
In the first lab, you loaded the data from a .csv file. In this lab, you will be using the same data, but as it is distributed in gapminder package.
To load a data set included with an R package, use the data()
function.
You can see which data sets are included in a package. gapminder is not the only one most packages have some data set within the package. See also for instance ggplot2
data(package = "gapminder")
data(package = "ggplot2")
Challenge
gapminder
package is country data?data
function
Load the gapminder data
dplyr
dplyr is a package for data manipulation. It provides a few core verbs and most data manipulations can be done by combining these verbs together — something which becomes even easier with the %>%
operator.
filter()
: subset observations by logical conditionsslice()
: subset observations by row numbersarrange()
: sort the data by variablesselect()
: select a subset of variablesrename()
: rename variablesdistinct()
: keep only distinct rowsmutate()
and transmute()
: adds new variablesgroup_by()
: group the data according to variablessummarise()
: summarize multiple values into a single valuesample_n()
and sample_frac()
: select a random sample of rowsdplyr also offers the function glimpse
to quickly view the data
glimpse(gapminder)
We are ready to begin exploring our data-set in more depth.
For this lab we want to explore the relationship between life expectancy and GDP. Let’s use some dplyr
verbs to explore our data. For you Stata users missing “if statements” let’s begin with filter()
filter(gapminder, lifeExp < 29)
filter(gapminder, country == "Rwanda")
You can combine filter statements.
Including multiple logical statements is equivalent to combining them with “and”.
This will give all observations in “Africa”, before 1966, and which have a life expectancy less than 40.
gapminder2 <-
filter(gapminder, continent == "Africa", year < 1966, lifeExp < 40)
That is equivalent to
filter(gapminder, continent == "Africa" & year < 1966 & lifeExp < 40 )
To combine logical statements with “or” you need to explicitly use |
. To find observations from Afghanistan or Albania,
filter(gapminder, country == "Afghanistan" | country == "Albania")
arrange
to sort columns in a given orderBecause the world is not always ordered the way we want it
arrange(gapminder, pop)
arrange(gapminder, -pop)
select()
to subset the data on variables or columns.Most of the times we don’t need to see all the variables and are often interested in just a few of them. Here’s a conventional call:
select(gapminder, year, lifeExp)
Challenge
Using a combination of filter
, select
, and slice
to create data frames to show only year and life expectancy of Cambodia for the first two observations
%>%
to join the XXI centuryBefore we go any further, we should introduce the pipe operator that dplyr
imports from the magrittr package.
This is going to change your (data-analysis) life.
Notice we can do the same computation as above but without having to create new objects. We are basically passing down the result from the previous line into the following.
We think of the %>%
operator as a then statement. So in the previous line we:
year
and lifeExp
columns, thenmutate()
to add new variablesImagine we wanted to recover each country’s GDP. We do have data for population and GDP per capita. what do we do?
So… GDP is almost useless because it doesn’t give a base line and that is why we often use per capita, but a baseline is often more useful, how about comparing it to another country say, USA?
Yes USA, USA, USA!
Let’s create first a data frame containing only US data, we use filter
here. We are also need to change one of the variable name
just_usa <- gapminder %>%
filter(country == "United States") %>%
select(year, gdpPercap) %>%
rename(usa_gdpPercap = gdpPercap)
We can join (or merge) the data set to the gapminder data using the left_join
function.
There are are several ways to merge data sets with dplyr (left join, right join, inner join, and outer join). They serve different purposes here we use left_join()
gapminder <-
left_join(gapminder, just_usa, by = c("year"))
No we can take gdpPercap
and divide it by `usa_gdpPercap to obtain a relative to the US figure.
gapminder <-
gapminder %>%
mutate(gdpPercapRel = gdpPercap / usa_gdpPercap)
Now, compute a general summary of the relative GDP
gapminder %>%
select(gdpPercapRel) %>%
summary()
Nice, now we can do something like this:
Look at the GDP per capita of Mexico and Canada relative to US by year
gapminder %>%
filter(country == "Canada") %>%
select(country, year, gdpPercap, usa_gdpPercap, gdpPercapRel)
gapminder %>%
filter(country == "Mexico") %>%
select(country, year, gdpPercap, usa_gdpPercap, gdpPercapRel)
Or this:
df_NAFTA<-
gapminder %>%
filter(country %in% c("Mexico", "Canada")) %>%
select(country, year, gdpPercap, usa_gdpPercap, gdpPercapRel)
Challenge
What about life expectancy? Create a relative to life expectancy variable, compare the three NAFTA countries US, Canada and Mexico
For the most part, to visualize results in this course we will be using the graphics package ggplot2, which is one of the most popular, but it is only one of several graphics packages in R.1
Unlike many other graphics systems, functions in ggplot2 do not correspond to separate types of graphs. There are not scatterplot, histogram, or line chart functions per se. Instead plots are built up from component functions.
Main components:
Other:
Let’s continue using the gapminder data, take another look at it
glimpse(gapminder)
Great, let the plotting begin:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
This just initializes the plot with the basic mapping. We still need to tell ggplot
the geometric object (or geoms
) that we will use to represent the data in this mapping.
Seems like points would be a good representation, let’s use geom_point
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
That looks okay but it would probably look be better if we log transform. Notice that we don’t have to create a new variable we can just do the transformation inside the aesthetics.
ggplot(gapminder, aes(x = log10(gdpPercap), y = lifeExp)) +
geom_point()
A better way to log transform
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10()
Note the common workflow: gradually build up the plot you want, re-define the plot by adding (literally) new elements.
Now, let’s vary add another variable by having those points be represented by different colors according to the continent they belong.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10()
Let’s try address over-plotting: Set alpha transparency and size to a value. There is a even better way to do it but this is ok for now.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = .3, size = 3) +
scale_x_log10()
Now, add a fitted curve or line, let’s forget about continent for now
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = .3, size = 3) +
geom_smooth() +
scale_x_log10()
Let’s remove the confidence band
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(lwd = 2, se = FALSE) +
scale_x_log10()
The default of geom_smooth
follows a local polynomial regression fitting (aka LOESS) but we can force any other type of fitting. Let’s try a linear model lm
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(lwd = 2, se = FALSE, method = "lm") +
scale_x_log10()
That’s great but I actually want to revive our interest in continents!
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
geom_smooth(lwd = 2, se = FALSE, method = "lm") +
scale_x_log10()
That’s a lot of info. Let’s try something else called faceting. This creates various panels or subplots by a given variable.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ continent)
Still want best fit lines? Let’s add them
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = .2) +
geom_smooth() +
scale_x_log10() +
facet_wrap(~ continent)
Notice what color = continent
is doing here: it’s affecting both the points and the line. Let’s add some aesthetics specific to geom_point
so that it changes only the points.
ggplot(gapminder, aes(x = year, y = lifeExp, color = continent)) +
geom_jitter( alpha = .2) +
geom_smooth(color = "black") +
scale_x_log10() +
facet_wrap(~ continent)
Challenge
Great! So we need to account for time. Let’s use some of the cool capabilities of the extensions in the ggplot
verse with the gganimate
package. MAke sure to also install the package gifski
library(gganimate)
ggplot(gapminder %>% filter(continent != "Oceania"), aes(gdpPercap, lifeExp, size = pop, color = country)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
scale_colour_manual(values = country_colors) +
# scale_size(range = c(2, 12)) +
scale_x_log10() +
facet_wrap(~continent, nrow = 1) +
theme_bw() +
# Here comes the gganimate specific bits
labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'Life Expectancy') +
transition_time(year) +
ease_aes('linear')
Ok, back to the analysis. What if I am only interested in the US?
ggplot(filter(gapminder, country == "United States"),
aes(x = year, y = lifeExp)) +
geom_line() +
geom_point()
Let’s just look at five countries
some_countries <- c("United States", "Canada", "Rwanda", "Cambodia", "Mexico")
ggplot(filter(gapminder, country %in% some_countries),
aes(x = year, y = lifeExp, color = country)) +
geom_line() +
geom_point()
So what’s up with Mexico?
Not really… Let’s add yet another variable, GDP
ggplot(filter(gapminder, country %in% some_countries),
aes(x = year, y = lifeExp, color = country)) +
geom_line() +
geom_point(aes(size=gdpPercap))
You can change the way the plot looks overall using theme
ggplot(subset(gapminder, country %in% some_countries),
aes(x = year, y = lifeExp, color = country)) +
geom_line() +
geom_point(aes(size=gdpPercap)) +
theme_minimal() +
scale_color_brewer(palette = "Dark2")
In addition to the themes included with ggplot, several other themes are available in the ggthemes package.
You may encounter these other packages in other classes, or code samples online.
plot
, barplot
, hist
. See http://www.statmethods.net/graphs/index.htmlbarchart
, densityplot
, dotplot
, xyplot
, histogram
. See http://www.statmethods.net/advgraphs/trellis.html