Your reproducible lab report: Before you get started, download the R Markdown template for this lab. Remember all of your code and answers go in this document:
download.file("https://github.com/GarciaRios/govt_3990/raw/gh-pages/Labs/lab2/lab2.Rmd", destfile = "lab2.Rmd")
Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.
A note on expectations: For each exercise and on your own question you answer include any relevant output (tables, summary statistics, plots) in your answer. Doing this is easy! Just place any relevant R code in a code chunk, and hit Knit HTML.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
We begin by loading the data set of 20,000 observations into the R work space. After launching RStudio, enter the following command.
load(url("https://github.com/GarciaRios/govt_3990/raw/gh-pages/Labs/lab2/Data/cdc.RData"))
The data set cdc
that shows up in your work space is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.
To view the names of the variables, type the command
names(cdc)
This returns the names genhlth
, exerany
, hlthplan
, smoke100
, height
, weight
, wtdesire
, age
, and gender
. Each one of these variables corresponds to a question that was asked in the survey.
We call this a codebook:
genhlth
Respondent’s self evaluation their general health:
exerany
Whether the respondent exercised in the past month: yes (1) or did not (0).hlthplan
Whether the respondent had some form of health coverage: yes (1) or did not (0).smoke100
Whether the respondent had smoked at least 100 cigarettes in her lifetime.height
Respondent’s height in inches.weight
Respondent’s weight in pounds.wtdesire
Respondent’s desired weight.age
Respondent’s age in years.gender
Respondent’s gender.A very useful function for taking a quick peek at your dataset, and viewing its dimensions and data types is str
.
str(cdc)
Note that R calls categorical variables factor
s.
The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics.
Previously we used qplot
which is the simplified version of ggplot
Let’s start with some histograms. We can create a the histogram for the age of our respondents with the following command:
ggplot(cdc, aes(x = age)) +
geom_histogram()
This functions says to plot age
on the x-axis, this information and more goes into the aesthetic components aes
. It also defines a geom
(short for geometric object), which describes the type of plot you will produce.
Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use (notice that it can go in the initial set up or inside the geom
):
ggplot(cdc, aes(x = age)) +
geom_histogram(binwidth = 20)
ggplot(cdc, aes(x = age, binwidth = 1)) +
geom_histogram()
How do these histograms with the various binwidths compare?
As a simple example, the function summary
returns a numerical summary: minimum, first quartile, median, mean, third quartile, maximum.
To calculate the summary statistics for weight
, type
cdc %>%
select(weight) %>%
summary()
A note on piping: Note that we can read these three lines of code as the following:
“Take the cdc
dataset and pipe it into the select
function. Using this function select the variable called weight
, and pipe this variable into the summary
function.”
The %>%
operator is called the piping operator. Basically, it takes the output of the current line and pipes it into the following line of code.
Since R also functions like a very fancy calculator, you can use these statistics to then calculate the interquartile range for the respondents’ weight, as the value of Q3 - Q1.
190 - 140
You can also individually calculate summary statistics, and make your own customized list of them using the summarise
function:
cdc %>%
summarise(mean_wt = mean(weight), sd_wt = sd(weight), mean_ht = mean(height), sd_ht = sd(height))
Note that in the summarise
function we created a list of four elements. The names of these elements are user defined, like mean_wt
, sd_wt
, etc. and you could customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also require that you know the function calls. Some useful function calls for summary statistics for a single numerical variable are as follows:
mean
median
sd
var
IQR
range
min
max
Another useful functionality is being able to quickly calculate summary statistics for various groups in your dataset. For example, we can modify the above command using the group_by
function to get the same summary stats for males and females:
cdc %>%
group_by(gender) %>%
summarise(mean_wt = mean(weight), sd_wt = sd(weight), mean_ht = mean(height), sd_ht = sd(height))
Here, we first grouped the data by gender, and then calculated the summary statistics.
weight
s of people who did and did not exercise in the last month.While it makes sense to describe a quantitative variable like weight
in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table
does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
cdc %>%
select(smoke100) %>%
table()
or instead look at the relative frequency distribution by typing
cdc %>%
select(smoke100) %>%
table()/20000
cdc %>%
select(smoke100) %>%
table() %>%
prop.table()
Notice how R automatically divides all entries in the table by 20,000 in the command above.
To make a bar plot of these data, use the following:
ggplot(cdc, aes(x = smoke100)) +
geom_bar()
gender
and also for genhlth
. How many males are in the sample? What proportion of the sample reports being in excellent health?The table
command can be used to create contingency tables as well. For example, to examine which participants have smoked across each gender, we could use the following.
cdc %>%
select(gender, smoke100) %>%
table()
cdc %>%
select(gender, smoke100) %>%
table() %>%
prop.table(.,2)
In the last lab we visualized relationships between two numerical variables using scatterplots. As a quick reminder, let’s make one more of those:
ggplot(cdc, aes(x = weight, y= height)) +
geom_point()
How are these two variables related?
To create a segmented bar plots we can use the following two options.
ggplot(cdc, aes(x = gender)) +
geom_bar(aes(fill = smoke100))
ggplot(cdc, aes(x = gender)) +
geom_bar(aes(fill = smoke100), position= position_dodge())
The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with
ggplot(cdc, aes(x= gender, y = height)) +
geom_boxplot()
Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio, and can be calculated as:
\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]
703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).
We can use the mutate
function to create this new variable and add it to the cdc
dataset:
cdc <- cdc %>%
mutate(bmi = (weight / height^2) * 703)
Notice that (weight / height^2) * 703
is just some arithmetic, but it’s applied to all 20,000 values in the dataset. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.
Now we can use this new variable in our analysis. For example, let’s create side-by-side box plots of bmi
over the levels of genhlth
:
ggplot(cdc, aes(y = bmi, x = genhlth)) +
geom_boxplot()
Discuss what this box plot shows.
It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We can do this easily using the filter
function and a series of logical operators. The most commonly used logical operators for data analysis are
==
means “equal to”!=
means “not equal to”>
or <
means “greater than” or “less than”>=
or <=
means “greater than or equal to” or “less than or equal to”Using these, we can create a subset of the cdc
dataset for just the men, and save this as a new dataset called males
:
males <- cdc %>%
filter(gender == "m")
Here, we’ve created a new object, called males
. The special symbol <-
performs an assignment, taking the output of one line of code and saving it into this new object that you defined.
We can take look at the contents of this object by clicking on it in the Environment tab, or by typing the following command we can view the first few lines of it
head(males)
We don’t want to print the entire dataset since it’s pretty large, taking a look at just the first few lines should give you a good idea of what it looks like.
This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables using the select
function we learned earlier, but this is not relevant here since we are focusing on subsetting a dataset based on values of one or more variables.
As an aside, you can use several of these conditions together with &
and |
. The &
is read “and” so that
males_and_over30 <- cdc %>%
filter(gender == "m" & age > 30)
will give you the data for men over the age of 30. The |
character is read “or” so that
males_or_over30 <- cdc %>%
filter(gender == "m" | age > 30)
will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.
under23_and_smoke
that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise, and report the number of cases that meet this criteria.At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.
Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
Let’s consider a new variable: the difference between desired weight (wtdesire
) and current weight (weight
). Create this new variable by subtracting the two columns in the data frame and assigning them to a new variable in the cdc
dataset called wdiff
.
What type of data is wdiff
? If an observation wdiff
is 0, what does this mean about the person’s weight and desired weight. What if wdiff
is positive or negative? Hint: This is the only exercise in this lab that doesn’t require code to answer.
Describe the distribution of wdiff
in terms of its center, shape, and spread, including any plots and numerical summaries you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and side-by-side box plots, determine if men tend to view their weight differently than women.
Now it’s time to get creative. Make a scatter plot of weight vs desired weight and add a new aesthetic element to the scatter with color
with your choice of an interesting variable. Explain your findings.