Introduction to data

Getting started
Describing distributions
Visualizing relationships
Subsetting
Recap
On Your Own

Your reproducible lab report: Before you get started, download the R Markdown template for this lab. Remember all of your code and answers go in this document:

download.file("https://github.com/GarciaRios/govt_3990/raw/gh-pages/Labs/lab2/lab2.Rmd", destfile = "lab2.Rmd")

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

A note on expectations: For each exercise and on your own question you answer include any relevant output (tables, summary statistics, plots) in your answer. Doing this is easy! Just place any relevant R code in a code chunk, and hit Knit HTML.

Getting started

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R work space. After launching RStudio, enter the following command.

load(url("https://github.com/GarciaRios/govt_3990/raw/gh-pages/Labs/lab2/Data/cdc.RData"))

The data set cdc that shows up in your work space is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.

To view the names of the variables, type the command

names(cdc)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey.

We call this a codebook:

genhlth Respondent’s self evaluation their general health:
- excellent
- very good
- good
- fair
- poor
exerany Whether the respondent exercised in the past month: yes (1) or did not (0).
hlthplan Whether the respondent had some form of health coverage: yes (1) or did not (0).
smoke100 Whether the respondent had smoked at least 100 cigarettes in her lifetime.
height Respondent’s height in inches.
weight Respondent’s weight in pounds.
wtdesire Respondent’s desired weight.
age Respondent’s age in years.
gender Respondent’s gender.

A very useful function for taking a quick peek at your dataset, and viewing its dimensions and data types is str.

str(cdc)

Note that R calls categorical variables factors.

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical: ordinal or not, numerical: continuous or discrete). Do not just rely on the R output, also think about the nature of the variables.

Describing distributions

Histograms

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics.

Previously we used qplot which is the simplified version of ggplot

Let’s start with some histograms. We can create a the histogram for the age of our respondents with the following command:

ggplot(cdc, aes(x = age)) + 
  geom_histogram()

This functions says to plot age on the x-axis, this information and more goes into the aesthetic components aes. It also defines a geom (short for geometric object), which describes the type of plot you will produce.

Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use (notice that it can go in the initial set up or inside the geom):

ggplot(cdc, aes(x = age)) + 
  geom_histogram(binwidth = 20)

ggplot(cdc, aes(x = age, binwidth = 1)) + 
  geom_histogram()

How do these histograms with the various binwidths compare?

Summary statistics

As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, third quartile, maximum.

To calculate the summary statistics for weight, type

cdc %>%
  select(weight) %>%
  summary()

A note on piping: Note that we can read these three lines of code as the following:

“Take the cdc dataset and pipe it into the select function. Using this function select the variable called weight, and pipe this variable into the summary function.”

The %>% operator is called the piping operator. Basically, it takes the output of the current line and pipes it into the following line of code.

Since R also functions like a very fancy calculator, you can use these statistics to then calculate the interquartile range for the respondents’ weight, as the value of Q3 - Q1.

190 - 140

You can also individually calculate summary statistics, and make your own customized list of them using the summarise function:

cdc %>%
  summarise(mean_wt = mean(weight), sd_wt = sd(weight), mean_ht = mean(height), sd_ht = sd(height))

Note that in the summarise function we created a list of four elements. The names of these elements are user defined, like mean_wt, sd_wt, etc. and you could customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also require that you know the function calls. Some useful function calls for summary statistics for a single numerical variable are as follows:

mean
median
sd
var
IQR
range
min
max

Another useful functionality is being able to quickly calculate summary statistics for various groups in your dataset. For example, we can modify the above command using the group_by function to get the same summary stats for males and females:

cdc %>%
  group_by(gender) %>%
  summarise(mean_wt = mean(weight), sd_wt = sd(weight), mean_ht = mean(height), sd_ht = sd(height))

Here, we first grouped the data by gender, and then calculated the summary statistics.

Calculate the median and interquartile range for weights of people who did and did not exercise in the last month.

Tables

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

cdc %>%
  select(smoke100) %>%
  table()

or instead look at the relative frequency distribution by typing

cdc %>%
  select(smoke100) %>%
  table()/20000

cdc %>%
  select(smoke100) %>%
  table() %>% 
  prop.table()

Notice how R automatically divides all entries in the table by 20,000 in the command above.

To make a bar plot of these data, use the following:

ggplot(cdc, aes(x = smoke100)) + 
         geom_bar()

Compute the relative frequency distribution for gender and also for genhlth. How many males are in the sample? What proportion of the sample reports being in excellent health?

The table command can be used to create contingency tables as well. For example, to examine which participants have smoked across each gender, we could use the following.

cdc %>%
  select(gender, smoke100) %>%
  table()


cdc %>% 
  select(gender, smoke100) %>%
  table() %>% 
  prop.table(.,2)

a. What percent of males are smokers? What percent of females are smokers? b. What percent of the sample are males who are smokers? What percent of the sample are females who are smokers? c. Which pair of statistics is more useful for determining whether males or females are more likely to be smokers? Explain your reasoning.

Visualizing relationships

Between two numercial variables

In the last lab we visualized relationships between two numerical variables using scatterplots. As a quick reminder, let’s make one more of those:

ggplot(cdc, aes(x = weight, y= height)) + 
  geom_point()

How are these two variables related?

Between two categorical variables

To create a segmented bar plots we can use the following two options.

ggplot(cdc, aes(x = gender)) + 
  geom_bar(aes(fill = smoke100))

ggplot(cdc, aes(x = gender)) + 
  geom_bar(aes(fill = smoke100), position= position_dodge())

How are the two plots different from each other? Which one is more useful for comparing the proportions of male and female smokers?

Between a numerical and a categorical variable

The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with

ggplot(cdc, aes(x= gender, y = height)) + 
  geom_boxplot()

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio, and can be calculated as:

\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]

703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).

We can use the mutate function to create this new variable and add it to the cdc dataset:

cdc <- cdc %>%
  mutate(bmi = (weight / height^2) * 703)

Notice that (weight / height^2) * 703 is just some arithmetic, but it’s applied to all 20,000 values in the dataset. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.

Now we can use this new variable in our analysis. For example, let’s create side-by-side box plots of bmi over the levels of genhlth:

ggplot(cdc, aes(y = bmi, x = genhlth)) + 
  geom_boxplot()

Discuss what this box plot shows.

Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, create side-by-side box plots of the distribution of BMI over this variable, and provide an interpretation for this plot.

Subsetting

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We can do this easily using the filter function and a series of logical operators. The most commonly used logical operators for data analysis are

== means “equal to”
!= means “not equal to”
> or < means “greater than” or “less than”
>= or <= means “greater than or equal to” or “less than or equal to”

Using these, we can create a subset of the cdc dataset for just the men, and save this as a new dataset called males:

males <- cdc %>%
  filter(gender == "m")

Here, we’ve created a new object, called males. The special symbol <- performs an assignment, taking the output of one line of code and saving it into this new object that you defined.

We can take look at the contents of this object by clicking on it in the Environment tab, or by typing the following command we can view the first few lines of it

head(males)

We don’t want to print the entire dataset since it’s pretty large, taking a look at just the first few lines should give you a good idea of what it looks like.

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables using the select function we learned earlier, but this is not relevant here since we are focusing on subsetting a dataset based on values of one or more variables.

As an aside, you can use several of these conditions together with & and |. The & is read “and” so that

males_and_over30 <- cdc %>%
  filter(gender == "m" & age > 30)

will give you the data for men over the age of 30. The | character is read “or” so that

males_or_over30 <- cdc %>%
  filter(gender == "m" | age > 30)

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise, and report the number of cases that meet this criteria.

Recap

At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new variable in the cdc dataset called wdiff.
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative? Hint: This is the only exercise in this lab that doesn’t require code to answer.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots and numerical summaries you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and side-by-side box plots, determine if men tend to view their weight differently than women.
Now it’s time to get creative. Make a scatter plot of weight vs desired weight and add a new aesthetic element to the scatter with color with your choice of an interesting variable. Explain your findings.