Install R and RStudio
R is the name of the programming language, and RStudio is a convenient and widely used interface to that language.
Since you will be using it for the remainder of the course, you should familiarize yourself with the RStudio GUI.
It consists of four windows,
Bottom left: The console window. You type commands at the >
prompt and R executes them.
Top left: The editor window. Here you can edit and save R scripts which contain multiple R commands.
Top right
Bottom right
RStudio documentation can be found at http://www.rstudio.com/ide/docs/. Of those, the most likely to be useful to you are:
Keeping all the files associated with a project organized together -input data, R scripts, analytical results, figures- is such a wise and common practice that RStudio has built-in support for this via its projects. Read this for more information about RStudio projects.
You will use RStudio projects for your labs and homeworks, and final paper. Create a RStudio project that you will use for all your labs.
For this course, you will be we using R Markdown documents for homeworks. Create your firs
Ctrl-S
.Cheat sheets and additional resources about R Markdown are available at http://rmarkdown.rstudio.com/.
Although it is so much more, you can use R as a calculator. For example, to add, subtract, multiply or divide:
2 + 3
2 - 3
2 * 3
2 / 3
The power of a number is calculated with ^
, e.g. \(4^2\) is,
4 ^ 2
R includes many functions for standard math functions. For example, the square root function is sqrt
, e.g. \(\sqrt{2}\),
sqrt(2)
And you can combine many of them together
2 * 4 + 3 ) / 10
(sqrt(2 * 2)
In R, you can save the results of calculations into objects that you can use later. This is done using the special symbol, <-
. For example, this saves the results of 2 + 2 to an object named foo
<- 2 + 2 foo
You can see that foo
is equal to 4
foo
And you can reuse foo in other calculations,
+ 3
foo / 2 * 8 + foo foo
You can use =
instead of <-
for assignment. You may see this in some other code. There are some technical reasons to use <-
instead of =
, but the primary reason we will use <-
instead of =
is that this is the convention used in modern R
programs.
Missing data is particularly important
<- c(1, 2, NA, 3, 4) foo
The function na.omit
is particularly useful.
It removes any row in a dataset with a missing value in any column.
For example:
<- data.frame(x = c(NA, NA, 4, 3),
dfrm y = c(NA, NA, 7, 8)
)
dfrm
na.omit(dfrm)
2 + NA
mean(foo)
mean
to change how that function handles missing values.median(foo)
work?foo > 2
. Are all the entries TRUE
and FALSE
?is.na(foo)
do? What about ! is.na(foo)
?foo[! is.na(foo)]
do?<- data.frame(x = c(NA, 2, NA, 4), y = c(NA, NA, 7, 8))
dfrm2
dfrm2
na.omit(dfrm2)
For the remainder of this lab you will be using a dataset of GDP per capita and fertility from Gapminder.
Download the csv (“comma-separated values”) from here.
Then load the file
<- read.csv("gapminder.csv", stringsAsFactors = FALSE) gapminder
This creates a data frame. A data frame is a type of R object that corresponds to what you usually think of as a dataset or a spreadsheet — rows are observations and columns are variables.
gapminder
This is a lot of information. How can we get a more useful picture of the dataset as a whole?
dim(gapminder)
names(gapminder)
head(gapminder)
tail(gapminder)
summary(gapminder)
dim()
shows the dimensions of the data frame as the number of rows, columnsnames()
shows the column names of the data frame.head()
shows the first few observationssummary()
calculates summary statistics for all variables in the data frame.You can extract single variables (or columns) and perform different operations on them. To extract a variable, we use the dollar sign ($
) extraction operator.
$lifeExp gapminder
Again, perhaps a summary may be more interesting. We can do more specific operations on this variable alone:
mean(gapminder$lifeExp)
median(gapminder$lifeExp)
sd(gapminder$lifeExp)
min(gapminder$lifeExp)
max(gapminder$lifeExp)
quantile(gapminder$lifeExp)
length()
calculates the length of a vector.unique()
returns the number of unique values in a vector.Make sure your lab compiles neatly. Make sure that you are not printing unnecessary output.
You can find the RMarkdown code that I used to create this document on the class website. Download it and check the code that I use to keep it nice and clean as well and the Markdown code that I use through the text (e.g to create lists and other styling)
footnotes:
If you are curious as to why the variable was named foo
, read this.
Dataset from the gapminder R packager. The dataset in that package is an excerpt from the Gapminder data. Gapminder data is released under the Creative Commons Attribution 3.0 Unported license. See their terms of use.
Some text and the data set used in this are taken from Jenny Bryant, R basics, workspace and working directory, RStudio projects, licensed under CC BY-NC 3.0
Science should be open! Here at Cornell and everywhere, this lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.
Comments
Any R code following a hash (
#
) is not executed. These are called comments, and can and should be used to annotate and explain your code. For example, this doesn’t do anything.And in this, nothing after the
#
is executed,Challenge: What is this equal to?