Due via Cnavas: Wed, March 24
hw2_lastname
(where lastname
should be your actual last name of course.Some other guidance
dplyr
, ggplot
, etc) as much as possible (as opposed to base R). I know sometimes it feels like base R would be easier or that it’s all the same but I want you to push yourself outside of that comfort zone. You can use base R later, for now, try to adhere as much as possible to the TydiverseThis problem uses sprinters.csv which contains the winning times from the meter sprint in Olympic competitions going back to 1900.[^1]
The dataset sprinters
contains the following variables:
Variable | Description |
---|---|
finish |
best time in seconds in the meter sprint |
year |
the year of the competition |
women |
1 if the time is women’s best; 0 if the time the men’s best. |
In R, Create a matrix \(X\) comprised of three columns: a column of ones, a column made of the variable year, and a column made up of the variable women.
Create a matrix \(y\) comprised of a single column, made up of the variable finish.
Compute the following using R’s matrix commands (note that you will need to use the matrix multiplication operator %*%
): \[
b = (X' X)^{-1} X' y
\] Report the result of this calculation. That is interprest the results
See Matrices in R for more information on how to use matrices in R.
Make a nice plot regression exploring the relationship between finish
and year
. Make sure the graph is labeled nicely, so that anyone who does not know your variable names could still read it.
Using the function lm
, run a regression of finish
on year
and women
.
Compare the results with the calculation you did in Section 1. What do you see?
Redo the plot and try to account for each level of women
. Interpret the graph
Rerun the regression, adding an interaction between women
and year
.
predict
function to calculate the expected finishing time for men and for women.
predict
function to calculate the expected finishing time for men and for women.
geom_smooth
). explain why you choose to use the elements that you used in this graphpredict
to do so, for both men and women, and report 95% confidence intervals for your results.
geom_smooth
)This question will use a dataset included with R.
data("anscombe")
The dataset consists of 4 seperate datasets each with an \(x\) and \(y\) variable. The original dataset is not a tidy dataset. The following code creates a tidy dataset of the anscombe data that is easier to analyze.
library("tidyverse")
anscombe2 <- anscombe %>%
mutate(obs = row_number()) %>%
gather(variable_dataset, value, - obs) %>%
separate(variable_dataset, c("variable", "dataset"), sep = 1L) %>%
spread(variable, value) %>%
arrange(dataset, obs)
For each dataset: calculate the mean and standard deviations of x and y, and correlation between x and y.
Run a linear regression between x and y for each dataset.
How similar do you think that these datasets will look?
Create a scatter plot of each dataset and its linear regression fit. Hint: you can do this easily with facet_wrap
.
How do we make sense of these plots?
In just a few sentences, describe your project. Think of this as something that you would say next semester when we finally go back to White Hall and you run into a faculty member you haven’t had a chance to interact with and they ask you about your research while in the elevator. Yes this is your elevator pitch!
Now, describe your data.
Describe, in as precise terms as possible, the distribution of the outcome variable you plan to use. If you have the data in hand, a histogram would be ideal; if you do not, give a verbal description of what you expect the distribution to look like. Be sure to indicate if the data are continuous or categorical.
What challenges would your data pose for analysis by least squares regression? Be sure to discuss any potential violations of the assumptions of the GaussMarkov theorem, as well as any other complications or difficulties you see in modeling your data.
If you are still collecting your data, report as much as you can or as much as you have. If you do not have data at this point, talk to me ASAP.
It’s okay if you end up doing something different for your final paper, or if are still unsure about the project. The point of this is to get you working with your data as soon as possible, so if problems arise early we can deal with those now, when things can be done, and not later, when it is too late.