Due via Cnavas: Wed, March 24

Instructions

Some other guidance

Problem 1

Data

This problem uses sprinters.csv which contains the winning times from the meter sprint in Olympic competitions going back to 1900.[^1]

The dataset sprinters contains the following variables:

Variable Description
finish best time in seconds in the meter sprint
year the year of the competition
women 1 if the time is women’s best; 0 if the time the men’s best.
  1. Matrix Form
  • In R, Create a matrix \(X\) comprised of three columns: a column of ones, a column made of the variable year, and a column made up of the variable women.

  • Create a matrix \(y\) comprised of a single column, made up of the variable finish.

  • Compute the following using R’s matrix commands (note that you will need to use the matrix multiplication operator %*%): \[ b = (X' X)^{-1} X' y \] Report the result of this calculation. That is interprest the results

    See Matrices in R for more information on how to use matrices in R.

  1. Fitting a linear model
  • Make a nice plot regression exploring the relationship between finish and year. Make sure the graph is labeled nicely, so that anyone who does not know your variable names could still read it.

    • Use ggplot alone for this plot. You do not need to obtain predicted probabilities for this plot.
  • Using the function lm, run a regression of finish on year and women.

  • Compare the results with the calculation you did in Section 1. What do you see?

  • Redo the plot and try to account for each level of women. Interpret the graph

  • Rerun the regression, adding an interaction between women and year.

    • Interpret the results. Did anything change as copared to the previous model?
  1. Predicted Values
  • Suppose that an Olympics had been held in 2001. Use the predict function to calculate the expected finishing time for men and for women.
    • Calculate 95% confidence intervals for the predictions.
    • Interpret the results
  • Now do the same for 1981, 1991, 2001. Use the predict function to calculate the expected finishing time for men and for women.
    • Calculate 95% confidence intervals for the predictions.
    • Interpret the results
    • Make a plot that would best summarize and present the most information in relation to these predictions (do not use geom_smooth). explain why you choose to use the elements that you used in this graph
  • The authors of the Nature article were interested in predicting the finishing times for the 2156 Olympics. Use predict to do so, for both men and women, and report 95% confidence intervals for your results.
    • Make a plot summarizing these predictions (do not use geom_smooth)
    • Do you trust the model’s predictions? Is there reason to trust the 2001 prediction more than the 2156 prediction?
    • Is any regression assumption of the model being abused or overworked to make this prediction?
    Hint: Try predicting the finishing times in the year 3000 C.E.

Problem 2

Data

This question will use a dataset included with R.

data("anscombe")

The dataset consists of 4 seperate datasets each with an \(x\) and \(y\) variable. The original dataset is not a tidy dataset. The following code creates a tidy dataset of the anscombe data that is easier to analyze.

library("tidyverse")
anscombe2 <- anscombe %>%
    mutate(obs = row_number()) %>%
    gather(variable_dataset, value, - obs) %>%
    separate(variable_dataset, c("variable", "dataset"), sep = 1L) %>%
    spread(variable, value) %>%
    arrange(dataset, obs)
  1. Looking at your data beyond summary statistics
  • For each dataset: calculate the mean and standard deviations of x and y, and correlation between x and y.

  • Run a linear regression between x and y for each dataset.

  • How similar do you think that these datasets will look?

  • Create a scatter plot of each dataset and its linear regression fit. Hint: you can do this easily with facet_wrap.

  • How do we make sense of these plots?


Problem 3

  1. Reseaerch project

If you are still collecting your data, report as much as you can or as much as you have. If you do not have data at this point, talk to me ASAP.

It’s okay if you end up doing something different for your final paper, or if are still unsure about the project. The point of this is to get you working with your data as soon as possible, so if problems arise early we can deal with those now, when things can be done, and not later, when it is too late.