Advanced Regression Analysis - GOVT 6029: Problem Set 2

Instructions
Problem 1
- Data
Problem 2
- Data
Problem 3

Due via Cnavas: Wed, March 24

Instructions

Create a new R markdown called hw2_lastname (where lastname should be your actual last name of course.
Do all your analysis in the R markdown document.
- Note that this is just the name of the file, not the title of the document.
Use appropriate markdown code to make a nice and clean document. Including titles, subtitles and relevant text outside of code chunks
Compile an html document
Submit the R markdown and your html. This should contain all necessary code and materials for another person to run your R Markdown file.

Some other guidance

All problems should be answerable in at most a few lines of R code. Questions which require looking up values should be answered using R code and not manually checking the value through the RStudio GUI.
Problems are thematically divided but each bullet point should be seen as a separate exercise
Try to use Tidyverse(dplyr, ggplot, etc) as much as possible (as opposed to base R). I know sometimes it feels like base R would be easier or that it’s all the same but I want you to push yourself outside of that comfort zone. You can use base R later, for now, try to adhere as much as possible to the Tydiverse

Problem 1

Data

This problem uses sprinters.csv which contains the winning times from the meter sprint in Olympic competitions going back to 1900.[^1]

The dataset sprinters contains the following variables:

Variable	Description
`finish`	best time in seconds in the meter sprint
`year`	the year of the competition
`women`	1 if the time is women’s best; 0 if the time the men’s best.

Matrix Form

In R, Create a matrix \(X\) comprised of three columns: a column of ones, a column made of the variable year, and a column made up of the variable women.
Create a matrix \(y\) comprised of a single column, made up of the variable finish.
Compute the following using R’s matrix commands (note that you will need to use the matrix multiplication operator %*%): \[ b = (X' X)^{-1} X' y \] Report the result of this calculation. That is interprest the results

See Matrices in R for more information on how to use matrices in R.

Fitting a linear model

Make a nice plot regression exploring the relationship between finish and year. Make sure the graph is labeled nicely, so that anyone who does not know your variable names could still read it.
- Use ggplot alone for this plot. You do not need to obtain predicted probabilities for this plot.
Using the function lm, run a regression of finish on year and women.
Compare the results with the calculation you did in Section 1. What do you see?
Redo the plot and try to account for each level of women. Interpret the graph
Rerun the regression, adding an interaction between women and year.
- Interpret the results. Did anything change as copared to the previous model?

Predicted Values

Suppose that an Olympics had been held in 2001. Use the predict function to calculate the expected finishing time for men and for women.
- Calculate 95% confidence intervals for the predictions.
- Interpret the results
Now do the same for 1981, 1991, 2001. Use the predict function to calculate the expected finishing time for men and for women.
- Calculate 95% confidence intervals for the predictions.
- Interpret the results
- Make a plot that would best summarize and present the most information in relation to these predictions (do not use geom_smooth). explain why you choose to use the elements that you used in this graph
The authors of the Nature article were interested in predicting the finishing times for the 2156 Olympics. Use predict to do so, for both men and women, and report 95% confidence intervals for your results.
- Make a plot summarizing these predictions (do not use geom_smooth)
- Do you trust the model’s predictions? Is there reason to trust the 2001 prediction more than the 2156 prediction?
- Is any regression assumption of the model being abused or overworked to make this prediction?
Hint: Try predicting the finishing times in the year 3000 C.E.

Problem 2

Data

This question will use a dataset included with R.

data("anscombe")

The dataset consists of 4 seperate datasets each with an \(x\) and \(y\) variable. The original dataset is not a tidy dataset. The following code creates a tidy dataset of the anscombe data that is easier to analyze.

library("tidyverse")
anscombe2 <- anscombe %>%
    mutate(obs = row_number()) %>%
    gather(variable_dataset, value, - obs) %>%
    separate(variable_dataset, c("variable", "dataset"), sep = 1L) %>%
    spread(variable, value) %>%
    arrange(dataset, obs)

Looking at your data beyond summary statistics

For each dataset: calculate the mean and standard deviations of x and y, and correlation between x and y.
Run a linear regression between x and y for each dataset.
How similar do you think that these datasets will look?
Create a scatter plot of each dataset and its linear regression fit. Hint: you can do this easily with facet_wrap.
How do we make sense of these plots?

Problem 3

Reseaerch project

In just a few sentences, describe your project. Think of this as something that you would say next semester when we finally go back to White Hall and you run into a faculty member you haven’t had a chance to interact with and they ask you about your research while in the elevator. Yes this is your elevator pitch!
Now, describe your data.
- Do you have it in a form that you can load it into R?
- What variables does it include?
- What are their descriptions and types?
Describe, in as precise terms as possible, the distribution of the outcome variable you plan to use. If you have the data in hand, a histogram would be ideal; if you do not, give a verbal description of what you expect the distribution to look like. Be sure to indicate if the data are continuous or categorical.
What challenges would your data pose for analysis by least squares regression? Be sure to discuss any potential violations of the assumptions of the GaussMarkov theorem, as well as any other complications or difficulties you see in modeling your data.

If you are still collecting your data, report as much as you can or as much as you have. If you do not have data at this point, talk to me ASAP.

It’s okay if you end up doing something different for your final paper, or if are still unsure about the project. The point of this is to get you working with your data as soon as possible, so if problems arise early we can deal with those now, when things can be done, and not later, when it is too late.

Advanced Regression Analysis - GOVT 6029: Problem Set 2

Sergio I. Garcia-Rios

March 15, 2021

Instructions

Problem 1

Data

Problem 2

Data

Problem 3