Interpretation Problems

Interpretation Problem 1: Beauty Is in the Eyes of the Student, Part II

We return to the data set from Problem 4 of Exam 1. Go back and re-read the prompt for that exam problem to remind yourself about the data set.

The economists recorded several other characteristics of the professors in addition to their course evaluation scores and attractiveness ratings. Additional variables included:

  • female : an indicator variable that 1 if the professor’s biological sex was female and 0 otherwise.
  • age : the chronological age of the professor.

In this problem, you will investigate how beauty, age, and biological sex of a professor are related to their course evaluation scores in the data set.

a. Load in the data set, and create exploratory data analysis plots for each of the predictor variables. Also create a matrix plot showing how the predictor variables are related to each other and the response variable.

Discuss each plot you create.

b. Fit three different regression models for predicting the courseevaluation variable using:

  • Model 1: beauty as the only predictor
  • Model 2: beauty and female as predictors
  • Model 3: beauty, female, and age as predictors

For each model, discuss what the coefficients of the fitted regression mean in the context of this data set. Are the coefficients different across the three models? Why?

c. Refit the three models from Part b, but now standardize beauty and age. Interpret the coefficients for the linear regressions fit using the standardized coefficients. Are these easier to interpret?

d. Create coefficient plots for the third standardized model, both including and excluding the intercept. Do any of the predictors seem more relevant to predicting the course evaluation of a professor? Do any of the predictors seem less relevant? Why?

e. Create all relevant residual diagnostic plots for the standardized Model 3. Comment on relevant features of diagnostic plots, and determine whether the MLRGN model seems appropriate.

f. Test the claim that for each additional year of age, the expected course evaluation decreases, while also accounting for the attractiveness and biological sex of the professor at the 0.25 significance level.

Does this result contradict the result of a test of the claim that there is a non-zero effect of age while accounting for attractiveness and biological sex at the 0.25 significance level? Why or why not?

Can we trust either of these results, given your analysis of the residual diagnostic plots? Why or why not?

g. Construct a confidence curve for the expected effect of a professor’s age on their course evaluation, while also accounting for their attractiveness and biological sex. Interpret the confidence curve in the context of this problem.

Interpretation Problem 2: It Takes a Village (and a High School Diploma)

In this problem, we consider a subset of data from the National Longitudinal Survey of Youth. This subsample contains cognitive test scores (kid_score) for three- and four-year-old children, as well as characteristics of their mothers. Some of the characteristics of their mother include:

  • mom_age : the age of the mother when the child was born.
  • mom_hs : an indicator variable for whether the mother completed high school: 1 for yes and 0 for no.

In this problem, you will investigate how a mother’s age and high school completion is related to the test score of their child.

a. Load in the data set, and create exploratory data analysis plots for each of the predictor variables. Also create a matrix plot showing how the predictor variables are related to each other and the response variable.

Discuss each plot you create.

b. Fit two models to predict a child’s test score, using:

  • Model 1: the standardized age of their mother.
  • Model 2: the standardized age of their mother, as well as (unstandardized) whether the mother completed high school.

For each model, discuss what the coefficients of the fitted regression mean in the context of this data set. Does the expected effect of a mother’s age on her child’s test score change when we also account for whether she has completed high school? If so, why might this be this case?

c. Create a coefficient plot for the second model from Part b. Does one of the predictors seem more important for predicting the child’s test score?

d. Create all relevant residual diagnostic plots for Model 2. Comment on relevant features of diagnostic plots, and determine whether the MLRGN model seems appropriate.

e. Test the claim that there is a positive relationship between a mother’s age and the expected score of a child on the test at the 0.1 significance level.

Hint: Which model should you use to answer this question?

Does your conclusion change if you test for whether this is any relationship between a mother’s age and the expected score of her child? Why?

f. Construct a confidence curve for the expected effect of a mother’s standardized age on their child’s score, while also accounting for whether the mother completed high school. Interpret the confidence curve in the context of this problem.

Computation Problems

Computation Problem 1

a. Verify by direct computation the estimated noise variances reported by lm for the multiple linear regressions \[ \texttt{courseevaluation} \sim \texttt{beauty} + \texttt{female} + \texttt{age}\] and \[ \texttt{courseevaluation} \sim \widetilde{\texttt{beauty}} + \texttt{female} + \widetilde{\texttt{age}}\] from Interpretation Problem 1, where the tilde over a predictor variable indicates that we use the standardized predictor.

b. Are the estimated noise variances the same? Why?

Computation Problem 2

a. Verify by direct computation the estimated standard errors for the coefficient estimates reported by lm for the multiple linear regression \[ \texttt{kid_score} \sim \widetilde{\texttt{mom_age}} + \texttt{mom_hs}\] from Interpretation Problem 2, where the tilde over a predictor variable indicates that we use the standardized predictor.

Hint: It will be easier if you work with the formula given in terms of the design matrix rather than the formula involving the multiple \(R^{2}\) value.

b. Would you get the same estimated standard errors for the multiple linear regression \[ \texttt{kid_score} \sim \texttt{mom_age} + \texttt{mom_hs}\text{?}\] Why?

Hint: It will be easier if you think about this using the formula given in terms of the multiple \(R^{2}\) value.

Theory Problems

Theory Problem 1

In class, I stated that, when we include an intercept in a simple linear regression, the mean of the residuals from that regression is zero: \[ \frac{1}{n} \sum_{i = 1}^{n} e_{i} = 0.\] This is one of the “constraints” placed on the residuals that induces a loss of one degree of freedom for the \(t\) and \(\chi^{2}\) distributions associated with the SLRGN and MLRGN models.

In this problem, we will prove that the residuals of a fitted multiple linear regression model \[ Y = b_{0} + \sum_{j = 1}^{p} b_{j} X_{j} + e_{i}\] have mean 0, as well as that the residuals are uncorrelated with any of the predictors included in the model. That is, we will show that: \[\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} e_{i} &= 0 \\ \frac{1}{n} \sum_{i = 1}^{n} X_{ij} e_{i} &= 0, j = 1, 2, \ldots, p \end{aligned}. \]

a. The “residual maker” matrix \(\mathbf{M} = \mathbf{I} - \mathbf{X} (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{X}^{T}\) is so-called because it transforms the response vector \(\mathbf{Y}\) into the residual vector \(\mathbf{e}\), \[ \mathbf{M} \mathbf{Y} = \mathbf{e}.\] Verify this property of the residual maker matrix.

b. Show that \(\mathbf{X}^{T} \mathbf{e} = \mathbf{0}\) by using the result you verified above.

c. Explain why \(\mathbf{X}^{T} \mathbf{e} = \mathbf{0}\) indicates that \[\frac{1}{n} \sum_{i = 1}^{n} e_{i} = 0 \] and we can conclude that the sample mean of the residuals is 0.

d. Explain why \(\mathbf{X}^{T} \mathbf{e} = \mathbf{0}\) indicates that \[\frac{1}{n} \sum_{i = 1}^{n} X_{ij} e_{i} = 0, j = 1, 2, \ldots, p \] and we can conclude that the sample correlation between the residuals and each predictor included in the model is 0.

e. What do the results from Part c and Part d indicate about the slope and intercept of a simple linear regression model that predicts the residuals using any one of the predictors included in the multiple linear regression? How does this relate to why we always use a nonparametric smoother to search for systematic deviations from 0 in the residual versus predictor plots?