Study Guide for Exam 1

This will be a closed-book exam. You will be allowed to use R on a school computer for computations. This means you will not have access to, nor will you need to use, anything beyond Base R (e.g. mosaic, ggformula, confcurve, etc.) for the exam. You should, however, still know the commands from these packages, as you may be asked for the code to generate a provided figure, output from lm, etc.

You are permitted one 8.5 inch by 11 inch sheet of handwritten notes (one side) for this exam.

To do well on the exam, you should be able to do the following:

Chapter 1

Section 1.1: Relations Between Variables

Explain the difference between a functional relationship and a statistical relationship between two variables.
State examples of functional relationships from the physical sciences, and posit statistical relationships from the social and biological sciences.
Interpret a scatter plot showing the association between two variables.

Section 1.2: Regression Models and Their Uses

State the two key components of a regression model.
Recognize the terminology predictor and response, and given a statistical relationship, identify which variable is the predictor and which variable is the response.
State the three main uses of a regression model.
Recognize that prediction of $Y$ using $X$ does not indicate that $X$ causes $Y$.

Section 1.4: Data for Regression Analysis

Distinguish between a population and a sample from that population.
Identify the two main sources of uncertainty associated with any inference from a sample to a population.

Lines of Best Fit (Portion of Section 1.3)

State the functional form a line of best fit.
Interpret the slope and intercept of a line of best fit in the context of a particular regression problem as they relate to prediction.

R

RStudio and R Markdown Notebooks

Create a R Markdown Notebook in RStudio.
Recognize the syntax for, create, and run code chunks in an R Markdown document.

Data Management in R

Load in a .rda file into an R session.
Access columns of a data frame using $ notation.
Access columns, rows, and elements of a data frame using dafr[i, j] ‘matrix’ notation.

Package Management in R

Install a package from the Comprehensive R Archive Network (CRAN) using install.packages.
Load a package into an R session using load.

mosaic and ggformula

State the ‘grammar’ of a command in ggformula.
Use the graphing functionality of ggformula to create univariate summaries of data, including:
- rug plots
- histograms
- density plots
- scatter plots
Use the graphing functionality of ggformula to create bivariate summaries of data, including:
- scatter plots
- nonparametric smoothers
- linear regression lines
Use the %>% operator from magrittr package to overlay multiple graphics in the same plot.

knitr, LaTeX, and MathPix Snip

Knit an R Markdown document into a PDF or HTML file.
Recognize the two main “modes” for math in LaTeX.
Pronounce LaTeX correctly (“lay”-“tek”).
Typeset basic mathematical expressions using LaTeX.
Use MathPix Snip to generate LaTeX expressions from handwritten and typeset mathematical expressions.

Section 1.3: Simple Linear Regression Model with Distribution of Error Terms Unspecified

State the model for simple linear regression when the distribution of the error terms is unspecified except for their means, variances, and covariances.
Explain which part of the simple linear regression model is “signal” and which part is “noise.”
State the mean and variance of the response in the simple linear regression model.
Relate the estimators for the population slope and intercept to the ordinary least squares solutions.

Section 1.6: Estimation of Regression Functions

State the ordinary least squares solutions for the slope and intercept of a simple linear regression.
Compute the ordinary least squares solutions for the slope and intercept of a simple linear regression given the relevant sample statistics for $X$ and $Y$.
State the objective function that is minimized to find the ordinary least squares solutions for the slope and intercept.

Section 1.7: Estimation of Error Terms Variance $\sigma^{2}$

State the estimator for the noise variance $\sigma^{2}[\epsilon] = \sigma_{\epsilon}^{2}$.

Section 1.8: Normal Error Regression Model

Recognize “Gaussian” as a synonym for “normal”.
Recognize the acronym “SLRGN” for “simple linear regression with Gaussian noise”
State the population parameters of the SLRGN model.
Draw a (rough) schematic of the SLRGN model given values for the population parameters.
Compute the probability that the response will fall within an interval, given the population parameters of the SLRGN model and a value for the predictor.

Use lm to fit a simple linear regression.
Use makeFun from the mosaic package to turn an output from lm into a callable function.
Extract the estimate of the noise variance from an output from lm.
Use pnorm to compute probabilites for a Gaussian (“normal”) random variable.

Chapter 2

Section 2.1: Inferences Concerning $\beta_{1}$

State the mean and variance of $b_{1}$ under the SLR model.
Explain why the variance of $b_{1}$ depends on: the sample size, the noise variance, and the spacing of the predictor values.
State the distribution of $b_{1}$ under the SLRGN model.
State the distribution of $b_{1}$ after studentization under the SLRGN model.
Construct a confidence interval for $\beta_{1}$ under the SLRGN model.
Perform a hypothesis test for $\beta_{1}$ under the SLRGN model.

Section 2.2: Inferences Concerning $\beta_{0}$

State the mean and variance of $b_{0}$ under the SLR model.
Explain why the variance of $b_{0}$ depends on: the sample size, the noise variance, the spacing of the predictor values, and the sample mean of the predictors.
State the distribution of $b_{0}$ under the SLRGN model.
State the distribution of $b_{0}$ after studentization under the SLRGN model.
Construct a confidence interval for $\beta_{0}$ under the SLRGN model.
Perform a hypothesis test for $\beta_{0}$ under the SLRGN model.

Compute the estimates of the standard errors for $b_{0}$ and $b_{1}$ directly from the sample statistics of a data frame.
Extract the estimates of the standard errors $b_{0}$ and $b_{1}$ using summary and an output from lm.
Construct confidence intervals for $\beta_{0}$ and $\beta_{1}$ under the SLRGN model directly from the sample statistics of a data frame.
Construct confidence intervas for $\beta_{0}$ and $\beta_{1}$ under the SLRGN model using confint and an output from lm.
Perform hypothesis tests for $\beta_{0}$ and $\beta_{1}$ under the SLRGN model directly from the sample statistics of a data frame.

Section 2.4: Interval Estimation for the Mean Response

State the mean and variance of the estimate of the mean response under the SLR model.
State the sampling distribution of the estimate of the mean response under the SLRGN model.
Explain how to studentize the estimate of the mean response under the SLRGN model, and how this is useful for performing hypothesis tests, constructing confidence intervals, etc., related to the mean response.
State and compute the confidence interval for the mean response under the SLRGN model.
Sketch, roughly, how the confidence interval for the mean response behaves as a function of the predictor variable.
Properly interpret what a confidence interval for the mean response indicates provided its confidence level.

Section 2.5: Prediction Intervals for a New Response

Define a prediction interval for a new response, and contrast it with the confidence interval for the mean response.
State and compute the prediction interval for a new response under the SLRGN model when the population parameters are known.
State and compute the prediction interval for a new response under the SLRGN model when the population parameters are unknown.
Sketch, roughly, how the prediction interval for a new response behaves as a function of the predictor variable.
Sketch, roughly, how the confidence interval for the mean response and the prediction interval for a new response compare in terms of their widths.
State the limiting behavior of the width of confidence intervals for the mean response and prediction intervals for a new response under the SLRGN model.

Computing and Plotting SLRGN Intervals in R (Lecture Notes for Lecture 9)

Compute confidence intervals for the mean response under the SLRGN model using an output from makeFun.
Compute prediction intervals for a new response under the SLRGN model using an output from makeFun.
Plot confidence intervals for the mean response and prediction intervals for a new response using gf_lm.

Chapter 3

Section 3.1: Diagnostics for Predictor Variables

Construct exploratory plots for the predictor in a simple linear regression.

Section 3.2: Residuals

State the 6 most important departures from the SLRGN model.
Explain why it is necessary to check that the SLRGN model holds before computing inferential statistics under the SLRGN model.

Section 3.3: Diagnostics for Residuals

State the five common diagnostic plots involving the sample residuals.
Sketch diagnostic plots for the sample residuals when the all of the assumptions of the SLRGN model are met.
Given diagnostic plots from sample residuals, identify what departures from the SLRGN model are indicated (or not) by the diagnostic plots.
Explain the rationale for each of the five diagnostic plots in terms of the assumptions of the SLRGN model.

Section 3.9: Transformations

Explain why a model of the form
\[Y = \beta_{0} + \beta_{1} f(X) + \epsilon\]
is still a simple linear regression in the transformed variable $f(X)$.
Identify appropriate transformations of a predictor variable from a plot of the response versus the predictor.

Access the residuals from an output from lm.
Construct residual diagnostic plots “by hand” using functions from ggformula.
Construct residual diagnostic plots using plot’s built-in functionality with an output from lm.
Fit models of the form
\[Y = \beta_{0} + \beta_{1} f(X) + \epsilon\]
using lm where $f$ is a polynomial or logarithmic function.

Non-book-based Lecture Notes

Duality Between Confidence Intervals and Hypothesis Testing (Lecture Notes for Lecture 7)

Use a two-sided confidence level $c$ confidence interval to perform a two-sided significance level $\alpha$ hypothesis test.
Explain, loosely, why we reject a point (two-sided) null hypothesis when the null value of the parameter does not fall in a confidence interval.
Compare and contrast accept/reject hypothesis testing with confidence intervals in terms of their strengths and weaknesses.

Practical versus Statistical Significance (Lecture Notes for Lecture 7)

Explain the difference between statistical and practical significance.
Give the origin of the phrase “statistical significance.”
Explain why a $P$-value, on its own, indicates nothing about practical significance.
Explain how a confidence interval can be used to identify both statistical and practical significance.
Construct a hypothetical scenario where a result may be:
- statistically significant but not practically significant.
- practically significant but not statistically significant.

Confidence Curves (Lecture Notes for Lecture 7)

Give a constructive definition of a confidence curve in terms of two-sided confidence intervals for a population parameter.
State how to construct the confidence curve for the mean of a Gaussian population.
State how to construct the confidence curves for the population slope and intercept under the SLRGN model.
Given a confidence curve for a population parameter and a ruler, identify:
- a point estimate for a population parameter
- a confidence level $c$ confidence interval for the population parameter
- a $P$-value for the two-sided hypothesis test for the point null hypothesis $\theta = \theta_{0}$.
Interpret a confidence curve in terms of what it indicates about our certainty about the value of a population parameter.

confcurve in R (Lecture Notes for Lecture 7)

Use confcurve to plot SLRGN confidence curves for the population slope and intercept using an output from lm.
Use confcurve to compute SLRGN two-sided $P$-values for the population slope and intercept using an output from lm.

Bootstrapping to Approximate the Sampling Distribution of a Statistic (Lecture Notes for Lecture 8)

Explain why inferential statistics derived under the SLRGN model become “junk numbers” when the SLRGN model fails to hold.
Identify when inferential statistics derived under the SLRGN model are likely to be “junk numbers”.
Explain how a resampling-based inferential method replaces a model-based sampling distribution with a sampling distribution based on the sample itself.

The Case Resampling Bootstrap for Simple Linear Regression (Lecture Notes for Lecture 8)

Describe the procedure for the Case Resampling Bootstrap.
Given a scatter plot, explain how you could generate a bootstrap sample using the Case Resampling Bootstrap.
Explain the tradeoffs between small and large values for $B$, the number of bootstrap samples.

The Percentile Bootstrap (Lecture Notes for Lecture 8)

Describe how to compute a percentile bootstrap confidence interval from bootstrapped estimates of parameter values.
State the name of the improved confidence interval method implemented in confcurve.

Confidence Curves from the Bootstrap Distribution (Lecture Notes for Lecture 8)

Explain how one can construct a bootstrapped confidence curve for a parameter value using the percentile bootstrap confidence interval.

$P$-values from the Bootstrapped Confidence Curve (Lecture Notes for Lecture 8)

Explain how one can compute a bootstrapped $P$-value for a two-sided hypothesis test using a bootstrapped confidence curve.

confcurve in R (Lecture Notes for Lecture 8)

Use bootcurve.lm to generate bootstrapped parameter estimates from a dataframe.
Use confcurve to compute a BCa bootstrap confidence interval for the population slope and intercept using an output from bootcurve.lm.
Use plot.confcurve to compute a BCa bootstrap confidence curve for the population slope and intercept using an output from bootcurve.lm.
Use confpvalue to compute bootstrapped two-sided $P$-values for the population slope and intercept using an output from bootcurve.lm.
Compare and contrast SLRGN-based and bootstrap-based confidence curves for the same data set in terms of how violations of the SLRGN model impact the properties of its inferential statistics.