Friday, 11:40 AM — 1:00 PM, Howard Hall 308

Here's the official description:

Covers topics related to multiple regression techniques, including testing the assumptions required for each to be valid. This includes applications to yield curve smoothing, pricing, and investment models, and the use of principal component analysis. Also covered are techniques for the analysis and modeling of time series data and forecasting.

This class covers linear statistical models: what they are, how to estimate and use them, and how to make inferences from them. Our focus will be on data analysis, statistical modeling and prediction, and critical thinking about what a statistical model can and cannot tell us about a scientific question. We will cover both classical methods of inference, including their often unrealistic assumptions, as well as more assumption-free methods. Ultimately, our goal is to use linear statistical models to explore and analyze data and report our findings to a community of peers.

MA 116 or MA 118 or MA 126, passed with a grade of C- or higher, and MA 151 or MA 220 or BE 251, passed with a grade of C- or higher.

Dr. David Darmon | ddarmon [at] monmouth.edu | |

Howard Hall 241 |

This is currently a *tentative* listing of topics, in order.

*Statistical models for bivariate data:*Statistical prediction: the predictor and the response. The line of best fit as a data-summary device. A statistical model as a story for how the data are generated. Where does the 'randomness' come from in the first place? Populations, super-populations, and no-populations.*Exploratory Data Analysis with R:*Histograms, scatter plots, and smoothers.*Statistical models for simple linear regression:*The least-squares line. Interpreting the regression coefficients: slope and intercept. Viewing the data as a random sample from a population. The standard line: simple linear regression with Gaussian noise (SLRGN).*Checking assumptions before we infer:*Sample residuals. Residual plots from the sample residuals. Where the SLRGN model can go wrong, and what it looks like. Checking validity using out-of-sample performance. Transforming the predictor and response to handle common violations.*Inferences for parameters of a simple linear regression:*Properties of the simple linear regression estimators under random sampling from a population, in general, and under the SLRGN model, in particular. Estimation and hypothesis testing under the SLRGN model. Confidence curves as summaries of all available information about regression parameters. Bootstrapping: when the data is a population of its own. Confidence intervals and \(P\)-values via bootstrapping. Confidence curves via bootstrapping.*Inferences for expected values of a simple linear regression:*Standard errors and confidence sets for the population line. Prediction intervals for a new measurement from the population.*Assessing the simple linear regression as a predictive model:*Predictive error. In-sample error. Out-of-sample error. Cross-validated error. Bootstrapping estimates of errors.*Multiple linear regression:*Like simple linear regression, but more. Predicting a response with more than one predictor. Least squares and the hyperplane-of-best-fit. Interpreting the regression coefficients. The dependence of population-level regression coefficients on the predictors included in the model.*Statistical models for multiple linear regression and model checking:*The multiple linear regression with Gaussian noise (MLRGN) model. Assumption checking for the MLRGN model. Common departures from the MLRGN model, and how they show up in the sample residuals.*Inferences for parameters of a multiple linear regression:*The sampling distribution of multiple linear regression coefficients under the MLRGN model. Confidence curves for the coefficients of a multiple linear regression via both the MLRGN model and bootstrapping. Adjusting for multiple comparisons. Testing one, more, or all coefficients in a multiple linear regression. Do you really care so much about that \(P\)-value?*Enhancements and adjustments to multiple linear regression:*Nonlinearity in predictor variables. Categorical predictors. Multicollinearity amongst predictors. Interactions between predictors. Influential points and outliers.*Model selection:*Choosing amongst a set of candidate models. Predictor variable selection as model selection. Theoretical and data-driven approaches to model selection. The perils of naive inference post-model selection.*Beyond Ordinary Least Squares:*Heteroskedasticity in the noise. Weighted least squares for handling heteroskedasticity. Nonparametric regression. Regression as a smoothing problem. Nearest neighbor and kernel regressions. Tuning parameters in nonparametric regression, and how to tune them.

I will have office hours at the following four times each week:

Tuesday, 03:00–04:00 PM | Howard Hall 241 |

Thursday, 10:00–11:00 AM | Howard Hall 241 |

Thursday, 01:30–02:30 PM | Howard Hall 241 |

Friday, 09:00–10:00 AM | Howard Hall 241 |

I have an open-door policy during those times: you can show up unannounced. If you cannot make the scheduled office hours, please e-mail me about making an appointment.

If you are struggling with the homework, having difficulty with the quizzes, or just want to chat, please visit me during my office hours. I am here to help.

- 30% for 2 in-class exams (15% each)
- 20% for homework problem sets
- 40% for data analysis projects
- 10% for class participation

- \([90, 100] \to \text{A}\)
- \([80, 90) \,\,\, \to \text{B}\)
- \([70, 80) \,\,\, \to \text{C}\)
- \([60, 70) \,\,\, \to \text{D}\)
- \([0, 60) \,\,\,\,\,\, \to \text{F}\)

There will be three take-home data analysis projects over the course of the semester. For each data analysis project, you will be provided with a research question and a data set collected to answer that question. You will analyze the data set, and write up your analysis and findings as a scientific report. For the final data analysis project, you will also give a 15-20 minute presentation on your analysis and findings. Each of the first two data analysis projects will count for 10% of your final grade and the final data analysis project will count for 20% of your final grade. A rubric and template report will be provided prior to the assignment of the first data analysis project.

The **required** textbook is:

- Michael H. Kutner, Christopher J. Nachtsheim, and John Neter.
*Applied Linear Regression Models*, Fourth Edition (McGraw-Hill, 2004).

We will use R, a programming language for statistical computing, throughout the semester for in-class activities and homework assignments. I will cover the relevant features of R throughout the course.

You can access R from any web accessible computer using RStudio Cloud. You will need to create an account on RStudio Cloud from their Registration page. I will send out a link via email for you to join a Space on RStudio Cloud for this course. Resources for homeworks, labs, etc., will be hosted on RStudio Cloud for easy access.

You can also install R on your personal computer, if you have one. You can install R by following the instructions for Windows here, for macOS here, or for Linux here. You will also want to install RStudio, and Integrated Development Environment for R, which you can find here.

We will use R as a scripting language and statistical calculator, and thus will not get into the nitty-gritty of programming in R. The R Tutorial by Kelly Black is a good reference for the basics of using R. I will demonstrate R's functionality in class and handouts as we need it.

You may use Anki for spaced retrieval practice throughout the semester. Anki is open-source, free (as in both *gratis* and *libre*) software. You can download Anki to your personal computer from this link. If you have ever used flashcards, then Anki should be fairly intuitive. If you would like more details you can find Anki's User Manual here.

Anki decks will be provided after each class covering the material that I expect you to know for homework assignments, exams, data analysis projects, etc. You may also want to make your own Anki cards.

- September 3, Lecture 1:
**Topics:**Statistical modeling. Drawing lines through scatter plots. Which line did it best? Statistical models as data summaries. Statistical models as stories about populations. Sources of uncertainty in inference: sampling variability, measurement error, and chance.**Sections:**1.1, 1.2, 1.4- Learning Objectives
**Homework 1.**Due Lecture 4- September 6, Lecture 2:
**Topics:**Introduction to R. Exploratory data analysis with R. Histograms, scatter plots, and smoothers. R Markdown and`knitr`. Basics of LaTeX (with Pointer to MathPix).**Sections:**Handout- Learning Objectives
**Lab 1.**Due Lecture 3- September 10, Lecture 3:
**Topics:**Simple linear regression. The least-squares line, and hints at its derivation. Simple linear regression with a random sample from a population. Stories we tell: the simple linear regression with Gaussian noise (SLRGN) model.**Sections:**1.3, 1.6, 1.7, 1.8- Learning Objectives
- Ordinary Least Squares Demo
- Simple Linear Regression Schematic Demo
- September 13, Lecture 4:
**Topics:**Checking the assumptions of the simple linear model. Residual plots. Out-of-sample generalization. Transformations to handle violations of assumptions.**Sections:**3.1, 3.2, 3.3, 3.9- Learning Objectives
- Simple Linear Regression Diagnostic Plots Demo
**Homework 2.**Due Lecture 6- September 17, Lecture 5:
**Topics:**Properties of the simple linear regression estimators under random sampling from a population. Properties of the simple linear regression estimators under the SLRGN model. Estimation and hypothesis testing under the SLRGN model. The estimate, not the parameter, is significant. What is a \(P\)-value, again? Statistical significance and practical significance.**Sections:**2.1, 2.2- Learning Objectives
- Demo of Sampling Variability of the Estimators for the Slope and Intercept
- September 20, Lecture 6:
**Topics:**Review. Quantiles and critical values. Quantiles and critical values in R. Interval estimators and confidence intervals. Interpreting the confidence level of an interval estimator. Hypothesis tests. How all of this relates to making inferences under the SLRGN model.**Sections:**Review**Homework 3.**Due Lecture 8- Demo on Quantiles and Critical Values for the \(t\)-distribution
- Demo on Interpretation of the Confidence Level of an Interval Estimator
- September 24, Lecture 7:
**Topics:**More on estimation and hypothesis testing. The duality between estimation and hypothesis testing. Why 0.05? Use all the numbers: the confidence interval at all confidence levels. Example with the interval estimator for the mean of a Gaussian population. The confidence curve. \(P\)-values (if you must) from the confidence curve.**Sections:**Handout- Learning Objectives
- Demo of Using Confidence Curves to Estimate the Mean of a Gaussian Population
- September 27, Lecture 8:
**Topics:**Estimation and hypothesis testing when SLRGN won't do. Bootstrapping. Bootstrap confidence intervals. Bootstrap hypothesis testing. Thank von Neumann for digital computers. Confidence curves from the bootstrap distribution.**Sections:**11.5, Handout- Demo of Case Resampling Bootstrap for Simple Linear Regression
- October 1, Lecture 9:
**Topics:**Inferences for expected values: standard errors and confidence sets. Inferences for new measurements: prediction intervals.**Sections:**2.4, 2.5, 2.6- October 4, Lecture 10:
**Topics:**But how did we do? Assessing the performance of a simple linear regression model at prediction. In-sample error. Out-of-sample error. Cross-validated error. Bootstrapping estimates of error.**Sections:**Handout, 9.3, 9.6- October 11, Lecture 11:
**Topics:**A taste of linear algebra: simple linear regression via matrices and vectors. Review of matrix-vector arithmetic. The inverse of a matrix. Doing linear algebra in R. Simple linear regression as a matrix-vector equation. The solution to ordinary least squares as a matrix-vector equation.**Sections:**5.1, 5.2, 5.3, 5.6, 5.9, 5.10- October 15, Lecture 12:
**Topics:**Exam 1.**Sections:**Chapters 1-3 and Lecture Notes 1-8- October 18, Lecture 13:
**Topics:**Multiple linear regression. Predicting with multiple predictors. Least squares with multiple predictors. Why multiple linear regression is different from combining several simple linear regressions. Why regression coefficients change when we change the predictors in our model.**Sections:**6.1, 6.3, 6.4- October 22, Lecture 14:
**Topics:**Stories we tell: the multiple linear regression with Gaussian noise (MLRGN) model. Assumption-checking for multiple linear regression: the same, but different. Properties of the ordinary least squares estimators under the MLRGN model.**Sections:**Handout, 6.6, 6.8- October 25, Lecture 15:
**Topics:**Standardizing the predictors and response to make multiple linear regression coefficients more interpretable. Hypothesis testing for a single regression coefficient in a multiple linear regression. What makes an estimate significant? Statistical versus practical significance: redux. Confidence curves for the coefficients of a multiple linear regression model under the MLRGN model. Confidence curves for the coefficients of a multiple linear regression model via bootstrapping. Adjusting for multiple comparisons.**Sections:**Handout, 6.6, 11.5- October 29, Lecture 16:
**Topics:**Testing and confidence sets for multiple coefficients. Coefficient plots (with confidence intervals) in place of tables. Testing for groups of coefficients in the context of a larger model. Testing all the slopes in the context of a larger model. Confidence rectangles via multiple comparisons. Confidence ellipsoids.**Sections:**Handout, 7.3- November 1, Lecture 17:
**Topics:**Dealing with nonlinearity. Adding polynomial terms. Dealing with categorical predictors by adding "dummy" variables. Interpreting the coefficients on categorical variables.**Sections:**8.1, 8.3- November 5, Lecture 18:
**Topics:**Multicollinearity: what it is and why it's a problem. Identifying collinearity from pairwise plots of the predictors. Dropping predictors to remove collinearity. Ridge regression for handling multicollinearity.**Sections:**Handout, 7.6, 11.2- November 8, Lecture 19:
**Topics:**Interactions. Interactions in a linear model. Interactions between numerical and categorical variables.**Sections:**8.2, 8.5- November 12, Lecture 20:
**Topics:**Influential points and outliers. Influence of a data point on the ordinary least squares estimates. Detecting outliers. Dealing with outliers and influential points: deletion and robust regression.**Sections:**10.2, 10.3, 10.4, 11.3- November 15, Lecture 21:
**Topics:**Lab 2.**Sections:**- November 19, Lecture 22:
**Topics:**Exam 2.**Sections:**Chapters 5-8 and Lecture Notes 9-19- November 22, Lecture 23:
**Topics:**Model selection. Who wore it best? Silly things you may see people do. Modern approaches to model selection. The perils of inference after model selection.**Sections:**Handout, 9.3, 9.4- November 26, Lecture 24:
**Topics:**Variable selection as a special case of model selection. Finding important variables. What do we mean by "important". Please leave \(P\) out of this. Cross-validation for variable selection. All subsets selection. Stepwise regression. The perils of inference after model selection, redux.**Sections:**Handout, 9.3, 9.4- December 3, Lecture 25:
**Topics:**When the noise misbehaves: heteroskedasticity and non-constant noise variance. Weighted least squares. Weighted least squares in R.**Sections:**11.1- December 6, Lecture 26:
**Topics:**The world is neither linear, nor quadratic, nor... Nonparametric regression: it's regression, Jim, but not as we know it. Regression as local smoothing. Linear regression as a simple smoother. Nearest neighbor regression. Kernel regression. Tuning parameters and their selection. R packages for nonparametric regression.**Sections:**Handout, 3.10, 11.4