David Darmon

MA 440-01, Regression and Time Series Analysis

Fall 2019

Tuesday, 11:40 AM — 1:00 PM, Howard Hall 212
Friday, 11:40 AM — 1:00 PM, Howard Hall 308

Here's the official description:

Covers topics related to multiple regression techniques, including testing the assumptions required for each to be valid. This includes applications to yield curve smoothing, pricing, and investment models, and the use of principal component analysis. Also covered are techniques for the analysis and modeling of time series data and forecasting.

This class covers linear statistical models: what they are, how to estimate and use them, and how to make inferences from them. Our focus will be on data analysis, statistical modeling and prediction, and critical thinking about what a statistical model can and cannot tell us about a scientific question. We will cover both classical methods of inference, including their often unrealistic assumptions, as well as more assumption-free methods. Ultimately, our goal is to use linear statistical models to explore and analyze data and report our findings to a community of peers.

Prerequisites

MA 116 or MA 118 or MA 126, passed with a grade of C- or higher, and MA 151 or MA 220 or BE 251, passed with a grade of C- or higher.

Professor

Dr. David Darmon ddarmon [at] monmouth.edu
Howard Hall 241

Topics, Notes, Readings

This is currently a tentative listing of topics, in order.

Statistical models for bivariate data: Statistical prediction: the predictor and the response. The line of best fit as a data-summary device. A statistical model as a story for how the data are generated. Where does the 'randomness' come from in the first place? Populations, super-populations, and no-populations.
Exploratory Data Analysis with R: Histograms, scatter plots, and smoothers.
Statistical models for simple linear regression: The least-squares line. Interpreting the regression coefficients: slope and intercept. Viewing the data as a random sample from a population. The standard line: simple linear regression with Gaussian noise (SLRGN).
Checking assumptions before we infer: Sample residuals. Residual plots from the sample residuals. Where the SLRGN model can go wrong, and what it looks like. Checking validity using out-of-sample performance. Transforming the predictor and response to handle common violations.
Inferences for parameters of a simple linear regression: Properties of the simple linear regression estimators under random sampling from a population, in general, and under the SLRGN model, in particular. Estimation and hypothesis testing under the SLRGN model. Confidence curves as summaries of all available information about regression parameters. Bootstrapping: when the data is a population of its own. Confidence intervals and \(P\)-values via bootstrapping. Confidence curves via bootstrapping.
Inferences for expected values of a simple linear regression: Standard errors and confidence sets for the population line. Prediction intervals for a new measurement from the population.
Assessing the simple linear regression as a predictive model: Predictive error. In-sample error. Out-of-sample error. Cross-validated error. Bootstrapping estimates of errors.
Multiple linear regression: Like simple linear regression, but more. Predicting a response with more than one predictor. Least squares and the hyperplane-of-best-fit. Interpreting the regression coefficients. The dependence of population-level regression coefficients on the predictors included in the model.
Statistical models for multiple linear regression and model checking: The multiple linear regression with Gaussian noise (MLRGN) model. Assumption checking for the MLRGN model. Common departures from the MLRGN model, and how they show up in the sample residuals.
Inferences for parameters of a multiple linear regression: The sampling distribution of multiple linear regression coefficients under the MLRGN model. Confidence curves for the coefficients of a multiple linear regression via both the MLRGN model and bootstrapping. Adjusting for multiple comparisons. Testing one, more, or all coefficients in a multiple linear regression. Do you really care so much about that \(P\)-value?
Enhancements and adjustments to multiple linear regression: Nonlinearity in predictor variables. Categorical predictors. Multicollinearity amongst predictors. Interactions between predictors. Influential points and outliers.
Model selection: Choosing amongst a set of candidate models. Predictor variable selection as model selection. Theoretical and data-driven approaches to model selection. The perils of naive inference post-model selection.
Beyond Ordinary Least Squares: Heteroskedasticity in the noise. Weighted least squares for handling heteroskedasticity. Nonparametric regression. Regression as a smoothing problem. Nearest neighbor and kernel regressions. Tuning parameters in nonparametric regression, and how to tune them.
See the end of this page for the current lecture schedule, subject to revision. Homework and additional resources will be linked there, as available.

Course Mechanics

Office Hours

I will have office hours at the following four times each week:

Tuesday,   03:00–04:00 PM Howard Hall 241
Thursday, 10:00–11:00 AM Howard Hall 241
Thursday, 01:30–02:30 PM Howard Hall 241
Friday,     09:00–10:00 AM Howard Hall 241

I have an open-door policy during those times: you can show up unannounced. If you cannot make the scheduled office hours, please e-mail me about making an appointment.

If you are struggling with the homework, having difficulty with the quizzes, or just want to chat, please visit me during my office hours. I am here to help.

Grading Policy

Your final grade will be determined by the following weighting scheme:
30% for 2 in-class exams (15% each)
20% for homework problem sets
40% for data analysis projects
10% for class participation
I will use the standard 10-point breakdown to assign letter grades to numerical grades:

Homework

Homework will be assigned on Fridays, and listed in the Schedule section of this page. Homework assignments are due at the beginning of class on the following Friday.

Data Analysis Projects

There will be three take-home data analysis projects over the course of the semester. For each data analysis project, you will be provided with a research question and a data set collected to answer that question. You will analyze the data set, and write up your analysis and findings as a scientific report. For the final data analysis project, you will also give a 15-20 minute presentation on your analysis and findings. Each of the first two data analysis projects will count for 10% of your final grade and the final data analysis project will count for 20% of your final grade. A rubric and template report will be provided prior to the assignment of the first data analysis project.

Class Participation

I expect you to be fully engaged during each class, to ask questions when confused, and to attempt to answer questions when called on. During class, attempting to answer a question is more important than getting the correct answer, and class participation will be assigned accordingly.

Attendance

Required. If you expect to miss 2-3 sessions of the course, you should take the course during another semester.

Examination Absences

If you miss an examination your grade will be zero for that exam. If you know you will be absent for an exam you must let me know at least one week in advance to schedule a make-up exam.

Textbook

The required textbook is:

Collaboration, Cheating, and Plagiarism

All submitted work should be your own. You are welcome and encouraged to consult with others while working on an assignment, including other students in the class and tutors in the Mathematics Learning Center. However, whenever you have had assistance with a problem, you must state so at the beginning of the problem solution. Unless this mechanism is abused, there will be no reduction in credit for using and reporting such assistance. This policy applies to both individual and group work. In group work, you only need to acknowledge help from outside the group. This policy does not apply to examinations.

Statement on Special Accommodations

Students with disabilities who need special accommodations for this class are encouraged to meet with me or the appropriate disability service provider on campus as soon as possible. In order to receive accommodations, students must be registered with the appropriate disability service provider on campus as set forth in the student handbook and must follow the University procedure for self-disclosure, which is stated in the University Guide to Services and Accommodations for Students with Disabilities. Students will not be afforded any special accommodations for academic work completed prior to the disclosure of the disability, nor will they be afforded any special accommodations prior to the completion of the documentation process with the appropriate disability office.

R

We will use R, a programming language for statistical computing, throughout the semester for in-class activities and homework assignments. I will cover the relevant features of R throughout the course.

You can access R from any web accessible computer using RStudio Cloud. You will need to create an account on RStudio Cloud from their Registration page. I will send out a link via email for you to join a Space on RStudio Cloud for this course. Resources for homeworks, labs, etc., will be hosted on RStudio Cloud for easy access.

You can also install R on your personal computer, if you have one. You can install R by following the instructions for Windows here, for macOS here, or for Linux here. You will also want to install RStudio, and Integrated Development Environment for R, which you can find here.

We will use R as a scripting language and statistical calculator, and thus will not get into the nitty-gritty of programming in R. The R Tutorial by Kelly Black is a good reference for the basics of using R. I will demonstrate R's functionality in class and handouts as we need it.

Anki

You may use Anki for spaced retrieval practice throughout the semester. Anki is open-source, free (as in both gratis and libre) software. You can download Anki to your personal computer from this link. If you have ever used flashcards, then Anki should be fairly intuitive. If you would like more details you can find Anki's User Manual here.

Anki decks will be provided after each class covering the material that I expect you to know for homework assignments, exams, data analysis projects, etc. You may also want to make your own Anki cards.

Schedule

Subject to revision. Assignments and solutions will all be linked here, as they are available. All readings are from the textbook by Kutner, Nachtsheim, and Neter unless otherwise noted.
September 3, Lecture 1:
Topics: Statistical modeling. Drawing lines through scatter plots. Which line did it best? Statistical models as data summaries. Statistical models as stories about populations. Sources of uncertainty in inference: sampling variability, measurement error, and chance.
Sections: 1.1, 1.2, 1.4
Learning Objectives
Homework 1. Due Lecture 4
September 6, Lecture 2:
Topics: Introduction to R. Exploratory data analysis with R. Histograms, scatter plots, and smoothers. R Markdown and knitr. Basics of LaTeX (with Pointer to MathPix).
Sections: Handout
Learning Objectives
Lab 1. Due Lecture 3
September 10, Lecture 3:
Topics: Simple linear regression. The least-squares line, and hints at its derivation. Simple linear regression with a random sample from a population. Stories we tell: the simple linear regression with Gaussian noise (SLRGN) model.
Sections: 1.3, 1.6, 1.7, 1.8
Learning Objectives
Ordinary Least Squares Demo
Simple Linear Regression Schematic Demo
September 13, Lecture 4:
Topics: Checking the assumptions of the simple linear model. Residual plots. Out-of-sample generalization. Transformations to handle violations of assumptions.
Sections: 3.1, 3.2, 3.3, 3.9
Learning Objectives
Simple Linear Regression Diagnostic Plots Demo
Homework 2. Due Lecture 6
September 17, Lecture 5:
Topics: Properties of the simple linear regression estimators under random sampling from a population. Properties of the simple linear regression estimators under the SLRGN model. Estimation and hypothesis testing under the SLRGN model. The estimate, not the parameter, is significant. What is a \(P\)-value, again? Statistical significance and practical significance.
Sections: 2.1, 2.2
Learning Objectives
Demo of Sampling Variability of the Estimators for the Slope and Intercept
September 20, Lecture 6:
Topics: Review. Quantiles and critical values. Quantiles and critical values in R. Interval estimators and confidence intervals. Interpreting the confidence level of an interval estimator. Hypothesis tests. How all of this relates to making inferences under the SLRGN model.
Sections: Review
Homework 3. Due Lecture 8
Demo on Quantiles and Critical Values for the \(t\)-distribution
Demo on Interpretation of the Confidence Level of an Interval Estimator
September 24, Lecture 7:
Topics: More on estimation and hypothesis testing. The duality between estimation and hypothesis testing. Why 0.05? Use all the numbers: the confidence interval at all confidence levels. Example with the interval estimator for the mean of a Gaussian population. The confidence curve. \(P\)-values (if you must) from the confidence curve.
Sections: Handout
Learning Objectives
Demo of Using Confidence Curves to Estimate the Mean of a Gaussian Population
September 27, Lecture 8:
Topics: Estimation and hypothesis testing when SLRGN won't do. Bootstrapping. Bootstrap confidence intervals. Bootstrap hypothesis testing. Thank von Neumann for digital computers. Confidence curves from the bootstrap distribution.
Sections: 11.5, Handout
Demo of Case Resampling Bootstrap for Simple Linear Regression
October 1, Lecture 9:
Topics: Inferences for expected values: standard errors and confidence sets. Inferences for new measurements: prediction intervals.
Sections: 2.4, 2.5, 2.6
October 4, Lecture 10:
Topics: But how did we do? Assessing the performance of a simple linear regression model at prediction. In-sample error. Out-of-sample error. Cross-validated error. Bootstrapping estimates of error.
Sections: Handout, 9.3, 9.6
October 11, Lecture 11:
Topics: A taste of linear algebra: simple linear regression via matrices and vectors. Review of matrix-vector arithmetic. The inverse of a matrix. Doing linear algebra in R. Simple linear regression as a matrix-vector equation. The solution to ordinary least squares as a matrix-vector equation.
Sections: 5.1, 5.2, 5.3, 5.6, 5.9, 5.10
October 15, Lecture 12:
Topics: Exam 1.
Sections: Chapters 1-3 and Lecture Notes 1-8
October 18, Lecture 13:
Topics: Multiple linear regression. Predicting with multiple predictors. Least squares with multiple predictors. Why multiple linear regression is different from combining several simple linear regressions. Why regression coefficients change when we change the predictors in our model.
Sections: 6.1, 6.3, 6.4
October 22, Lecture 14:
Topics: Stories we tell: the multiple linear regression with Gaussian noise (MLRGN) model. Assumption-checking for multiple linear regression: the same, but different. Properties of the ordinary least squares estimators under the MLRGN model.
Sections: Handout, 6.6, 6.8
October 25, Lecture 15:
Topics: Standardizing the predictors and response to make multiple linear regression coefficients more interpretable. Hypothesis testing for a single regression coefficient in a multiple linear regression. What makes an estimate significant? Statistical versus practical significance: redux. Confidence curves for the coefficients of a multiple linear regression model under the MLRGN model. Confidence curves for the coefficients of a multiple linear regression model via bootstrapping. Adjusting for multiple comparisons.
Sections: Handout, 6.6, 11.5
October 29, Lecture 16:
Topics: Testing and confidence sets for multiple coefficients. Coefficient plots (with confidence intervals) in place of tables. Testing for groups of coefficients in the context of a larger model. Testing all the slopes in the context of a larger model. Confidence rectangles via multiple comparisons. Confidence ellipsoids.
Sections: Handout, 7.3
November 1, Lecture 17:
Topics: Dealing with nonlinearity. Adding polynomial terms. Dealing with categorical predictors by adding "dummy" variables. Interpreting the coefficients on categorical variables.
Sections: 8.1, 8.3
November 5, Lecture 18:
Topics: Multicollinearity: what it is and why it's a problem. Identifying collinearity from pairwise plots of the predictors. Dropping predictors to remove collinearity. Ridge regression for handling multicollinearity.
Sections: Handout, 7.6, 11.2
November 8, Lecture 19:
Topics: Interactions. Interactions in a linear model. Interactions between numerical and categorical variables.
Sections: 8.2, 8.5
November 12, Lecture 20:
Topics: Influential points and outliers. Influence of a data point on the ordinary least squares estimates. Detecting outliers. Dealing with outliers and influential points: deletion and robust regression.
Sections: 10.2, 10.3, 10.4, 11.3
November 15, Lecture 21:
Topics: Lab 2.
Sections:
November 19, Lecture 22:
Topics: Exam 2.
Sections: Chapters 5-8 and Lecture Notes 9-19
November 22, Lecture 23:
Topics: Model selection. Who wore it best? Silly things you may see people do. Modern approaches to model selection. The perils of inference after model selection.
Sections: Handout, 9.3, 9.4
November 26, Lecture 24:
Topics: Variable selection as a special case of model selection. Finding important variables. What do we mean by "important". Please leave \(P\) out of this. Cross-validation for variable selection. All subsets selection. Stepwise regression. The perils of inference after model selection, redux.
Sections: Handout, 9.3, 9.4
December 3, Lecture 25:
Topics: When the noise misbehaves: heteroskedasticity and non-constant noise variance. Weighted least squares. Weighted least squares in R.
Sections: 11.1
December 6, Lecture 26:
Topics: The world is neither linear, nor quadratic, nor... Nonparametric regression: it's regression, Jim, but not as we know it. Regression as local smoothing. Linear regression as a simple smoother. Nearest neighbor regression. Kernel regression. Tuning parameters and their selection. R packages for nonparametric regression.
Sections: Handout, 3.10, 11.4