David Darmon
MA 440-01, Regression and Time Series Analysis
Fall 2019
Tuesday, 11:40 AM — 1:00 PM, Howard Hall 212
Friday, 11:40 AM — 1:00 PM, Howard Hall 308
Here's the official description:
Covers topics related to multiple regression techniques, including testing the assumptions required for each to be valid. This includes applications to yield curve smoothing, pricing, and investment models, and the use of principal component analysis. Also covered are techniques for the analysis and modeling of time series data and forecasting.
This class covers linear statistical models: what they are, how to estimate and use them, and how to make inferences from them. Our focus will be on data analysis, statistical modeling and prediction, and critical thinking about what a statistical model can and cannot tell us about a scientific question. We will cover both classical methods of inference, including their often unrealistic assumptions, as well as more assumption-free methods. Ultimately, our goal is to use linear statistical models to explore and analyze data and report our findings to a community of peers.
Prerequisites
MA 116 or MA 118 or MA 126, passed with a grade of C- or higher, and MA 151 or MA 220 or BE 251, passed with a grade of C- or higher.
Professor
Dr. David Darmon | | ddarmon [at] monmouth.edu |
| | Howard Hall 241 |
Topics, Notes, Readings
This is currently a tentative listing of topics, in order.
- Statistical models for bivariate data: Statistical prediction: the predictor and the response. The line of best fit as a data-summary device. A statistical model as a story for how the data are generated. Where does the 'randomness' come from in the first place? Populations, super-populations, and no-populations.
- Exploratory Data Analysis with R: Histograms, scatter plots, and smoothers.
- Statistical models for simple linear regression: The least-squares line. Interpreting the regression coefficients: slope and intercept. Viewing the data as a random sample from a population. The standard line: simple linear regression with Gaussian noise (SLRGN).
- Checking assumptions before we infer: Sample residuals. Residual plots from the sample residuals. Where the SLRGN model can go wrong, and what it looks like. Checking validity using out-of-sample performance. Transforming the predictor and response to handle common violations.
- Inferences for parameters of a simple linear regression: Properties of the simple linear regression estimators under random sampling from a population, in general, and under the SLRGN model, in particular. Estimation and hypothesis testing under the SLRGN model. Confidence curves as summaries of all available information about regression parameters. Bootstrapping: when the data is a population of its own. Confidence intervals and \(P\)-values via bootstrapping. Confidence curves via bootstrapping.
- Inferences for expected values of a simple linear regression: Standard errors and confidence sets for the population line. Prediction intervals for a new measurement from the population.
- Assessing the simple linear regression as a predictive model: Predictive error. In-sample error. Out-of-sample error. Cross-validated error. Bootstrapping estimates of errors.
- Multiple linear regression: Like simple linear regression, but more. Predicting a response with more than one predictor. Least squares and the hyperplane-of-best-fit. Interpreting the regression coefficients. The dependence of population-level regression coefficients on the predictors included in the model.
- Statistical models for multiple linear regression and model checking: The multiple linear regression with Gaussian noise (MLRGN) model. Assumption checking for the MLRGN model. Common departures from the MLRGN model, and how they show up in the sample residuals.
- Inferences for parameters of a multiple linear regression: The sampling distribution of multiple linear regression coefficients under the MLRGN model. Confidence curves for the coefficients of a multiple linear regression via both the MLRGN model and bootstrapping. Adjusting for multiple comparisons. Testing one, more, or all coefficients in a multiple linear regression. Do you really care so much about that \(P\)-value?
- Enhancements and adjustments to multiple linear regression: Nonlinearity in predictor variables. Categorical predictors. Multicollinearity amongst predictors. Interactions between predictors. Influential points and outliers.
- Model selection: Choosing amongst a set of candidate models. Predictor variable selection as model selection. Theoretical and data-driven approaches to model selection. The perils of naive inference post-model selection.
- Beyond Ordinary Least Squares: Heteroskedasticity in the noise. Weighted least squares for handling heteroskedasticity. Nonparametric regression. Regression as a smoothing problem. Nearest neighbor and kernel regressions. Tuning parameters in nonparametric regression, and how to tune them.
See the end of this page for the current lecture schedule, subject to revision. Homework and additional resources will be linked there, as available.
Course Mechanics
Office Hours
I will have office hours at the following four times each week:
Tuesday, 03:00–04:00 PM | Howard Hall 241 |
Thursday, 10:00–11:00 AM | Howard Hall 241 |
Thursday, 01:30–02:30 PM | Howard Hall 241 |
Friday, 09:00–10:00 AM | Howard Hall 241 |
I have an open-door policy during those times: you can show up unannounced. If you cannot make the scheduled office hours, please e-mail me about making an appointment.
If you are struggling with the homework, having difficulty with the quizzes, or just want to chat, please visit me during my office hours. I am here to help.
Grading Policy
Your final grade will be determined by the following weighting scheme:
- 30% for 2 in-class exams (15% each)
- 20% for homework problem sets
- 40% for data analysis projects
- 10% for class participation
I will use the standard 10-point breakdown to assign letter grades to numerical grades:
- \([90, 100] \to \text{A}\)
- \([80, 90) \,\,\, \to \text{B}\)
- \([70, 80) \,\,\, \to \text{C}\)
- \([60, 70) \,\,\, \to \text{D}\)
- \([0, 60) \,\,\,\,\,\, \to \text{F}\)
Homework
Homework will be assigned on Fridays, and listed in the Schedule section of this page. Homework assignments are due at the beginning of class on the following Friday.
Data Analysis Projects
There will be three take-home data analysis projects over the course of the semester. For each data analysis project, you will be provided with a research question and a data set collected to answer that question. You will analyze the data set, and write up your analysis and findings as a scientific report. For the final data analysis project, you will also give a 15-20 minute presentation on your analysis and findings. Each of the first two data analysis projects will count for 10% of your final grade and the final data analysis project will count for 20% of your final grade. A rubric and template report will be provided prior to the assignment of the first data analysis project.
Class Participation
I expect you to be fully engaged during each class, to ask questions when confused, and to attempt to answer questions when called on. During class, attempting to answer a question is more important than getting the correct answer, and class participation will be assigned accordingly.
Attendance
Required. If you expect to miss 2-3 sessions of the course, you should take the course during another semester.
Examination Absences
If you miss an examination your grade will be zero for that exam. If you know you will be absent for an exam you must let me know at least one week in advance to schedule a make-up exam.
Textbook
The required textbook is:
- Michael H. Kutner, Christopher J. Nachtsheim, and John Neter. Applied Linear Regression Models, Fourth Edition (McGraw-Hill, 2004).
Collaboration, Cheating, and Plagiarism
All submitted work should be your own. You are welcome and encouraged to consult with others while working on an assignment, including other students in the class and tutors in the Mathematics Learning Center. However, whenever you have had assistance with a problem, you must state so at the beginning of the problem solution. Unless this mechanism is abused, there will be no reduction in credit for using and reporting such assistance. This policy applies to both individual and group work. In group work, you only need to acknowledge help from outside the group. This policy does not apply to examinations.
Statement on Special Accommodations
Students with disabilities who need special accommodations for this class are encouraged to meet with me or the appropriate disability service provider on campus as soon as possible. In order to receive accommodations, students must be registered with the appropriate disability service provider on campus as set forth in the student handbook and must follow the University procedure for self-disclosure, which is stated in the University Guide to Services and Accommodations for Students with Disabilities. Students will not be afforded any special accommodations for academic work completed prior to the disclosure of the disability, nor will they be afforded any special accommodations prior to the completion of the documentation process with the appropriate disability office.
R
We will use R, a programming language for statistical computing, throughout the semester for in-class activities and homework assignments. I will cover the relevant features of R throughout the course.
You can access R from any web accessible computer using RStudio Cloud. You will need to create an account on RStudio Cloud from their Registration page. I will send out a link via email for you to join a Space on RStudio Cloud for this course. Resources for homeworks, labs, etc., will be hosted on RStudio Cloud for easy access.
You can also install R on your personal computer, if you have one. You can install R by following the instructions for Windows here, for macOS here, or for Linux here. You will also want to install RStudio, and Integrated Development Environment for R, which you can find here.
We will use R as a scripting language and statistical calculator, and thus will not get into the nitty-gritty of programming in R. The R Tutorial by Kelly Black is a good reference for the basics of using R. I will demonstrate R's functionality in class and handouts as we need it.
Anki
You may use Anki for spaced retrieval practice throughout the semester. Anki is open-source, free (as in both gratis and libre) software. You can download Anki to your personal computer from this link. If you have ever used flashcards, then Anki should be fairly intuitive. If you would like more details you can find Anki's User Manual here.
Anki decks will be provided after each class covering the material that I expect you to know for homework assignments, exams, data analysis projects, etc. You may also want to make your own Anki cards.
Schedule
Subject to revision. Assignments and solutions will all be linked here, as they are available. All readings are from the textbook by Kutner, Nachtsheim, and Neter unless otherwise noted.
- September 3, Lecture 1:
- Topics: Statistical modeling. Drawing lines through scatter plots. Which line did it best? Statistical models as data summaries. Statistical models as stories about populations. Sources of uncertainty in inference: sampling variability, measurement error, and chance.
- Sections: 1.1, 1.2, 1.4
- Learning Objectives
- Homework 1. Due Lecture 4
- September 6, Lecture 2:
- Topics: Introduction to R. Exploratory data analysis with R. Histograms, scatter plots, and smoothers. R Markdown and knitr. Basics of LaTeX (with Pointer to MathPix).
- Sections: Handout
- Learning Objectives
- Lab 1. Due Lecture 3
- September 10, Lecture 3:
- Topics: Simple linear regression. The least-squares line, and hints at its derivation. Simple linear regression with a random sample from a population. Stories we tell: the simple linear regression with Gaussian noise (SLRGN) model.
- Sections: 1.3, 1.6, 1.7, 1.8
- Learning Objectives
- Ordinary Least Squares Demo
- Simple Linear Regression Schematic Demo
- September 13, Lecture 4:
- Topics: Checking the assumptions of the simple linear model. Residual plots. Out-of-sample generalization. Transformations to handle violations of assumptions.
- Sections: 3.1, 3.2, 3.3, 3.9
- Learning Objectives
- Simple Linear Regression Diagnostic Plots Demo
- Homework 2. Due Lecture 6
- September 17, Lecture 5:
- Topics: Properties of the simple linear regression estimators under random sampling from a population. Properties of the simple linear regression estimators under the SLRGN model. Estimation and hypothesis testing under the SLRGN model. The estimate, not the parameter, is significant. What is a \(P\)-value, again? Statistical significance and practical significance.
- Sections: 2.1, 2.2
- Learning Objectives
- Demo of Sampling Variability of the Estimators for the Slope and Intercept
- September 20, Lecture 6:
- Topics: Review. Quantiles and critical values. Quantiles and critical values in R. Interval estimators and confidence intervals. Interpreting the confidence level of an interval estimator. Hypothesis tests. How all of this relates to making inferences under the SLRGN model.
- Sections: Review
- Homework 3. Due Lecture 8
- Demo on Quantiles and Critical Values for the \(t\)-distribution
- Demo on Interpretation of the Confidence Level of an Interval Estimator
- September 24, Lecture 7:
- Topics: More on estimation and hypothesis testing. The duality between estimation and hypothesis testing. Why 0.05? Use all the numbers: the confidence interval at all confidence levels. Example with the interval estimator for the mean of a Gaussian population. The confidence curve. \(P\)-values (if you must) from the confidence curve.
- Sections: Handout
- Learning Objectives
- Demo of Using Confidence Curves to Estimate the Mean of a Gaussian Population
- September 27, Lecture 8:
- Topics: Estimation and hypothesis testing when SLRGN won't do. Bootstrapping. Bootstrap confidence intervals. Bootstrap hypothesis testing. Thank von Neumann for digital computers. Confidence curves from the bootstrap distribution.
- Sections: 11.5, Handout
- Learning Objectives
- Demo of Case Resampling Bootstrap for Simple Linear Regression
- Homework 4. Due Lecture 10
- October 1, Lecture 9:
- Topics: Inferences for expected values: standard errors and confidence sets. Inferences for new measurements: prediction intervals.
- Sections: 2.4, 2.5
- Learning Objectives
- Demo of Confidence Interval for the Expected Response and Prediction Interval for a New Response
- October 4, Lecture 10:
- Topics: A taste of linear algebra: simple linear regression via matrices and vectors. Review of matrix-vector arithmetic. The inverse of a matrix. Doing linear algebra in R.
- Sections: 5.1, 5.2, 5.3, 5.4, 5.6
- Learning Objectives
- Homework 5. Due Lecture 11
- October 11, Lecture 11:
- Topics: Random vectors and random matrices. Expectations and variances of random vectors. Simple linear regression as a matrix-vector equation. The solution to ordinary least squares as a matrix-vector equation. Sampling properties under SLR of the ordinary least squares estimator via matrix algebra.
- Sections: 5.8, 5.9, 5.10, 5.13
- Learning Objectives
- October 15, Lecture 12:
- Topics: Exam 1.
- Sections: Chapters 1-3 and Lecture Notes 1-9
- Exam 1 Study Guide
- October 18, Lecture 13:
- Topics: Multiple linear regression. Predicting with multiple predictors. Least squares with multiple predictors. Why multiple linear regression is different from combining several simple linear regressions. Why regression coefficients change when we change the predictors in our model.
- Sections: 6.1, 6.3, 6.4
- Learning Objectives
- Multiple Linear Regression Schematic Demo
- Homework 6. Due Lecture 15
- October 22, Lecture 14:
- Topics: Stories we tell: the multiple linear regression with Gaussian noise (MLRGN) model. Assumption-checking for multiple linear regression: the same, but different. Properties of the ordinary least squares estimators under the MLRGN model.
- Sections: Handout, 6.1, 6.8
- Learning Objectives
- October 25, Lecture 15:
- Topics: Standardizing the predictors to make multiple linear regression coefficients more interpretable. Coefficient plots (with confidence intervals) in place of tables. Hypothesis testing for a single regression coefficient under the MLRGN model. What makes an estimate significantly different from 0? Statistical versus practical significance: redux. Confidence curves for a single coefficient under the MLRGN model.
- Sections: Handout, 6.6
- Learning Objectives
- Homework 7. Due Lecture 17
- October 29, Lecture 16:
- Topics: Lab.
- Sections: Lab Handout
- November 1, Lecture 17:
- Topics: Confidence curves for a single coefficient via bootstrapping. Adjusting for multiple comparisons. Confidence sets for multiple coefficients. Confidence rectangles via multiple comparisons. Confidence ellipsoids via the sampling distribution of the coefficient estimator.
- Sections: Handout, 4.1, 7.3, 11.5
- Learning Objectives
- Homework 8. Due Lecture 19
- November 5, Lecture 18:
- Topics: Testing for multiple coefficients. Testing for groups of coefficients in the context of a larger model. Testing all the slopes in the context of a larger model. Categorical predictors. Dealing with categorical predictors by adding "dummy" variables. Reference and baseline categories. Interpreting the coefficients on categorical variables. Oh-NOVA.
- Sections: 7.3, 8.3
- Learning Objectives
- November 8, Lecture 19:
- Topics: Interactions. Interactions in a linear model. Interactions between numerical and categorical variables. Interactions between categorical variables.
- Sections: 8.2, 8.5
- Learning Objectives
- Homework 9. Due Lecture 21
- November 12, Lecture 20:
- Topics: Dealing with nonlinearity. Adding polynomial terms. Multicollinearity: what it is and why it's a problem.
- Sections: Handout, 8.1, 7.6, 11.2
- November 15, Lecture 21:
- Topics: Multicollinearity. Identifying collinearity from pairwise plots of the predictors. Dropping predictors to remove collinearity. Ridge regression for handling multicollinearity.
- Sections: Handout, 7.6, 11.2
- November 19, Lecture 22:
- Topics: Exam 2.
- Sections: Chapters 5-8 and Lecture Notes 9-19
- November 22, Lecture 23:
- Topics: Influential points and outliers. Influence of a data point on the ordinary least squares estimates. Detecting outliers.
- Sections: 10.2, 10.3, 10.4, 11.3
- Demo of Outliers, Leverage, and Influential Points in Simple Linear Regression
- November 26, Lecture 24:
- Topics: Dealing with outliers and influential points: deletion and robust regression.
- Sections: 10.2, 10.3, 10.4, 11.3
- December 3, Lecture 25:
- Topics: Variable selection. Why drop variables? Finding important variables. What do we mean by "important" variables? Please leave \(P\) out of this. Comparing models in-sample. Model selection criteria. Best subset selection. Stepwise selection.
- Sections: Handout, 9.3, 9.4
- Homework 10. Due December 10
- December 6, Lecture 26:
- Topics: Variable selection. The perils of inference after model selection. Valid post-selection inferences by splitting the data into selection and estimation sets. Shrinkage methods as an alternative to subset selection. Ridge regression and LASSO.
- Sections: Handout, 9.3, 9.4
- Demo of the Geometry of Ridge Regression and LASSO
- December 13, Final Exam:
- Time: 11:35 AM - 2:25 PM
- Location: Howard Hall 212 (HH 212)