David Darmon
MA 350-01, Computation and Statistics
Spring 2020
Monday and Thursday, 1:15 PM — 2:35 PM, Howard Hall 307
The modern approach to statistics can broadly be broken into two distinct tasks: algorithms for computing statistical quantities (point estimates, confidence intervals, \(P\)-values, etc.) from data and inferences to be made from those quantities based on statistical models of how the data were generated. Until now in your statistical education, the focus has been on the inferential aspect of statistics. This class is a first course on the algorithmic side of statistics. You will learn the core concepts of programming — functions, objects, data structures, flow control, debugging, logical design, and abstraction — by writing code to implement statistical algorithms. You will also learn how to use stochastic simulation to evaluate a statistical algorithm in terms of its inferential properties. Programming and programming languages are inherently social, so you will also learn the customs and conventions of the tribe of programmers who use R as well as best-practices for organizing and commenting your code.
Prerequisites
MA 116 or MA 118 or MA 126, passed with a grade of C- or higher, and MA 151 or MA 220 or BE 251, passed with a grade of C- or higher.
Professor
Dr. David Darmon | | ddarmon [at] monmouth.edu |
| | Howard Hall 241 |
Topics, Notes, Readings
This is currently a tentative listing of topics, in order.
- Introduction to R: The R console. Using R as a fancy calculator. The RStudio integrated development environment. R Markdown for literate programming.
- Objects, the nouns of programming: Objects and data types. Variables. Floats and integers. Booleans. Characters and strings.
- Data structures: Vectors for one-dimensional collections of the same data type. Arrays for multi-dimensional collections of the same data type. Lists for one-dimensional collections of different data types. Data frames for multi-dimensional collections of different data types.
- Flow control: If-Else and more complex conditional statements. Iteration. For loops and while loops. Vectorization to avoid iteration. *apply for vectorization with multi-dimensional objects.
- Operating on text: More on the character data type. Extracting substrings of a string. Splitting strings at delimiters. Building strings from a pattern. Counting strings.
- Functions, the verbs of programming: Using functions to collect together related commands. Functions as input-output devices. Arguments (inputs) and return values (outputs). Named arguments and defaults. Scope of variables in a function.
- Plotting in R: Use the eyes that Nature gave you! Base R graphics. ggplot and ggformula. Exploratory data analysis using mosaic.
- Stochastic simulation: Random numbers. Sources of randomness in nature. Pseudo-random numbers, and how to generate them. Pseudo-random number generators in R. The p*, d*, q*, and r* functions in R.
- Summarizing data: Using the empirical distribution function to summarize a univariate data set. Density estimation as a smoothed summary of a data set. Numerical summaries: mean, median, standard deviation, range, and much much more. Assessing the statistical properties of numerical summaries via stochastic simulation.
- Debugging and testing code: What is R trying to tell you with all that red text? Let me Google that for you: using the internet wisely for help with programming. Debugging by brute-print. Debugging intelligently using browser in RStudio. Using assertions to catch bugs. Using unit testing to keep code bug-free.
- Version control: Always click save. Version control systems. git and GitHub. Committing changes. Pushing and pulling commits. Using GitHub to create your own R package.
- Numerical methods: Root finding: determining where a function takes a value. Optimization: finding the peaks and valleys of a function. Quadrature: integrating under a curve. Symbolic differentiation: a computer is really good at Calculus I.
- Statistical simulation: Using simulation when the math is too hard. Why simulation can perform complicated computations as-if by magic. Simulation to assess the level and power of a hypothesis test, with an application to equivalence testing. Simulation to assess the coverage and expected width of a confidence interval, with an application to the population correlation coefficient.
- Welcome to the tidyverse: Hadley Whickham and the tidyverse. dplyr and the Split/Apply/Combine pattern for analyzing a structured data set. Tidy data and tidyr.
See the end of this page for the current lecture schedule, subject to revision. Homework and additional resources will be linked there, as available.
Course Mechanics
Office Hours
I will have office hours at the following four times each week:
Monday, 10:00—11:00 AM | Howard Hall 241 |
Tuesday, 03:00—04:00 PM | Howard Hall 241 |
Thursday, 10:00—11:00 AM | Howard Hall 241 |
Thursday, 03:00—04:00 PM | Howard Hall 241 |
I have an open-door policy during those times: you can show up unannounced. If you cannot make the scheduled office hours, please e-mail me about making an appointment.
If you are struggling with the homework, having difficulty with the quizzes, or just want to chat, please visit me during my office hours. I am here to help.
Grading Policy
Your final grade will be determined by the following weighting scheme:
- 30% for homework problem sets
- 10% for labs
- 30% for quizzes
- 20% for final project
- 10% for programming journal
I will use the standard 10-point breakdown to assign letter grades to numerical grades:
- \([90, 100] \to \text{A}\)
- \([80, 90) \,\,\, \to \text{B}\)
- \([70, 80) \,\,\, \to \text{C}\)
- \([60, 70) \,\,\, \to \text{D}\)
- \([0, 60) \,\,\,\,\,\, \to \text{F}\)
with pluses and minuses assigned by dividing the intervals into thirds.
Homework
Homework will be assigned regularly, announced in class, and listed in the Schedule section of this page.
Each homework problem will be graded out of three points: one point for making an honest effort to complete the problem, one point for a technically-correct solution, and one point for clean, well-formatted, documented, and easily readable code.
You will have the opportunity to revise your homework within a week after it is returned and resubmit the new solution to earn back partial credit.
Labs
There will be (approximately) weekly labs. The labs will involve brief exercises related to the latest homework assignment. Attending labs is mandatory. If you miss a lab, your grade for that lab will be a 0.
Pair programming: Labs will be completed in groups using "pair programming," a software development technique where programmers alternate between two roles: a driver who writes the code and an observer who reviews each line of code for correctness and strategizes ways to improve the code moving forward. For each lab, you will randomly be assigned to a partner and the role of either driver or observer. Halfway through the lab, you will switch roles with your partner.
Quizzes
There will be a 20-minute quiz at the beginning of each lab session. Each quiz will contain problems related to the material covered in the readings and during the lectures for a given unit of material. If you miss a quiz, your grade will be a zero for that quiz. Your lowest two quiz grades will be dropped.
Final Project
There will be a month-long final project at the end of the course. You will be assigned into small groups, or, if you prefer, may work alone. You will select a project topic from a list provided by me in the beginning of April. You will write code, document it, write a report, and prepare a presentation on your work to present during the final exam period for this class.
Programming Journal
Learning to program, like learning any new skill, can be frustrating. The computer will always do what you tell it do, which is not always what you want it to do, and determining how to bridge that gap is simultaneously the challenge and the reward of programming. Over the course of the semester, you will keep a programming journal, a document where you record your thoughts about your journey into programming, the struggles you have faced, and the challenges you have overcome. You will also use the programming journal to document common errors that you run into and how to solve them.
You should maintain your programming journal regularly, as you would a personal journal. Updates to your programming journal will be due on the first lecture following each lab.
Class Participation
I expect you to be fully engaged during each class, to ask questions when confused, and to attempt to answer questions when called on. During class, attempting to answer a question is more important than getting the correct answer.
Attendance
Required. If you expect to miss 2-3 sessions of the course, you should take the course during another semester.
Textbook
The required textbooks are:
- Norman Matloff. The Art of R Programming: A Tour of Statistical Software Design (No Starch Press, 2011).
- J.D. Long and Paul Teetor. R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics, 2nd edition (O'Reilly Media, 2019).
Collaboration, Cheating, and Plagiarism
All submitted work should be your own. You are welcome and encouraged to use the Internet and consult with others while working on an assignment. However, whenever you have had assistance with a problem, you must state so at the beginning of the problem solution. Unless this mechanism is abused, there will be no reduction in credit for using and reporting such assistance. This policy applies to both individual and group work. In group work, you only need to acknowledge help from outside the group.
Statement on Special Accommodations
Students with disabilities who need special accommodations for this class are encouraged to meet with me or the appropriate disability service provider on campus as soon as possible. In order to receive accommodations, students must be registered with the appropriate disability service provider on campus as set forth in the student handbook and must follow the University procedure for self-disclosure, which is stated in the University Guide to Services and Accommodations for Students with Disabilities. Students will not be afforded any special accommodations for academic work completed prior to the disclosure of the disability, nor will they be afforded any special accommodations prior to the completion of the documentation process with the appropriate disability office.
R
We will use R regularly throughout the semester.
You can access R from any web accessible computer using RStudio Cloud. You will need to create an account on RStudio Cloud from their Registration page. I will send out a link via email for you to join a Space on RStudio Cloud for this course. Resources for homeworks, labs, etc., will be hosted on RStudio Cloud for easy access.
You should also install R on your personal computer, if you have one. You can install R by following the instructions for Windows here, for macOS here, or for Linux here. You will also want to install RStudio, and Integrated Development Environment for R, which you can find here.
Schedule
Subject to revision. Assignments and solutions will all be linked here, as they are available.
- January 23, Lecture 1:
- Topics: Intro to course. Statistical Computing and Computational Statistics. R. Basic data types and data structures.
- Sections: TARP: 1.2, 1.4, 1.7 RC: 1.4, 1.8, 2.1, 2.2, 16.1, 16.2, 16.3, 16.4, 16.6
- Learning Objectives
- January 27, Lecture 2:
- Topics: Data structures: vectors, arrays, lists, and data frames.
- Sections: TARP: 2.1, 2.2, 2.4, 2.8, 3.1, 3.2, 3.5, 4.1, 4.2, 4.3, 5.1, 5.2, 5.3
- Learning Objectives
- January 30, Lecture 3:
- Topics: Lab 1.
- Study Guide for Quiz 1
- Lab and Homework Formatting Instructions: Rmd File, HTML
- Programming Journal Template: Rmd file
- February 3, Lecture 4:
- Topics: Conditional statements. Iteration. Vectorization to avoid iteration. apply.
- Sections: TARP: 7.1, 2.6 RC: 15.1, 15.2, 6.3, 6.4
- Learning Objectives
- February 6, Lecture 5:
- Topics: Text. Characters and strings. Substrings. Splitting strings. Building strings. Counting strings.
- Sections: TARP: 11.1 RC: 7.1, 7.2, 7.3, 7.4, 7.6
- Learning Objectives
- February 10, Lecture 6:
- Topics: Lab 2.
- Study Guide for Quiz 2
- February 13, Lecture 7:
- Topics: Lab 2 (cont'd).
- February 17, Lecture 8:
- Topics: Functions. Writing functions.
- Sections: TARP: 1.3, 7.3, 7.4, 7.5, 7.13
- Learning Objectives
- gmp.dat file
- Demo of Steepest Descent
- February 20, Lecture 9:
- Topics: More with functions. Families of functions for acting on the same object. Sub-functions for breaking up a large task. Sourcing functions from a file.
- Sections: Handout
- Learning Objectives
- February 24, Lecture 10:
- Topics: Plotting in R. Base R graphics: plot, points, lines. "Decorating" Base R graphics. ggformula graphics: gf_* functions.
- Sections: TARP: 12.1, 12.2, 12.3
- Learning Objectives
- February 27, Lecture 11:
- Topics: Lab 3.
- Study Guide for Quiz 3
- March 2, Lecture 12:
- Topics: Using randomness for fun and profit. Randomness in nature. Pseudo-random number generators. Why generate pseudo-random numbers? They're the (H-)bomb. Basic R functions for parametric distributions.
- Sections: TARP: 8.2, 8.6 RC: 8.3, 8.4, 8.5
- Learning Objectives
- March 5, Lecture 13:
- Topics: Summarizing data. The empirical distribution function. Density estimation. Numerical summaries. Sampling distributions of test statistics via simulation.
- Sections: Handout
- Learning Objectives
- Demo of Kernel Density Estimation
- March 9, Lecture 14:
- Topics: Lab 4.
- Study Guide for Quiz 4
- March 12, Lecture 15:
- Topics: Debugging and testing code. Interpreting an error message from R. Debugging with cat and print. Debugging with browser.
- Sections: TARP: 13.1, 13.2, 13.3, 13.4
- Learning Objectives
- March 23, Lecture 16:
- Topics: Adding break points to R scripts. When in doubt, internet: how to read a Stack Overflow (or similar) page. Version control and git. GitHub. Interfacing GitHub with RStudio Cloud.
- Sections: Handout
- Learning Objectives
- Walkthrough of Interfacing RStudio Cloud and GitHub
- March 26, Lecture 17:
- Topics: Lab 5.
- Study Guide for Quiz 5
- Setting up a GitHub Repo for Collaborating on Lab 5
- March 30, Lecture 18:
- Topics: Numerical methods in R. Root finding. Optimization.
- Sections: RC: 8.1
- Learning Objectives
- Demo of Successive Parabolic Interpolation
- April 2, Lecture 19:
- Topics: Numerical methods in R. Numerical quadrature (i.e. integration). Symbolic differentiation.
- Sections: Handout
- Learning Objectives
- April 6, Lecture 20:
- Topics: Simulation. Simulation to approximate sampling distributions. Simulation to assess the properties of a confidence interval. Fisher's confidence interval for a population correlation.
- Sections: Handout
- Learning Objectives
- April 9, Lecture 21:
- Topics: Lab 6.
- Study Guide for Quiz 6
- April 13, Lecture 22:
- Topics: More simulation. Simulation to assess the properties of a hypothesis test. The one-sample sign test for medians. Simulating to assess the properties of a one-sample sign test.
- Sections: Handout
- Learning Objectives
- April 16, Lecture 23:
- Topics: Bootstrapping as a special type of simulation. The parametric bootstrap. The nonparametric bootstrap.
- Sections: RC: 13.8
- Learning Objectives
- April 20, Lecture 24:
- Topics: Lab 7.
- Study Guide for Quiz 7
- Demo of Principal Component Analysis (PCA) for Trivariate Data
- April 23, Lecture 25:
- Topics: No Lecture. Work on final projects.
- Sections:
- April 27, Lecture 26:
- Topics: No Lecture. Work on final projects.
- Sections:
- May 4, Final Exam:
- Time: 2:40 PM - 5:30 PM
- Location:
Howard Hall 307 (HH 307) Remote