David Darmon

MA 350-01, Computation and Statistics

Spring 2020

Monday and Thursday, 1:15 PM — 2:35 PM, Howard Hall 307

The modern approach to statistics can broadly be broken into two distinct tasks: algorithms for computing statistical quantities (point estimates, confidence intervals, \(P\)-values, etc.) from data and inferences to be made from those quantities based on statistical models of how the data were generated. Until now in your statistical education, the focus has been on the inferential aspect of statistics. This class is a first course on the algorithmic side of statistics. You will learn the core concepts of programming — functions, objects, data structures, flow control, debugging, logical design, and abstraction — by writing code to implement statistical algorithms. You will also learn how to use stochastic simulation to evaluate a statistical algorithm in terms of its inferential properties. Programming and programming languages are inherently social, so you will also learn the customs and conventions of the tribe of programmers who use R as well as best-practices for organizing and commenting your code.

Prerequisites

MA 116 or MA 118 or MA 126, passed with a grade of C- or higher, and MA 151 or MA 220 or BE 251, passed with a grade of C- or higher.

Professor

Dr. David Darmon ddarmon [at] monmouth.edu
Howard Hall 241

Topics, Notes, Readings

This is currently a tentative listing of topics, in order.

Introduction to R: The R console. Using R as a fancy calculator. The RStudio integrated development environment. R Markdown for literate programming.
Objects, the nouns of programming: Objects and data types. Variables. Floats and integers. Booleans. Characters and strings.
Data structures: Vectors for one-dimensional collections of the same data type. Arrays for multi-dimensional collections of the same data type. Lists for one-dimensional collections of different data types. Data frames for multi-dimensional collections of different data types.
Flow control: If-Else and more complex conditional statements. Iteration. For loops and while loops. Vectorization to avoid iteration. *apply for vectorization with multi-dimensional objects.
Operating on text: More on the character data type. Extracting substrings of a string. Splitting strings at delimiters. Building strings from a pattern. Counting strings.
Functions, the verbs of programming: Using functions to collect together related commands. Functions as input-output devices. Arguments (inputs) and return values (outputs). Named arguments and defaults. Scope of variables in a function.
Plotting in R: Use the eyes that Nature gave you! Base R graphics. ggplot and ggformula. Exploratory data analysis using mosaic.
Stochastic simulation: Random numbers. Sources of randomness in nature. Pseudo-random numbers, and how to generate them. Pseudo-random number generators in R. The p*, d*, q*, and r* functions in R.
Summarizing data: Using the empirical distribution function to summarize a univariate data set. Density estimation as a smoothed summary of a data set. Numerical summaries: mean, median, standard deviation, range, and much much more. Assessing the statistical properties of numerical summaries via stochastic simulation.
Debugging and testing code: What is R trying to tell you with all that red text? Let me Google that for you: using the internet wisely for help with programming. Debugging by brute-print. Debugging intelligently using browser in RStudio. Using assertions to catch bugs. Using unit testing to keep code bug-free.
Version control: Always click save. Version control systems. git and GitHub. Committing changes. Pushing and pulling commits. Using GitHub to create your own R package.
Numerical methods: Root finding: determining where a function takes a value. Optimization: finding the peaks and valleys of a function. Quadrature: integrating under a curve. Symbolic differentiation: a computer is really good at Calculus I.
Statistical simulation: Using simulation when the math is too hard. Why simulation can perform complicated computations as-if by magic. Simulation to assess the level and power of a hypothesis test, with an application to equivalence testing. Simulation to assess the coverage and expected width of a confidence interval, with an application to the population correlation coefficient.
Welcome to the tidyverse: Hadley Whickham and the tidyverse. dplyr and the Split/Apply/Combine pattern for analyzing a structured data set. Tidy data and tidyr.
See the end of this page for the current lecture schedule, subject to revision. Homework and additional resources will be linked there, as available.

Course Mechanics

Office Hours

I will have office hours at the following four times each week:

Monday,   10:00—11:00 AM Howard Hall 241
Tuesday,   03:00—04:00 PM Howard Hall 241
Thursday, 10:00—11:00 AM Howard Hall 241
Thursday, 03:00—04:00 PM Howard Hall 241

I have an open-door policy during those times: you can show up unannounced. If you cannot make the scheduled office hours, please e-mail me about making an appointment.

If you are struggling with the homework, having difficulty with the quizzes, or just want to chat, please visit me during my office hours. I am here to help.

Grading Policy

Your final grade will be determined by the following weighting scheme:
30% for homework problem sets
10% for labs
30% for quizzes
20% for final project
10% for programming journal
I will use the standard 10-point breakdown to assign letter grades to numerical grades: with pluses and minuses assigned by dividing the intervals into thirds.

Homework

Homework will be assigned regularly, announced in class, and listed in the Schedule section of this page.

Each homework problem will be graded out of three points: one point for making an honest effort to complete the problem, one point for a technically-correct solution, and one point for clean, well-formatted, documented, and easily readable code.

You will have the opportunity to revise your homework within a week after it is returned and resubmit the new solution to earn back partial credit.

Labs

There will be (approximately) weekly labs. The labs will involve brief exercises related to the latest homework assignment. Attending labs is mandatory. If you miss a lab, your grade for that lab will be a 0.

Pair programming: Labs will be completed in groups using "pair programming," a software development technique where programmers alternate between two roles: a driver who writes the code and an observer who reviews each line of code for correctness and strategizes ways to improve the code moving forward. For each lab, you will randomly be assigned to a partner and the role of either driver or observer. Halfway through the lab, you will switch roles with your partner.

Quizzes

There will be a 20-minute quiz at the beginning of each lab session. Each quiz will contain problems related to the material covered in the readings and during the lectures for a given unit of material. If you miss a quiz, your grade will be a zero for that quiz. Your lowest two quiz grades will be dropped.

Final Project

There will be a month-long final project at the end of the course. You will be assigned into small groups, or, if you prefer, may work alone. You will select a project topic from a list provided by me in the beginning of April. You will write code, document it, write a report, and prepare a presentation on your work to present during the final exam period for this class.

Programming Journal

Learning to program, like learning any new skill, can be frustrating. The computer will always do what you tell it do, which is not always what you want it to do, and determining how to bridge that gap is simultaneously the challenge and the reward of programming. Over the course of the semester, you will keep a programming journal, a document where you record your thoughts about your journey into programming, the struggles you have faced, and the challenges you have overcome. You will also use the programming journal to document common errors that you run into and how to solve them.

You should maintain your programming journal regularly, as you would a personal journal. Updates to your programming journal will be due on the first lecture following each lab.

Class Participation

I expect you to be fully engaged during each class, to ask questions when confused, and to attempt to answer questions when called on. During class, attempting to answer a question is more important than getting the correct answer.

Attendance

Required. If you expect to miss 2-3 sessions of the course, you should take the course during another semester.

Textbook

The required textbooks are:

Collaboration, Cheating, and Plagiarism

All submitted work should be your own. You are welcome and encouraged to use the Internet and consult with others while working on an assignment. However, whenever you have had assistance with a problem, you must state so at the beginning of the problem solution. Unless this mechanism is abused, there will be no reduction in credit for using and reporting such assistance. This policy applies to both individual and group work. In group work, you only need to acknowledge help from outside the group.

Statement on Special Accommodations

Students with disabilities who need special accommodations for this class are encouraged to meet with me or the appropriate disability service provider on campus as soon as possible. In order to receive accommodations, students must be registered with the appropriate disability service provider on campus as set forth in the student handbook and must follow the University procedure for self-disclosure, which is stated in the University Guide to Services and Accommodations for Students with Disabilities. Students will not be afforded any special accommodations for academic work completed prior to the disclosure of the disability, nor will they be afforded any special accommodations prior to the completion of the documentation process with the appropriate disability office.

R

We will use R regularly throughout the semester.

You can access R from any web accessible computer using RStudio Cloud. You will need to create an account on RStudio Cloud from their Registration page. I will send out a link via email for you to join a Space on RStudio Cloud for this course. Resources for homeworks, labs, etc., will be hosted on RStudio Cloud for easy access.

You should also install R on your personal computer, if you have one. You can install R by following the instructions for Windows here, for macOS here, or for Linux here. You will also want to install RStudio, and Integrated Development Environment for R, which you can find here.

Schedule

Subject to revision. Assignments and solutions will all be linked here, as they are available.
January 23, Lecture 1:
Topics: Intro to course. Statistical Computing and Computational Statistics. R. Basic data types and data structures.
Sections: TARP: 1.2, 1.4, 1.7 RC: 1.4, 1.8, 2.1, 2.2, 16.1, 16.2, 16.3, 16.4, 16.6
Learning Objectives
January 27, Lecture 2:
Topics: Data structures: vectors, arrays, lists, and data frames.
Sections: TARP: 2.1, 2.2, 2.4, 2.8, 3.1, 3.2, 3.5, 4.1, 4.2, 4.3, 5.1, 5.2, 5.3
Learning Objectives
January 30, Lecture 3:
Topics: Lab 1.
Study Guide for Quiz 1
Lab and Homework Formatting Instructions: Rmd File, HTML
Programming Journal Template: Rmd file
February 3, Lecture 4:
Topics: Conditional statements. Iteration. Vectorization to avoid iteration. apply.
Sections: TARP: 7.1, 2.6 RC: 15.1, 15.2, 6.3, 6.4
Learning Objectives
February 6, Lecture 5:
Topics: Text. Characters and strings. Substrings. Splitting strings. Building strings. Counting strings.
Sections: TARP: 11.1 RC: 7.1, 7.2, 7.3, 7.4, 7.6
Learning Objectives
February 10, Lecture 6:
Topics: Lab 2.
Study Guide for Quiz 2
February 13, Lecture 7:
Topics: Lab 2 (cont'd).
February 17, Lecture 8:
Topics: Functions. Writing functions.
Sections: TARP: 1.3, 7.3, 7.4, 7.5, 7.13
Learning Objectives
gmp.dat file
Demo of Steepest Descent
February 20, Lecture 9:
Topics: More with functions. Families of functions for acting on the same object. Sub-functions for breaking up a large task. Sourcing functions from a file.
Sections: Handout
Learning Objectives
February 24, Lecture 10:
Topics: Plotting in R. Base R graphics: plot, points, lines. "Decorating" Base R graphics. ggformula graphics: gf_* functions.
Sections: TARP: 12.1, 12.2, 12.3
Learning Objectives
February 27, Lecture 11:
Topics: Lab 3.
Study Guide for Quiz 3
March 2, Lecture 12:
Topics: Using randomness for fun and profit. Randomness in nature. Pseudo-random number generators. Why generate pseudo-random numbers? They're the (H-)bomb. Basic R functions for parametric distributions.
Sections: TARP: 8.2, 8.6 RC: 8.3, 8.4, 8.5
Learning Objectives
March 5, Lecture 13:
Topics: Summarizing data. The empirical distribution function. Density estimation. Numerical summaries. Sampling distributions of test statistics via simulation.
Sections: Handout
Learning Objectives
Demo of Kernel Density Estimation
March 9, Lecture 14:
Topics: Lab 4.
Study Guide for Quiz 4
March 12, Lecture 15:
Topics: Debugging and testing code. Interpreting an error message from R. Debugging with cat and print. Debugging with browser.
Sections: TARP: 13.1, 13.2, 13.3, 13.4
Learning Objectives
March 23, Lecture 16:
Topics: Adding break points to R scripts. When in doubt, internet: how to read a Stack Overflow (or similar) page. Version control and git. GitHub. Interfacing GitHub with RStudio Cloud.
Sections: Handout
Learning Objectives
Walkthrough of Interfacing RStudio Cloud and GitHub
March 26, Lecture 17:
Topics: Lab 5.
Study Guide for Quiz 5
Setting up a GitHub Repo for Collaborating on Lab 5
March 30, Lecture 18:
Topics: Numerical methods in R. Root finding. Optimization.
Sections: RC: 8.1
Learning Objectives
Demo of Successive Parabolic Interpolation
April 2, Lecture 19:
Topics: Numerical methods in R. Numerical quadrature (i.e. integration). Symbolic differentiation.
Sections: Handout
Learning Objectives
April 6, Lecture 20:
Topics: Simulation. Simulation to approximate sampling distributions. Simulation to assess the properties of a confidence interval. Fisher's confidence interval for a population correlation.
Sections: Handout
Learning Objectives
April 9, Lecture 21:
Topics: Lab 6.
Study Guide for Quiz 6
April 13, Lecture 22:
Topics: More simulation. Simulation to assess the properties of a hypothesis test. The one-sample sign test for medians. Simulating to assess the properties of a one-sample sign test.
Sections: Handout
Learning Objectives
April 16, Lecture 23:
Topics: Bootstrapping as a special type of simulation. The parametric bootstrap. The nonparametric bootstrap.
Sections: RC: 13.8
Learning Objectives
April 20, Lecture 24:
Topics: Lab 7.
Study Guide for Quiz 7
Demo of Principal Component Analysis (PCA) for Trivariate Data
April 23, Lecture 25:
Topics: No Lecture. Work on final projects.
Sections:
April 27, Lecture 26:
Topics: No Lecture. Work on final projects.
Sections:
May 4, Final Exam:
Time: 2:40 PM - 5:30 PM
Location: Howard Hall 307 (HH 307) Remote