Lab 1: Introduction to R Markdown and Exploratory Data Analysis Using R

Due Date:

Complete this assignment by carefully reading and following the directions.

This assignment is due by the beginning of Lecture 5. See the submission details at the end of this lab.

Introduction

In this lab, you will explore a (fake!) data set of the attributes of students in a MA 151 course. In the process, you will learn how to:

Organize your work in R.
Create and use an R Markdown file to store your work.
Load and inspect a data frame.
Compute summaries of categorical and quantitative data.

Setting Up a Directory for MA 220

It is a good idea to save all of the files you will create in R in an organized fashion. You may already have an organizational system for your files from previous courses. If so, feel free to use that system, since you are already familiar with it.

If you do not have an organizational system for your files, I recommend using the Documents folder on your computer as home base, and then creating a folder for MA 220, sub-folders for homeworks and labs, etc. So your file system should look like:

Documents
  School
    sp21
      ma220
        hws
          hw1
        labs
          lab1
        lectures
          lecture1

R can be finicky in how it reads in files. Windows, macOS, and Linux allow you to do some crazy things with file names, but you should aim to keep your file names as vanilla as possible.

As a rule:

Avoid spaces in file names. R may get confused, since spaces are also used to separate commands.
Avoid special characters (-, +, _, %, etc.) in file names. Again, R may get confused.
Put all of files for a given assignment in the same directory. This should include the .Rmd file, the .csv file, any additional notes you might write, etc.

Creating a New R Notebook

You will need to create an R Notebook to store your work. R Notebook’s are plain text files containing R code that RStudio can interpret, execute, and use to embed graphics, tables, etc.

To create a new R Notebook file, select

File > New File... > R Notebook ...

from the File menu at the top of your screen. An R Notebook will now appear in the top left panel of your screen:

The R Notebook will be pre-populated with text. You should delete everything below the second set of triple-hyphens ---:

You should save the file using

File > Save

from the File menu, naming it lab1.Rmd. Save lab1.Rmd in the lab1 folder you created in the previous section.

Downloading the Data

You can find the data file SynthCourseStats.csv at this link.

The data is stored in SynthCourseStats.csv. The file has the suffix csv file, which stands for “comma-separated values.” This is similar to the .xls, .doc, etc., suffixes you may be familiar with from Microsoft Excel, Microsoft Word, etc. It tells your computer that this type of file contains plain text where values are separated by commas.

You should move SynthCourseStats.csv to the lab1 folder you have already created.

Writing Code Chunks in R Markdown

In R Markdown, we can mix plain text with R code. We need to tell R Markdown how to recognize when a chunk of text should be interpretted as plain text (something a person can read) versus R code (something the computer can read). R Markdown does this this using “code chunks,” which are separated from the main text by a block like this:

```{r}

```

The ` symbol is grave accent and can be found on the top left of the keyboard, just below the Escape key. So any time you see a block of code like this:

# Block of code

you should put it inside of a code chunk in R Markdown like this:

```{r}
# Block of code
```

Try this now by creating a code chunk that contains the code

2 + 2

as below:

You can run a code chunk by clicking the green play button at the top-right corner of the code chunk. Try this now.

Loading Data

We will begin our data analysis by loading in the data file into RStudio. You can do this with the load command:

course.stats <- read.csv('SynthCourseStats.csv')

Again, you should put this in a code chunk, so your R Notebook should now look like this:

You can then run this code chunk by clicking on the green play button on the right of the code chunk.

Assuming you are in the correct working directory, RStudio will now load in the SynthCourseStats.csv file and store it in the course.stats data frame. If you get an error, it’s most likely because R is not looking in the correct directory. If this happens, select

Session > Set Working Directory > To Source File Location

and the working directory (where R looks for files) will now be set to the correct folder. Rerun the code chunk and the data frame should load. If you still cannot get the data frame to load, raise your hand and I will troubleshoot the issue with you.

Inspecting a Data Frame

We can inspect the first few rows of the course.stats data frame using the head command in R. “Head” here means the top; there is a corresponding command tail which will print the last few rows of a data frame.

head(course.stats)

##   student.number      ID major midterm.1 midterm.2 midterm.3 final.exam
## 1              1 1114716    HE      65.3      94.3      78.6       83.8
## 2              2 1260907    CE      75.2      99.8      91.8       96.8
## 3              3 1102104    BY      84.2      76.3      64.8       73.6
## 4              4 1144623    HE      93.6      69.6      85.9       75.4
## 5              5 1179867    BY      79.0      85.7      83.3       94.3
## 6              6 1148060    HE      83.9      78.6      90.6       73.5
##   course.score course.grade
## 1       80.500            B
## 2       90.900            A
## 3       74.725            C
## 4       81.125            B
## 5       85.575            B
## 6       81.650            B

Again, you should put the above code in a code chunk in your R Notebook and click the green play button to run it. From here on out, I will not keep mentioning this. But you need to remember to do it!

Installing and Loading `mosaic`

We will use the mosaic package to help analyze and visualize this data.

Before using mosaic, we need to (a) install it and (b) load it.

To install mosaic, run the following command in a code chunk.

Note: You may receive prompts in the Console after running this command. If so, respond to the prompts in the Console to complete the installation process.

install.packages('mosaic')

Once you have successfully installed mosaic, you can comment out the command (so it will not run again) by adding a pound symbol at the start of the line, as below:

#install.packages('mosaic')

To load a package into R, you use the library command. So to load the mosaic package into R, you should create and run the code chunk:

library(mosaic)

Now all of the functionality of mosaic is available to you.

Using `tally` to Aggregate Data

Let’s use some of mosaic’s functionality to explore the course.stats data frame. We will use the tally function to summarize the distribution of majors in this MA 151 class.

Remember that the “grammar” of functions in mosaic follows the structure

The grammar of mosaic

Since we want to tally the students by their major, and the data is stored in the course.stats data frame, the command we want to use is

tally(~ major, data = course.stats)

## major
## BY CE HE 
##  8  5 14

We can also pass additional arguments to an R function, to change how it presents its output. So, for example, to output the tally in terms of the proportion of students by major, we can pass ‘proportion’ to the format argument:

tally(~ major, data = course.stats, format = 'proportion')

## major
##        BY        CE        HE 
## 0.2962963 0.1851852 0.5185185

To output the tally in terms of the percentage of students, we use percentage:

tally(~ major, data = course.stats, format = 'percent')

## major
##       BY       CE       HE 
## 29.62963 18.51852 51.85185

Generating Histograms from Data

One of the things students and professors care about is the distribution of scores on any given exam in a course. We can visualize the distribution of the Midterm 1 scores in the class using a histogram:

gf_histogram(~ midterm.1, data = course.stats)

Notice that this is exactly the same grammar as we used for the tally command, but now goal goal is a histogram, so we use gf_histogram, and we want to know about the Midterm 1 grade, so we use the midterm.1 column in the data.frame.

Suppose the professor used a standard ten point scale to assign letter grades to the exam scores. The bins that gf_histogram used by default make it difficult to determine the number of students who earned an A versus a B versus a C, etc., on the exam. We can use the binwidth and boundary arguments of gf_histogram to set the bins to have a width of 10, and to start the bins at 0:

gf_histogram(~ midterm.1, data = course.stats,
             binwidth = 10, boundary = 0)

We can also add a rug plot to the histogram, to display the actual scores of the students, using the gf_rugx command. To add an element to a plot, you use the %>% operation:

gf_histogram(~ midterm.1, data = course.stats,
             binwidth = 10, boundary = 0) %>% 
  gf_rugx(~ midterm.1)

Finally, we can break down the distribution of Midterm 1 scores by-major using the | (vertical bar or “pipe”) character. You will learn later in this course that, in writing mathematical expressions involving probabilities, the vertical bar denotes “given,” so “midterm.1 | major” can be read as “break down the Midterm 1 scores given that we know the major of a particular student”:

gf_histogram(~ midterm.1 | major, data = course.stats)

Generating Density Plots from Data

Histograms are easy to understand, but suffer from the fact that the choice of bin width can drastically change the appearance of a histogram. A density plot is a “smoothed” histogram that avoids this problem. We can create a density plot for the Midterm 1 grades and add a rug plot using:

gf_dens(~ midterm.1, data = course.stats) %>% gf_rugx(~ midterm.1)

Again, notice how this uses the same grammar as tally and gf_histogram.

Finally, we can break up the distributions by major as before:

gf_dens(~ midterm.1 | major, data = course.stats) %>% gf_rugx(~ midterm.1)

If we want to show all of densities overlayed on the same plot, we can use the color argument of gf_dens:

gf_dens(~ midterm.1, color = ~ major, data = course.stats)

Including Additional Information Using `tally`

Let’s return to tally, and see what happens when we use the pipe operation to break down the course.grade (the letter grade for each student in the course) variable by major:

tally(~ course.grade | major, data = course.stats)

##             major
## course.grade BY CE HE
##            A  0  2  0
##            B  5  1  7
##            C  3  2  7

tally is telling us that 5 biology majors got Bs, 3 biology majors got Cs, 1 chemistry major got a B, etc. (Reminder: these data are fake.)

We can also use the format argument to break down the distribution of grades by major:

tally(~ course.grade | major, data = course.stats, format = 'percent')

##             major
## course.grade   BY   CE   HE
##            A  0.0 40.0  0.0
##            B 62.5 20.0 50.0
##            C 37.5 40.0 50.0

So we see that amongst biology majors, 62.5% got Bs, and amongst health majors, 50% got Bs, etc.

Knitting an R Notebook

Now that you’ve completed the lab, you can “knit” the R Notebook together into an HTML file using the R package knitr. To do this, select

File > Knit Document

from the File menu.

This will create a file called lab1.nb.html that contains all of the code chunks you wrote and the tables and plots they generated.

You can add your name to the top of the HTML file by including an author: field at the top of the R Notebook:

Submitting on eCampus

You should submit both your lab1.Rmd file and the lab1.nb.html files to eCampus, under the Lab 1 assignment. This submission is due by the beginning of Lecture 5.