Lab 1: Introduction to R Markdown and Exploratory Data Analysis Using R

Due Date:

Complete this assignment by carefully reading and following the directions.

This assignment is due by the beginning of Lecture 4. See the submission details at the end of this lab.

Introduction

In this lab, you will explore a (fake!) data set of the attributes of students in a MA 151 course. In the process, you will learn how to:

Organize your work in RStudio.
Create and use an R Markdown file to store your work.
Load and inspect a data frame.
Compute summaries of categorical and quantitative data.

RStudio Cloud

I have created an RStudio Cloud Project that you can work in for this Lab. You can find RStudio Cloud here. You should have already created an account and joined the RStudio Cloud Space MA 220-02, Spring 2020.

You can find the MA 220-02 Space by clicking on the “Your Workspace” icon in the top of the page and then selecting “MA 220-02, Spring 2020” from the menu on the left:

You will then see a list of all the Projects hosted in the MA 220-02 Space. You want to select the lab1 Project:

Some Pointers About Naming Files for Use with R

R can be finicky in how it reads in files. Windows, macOS, and Linux allow you to do some crazy things with file names, but you should aim to keep your file names as vanilla as possible.

As a rule:

Avoid spaces in file names. R may get confused, since spaces are also used to separate commands.
Avoid special characters (-, +, _, %, etc.) in file names. Again, R may get confused.
Put all of files for a given assignment in the same directory. This should include the .Rmd file, the .Rda file, any additional notes you might write, etc.

Creating a New R Notebook

You will need to create an R Notebook to store your work. R Notebook’s are plain text files containing R code that RStudio can interpret, execute, and use to embed graphics, tables, etc.

To create a new R Notebook file, select

File > New File... > R Notebook ...

from the File menu at the top of your screen. An R Notebook will now appear in the top left panel of your screen:

The R Notebook will be pre-populated with text. You should delete everything below the second set of triple-hyphens ---:

You should save the file using

File > Save

from the File menu, naming it lab1.Rmd. Save lab1.Rmd directly in the project folder in the RStudio Cloud Project.

Writing Code Chunks in R Markdown

In R Markdown, we can mix plain text with R code. We need to tell R Markdown how to recognize when a chunk of text should be interpretted as plain text (something a person can read) versus R code (something the computer can read). R Markdown does this this using “code chunks,” which are separated from the main text by a block like this:

```{r}

```

The ` symbol is grave accent and can be found on the top left of the keyboard, just below the Escape key. So any time you see a block of code like this:

# Block of code

you should put it inside of a code chunk in R Markdown like this:

```{r}
# Block of code
```

Try this now by creating a code chunk that contains the code

2 + 2

as below:

You can run a code chunk by clicking the green play button at the top-right corner of the code chunk. Try this now.

Loading Data

We will begin our data analysis by loading in a data file into RStudio. The data is stored in SynthCourseStats.rda. The file has the suffix rda file, which stands for RData Format. This is similar to the .xls, .doc, etc., suffixes you may be familiar with from Microsoft Excel, Microsoft Word, etc. It tells your computer that this type of file should be opened by a program that can read R files.

You load SynthCourseStats.rda into the current R session using the command:

load('SynthCourseStats.rda')

Again, you should put this in a code chunk, so your R Notebook should now look like this:

You can then run this code chunk by clicking on the green play button on the right of the code chunk.

Assuming you are in the correct working directory, RStudio will now load in the course.stats data frame. If you get an error, it’s most likely because R is not looking in the correct directory. If this happens, select

Session > Set Working Directory > To Source File Location

and the working directory (where R looks for files) will now be set to the correct folder. Rerun the code chunk and the data frame should load. If you still cannot get the data frame to load, raise your hand and I will troubleshoot the issue with you.

Inspecting a Data Frame

We can inspect the first few rows of the course.stats data frame using the head command in R. “Head” here means the top; there is a corresponding command tail which will print the last few rows of a data frame.

head(course.stats)

##   student.number      ID major midterm.1 midterm.2 midterm.3 final.exam
## 1              1 1114716    HE      65.3      94.3      78.6       83.8
## 2              2 1260907    CE      75.2      99.8      91.8       96.8
## 3              3 1102104    BY      84.2      76.3      64.8       73.6
## 4              4 1144623    HE      93.6      69.6      85.9       75.4
## 5              5 1179867    BY      79.0      85.7      83.3       94.3
## 6              6 1148060    HE      83.9      78.6      90.6       73.5
##   course.score course.grade
## 1       80.500            B
## 2       90.900            A
## 3       74.725            C
## 4       81.125            B
## 5       85.575            B
## 6       81.650            B

Again, you should put the above code in a code chunk in your R Notebook and click the green play button to run it. From here on out, I will not keep mentioning this. But you need to remember to do it!

Loading `mosaic`

To load a package into R, you use the library command. So to load the mosaic package into R, you should create and run the code chunk:

library(mosaic)

Now all of the functionality of mosaic is available to you.

Using `tally` to Aggregate Data

Let’s use some of mosaic’s functionality to explore the course.stats data frame. We will use the tally function to summarize the distribution of majors in this MA 151 class.

The “grammar” of functions in mosaic follows the structure

The grammar of mosaic

Since we want to tally the students by their major, and the data is stored in the course.stats data frame, the command we want to use is

tally(~ major, data = course.stats)

## major
## BY CE HE 
##  8  5 14

We can also pass additional arguments to an R function, to change how it presents its output. So, for example, to output the tally in terms of the proportion of students by major, we can pass 'proportion' to the format argument:

tally(~ major, data = course.stats, format = 'proportion')

## major
##        BY        CE        HE 
## 0.2962963 0.1851852 0.5185185

To output the tally in terms of the percentage of students, we use percentage:

tally(~ major, data = course.stats, format = 'percent')

## major
##       BY       CE       HE 
## 29.62963 18.51852 51.85185

Generating Histograms from Data

One of the things students and professors care about is the distribution of scores on any given exam in a course. We can visualize the distribution of the Midterm 1 scores in the class using a histogram:

gf_histogram(~ midterm.1, data = course.stats)

Notice that this is exactly the same grammar as we used for the tally command, but now goal goal is a histogram, so we use gf_histogram, and we want to know about the Midterm 1 grade, so we use the midterm.1 column in the data.frame.

Suppose the professor used a standard ten point scale to assign letter grades to the exam scores. The bins that gf_histogram used by default make it difficult to determine the number of students who earned an A versus a B versus a C, etc., on the exam. We can use the binwidth and boundary arguments of gf_histogram to set the bins to have a width of 10, and to start the bins at 0:

gf_histogram(~ midterm.1, data = course.stats,
             binwidth = 10, boundary = 0)

We can also add a rug plot to the histogram, to display the actual scores of the students, using the gf_rugx command. To add an element to a plot, you use the %>% operation:

gf_histogram(~ midterm.1, data = course.stats,
             binwidth = 10, boundary = 0) %>% 
  gf_rugx(~ midterm.1)

Finally, we can break down the distribution of Midterm 1 scores by-major using the | (vertical bar or “pipe”) character. You will learn later in this course that, in writing mathematical expressions involving probabilities, the vertical bar denotes “given,” so “midterm.1 | major” can be read as “break down the Midterm 1 scores given that we know the major of a particular student”:

gf_histogram(~ midterm.1 | major, data = course.stats)

Generating Boxplots from Data

Histograms are easy to visually inspect, but they do not summarize the data as much as we might want. A boxplot is a compromise between a visual summary and a numerical summary of a data set. To generate a boxplot for the Midterm 1 grades and add a rug plot, we can use:

gf_boxplot(~ midterm.1, data = course.stats) %>% gf_rugy(~ midterm.1)

Again, notice how this uses the same grammar as tally and gf_histogram. We used the function gf_rugy instead of gf_rugx because we want the rug plot on the \(y\)-axis

Finally, we can break up the boxplots by major as before:

gf_boxplot(~ midterm.1 | major, data = course.stats) %>% gf_rugy(~ midterm.1)

If we want to show all of boxplots overlayed on the same plot, we can use the color argument of gf_boxplot:

gf_boxplot(~ midterm.1, color = ~ major, data = course.stats)

Including Additional Information Using `tally`

Let’s return to tally, and see what happens when we use the pipe operation to break down the course.grade (the letter grade for each student in the course) variable by major:

tally(~ course.grade | major, data = course.stats)

##             major
## course.grade BY CE HE
##            F  0  0  0
##            D  0  0  0
##            C  3  2  7
##            B  5  1  7
##            A  0  2  0

tally is telling us that 5 biology majors got Bs, 3 biology majors got Cs, 1 chemistry major got a B, etc. (Reminder: these data are fake.)

We can also use the format argument to break down the distribution of grades by major:

tally(~ course.grade | major, data = course.stats, format = 'percent')

##             major
## course.grade   BY   CE   HE
##            F  0.0  0.0  0.0
##            D  0.0  0.0  0.0
##            C 37.5 40.0 50.0
##            B 62.5 20.0 50.0
##            A  0.0 40.0  0.0

So we see that amongst biology majors, 62.5% got Bs, and amongst health majors, 50% got Bs, etc.

Knitting an R Notebook

Now that you’ve completed the lab, you can “knit” the R Notebook together into an HTML file using the R package knitr. To do this, select

File > Knit Document

from the File menu.

This will create a file called lab1.nb.html that contains all of the code chunks you wrote and the tables and plots they generated.

You can add your name to the top of the HTML file by including an author: field at the top of the R Notebook:

Downloading a File from RStudio Cloud

The files you created are hosted remotely on the RStudio Cloud server. To download a file from the RStudio Cloud server, select the check box next to each file you want to download, and click the "More" (gear) icon and select "Export..." option from the drop-down menu:

Click the Download button in the resulting dialog box. This will save a zip file to your computer containing the selected files.

Submitting on eCampus

You should submit both your lab1.Rmd file and the lab1.nb.html files to eCampus, under the Lab 1 assignment. This submission is due by the beginning of Lecture 4.

Lab 1: Introduction to R Markdown and Exploratory Data Analysis Using R

David Darmon

1/29/2020

Due Date:

Introduction

RStudio Cloud

Some Pointers About Naming Files for Use with R

Creating a New R Notebook

Writing Code Chunks in R Markdown

Loading Data

Inspecting a Data Frame

Loading `mosaic`

Using `tally` to Aggregate Data

Generating Histograms from Data

Generating Boxplots from Data

Including Additional Information Using `tally`

Knitting an R Notebook

Downloading a File from RStudio Cloud

Submitting on eCampus

Lab 1: Introduction to R Markdown and Exploratory Data Analysis Using R

David Darmon

1/29/2020

Due Date:

Introduction

RStudio Cloud

Some Pointers About Naming Files for Use with R

Creating a New R Notebook

Writing Code Chunks in R Markdown

Loading Data

Inspecting a Data Frame

Loading mosaic

Using tally to Aggregate Data

Generating Histograms from Data

Generating Boxplots from Data

Including Additional Information Using tally

Knitting an R Notebook

Downloading a File from RStudio Cloud

Submitting on eCampus

Loading `mosaic`

Using `tally` to Aggregate Data

Including Additional Information Using `tally`