Complete this assignment by carefully reading and following the directions.
This assignment is due by the beginning of Lecture 4. See the submission details at the end of this lab.
In this lab, you will explore a (fake!) data set of the attributes of students in a MA 151 course. In the process, you will learn how to:
I have created an RStudio Cloud Project that you can work in for this Lab. You can find RStudio Cloud here. You should have already created an account and joined the RStudio Cloud Space MA 220-02, Spring 2020.
You can find the MA 220-02 Space by clicking on the “Your Workspace” icon in the top of the page and then selecting “MA 220-02, Spring 2020” from the menu on the left:
You will then see a list of all the Projects hosted in the MA 220-02 Space. You want to select the lab1 Project:
R can be finicky in how it reads in files. Windows, macOS, and Linux allow you to do some crazy things with file names, but you should aim to keep your file names as vanilla as possible.
As a rule:
.Rmd
file, the .Rda
file, any additional notes you might write, etc.You will need to create an R Notebook to store your work. R Notebook’s are plain text files containing R code that RStudio can interpret, execute, and use to embed graphics, tables, etc.
To create a new R Notebook file, select
File > New File... > R Notebook ...
from the File menu at the top of your screen. An R Notebook will now appear in the top left panel of your screen:
The R Notebook will be pre-populated with text. You should delete everything below the second set of triple-hyphens ---
:
You should save the file using
File > Save
from the File menu, naming it lab1.Rmd
. Save lab1.Rmd
directly in the project
folder in the RStudio Cloud Project.
In R Markdown, we can mix plain text with R code. We need to tell R Markdown how to recognize when a chunk of text should be interpretted as plain text (something a person can read) versus R code (something the computer can read). R Markdown does this this using “code chunks,” which are separated from the main text by a block like this:
```{r}
```
The ` symbol is grave accent and can be found on the top left of the keyboard, just below the Escape key. So any time you see a block of code like this:
# Block of code
you should put it inside of a code chunk in R Markdown like this:
```{r}
# Block of code
```
Try this now by creating a code chunk that contains the code
2 + 2
as below:
You can run a code chunk by clicking the green play button at the top-right corner of the code chunk. Try this now.
We will begin our data analysis by loading in a data file into RStudio. The data is stored in SynthCourseStats.rda. The file has the suffix rda file, which stands for RData Format. This is similar to the .xls, .doc, etc., suffixes you may be familiar with from Microsoft Excel, Microsoft Word, etc. It tells your computer that this type of file should be opened by a program that can read R files.
You load SynthCourseStats.rda into the current R session using the command:
load('SynthCourseStats.rda')
Again, you should put this in a code chunk, so your R Notebook should now look like this:
You can then run this code chunk by clicking on the green play button on the right of the code chunk.
Assuming you are in the correct working directory, RStudio will now load in the course.stats
data frame. If you get an error, it’s most likely because R is not looking in the correct directory. If this happens, select
Session > Set Working Directory > To Source File Location
and the working directory (where R looks for files) will now be set to the correct folder. Rerun the code chunk and the data frame should load. If you still cannot get the data frame to load, raise your hand and I will troubleshoot the issue with you.
We can inspect the first few rows of the course.stats
data frame using the head
command in R. “Head” here means the top; there is a corresponding command tail
which will print the last few rows of a data frame.
head(course.stats)
## student.number ID major midterm.1 midterm.2 midterm.3 final.exam
## 1 1 1114716 HE 65.3 94.3 78.6 83.8
## 2 2 1260907 CE 75.2 99.8 91.8 96.8
## 3 3 1102104 BY 84.2 76.3 64.8 73.6
## 4 4 1144623 HE 93.6 69.6 85.9 75.4
## 5 5 1179867 BY 79.0 85.7 83.3 94.3
## 6 6 1148060 HE 83.9 78.6 90.6 73.5
## course.score course.grade
## 1 80.500 B
## 2 90.900 A
## 3 74.725 C
## 4 81.125 B
## 5 85.575 B
## 6 81.650 B
Again, you should put the above code in a code chunk in your R Notebook and click the green play button to run it. From here on out, I will not keep mentioning this. But you need to remember to do it!
mosaic
To load a package into R, you use the library
command. So to load the mosaic
package into R, you should create and run the code chunk:
library(mosaic)
Now all of the functionality of mosaic
is available to you.
tally
to Aggregate DataLet’s use some of mosaic
’s functionality to explore the course.stats
data frame. We will use the tally
function to summarize the distribution of majors in this MA 151 class.
The “grammar” of functions in mosaic
follows the structure
Since we want to tally the students by their major, and the data is stored in the course.stats
data frame, the command we want to use is
tally(~ major, data = course.stats)
## major
## BY CE HE
## 8 5 14
We can also pass additional arguments to an R function, to change how it presents its output. So, for example, to output the tally in terms of the proportion of students by major, we can pass 'proportion'
to the format
argument:
tally(~ major, data = course.stats, format = 'proportion')
## major
## BY CE HE
## 0.2962963 0.1851852 0.5185185
To output the tally in terms of the percentage of students, we use percentage
:
tally(~ major, data = course.stats, format = 'percent')
## major
## BY CE HE
## 29.62963 18.51852 51.85185
One of the things students and professors care about is the distribution of scores on any given exam in a course. We can visualize the distribution of the Midterm 1 scores in the class using a histogram:
gf_histogram(~ midterm.1, data = course.stats)
Notice that this is exactly the same grammar as we used for the tally
command, but now goal goal is a histogram, so we use gf_histogram
, and we want to know about the Midterm 1 grade, so we use the midterm.1
column in the data.frame.
Suppose the professor used a standard ten point scale to assign letter grades to the exam scores. The bins that gf_histogram
used by default make it difficult to determine the number of students who earned an A versus a B versus a C, etc., on the exam. We can use the binwidth
and boundary
arguments of gf_histogram
to set the bins to have a width of 10, and to start the bins at 0:
gf_histogram(~ midterm.1, data = course.stats,
binwidth = 10, boundary = 0)
We can also add a rug plot to the histogram, to display the actual scores of the students, using the gf_rugx
command. To add an element to a plot, you use the %>%
operation:
gf_histogram(~ midterm.1, data = course.stats,
binwidth = 10, boundary = 0) %>%
gf_rugx(~ midterm.1)
Finally, we can break down the distribution of Midterm 1 scores by-major using the |
(vertical bar or “pipe”) character. You will learn later in this course that, in writing mathematical expressions involving probabilities, the vertical bar denotes “given,” so “midterm.1 | major
” can be read as “break down the Midterm 1 scores given that we know the major of a particular student”:
gf_histogram(~ midterm.1 | major, data = course.stats)
Histograms are easy to visually inspect, but they do not summarize the data as much as we might want. A boxplot is a compromise between a visual summary and a numerical summary of a data set. To generate a boxplot for the Midterm 1 grades and add a rug plot, we can use:
gf_boxplot(~ midterm.1, data = course.stats) %>% gf_rugy(~ midterm.1)
Again, notice how this uses the same grammar as tally
and gf_histogram
. We used the function gf_rugy
instead of gf_rugx
because we want the rug plot on the \(y\)-axis
Finally, we can break up the boxplots by major as before:
gf_boxplot(~ midterm.1 | major, data = course.stats) %>% gf_rugy(~ midterm.1)
If we want to show all of boxplots overlayed on the same plot, we can use the color
argument of gf_boxplot
:
gf_boxplot(~ midterm.1, color = ~ major, data = course.stats)
tally
Let’s return to tally
, and see what happens when we use the pipe operation to break down the course.grade
(the letter grade for each student in the course) variable by major
:
tally(~ course.grade | major, data = course.stats)
## major
## course.grade BY CE HE
## F 0 0 0
## D 0 0 0
## C 3 2 7
## B 5 1 7
## A 0 2 0
tally
is telling us that 5 biology majors got Bs, 3 biology majors got Cs, 1 chemistry major got a B, etc. (Reminder: these data are fake.)
We can also use the format
argument to break down the distribution of grades by major:
tally(~ course.grade | major, data = course.stats, format = 'percent')
## major
## course.grade BY CE HE
## F 0.0 0.0 0.0
## D 0.0 0.0 0.0
## C 37.5 40.0 50.0
## B 62.5 20.0 50.0
## A 0.0 40.0 0.0
So we see that amongst biology majors, 62.5% got Bs, and amongst health majors, 50% got Bs, etc.
Now that you’ve completed the lab, you can “knit” the R Notebook together into an HTML file using the R package knitr
. To do this, select
File > Knit Document
from the File menu.
This will create a file called lab1.nb.html
that contains all of the code chunks you wrote and the tables and plots they generated.
You can add your name to the top of the HTML file by including an author:
field at the top of the R Notebook:
The files you created are hosted remotely on the RStudio Cloud server. To download a file from the RStudio Cloud server, select the check box next to each file you want to download, and click the "More"
(gear) icon and select "Export..."
option from the drop-down menu:
Click the Download
button in the resulting dialog box. This will save a zip file to your computer containing the selected files.
You should submit both your lab1.Rmd
file and the lab1.nb.html
files to eCampus, under the Lab 1 assignment. This submission is due by the beginning of Lecture 4.