Complete this assignment by carefully reading and following the directions.
This assignment is due by the beginning of Lecture 3. See the submission details at the end of this lab.
In this lab, you will explore a (fake!) data set of the attributes of students in a MA 151 course. In the process, you will learn how to:
It is a good idea to save all of the files you will create in R in an organized fashion. You may already have an organizational system for your files from previous courses. If so, feel free to use that system, since you are already familiar with it.
If you do not have an organizational system for your files, I recommend using the Documents folder on your computer as home base, and then creating a folder for MA 151, sub-folders for homeworks and labs, etc. So your file system should look like:
Documents
School
fa20
ma151
hws
hw1
labs
lab1
lectures
lecture1
R can be finicky in how it reads in files. Windows, macOS, and Linux allow you to do some crazy things with file names, but you should aim to keep your file names as vanilla as possible.
As a rule:
You will need to create an R Notebook to store your work. R Notebook’s are plain text files containing R code that RStudio can interpret, execute, and use to embed graphics, tables, etc.
To create a new R Notebook file, select
File > New File... > R Notebook ...
from the File menu at the top of your screen. An R Notebook will now appear in the top left panel of your screen:
The R Notebook will be pre-populated with text. You should delete everything below the second set of triple-hyphens ---
:
You should save the file using
File > Save
from the File menu, naming it lab1.Rmd. Save lab1.Rmd
in the lab1
folder you created in the previous section.
You can find the data file SynthCourseStats.csv
at this link.
The data is stored in SynthCourseStats.csv. The file has the suffix csv file, which stands for “comma-separated values.” This is similar to the .xls, .doc, etc., suffixes you may be familiar with from Microsoft Excel, Microsoft Word, etc. It tells your computer that this type of file contains plain text where values are separated by commas.
You should move SynthCourseStats.csv to the lab1 folder you have already created.
In R Markdown, we can mix plain text with R code. We need to tell R Markdown how to recognize when a chunk of text should be interpretted as plain text (something a person can read) versus R code (something the computer can read). R Markdown does this this using “code chunks,” which are separated from the main text by a block like this:
```{r}
```
The ` symbol is grave accent and can be found on the top left of the keyboard, just below the Escape key. So any time you see a block of code like this:
# Block of code
you should put it inside of a code chunk in R Markdown like this:
```{r}
# Block of code
```
Try this now by creating a code chunk that contains the code
2 + 2
as below:
You can run a code chunk by clicking the green play button at the top-right corner of the code chunk. Try this now.
We will begin our data analysis by loading in the data file into RStudio. You can do this with the load command:
course.stats = read.csv('SynthCourseStats.csv')
Again, you should put this in a code chunk, so your R Notebook should now look like this:
You can then run this code chunk by clicking on the green play button on the right of the code chunk.
Assuming you are in the correct working directory, RStudio will now load in the SynthCourseStats.csv
file and store it in the course.stats data frame. If you get an error, it’s most likely because R is not looking in the correct directory. If this happens, select
Session > Set Working Directory > To Source File Location
and the working directory (where R looks for files) will now be set to the correct folder. Rerun the code chunk and the data frame should load. If you still cannot get the data frame to load, raise your hand and I will troubleshoot the issue with you.
We can inspect the first few rows of the course.stats data frame using the head command in R. “Head” here means the top; there is a corresponding command tail which will print the last few rows of a data frame.
head(course.stats)
## student.number ID major midterm.1 midterm.2 midterm.3 final.exam
## 1 1 1114716 HE 65.3 94.3 78.6 83.8
## 2 2 1260907 CE 75.2 99.8 91.8 96.8
## 3 3 1102104 BY 84.2 76.3 64.8 73.6
## 4 4 1144623 HE 93.6 69.6 85.9 75.4
## 5 5 1179867 BY 79.0 85.7 83.3 94.3
## 6 6 1148060 HE 83.9 78.6 90.6 73.5
## course.score course.grade
## 1 80.500 B
## 2 90.900 A
## 3 74.725 C
## 4 81.125 B
## 5 85.575 B
## 6 81.650 B
Again, you should put the above code in a code chunk in your R Notebook and click the green play button to run it. From here on out, I will not keep mentioning this. But you need to remember to do it!
mosaic
To load a package into R, you use the library command. So to load the mosaic package into R, you should create and run the code chunk:
library(mosaic)
Now all of the functionality of mosaic is available to you.
Let’s use some of mosaic’s functionality to explore the course.stats data frame. We will use the tally function to summarize the distribution of majors in this MA 151 class.
Remember that the “grammar” of functions in mosaic follows the structure
Since we want to tally the students by their major, and the data is stored in the course.stats data frame, the command we want to use is
tally(~ major, data = course.stats)
## major
## BY CE HE
## 8 5 14
We can also pass additional arguments to an R function, to change how it presents its output. So, for example, to output the tally in terms of the proportion of students by major, we can pass ‘proportion’ to the format argument:
tally(~ major, data = course.stats, format = 'proportion')
## major
## BY CE HE
## 0.2962963 0.1851852 0.5185185
To output the tally in terms of the percentage of students, we use percentage:
tally(~ major, data = course.stats, format = 'percent')
## major
## BY CE HE
## 29.62963 18.51852 51.85185
One of the things students and professors care about is the distribution of scores on any given exam in a course. We can visualize the distribution of the Midterm 1 scores in the class using a histogram:
gf_histogram(~ midterm.1, data = course.stats)
Notice that this is exactly the same grammar as we used for the tally command, but now goal goal is a histogram, so we use gf_histogram
, and we want to know about the Midterm 1 grade, so we use the midterm.1 column in the data.frame.
Suppose the professor used a standard ten point scale to assign letter grades to the exam scores. The bins that gf_histogram used by default make it difficult to determine the number of students who earned an A versus a B versus a C, etc., on the exam. We can use the binwidth and boundary arguments of gf_histogram to set the bins to have a width of 10, and to start the bins at 0:
gf_histogram(~ midterm.1, data = course.stats,
binwidth = 10, boundary = 0)
We can also add a rug plot to the histogram, to display the actual scores of the students, using the gf_rugx command. To add an element to a plot, you use the %>%
operation:
gf_histogram(~ midterm.1, data = course.stats,
binwidth = 10, boundary = 0) %>%
gf_rugx(~ midterm.1)
Finally, we can break down the distribution of Midterm 1 scores by-major using the | (vertical bar or “pipe”) character. You will learn later in this course that, in writing mathematical expressions involving probabilities, the vertical bar denotes “given,” so “midterm.1 | major” can be read as “break down the Midterm 1 scores given that we know the major of a particular student”:
gf_histogram(~ midterm.1 | major, data = course.stats)
Histograms are easy to understand, but suffer from the fact that the choice of bin width can drastically change the appearance of a histogram. A density plot is a “smoothed” histogram that avoids this problem. We can create a density plot for the Midterm 1 grades and add a rug plot using:
gf_dens(~ midterm.1, data = course.stats) %>% gf_rugx(~ midterm.1)
Again, notice how this uses the same grammar as tally and gf_histogram.
Finally, we can break up the distributions by major as before:
gf_dens(~ midterm.1 | major, data = course.stats) %>% gf_rugx(~ midterm.1)
If we want to show all of densities overlayed on the same plot, we can use the color argument of gf_dens:
gf_dens(~ midterm.1, color = ~ major, data = course.stats)
Let’s return to tally, and see what happens when we use the pipe operation to break down the course.grade (the letter grade for each student in the course) variable by major:
tally(~ course.grade | major, data = course.stats)
## major
## course.grade BY CE HE
## A 0 2 0
## B 5 1 7
## C 3 2 7
tally is telling us that 5 biology majors got Bs, 3 biology majors got Cs, 1 chemistry major got a B, etc. (Reminder: these data are fake.)
We can also use the format argument to break down the distribution of grades by major:
tally(~ course.grade | major, data = course.stats, format = 'percent')
## major
## course.grade BY CE HE
## A 0.0 40.0 0.0
## B 62.5 20.0 50.0
## C 37.5 40.0 50.0
So we see that amongst biology majors, 62.5% got Bs, and amongst health majors, 50% got Bs, etc.
Now that you’ve completed the lab, you can “knit” the R Notebook together into an HTML file using the R package knitr. To do this, select
File > Knit Document
from the File menu.
This will create a file called lab1.nb.html
that contains all of the code chunks you wrote and the tables and plots they generated.
You can add your name to the top of the HTML file by including an author:
field at the top of the R Notebook:
You should submit both your lab1.Rmd
file and the lab1.nb.html
files to eCampus, under the Lab 1 assignment. This submission is due by the beginning of Lecture 3.