Due Date:

Complete this assignment by carefully reading and following the directions.

This assignment is due by the beginning of Lecture 3. See the submission details at the end of this lab.

Introduction

We will explore a data set created by the Gapminder Foundation, a non-profit organization dedicated to the promotion of sustainable global development.

This data set contains statistics about several countries over the past 50+ years, including life expectancy, population, and Gross Domestic Product (GDP) per capita. You may have seen this data set in a TED Talk by Hans Rosling.

In the process of exploring this data, you will learn how to:

  1. Organize your work in R.
  2. Create and use an R Markdown file to store your work.
  3. Load and inspect a data frame.
  4. Perform univariate and bivariate exploratory data analysis with R.
  5. Typeset mathematical expressions using LaTeX.

Setting Up a Directory for MA 440

Note: If you are working in RStudio Cloud, you will not need to create the directory structure suggested below. However, if you plan to (or ever plan to) work locally on your personal computer, rather than in RStudio Cloud, I recommend using this sort of directory structure.

It is a good idea to save all of the files you will create in R in an organized fashion. You may already have an organizational system for your files from previous courses. If so, feel free to use that system, since you are already familiar with it.

If you do not have an organizational system for your files, I recommend using the Documents folder on your computer as home base, and then creating a folder for MA 440, sub-folders for homeworks and labs, etc. So your file system should look like:

Documents
  ma440
    hw
      hw1
    labs
      lab1

R can be finicky in how it reads in files. Windows, macOS, and Linux allow you to do some crazy things with file names, but you should aim to keep your file names as vanilla as possible.

As a rule:

Creating a New R Notebook File

You will need to create an R Notebook file to store your work. R Notebook files are plain text files containing R code that RStudio can interpret, execute, and use to embed graphics, tables, etc.

To create a new R Notebook file, select

File > New File... > R Markdown

from the File menu at the top of your screen.

You will be prompted with a dialog box to give a title to your R Markdown file, and to add your name as author. You can leave all of the remaining options at their default values.

You should save the file using

File > Save

name it lab1.Rmd.

Writing Code Chunks in R Markdown

In R Markdown, we can mix plain text with R code. We need to tell R Markdown how to recognize when a chunk of text should be interpretted as plain text (something a person can read) versus R code (something the computer can read). R Markdown does this this using “code chunks,” which are separated from the main text by a block like this:

The ` symbol is grave accent and can be found on the top left of the keyboard, just below the Escape key. So any time you see a block of code like this:

# Block of code

you should put it inside of a code chunk in R Markdown like this:

Installing Packages

Note: If you are using RStudio Cloud, you will not need to install mosaic, since I have already done that for you. However, if you work locally on your personal computer, you will need to install mosaic onto your computer.

To create histograms, density plots, etc., we need to install the mosaic package. To install a package in R, you use the install.packages command. So to install mosaic, create a code chunk with the following command and run it:

install.packages('mosaic')

You only need to do this once per computer that you will be using R on. After a package is installed, it will remain installed on that computer. RStudio will also warn you if you try to run an R Markdown file with uninstalled packages.

To load a package into R, you use the library command. So to load the mosaic package into R, you should create and run the code chunk:

library(mosaic)

Now all of the functionality of mosaic is available to you.

The data set we will consider is included in the gapminder package for R.

You should install the gapminder package by creating and running the following code chunk

install.packages('gapminder')

and load it by creating and running the following code chunk

library(gapminder)

Inspecting a Data Frame

The gapminder data set is now loaded into R environment, which you can see by entering

gapminder

in the console.

We can inspect the first few rows of the gapminder data frame using the head command in R. “Head” here means the top; there is a corresponding command tail which will print the last few rows of a data frame.

head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Generating Histograms and Density Plots from Data

Let’s start by considering how the life expectancy is distributed in the entire data set.

Remember that the “grammar” of functions in mosaic follows the structure

Our goal is to create a histogram, so we will use the gf_histogram function. The “gf” here is an abbreviation of “ggformula”, which is the package that mosaic uses for plotting. And the “gg” in “ggformula” stands for grammar of graphics, a system for organizing graphical displays of information created by Leland Wilkinson.

gf_histogram(~ lifeExp, data = gapminder)

We can add a rug plot to the histogram using the “pipe” %>% command from the dplyr package (which is automatically loaded along with mosaic). You can read a command like a %>% b as "do a and then do b. So to generate a histogram and then add a rug plot, the command is:

gf_histogram(~ lifeExp, data = gapminder) %>%
  gf_rugx(~ lifeExp, data = gapminder)

We can just as easily generate a density plot, now using the gf_dens command, to which we add a rug plot as before:

gf_dens(~ lifeExp, data = gapminder) %>%
  gf_rugx(~ lifeExp, data = gapminder)

Subsetting on a Categorical Variable

One nice feature of ggformula is that it easily allows you to subset your data based on a categorical variable. Notice that each row in the data set has a continent variable associated with it, which indicates the contintent on which a given country is located. To subset our density plots by continent, we can use the “|” command. This is analogous to the “given” notation we use in probability. Recall from introductory statistics that \(P(A \mid B)\) asks the probability of an event \(A\) given that we know event \(B\) occurred.

To generate the density plots subset by continent and add a rug plot, run:

gf_dens(~ lifeExp | continent, data = gapminder) %>%
  gf_rugx(~ lifeExp)

We can also plot all of the density plots at once, color-coded by continent, by using the col argument to gf_dens. This is more useful if we want to directly compare the distribution of life expectancy by continent across the different continents:

gf_dens(~ lifeExp, col = ~ continent, data = gapminder) %>%
  gf_rugx(~ lifeExp, col = ~continent)

Each row of the data frame also has a year variable, which we can use to subset the data by year:

gf_dens(~ lifeExp | year, data = gapminder) %>%
  gf_rugx(~ lifeExp, data = gapminder)

If we try to do the color-coding by year (if we want to know how the distribution of life expectancy has changed over time) like we did above:

gf_dens(~ lifeExp, col = ~ year, data = gapminder) %>%
  gf_rugx(~ lifeExp, data = gapminder, col = ~year)

we get a plot that doesn’t make a ton of sense. Why did ggformula only plot one density? It is because the year variable is an integer:

typeof(gapminder$year)
## [1] "integer"

which is a quantitative variable, so ggformula has treated it as such.

If we want to treat year as a categorical variable, we can create a new variable yearFac using the factor command. The factor command generates a “factor” (R’s name for a categorical variable):

gapminder$yearFac = factor(gapminder$year)

gf_dens(~ lifeExp, col = ~ yearFac, data = gapminder) %>%
  gf_rugx(~ lifeExp, data = gapminder, col = ~yearFac)

Now we can clearly see that, from 1952 to 2007, the distribution of life expectancies has clearly shifted higher over time.

Creating Boxplots and Violin Plots

You can easily create almost all of the graphical summaries you learned about in introductory statistics using ggformula, as long as you remember to look for the appropriate gf_* function, and to use ggformula’s grammar.

For example, to create a boxplot from all of the life expectancies, you can use the following command:

gf_boxplot(~ lifeExp, data = gapminder)

To subset by continent, you use the | functionality, as before:

gf_boxplot(~ lifeExp | continent, data = gapminder)

To plot all of the boxplots of the life expectancy-by-continent on the same plot, you can color code by continent, as before:

gf_boxplot(~ lifeExp + continent, col = ~continent, data = gapminder)

Another useful data visualization you may not have seen is a violin plot. A violin plot is just a density plot, mirrored on either side:

gf_violin(~ lifeExp + continent, col = ~continent, data = gapminder)

You can also add a boxplot to the violin plot again using the pipe operator %<%.

gf_violin(~ lifeExp + continent, col = ~continent, data = gapminder, alpha = 0.2) %>%
  gf_boxplot(alpha = 0.2)

alpha here sets the transparency of the plot, so we can see both the violin plots and the boxplots at the same time.

Creating Scatter Plots

One of the main skills you will learn in MA 440 is how to describe and model the relationship between two quantitative variables. The first step towards any such description is creating a scatter plot from the two variables. To generate a scatter plot using ggformula, we will use the gf_point (“point” because we are plotting points) function.

For example, to create a scatter plot of lifeExp versus gdpPercap, run the command:

gf_point(lifeExp ~ gdpPercap, data = gapminder)

Remember that ggformula’s formula syntax is always y ~ x, so the variable to the left of the tilde is what we want on the vertical axis, and the variable to the right of the tilde is what we want on the horizontal axis.

Adding Smoothers

From the scatter plot above, we can almost see a general trend in the association between life expectancy and GDP per capita: it looks like as GDP per capita increases, so does life expectancy, to a point. We can add a “smoother” to the plot to highlight that relationship. This is done using the gf_smooth command:

gf_point(lifeExp ~ gdpPercap, data = gapminder) %>%
  gf_smooth

For the majority of this course, we will focus on linear smoothers, which we can tell R to use by passing 'lm' to the method argument of gf_smooth. lm is short for “linear model,” since we are assuming the relationship between the x and y is linear:

gf_point(lifeExp ~ gdpPercap, data = gapminder) %>%
  gf_smooth(method = 'lm')

Scatter Plots with Transformed Data

We can see that the linear fit is a pretty bad one: it underpredicts the life expectancy at low GDP / capita, and overpredicts the life expectancy at high GDP / capita. In cases like these, it can make sense to transform either the response (life expectancy) or the predictor (GDP / capita). We can log-transform the gdpPercap variable directly in the gf_point command using the following command:

gf_point(lifeExp ~ log(gdpPercap), data = gapminder)

and add the smoother using the log-transformed gdpPercap predictor:

gf_point(lifeExp ~ log(gdpPercap), data = gapminder) %>%
  gf_smooth

Now a line looks like it might do a decent job of describing the trend, which we can add as before:

gf_point(lifeExp ~ log(gdpPercap), data = gapminder) %>%
  gf_smooth(method = 'lm')

Scatter Plots and Smoothers by Categorical Variable

The | command works with gf_point exactly the same way it did with gf_histogram, et al. We can look at the relationship between life expectancy and GDP / capita by continent using

gf_point(lifeExp ~ log(gdpPercap) | continent, data = gapminder) %>%
  gf_smooth

and by year using

gf_point(lifeExp ~ log(gdpPercap) | yearFac, data = gapminder) %>%
  gf_smooth

Finally, if we just want to look at the trends-by-continent, without including the scatter plots, we can use gf_smooth by itself, coloring by continent:

gf_smooth(lifeExp ~ log(gdpPercap), col = ~continent, data = gapminder)

gf_smooth(lifeExp ~ log(gdpPercap), col = ~continent, method = 'lm', data = gapminder)

Markdown and LaTeX

The web page that you are currently reading was written in R Markdown and typeset into HTML. R Markdown, as the name suggests, is a spin-off of Markdown, a markup language created by John Gruber and Aaron Swartz. Unlike Microsoft Word, for example, which is WYSIWIG (pronounced “wizzy-wig,” and stands for What You See Is What You Get), Markdown is a lightweight markup language, similar to HTML but much simpler. Everything is written in plain text, and then Markdown (or R Markdown) is interpretted into the desired output like HTML, LaTeX, Word (if you must…), etc.

You can include in-line math in R Markdown by using \( \) around an expression. For example, this:

The expected value of a random variable \(X\) is denoted by \(\mu\).

will typeset as this:

The expected value of a random variable \(X\) is denoted by \(\mu\).

You can include equations by using \[ \], so that

\[ E[Y] = \beta_{0} + \beta_{1} X\]

will typeset as: \[ E[Y] = \beta_{0} + \beta_{1} X\]

There are lots of commands to learn in LaTeX, for example to use “teletype” font, you can use the \texttt command. Commands in LaTeX are always followed by curly braces {} enclosing their arguments. So this:

\[ E[\texttt{lifeExp}] = \beta_{0} + \beta_{1} \log(\texttt{gdpPercap})\]

will typeset as this: \[ E[\texttt{lifeExp}] = \beta_{0} + \beta_{1} \log(\texttt{gdpPercap})\]

The hardest part of learning LaTeX is gaining familiarity with all of the commands for symbols, typesetting, etc. “Back in my day,” the only way to do this was by browsing through a textbook1 or consulting someone who already knew LaTeX. Now, there are tools that make finding a given command easier. Two that I can recommend are:

With Mathpix Snip, you can take a photo of either typeset (in a book, webpage, etc.) or hand-written (on a sheet of paper, white board, etc.) mathematics, and Snip will attempt to convert it to LaTeX.

Try typesetting the following expression (for the density of a standard Gaussian random variable) using Mathpix Snip as a guide: \[ f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}z^{2}}\]

Knitting an R Markdown File

Now that you’ve completed the lab, you can “knit” the R Markdown file together into an HTML file using the R package knitr. To do this, either select

File > Knit Document

from the File menu, or click the Knit icon in the ribbon of the Rmd file (the blue ball of yarn).

This will create a file called lab1.html that contains all of the code chunks you wrote and the tables and plots they generated.

Downloading a File from RStudio Cloud

If you are using RStudio Cloud, the files you created are hosted remotely on their server. To download a file from the RStudio Cloud server, select the check box next to each file you want to download, and click the “More” (gear) icon and select “Export…” option from the drop-down menu:

Click the Download button in the resulting dialog box. This will save a zip file to your computer containing the selected files.

Submitting on eCampus

You should submit both your lab1.Rmd file and the lab1.nb.html files to eCampus, under the Lab 1 assignment. This submission is due by the beginning of class on Tuesday


  1. When I took the equivalent of Intro to Math Reasoning at my undergraduate institution, we had to typeset all of our homework in LaTeX. I spent 30% of my time solving the problems, and 70% of my time typesetting the solutions. Fortunately for you, the user-friendliness of LaTeX has increased considerably since 2007.