We will explore a data set created by the Gapminder Foundation, a non-profit organization dedicated to the promotion of sustainable global development.
This data set contains statistics about several countries over the past 50+ years, including life expectancy, population, and Gross Domestic Product (GDP) per capita. You may have seen this data set in a TED Talk by Hans Rosling.
In the process of exploring this data, you will learn how to:
Note: If you are working in RStudio Cloud, you will not need to create the directory structure suggested below. However, if you plan to (or ever plan to) work locally on your personal computer, rather than in RStudio Cloud, I recommend using this sort of directory structure.
It is a good idea to save all of the files you will create in R in an organized fashion. You may already have an organizational system for your files from previous courses. If so, feel free to use that system, since you are already familiar with it.
If you do not have an organizational system for your files, I recommend using the Documents folder on your computer as home base, and then creating a folder for MA 440, sub-folders for homeworks and labs, etc. So your file system should look like:
R can be finicky in how it reads in files. Windows, macOS, and Linux allow you to do some crazy things with file names, but you should aim to keep your file names as vanilla as possible.
As a rule:
You will need to create an R Notebook file to store your work. R Notebook files are plain text files containing R code that RStudio can interpret, execute, and use to embed graphics, tables, etc.
To create a new R Notebook file, select
File > New File... > R Markdown
from the File menu at the top of your screen.
You will be prompted with a dialog box to give a title to your R Markdown file, and to add your name as author. You can leave all of the remaining options at their default values.
You should save the file using
File > Save
name it lab1.Rmd.
In R Markdown, we can mix plain text with R code. We need to tell R Markdown how to recognize when a chunk of text should be interpretted as plain text (something a person can read) versus R code (something the computer can read). R Markdown does this this using “code chunks,” which are separated from the main text by a block like this:
The ` symbol is grave accent and can be found on the top left of the keyboard, just below the Escape key. So any time you see a block of code like this:
# Block of code
you should put it inside of a code chunk in R Markdown like this:
Note: If you are using RStudio Cloud, you will not need to install mosaic
, since I have already done that for you. However, if you work locally on your personal computer, you will need to install mosaic
onto your computer.
To create histograms, density plots, etc., we need to install the mosaic package. To install a package in R, you use the install.packages command. So to install mosaic, create a code chunk with the following command and run it:
You only need to do this once per computer that you will be using R on. After a package is installed, it will remain installed on that computer. RStudio will also warn you if you try to run an R Markdown file with uninstalled packages.
To load a package into R, you use the library command. So to load the mosaic package into R, you should create and run the code chunk:
Now all of the functionality of mosaic is available to you.
The data set we will consider is included in the gapminder package for R.
You should install the gapminder
package by creating and running the following code chunk
and load it by creating and running the following code chunk
The gapminder data set is now loaded into R environment, which you can see by entering
in the console.
We can inspect the first few rows of the gapminder data frame using the head command in R. “Head” here means the top; there is a corresponding command tail which will print the last few rows of a data frame.
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
Let’s start by considering how the life expectancy is distributed in the entire data set.
Remember that the “grammar” of functions in mosaic follows the structure
Our goal is to create a histogram, so we will use the gf_histogram function. The “gf” here is an abbreviation of “ggformula”, which is the package that mosaic uses for plotting. And the “gg” in “ggformula” stands for grammar of graphics, a system for organizing graphical displays of information created by Leland Wilkinson.
gf_histogram(~ lifeExp, data = gapminder)
We can add a rug plot to the histogram using the “pipe” %>%
command from the dplyr
package (which is automatically loaded along with mosaic
). You can read a command like a %>% b
as "do a
and then do b
. So to generate a histogram and then add a rug plot, the command is:
gf_histogram(~ lifeExp, data = gapminder) %>%
gf_rugx(~ lifeExp, data = gapminder)
We can just as easily generate a density plot, now using the gf_dens
command, to which we add a rug plot as before:
gf_dens(~ lifeExp, data = gapminder) %>%
gf_rugx(~ lifeExp, data = gapminder)
One nice feature of ggformula
is that it easily allows you to subset your data based on a categorical variable. Notice that each row in the data set has a continent
variable associated with it, which indicates the contintent on which a given country is located. To subset our density plots by continent, we can use the “|
” command. This is analogous to the “given” notation we use in probability. Recall from introductory statistics that \(P(A \mid B)\) asks the probability of an event \(A\) given that we know event \(B\) occurred.
To generate the density plots subset by continent and add a rug plot, run:
gf_dens(~ lifeExp | continent, data = gapminder) %>%
gf_rugx(~ lifeExp)
We can also plot all of the density plots at once, color-coded by
, by using the col
argument to gf_dens
. This is more useful if we want to directly compare the distribution of life expectancy by continent across the different continents:
gf_dens(~ lifeExp, col = ~ continent, data = gapminder) %>%
gf_rugx(~ lifeExp, col = ~continent)
Each row of the data frame also has a year
variable, which we can use to subset the data by year:
gf_dens(~ lifeExp | year, data = gapminder) %>%
gf_rugx(~ lifeExp, data = gapminder)
If we try to do the color-coding by year
(if we want to know how the distribution of life expectancy has changed over time) like we did above:
gf_dens(~ lifeExp, col = ~ year, data = gapminder) %>%
gf_rugx(~ lifeExp, data = gapminder, col = ~year)
we get a plot that doesn’t make a ton of sense. Why did
only plot one density? It is because the year
variable is an integer:
## [1] "integer"
which is a quantitative variable, so ggformula
has treated it as such.
If we want to treat year as a categorical variable, we can create a new variable yearFac
using the factor
command. The factor
command generates a “factor” (R’s name for a categorical variable):
gapminder$yearFac = factor(gapminder$year)
gf_dens(~ lifeExp, col = ~ yearFac, data = gapminder) %>%
gf_rugx(~ lifeExp, data = gapminder, col = ~yearFac)
Now we can clearly see that, from 1952 to 2007, the distribution of life expectancies has clearly shifted higher over time.
You can easily create almost all of the graphical summaries you learned about in introductory statistics using ggformula
, as long as you remember to look for the appropriate gf_*
function, and to use ggformula
’s grammar.
For example, to create a boxplot from all of the life expectancies, you can use the following command:
gf_boxplot(~ lifeExp, data = gapminder)
To subset by continent, you use the
functionality, as before:
gf_boxplot(~ lifeExp | continent, data = gapminder)
To plot all of the boxplots of the life expectancy-by-continent on the same plot, you can color code by continent, as before:
gf_boxplot(~ lifeExp + continent, col = ~continent, data = gapminder)
Another useful data visualization you may not have seen is a violin plot. A violin plot is just a density plot, mirrored on either side:
gf_violin(~ lifeExp + continent, col = ~continent, data = gapminder)
You can also add a boxplot to the violin plot again using the pipe operator
gf_violin(~ lifeExp + continent, col = ~continent, data = gapminder, alpha = 0.2) %>%
gf_boxplot(alpha = 0.2)
here sets the transparency of the plot, so we can see both the violin plots and the boxplots at the same time.
One of the main skills you will learn in MA 440 is how to describe and model the relationship between two quantitative variables. The first step towards any such description is creating a scatter plot from the two variables. To generate a scatter plot using ggformula
, we will use the gf_point
(“point” because we are plotting points) function.
For example, to create a scatter plot of lifeExp
versus gdpPercap
, run the command:
gf_point(lifeExp ~ gdpPercap, data = gapminder)
Remember that
’s formula syntax is always y ~ x
, so the variable to the left of the tilde is what we want on the vertical axis, and the variable to the right of the tilde is what we want on the horizontal axis.
From the scatter plot above, we can almost see a general trend in the association between life expectancy and GDP per capita: it looks like as GDP per capita increases, so does life expectancy, to a point. We can add a “smoother” to the plot to highlight that relationship. This is done using the gf_smooth
gf_point(lifeExp ~ gdpPercap, data = gapminder) %>%
For the majority of this course, we will focus on linear smoothers, which we can tell R to use by passing
to the method
argument of gf_smooth
. lm
is short for “linear model,” since we are assuming the relationship between the x
and y
is linear:
gf_point(lifeExp ~ gdpPercap, data = gapminder) %>%
gf_smooth(method = 'lm')
We can see that the linear fit is a pretty bad one: it underpredicts the life expectancy at low GDP / capita, and overpredicts the life expectancy at high GDP / capita. In cases like these, it can make sense to transform either the response (life expectancy) or the predictor (GDP / capita). We can log-transform the gdpPercap
variable directly in the gf_point
command using the following command:
gf_point(lifeExp ~ log(gdpPercap), data = gapminder)
and add the smoother using the log-transformed
gf_point(lifeExp ~ log(gdpPercap), data = gapminder) %>%
Now a line looks like it might do a decent job of describing the trend, which we can add as before:
gf_point(lifeExp ~ log(gdpPercap), data = gapminder) %>%
gf_smooth(method = 'lm')
The |
command works with gf_point
exactly the same way it did with gf_histogram
, et al. We can look at the relationship between life expectancy and GDP / capita by continent using
gf_point(lifeExp ~ log(gdpPercap) | continent, data = gapminder) %>%
and by year using
gf_point(lifeExp ~ log(gdpPercap) | yearFac, data = gapminder) %>%
Finally, if we just want to look at the trends-by-continent, without including the scatter plots, we can use gf_smooth
by itself, coloring by continent
gf_smooth(lifeExp ~ log(gdpPercap), col = ~continent, data = gapminder)
gf_smooth(lifeExp ~ log(gdpPercap), col = ~continent, method = 'lm', data = gapminder)
The web page that you are currently reading was written in R Markdown and typeset into HTML. R Markdown, as the name suggests, is a spin-off of Markdown, a markup language created by John Gruber and Aaron Swartz. Unlike Microsoft Word, for example, which is WYSIWIG (pronounced “wizzy-wig,” and stands for What You See Is What You Get), Markdown is a lightweight markup language, similar to HTML but much simpler. Everything is written in plain text, and then Markdown (or R Markdown) is interpretted into the desired output like HTML, LaTeX, Word (if you must…), etc.
You can include in-line math in R Markdown by using \( \)
around an expression. For example, this:
The expected value of a random variable \(X\) is denoted by \(\mu\).
will typeset as this:
The expected value of a random variable \(X\) is denoted by \(\mu\).
You can include equations by using \[ \]
, so that
\[ E[Y] = \beta_{0} + \beta_{1} X\]
will typeset as: \[ E[Y] = \beta_{0} + \beta_{1} X\]
There are lots of commands to learn in LaTeX, for example to use “teletype” font, you can use the \texttt
command. Commands in LaTeX are always followed by curly braces {}
enclosing their arguments. So this:
\[ E[\texttt{lifeExp}] = \beta_{0} + \beta_{1} \log(\texttt{gdpPercap})\]
will typeset as this: \[ E[\texttt{lifeExp}] = \beta_{0} + \beta_{1} \log(\texttt{gdpPercap})\]
The hardest part of learning LaTeX is gaining familiarity with all of the commands for symbols, typesetting, etc. “Back in my day,” the only way to do this was by browsing through a textbook1 or consulting someone who already knew LaTeX. Now, there are tools that make finding a given command easier. Two that I can recommend are:
With Mathpix Snip, you can take a photo of either typeset (in a book, webpage, etc.) or hand-written (on a sheet of paper, white board, etc.) mathematics, and Snip will attempt to convert it to LaTeX.
Try typesetting the following expression (for the density of a standard Gaussian random variable) using Mathpix Snip as a guide: \[ f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}z^{2}}\]
Now that you’ve completed the lab, you can “knit” the R Markdown file together into an HTML file using the R package knitr. To do this, either select
File > Knit Document
from the File menu, or click the Knit icon in the ribbon of the Rmd file (the blue ball of yarn).
This will create a file called lab1.html that contains all of the code chunks you wrote and the tables and plots they generated.
If you are using RStudio Cloud, the files you created are hosted remotely on their server. To download a file from the RStudio Cloud server, select the check box next to each file you want to download, and click the “More” (gear) icon and select “Export…” option from the drop-down menu:
Click the Download button in the resulting dialog box. This will save a zip file to your computer containing the selected files.
You should submit both your lab1.Rmd
file and the lab1.nb.html
When I took the equivalent of Intro to Math Reasoning at my undergraduate institution, we had to typeset all of our homework in LaTeX. I spent 30% of my time solving the problems, and 70% of my time typesetting the solutions. Fortunately for you, the user-friendliness of LaTeX has increased considerably since 2007.↩