Study Guide for Exam 1

This will be a closed-book exam. You will need a calculator. A calculator that performs arithmetic operations, reciprocals, square roots, powers and logarithms (base 10) is sufficient. Graphing calculators are permitted.

You will not be using R or RStudio during the exam, but you should know the R commands to generate a given graphic, compute a given statistic, etc.

To do well on the exam, you should be able to do the following:

What is Statistics? (Lecture Notes)

State examples of statistical questions.
State the five main steps in the statistical analysis of data.

Chapter 1

Individuals and variables

Define individual and variable in the context of a data set.
Identify the individuals and variables given the description of a data set.
Explain how the concept of individual and variable relates to the rows and columns of a typical spreadsheet containing a data set.

Identifying categorical and quantitative variables

State the characteristics of quantitative and categorical variables.
Given a variable, determine whether it is quantitative or categorical.

Quantitative variables: histograms

Construct a rug plot by-hand given a (small) data set.
Construct a histogram by-hand given a (small) data set and the desired bin width and starting boundary.

Interpreting histograms

Identify key characteristics of the shape of a histogram.
Identify key characteristicss of the center of a histogram.
Identify key characteristics of the spread of a histogram.
Identify potential outliers using a histogram in conjunction with a rug plot.
Interpret a histogram in the context of the data set it summarizes.

Chapter 2

Measures of center: median, mean

State the formula for the sample mean of a data set.
Recognize the notation \(\bar{x}\) for the sample mean.
Compute the sample mean of a (small) data set by-hand.
State a physical interpretation of the sample mean in terms of the rug plot of the data.
State the definition of the sample median.
Compute the sample median of a (small) data set by-hand.

Measures of spread: percentiles, standard deviation

State the formula for the sample variance and sample standard deviation.
Recognize the notation \(s\) for the sample standard deviation.
Compute the sample variance and sample standard deviation of a (small) data set by-hand.
Define the sample quartiles of a data set.
Recognize the notation \(Q_{1}\), \(Q_{2}\), and \(Q_{3}\) for the first, second, and third sample quartiles.
Define the interquartile range of a data set.

Graphical displays of numerical summaries

State the five components of the “five-number summary.”
Draw a boxplot for a data set given a five-number summary of the data set.
Interpret a boxplot in terms of symmetry versus skewness of a distribution and identification of outliers.

Chapter 3

Explanatory and response variables

Define explanatory and response variables in terms of their roles in predicting one variable from another.
Given a predictive question about two variables, identify which variable is the explanatory variable and which is the response variable.

Relationship between two quantitative variables: scatterplots

Interpret a scatter plot showing the relationship between an explanatory variable and response variable in a data set.
Construct a scatter plot from a (small) data set with two quantitative variables.

Adding categorical variables to scatterplots

Explain how to construct a scatter plot that also includes the value of a categorical variable for each individual in a data set.

Measuring linear association: correlation

State the formula for the sample covariance and sample correlation.
State the major properties of the sample correlation.
Given a scatter plot, identify whether the sample correlation for the points is positive, negative, or nearly zero.

Chapter 4

The least-squares regression line

Specify how a regression function is related to the task of predicting one outcome from another.
Specify the form of a simple linear regression of a response variable \(y\) on an explanatory variable \(x\). Equivalently, identify the form of a simple linear regression model for predicting a response variable \(y\) using a predictor variable \(x\).
Recall, from either high school or college algebra / precalculus, the equation for a line and the interpretation of the slope and intercept of the line.
Identify the slope and intercept from a simple linear regression model, and interpret the slope and intercept in the context of a prediction problem.
Use a simple linear regression model to predict a response variable at a given value of the predictor.
Determine the residual / error of a prediction given a simple linear regression model, a value for the explanatory variable, and a response variable at that value of the explanatory variable.

Association does not imply causation

Give examples where an observed association between two variables does not result from a causal influence from one variable to the other.
Explain the motto “association does not imply causation,” or its more popular counterpart “correlation does not imply causation.”

Nonlinear Relationships (Handout)

Explain why it is sometimes appropriate to transform either the explanatory variable, the response variable, or both, before computing a sample correlation.

Chapter 5

Marginal distributions

Idenfity when a two-way table is appropriate for summarizing the relationship between an explanatory variable and a response variable.
Given a two-way table, compute the marginal distributions of the two-way table both as counts and percentages / proportions.
Identify what a marginal distribution, when given as a percentage or proportion, should sum to.

Conditional Distributions

Given a “word problem” asking for the proportion of individuals with one characteristic who have another characteristic, identify the appropriate numerator and denominator using a two-way table.
Explain why a conditional distribution is so-named.
Given a two-way table, compute the conditional distributions for one of the categorical variables as percentages / proportions.
Identify what a conditional distribution, when given as a percentage or proportion, should sum to.

Association versus Causation for Categorical Variables (Lecture Notes)

Distinguish between association and causation for categorical variables.
Posit potential “lurking variables” for an observed association between two categorical variables.

Chapter 6

Observation versus experiment

Compare and constrast observational and experimental studies.
Given the description of a study, identify whether the study was observational or experimental.
Define confounding in terms of the observed association between an explanatory variable and a response variable.
Propose confounding variables that might explain an assocation in an observational study.

Sampling

Compare and contrast a population and a sample.
Given the description of a study, identify the target population, the sample, and the actual population.

Sampling Designs

Define sampling design.
Explain bias in a sampling design.
Given the description of a study and its sampling design, identify possible biases in its sampling design.
Define a convenience sampling design.
Define a simple random sample (SRS) sampling design.

R (Lecture Notes)

Use R for basic arithmetic.
Load a data frame into RStudio.
Load a package (such as mosaic) into RStudio.
State the grammar used by functions in mosaic.
Generate a rug plot using mosaic’s gf_rugx function.
Generate a histogram using mosaic’s gf_histogram function, and specify the bin width and starting boundary by passing arguments to gf_histogram.
Add a rug plot to a histogram generated by mosaic using the %>% operator.
Compute the sample mean, sample median, sample standard deviation, and five-number summary using mosaic in R.
Plot a boxplot using mosaic in R.
Construct a scatter plot using mosaic in R.
Construct a scatter plot that includes a categorical variable using mosaic in R.
Compute the sample correlation between two variables using mosaic in R.
Compute \(\log_{10} (x)\) in R.
Compute the sample correlation between transformed explanatory and response variables.
Perform a simple linear regression using R.
Relate the output of lm to the slope and intercept of the fitted simple linear regression equation.