Study Guide for Final Exam

This will be a closed-book exam.

You are permitted one 8.5 inch by 11 inch sheet of handwritten notes (both sides) for the final exam.

You will be allowed to use R on a school computer for computations. This means you will not have access to, nor will you need to use, anything beyond Base R (e.g. you will not need to use mosaic, MUsaic, etc.) for the exam.

You will have 2 hours to complete the exam. The exam will start promptly at 11:35 AM, so please arrive on time.

To do well on the exam, you should be able to do the following:

Chapter 27

Comparing two samples: the Wilcoxon rank sum test

Describe useful exploratory plots to generate before performing Wilcoxon’s Rank Sum Test, and interpret exploratory plots in this context.
Define the rank of a data value in a sample.
Explain the rationale for Wilcoxon’s Rank Sum Test for comparing two population distributions.
State Wilcoxon’s Rank Sum Test Statistic \(W\).
Given a (small) combined sample with the associated ranks, compute Wilcoxon’s Rank Sum Test Statistic.
State appropriate null and alternative hypotheses for Wilcoxon’s Rank Sum Test, given a claim about two populations.
Perform Wilcoxon’s Rank Sum Test using wilcox.test in R.
Interpret wilcox.test’s output for the Rank Sum Test.

Matched pairs: the Wilcoxon signed rank test

Describe useful exploratory plots to generate before performing Wilcoxon’s Signed Rank Test, and interpret exploratory plots in this context.
Explain the rationale for Wilcoxon’s Signed Rank Test for the median of a distribution.
State the assumption needed for Wilcoxon’s Signed Rank Test.
State Wilcoxon’s Signed Rank Test Statistic \(W^{+}\).
State appropriate null and alternative hypotheses for Wilcoxon’s Signed Rank Test, given a claim about a single population or two matched populations.
Perform Wilcoxon’s Signed Rank Test using wilcox.test in R.
Interpret wilcox.test’s output for the Signed Rank Test.

Chapter 24

Comparing several means

Use boxplots generated using gf_boxplot to compare samples from two more more populations.
State the null and alternative hypotheses for a claim about equality amongst three or more population means.
Explain why the alternative hypothesis of a “all means are equal” null hypothesis does not specify precisely which population means, if any, differ.
Recognize that a test related to three or more population means is called an Analysis of Variance.
Recognize the acronym ANOVA for Analysis of Variance.

The \(F\) statistic

State the test statistic for a one-way ANOVA.
Explain the effect on the \(F\)-statistic of increasing / decreasing the variability amongst the sample means and increasing / decreasing the variability within each sample.
Indicate what values of an \(F\)-statistic indicate evidence against a null hypothesis.
Given rug plots and box plots for several samples, indicate whether the \(F\)-statistic will be small or large.
State the sampling distribution of the \(F\) statistic when the null hypothesis is true.
State the type of \(P\)-value (right-sided, left-sided, or two-sided) computed from the \(F\) distribution, and explain why this type of \(P\)-value is used.

The analysis of variance \(F\) test

Use the functions aov and summary to perform an \(F\) test in R.
Interpret the output of an aov object passed to summary.

Conditions for ANOVA

State the assumptions of a one-way ANOVA.
State an alternative to a one-way ANOVA when the assumptions of the one-way ANOVA fail or cannot be checked.

Pairwise Comparisons (Lecture Notes for Lecture 21)

Explain why pairwise comparisons are necessary after finding a statistically significant \(F\) statistic.
Recognize Tukey’s Honest Significant Difference as a pairwise comparison method.
Use the function TukeyHSD to perform pairwise comparsisons in R.
Interpret the output of TukeyHSD when passed an output from aov, including:
- the estimated pairwise differences
- the left and right confidence intervals for the pairwise differences
- the adjusted \(P\)-value for the pairwise comparisons
State the null hypothesis implicit in the \(P\)-values reported by TukeyHSD.

Chapter 21

Hypotheses for goodness of fit

Recognize and state a claim about population proportions for categories of a categorical variable in a population.
Given claimed proportions of categories in a population, state the null and alternative hypothesis corresponding to that claim.
Explain why the alternative hypothesis of a “the population proportions equal the specified values” null hypothesis does not specify precisely which population proportions, if any, differ.

Expected counts and chi-square statistic

Given a one-way table of counts and null values for population proportions, compute the expected count for each category.
Compute the deviation between the observed counts in a sample and the expected counts under the null model.
Compute the \(\chi^{2}\)-statistic given observed counts and population proportions.
Recognize the Greek letter \(\chi\) (“chi”, pronounced “ki” as in “kite”) as the Greek analog to the Roman letter \(x\).
Compute the \(\chi^{2}\)-statistic from observed counts and null proportions using xchisq.test from mosaic.

The chi-square test for goodness of fit

Explain why it is more appropriate to call the \(\chi^{2}\) “goodness-of-fit” test a “lack-of-fit” test.
Interpret the output of xchisq.test in terms of a \(\chi^{2}\) lack-of-fit test.
Use the output of xchisq.test to test a hypothesis about the proportions of some category in a population.

Interpreting significant chi-square results

Construct simultaneous confidence intervals for the population proportions of a categorical variable using gf_pop_props from MUsaic.
Interpret the output of gf_pop_props.
State the implicit null hypothesis tested by checking for inclusion of a null population proportion in a confidence interval returned by gf_pop_props.

Conditions for the chi-square test

State the assumptions on a sample for the \(\chi^{2}\) statistic to follow a \(\chi^{2}\) distribution.

Chapter 22

Two-way tables

Explain how a two-way table can be used to summarize counts of a categorical variable across two or more populations.
Give an analogy between one-, two-, and multi-sample tests for population means and a two-way table for testing claims about proportions of a category across one, two, and more than two populations.
State the convention we will use in the class in terms of what rows and columns correspond to in a two-way table.

Hypotheses for two-way tables of counts

Explain what we mean by two variables being associated.
Explain what we mean by two variables being independent.
Describe how independence between two variables manifests in a two-way table.
State the generic form of the null and alternative hypotheses about two categorical variables in a population.
Given a problem about two categorical variables, identify the relevant null and alternative hypotheses from the problem.

Expected counts and the chi-square statistic

Explain under what hypothesis the expected counts for the \(\chi^{2}\) statistic for association are calculated.
State the \(\chi^{2}\) statistic for association.
State the sampling distribution of the \(\chi^{2}\) statistic for association under the null hypothesis.
Construct a matrix from a two-way table using matrix in R.
Compute the \(\chi^{2}\) statistic for association using xchisq.test from mosaic.
Interpret the output of xchisq.test in terms of the \(\chi^{2}\) statistic for association.

The chi-square test

Interpret the output of xchisq.test in terms of the \(\chi^{2}\) test for association.
Perform a hypothesis test for association using xchisq.test.

Conditions for the chi-square test

State the assumptions on the data for the \(\chi^{2}\) statistic for association to follow a \(\chi^{2}\) distribution.

Chapter 9

Risk and odds

Define the population risk of a negative outcome.
Define the population odds of a negative outcome.
Compute the sample risk and sample odds given a table summarizing positive and negative outcomes in a sample.

Chapter 20

Two-sample problems: proportions

Give examples of two-sample problems that involve proportions.

Relative risk and odds ratios

Define the population relative risk of a negative outcome in a treatment group compared to a control group.
Define the population odds ratio of a negative outcome in a treatment group compared to a control group.
Compute the sample relative risk and sample odds ratio given a two-way table.
State what value of relative risk / odds ratio corresponds to “no difference” in risk between the treatment and control populations.
Explain why relative risks and odds ratios are most appropriately considered on a logarithmic scale.
Recognize that odds are not risks, and odds ratios are not relative risks.
Interpret relative risks and odds ratios in terms of whether they indicate the treatment or control condition leads to lower risk.

Inferences for Relative Risks and Odds Ratios Using R (Lecture Notes for Lecture 24)

State the four conditions on the count of a categorical variable for it be binomial.
Compute the sample risks, relative risk, and odds ratio from a two-way table using oddsRatio from the mosaic package.
Compute confidence intervals for the population relative risk and odds ratio from a two-way table using oddsRatio from the mosaic package.
Use a confidence interval for either a population relative risk or population odds ratio to test for a difference in risk between a treatment and control population.

Chapter 23

The regression parameters

State the assumptions of the Simple Linear Regression with Normal Noise (SLRNN) model.
State the three parameters of the SLRNN model.
Relate the components of the SLRNN model (intercept, slope, errors) to their sample analogs (intercept, slope, residuals).
Explain in what sense the SLRNN model is “a line plus noise.”
State the estimator for the standard deviation of the noise term in the SLRNN model.
Identify the estimates for the intercept, slope, and standard deviation of the noise term in an output from summary.

Checking the conditions for inference

Use the following four plots to diagnose the validity of the SLRNN model from residuals of a simple linear regression:
a. A Q-Q plot of the residuals.
b. A plot of the residuals against the fitted values of the response.
c. A plot of the squared residuals against fitted values of the response.
d. A plot of the residuals against the individual index.
Identify which SLRNN model assumption is checked using the four residual plots from the previous learning objective.
Use mplot and gf_line in mosaic to generate residual diagnostic plots from an output from lm.

Testing the hypothesis of no linear relationship

State the \(T\)-statistic for the sample slope and its sampling distribution under the null hypothesis when the SLRNN model assumptions hold.
State the estimate of the standard error of the slope estimate in simple linear regression.
Identify the estimate of the standard error of the slope estimate in an output from summary.
Perform a hypothesis test for a claim about a population slope.

Confidence intervals for the regression slope

State the confidence interval for a population slope under the SLRNN model assumptions.
Construct a confidence interval for a population slope using the output from summary and qt.