Study Guide for Final Exam
This will be a closed-book exam.
You are permitted one 8.5 inch by 11 inch sheet of handwritten notes (both sides) for the final exam.
You will be allowed to use R on a school computer for computations. This means you will not have access to, nor will you need to use, anything beyond Base R (e.g. you will not need to use mosaic, MUsaic, etc.) for the exam.
You will have 2 hours to complete the exam. The exam will start promptly at 11:35 AM, so please arrive on time.
To do well on the exam, you should be able to do the following:
Chapter 27
Comparing two samples: the Wilcoxon rank sum test
- Describe useful exploratory plots to generate before performing Wilcoxon’s Rank Sum Test, and interpret exploratory plots in this context.
- Define the rank of a data value in a sample.
- Explain the rationale for Wilcoxon’s Rank Sum Test for comparing two population distributions.
- State Wilcoxon’s Rank Sum Test Statistic \(W\).
- Given a (small) combined sample with the associated ranks, compute Wilcoxon’s Rank Sum Test Statistic.
- State appropriate null and alternative hypotheses for Wilcoxon’s Rank Sum Test, given a claim about two populations.
- Perform Wilcoxon’s Rank Sum Test using
wilcox.test
in R.
- Interpret
wilcox.test
’s output for the Rank Sum Test.
Matched pairs: the Wilcoxon signed rank test
- Describe useful exploratory plots to generate before performing Wilcoxon’s Signed Rank Test, and interpret exploratory plots in this context.
- Explain the rationale for Wilcoxon’s Signed Rank Test for the median of a distribution.
- State the assumption needed for Wilcoxon’s Signed Rank Test.
- State Wilcoxon’s Signed Rank Test Statistic \(W^{+}\).
- State appropriate null and alternative hypotheses for Wilcoxon’s Signed Rank Test, given a claim about a single population or two matched populations.
- Perform Wilcoxon’s Signed Rank Test using
wilcox.test
in R.
- Interpret
wilcox.test
’s output for the Signed Rank Test.
Chapter 24
Comparing several means
- Use boxplots generated using
gf_boxplot
to compare samples from two more more populations.
- State the null and alternative hypotheses for a claim about equality amongst three or more population means.
- Explain why the alternative hypothesis of a “all means are equal” null hypothesis does not specify precisely which population means, if any, differ.
- Recognize that a test related to three or more population means is called an Analysis of Variance.
- Recognize the acronym ANOVA for Analysis of Variance.
The \(F\) statistic
- State the test statistic for a one-way ANOVA.
- Explain the effect on the \(F\)-statistic of increasing / decreasing the variability amongst the sample means and increasing / decreasing the variability within each sample.
- Indicate what values of an \(F\)-statistic indicate evidence against a null hypothesis.
- Given rug plots and box plots for several samples, indicate whether the \(F\)-statistic will be small or large.
- State the sampling distribution of the \(F\) statistic when the null hypothesis is true.
- State the type of \(P\)-value (right-sided, left-sided, or two-sided) computed from the \(F\) distribution, and explain why this type of \(P\)-value is used.
The analysis of variance \(F\) test
- Use the functions
aov
and summary
to perform an \(F\) test in R.
- Interpret the output of an
aov
object passed to summary
.
Conditions for ANOVA
- State the assumptions of a one-way ANOVA.
- State an alternative to a one-way ANOVA when the assumptions of the one-way ANOVA fail or cannot be checked.
Pairwise Comparisons (Lecture Notes for Lecture 21)
- Explain why pairwise comparisons are necessary after finding a statistically significant \(F\) statistic.
- Recognize Tukey’s Honest Significant Difference as a pairwise comparison method.
- Use the function
TukeyHSD
to perform pairwise comparsisons in R.
- Interpret the output of
TukeyHSD
when passed an output from aov
, including:
- the estimated pairwise differences
- the left and right confidence intervals for the pairwise differences
- the adjusted \(P\)-value for the pairwise comparisons
- State the null hypothesis implicit in the \(P\)-values reported by
TukeyHSD
.
Chapter 21
Hypotheses for goodness of fit
- Recognize and state a claim about population proportions for categories of a categorical variable in a population.
- Given claimed proportions of categories in a population, state the null and alternative hypothesis corresponding to that claim.
- Explain why the alternative hypothesis of a “the population proportions equal the specified values” null hypothesis does not specify precisely which population proportions, if any, differ.
Expected counts and chi-square statistic
- Given a one-way table of counts and null values for population proportions, compute the expected count for each category.
- Compute the deviation between the observed counts in a sample and the expected counts under the null model.
- Compute the \(\chi^{2}\)-statistic given observed counts and population proportions.
- Recognize the Greek letter \(\chi\) (“chi”, pronounced “ki” as in “kite”) as the Greek analog to the Roman letter \(x\).
- Compute the \(\chi^{2}\)-statistic from observed counts and null proportions using
xchisq.test
from mosaic
.
The chi-square test for goodness of fit
- Explain why it is more appropriate to call the \(\chi^{2}\) “goodness-of-fit” test a “lack-of-fit” test.
- Interpret the output of
xchisq.test
in terms of a \(\chi^{2}\) lack-of-fit test.
- Use the output of
xchisq.test
to test a hypothesis about the proportions of some category in a population.
Interpreting significant chi-square results
- Construct simultaneous confidence intervals for the population proportions of a categorical variable using
gf_pop_props
from MUsaic
.
- Interpret the output of
gf_pop_props
.
- State the implicit null hypothesis tested by checking for inclusion of a null population proportion in a confidence interval returned by
gf_pop_props
.
Conditions for the chi-square test
- State the assumptions on a sample for the \(\chi^{2}\) statistic to follow a \(\chi^{2}\) distribution.
Chapter 22
Two-way tables
- Explain how a two-way table can be used to summarize counts of a categorical variable across two or more populations.
- Give an analogy between one-, two-, and multi-sample tests for population means and a two-way table for testing claims about proportions of a category across one, two, and more than two populations.
- State the convention we will use in the class in terms of what rows and columns correspond to in a two-way table.
Hypotheses for two-way tables of counts
- Explain what we mean by two variables being associated.
- Explain what we mean by two variables being independent.
- Describe how independence between two variables manifests in a two-way table.
- State the generic form of the null and alternative hypotheses about two categorical variables in a population.
- Given a problem about two categorical variables, identify the relevant null and alternative hypotheses from the problem.
Expected counts and the chi-square statistic
- Explain under what hypothesis the expected counts for the \(\chi^{2}\) statistic for association are calculated.
- State the \(\chi^{2}\) statistic for association.
- State the sampling distribution of the \(\chi^{2}\) statistic for association under the null hypothesis.
- Construct a matrix from a two-way table using
matrix
in R.
- Compute the \(\chi^{2}\) statistic for association using
xchisq.test
from mosaic
.
- Interpret the output of
xchisq.test
in terms of the \(\chi^{2}\) statistic for association.
The chi-square test
- Interpret the output of
xchisq.test
in terms of the \(\chi^{2}\) test for association.
- Perform a hypothesis test for association using
xchisq.test
.
Conditions for the chi-square test
- State the assumptions on the data for the \(\chi^{2}\) statistic for association to follow a \(\chi^{2}\) distribution.
Chapter 9
Risk and odds
- Define the population risk of a negative outcome.
- Define the population odds of a negative outcome.
- Compute the sample risk and sample odds given a table summarizing positive and negative outcomes in a sample.
Chapter 20
Two-sample problems: proportions
- Give examples of two-sample problems that involve proportions.
Relative risk and odds ratios
- Define the population relative risk of a negative outcome in a treatment group compared to a control group.
- Define the population odds ratio of a negative outcome in a treatment group compared to a control group.
- Compute the sample relative risk and sample odds ratio given a two-way table.
- State what value of relative risk / odds ratio corresponds to “no difference” in risk between the treatment and control populations.
- Explain why relative risks and odds ratios are most appropriately considered on a logarithmic scale.
- Recognize that odds are not risks, and odds ratios are not relative risks.
- Interpret relative risks and odds ratios in terms of whether they indicate the treatment or control condition leads to lower risk.
Inferences for Relative Risks and Odds Ratios Using R (Lecture Notes for Lecture 24)
- State the four conditions on the count of a categorical variable for it be binomial.
- Compute the sample risks, relative risk, and odds ratio from a two-way table using
oddsRatio
from the mosaic
package.
- Compute confidence intervals for the population relative risk and odds ratio from a two-way table using
oddsRatio
from the mosaic
package.
- Use a confidence interval for either a population relative risk or population odds ratio to test for a difference in risk between a treatment and control population.
Chapter 23
The regression parameters
- State the assumptions of the Simple Linear Regression with Normal Noise (SLRNN) model.
- State the three parameters of the SLRNN model.
- Relate the components of the SLRNN model (intercept, slope, errors) to their sample analogs (intercept, slope, residuals).
- Explain in what sense the SLRNN model is “a line plus noise.”
- State the estimator for the standard deviation of the noise term in the SLRNN model.
- Identify the estimates for the intercept, slope, and standard deviation of the noise term in an output from
summary
.
Checking the conditions for inference
- Use the following four plots to diagnose the validity of the SLRNN model from residuals of a simple linear regression:
a. A Q-Q plot of the residuals.
b. A plot of the residuals against the fitted values of the response.
c. A plot of the squared residuals against fitted values of the response.
d. A plot of the residuals against the individual index.
- Identify which SLRNN model assumption is checked using the four residual plots from the previous learning objective.
- Use
mplot
and gf_line
in mosaic
to generate residual diagnostic plots from an output from lm
.
Testing the hypothesis of no linear relationship
- State the \(T\)-statistic for the sample slope and its sampling distribution under the null hypothesis when the SLRNN model assumptions hold.
- State the estimate of the standard error of the slope estimate in simple linear regression.
- Identify the estimate of the standard error of the slope estimate in an output from
summary
.
- Perform a hypothesis test for a claim about a population slope.
Confidence intervals for the regression slope
- State the confidence interval for a population slope under the SLRNN model assumptions.
- Construct a confidence interval for a population slope using the output from
summary
and qt
.