Chapter 2 Introduction to Statistical Modeling and Designed Experiments

We begin our dive into Statistical Modeling by building from some introductory statistics material, namely two-sample inference and analysis of variance. The learning objective of this unit include:

  • Understanding the basic structure of statistical model.
  • Understanding that statistical inference can be formulated as the comparison of models.
  • Distinguishing between observational studies and designed experiments.
  • Performing a full analysis involving a one-factor experiment.

2.1 Statistical Analyses is Modeling

Think of a model airplane. It is just a simplified representation of the real thing.

In many statistical problems (and all the problems we will encounter in this course), we are interested in describing the relationship that may exist among several different measured variables. For example, your current GPA could be impacted by how much you study. Or there may be several contributing factors – your credit hour load, your ACT score, whether or not you have a part-time job, etc.

Of course, what makes your GPA is the result of a very highly complex combination of factors, some important and others not so important.

A statistical model is a mechanism we will use to try to describe the structural relationship between some measured outcome (called the response variable) and one or more impacting variables (called predictor variables or factors or features, depending on the contextual circumstance) in a simplified mathematical way.

It is in this sense that you can think of a statistical model as a “simplification” in much the same way as the model airplane: we know that the true relationship between response and predictor variables is very detailed and complex. The goal of the model is not to capture all the intricacies of the complexity but rather, the model only seeks to describe the essential features of any relationships that exist.

2.1.1 Example of a two-sample \(t\)-test as a model

It turns out, we have already fit a statistical model. Consider a variation of the two-sample \(t\)-test we performed in the previous chapter on the University Admissions data (Kutner et al., 2004). For cohesiveness we input the data again.

uadata <- read_table("https://tjfisher19.github.io/introStatModeling/data/univadmissions.txt")
uadata.trim <- uadata %>%
  filter(year %in% c(1996, 2000) )

t.test(act ~ year, data=uadata.trim, conf.level=0.98, 
       alternative="two.sided", var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  act by year
## t = 0.3013, df = 277, p-value = 0.763
## alternative hypothesis: true difference in means between group 1996 and group 2000 is not equal to 0
## 98 percent confidence interval:
##  -0.916072  1.186864
## sample estimates:
## mean in group 1996 mean in group 2000 
##            24.6901            24.5547

Looking at the code you’ll note here we are performing a two.sided or “not equal to” test. More importantly note the act ~ year notation – this is known as a formula in R. In particular, we are telling R that the ACT scores are a function of the student’s incoming Year. This concept should be familiar as it is similar to the regression ideas from your Intro Statistics course.

It turns out that nearly all statistical inference can be framed in terms of modeling. To see this we first must cover some light math.

Recall from Intro Statistics that a random variable \(Y\) is typically assumed to come from a Normal distribution with mean \(\mu\) and standard deviation \(\sigma\). Standard shorthand notation for this is to write \[Y \sim N(\mu, \sigma).\] You may also remember that if you subtract off the mean, \(\mu\), from a random variable it will just shift the distribution by \(\mu\) units. For example \[ Y - \mu \sim N(0, \sigma)\] We could reformulate the above via \[\begin{equation} Y = \mu + \varepsilon \tag{2.1} \end{equation}\] where \[\varepsilon \sim N(0, \sigma).\]

The equation (2.1) can be considered a baseline model. Here, we are simply stating that the random variable \(Y\) is a function of a mean \(\mu\) and random component \(\varepsilon\), which happens to be from a Normal distribution with mean 0 and standard deviation \(\sigma\). In words have the model \[\textbf{Data} = \textbf{Systematic Structure} + \textbf{Random Variation}\] where in the base model the Systematic Structure is simply the constant \(\mu\). We will consider more complicated structures later.

Estimation

A quick note about estimation. In your Intro Stat class you worked with this model regularly. We would estimate \(\mu\) with the sample mean \(\bar{Y}\) and estimate \(\sigma\) with the sample standard deviation \(S\). You learned about one-sample hypothesis testing and confidence intervals all based from this baseline model.

Two-sample \(t\)-test as a Model

Now consider the case of testing if ACT scores were different in 1996 compared to 2000. Let’s create a new variable called \(\tau\) that measures how the 2000 ACT scores deviate from the 1996 ACT scores. That is, considering a model \[Y_i = \mu + \tau + \varepsilon_i\] Here the value \(\mu\) corresponds to the mean ACT score in 1996 and is the baseline, \(\tau\) measures how the year 2000 ACT scores differ from 1996 and \(\varepsilon_i\) is the underlying random part of the \(i^\mathrm{th}\) observation \(Y_i\).

Another way to frame this model is to consider a variable \(X\) such that if observation \(i\) is from 1996 then \(X_i\) takes on the value zero and if the observation if from 2000 then \(X_i\) takes on the value one. We’ll let \(Y_i\) be the observed ACT score for student \(i\) and consider the following

\[\begin{equation} Y_i = \mu + \delta\cdot X_i + \varepsilon_i. \tag{2.2} \end{equation}\]

Now \(\mu\) is the mean of 1996 ACT scores, \(\delta\) is the effect the year 2000 has on ACT scores (i.e., if ACT scores increased by 4 units from 1996 to 2000, \(\delta=4\)), thus \(\mu+\delta\) is the mean ACT score in the year 2000 (the \(\tau\) effect in the above). Lastly, \(\varepsilon_i\) is how observation \(i\) deviates from the mean and we assume \(\varepsilon_i \sim N(0, \sigma)\) for standard deviation \(\sigma\) for all \(i=1,\ldots,n\).

When performing a two-sample \(t\)-test we are essentially testing \(H_0:\ \delta=0\) versus \(H_A:\ \delta\neq 0\). Note that if \(H_0\) were true, we are back to the baseline model (2.1).

So to put it another way, a two-sample \(t\)-test essentially is a choice of

\[H_0: \textrm{Baseline Model} ~~ Y_i = \mu + \varepsilon_i\] and \[H_A: \textrm{More complex model} ~~ Y_i = \mu + \delta\cdot X_i + \varepsilon_i.\]

To tie everything together note that in \(H_A\) we are essentially saying observation \(Y_i\) is a function of the variable \(X_i\).

Because of this we typically refer to \(Y_i\) as the response (or dependent) variable and \(X_i\) as a predictor (or independent) variable. This coincides with the R formula notation, response ~ predictor.

2.2 Observational Studies versus designed experiments

Before we introduce more statistical models we need to step back and discuss the source of our data. Are the data a “sample of convenience,” or were they obtained via a designed experiment or some random sampling scheme? How the data were collected has a crucial impact on what conclusions can be meaningfully made.

It is important to recognize that there are two primary methods for obtaining data for analysis: designed experiments and observational studies. There is a third type of data collected through survey sampling methods that incorporates elements from both designed experiments and observational studies, but we exclude it here. It is important to know the distinction because each type of data results in a different approach to interpreting estimated models.

2.2.1 Observational Studies

In many situations a practitioner (perhaps you, or a client) simply collect measurements on predictor and response variables as they naturally occur, without intervention from the data collector. Such data is called observational data.

Interpreting models built on observational data can be challenging. There are many opportunities for error and any conclusions will carry with them substantial unquantifiable uncertainty. Nevertheless, there are many important questions for which only observational data will ever be available. For example, how else would we study something like differences in prevalence of obesity, diabetes and other cardiovascular risk factors between different ethnic groups? Or the effect of socio-economic status on self esteem? It may be impossible to design experiments to investigate these, so we must make the attempt to build good models with observational data in spite of their shortcomings.

In observational studies, establishing causal connections between response and predictor variables is nearly impossible. In the limited scope of a single study, the best one can hope for is to establish associations between predictor variables and response variables. But even this can be difficult due to the uncontrolled nature of observational data. Why? It is because unmeasured and possibly unsuspected “lurking” variables may be the real cause of an observed relationship between \(Y\) and some predictor \(X\).

In observational studies, it is important to adjust for the effects of possible confounding variables. Unfortunately, one can never be sure that the all relevant confounding variables have been identified. As a result, one must take care in interpreting estimated models involving observational data.

This topic will be more thoroughly visited later.

2.2.2 Designed experiments

In a designed experiment, the researcher has control over the settings of the predictor variables. For example, suppose we wish to study several physical exercise regimens and how they impact calorie burn. Typically the values of the predictor variables are discrete (that is, a countably finite number of controlled values). Because of this, predictor variables of this type are known as factors. The experimental units (EUs) are the people we use for the study. We can control some of the predictors such as the amount of time spent exercising or the amount of carbohydrates consumed prior to exercising. Some other predictors may not be controlled but could be measured, such as baseline metabolic variables. Other variables, such as the temperature in the room or the type of exercise done, could be held fixed.

Having control over the conditions in an experiment allows us to make stronger conclusions from the analysis. Another key feature from a designed experiment is that the model is chosen for us! We will see later in the class that with observational data we typically need to select a model. But with a designed experiment the model is predetermined based on the experiment.

2.3 Designed experiement vocabulary

2.3.1 What is an experiment?

In an experiment, a researcher manipulates one or more variables, while holding all other variables constant. The main advantage of well-designed experiments over observational studies is that we can establish cause and effect relationships between the predictors and response.

One of the most important things to keep in mind with analyzing designed experiments is that the structure of the experiment dictates how the analysis may proceed. There are a plethora of structural possibilities of which we will only scratch the surface here.

Consider, for example, these four scenarios:

  1. Suppose a study is conducted on the effects a new drug has on patient blood pressure. At the start of the study the blood pressure of all participants is measured. After receiving the drug for two months the blood pressure of each patient is measured again. The analysis to determine the effect of the drug on blood pressure is the Paired \(t\)-test from your Intro stat class. Here the pairing of each observation is the difference between the before and after blood pressure.

  2. We may have a simple design with only one factor, where each subject (experimental unit) is measured only under one of the factor levels. For example, suppose three different drugs are administered to 18 subjects. Each person is randomly assigned to receive only one of the three drugs. A response is measured after drug administration. The analysis we use for such data is known as a one-way analysis of variance.

  3. We may have a design with two factors, where every level of the first factor appears with every level of the second factor. For example, suppose your data are from an experiment in which alertness levels of male and female subjects were measured after they had been given one of two possible dosages of a drug. Such an experimental design is known as a factorial design. Here, the two factors are gender and dosage. The analysis we use for such data is known as a two-way analysis of variance.

  4. Suppose we have a design with only one factor of interest (word type). Five subjects are asked to memorize a list of words. The words on this list are of three types: positive words, negative words and neutral words. The response variable is the number of words recalled by word type, with a goal of determining if the ability to recall words is affected by the word type. Note that even though there is only one factor of interest (word type) with three levels (negative, neutral and positive), this data structure differs greatly from a one-way analysis of variance mentioned earlier because each subject is observed under every factor level. Thus, there will be variability in the responses within subjects as well as between subjects. This fact will affect how we analyze the data.

We will cover each of these analyses.

2.3.2 Analysis of variance

Analysis of variance (ANOVA) is an analytic tool in statistics to partition the variation in a response variable and attribute it to known sources in the experiment. It is essentially an extension of the more familiar \(t\)-test. As stated earlier, there are many different experimental structures possible when conducting an experiment. Some of the other widely used structures in practice are known as nested designs, split-plot designs, and repeated measures designs. It is worth saying again that the structure of the experimental data will dictate how an analysis of variance is used to model the data. In each of these designs, some variant of an Analysis of Variance is used for analysis.

2.3.3 Elements of a designed experiment

All experiments have factors, a response variable, and experimental units. Here are some definitions and terms:

  • Factors: A factor is an explanatory variable manipulated by the experimenter. Each factor has two or more levels (i.e., different values of the factor). Combinations of factor levels form what are called treatments. The table below shows factors, factor levels, and treatments for a hypothetical experiment:

    Vitamin C

    Vitamin E

    0 mg

    250 mg

    500 mg

    0 mg

    Treatment 1

    Treatment 2

    Treatment 3

    400 mg

    Treatment 4

    Treatment 5

    Treatment 6

    In this hypothetical experiment, the researcher is studying the possible effects of Vitamin C and Vitamin E on health. There are two factors: dosage of Vitamin C and dosage of Vitamin E. The Vitamin C factor has three levels: 0 mg per day, 250 mg per day, and 500 mg per day. The Vitamin E factor has 2 levels: 0 mg per day and 400 mg per day. The experiment has six treatments. Treatment 1 is 0 mg of E and 0 mg of C; Treatment 2 is 0 mg of E and 250 mg of C; and so on.

  • Response variable: In the hypothetical experiment above, the researcher is looking at the effect of vitamins on health. The response variable in this experiment would be some measure of health (annual doctor bills, number of colds caught in a year, number of days hospitalized, etc.).

  • Experimental units: The recipients of experimental treatments are called experimental units. The experimental units in an experiment could be anything - people, plants, animals, or even inanimate objects.

In the hypothetical experiment above, the experimental units would probably be people (or lab animals). But in an experiment to measure the tensile strength of string, the experimental units might be pieces of string. When the experimental units are people, they are often called participants; when the experimental units are animals, they are often called subjects.

Three characteristics of a well-designed experiment. While the elements above are common to all experiments, there are aspects to the manner in which the experiment is run which are critical to it being a “well-designed” experiment. A well-designed experiment includes design features that allow researchers to eliminate extraneous variables as an explanation for the observed relationship between the factor(s) and the response variable. Some of these features are listed below.

  • Control: Control refers to steps taken to reduce the effects of extraneous variables (i.e., variables other than the factor(s) and response). These extraneous variables are called lurking variables. Control involves making the experiment as similar as possible for experimental units in each treatment condition. Two control strategies are control groups and placebos:

    • Control group: A control group is a baseline group that receives no treatment or a neutral treatment. To assess treatment effects, the experimenter compares results in the treatment group to results in the control group.

    • Placebo: Often, participants in an experiment respond differently after they receive a treatment, even if the treatment is neutral. A neutral treatment that has no “real” effect on the dependent variable is called a placebo, and a participant’s positive response to a placebo is called the placebo effect. To control for the placebo effect, researchers often administer a neutral treatment (i.e., a placebo) to the control group. The classic example is using a sugar pill in drug research. The drug is effective only if participants who receive the drug have better outcomes than participants who receive the sugar pill.

    • Blinding: Of course, if participants in the control group know that they are receiving a placebo, the placebo effect will be reduced or eliminated; and the placebo will not serve its intended control purpose. Blinding is the practice of not telling participants whether they are receiving a placebo. In this way, participants in the control and treatment groups experience the placebo effect equally. Often, knowledge of which groups receive placebos is also kept from people who administer or evaluate the experiment. This practice is called double blinding. It prevents the experimenter from “spilling the beans” to participants through subtle cues; and it assures that the analyst’s evaluation is not tainted by awareness of actual treatment conditions.

  • Randomization: Randomization refers to the practice of using chance methods (random number generation, etc.) to assign experimental units to treatments. In this way, the potential effects of lurking variables are distributed at chance levels (hopefully roughly evenly) across treatment conditions. For example, suppose we had 12 experimental units which we wanted to completely randomly allocate to three different treatments (call them A, B and C). We can accomplish this as follows in R:

ids <- 1:12
trt <- rep(c('A','B','C'), each=4)
randomized <- sample(ids, size=length(ids), replace=FALSE)
CRD <- data.frame(trt, randomized)
CRD
##    trt randomized
## 1    A          5
## 2    A          6
## 3    A          8
## 4    A          1
## 5    B          3
## 6    B          2
## 7    B         10
## 8    B         11
## 9    C          9
## 10   C          4
## 11   C          7
## 12   C         12

Such a design is known as a completely randomized design (or CRD).

Randomization is arguably one of the most important pieces of any well-designed experiment. Failure to randomize can lead to confounded conclusions (see discussion below) and biased results!

Example: Consider a contrived, but simple, experiment where you might be comparing a new heart disease drug to a placebo. Suppose your sample consists of one-hundred adults with with diagnosed heart disease. Further, suppose the ages of the adults are uniformly distributed across the range 20 to 90. Out of convenience suppose the new drug is administered to the youngest fifty volunteers and the placebo to the oldest 50 volunteers. Would you trust the results of this experiment?

However, if the drug/placebo assignment was randomly assigned to different patients of different ages and backgrounds, the randomization would help eliminate any bias age.

  • Replication: Replication refers to the practice of assigning each treatment to many experimental units. The purpose of replication is as follows:

    • Estimate the noise in the system. Recall early in the text when we discussed sources of variation in data. What do I mean by that? Well, think of this: if you were to apply precisely the same treatment to a set of different individuals, the response would naturally vary due to myriad reasons of little to no interest to the researcher. So, responses vary to some degree due to simple random chance fluctuations, or “noise.” For our purposes, this constitutes a single source of variation.
    • Now, lets apply different treatments to a set of individuals. The responses would still vary, but now there are two sources of variation: the random chance element from above (because we still have different individuals), but now we add in the effect that the different treatments have on the response. So, one purpose of replication is so that we can estimate how much noise is naturally in the system, so that we can see if the variability in response introduced by changing the treatment rises above the noise. The more replication in an experiment, the better you can estimate the noise. However, this comes at an expense!
    • Help wash out the effect of lurking variables. This was also mentioned as a reason for randomization. When coupled with randomization, replication provides the “opportunity” for your experiment to have a mix of all those lurkers appear under each treatment, thus mitigating their effects from biasing your results.

2.3.3.1 Confounding

This sounds serious. What is it?

Confounding is a condition that occurs when the experimental controls do not allow the experimenter to reasonably eliminate plausible alternative explanations for an observed relationship between factors and the response. Needless to say, confounding renders experimental results to be seriously impaired, if not totally useless. Consider this example:

Example: A drug manufacturer tests a new cold medicine with 200 participants: 100 men and 100 women. The men receive the drug, and the women do not. At the end of the test period, the men report fewer colds.

What is the problem? This experiment implements no controls! As a result, many variables are confounded, and it is impossible to say whether the drug was effective. For example:

  • Gender is confounded with drug use. Perhaps, men are less vulnerable to the particular cold virus circulating during the experiment, and the new medicine had no effect at all. Or perhaps the men experienced a placebo effect.

This experiment could be strengthened with a few controls. Women and men could be randomly assigned to treatments. One treatment group could receive a placebo. Blinding could be implemented to reduce the influence of “expectation” on the part of participants with regard to the outcome of a treatment. Then, after all this, if the treatment group (i.e., the group getting the medicine) exhibits sufficiently fewer colds than the control group, it would be reasonable to conclude that the medicine was the reason for effectively preventing colds, and not some other lurking reason.

2.4 Paired \(t\)-test

Now that the framework for experimental design is complete let’s consider our first example, the paired \(t\)-test (also known as the matched pairs design). Arguably the simplest example of such a design is a before and after study as outlined in the fictional example below.

Example. In an effort to student the effectiveness of a new cholesterol drug 20 patients with high cholesterol are selected and their cholesterol is measured at the beginning of the study. The patients then take the new drug for 45 days. At the conclusion of the 45 days their cholesterol is measured and compared to that at the beginning of the study.

The above is an example of a designed experiment. The experimental units are the 20 individuals, the response is the difference in cholesterol (before and after) and the pairing attempts to control for confounding variability (variability within an individual is somewhat controlled via the pairing mechanism).

The use of the paired \(t\)-test is not limited to experimental designs, it also appears in observational studies. The paired \(t\) does follow the model form discussed above but is actually somewhat complex due to the pairing mechanism. We will revisit the model statement in a later section.

2.4.1 Paired \(t\)-test example

To demonstrates the implementation of a paired \(t\)-test in R we consider another example.

Example. Carbon Emissions (originally in Weiss (2012)).

The makers of the MAGNETIZER Engine Energizer System (EES) claim that it improves the gas mileage and reduces emissions in automobiles by using magnetic free energy to increase the amount of oxygen in the fuel for greater combustion efficiency. Test results for 14 vehicles that underwent testing is available in the file carCOemissions.csv. The data includes the emitted carbon monoxide (CO) levels, in parts per million, of each of the 14 vehicles tested before and after installation of the EES.

This is a classic before-after (Paired design) study. Below is some code to implement the test.

www <- "https://tjfisher19.github.io/introStatModeling/data/carCOemissions.csv"
vehicleCO <- read_csv(www, col_types = cols())
vehicleCO
## # A tibble: 14 x 3
##       id before after
##    <dbl>  <dbl> <dbl>
##  1     1   1.6   0.15
##  2     2   0.3   0.2 
##  3     3   3.8   2.8 
##  4     4   6.2   3.6 
##  5     5   3.6   1   
##  6     6   1.5   0.5 
##  7     7   2     1.6 
##  8     8   2.6   1.6 
##  9     9   0.15  0.06
## 10    10   0.06  0.16
## 11    11   0.6   0.35
## 12    12   0.03  0.01
## 13    13   0.1   0   
## 14    14   0.19  0

First we note each row of the data consist of an id with a before and after measurement. In the paired \(t\)-test, we are comparing the differences between the before and after measurements. We perform this test in two different ways: (1) we explicitly calculate the difference and (2) we use features in the t.test function.

Method 1

vehicleCO <- vehicleCO %>%
  mutate(Difference = before - after)
t.test(vehicleCO$Difference)
## 
##  One Sample t-test
## 
## data:  vehicleCO$Difference
## t = 3.146, df = 13, p-value = 0.00773
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.239443 1.289129
## sample estimates:
## mean of x 
##  0.764286

Method 2

t.test(vehicleCO$before, vehicleCO$after, paired=TRUE)
## 
##  Paired t-test
## 
## data:  vehicleCO$before and vehicleCO$after
## t = 3.146, df = 13, p-value = 0.00773
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.239443 1.289129
## sample estimates:
## mean of the differences 
##                0.764286

In the first method, since we explicitly calculated the Difference as a new variable, we are determining if it is significantly different than zero (essentially it reduces to a one-sample \(t\)-test from your Intro Stat course). In the second method, we compare the before and after measures but we tell R that the data are paired with the paired=TRUE option. Lastly we note both methods provide the same findings with a \(p\)-value of approximately 0.0077, thus we have strong evidence to support that the EES system changes carbon monoxide levels.

Graphical comparison

To graphically compare paired data we must use some caution and consider the pairing feature in the data. Ultimately we are interested in the pairwise difference between observations. Thus, side-by-side boxplots are not appropriate since they do not account for the pairing. We can however report a visualization of the differences. Figure 2.1 provides a horizontal boxplot, with overlayed jittered observations, of the differences (the coord_clip() function causes the vertical-horizontal switch while the jittering just add some minor random perturbation to the data to visually separate observations, specifying a maximum shift with width=0.25).

ggplot(vehicleCO) + 
  geom_boxplot(aes(x="",y=Difference) )+
  geom_jitter(aes(x="", y=Difference), width=0.25 ) +
  labs(x="", y="Changes in CO Levels (ppm)", 
       title="Effects of Engine Energizer System (EES) on 14 vehicles", 
       subtitle="Change in Carbon Monoxide Emissions after installation of EES") +
  coord_flip() +
  theme_minimal()
Box-whiskers plot showing the distribution of difference in CO emissions (Before installation of EES - After installation) demonstrating a CO emissions decreased for most vehicles under study.

Figure 2.1: Box-whiskers plot showing the distribution of difference in CO emissions (Before installation of EES - After installation) demonstrating a CO emissions decreased for most vehicles under study.

Another type of plot that can be handy for this sort of data (and will be more important later) is known as a profile plot. Here we use the id values to group data.Basically we wish to draw a line for each and every individual vehicle. We begin with some data manipulation.

vehicleCO.tall <- vehicleCO %>%
  select(-Difference) %>% 
  pivot_longer(c(before, after), names_to="Time", values_to="CO") %>%
  mutate(id=factor(id),
         Time=factor(Time, levels=c("before","after")))

Let’s step through the above R code. In the first statement select(-Difference) we are telling R to drop the Difference variable as we do not need it for the plot. Then we are pivoting from wide-to-long format (or wide-to-tall) with the before and after measurements. The resulting dataset has two new variables, one called Time which states if it is the before or after measure and another called CO with the corresponding Carbon Monoxide emissions at that time. Lastly, we tell R to treat id as a factor and set the levels of Time so that before is first (otherwise it would put things in alphabetical order, by default).

Now that the data is in a tall format, we can make our profile plot in Figure 2.2. Here we simply tell R to treat the factor Time as the \(x\)-variable with CO as the \(y\)-variable, but group each line by vehicle id.

ggplot(vehicleCO.tall) + 
  geom_line(aes(x=Time, y=CO, group=id) ) +
  theme_minimal() + 
  labs(y="CO Emissions (ppm)", x="Installation of EES")
Profiles of each vehicle demonstrating a general decrease in CO emissions after EES installation.

Figure 2.2: Profiles of each vehicle demonstrating a general decrease in CO emissions after EES installation.

To jazz up our plot we can add the overall effect by including some summary values. First we calculate summary statistics. Since we use a group variable for each vehicle we need to trick R into plotting the overall summary line as well, thus we create an id="Summary" variable in vehicleCO.summary to use as a group variable. We also tweak the \(x\)-axis scale by limiting the amount of space around the before and after markings. The code below produces Figure 2.2.

vehicleCO.summary <- vehicleCO.tall %>%
  group_by(Time) %>%
  summarize(CO=mean(CO),
            id="Summary")

ggplot(vehicleCO.tall) + 
  geom_line(aes(x=Time, y=CO, group=id), color="gray50" ) +
  geom_line(data=vehicleCO.summary, aes(x=Time, y=CO, group=id), color="darkred", size=2) +
  labs(y="CO Emissions (ppm)", x="Installation of EES") + 
  scale_x_discrete(expand=c(0.075, 0.075) ) + 
  theme_minimal()
Profiles of each vehicle, with average profile highlighted in a thick dark red line, demonstrating a general decrease in CO emissions after EES installation.

Figure 2.3: Profiles of each vehicle, with average profile highlighted in a thick dark red line, demonstrating a general decrease in CO emissions after EES installation.

Based on the two plots, coupled with the results of the paired \(t\)-test, it is clear there is significant decrease in CO emissions after installation of the EES.

2.5 One-Way ANOVA

When there is a single factor whose levels may only change between different EUs, we can analyze the effect of the factor on the response by using a one-way analysis of variance (one-way ANOVA) on the between-subjects factor. If we had a single factor with just two levels, you could still use a one-way ANOVA, but it would be equivalent to running an independent samples \(t\)-test (see earlier material).

In this design, we observe random samples from \(k\) different populations. As a designed experiment, these populations may be defined on the basis of an administered treatment. The levels of the factor being manipulated by the researcher form the different treatments. Frequently, the data are obtained by collecting a random sample of individuals to participate in the study, and then randomly allocating a single treatment to each of the study participants. If so, then the individual subjects (experimental units) receiving treatment \(j\) may be thought of as a random sample from the population of all individuals who could be administered treatment \(j\).

The one-way data structure looks like the following:

\[\begin{array}{cccc} \hline \textbf{Treatment 1} & \textbf{Treatment 2} & \cdots & \textbf{Treatment}~k \\ \hline Y_{11} & Y_{21} & \ldots & Y_{k1} \\ Y_{12} & Y_{22} & \ldots & Y_{k2} \\ Y_{13} & Y_{23} & \ldots & Y_{k3} \\ \vdots & \vdots & \vdots & \vdots \\ Y_{1n_{1}} & Y_{2n_{2}} & \ldots & Y_{kn_{k}} \\ \hline \end{array}\]

The model for such data has the form \[\begin{equation} Y_{ji} = \mu_j + \varepsilon_{ji} \tag{2.3} \end{equation}\] where \(\mu_j\) is the mean of the \(j^\mathrm{th}\) treatment, \(j=1,\ldots,k\), or more commonly, \[Y_{ji} = \mu + \tau_j + \varepsilon_{ji}\] where

  • \(Y_{ji}\) is the \(i^\mathrm{th}\) observation in the \(j^\mathrm{th}\) treatment.
  • \(\mu\) is the overall mean of all the populations combined.
  • \(\tau_j\) is the deviation of the mean by the \(j^\mathrm{th}\) treatment population from the overall mean \(\mu\).
  • \(\varepsilon_{ji}\) is the random error term.

The usual test of interest in a one-way analysis of variance is to compare the population means. The null and alternative hypotheses are given by:

\[H_0: \mu_1 = \mu_2 = \ldots = \mu_k\] \[H_a: \textrm{at least two of the population means differ}\]

Expressed in terms of the more commonly used model parameters, we really want to test to see if there are no treatment mean deviations among the populations, so the above hypotheses can be equivalently expressed as follows:

\[H_0: \tau_1 = \tau_2 = \ldots = \tau_k = 0\] \[H_a: \textrm{at least one } \tau_j \neq 0\]

In terms of modeling, we are testing

\[H_0: \textrm{Baseline Model:} ~~ Y_{ji} = \mu + \varepsilon_{ji}\]

\[H_0: \textrm{More Complex Model:} ~~ Y_{ji} = \mu + \tau_j + \varepsilon_{ji},~~\textrm{at least one}~\tau_j\neq0\]

We can test these hypotheses by partitioning the total variability in the response into two components: (1) variation between treatments, and (2) variation within treatments. The latter partition is essentially residual error (unexplained variability) while the former is the explained variability. An \(F\)-test is performed to run the test.

The \(F\)-test works like other hypothesis test. The test statistic is a ratio of variance estimates and follows the Fisher-Snedecor distribution (or \(F\)-distribution for short). The details of this distribution are not important for this course and are covered in a Mathematical Statistics course. An example of a one-way ANOVA in R is provided below.

2.5.1 Example of One-Way ANOVA

Example: Drug effects on “knockdown” times (from the text Walpole et al. (2007)).

Immobilization of free-ranging white-tailed deer by drugs allows researchers the opportunity to closely examine deer and gather valuable physiological information. Wildlife biologists tested the “knockdown” time (time, measured in minutes, from injection to immobilization) of three different immobilizing drugs. Thirty male white-tailed deer were randomly assigned to each of three treatments: Group A received 5mg of liquid succinylcholine chloride (SCC); group B received 8 mg of powdered SSC; and group C received 200 mg of phencyclidine hydrochloride.

Original Source: Wesson (1976)

The data appear in the R dataframe deerKnockdown.RData in our repository: (https://tjfisher19.github.io/introStatModeling/data/deerKnockdown.RData)

site <- "https://tjfisher19.github.io/introStatModeling/data/deerKnockdown.RData"
load(url(site))
glimpse(deer_knockdown_times)
## Rows: 30
## Columns: 2
## $ Drug           <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", ~
## $ Knockdown_time <dbl> 11, 5, 14, 7, 10, 7, 23, 4, 11, 11, 10, 7, 16, 7, 7, 5,~

Note the subtle change in the code compared to read_csv() and read_table(). When loading an R data.frame from a website, we need to use the url() function to open and close the internet connection inside the load() function.

We can then plot the data.

ggplot(deer_knockdown_times, aes(x=Drug, y=Knockdown_time )) + 
  geom_boxplot() + 
  geom_jitter(width=0.25) +
  stat_summary(fun="mean", geom="point", shape=23, fill="gray60", size=4) + 
  labs(x="Drug Treatment", y="Time to Immobilization (min)") + 
  theme_minimal()
Side-by-side Box-whiskers plots, with data overlayed jittered data, and treatement means (gray diamond), showing the distribution of 'knockdown' times for each of the three drug treatments, `A`, `B` and `C`. Overall we see immobilization times is highest in treatment `A`, followed by `B` and `C`. We Also note that variability appears greatest in treatment `A`, followed by `B` and `C`.

Figure 2.4: Side-by-side Box-whiskers plots, with data overlayed jittered data, and treatement means (gray diamond), showing the distribution of ‘knockdown’ times for each of the three drug treatments, A, B and C. Overall we see immobilization times is highest in treatment A, followed by B and C. We Also note that variability appears greatest in treatment A, followed by B and C.

Visually it appears the immobilization times under drug C seem to be noticeably lower on average than for the other two drugs. Also, we may suspect the variance of immobilization times is different between the three drugs. That is, the distributions of times for drugs A and B are more disperse than for drug C.

To perform One-Way ANOVA we employ the aov function (for Analysis Of Variance). We use the summary() function to see its full output.

fit.drug <- aov(Knockdown_time ~ Drug, data=deer_knockdown_times)
summary(fit.drug)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Drug         2    159    79.4    5.46   0.01 *
## Residuals   27    393    14.6                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see One-Way ANOVA report a significant \(p\)-value (\(p\)-value of 0.0102) but we’ll soon see there are problems with this result.

2.6 Assumption Checking

Arguably the most important aspect in statistics is the assumptions made with the chosen statistical model. In the field of statistics there are two types of assumptions made:

  1. Assumptions about the systematic structure.
  2. Assumptions about the random variation.

Most of the time, when referring to assumptions we are concerned about 2: Assumptions about the random variation. However, it is important to consider the assumptions on the structure. In all of our models/testing discussed above, we are assuming the response variable is a linear function on the predictor variable (consider (2.2)). Based on context, it is important to consider if a linear structure is appropriate.

In this class, along with nearly all methods you learned in your Intro Stat course, we make the following assumptions about the underlying stochastic (i.e., random) part of the response variable, \(\varepsilon\):

  • The \(\varepsilon_i\) terms are independent.
  • The variance of \(\varepsilon_i\) is homogeneous; i.e., \(Var(\epsilon_i)=\sigma^2\) is constant for all \(i=1,\ldots,n\). This is sometimes known as being homoskedastic.
  • The \(\varepsilon_i\) terms are Normally distributed.

Collectively, the three assumptions simply state \[\varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2), ~~\textrm{for variance}~\sigma^2\]

Important Note. In the above bullet points, the assumptions are listed in their order of importance. If you lack independence, nearly all statistical methods we will cover in this text are invalid. The constant variance assumption can be important but in many cases we can address it. Lastly, the Normality assumption tends to not be all that important except in small sample sizes and when the data is heavily skewed – we can also address it in many cases.

2.6.1 Independence

In general, independence (or lack thereof) is the result of the sampling scheme employed during data collection. If you collect your data according to a well designed experiment or a simple random sampling scheme with one observation per subject, there is usually no reason to suspect you will have a problem. Data collected sequentially in time (e.g., daily high temperatures or repeated measures on the same subject) will exhibit a phenomenon known as autocorrelation (or serial correlation) and is a major problem! Specialized methods are needed for this type of data (see Section 4.2).

2.6.2 Constant Variance

To assess the constant variance assumption we typically consider two simple plots: a Residuals vs Fitted plot and Scale-Location plot as seen in Figure 2.5 (code to generate the plot available below).

Residual diagnostic plots showing a *fanning* effect in the residuals vs fitted and increasing linear trend in the Scale-Location plot, thus suggesting the variance increases with the mean.

Figure 2.5: Residual diagnostic plots showing a fanning effect in the residuals vs fitted and increasing linear trend in the Scale-Location plot, thus suggesting the variance increases with the mean.

What is a residual?

Consider the baseline model (2.1). We can estimate \(\mu\) with \(\bar{Y}\) and we would predict each observed \(Y_i\) with \(\hat{Y}_i=\bar{Y}\). From there, we can approximate the random variation of the \(i^\mathrm{th}\) observation \(\varepsilon_i\) with \(e_i = Y_i - \hat{Y}_i = Y_i - \bar{Y}\). The values \(\hat{Y}_i\) are known as the fitted, or predicted, values and \(e_j\) as the sample residuals.

In the case of the One-Way ANOVA model (2.3), we would calculate a different mean \(\bar{Y}_J\) for each of the \(k\) treatments and each \(Y_{ji}\) would be predicted by the appropriate \(\bar{Y}_j\) for \(j=1,\ldots,k\). The sample residuals would be found by \(e_{ji} = Y_{ji} - \hat{Y}_{ji} = Y_{ji} - \bar{Y}_j\).

How to read these plots?

A residuals versus fitted plot is provided in the left panel of Figure 2.5. If all is well, you should see fairly uniform (“constant”) spread of the points in the vertical direction and the scatter should be symmetric vertically around zero. The vertical spread here looks like it might be expanding as the fitted values increase (fanning effect), suggesting that there may be non-constant error variance.

An attempt at refining the residuals vs fitted plot is given in the plot on the right in Figure 2.5, called the Scale-Location plot. The difference now is that instead of plotting the raw residuals \(e_{ij}\) on the vertical axis, R first standardizes them (so you can better check for extreme cases), takes their absolute value (to double the resolution in the plot) and then takes their square root (to remove skew that sometimes affects these plots). R then adds a trend line through the points: if the constant variance assumption is OK, this trend line should be roughly horizontal. Here, you can see it trending upward, so we may have a constant variance problem.

Using the absolute values of the residuals is a good refinement. The reason is because we really do not care if a residual is positive or negative: all we are looking for is if the scatter is uniform across the plot. Thus, by looking at the magnitude of the residuals only by taking absolute values (and not their direction), we are actually improving the resolution of the information that the plot contains in addressing constant variance.

2.6.3 Checking Normality

We typically make this assessment with a Normal quantile-quantile (Q-Q) plot.

Normal Q-Q plot demonstrating the residuals may deviation from the Normality assumption.

Figure 2.6: Normal Q-Q plot demonstrating the residuals may deviation from the Normality assumption.

In a normal Q-Q plot, the observed sample values (the “sample quantiles”) are plotted against the corresponding quantiles from a reference normal distribution (the “theoretical quantiles”). If the parent population is normally distributed, then the sample values should reasonably match up one-to-one with normal reference quantiles, resulting in a plot of points that line up in a linear (straight line) fashion.

A normal Q-Q plot of the residuals is presented in Figure 2.6. A common tool to analyze these plots is sometimes known as the “fat pencil” test. Consider the thick pencils children are given when learning to write – if you lay that pencil on the plotted line, if all the points are covered by the pencil the data is approximately normally distributed.

In Figure 2.6 there is some deviation from the line (observations 13 and 7). Coupled with the issues with variance outlined above, we may have evidence to suggest the normality assumption is not met.

Note: It is quite common for violations of constant variance and normality to go hand-in-hand. That is, if there are concerns about one, it is common for there to be concerns regarding the other.

2.6.4 Code to check assumptions

The code to generate the above plots can be achieved via the autoplot function when including the ggfortify library (Horikoshi et al., 2019). The autoplot function provides 4 plots by default seen in Figure 2.7 (note: we will discuss the Constant Leverage plot in Section 8.6).

library(ggfortify)
autoplot(fit.drug)
Residual plots for the One-Way ANOVA studying the effect of different drugs on immobilization times - Top-left a Residuals vs Fitted Plot; Top-right a Normal Q-Q plot of the residuals; Bottom-left the Scale-Location plot; and Bottom-right a Residuals vs Leverage plot.

Figure 2.7: Residual plots for the One-Way ANOVA studying the effect of different drugs on immobilization times - Top-left a Residuals vs Fitted Plot; Top-right a Normal Q-Q plot of the residuals; Bottom-left the Scale-Location plot; and Bottom-right a Residuals vs Leverage plot.

2.6.5 Transforming your response

We have evidence from the above plots that the variance appears to be increasing with the response (determined by assessing the constant variance assumption) and that there may be some violations to normality. One way to typically address this issue is to transform the response variable. We will go into more details later but common transformations include \(Y^*=log(Y)\), \(Y^*=\sqrt{Y}\) and \(Y^*=1/Y\). Here we will use a logarithmic as it is fairly common. Later in the class we will describe the other transformations in more detail. We can achieve this in R with the following which provides the residuals plots in Figure 2.8.

fit.log.drug <- aov(log(Knockdown_time) ~ Drug, data=deer_knockdown_times)
autoplot(fit.log.drug)
Residual diagnostic plots when the response variable is the logarithm of the immobilization time. Here the heteroskedasticity and non-normality in the original model has been improved, albeit not entirely mitigated, with some minor fanning in the residuals vs fitted plot.

Figure 2.8: Residual diagnostic plots when the response variable is the logarithm of the immobilization time. Here the heteroskedasticity and non-normality in the original model has been improved, albeit not entirely mitigated, with some minor fanning in the residuals vs fitted plot.

This is a bit better but other transformations can be considered. For now we proceed with the logarithmic transformation.

The test comparing the three drug population means (on a log scale) is given via the summary() command:

summary(fit.log.drug)
##             Df Sum Sq Mean Sq F value Pr(>F)   
## Drug         2   2.84   1.420    8.22 0.0016 **
## Residuals   27   4.67   0.173                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result is significant (\(F\) = 8.219, \(\textrm{df}_1\) = 2, \(\textrm{df}_2\) = 27, \(p\)-value = 0.00163). Thus, we could conclude that at least two of the drugs produce significantly different log-mean immobilization times.

Math Tangent: Note that the logarithm function is a monotone increasing function. So if \(a>b\) then \(\log(a)>\log(b)\). Thus, we can infer that the mean immobilzation times for the three drug treatments also differs.

Of course, what remains to be seen is which pairs of drugs produce different log-mean immobilization times. For that, we will need to look at techniques known as multiple comparison procedures.

2.7 Follow-up procedures – Multiple Comparisons

When there is no significant difference detected among treatment means, we can safely conclude that the treatment had no effect on the mean response that was any larger than the type of change we might expect just due to random chance. In such a case, the analysis is essentially over, except that we might want to assess the size of the observed treatment differences for descriptive purposes.

However, if significant differences are detected among the treatments (as in the example above), the ANOVA \(F\)-test does not inform you where the differences lie. In such a situation, a follow-up procedure known as a multiple comparison procedure is warranted for seeking out pairwise mean differences where they exist.

There are numerous multiple comparison methods in existence, and they all have certain advantages and disadvantages, and some are designed for specific sorts of follow-up comparisons. Here we will discuss two procedures for handling the following scenarios:

  1. All possible pairwise comparisons
  2. All comparisons versus a control group

So for example, if we had a one-way design with five treatment groups A, B, C, D and E, there would be 10 possible pairwise comparisons:

\[\begin{array}{cccc} A~\textrm{vs.}~B & B~\textrm{vs.}~C & C~\textrm{vs.}~D & D~\textrm{vs.}~E \\ A~\textrm{vs.}~C & B~\textrm{vs.}~D & C~\textrm{vs.}~E & \\ A~\textrm{vs.}~D & B~\textrm{vs.}~E & & \\ A~\textrm{vs.}~E & & & \end{array}\]

whereas if (say) A was a control group and we were only interested in comparisons versus a control group, then there would be only 4 pairwise comparisons of interest:

\[\begin{array}{c} A~\textrm{vs.}~B \\ A~\textrm{vs.}~C \\ A~\textrm{vs.}~D \\ A~\textrm{vs.}~E \\ \end{array}\]

So what is the big deal? Well, if you find yourself in a multiple comparison situation, it is because you have declared that at least two groups differ based on the result of a single hypothesis test (the \(F\)-test). For that test, you used a specific criterion to make the decision (probably \(p\)-value < 0.05).

But to tease out where the differences are, you now need to run multiple tests under the same “umbrella.” That gives rise to something known as the multiple testing problem. The problem is an increase in the false positive rate (also known as Type I error rate) that occurs when statistical tests are used repeatedly.

Multiple testing problem. If \(m\) independent comparisons are made, each using a Type I error rate of \(\alpha_{PCER}\), then the family-wise (overall) Type I error rate for all comparisons considered simultaneously is given by \[\alpha_{FWER} = 1 - (1 - \alpha_{PCER})^m.\]

Illustration: Coin flips. Suppose that we declare that a coin is biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then it can be shown that the probability that a fair coin would come up heads at least 9 out of 10 times is 0.0107. This is relatively unlikely, and under usual statistical criteria such as “reject \(H_0\) if \(p\)-value < 0.05,” one would declare that the null hypothesis should be rejected — i.e., declare the coin is unfair.

A multiple testing problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin) to test the fairness of many coins. Now, imagine if you were to test 100 coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that when flipping 100 coins ten times each, to see a particular (i.e., pre-selected) coin come up heads 9 or 10 times would still be very unlikely, but seeing some coin (as in any of the 100 coins) behave that way, without concern for which one, would be more likely than not. In fact, the likelihood that all 100 fair coins are identified as fair by this criterion is \((1 - 0.0107)^{100} \approx 0.34\). Therefore the application of a “single-test coin-fairness criterion” to multiple comparisons would more likely than not falsely identify at least one fair coin as unfair!

But alas, multiple testing is a reality and a natural follow-up procedure to ANOVA. To address the problem, many multiple comparison procedures have been developed that help control the accumulation of the false positive rate. Some methods perform better than others, but common to all of them is making adjustments to \(p\)-values or CIs based upon how many comparisons are being made. Multiple comparisons in an ANOVA framework can be performed using the add-on R package emmeans (Lenth, 2019). We will focus on two useful methods: Tukey’s HSD and Dunnett’s.

2.7.1 Tukey’s HSD method

One of the more widely used methods is attributable to statistician John Tukey (creator of the Box-Whiskers plot and credited with creating the word “bit”). It is sometimes referred to as Tukey’s Honestly Significant Difference method (honestly!). It has the advantage of controlling the Type I error rate when performing all possible pairwise comparisons. First we load the library and then run our fitted model fit.log.drug through the emmeans() function. Then we tell it to do a “pairwise” comparison via the contrast() function.

library(emmeans)
drug.mc <- emmeans(fit.log.drug, "Drug")
contrast(drug.mc, "pairwise")
##  contrast estimate    SE df t.ratio p.value
##  A - B      0.0754 0.186 27   0.406  0.9136
##  A - C      0.6872 0.186 27   3.696  0.0027
##  B - C      0.6117 0.186 27   3.291  0.0076
## 
## Results are given on the log (not the response) scale. 
## P value adjustment: tukey method for comparing a family of 3 estimates

The significant differences are what we expected based on our earlier EDA: drug C is significantly different in terms of log-mean immobilization time as compared to the other two drugs. This is borne out from the Tukey-adjusted \(p\)-values of 0.0027 (when comparing drugs A and C) and 0.0076 (when comparing drugs B and C). B and A are not significantly different. The \(p\)-values in this output have been adjusted, so we can just compare them to our usual 0.05.

We can also obtain simultaneous 95% confidence intervals for the true mean differences due to each drug pairing using the confint() function on our contrast of an emmeans object:

confint(contrast(drug.mc, "pairwise"))
##  contrast estimate    SE df lower.CL upper.CL
##  A - B      0.0754 0.186 27   -0.386    0.536
##  A - C      0.6872 0.186 27    0.226    1.148
##  B - C      0.6117 0.186 27    0.151    1.073
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates

These may be interpreted as follows:

95% CI for \(\mu_A - \mu_B\) = (-0.386, 0.536). This interval contains 0, so we cannot conclude that the true log-mean immobilization time significantly differ between drugs A and B.

95% CI for \(\mu_A - \mu_C\) = (0.226, 1.148). We can be 95% confident that drug A produces a true log-mean immobilization time that is between 0.226 to 1.148 higher than drug C.

95% CI for \(\mu_B - \mu_C\) = (0.151, 1.073). We can be 95% confident that drug B produces a true log-mean immobilization time that is between 0.151 and 1.073 higher than drug C.

To put back in the original units we could “un”-log the intervals by taking exponents (use the exp() function since by default log takes a “natural log”).

We can also visualize these results via a plot:

plot(contrast(drug.mc, "pairwise"))
Tukey adjusted confidence interval plots comparing treatments treatments `A`, `B` and `C`. Here, we see that treatments `B` and `C` are significantly different as well as treatments `A` and `C`. We also note there is no significant difference between treatments `A` and `B`.

Figure 2.9: Tukey adjusted confidence interval plots comparing treatments treatments A, B and C. Here, we see that treatments B and C are significantly different as well as treatments A and C. We also note there is no significant difference between treatments A and B.

Here, we are interesting in comparing each plotted confident band to the value of 0.

2.7.2 Dunnett multiple comparisons

The Dunnett method is applicable when one of your experimental groups serves as a natural control to which you want to compare the remaining groups. Note that there is no control group in our working example, but we include the code below for demonstration purposes.

In situations were a control group is present, it is advisable to use a method that only controls the family-wise error rate for the comparisons of interest, rather than over-controlling for all possible comparison such as when using a method like Tukey HSD. We still use the emmeans package but specify the comparisons are against a control group. The code is below:

contrast(drug.mc, "trt.vs.ctrl")
##  contrast estimate    SE df t.ratio p.value
##  B - A     -0.0754 0.186 27  -0.406  0.8750
##  C - A     -0.6872 0.186 27  -3.696  0.0019
## 
## Results are given on the log (not the response) scale. 
## P value adjustment: dunnettx method for 2 tests

The two comparisons provided are A vs. B and A vs. C. The only significant difference is between A and C (\(p\)-value = 0.0019). As before, here are simultaneous CIs and plots:

confint(contrast(drug.mc, "trt.vs.ctrl"))
##  contrast estimate    SE df lower.CL upper.CL
##  B - A     -0.0754 0.186 27   -0.512    0.361
##  C - A     -0.6872 0.186 27   -1.123   -0.251
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95 
## Conf-level adjustment: dunnettx method for 2 estimates
plot(contrast(drug.mc, "trt.vs.ctrl"))
Dunnett adjusted confidence interval plots demonstrating that treatment `c` is significantly different than control group `a`, while treatment `b` is not significantly different than control group `a`.

Figure 2.10: Dunnett adjusted confidence interval plots demonstrating that treatment c is significantly different than control group a, while treatment b is not significantly different than control group a.

It is interesting to compare the results you get via the two methods for the same comparisons (e.g., A vs. C). Can you explain what you see?

Note. Group A was automatically chosen as the reference group for Dunnett (the default behavior is to choose the first factor level [in this case the first alphabetically]). To change this from the default, just tell R to do so in the contrast statement. In the below example we set the second (2) factor, B, to be the control with ref=2.

contrast(drug.mc, "trt.vs.ctrl", ref=2)
##  contrast estimate    SE df t.ratio p.value
##  A - B      0.0754 0.186 27   0.406  0.8750
##  C - B     -0.6117 0.186 27  -3.291  0.0054
## 
## Results are given on the log (not the response) scale. 
## P value adjustment: dunnettx method for 2 tests

References

Horikoshi, M. and Tang, Y., Ggfortify: Data Visualization Tools for Statistical Analysis Results, from https://CRAN.R-project.org/package=ggfortify, 2019.
Kutner, M. H., Nachtsheim, C., Neter, J. and Li, W., Applied Linear Statistical Models, New York, NY: McGraw-Hill Irwin, 2004.
Lenth, R., Emmeans: Estimated Marginal Means, Aka Least-Squares Means, from https://CRAN.R-project.org/package=emmeans, 2019.
Walpole, R. E., Myers, R. H., Myers, S. L. and Ye, K., Probability & Statistics for Engineers and Scientists, Upper Saddle River: Pearson Education, 2007.
Weiss, N. A., Introductory Statistics, Pearson Education, 2012.
Wesson, J. A., Influence of Physical Restraint and Restraint-Facilitating Drugs on Blood Measurements of White-Tailed Deer and Other Selected Mammals, Master’s thesis, Virginia Polytechnic Institute; State University, 1976.