Chapter 3 Multiple Factor Designed Experiments

In the previous chapters we have seen how one variable (perhaps a factor with multiple treatments) can influence a response variable. It should not be much of a stretch to consider the case of multiple predictor variables influencing the response.

Example. Consider the admissions dataset with ACT scores as a response, besides the year of entry maybe a student’s home neighborhood or state could also be a predictor?

In this chapter we will explore two key analyses for designed experiments – a Block design and Two-factor experiment. The learning objectives of this unit include:

  • Understanding the structure of a multi-factor model.
  • Understanding the rationale of a block design.
  • Understanding the rationale and structure of a two-factor design.
  • Performing a full analysis involving multiple predictor variables.

3.1 Blocking

Randomized Block Designs are a bit of an odd duck. The type of design itself is straightforward, but the analysis might seem a bit unusual.

The design is best described in terms of the agricultural field trials that gave birth to it. When conducting field trials to compare plant varieties, there is concern that some areas of a field may be more fertile than others (e.g., soil nutrients, access to water, shade versus sun, etc…). For example, if one were comparing three plant varieties (call them A, B and C), it would not be a good idea to use one variety in one location (say next to a creek), another variety in a different location, and the third variety way back in another part of the field near a forest because the effects of the varieties would be confounded with the natural fertility of the land. See Figure 3.1 as an example of a poorly designed experiment.

Diagram of a poorly designed experiment that is ignoring the potential confouding factors of a creek and forest.

Figure 3.1: Diagram of a poorly designed experiment that is ignoring the potential confouding factors of a creek and forest.

The natural fertility (next to the creek, compared to next to the forest, compared to the middle) in this situation is a confounding factor. A confounding factor is something we are not necessarily interested in, but that none the less will have an impact on our assessment of the factor of interest (e.g., plant variety).

We could attempt to mitigate the effect of this confounding factor through randomization. For example, suppose by random allocation the treatments are distributed as such as in Figure 3.2.

Randomized Design that attempts to account for the confouding factors of a creek and forest.

Figure 3.2: Randomized Design that attempts to account for the confouding factors of a creek and forest.

In Figure 3.2 the treatments have been randomly assigned to the different locations in an effort to mitigate the effects of the stream and forest on land fertility. However, by random chance we note that treatment “C” has 4 of its 6 replicates in proximity to the stream, with the remaining 2 next to the forest. A block design looks to address this limitation of randomization when there are suspected (known) confounding effects.

A block is defined to be “a relatively homogeneous set of experimental material.” This essentially means that we would like to be able to assess the treatment effects without undue influence from a contaminating or confounding factor, such as land fertility. One way to do this would be to apply all the different treatments (e.g., the three plant varieties) to the same area of the field. Within one area (say next the creek in Figures 3.1 and 3.2), we should see consistency in natural fertility levels. Thus any differences observed in a measured response (such as crop yield) between plant varieties will not be due to variation in fertility levels, but rather due to the different plant varieties themselves.

A bock design is one that attempts to reduce the noise in the measured response in order to “clarify” the effect due to the treatments under study.

A randomized block design in the field study would be conducted in the following way:

  • The field is divided into blocks.
    • In our working example we use 3 blocks for the land next to the creek, next to the forest and that in the middle.
  • Each block is divided into a number of units equal to, or more than, the number of treatments.
    • In the example we have 6 units in each block.
  • Within each block, the treatments are assigned at random so that a different treatment is applied to each unit. When all treatments are assigned at least once in each block, it is known as a “Complete” Block Design. The defining feature of the Randomized (Complete) Block Design is that each block sees each treatment at least once.
    • We have randomly allocated 2 replicates of each treatment in each block for a complete and balanced design.

Figure 3.3 displays a randomized complete block design for the same experiment involving three plant varieties, “A,” “B” and “C.”

Diagram of a Randomized Complete Block Design that accounts for the confouding factors of a creek and forest.

Figure 3.3: Diagram of a Randomized Complete Block Design that accounts for the confouding factors of a creek and forest.

A good experimental design will be able to “parse out” the variability due to potential confounding factors, and thus give clarity to our assessment of the factor of interest. A block design can accomplish this task.

What can serve as a block? It is important to note that subjects (i.e., experimental units) themselves may form blocks in a block design. It is also possible that different experimental units may collectively form a “block.” It depends on the circumstances. All that is needed is that a block, however, formed, creates a “relatively homogeneous set of experimental material.”

An example. You’ve already seen the concept of a block design! Consider the paired *\(t\)-test included in Section 2.4. There, the individual vehicles are essentially treated as blocks. We observe two responses within each block (or vehicle).

3.1.1 Data structure, model form and analysis of variance of a Randomized Block Design

Here is the data structure for a randomized block design (RBD) with one treatment replicate per block:

\[\begin{array}{c|cccc} \hline & \textbf{Treatment 1} & \textbf{Treatment 2} & \ldots & \textbf{Treatment } k \\ \hline \textbf{Block 1} & Y_{11} & Y_{21} & \ldots & Y_{k1} \\ \textbf{Block 2} & Y_{12} & Y_{22} & \ldots & Y_{k2} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \textbf{Block } b & Y_{1b} & Y_{2b} & \ldots & Y_{kb} \\ \hline \end{array}\]

The model for such data has the form

\[Y_{ij} = \mu + \tau_i + \beta_j + \varepsilon_{ij}\]

where

  • \(Y_{ij}\) is the observation for treatment \(i\) within block \(j\).
  • \(\mu\) is the overall mean.
  • \(\tau_i\) is the effect of the \(i^\mathrm{th}\) treatment on the mean response.
  • \(\beta_j\) is the effect of the \(j^\mathrm{th}\) block on the mean response.
  • \(\varepsilon_{ij}\) is the random error term.

The usual test of interest is the same as in a one-way analysis of variance: to compare the population means due to the different treatments. The null and alternative hypotheses are given by: \[H_0: \tau_1 = \tau_2 = \ldots = \tau_k = 0 ~~\textrm{versus}~~ H_a: \textrm{at least one } \tau_i \neq 0\] We may also test for differences between blocks, but this is usually of secondary interest at best.

3.1.2 Block ANOVA Example

As a first example consider the following classic barley yield experiment.

Example. Barley Yields (from Fisher (1935)).

Data was collected on the yields of five varieties of barley grown at six different locations in 1931 and 1932. Of statistical interested is the mean difference in yield between the five different varieties.

Original Source: Immer et al. (1934)

Before proceeding with the data analysis, we first note the variables and factors of interests in this study.

  • Barley yield - this is the response variable.
  • Barley variety - the variety, or species, of barley. This is the factor variable of interest. It has 5 levels.
  • Location and Year - The six different location and years serve as a block term in this model. We note that within each location and year, the environment should be fairly homogenous, so the location and year will serve as a blocking term.

The data is available in the file fisherBarley.RData on the class repository.

load(url("https://tjfisher19.github.io/introStatModeling/data/fisherBarley.RData"))
glimpse(fisherBarleyData)
## Rows: 60
## Columns: 5
## $ Yield        <dbl> 81.0, 80.7, 146.6, 100.4, 82.3, 103.1, 119.8, 98.9, 98.9,~
## $ Variety      <chr> "Manchuria", "Manchuria", "Manchuria", "Manchuria", "Manc~
## $ LocationYear <chr> "L1-31", "L1-32", "L2-31", "L2-32", "L3-31", "L3-32", "L4~
## $ Location     <chr> "University Farm", "University Farm", "Waseca", "Waseca",~
## $ Year         <dbl> 1931, 1932, 1931, 1932, 1931, 1932, 1931, 1932, 1931, 193~

We note the data includes a LocationYear variable that combines the information of both the Location and Year. We also note we have a single observation for each Variety and LocationYear – this we have a complete randomized block design with a single observation in each.

We first proceed with an inappropriate simple visualization of the data. We do this for demonstration and learning purposes.

p1 <- ggplot(fisherBarleyData) + 
  geom_boxplot(aes(x=LocationYear, y=Yield) ) +
  labs(x="Location & Year", y="Yield (bushels)")
p2 <- ggplot(fisherBarleyData) + 
  geom_boxplot(aes(x=Variety, y=Yield)) +
  labs(x="Barley Variety", y="Yield (bushels)")
grid.arrange(p1, p2, nrow=1)
Boxplot distribution of the barley yeilds of `LocationYear` and barley `Variety`.

Figure 3.4: Boxplot distribution of the barley yeilds of LocationYear and barley Variety.

First, a quick note about the above code. We build a simple boxplot of the yield by the block term and call it p1. Similarly we build a boxplot of yield by variety, called p2. We then use the grid.arrange() function in the gridExtra package (Auguie, 2017) to put the two plots side-by-side. The results are in Figure 3.4.

Incorrect plots. We note that these plots are not necessarily appropriate for this data. For one, Boxplots display a 5 number summary – for each block term we only have five observations (one for each variety), so the figure on the left appears to summarize a sample of data when in fact we only have 5 observations – why not just display the raw data? Second, the figure on the right is ignoring the block structure – this is effectively equivalent to ignoring the pairing in a matched pairs design. Each environmental condition consisting of a location & year will have its own climate, rainfall, soil nutrients, etc… For example, it is possible a particular barley may grow well in one Location and year due to optimal preciptation and sunshine levels, but another variety may grow poorly there. Likewise, a different variety may prefer a different climate – this information cannot be deduced from this plot.

A better plot to consider is a profile plot similar to that used in the matched pairs design in Figure 2.2 from Section 2.4. Here, we explore the response at each treatment level within each block term.

ggplot(fisherBarleyData) + 
  geom_line(aes(x=Variety, y=Yield, 
                color=LocationYear, group=LocationYear) ) +
  labs(x="Barley Variety", y="Yield (bushels)", color="Location & Year") + 
  theme_minimal()
Profiles of each barley yield by variety profiling across each of the 12 Location & Year combinations.

Figure 3.5: Profiles of each barley yield by variety profiling across each of the 12 Location & Year combinations.

This plot in Figure 3.5 is an improvement over the previous result but can be further tweaked for visual improvement. In particular, each line corresponds to one of the block terms (this is the common approach for these plots) but there are 12 different colors displayed, and it can be difficult to distinguish them. This can be considered visual clutter. Since the specific Location & Year combinations are not necessarily of interest in our study, we can mute their impact by plotting each line with the same color as in Figure 3.6.

ggplot(fisherBarleyData) + 
  geom_line(aes(x=Variety, y=Yield, 
                group=LocationYear), color="gray30" ) +
  labs(x="Barley Variety", y="Yield (bushels)") + 
  theme_minimal()
Profiles of each barley yield by variety profiling across each of the 12 Location & Year combinations.

Figure 3.6: Profiles of each barley yield by variety profiling across each of the 12 Location & Year combinations.

From the profile plots in Figures 3.5 and 3.6 it is clear that certain Location & Years were more fruitful than others, in terms of barley yield. However, this is not the research question of interest - we are interested in determine how the different varieties influence yield. From the plots, this is not as clear. There is some evidence that the Trebi variety may result in more yield across all the blocks as well as maybe the Peatland variety.

To determine the effect variety has on yield we proceed with a formal analysis. Below is a formal analysis in a call of aov() – making sure to check underlying assumptions in Figure 3.7.

barley.fit <- aov(Yield ~ LocationYear + Variety, data=fisherBarleyData)
autoplot(barley.fit)
Residual diagnostic plots when the response variable is the barley yield as  a function of location & year (block) and variety. Here, the residuals appear fairly homoskedastic based on the Residuals vs Fitted and Scale-Location plots. The normality assumption is satisfied based on the Normal Q-Q plot.

Figure 3.7: Residual diagnostic plots when the response variable is the barley yield as a function of location & year (block) and variety. Here, the residuals appear fairly homoskedastic based on the Residuals vs Fitted and Scale-Location plots. The normality assumption is satisfied based on the Normal Q-Q plot.

The assumptions look generally fine here (nothing too concerning about constant variance or normality). So, on to the analysis:

summary(barley.fit)
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## LocationYear 11  31913    2901   17.00 2.1e-12 ***
## Variety       4   5310    1327    7.78 7.9e-05 ***
## Residuals    44   7509     171                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that there is a significant difference in the yield between the five varieties (\(F\) = 7.779, \(\textrm{df}_1\) = 4, \(\textrm{df}_2\) = 44, \(p\)-value = 0.0000785)

You can also see how the total variation was partitioned into three components:

  1. Within-LocationYear (i.e., the block term) sum of squares (31913).
  2. Between variety (i.e., treatment) sum of squares (5310).
  3. Residual sum of squares (7509).

We see that R also reports an \(F\)-statistic and associated \(p\)-value for the LocationYear. This is the block term and is not of concern to our analysis. So we ignore that \(F\)-statistic and \(p\)-value!

We can investigate the differences in yields between the five varieties using a multiple comparison procedure.

barley.mc <- emmeans(barley.fit, "Variety")
plot(contrast(barley.mc, "pairwise") )
Tukey adjusted confidence interval plots comparing the give barley varieties (treatments). Here, we see that Trebi variety is statistically different than the other four varieties, while those four are all statistically similar (recall, we look for zero in the confidence interval of the differences).

Figure 3.8: Tukey adjusted confidence interval plots comparing the give barley varieties (treatments). Here, we see that Trebi variety is statistically different than the other four varieties, while those four are all statistically similar (recall, we look for zero in the confidence interval of the differences).

We see that six of the ten Confidence Intervals in Figure 3.8 contain the value zero, thus concluding there is not statistical difference between these varieties of barley. However, in four of the intervals zero is not included, and all four involve the ‘Trebi’ variety of barley - this we have statistical evidence to conclude that the Trebi variety results in a different yield than the other four varieties. Furthermore, based on the values of this intervals, it suggest that the Trebi variety will result in a greater yield than the other varieties (Trebi \(-\) Variety \(>0\) implies Trebi \(>\) Variety, and Variety \(-\) Trebi \(<0\) implies Trebi \(>\) Variety).

3.1.3 Another Block ANOVA Example

The previous Block designed had roots in agriculture, much like our original description. Here is a different example of block terms in an design.

Example. Weight loss study (published in Walpole et al. (2007)).

Two reduction treatments and a control treatment were studied for their effects on weight change in obses women. The two reduction treatments involved were a self-induced weight reduction program and a therapist-controlled reduction program. Each of 10 subjects were assigned to the three treatment programs in a random order and measured for weight loss.

Original Source: Hall (1972)

Subject Control Self-induced Therapist
1 1.00 -2.25 -10.50
2 3.75 -6.00 -13.50
3 0.00 -2.00 0.75
4 -0.25 -1.50 -4.50
5 -2.25 -3.25 -6.00
6 -1.00 -1.50 4.00
7 -1.00 -10.75 -12.25
8 3.75 -0.75 -2.75
9 1.50 0.00 -6.75
10 0.50 -3.75 -7.00

There data are available in the R dataframe wordrecall.RData on the class repository (https://tjfisher19.github.io/introStatModeling/data/). Below is an analysis of this data.

weightloss <- read_tsv("https://tjfisher19.github.io/introStatModeling/data/weightlossStudy.csv")
glimpse(weightloss)
## Rows: 30
## Columns: 3
## $ Subject   <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, ~
## $ Treatment <chr> "Control", "Self-induced", "Therapist", "Control", "Self-ind~
## $ WtChanges <dbl> 1.00, -2.25, -10.50, 3.75, -6.00, -13.50, 0.00, -2.00, 0.75,~

Each subject gives three responses instead of one as in a usual one-way design. In this manner, each subject (or person) forms a block (a person is a fairly homogeneous). The design is a randomized block design, because the experimenter randomly determines the order of treatments for each subject.

Before the formal analysis we build profile plots to explore the effects of the three treatments within each subject in Figure 3.9.

ggplot(weightloss, aes(x=Treatment, y=WtChanges)) + 
  geom_line(aes(group=Subject), color="gray30" ) +
  stat_summary(fun="mean", geom="line", size=2, group=1, color="blue") + 
  labs(x="Weight Loss Treatment", y="Weight Change") + 
  theme_minimal()
Profiles of each subject's weight loss for the different treatments, with an overlayed 'average' highlighted in blue.

Figure 3.9: Profiles of each subject’s weight loss for the different treatments, with an overlayed ‘average’ highlighted in blue.

Overall, in Figure 3.9 the effect of the different treatments appears quite stark, with the Therapist and Self-induced treatments improving over the control. There is also quite a bit of subject-to-subject variability (i.e., each person has their own profile!) and some subjects do not follow the “average” profile.

Here is how to do a formal analysis in a call of aov() – making sure to check underlying assumptions in Figure 3.10.

weightloss.fit <- aov(WtChanges ~ Subject + Treatment, data=weightloss)
autoplot(weightloss.fit)
Residual diagnostic plots when the response variable is the weight change as a function of subject (block) and treatment. Here, the residuals appear to show some minor violations to the constant variance assumption as indicated by the Residuals vs Fitted and Scale-Location plots. Except for a single concerning observation, the normality assumption is satisfied based on the Normal Q-Q plot.

Figure 3.10: Residual diagnostic plots when the response variable is the weight change as a function of subject (block) and treatment. Here, the residuals appear to show some minor violations to the constant variance assumption as indicated by the Residuals vs Fitted and Scale-Location plots. Except for a single concerning observation, the normality assumption is satisfied based on the Normal Q-Q plot.

There appears to be a minor violation in the constant variance assumption (there is some indication that fitted values away from zero are more variable?). But, this type of violation cannot be easily be remedied by the standard techniques (a logarithm or square root transformation is not possible due to the negative values, and an reciprocal transformation is not possible due to a 0 observation). A single observation (observation 16) has a largest \(z\)-score value which deviates from the Normal line. We proceed with the analysis making note of some potential minor assumption violations.

summary(weightloss.fit)
##             Df Sum Sq Mean Sq F value Pr(>F)   
## Subject      1      3     3.0    0.19  0.663   
## Treatment    2    210   105.0    6.88  0.004 **
## Residuals   26    397    15.3                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that there is a significant difference in the mean weight change between the three treatments (\(F\) = 6.878, \(\textrm{df}_1\) = 2, \(\textrm{df}_2\) = 26, \(p\)-value = 0.004).

As a follow-up, we can investigate the differences in mean weight changes between the three treatments using a multiple comparison procedure and noting that there is a control treatment.

weightloss.mc <- emmeans(weightloss.fit, "Treatment")
plot(contrast(weightloss.mc, "trt.vs.ctrl") )
Dunnett adjusted confidence interval plots comparing the two weight-loss treatments against the control. Here, we see that Self-induced treatment is not statistically different than the control (zero is in the interval) while the Therapist based treatment results in significantly more weight loss than the control.

Figure 3.11: Dunnett adjusted confidence interval plots comparing the two weight-loss treatments against the control. Here, we see that Self-induced treatment is not statistically different than the control (zero is in the interval) while the Therapist based treatment results in significantly more weight loss than the control.

The confidence intervals comparing the therapist-based weight loss treatment to the control in Figure 3.11 does not contain the value zero, so we have evidence to suggest that the therapist based method results in a bigger weight change. However, the interval comparing the self-induced method to the control indicates there is not statistical difference (zero is included in the interval).

3.2 Two-factor Designs

A factorial structure in an experimental design is one in which there are two or more factors, and the levels of the factors are all observed in combination. For example, suppose you are testing for the effectiveness of three drugs (A, B, C) at three different dosages (5 mg, 10 mg, 25 mg). In a factorial design, you would observed a total of nine treatments, each of which consists of a combination of drug and dose (e.g., treatment 1 is drug A administered at 5 mg, treatment 2 is drug A administered at 10 mg, etc). Note how quickly things will grow if you have more than two factors!

In this design, the data are usually obtained by collecting a random sample of individuals to participate in the study, and then randomly allocating a single treatment to each of the study participants as in a one-way design structure. The only difference now is that a “treatment” consists of a combination of different factors. If there are two factors, the data may be analyzed using a two-way ANOVA.

The two-way data structure with two factors \(\mathbf{A}\) and \(\mathbf{B}\) looks like the following:

\[\begin{array}{c|cccc} \hline & \mathbf{A_1} & \mathbf{A_2} & \mathbf{\ldots} & \mathbf{A_a} \\ \hline \mathbf{B_1} & Y_{111}, Y_{112}, \ldots & Y_{211}, Y_{212}, \ldots, & \ldots & Y_{a11}, Y_{a12}, \ldots \\ \mathbf{B_2} & Y_{121}, Y_{122}, \ldots & Y_{221}, Y_{222}, \ldots, & \ldots & Y_{a21}, Y_{a22}, \ldots \\ \mathbf{B_3} & Y_{131}, Y_{132}, \ldots & Y_{231}, Y_{232}, \ldots, & \ldots & Y_{a31}, Y_{a32}, \ldots \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{B_b} & Y_{1b1}, Y_{1b2}, \ldots & Y_{2b1}, Y_{2b2}, \ldots, & \ldots & Y_{ab1}, Y_{ab2}, \ldots \\ \hline \end{array}\]

Note there is replication (i.e., multiple independent observations) in each treatment (combination of factors \(\mathbf{A}\) and \(\mathbf{B}\)).

The general model for such data has the form

\[Y_{ijk} = \mu + \alpha_i + \beta_j + \alpha\beta_{ij} + \varepsilon_{ijk}\] where

  • \(Y_{ijk}\) is the \(k^\mathrm{th}\) observation in the \(i^\mathrm{th}\) level of factor \(A\) and the \(j^\mathrm{th}\) level of factor \(B\).
  • \(\mu\) is the overall mean.
  • \(\alpha_i\) is the “main effect” of the \(i^\mathrm{th}\) level of factor \(A\) on the mean response.
  • \(\beta_j\) is the “main effect” of the \(j^\mathrm{th}\) level of factor \(B\) on the mean response.
  • \(\alpha\beta_{ij}\) is the “interaction effect” of the \(i^\mathrm{th}\) level of factor \(A\) and the \(j^\mathrm{th}\) level of factor \(B\) on the mean response.
  • \(\varepsilon_{ijk}\) is the random error term.

3.2.1 Analysis of a two-factor design

First recall an important feature of designed experiments – the analysis is determined by the experiment! In most cases of factorial designs, the primary interest is to determine if some interaction between the two factors influences the response. Therefore it is crucial to test the interaction term before assessing the effects of factor \(A\) alone on the response, because if \(A\) and \(B\) interact, then we cannot separate their effects. In other words, if the interaction between \(A\) and \(B\) significantly influences the response, then the effect that factor \(A\) has on the response changes depending on the level of factor \(B\).

The usual testing strategy is as follows:

  1. Fit a full interaction model.
  2. Test for significant interaction between \(A\) and \(B\):
    1. If the interaction term is significant, you must look at comparisons in terms of treatment combination; i.e., you cannot separate the effects of \(A\) and \(B\).
    2. If the interaction term is non-significant, you can look into deleting the interaction term from your model, fitting a reduced main-effects model. Then (and only then) may you look at the effects of \(A\) and \(B\) separately.

We consider two examples below.

3.2.2 Example with No Interaction

Example. Highway Signage (published in Weiss (2012)).

An experiment was conducted to determine the effects that the sign size and sign material have on detection distance (the distance at which drivers can first detect highway caution signs). Four drivers were randomly selected for each combination of sign size (small, medium, and large) and sign material (marked 1, 2, and 3). Each driver covered the same streth of highway at a constant speed during the same time of day, and the detection distance (in feet) was determined for hte driver’s assignment caution sign.

Original Source: Younes (1994)

The data appear in the csv file highwaySigns.tsv in the textbook repository.

First note the data is stormed in a “Tab separated values” file, or TSV. This is similar to the more common, “Comma Separated values,” or CSV, format. The readr package includes a read_tsv() function that works just like read_csv() and read_table().

highway_signs <- read_tsv("https://tjfisher19.github.io/introStatModeling/data/highwaySigns.tsv", col_types=cols())
glimpse(highway_signs)
## Rows: 36
## Columns: 3
## $ Distance <dbl> 2503, 1929, 1512, 2774, 2187, 2288, 3354, 2920, 2928, 2427, 2~
## $ Size     <chr> "Small", "Small", "Small", "Medium", "Medium", "Medium", "Lar~
## $ Material <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3~

Take a look at the detection distances by treatment and note that we need to coerce the Material variable to be a factor variable since it is coded as a numer. We will also turn `Size1 into a factor with a more logical order (recall that R will put the values in alphabetical order by default).

highway_signs <- highway_signs %>%
  mutate(Material = as.factor(Material),
         Size = factor(Size, levels=c("Small", "Medium", "Large")))

Now we wish to construct a plot of the data. We note that we have 4 observations per treatment (combination of size and material) so a boxplot makes no sense here. We attempt to plot the raw data, using color to distinguish the three sign materials. We also add some jittering and dodging to separate data points and materials.

ggplot(highway_signs) + 
  geom_point(aes(x=Size, y=Distance, color=Material),
              position=position_jitterdodge() ) + 
  labs(x="Sign Size", y="Detection Distance (ft)") +
  theme_bw()
Boxplot distribution of the texture values as a function of sauce and fish type.

Figure 3.12: Boxplot distribution of the texture values as a function of sauce and fish type.

From Figure 3.12, there appears to be a size effect: Large signs have larger detection distances, followed by medium sized and then small. There may be a material effect as material type 3 appears to have lower detection distances than type 1. Material type 2 is somewhere between types 1 and 3 in terms of detection distances.

We fit the interaction model in aov() by telling R to include an interaction term with the notation factor1:factor2. Below is the specific example for this study but first run a check on the residuals in Figure 3.13.

highway.fit <- aov(Distance ~ Size + Material + Size:Material, data=highway_signs)
autoplot(highway.fit)
Residual diagnostic plots when the response variable is the detection distance as a function of sign size and sign material. Here, the residuals appear fairly homoskedastic based on the Residuals vs Fitted and Scale-Location plots. The normality assumption is satisfied based on the Normal Q-Q plot.

Figure 3.13: Residual diagnostic plots when the response variable is the detection distance as a function of sign size and sign material. Here, the residuals appear fairly homoskedastic based on the Residuals vs Fitted and Scale-Location plots. The normality assumption is satisfied based on the Normal Q-Q plot.

The assumptions look to be reasonably met. While there may be some minor issues with constant variance, it does not appear to be systemically related to the predicted (fitted) values, so transformations will not be helpful. Likewise, those few observations in the QQ-Plot are not overly concerning given their \(z\)-score values are in the range \((-2,2)\).

At this point, we can also generate tables of response means after we fit the interaction model. These consist of:

  • The “grand mean” (overall mean, all data pooled together and estimate \(\mu\)).
  • Main effect means (i.e., means for each factor level, taken one factor at a time).
  • Interaction means (i.e., treatment means for all factor combinations).

These means are estimates from the data and are purely descriptive (like the boxplots were), but informative nonetheless. We could calculate each one of the above using a series of tidyverse commands or using the R function model.tables():

model.tables(highway.fit, "means")
## Tables of means
## Grand mean
##         
## 2552.33 
## 
##  Size 
## Size
##  Small Medium  Large 
## 2104.8 2505.9 3046.2 
## 
##  Material 
## Material
##      1      2      3 
## 2884.9 2522.9 2249.2 
## 
##  Size:Material 
##         Material
## Size     1    2    3   
##   Small  2486 2158 1670
##   Medium 2804 2381 2333
##   Large  3365 3029 2744

We proceed to the hypothesis tests:

summary(highway.fit)
##               Df  Sum Sq Mean Sq F value  Pr(>F)    
## Size           2 5356373 2678187   50.91 6.9e-10 ***
## Material       2 2440644 1220322   23.20 1.4e-06 ***
## Size:Material  4  217044   54261    1.03    0.41    
## Residuals     27 1420315   52604                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first test to inspect is the interaction test. The interaction is insignificant (\(F\) = 1.031, \(\textrm{df}_1\) = 4, \(\textrm{df}_2\) = 27, \(p\)-value = 0.409). Thus, we could conclude the effect that sign size type has on the mean detection distance does not depend on material type.

A visualization of the interaction effect may be obtained using an interactions plot. An interaction plot is basically a plot of treatment means, whereby the means for all treatments having a given fixed level of one of the factors are visually “connected.” We can build the interaction plot seen in Figure 3.14 using aesthetic options in ggplot() and with the stat_summary functions. Note both color and group are determined based on the gender variable.

ggplot(highway.fit, 
       aes(x=Size, y=Distance, color=Material, group=Material) ) +  
  stat_summary(fun=mean, geom="point") + 
  stat_summary(fun=mean, geom="line") +
  labs(x="Sign Size", y="Mean Detection Distance (ft)", color="Material")
Interaction plot demonstrating that the gender and dosage factors do not interact.

Figure 3.14: Interaction plot demonstrating that the gender and dosage factors do not interact.

There are three treatment mean traces, one for each material type. Each trace connects the mean detection distance of the sign size within each material.

The lack of significant interaction is evidenced by the parallelism of these traces: the material effect for small signs is depicted by the “gaps” between the traces on the left. Similarly, the material effect for the medium signs is depicted by the gaps at the middle points. Likewise, the material effect for the large signs is dipicted by the gaps at the right points. Overall the three lines are reasonably parallel, and the interaction is not significant, this implies that the effect of material type within three sign sizes is about the same across all sign sizes.

In other words, the effect of material type does not depend on sign size.

The interaction term was non-significant. Visually, we may have support that the two factors do not interact. We can consider a reduced model that eliminates the interaction term and includes only main effects:

highway.main.fit <- aov(Distance ~ Size + Material, data=highway_signs)
summary(highway.main.fit)
##             Df  Sum Sq Mean Sq F value  Pr(>F)    
## Size         2 5356373 2678187    50.7 1.7e-10 ***
## Material     2 2440644 1220322    23.1 7.2e-07 ***
## Residuals   31 1637358   52818                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results here (or with the interaction model), we see that the both material type and sign size appear to influence the detection distance of the fish (both \(p\)-values are smaller than typically-used significance levels).

At this point, the follow-up work would involve multiple comparisons among the levels of the significant main effect factors. Here, since ther is no interaction, we can perform multiple comparisons within each of the main effects separately. First, we look at the influence of sign size.

twoway.size.mc <- emmeans(highway.main.fit, "Size")
confint(contrast(twoway.size.mc, "pairwise"))
##  contrast       estimate   SE df lower.CL upper.CL
##  Small - Medium     -401 93.8 31     -632     -170
##  Small - Large      -941 93.8 31    -1172     -710
##  Medium - Large     -540 93.8 31     -771     -309
## 
## Results are averaged over the levels of: Material 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates

We note that the adjusted 95% confidence do not include the value 0, thus all sign sizes result in different detection distances.

For the three different material types.

twoway.material.mc <- emmeans(highway.main.fit, "Material")
confint(contrast(twoway.material.mc, "pairwise"))
##  contrast estimate   SE df lower.CL upper.CL
##  1 - 2         362 93.8 31    131.1      593
##  1 - 3         636 93.8 31    404.8      867
##  2 - 3         274 93.8 31     42.8      505
## 
## Results are averaged over the levels of: Size 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates

Similar to the above, we note that all three confidence intervals do not include the value of 0, indicating that all three material types result in different mean detection distances.

3.2.3 Example with Interaction

Example. Tooth growth in guinea pigs (original from Crampton (1947)).

Consider an example that investigates the effects of ascorbic acid and delivery method on tooth growth in guinea pigs. Sixty guinea pigs are randomly assigned to receive one of three levels of ascorbic acid (0.5, 1.0 or 2.0 mg) via one of two delivery methods (orange juice or Vitamin C), under the restriction that each treatment combination has an equal number of guinea pigs. The response variable len is the length of odontoblasts (cells responsible for tooth growth). The two-way data structure here looks like this:

Dose
Supplement 0.5 mg 1.0 mg 2.0 mg
Orange Juice Treatment 1 Treatment 2 Treatment 3
Vitamin C Treatment 4 Treatment 5 Treatment 6

As alluded to before, there are ten replicate guinea pigs per treatment. The data appear in the R workspace ToothGrowth included in the datasets package (R Core Team, 2019). Here is the head of the dataframe:

data("ToothGrowth")
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Let us bypass generating numeric descriptive statistics for the moment, and instead jump to an interaction plot of the length response in Figure 3.15.

ggplot(ToothGrowth, aes(x=dose, y=len, group=supp, color=supp) ) +  
  stat_summary(fun=mean, geom="point") + 
  stat_summary(fun=mean, geom="line") +
  labs(x="Dosage (mg per day)", y="Length of dontoblasts",
       color="Supplement")
Interaction plot demonstrating that the dose and supplement may interact in predicting tooth length.

Figure 3.15: Interaction plot demonstrating that the dose and supplement may interact in predicting tooth length.

There are a couple of things to note here:

  1. There appears to be a substantial dose effect, in that higher doses of ascorbic acid generally result in higher mean length.
  2. However, there may be a substantial interaction effect between the type of supplement and dose in determining mean tooth length. In particular, orange juice appears to be more effective than Vitamin C as a delivery method (longer tooth length), but only at lower dose levels. At the high dose level, there appears to be no difference between the supplements.

Once again, non-parallel traces may be the red flag. But, we need to run the ANOVA test for interaction to confirm if this is a significant effect. First, check assumptions in Figure 3.16.

tooth.fit <- aov(len ~ factor(dose) + supp + factor(dose):supp, data=ToothGrowth)
autoplot(tooth.fit)
Residual diagnostic plots when the response variable is the tooth length as a function of supplement and dosage. Here, the residuals appear fairly homoskedastic based on the Residuals vs Fitted and Scale-Location plots. The normality assumption is satisfied based on the Normal Q-Q plot.

Figure 3.16: Residual diagnostic plots when the response variable is the tooth length as a function of supplement and dosage. Here, the residuals appear fairly homoskedastic based on the Residuals vs Fitted and Scale-Location plots. The normality assumption is satisfied based on the Normal Q-Q plot.

All assumptions look fine here. So, proceed to the ANOVA tests:

summary(tooth.fit)
##                   Df Sum Sq Mean Sq F value  Pr(>F)    
## factor(dose)       2   2426    1213   92.00 < 2e-16 ***
## supp               1    205     205   15.57 0.00023 ***
## factor(dose):supp  2    108      54    4.11 0.02186 *  
## Residuals         54    712      13                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test for interaction is significant (\(F\) = 4.107, \(\textrm{df}_1\) = 2, \(\textrm{df}_2\) = 54, \(p\)-value = 0.0218). Thus, we conclude the effect that supplement has on the mean tooth growth does depend on the dosage level. We then ignore the main effects tests if the interaction is significant.

Remember the strategy we introduced earlier:

  1. Fit a full interaction model.
  2. Test for significant interaction between \(A\) and \(B\):
    1. If interaction is significant, you must look at comparisons in terms of treatment combinations; i.e., you cannot separate the effects of \(A\) and \(B\).
    2. If interaction is non-significant, you may delete the interaction term from the model, fitting a reduced main-effects model. Then (and only then) may you look at the effects of \(A\) and \(B\) separately.

Performing multiple comparisons. In the case of a significant interaction effect in a two-factor ANOVA model, we typically follow up by performing multiple comparisons for the levels of one of the factors while holding the other factor fixed (the “conditioning” factor). So for example, in the present problem we could either

  • Compare the supplements at each of the dose levels (3 tests total), or
  • Compare the dose levels within each of the supplement types (3 × 2 = 6 tests total).

Frequently in practice, the context of the problem will dictate to the researcher which way makes the most sense. We’ll provide code that performs both cases using the emmeans package.

tooth.mc1 <- emmeans(tooth.fit, pairwise ~ factor(dose) | supp)
tooth.mc2 <- emmeans(tooth.fit, pairwise ~ supp | factor(dose) )
tooth.mc1
## $emmeans
## supp = OJ:
##  dose emmean   SE df lower.CL upper.CL
##   0.5  13.23 1.15 54    10.93     15.5
##   1.0  22.70 1.15 54    20.40     25.0
##   2.0  26.06 1.15 54    23.76     28.4
## 
## supp = VC:
##  dose emmean   SE df lower.CL upper.CL
##   0.5   7.98 1.15 54     5.68     10.3
##   1.0  16.77 1.15 54    14.47     19.1
##   2.0  26.14 1.15 54    23.84     28.4
## 
## Confidence level used: 0.95 
## 
## $contrasts
## supp = OJ:
##  contrast estimate   SE df t.ratio p.value
##  0.5 - 1     -9.47 1.62 54  -5.831  <.0001
##  0.5 - 2    -12.83 1.62 54  -7.900  <.0001
##  1 - 2       -3.36 1.62 54  -2.069  0.1060
## 
## supp = VC:
##  contrast estimate   SE df t.ratio p.value
##  0.5 - 1     -8.79 1.62 54  -5.413  <.0001
##  0.5 - 2    -18.16 1.62 54 -11.182  <.0001
##  1 - 2       -9.37 1.62 54  -5.770  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
tooth.mc2
## $emmeans
## dose = 0.5:
##  supp emmean   SE df lower.CL upper.CL
##  OJ    13.23 1.15 54    10.93     15.5
##  VC     7.98 1.15 54     5.68     10.3
## 
## dose = 1.0:
##  supp emmean   SE df lower.CL upper.CL
##  OJ    22.70 1.15 54    20.40     25.0
##  VC    16.77 1.15 54    14.47     19.1
## 
## dose = 2.0:
##  supp emmean   SE df lower.CL upper.CL
##  OJ    26.06 1.15 54    23.76     28.4
##  VC    26.14 1.15 54    23.84     28.4
## 
## Confidence level used: 0.95 
## 
## $contrasts
## dose = 0.5:
##  contrast estimate   SE df t.ratio p.value
##  OJ - VC      5.25 1.62 54   3.233  0.0021
## 
## dose = 1.0:
##  contrast estimate   SE df t.ratio p.value
##  OJ - VC      5.93 1.62 54   3.651  0.0006
## 
## dose = 2.0:
##  contrast estimate   SE df t.ratio p.value
##  OJ - VC     -0.08 1.62 54  -0.049  0.9609

Here we see the output includes the estimates and confidence intervals for the means (while holding one factor fixed) and also includes the contrasts (comparisons) of one factor while holding the second factor constant.

At a dose level of 0.5 mg, ascorbic acid delivery via orange juice produces significantly larger mean tooth growth than does delivery via Vitamin C. We determined this based on the the second set of output, where we see an adjusted \(p\)-value of 0.0021 comparing orange juice with Vitamin C. Similarly, at dose=1.0, there is still a difference but not at dose=2.0.

We can also use plotting features to graphically explore the factors while holding another factor constant.

plot(tooth.mc1)
Graphical exploration of all treatments noting that the OJ supplement tends to result in higher tooth lengths than Vitamin C.

Figure 3.17: Graphical exploration of all treatments noting that the OJ supplement tends to result in higher tooth lengths than Vitamin C.

Here we explore the performance of the dosage amounts conditioning (essentially faceting) on the delivery type. Graphically we can see that the OJ method tends to have higher tooth lengths than Vitamin C for the low dosage levels of ascorbic acid.

References

Auguie, B., gridExtra: Miscellaneous Functions for "Grid" Graphics, from https://CRAN.R-project.org/package=gridExtra, 2017.
Crampton, E. W., The Growth of the Odontoblasts of the Incisor Tooth as a Criterion of the Vitamin C Intake of the Guinea Pig: Five Figures, The Journal of Nutrition, vol. 33, no. 5, pp. 491–504, from https://doi.org/10.1093/jn/33.5.491, May 1947. DOI: 10.1093/jn/33.5.491
Fisher, R. A., The Design of Experiments, Edinburgh: Oliver; Boyd, 1935.
Hall, S. M., Self-Control and Therapist Control in the Behavioral Treatment of Overweight Women, Behaviour Research and Therapy, vol. 10, no. 1, pp. 59–68, 1972. DOI: https://doi.org/10.1016/0005-7967(72)90008-3
Immer, F. R., Hayes, H. K. and Powers, L., Statistical Determination of Barley Varietal Adaptation1, Agronomy Journal, vol. 26, no. 5, pp. 403–19, from https://acsess.onlinelibrary.wiley.com/doi/abs/10.2134/agronj1934.00021962002600050008x, 1934. DOI: https://doi.org/10.2134/agronj1934.00021962002600050008x
R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing, from https://www.R-project.org/, 2019.
Walpole, R. E., Myers, R. H., Myers, S. L. and Ye, K., Probability & Statistics for Engineers and Scientists, Upper Saddle River: Pearson Education, 2007.
Weiss, N. A., Introductory Statistics, Pearson Education, 2012.
Younes, S. S., Highway Construction Safety and the Aging Driver: Detection and Legibility Distances of Construction Warning Signs, PhD thesis, Arizona State University, 1994.