Chapter 4 Advanced Designs

The previous chapters have established the groundwork of experimental designed. We discussed the key elements of experimental units, factors, response variables, treatments, confounding influences and the purpose of replication and randomization. This former work just scratches the surface of the field of experimental design and the analyses associated with those designs. The objective of this chapter includes:

  • Understanding the basic structure of factorial designs.
  • Understanding the basic structure of within-subject designs.
  • Building an appreciation for the complexity of experimental design.

4.1 Higher order factor models

In the previous two chapters we have seen a One-Way and Two-Way ANOVA. In the former, a single factor consisted of \(k\)-levels (and thus \(k\)-treatments) while in the latter the first factor may consist of \(k\) levels and the second factor \(j\) levels (and therefore \(k\times j\) treatments). This idea can, of course, be expanded to even more terms, consider the following example.

4.1.1 Example of Three-factor design

Example. Marketing research contractors (Kutner et al. (2004)).

A marketing research consultant evaluated the effects of free schedule, scope of work and type of supervisory control on the quality of work performed under contract by independent marketing research agencies. The design of this experiment can be summarized as follows

\[\begin{array}{ll} \textbf{Factor} & \textbf{Factor Levels} \\ \hline \textrm{Fee level} & \textrm{High} \\ & \textrm{Average} \\ & \textrm{Low}\\ \hline \textrm{Scope} & \textrm{In house} \\ & \textrm{Subcontracted} \\ \hline \textrm{Supervision} & \textrm{Local Supervisors}\\ & \textrm{Traveling Supervisors} \\ \end{array}\]

The quality of work performed was measured by an index taking into account several characteristics of quality.

You should note there are three factors in this experiment, with 3, 2 and 2 levels for each factor. This means the number of treatments is \(3\times 2\times 2 = 12\) and can be summarized in the following table.

Supervision
Local Supervisers
Traveling Supervisors
Fee Level In house Subcontracted In house Subcontracted
High Treatment 1 Treatment 2 Treatment 3 Treatment 4
Average Treatment 5 Treatment 6 Treatment 7 Treatment 8
Low Treatment 9 Treatment 10 Treatment 11 Treatment 12

Recall the main idea from two-factor experiments–the interaction terms are typically very important. In this three factor design we have the three main effects (Fee level, Scope and Supervision) but then we also have pairwise interactions between these variables and an interaction among all three. From a statistical modeling perspective, we are now essentially fitting the model

\[Y_{ijkm} = \mu + \alpha_i + \beta_j + \gamma_k + (\alpha\beta)_{ij} + (\alpha\gamma)_{ik} + (\beta\gamma)_{jk} + (\alpha\beta\gamma)_{ijk} + \varepsilon_{ijkm}\]

where

  • \(Y_{ijkm}\) is the \(m^\mathrm{th}\) observation in the \(i^\mathrm{th}\) level of the first factor, the \(j^\mathrm{th}\) level of the second factor and the \(k^\mathrm{th}\) level of the third factor.
  • \(\mu\) is an overall mean.
  • Fee level main effect \(\alpha_i\).
  • Scope main effect \(\beta_j\).
  • Supervision main effect \(\gamma_k\).
  • Fee level and Scope interaction effect \((\alpha\beta)_{ij}\).
  • Fee level and Supervision interaction effect \((\alpha\gamma)_{ik}\).
  • Scope and Supervision interaction effect \((\beta\gamma)_{jk}\).
  • Fee level, Scope and Supervision interaction effect \((\alpha\beta\gamma)_{ijk}\).
  • \(\varepsilon_{ijkm}\) is the random error term.

As you can see this model has grown greatly compared to to the two-way ANOVA examples from before.

What will happen if there are four factors?

In general, if an experiment consisted of \(n\) factors, we would potentially have \(2^n - 1\) terms in our model. Thus with four factors you are looking at \(2^4 - 1 = 16-1=15\) terms in your ANOVA decomposition. With five factors, 31 terms. This will quickly grow out of control.

Fortunately, for higher order factor problems there are statisticians available to help formulate a better design and potentially optimize some of the variables of study. We leave out the details of these topics in this introductory text but want to provide insight that these advanced designs and analyses can be done. Please consult a statistician before your experiment if you are planning a complex design!!

4.1.2 Three-factor ANOVA Example

Example continued. Below we complete the example involving the marketing research contractors.

First we read in the data and make sure R properly treats the factors.

data_site <- "https://tjfisher19.github.io/introStatModeling/data/marketingResearch.txt"
marketing <- read_table(data_site, col_types=cols() )
marketing <- marketing %>%
  mutate(Fee = as.factor(Fee),
         Scope = as.factor(Scope),
         Supervision = as.factor(Supervision))

We leave out EDA and residual analysis from this example for brevity. We fit the ANOVA model.

marketing.anova <- aov(Quality ~ Fee*Scope*Supervision, data=marketing)

Here we used the notation Fee*Scope*Supervision so R will fit a full factorial model (that is, all main effects and interaction terms). Now look at the ANOVA output.

summary(marketing.anova)
##                       Df Sum Sq Mean Sq F value  Pr(>F)    
## Fee                    2  10044    5022  679.34 < 2e-16 ***
## Scope                  1   1834    1834  248.08 < 2e-16 ***
## Supervision            1   3832    3832  518.40 < 2e-16 ***
## Fee:Scope              2      2       1    0.11    0.90    
## Fee:Supervision        2      1       0    0.05    0.95    
## Scope:Supervision      1    575     575   77.75 1.6e-10 ***
## Fee:Scope:Supervision  2      4       2    0.27    0.77    
## Residuals             36    266       7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As with the two-factor analysis, always look at the most complex interaction term first. Here we see that the three factor interaction term is not significant, thus we proceed to the pairwise interactions.

The interaction between Scope and Supervision is significant. Interactions involving the Fee level factor are not significant, thus we can interpret the Fee level main effect as being a significant influence on quality. We can also conclude that some combination of Scope and Supervision influence quality. We leave out any multiple comparisons analysis here as it gets very complex when three factors are involved.

4.2 Within-subject designs

In the previous two chapters, you have seen different experimental designs where a factor(s) levels are manipulated by the researcher in order to measure a response. In simpler design settings, the factors that are manipulated are of the “between-subjects” type. The one-way ANOVA example in chapter 2 on drug effects on alertness is illustrative of this: each participant (experimental unit) had exactly one drug randomly assigned to them. As such, there is only one observation per experimental unit in the design. Drug was a between-subjects factor.

Why is that important to note? Because, as you should realize by now, that assures that the observations in the data are all independent. It will be elaborated upon later in this course how crucial this assumption is to the validity of any statistical inferences we draw from the tests/CIs we derive. In fact, independence is more important to inference validity than either normality or constant variance assumptions.

Now, contrast this with the paired \(t\)-test (see section 2.4). An example was discussed that involved giving a sample of 20 people a drug treatment in order to study its effect on the Cholesterol of the patients. Think of that problem from a design of experiments point of view: what was/were the factor(s) that were manipulated by the researcher?

  • It is not the drug treatment! Why? Because in the experiment, only one drug was studied: there was no “comparison” drug or placebo used to form a basis for comparison. As such, the drug treatment in that problem was not a variable at all: it was a constant.
  • So what is left? The answer is TIME. The researcher measured Cholesterol at two time points on each patient (experimental unit): [1] just before the start of the drug treatment, and [2] 45 days later. The factor here is time, and it has two levels. Time is a within-subject factor.

In a designed experiment that involves several measurements made on the same experimental unit over time, the design is called a repeated measures design.

The important thing to recognize here is that each patient provided two observations, not one. It stands to reason that what a particular subject’s Cholesterol is after 45 days will be related to their starting Cholesterol before the drug treatment. That is the layman’s way of saying that the measurements within a subject are not independent (i.e., a patient is not independent of themselves – the observations are correlated). If in any analysis we ignored this basic truth, that analysis would be seriously flawed.

So how did we handle this before? A simple strategy which is common with paired data comparisons is to calculate each subject’s difference score (yielding a single difference measurement per subject) and then performing a one-sample \(t\)-test on the differences. Each difference score is independent because each came from a different randomly chosen subject. This is what was done in the paired \(t\)-test analysis in section 2.4. Check back and remind yourself of this.

4.2.1 Blocks revisited: an approach to handling within-subjects factors

The notion and purpose of blocking in an experimental design was introduced in section 3.1. If you recall that witty definition of blocks as relatively homogeneously sets of experimental material, it may occur to you that subjects can serve as blocks. In fact, this was already stated in section 3.1. What this means for us now is that using a block design approach in analysis is ‘a way’ to handle correlated (non-independent) observations such as in the paired \(t\)-test problem above. We illustrate that here.

Example. Carbon Emissions (originally in Weiss (2012)).

The makers of the MAGNETIZER Engine Energizer System (EES) claim that it improves the gas mileage and reduces emissions in automobiles by using magnetic free energy to increase the amount of oxygen in the fuel for greater combustion efficiency. Test results for 14 vehicles that underwent testing is available in the file carCOemissions.csv. The data includes the emitted carbon monoxide (CO) levels, in parts per million, of each of the 14 vehicles tested before and after installation of the EES.

This is a classic before-after (Paired design) study. Below is some code to implement the test.

First, we quickly recap the Paired \(t\)-test result from Section REF

www <- "https://tjfisher19.github.io/introStatModeling/data/carCOemissions.csv"
vehicleCO <- read_csv(www, col_types = cols())
vehicleCO
## # A tibble: 14 x 3
##       id before after
##    <dbl>  <dbl> <dbl>
##  1     1   1.6   0.15
##  2     2   0.3   0.2 
##  3     3   3.8   2.8 
##  4     4   6.2   3.6 
##  5     5   3.6   1   
##  6     6   1.5   0.5 
##  7     7   2     1.6 
##  8     8   2.6   1.6 
##  9     9   0.15  0.06
## 10    10   0.06  0.16
## 11    11   0.6   0.35
## 12    12   0.03  0.01
## 13    13   0.1   0   
## 14    14   0.19  0

and the paired \(t\)-test

t.test(vehicleCO$before, vehicleCO$after, paired=TRUE)
## 
##  Paired t-test
## 
## data:  vehicleCO$before and vehicleCO$after
## t = 3.146, df = 13, p-value = 0.00773
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.239443 1.289129
## sample estimates:
## mean of the differences 
##                0.764286

Now, we will perform this same analysis using a Block ANOVA. The data is structure that 2 observations (the before and after) are on each row of the data frame. We will reshape this data into a tall, or long, format.

vehicleCO_tall <- vehicleCO %>%
  pivot_longer(cols=c(before, after),
               names_to="Time",
               values_to="Emitted_CO")
head(vehicleCO_tall)
## # A tibble: 6 x 3
##      id Time   Emitted_CO
##   <dbl> <chr>       <dbl>
## 1     1 before       1.6 
## 2     1 after        0.15
## 3     2 before       0.3 
## 4     2 after        0.2 
## 5     3 before       3.8 
## 6     3 after        2.8

Note the difference in the layout of the data here as compared to the original vehicleCO used in the paired \(t\)-test: each observation occupies a single row of the data file now. There is an identifier variable (id) that identifies which subject every observation belongs to. The variable Time refers to the time factor and Emitted_CO records the emitted Carbon Monoxide levels (ppm).

We will analyze this same data (addressing the same research question) by using a block design approach:

vehicleco.block.aov <- aov(Emitted_CO ~ factor(id) + Time, data=vehicleCO_tall)
summary(vehicleco.block.aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## factor(id)  13   56.7    4.36    10.6 7.2e-05 ***
## Time         1    4.1    4.09     9.9  0.0077 ** 
## Residuals   13    5.4    0.41                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Look at the \(F\)-test for the time factor (labeled Time here). This result is identical to the result of the paired t-test above (\(F\) = 9.897, \(\textrm{df}_1\) = 1, \(\textrm{df}_2\) = 13, \(p\)-value \(= 0.00773\)). Note, \(\sqrt{9.897} =\) 3.146, the value of the paired \(t\)-test statistic.

How does the block approach work here? You will note that the test of Time (the within subjects effect) uses the residual effect as the error term. Because the blocks are accounted for in the model (and here, the blocks constitutes the overall between-vehicle variability source), the only variation left over for the “residuals” is the variability due to the within-vehicle differences. In essence, there are two components to the random error now:

  • Aggregated variability between vehicles (block effect)
  • Remaining unexplained (random) variability after accounting for aggregate vehicle-to-vehicle differences and aggregated time-to-time differences. We will call this leftover stuff “pure” error.

For a correct analysis, it is important that each factor of interest in the study be tested against the proper component of the error. Here, we test a within-subjects effect against the random error component dealing with “pure” random error.

Perhaps a more general approach to within-subjects analysis is to specify to R the fact that there are layers of random variability in the data introduced by the fact that we have more than one measurement per experimental unit. This can be done using an aov model specification that explicitly cites these layers with an Error option. The model specification in R is generally

`aov(response ~ between-subjects factor structure + Error(EU variable/WS factor)`

In the present example, this would be as follows:

WS_aov <- aov(Emitted_CO ~ Time + Error(factor(id)/Time), data=vehicleCO_tall)
summary(WS_aov)
## 
## Error: factor(id)
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 13   56.7    4.36               
## 
## Error: factor(id):Time
##           Df Sum Sq Mean Sq F value Pr(>F)   
## Time       1   4.09    4.09     9.9 0.0077 **
## Residuals 13   5.37    0.41                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again we see the same result. It turns out that for more complex studies this general approach is preferred.

4.2.2 A more involved repeated measures case study

If we can handle within-subjects factors using blocks, why not just stick with that all the time?
Well, the reason is that repeated measures designs can get complicated: multiple time measurement points (more than just two), the addition of between-subjects factors, etc. The following case study example illustrates this.

Example. Cholesterol study (from the University of Sheffield (Sheffield, 2021)).

A study tested whether cholesterol was reduced after using a certain brand of margarine as part of a low fat, low cholesterol diet. The subjects consumed on average 2.31g of the active ingredient, stanol ester, a day. This data set cholesterolreduction.csv contains information on 18 people using margarine to reduce cholesterol over three time points.

Variable Name Variable Data type
ID Participant Number
Before Cholesterol before the diet (mmol/L) Scale
After4weeks Cholesterol after 4 weeks on diet (mmol/L) Scale
After8weeks Cholesterol after 8 weeks on diet (mmol/L) Scale
Margarine Margarine type A or B Binary
"carCOemissions.csv"
chol <- read_csv("https://tjfisher19.github.io/introStatModeling/data/cholesterolreduction.csv", col_types=cols())
glimpse(chol)
## Rows: 18
## Columns: 5
## $ ID          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Before      <dbl> 6.42, 6.76, 6.56, 4.80, 8.43, 7.49, 8.05, 5.05, 5.77, 3.91~
## $ After4weeks <dbl> 5.83, 6.20, 5.83, 4.27, 7.71, 7.12, 7.25, 4.63, 5.31, 3.70~
## $ After8weeks <dbl> 5.75, 6.13, 5.71, 4.15, 7.67, 7.05, 7.10, 4.67, 5.33, 3.66~
## $ Margarine   <chr> "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "B"~

There are two factors of interest here:

  • Time points (3 levels within-subjects): before, after4weeks, after8weeks)
  • Margarine type (2 level between-subjects): A or B

Note that the data is currently in wide form. We need to convert it to tall form:

chol_tall <- chol %>% 
  pivot_longer(cols=Before:After8weeks, 
               names_to="time.pt", 
               values_to="chol.level")
head(chol_tall)
## # A tibble: 6 x 4
##      ID Margarine time.pt     chol.level
##   <dbl> <chr>     <chr>            <dbl>
## 1     1 B         Before            6.42
## 2     1 B         After4weeks       5.83
## 3     1 B         After8weeks       5.75
## 4     2 A         Before            6.76
## 5     2 A         After4weeks       6.2 
## 6     2 A         After8weeks       6.13
chol_tall <- chol_tall %>%
  mutate(time.pt = factor(time.pt, 
                          levels=c("Before", "After4weeks", "After8weeks"),
                          labels=c("Before", "After 4 weeks", "After 8 weeks")))

Note here we have changed the variable time.pt to be a factor and we order the factors levels so they are contextually sequential (rather than alphabetical order).

4.2.2.1 Exploratory Data Analysis

We are now ready for some EDA. A table of descriptive statistics by combinations of time and margarine group is our first cut:

chol.summary <- chol_tall %>%
  group_by(Margarine, time.pt) %>%
  summarise(Mean = mean(chol.level),
            SD = sd(chol.level),
            SE = sd(chol.level)/sqrt(length(chol.level)),
            n = length(chol.level))
kable(chol.summary)
Margarine time.pt Mean SD SE n
A Before 6.03556 1.363232 0.454411 9
A After 4 weeks 5.55000 1.320672 0.440224 9
A After 8 weeks 5.48889 1.307282 0.435761 9
B Before 6.78000 0.919008 0.306336 9
B After 4 weeks 6.13333 0.863713 0.287904 9
B After 8 weeks 6.06889 0.825825 0.275275 9

We generally observe that cholesterol levels with margarine A appear lower than with margarine B. Also, there does appear to be some reduction in cholesterol level using both products: however, there also appears to be more unexplained variability (“noise”) in the cholesterol measurements under margarine A. We can further check this using some data visualizations. First let us try boxplots (not appropriate given the design) as seen in Figure 4.1.

ggplot(chol_tall) +
  geom_boxplot(aes(x=time.pt, y=chol.level)) +
  facet_wrap(~Margarine) +
  xlab("Time Point of Measurement") +
  ylab("Cholesterol level (mmol/L)")
Boxplot distribution of the Cholesterol level as a function of Treatment and Time of measurement.

Figure 4.1: Boxplot distribution of the Cholesterol level as a function of Treatment and Time of measurement.

We see in Figure 4.1 that perhaps cholesterol levels are generally more consistent under margarine B, and that it is just some outlier(s) that are the issue. However, this plot does not reveal a crucial piece of information: the responses are correlated over time. A better depiction would allow us to see how each individual subject’s cholesterol changed over the span of the experiment as seen in Figure 4.2.

# Plot the individual-level cholesterol profiles
ggplot(chol_tall, aes(x=time.pt, y=chol.level, colour=Margarine, group=ID)) +
  geom_line() + 
  geom_point(shape=21, fill="white") + 
  facet_wrap(~Margarine) +
  xlab("Time Point of Measurement") +
  ylab("Cholesterol level (mmol/L)") +
  ggtitle("Subject-level cholesterol profiles by Margarine group")
Profile plots of the Cholesterol level as a function of Treatment and Time of measurement.

Figure 4.2: Profile plots of the Cholesterol level as a function of Treatment and Time of measurement.

In Figure 4.2 we see some things more clearly. The general pattern over time appears to be pretty consistent for every experimental unit (visually, each “line” in the plot). There are two individuals in margarine B whose cholesterol level seems unusually higher than the others: this might warrant further investigation. Now we provide an summary profile of the two treatments in 4.3.

# we include a position_dodge elements to avoid overlap
ggplot(chol.summary, aes(x=time.pt, y=Mean, colour=Margarine)) + 
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=.1, position=position_dodge(0.3)) +
  geom_line(aes(group=Margarine), position=position_dodge(0.3)) +
  geom_point(position=position_dodge(0.3)) +
  xlab("Time Point of Measurement") +
  ylab("Cholesterol level (mmol/L)") +
  ggtitle("Mean Cholesterol level (with Standard Error bars)")
Average Profile of Cholesterol level by Treatment in Time.

Figure 4.3: Average Profile of Cholesterol level by Treatment in Time.

In the summary version (4.3) we see that margarine A Cholesterol levels appear lower than margarine B. It is worth noting that the group assigned to margarine level A had lower cholesterol “before” the study began.

4.2.2.2 Preliminary inferential analysis

Of more concern to a fully correct analysis is addressing why there appears to be a difference between the margarine groups at the “Before” stage. At that point, no margarines had been administered, so if the subjects had been truly randomly assigned to the margarine groups (as they should be in a properly designed experiment!), we would have expected consistent cholesterol readings between the groups at that point. This can be formally tested as follows, with a simple independent samples \(t\)-test comparing the mean cholesterol levels of the A and B groups at the “Before” stage:

BeforeDataOnly <- chol_tall %>%
  filter(time.pt == "Before")
t.test(chol.level ~ Margarine, data=BeforeDataOnly, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  chol.level by Margarine
## t = -1.358, df = 16, p-value = 0.193
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -1.906205  0.417316
## sample estimates:
## mean in group A mean in group B 
##         6.03556         6.78000

The result is not statistically significant (\(p\)-value = 0.1932). That is actually good news here! Although visually it appears different it is not statistically different.

4.2.2.3 A conditional repeated measures analysis

To get started, lets do a within-subjects (repeated measures) ANOVA on the Margarine A data only. The goal here would be to see if there is an effective reduction in mean cholesterol over the eight-week trial by using margarine A. Here goes: follow the code!

MargarineA <- chol_tall %>% 
  filter(Margarine == "A")
RMaov_MargA <- aov(chol.level ~ time.pt + Error(factor(ID)/time.pt), data=MargarineA)
summary(RMaov_MargA)
## 
## Error: factor(ID)
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals  8   42.4     5.3               
## 
## Error: factor(ID):time.pt
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## time.pt    2  1.615   0.808     106 5.8e-10 ***
## Residuals 16  0.122   0.008                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test of interest is labeled time.pt. It is significant (\(F\) = 106.3, \(\textrm{df}_1\) = 2, \(\textrm{df}_2\) = 16, \(p\)-value = \(5.7 \times 10^{-10}\)). What this means is that, using Margarine A, there is significant evidence to conclude that the true mean cholesterol level differs between at least two of the time points. Of course, that result does not mean that the margarine is effective at reducing cholesterol levels. All it means is that the mean cholesterol level is changing at some point over the experiment. Since there are three time points, a multiple comparison procedure is warranted with the output below and in Figure 4.4.

MargA.mc <- emmeans(RMaov_MargA, ~ time.pt)
## Note: re-fitting model with sum-to-zero contrasts
contrast(MargA.mc, "pairwise")
##  contrast                      estimate     SE df t.ratio p.value
##  Before - After 4 weeks          0.4856 0.0411 16  11.817  <.0001
##  Before - After 8 weeks          0.5467 0.0411 16  13.304  <.0001
##  After 4 weeks - After 8 weeks   0.0611 0.0411 16   1.487  0.3229
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
plot(contrast(MargA.mc, "pairwise"))
Comparing the effects of time within the Margarine A treatment groupu. We see there is no statistical difference between 4 and 8 weeks after treatment begins.

Figure 4.4: Comparing the effects of time within the Margarine A treatment groupu. We see there is no statistical difference between 4 and 8 weeks after treatment begins.

The significant reductions in mean cholesterol level under Margarine A occurs right after the start of the experiment. The mean cholesterol level is reduced by 0.485 (SE = 0.041) by week 4 (\(p\)-value < 0.0001); and by 0.546 (SE = 0.041) by week 8 (\(p\)-value < 0.0001). There is no appreciable change in mean cholesterol level between weeks 4 and 8 (\(p\)-value = 0.3229).

4.2.2.4 Full analysis: Formal Comparison of Margarines

It is always critically important to know the structure of the data, which here stems from the design of the experiment. In this case, we cannot just “compared the margarines” because there is a subject-level time profile under each margarine. Since the same time points are observed under each margarine group, the factors margarine and time.pt are crossed effects, which means we have a factorial data structure (see section 3.2). The “wrinkle” here is that one of the two factors is a between-subjects factor (Margarine) and the other is within-subjects (time.pt). But, we can handle this in aov:

RMaov_all <- aov(chol.level ~ Margarine + time.pt + Margarine:time.pt + 
                              Error(factor(ID)/time.pt), data=chol_tall)
summary(RMaov_all)
## 
## Error: factor(ID)
##           Df Sum Sq Mean Sq F value Pr(>F)
## Margarine  1    5.5    5.46    1.45   0.25
## Residuals 16   60.4    3.78               
## 
## Error: factor(ID):time.pt
##                   Df Sum Sq Mean Sq F value Pr(>F)    
## time.pt            2   4.32   2.160  259.49 <2e-16 ***
## Margarine:time.pt  2   0.08   0.040    4.78  0.015 *  
## Residuals         32   0.27   0.008                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As with the factorials models before, the first thing we must check is the significance of the interaction term. If this is significant, it means that the effect of time on the mean cholesterol is different between the two margarines. That is the case here (\(F\) = 4.777, \(\textrm{df}_1\) = 2, \(\textrm{df}_2\) = 32, \(p\)-value = 0.0153). As a result, we should do our multiple comparisons by conditioning on one of the two factors. Experimental context will dictate which of the two factors is the more reasonable to condition on. Here, I choose to condition on Margarine group, so I will get a time-based comparison of mean cholesterol level, but specific to each Margarine group:

RMall.mc <- emmeans(RMaov_all, ~ time.pt | Margarine)
## Note: re-fitting model with sum-to-zero contrasts
contrast(RMall.mc, "pairwise")
## Margarine = A:
##  contrast                      estimate    SE df t.ratio p.value
##  Before - After 4 weeks          0.4856 0.043 32  11.290  <.0001
##  Before - After 8 weeks          0.5467 0.043 32  12.711  <.0001
##  After 4 weeks - After 8 weeks   0.0611 0.043 32   1.421  0.3424
## 
## Margarine = B:
##  contrast                      estimate    SE df t.ratio p.value
##  Before - After 4 weeks          0.6467 0.043 32  15.036  <.0001
##  Before - After 8 weeks          0.7111 0.043 32  16.535  <.0001
##  After 4 weeks - After 8 weeks   0.0644 0.043 32   1.498  0.3051
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
plot(contrast(RMall.mc, "pairwise"))
Graphical exploration of all treatments noting that there is no statistical difference between 4 and 8 weeks after the study has began regardless of Margarine treatment.

Figure 4.5: Graphical exploration of all treatments noting that there is no statistical difference between 4 and 8 weeks after the study has began regardless of Margarine treatment.

Even though at first glance the time-based comparisons look about the same between the two margarines, the point estimates and plots reveal the story: the reductions in mean cholesterol following the initiation of the study are larger for margarine B (mean=0.7111 mmol/L, SE=0.043 by week 8) than for margarine A (mean=0.546 mmol/L, SE=0.043 by week 8).

4.2.3 Further study

More thorough analysis of within-subjects and repeated measures designs are possible using advanced methods beyond the scope of this course. These methods include:

  • processes that allow the analyst to estimate the degree of correlation between repeated measurements
  • processes that allow one to model different correlation structures based upon experimental realities in the data collection process

For example, suppose we had a repeated measures experiment with subjects measured across four time points: Week 1, 2, 3 and 8. Because of the time spacing, we might expect that measurements between adjacent weeks to be more similar within a subject than weeks far removed in time. For example, it is reasonable to expect that results in week 2 are more strongly correlated to week 1 results than would be the correlation between weeks 3 and 1. Week 8’s results might be so far removed from the first few weeks that the results there might be safely presumed to be nearly independent of earlier results. We could even be more restrictive, using an approach that assumes only adjacent time point results are correlated, but beyond that, assume independence.

Subtleties like this can be modeled and checked using advanced modeling techniques.

References

Kutner, M. H., Nachtsheim, C., Neter, J. and Li, W., Applied Linear Statistical Models, New York, NY: McGraw-Hill Irwin, 2004.
Sheffield, U. of, Mathematics and Statistics Help (MASH), Datasets for Teaching - Statistics - MASH - The University of Sheffield, January 2021.
Weiss, N. A., Introductory Statistics, Pearson Education, 2012.