# packages
::p_load(conflicted, # package function conflicts
pacman# data handling
dplyr, purrr, tibble, tidyr, stringr, # plot
gganimate, ggplot2, gifski, viridis, # mixed modelling
nlme, lme4, glmmTMB, sommer, # mixed model extractions
AICcmodavg, mixedup)
# package function conflicts
conflict_prefer("filter", "dplyr")
# data
<- agriTutorial::sorghum %>%
dat rename(block = Replicate, plot = factplot) %>%
::select(y, variety, block, plot, factweek, varweek) %>%
dplyras_tibble()
y | variety | block | plot | factweek | varweek |
---|---|---|---|---|---|
5.00 | 1 | 1 | 1 | 1 | 1 |
4.84 | 1 | 1 | 1 | 2 | 2 |
4.02 | 1 | 1 | 1 | 3 | 3 |
3.75 | 1 | 1 | 1 | 4 | 4 |
3.13 | 1 | 1 | 1 | 5 | 5 |
4.42 | 1 | 2 | 2 | 1 | 1 |
4.30 | 1 | 2 | 2 | 2 | 2 |
3.67 | 1 | 2 | 2 | 3 | 3 |
3.23 | 1 | 2 | 2 | 4 | 4 |
2.83 | 1 | 2 | 2 | 5 | 5 |
4.42 | 1 | 3 | 3 | 1 | 1 |
4.10 | 1 | 3 | 3 | 2 | 2 |
3.46 | 1 | 3 | 3 | 3 | 3 |
3.09 | 1 | 3 | 3 | 4 | 4 |
2.82 | 1 | 3 | 3 | 5 | 5 |
4.01 | 1 | 4 | 4 | 1 | 1 |
3.89 | 1 | 4 | 4 | 2 | 2 |
3.21 | 1 | 4 | 4 | 3 | 3 |
2.89 | 1 | 4 | 4 | 4 | 4 |
2.56 | 1 | 4 | 4 | 5 | 5 |
3.36 | 1 | 5 | 5 | 1 | 1 |
3.10 | 1 | 5 | 5 | 2 | 2 |
2.67 | 1 | 5 | 5 | 3 | 3 |
2.47 | 1 | 5 | 5 | 4 | 4 |
2.16 | 1 | 5 | 5 | 5 | 5 |
5.82 | 2 | 1 | 6 | 1 | 1 |
5.60 | 2 | 1 | 6 | 2 | 2 |
5.05 | 2 | 1 | 6 | 3 | 3 |
4.72 | 2 | 1 | 6 | 4 | 4 |
4.46 | 2 | 1 | 6 | 5 | 5 |
5.73 | 2 | 2 | 7 | 1 | 1 |
5.59 | 2 | 2 | 7 | 2 | 2 |
5.00 | 2 | 2 | 7 | 3 | 3 |
4.65 | 2 | 2 | 7 | 4 | 4 |
4.42 | 2 | 2 | 7 | 5 | 5 |
5.31 | 2 | 3 | 8 | 1 | 1 |
5.19 | 2 | 3 | 8 | 2 | 2 |
4.86 | 2 | 3 | 8 | 3 | 3 |
4.44 | 2 | 3 | 8 | 4 | 4 |
4.22 | 2 | 3 | 8 | 5 | 5 |
4.92 | 2 | 4 | 9 | 1 | 1 |
4.66 | 2 | 4 | 9 | 2 | 2 |
4.56 | 2 | 4 | 9 | 3 | 3 |
4.16 | 2 | 4 | 9 | 4 | 4 |
3.99 | 2 | 4 | 9 | 5 | 5 |
3.96 | 2 | 5 | 10 | 1 | 1 |
3.86 | 2 | 5 | 10 | 2 | 2 |
3.50 | 2 | 5 | 10 | 3 | 3 |
3.13 | 2 | 5 | 10 | 4 | 4 |
2.95 | 2 | 5 | 10 | 5 | 5 |
5.65 | 3 | 1 | 11 | 1 | 1 |
5.97 | 3 | 1 | 11 | 2 | 2 |
5.27 | 3 | 1 | 11 | 3 | 3 |
5.07 | 3 | 1 | 11 | 4 | 4 |
4.52 | 3 | 1 | 11 | 5 | 5 |
5.39 | 3 | 2 | 12 | 1 | 1 |
5.49 | 3 | 2 | 12 | 2 | 2 |
5.08 | 3 | 2 | 12 | 3 | 3 |
4.87 | 3 | 2 | 12 | 4 | 4 |
4.32 | 3 | 2 | 12 | 5 | 5 |
5.15 | 3 | 3 | 13 | 1 | 1 |
5.28 | 3 | 3 | 13 | 2 | 2 |
4.93 | 3 | 3 | 13 | 3 | 3 |
4.67 | 3 | 3 | 13 | 4 | 4 |
4.15 | 3 | 3 | 13 | 5 | 5 |
4.50 | 3 | 4 | 14 | 1 | 1 |
4.89 | 3 | 4 | 14 | 2 | 2 |
4.74 | 3 | 4 | 14 | 3 | 3 |
4.49 | 3 | 4 | 14 | 4 | 4 |
4.10 | 3 | 4 | 14 | 5 | 5 |
3.75 | 3 | 5 | 15 | 1 | 1 |
3.74 | 3 | 5 | 15 | 2 | 2 |
3.55 | 3 | 5 | 15 | 3 | 3 |
3.28 | 3 | 5 | 15 | 4 | 4 |
3.00 | 3 | 5 | 15 | 5 | 5 |
5.86 | 4 | 1 | 16 | 1 | 1 |
5.60 | 4 | 1 | 16 | 2 | 2 |
5.37 | 4 | 1 | 16 | 3 | 3 |
5.00 | 4 | 1 | 16 | 4 | 4 |
4.37 | 4 | 1 | 16 | 5 | 5 |
5.82 | 4 | 2 | 17 | 1 | 1 |
5.55 | 4 | 2 | 17 | 2 | 2 |
5.29 | 4 | 2 | 17 | 3 | 3 |
4.95 | 4 | 2 | 17 | 4 | 4 |
4.07 | 4 | 2 | 17 | 5 | 5 |
5.26 | 4 | 3 | 18 | 1 | 1 |
5.06 | 4 | 3 | 18 | 2 | 2 |
4.76 | 4 | 3 | 18 | 3 | 3 |
4.48 | 4 | 3 | 18 | 4 | 4 |
3.94 | 4 | 3 | 18 | 5 | 5 |
4.87 | 4 | 4 | 19 | 1 | 1 |
4.75 | 4 | 4 | 19 | 2 | 2 |
4.55 | 4 | 4 | 19 | 3 | 3 |
4.33 | 4 | 4 | 19 | 4 | 4 |
3.83 | 4 | 4 | 19 | 5 | 5 |
3.96 | 4 | 5 | 20 | 1 | 1 |
3.76 | 4 | 5 | 20 | 2 | 2 |
3.56 | 4 | 5 | 20 | 3 | 3 |
3.18 | 4 | 5 | 20 | 4 | 4 |
2.96 | 4 | 5 | 20 | 5 | 5 |
The example in this chapter is taken from Example 4 in Piepho & Edmondson (2018) (see also the Agritutorial vigniette). It considers data from a sorghum trial laid out as a randomized complete block
design (5 blocks) with variety
(4 sorghum varities) being the only treatment factor. Thus, we have a total of 20 plot
s. It is important to note that our response variable (y
), the leaf area index, was assessed in five consecutive weeks on each plot starting 2 weeks after emergence. Therefore, the dataset contains a total of 100 values and what we have here is longitudinal data, a.k.a. repeated measurements over time, a.k.a. a time series analysis.
As Piepho & Edmondson (2018) put it: “the week factor is not a treatment factor that can be randomized. Instead, repeated measurements are taken on each plot on five consecutive occasions. Successive measurements on the same plot are likely to be serially correlated, and this means that for a reliable and efficient analysis of repeated-measures data we need to take proper account of the serial correlations between the repeated measures (Piepho, Büchse & Richter, 2004; Pinheiro & Bates, 2000).”
<- ggplot(data = dat, aes(
gganimate_plot y = y,
x = varweek,
group = variety,
color = variety
+
)) geom_boxplot(outlier.shape = NA) +
geom_point(alpha = 0.5, size = 3) +
scale_color_viridis(option = "D",
discrete = TRUE,
name = "Variety") +
scale_y_continuous(
name = "Leaf area index",
limits = c(0, 6.5),
expand = c(0, 0),
breaks = c(0:6)
+
) xlab("Week") +
theme_bw() +
theme(legend.position = "bottom") +
transition_time(varweek) +
shadow_mark(exclude_layer = 2)
animate(gganimate_plot, renderer = gifski_renderer()) # render gif
Our goal is therefore to build a suitable model taking serial correlation into account. In order to do this, we will initially consider the model for a single time point. Then, we extend this model to account for multiple weeks by allowing for week-speficic effects. Finally, we further allow for serially correlated error terms.
When looking at data from a single time point (e.g. the first week), we merely have 20 observations from a randomized complete block design with a single treatment factor. It can therefore be analyzed with a simple one-way ANOVA (fixed variety
effect) for randomized complete block designs (fixed block
effect):
<- dat %>% filter(factweek == "1") # subset data from first week only
dat.wk1
<- lm(formula = y ~ variety + block,
mod.wk1 data = dat.wk1)
We could now go on and look at the ANOVA via anova(mod.wk1)
and it would indeed not be wrong to simply repeat this for each week. Yet, one may not be satisfied with obtaining multiple ANOVA results - namely one per week. This is especially likeliy in case the results contradict each other, because e.g. the variety effects are found to be significant in only two out of five weeks. Therefore, one may want to analyze the entire dataset i.e. the multiple weeks jointly.
Going from the single-week-analysis to jointly analyzing the entire dataset is more than just changing the data =
statement in the model. This is because “it is realistic to assume that the treatment effects evolve over time and thus are week-specific. Importantly, we must also allow for the block effects to change over time in an individual manner. For example, there could be fertility or soil type differences between blocks and these could have a smooth progressive or cumulative time-based effect on differences between the blocks dependent on factors such as temperature or rainfall” (Piepho & Edmondson, 2018). We implement this by taking the model in mod.wk1
and multipyling each effect with factweek
. Note that this is also true for the general interecept (µ) in mod.wk1
, meaning that we would like to include one intercept per week, which can be achieved by simply adding factweek
as a main effect as well. This leaves us with fixed main effects for factweek
, variety
, and block
, as well as the week-specific effects of the latter two factweek:variety
and factweek:block
.
Finally, note that we are not doing anything about the model’s error term at this point. More specifically this means that its variance structure is still the default iid (independent and identically distributed) - see our summary on correlation/variance strucutres here.
Since the models in this chapter do not contain any random effects, we make use of gls()
instead of lme()
. Furthermore, the above named model factweek + variety + block + factweek:variety + factweek:block
can be written in a shorter syntax as:
<- nlme::gls(model = y ~ factweek * (variety + block),
mod.iid.nlme correlation = NULL, # default, i.e. homoscedastic, independent errors
data = dat)
# Extract variance component estimates
<- tibble(varstruct = "iid") %>%
mod.iid.nlme.VC mutate(sigma = mod.iid.nlme$sigma) %>%
mutate(Variance = sigma^2)
varstruct | sigma | Variance |
---|---|---|
iid | 0.152 | 0.023 |
Since the models in this chapter do not contain any random effects, we cannot use lmer()
or any other function of the lme4
package. However, even if there were random effects in our models, the short answer here is that with lme4
it is not possible to fit any variance structures.
More specifically, we can read in an lme4
vigniette: “The main advantage of nlme
relative to lme4
is a user interface for fitting models with structure in the residuals (various forms of heteroscedasticity and autocorrelation) and in the random-effects covariance matrices (e.g., compound symmetric models). With some extra effort, the computational machinery of lme4
can be used to fit structured models that the basic lmer
function cannot handle (see Appendix A)”
Michael Clark puts it as “Unfortunately, lme4 does not provide the ability to model the residual covariance structure, at least not in a straightforward fashion”
Accordingly, there is no info on this package for this chapter beyond this point.
In glmmTMB()
it is -to our knowledge- not possible to adjust the variance structure of the error.
Just like in
nlme
, there is aweights=
argument inglmmTMB()
. However, to our understanding, they have different functions:In
nlme
, it requires “an optionalvarFunc
object or one-sided formula describing the within-group heteroscedasticity structure” (nlme RefMan) and we make use of this in the chapter at hand.In
glmmTMB
, the RefMan only states “weights, as inglm
. Not automatically scaled to have sum 1”. Following this trail, theglm
documentation description for theweights=
argument is “an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector.”. Accordingly, it cannot be used to allow for heterogeneous error variances in this package.
We can, however, “fix the residual variance to be 0 (actually a small non-zero value)” and therefore “force variance into the random effects” (glmmTMB RefMan) via adding the dispformula = ~ 0
argument. Thus, when doing so, we need to make sure to also add a random term to the model with the desired variance structure. By taking both of these actions, we are essentially mimicing the error (variance) as a random effect (variance). We achieve this by first creating a unit
column in the data with different entries for each data point.
Note that doing this may not be necessary for mod.iid.glmm
with the default homoscedastic, independent error variance structure, but since it will be necessary for the following model with a more sophisticated variance structure, we will apply it here, too.
Finally, for extracting the variance component estimates from glmmTMB
objects in this chapter, we use the extract_vc
function from the mixedup
helper package.
<- dat %>%
dat mutate(unit = 1:n() %>% as.factor) # new column with running number
<- glmmTMB(formula = y ~ factweek * (variety + block)
mod.iid.glmm + (1 | unit), # add random unit term to mimic error variance
dispformula = ~ 0, # fix original error variance to 0
REML = TRUE, # needs to be stated since default = ML
data = dat)
# Extract variance component estimates
<- mod.iid.glmm %>%
mod.iid.glmm.VC extract_vc(ci_scale = "var")
group | effect | variance | var_2.5 | var_97.5 | |
---|---|---|---|---|---|
unit | unit | Intercept | 0.023 | 0.016 | 0.033 |
1 | Residual | 0.000 |
It is quite easy to extract variance component estimates from sommer
objects in a table format we are used to:
<- mmer(fixed = y ~ factweek + variety + block + factweek:variety + factweek:block,
mod.iid.somm rcov = ~ units, # default = iid
data = dat, verbose=F)
## Version out of date. Please update sommer to the newest version using:
## install.packages('sommer') in a new session
## Use the 'date.warning' argument to disable the warning message.
# Extract variance component estimates
<- summary(mod.iid.somm)$varcomp mod.iid.somm.VC
grp | VarComp | VarCompSE | Zratio | Constraint |
---|---|---|---|---|
units.y-y | 0.023 | 0.004 | 5.477 | Positive |
In SAS, there is an actual subject=
statement that is used to identify the subjects on which repeated measurements are taken. This may not be necessary for the iid
with the default homoscedastic, independent error variance structure, but since it will be necessary for the following model with a more sophisticated variance structure, we will apply it here, too. In the end, the entire repeated
line below could also be left out as it simply explicitly states the default setting.
mod | CovParm | Subject | Estimate |
---|---|---|---|
iid | factweek | plot | 0.023 |
After all, a decision must be made on whether the ar1
variance structure for the error term was a good choice for modelling this dataset. More broadly speaking, it must be decided which of the models should be used for further investigations such as a test of fixed effects etc. As is standard procedure, we can do a model selection based on goodness-of-fit statistics such as the AIC (Akaike information criterion).
For nlme
and glmmTMB
, we make use of the aictab()
function from the helper package AICcmodavg.
<- aictab(list(mod.iid.nlme, mod.ar1.nlme), modnames = c("iid", "ar1")) AIC.nlme
Modnames | K | AICc | Delta_AICc | Res.LL | |
---|---|---|---|---|---|
2 | ar1 | 42 | 101.4 | 0.0 | 23.0 |
1 | iid | 41 | 137.6 | 36.2 | 1.9 |
not possible - see above.
<- aictab(list(mod.iid.glmm, mod.ar1.glmm), modnames = c("iid", "ar1")) AIC.glmm
Modnames | K | AICc | Delta_AICc | LL | |
---|---|---|---|---|---|
2 | ar1 | 42 | 101.4 | 0.0 | 23.0 |
1 | iid | 41 | 137.6 | 36.2 | 1.9 |
Comparing AIC values from the two models once again shows that sommer
does not really do what we are aiming for in this example. The AIC values for the two models are identical. The main reason for this is that they do not actually differ in the number of estimated parameters, since rho as the potentially second parameter here, was not estimated, but fixed by us.
[ Is this the whole story? …in progress… ]
<- list(mod.iid.somm, mod.ar1.somm)
somm.mods
<- tibble(
AIC.somm Modnames = c("iid", "ar1"),
AIC = somm.mods %>% map("AIC") %>% unlist,
LL = somm.mods %>% map("monitor") %>% lapply(. %>% `[`(1, ncol(.))) %>% unlist) %>% # last element in first row of "monitor"
arrange(AIC)
Modnames | AIC | LL |
---|---|---|
iid | -73.3 | 76.67347 |
ar1 | -73.3 | 76.67347 |
The model fit statistics have already been created via the ods output
statement of the respective models and are displayed here jointly:
mod | -2 Res Log-Likelihood | AIC | AICC | BIC |
---|---|---|---|---|
iid | -3.7 | -1.7 | -1.7 | -0.7 |
ar1 | -46.0 | -42.0 | -41.7 | -40.0 |
The ar1
model has the smaller AIC value. Thus, we choose the ar1
over the iid
model, since the former has the better fit.
Notice that this outcome satisfies the aim of this chapter (i.e. having an ar1
variance structure for the error term) but regarding the analysis of the underlying dataset, more steps can be taken in order to get concluding results. You can find further steps in Example 4 in Piepho & Edmondson (2018) (see also the Agritutorial vigniette). On one hand, they investigate more variance structure alternatives to iid
and on the other hand, they continue selecting a suitable regression model for the time trend after selecting a sutiable variance structure.
package | model syntax |
---|---|
nlme |
correlation = corAR1(form = ~ TIME | SUBJECT) |
lme4 |
not possible |
glmmTMB |
random = ~ ar1(TIME + 0 | SUBJECT), dispformula = ~ 0 |
sommer |
random = ~ vs(TIME, Gu = AR1(TIME, rho = FIXED.RHO)) |
SAS |
repeated TIME/sub=SUBJECT type=ar(1); |
mod | estimate | nlme | lme4 | glmmTMB | sommer | SAS |
---|---|---|---|---|---|---|
iid | variance | 0.0231758 | 0.023 | 0.0231758 | 0.02318 |
mod | estimate | nlme | lme4 | glmmTMB | sommer | SAS |
---|---|---|---|---|---|---|
ar1 | variance | 0.0221555 | 0.022 | 0.02216 | ||
ar1 | correlation | 0.7489923 | 0.749 | 0.74900 |
mod | estimate | nlme | lme4 | glmmTMB | sommer | SAS | |
---|---|---|---|---|---|---|---|
2 | iid | AIC | 101.4131 | 101.4130 | -73.34693 | -1.7 | |
1 | ar1 | AIC | 137.6407 | 137.6407 | -42.0 |
Please feel free to contact us about any of this!
schmidtpaul1989@outlook.com