# packages
::p_load(tidyverse, # data import and handling
pacman# handling function conflicts
conflicted, # adjusted mean comparisons
emmeans, multcomp, multcompView, # plots
ggplot2, desplot)
# conflicts between functions with the same name
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
This example is taken from Chapter “2 Randomized complete block
design” of the course material “Mixed models for metric data (3402-451)” by Prof. Dr. Hans-Peter Piepho. It considers data
published in Clewer and Scarisbrick (2001) from a yield (t/ha)
trial laid out as a randomized complete block
design (3
blocks) with cultivar
(4 cultivars) being the only
treatment factor. Thus, we have a total of 12 plots.
# data (import via URL)
<- "https://raw.githubusercontent.com/SchmidtPaul/DSFAIR/master/data/Clewer%26Scarisbrick2001.csv"
dataURL <- read_csv(dataURL)
dat
dat
## # A tibble: 12 × 5
## block cultivar yield row col
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 B1 C1 7.4 2 1
## 2 B1 C2 9.8 3 1
## 3 B1 C3 7.3 1 1
## 4 B1 C4 9.5 4 1
## 5 B2 C1 6.5 1 2
## 6 B2 C2 6.8 4 2
## 7 B2 C3 6.1 3 2
## 8 B2 C4 8 2 2
## 9 B3 C1 5.6 2 3
## 10 B3 C2 6.2 1 3
## 11 B3 C3 6.4 3 3
## 12 B3 C4 7.4 4 3
Before anything, the columns block
and
cultivar
should be encoded as factors, since R by default
encoded them as character.
<- dat %>%
dat mutate_at(vars(block, cultivar), as.factor)
In order to obtain a field layout of the trial, we can use the
desplot()
function. Notice that for this we need two data
columns that identify the row
and column
of
each plot in the trial.
desplot(data = dat, flip = T,
form = cultivar ~ col + row, # fill color per cultivar
out1 = block, # bold lines between blocks
text = yield, cex = 1, shorten = F, # show yield for each plot
main = "Field layout", show.key = F) # formatting
We could also have a look at the arithmetic means and standard
deviations for yield per cultivar
or
block
.
%>%
dat group_by(cultivar) %>%
summarize(mean = mean(yield),
std.dev = sd(yield))
## # A tibble: 4 × 3
## cultivar mean std.dev
## <fct> <dbl> <dbl>
## 1 C1 6.5 0.9
## 2 C2 7.6 1.93
## 3 C3 6.6 0.624
## 4 C4 8.3 1.08
%>%
dat group_by(block) %>%
summarize(mean = mean(yield),
std.dev = sd(yield))
## # A tibble: 3 × 3
## block mean std.dev
## <fct> <dbl> <dbl>
## 1 B1 8.5 1.33
## 2 B2 6.85 0.819
## 3 B3 6.4 0.748
We can also create a plot to get a better feeling for the data.
ggplot(data = dat,
aes(y = yield, x = cultivar, color = block)) +
geom_point() + # scatter plot
ylim(0, NA) + # force y-axis to start at 0
theme_classic() # clearer plot format
Finally, we can decide to fit a linear model with yield
as the response variable and (fixed) cultivar
and
block
effects.
<- lm(yield ~ cultivar + block, data = dat) mod
Thus, we can conduct an ANOVA for this model. As can be seen, the
F-test of the ANOVA finds the cultivar
effects to be
statistically significant (p = 0.037 < 0.05).
%>% anova() mod
## Analysis of Variance Table
##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## cultivar 3 6.63 2.21 5.525 0.036730 *
## block 2 9.78 4.89 12.225 0.007651 **
## Residuals 6 2.40 0.40
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Following a significant F-test, one will want to compare cultivar means.
<- mod %>%
mean_comparisons emmeans(specs = ~ cultivar) %>% # get adjusted means for cultivars
cld(adjust="tukey", Letters=letters) # add compact letter display
mean_comparisons
## cultivar emmean SE df lower.CL upper.CL .group
## C1 6.5 0.365 6 5.22 7.78 a
## C3 6.6 0.365 6 5.32 7.88 ab
## C2 7.6 0.365 6 6.32 8.88 ab
## C4 8.3 0.365 6 7.02 9.58 b
##
## Results are averaged over the levels of: block
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 4 estimates
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
Note that if you would like to see the underyling individual
contrasts/differences between adjusted means, simply add
details = TRUE
to the cld()
statement. Also,
find more information on mean comparisons and the Compact Letter Display
in the separate Compact Letter
Display Chapter
For this example we can create a plot that displays both the raw data and the results, i.e. the comparisons of the adjusted means that are based on the linear model. If you would rather have e.g. a bar plot to show these results, check out the separate Compact Letter Display Chapter
ggplot() +
# black dots representing the raw data
geom_point(
data = dat,
aes(y = yield, x = cultivar)
+
) # red dots representing the adjusted means
geom_point(
data = mean_comparisons,
aes(y = emmean, x = cultivar),
color = "red",
position = position_nudge(x = 0.1)
+
) # red error bars representing the confidence limits of the adjusted means
geom_errorbar(
data = mean_comparisons,
aes(ymin = lower.CL, ymax = upper.CL, x = cultivar),
color = "red",
width = 0.1,
position = position_nudge(x = 0.1)
+
) # red letters
geom_text(
data = mean_comparisons,
aes(y = emmean, x = cultivar, label = str_trim(.group)),
color = "red",
position = position_nudge(x = 0.2),
hjust = 0
+
) ylim(0, NA) + # force y-axis to start at 0
ylab("Yield in t/ha") + # label y-axis
xlab("Cultivar") + # label x-axis
labs(caption = "Black dots represent raw data
Red dots and error bars represent adjusted mean with 95% confidence limits per cultivar
Means followed by a common letter are not significantly different according to the Tukey-test") +
theme_classic() # clearer plot format
This example is taken from “Example 5.11” of the course material “Quantitative Methods in Biosciences (3402-420)” by Prof. Dr. Hans-Peter Piepho. It considers data published in Gomez & Gomez (1984). A randomized complete block experiment was conducted to assess the yield (kg/ha) of rice cultivar IR8 at six different seeding densities (kg/ha).
Important: The treatment factor
(density
) is actually quantitative variable, but for this
exercise you should define it as a factor
variable with 6
levels anyway. (Notice, however, that in such a case a regression
analysis with density
being a numeric variable is actually
more efficient that ANOVA followed by multiple comparison of means.)
desplot()
# data (import via URL)
<- "https://raw.githubusercontent.com/SchmidtPaul/DSFAIR/master/data/Gomez%26Gomez1984b.csv"
dataURL <- read_csv(dataURL) ex1dat
R-Code and exercise solutions
Please click here to find a folder with .R
files. Each file contains
Please feel free to contact me about any of this!
schmidtpaul1989@outlook.com