This vignette builds on the information in Vignette 1, which covered analysis at a single location. This vignette illustrates how to model the covariance between multiple locations, using a dataset of potato yield trials across 8 locations (TX was excluded) and 6 years, with a population of 336 genotypes (Schmitz Carley et al. 2019).
library(StageWise)
geno.file <- system.file("vignette_data", "geno2.csv", package = "StageWise")
geno <- read_geno(filename=geno.file,ploidy=4,map=FALSE,dominance=TRUE)
## Minor allele threshold = 5 genotypes
## Number of markers = 5262
## Number of genotypes = 336
pheno.file <- system.file("vignette_data", "pheno2.csv", package = "StageWise")
pheno <- read.csv(pheno.file)
head(pheno)
## id loc env Yield.Mg.ha
## 1 A01143-3C CA CA_2011 90.98
## 2 A01143-3C CA CA_2011 95.12
## 3 AC00206-2W CA CA_2011 74.44
## 4 AC01151-5W CA CA_2011 74.44
## 5 AC01151-5W CA CA_2011 95.12
## 6 AC03452-2W CA CA_2011 86.85
##
## CA FL MI MO NC NY OR WI
## 804 815 820 482 847 832 612 757
The input file contains not only a column named ‘env’ for environment
but also one named ‘loc’ for location. The location information is not
explicitly used by Stage1
but is automatically detected and
retained in the output for use in Stage2
.
## Offline License checked out Thu Jul 18 03:16:21 2024
## env id BLUE loc
## 1 CA_2011 A01143-3C 93.050 CA
## 2 CA_2011 AC00206-2W 74.440 CA
## 3 CA_2011 AC01151-5W 84.780 CA
## 4 CA_2011 AC03452-2W 86.850 CA
## 5 CA_2011 AC05153-1W 49.630 CA
## 6 CA_2011 Accumulator 105.455 CA
library(ggplot2)
ggplot(data=ans1$fit,aes(x=loc,y=H2)) + stat_boxplot(outlier.color="red") + xlab("Location") + ylab(expression(paste("Broad-sense ",H^2," (plot basis)")))
The above figure shows the variation in broad-sense heritability across locations and years.
When the data frame of BLUEs passed to Stage2
has a
column labeled ‘loc’, genotype x location effects are included using a
separable covariance structure. The genetic covariance between locations
for the highest order genetic effect (i.e., additive when marker data
are included) follows a 2nd order factor-analytic (FA2) model. (For
non-additive genetic effects, the correlation is constrained at 1.) For
large datasets with many locations (such as this one), I recommend first
analyzing it without marker data, as the computation proceeds more
quickly.
Here is a joint analysis of the 8 locations without using the marker data.
## $var
## Variance PVE
## env 286 NA
## genotype 53 0.355
## g x loc 15 0.098
## g x env 22 0.150
## Stage1.error 59 0.397
##
## $cor
## CA FL MI MO NC NY OR WI
## CA 1.000 0.642 0.685 0.728 0.839 0.686 0.869 0.751
## FL 0.642 1.000 0.665 0.711 0.699 0.699 0.734 0.717
## MI 0.685 0.665 1.000 0.932 0.860 0.930 0.908 0.932
## MO 0.728 0.711 0.932 1.000 0.921 0.998 0.972 0.999
## NC 0.839 0.699 0.860 0.921 1.000 0.904 0.953 0.928
## NY 0.686 0.699 0.930 0.998 0.904 1.000 0.956 0.996
## OR 0.869 0.734 0.908 0.972 0.953 0.956 1.000 0.979
## WI 0.751 0.717 0.932 0.999 0.928 0.996 0.979 1.000
As in the single location analysis, running the summary
command on the ‘vars’ output shows the partitioning of variance, but now
a g x loc effect is also present. Because g x year effects are not
repeatable and often small, they are not included in the
Stage2
model.
The FA2 model should have enough complexity for a set of correlated
locations used by a single breeding program, but it may be inadequate
when analyzing disparate locations. The factor loadings returned by
Stage2
can be used with the function uniplot
to visualize model sufficiency and correlation structure (Cullis et al. 2010).
The squared radius for each location is the proportion of variance explained (PVE) by the latent factors. With the exception of FL, the FA2 model appears to provide a good representation of the covariance structure. The numeric PVE values can be obtained as follows:
## CA FL MI MO NC NY OR WI
## 1.0000000 0.5388560 0.8694623 1.0000000 0.9077776 1.0000000 1.0000000 1.0000000
The cosine of the angle between locations equals the correlation due
to the latent factors (recall cos(0) = 1). Focusing on the 6 highly
correlated locations-WI,MI,NC,OR,NY,MO-we now add the marker data to
Stage2
to estimate the additive correlation and predict
breeding values.
locs <- c("WI","MI","OR","NY","NC","MO")
blues <- ans1$blues[ans1$blues$loc %in% locs,]
tmp <- sapply(strsplit(names(ans1$vcov),split="_"),"[[",1)
vcov <- ans1$vcov[tmp %in% locs]
ans2b <- Stage2(data=blues,vcov=vcov,geno=geno,non.add="g.resid")
ans2c <- Stage2(data=blues,vcov=vcov,geno=geno,non.add="dom")
## non.add AIC
## 1 g.resid 16725.00
## 2 dom 16694.96
The dominance model is selected over the genetic residual model because of its lower AIC.
## $var
## Variance PVE
## env 311 NA
## additive 35 0.208
## add x loc 8 0.046
## dominance 36 0.211
## heterosis 4 0.024
## g x env 20 0.117
## Stage1.error 67 0.393
##
## $cor
## MI MO NC NY OR WI
## MI 1.000 0.881 0.700 0.884 0.890 0.908
## MO 0.881 1.000 0.907 0.934 0.976 0.939
## NC 0.700 0.907 1.000 0.737 0.831 0.707
## NY 0.884 0.934 0.737 1.000 0.945 0.967
## OR 0.890 0.976 0.831 0.945 1.000 0.961
## WI 0.908 0.939 0.707 0.967 0.961 1.000
The correlation matrix and uniplot, which are based on the additive values, show that yield in NC was somewhat different compared to the other five sites.
Compared to the single location analysis in Vignette 1, an additional
argument is needed for the blup
function to specify the
index coefficients, which are interpreted as the relative weight for
each location after standardization to unit variance. The following code
compares the reliability for BV predictions in WI vs. NC.
prep1 <- blup_prep(data=blues,vcov=vcov,geno=geno,vars=ans2c$vars)
WI.index <- c(WI=1, NC=0, OR=0, NY=0, MO=0, MI=0)
NC.index <- c(WI=0, NC=1, OR=0, NY=0, MO=0, MI=0)
WI.pred <- blup(data=prep1,geno=geno,index.coeff=WI.index,what="BV")
NC.pred <- blup(data=prep1,geno=geno,index.coeff=NC.index,what="BV")
pred <- merge(NC.pred,WI.pred,by="id")
colnames(pred) <- c("id","BV.NC","r2.NC","BV.WI","r2.WI")
ggplot(pred,aes(x=r2.NC,y=r2.WI)) + geom_point() + coord_fixed(ratio=1) + xlim(0.3,0.8) + ylim(0.3,0.8) + geom_line(data=data.frame(x=c(0.3,0.8),y=c(0.3,0.8)),mapping=aes(x=x,y=y),linetype=2) + theme_bw() +
xlab("NC") + ylab("WI") + ggtitle("BV Reliability")
The above figure shows that breeding values were predicted with higher reliability in WI than NC, which is to be expected from the high correlation between WI and the other four sites (OR,MO,MI,NY) compared to NC.
The mask
argument for blup_prep
can be used
to mask individuals at one or more locations, to explore the accuracy of
prediction into new environments. Whereas the previous analysis
determined the reliability of GEBVs in WI when WI phenotypes are
available, the next analysis excludes WI phenotypes:
WI.env <- unique(blues$env[blues$loc=="WI"])
prep2 <- blup_prep(data=blues,vcov=vcov,geno=geno,vars=ans2c$vars,
mask=data.frame(env=WI.env))
pred2 <- blup(data=prep2,geno=geno,index.coeff=WI.index,what="BV")
plot.data <- merge(pred2[,c("id","r2")],WI.pred[,c("id","r2")],by="id")
colnames(plot.data) <- c("id","without.WI.pheno","with.WI.pheno")
ggplot(plot.data,aes(x=without.WI.pheno,y=with.WI.pheno)) + geom_point() + coord_fixed(ratio=1) + xlim(0.3,0.8) + ylim(0.3,0.8) + geom_line(data=data.frame(x=c(0.3,0.8),y=c(0.3,0.8)),mapping=aes(x=x,y=y),linetype=2) + theme_bw() +
xlab("Without WI phenotypes") + ylab("With WI phenotypes") + ggtitle("WI BV Reliability")
As expected, the reliability was higher with WI phenotypes.