When estimating BLUEs in Stage1, an unstructured trait covariance matrix for the residuals is included. The phenotypes for one environment are denoted \(y_{ikl}\), where \(i\) represents genotype, \(k\) is trait, and \(l\) represents one or more indices for treatments, covariates, etc.
\[ y_{ikl} = \mu + g_{ik} + \dots + \epsilon_{ikl} \] The genotype effects \(g_{ik}\) are fixed, and the residuals are multivariate normal with \(var[\boldsymbol{\epsilon}] = \mathbf{I} \otimes \boldsymbol\Sigma_\epsilon\), where \(\boldsymbol\Sigma_\epsilon\) is an unstructured var-cov matrix for traits.
The response variable for Stage2 is \(BLUE[g_{ijk}]\), where the subscript \(j\) is now included to represent environment. The Stage2 model with multivariate normal additive \(\boldsymbol{a}\) and dominance \(\boldsymbol{d}\) effects is
\[BLUE[g_{ijk}] = E_{jk} + a_{ik} + d_{ik} - b_k F_i + gE_{ijk} + s_{ijk}\]
where \(E_{jk}\) is the fixed effect for environment, \(F_i\) is the genomic inbreeding coefficient, and its regression coefficient \(b_k\) represents baseline heterosis, i.e., the difference between the population at panmictic equilibrium vs. fully inbred. The other effects are multivariate normal with zero mean and \(var[\boldsymbol{a}]=\mathbf{G} \otimes \boldsymbol\Sigma_a\), \(var[\boldsymbol{d}]=\mathbf{D} \otimes \boldsymbol\Sigma_d\), and \(var[\boldsymbol{gE}]=\mathbf{I} \otimes \boldsymbol\Sigma_{gE}\). The preceding \(\boldsymbol\Sigma\) are unstructured var-cov matrices for traits. The Stage1 error term \(\boldsymbol{s}\) is also multivariate normal, with var-cov matrix equal to the direct sum of the var-cov matrices of the Stage1 BLUEs for each environment.
In Vignette 1, three traits were analyzed independently: yield, maturity, and fry color. The syntax for analyzing multiple traits is very similar.
library(StageWise)
pheno.file <- system.file("vignette_data", "pheno1a.csv", package = "StageWise")
ans1 <- Stage1(filename=pheno.file,traits=c("total.yield","vine.maturity","fry.color"),
effects <- data.frame(name=c("block","stand.count"),
fixed=c(FALSE,TRUE),
factor=c(TRUE,FALSE)))
## Online License checked out Wed Jul 24 14:18:21 2024
## [1] "blues" "vcov" "fit" "resid"
As with the single trait analysis, Stage1
returns a data
frame of BLUEs and a list of their var-cov matrices. Instead of residual
diagnostic plots, however, the residual covariance matrices are returned
in “resid”.
As of v1.04 of the package, the Stage2
command has an
additional argument “pairwise”, which indicates whether all traits
should be analyzed at once, or if the traits should be analyzed in pairs
to build up the full covariance matrix. The pairwise approach may be
needed with many traits to get convergence in ASReml-R. Let’s compare
the two approaches with this dataset.
geno.file <- system.file("vignette_data", "geno1.csv", package = "StageWise")
geno <- read_geno(geno.file,ploidy=4,map=TRUE,dominance = TRUE)
## Minor allele threshold = 5 genotypes
## Number of markers = 12242
## Number of genotypes = 943
ans2 <- Stage2(data=ans1$blue, vcov=ans1$vcov, geno=geno, non.add="dom",
pairwise=FALSE)
ans2p <- Stage2(data=ans1$blue, vcov=ans1$vcov, geno=geno, non.add="dom",
pairwise=TRUE)
summary(ans2$vars)
## $var
## total.yield vine.maturity fry.color
## env 68.0 0.50 17.7
## additive 43.3 1.04 5.1
## dominance 26.9 0.22 0.8
## heterosis 3.9 0.01 0.0
## g x env 41.8 0.42 4.9
## Stage1.error 31.9 0.99 6.1
##
## $PVE
## total.yield vine.maturity fry.color
## additive 0.293 0.390 0.303
## dominance 0.182 0.080 0.049
## heterosis 0.026 0.003 0.000
## g x env 0.283 0.158 0.291
## Stage1.error 0.216 0.369 0.357
##
## $cor.mat
## total.yield vine.maturity fry.color
## total.yield 1.000 0.473 -0.194
## vine.maturity 0.473 1.000 0.112
## fry.color -0.194 0.112 1.000
## total.yield vine.maturity fry.color
## total.yield 1.000 0.502 -0.246
## vine.maturity 0.502 1.000 0.131
## fry.color -0.246 0.131 1.000
When “pairwise=FALSE”, the summary
command shows the
variances and proportion of variation explained (PVE) as separate
tables, as well as the additive genetic correlation between traits. When
“pairwise=TRUE”, only the additive genetic correlation matrix is
available. The correlation matrices are similar and show that later
maturity is correlated with higher yield, which will need to be
addressed when constructing a selection index.
The next step in the workflow is blup_prep
, which has
exactly the same syntax as the single trait analysis.
For multi-trait analyses, the blup
command was designed
to predict total genetic merit, not individual traits, so the user must
supply index coefficients.
The gain
command allows breeders to compare the expected
response for each trait under different indices. The argument “merit”
specifies the contribution of each standardized trait for total genetic
merit. For example, if genetic gains for fry color and yield are equally
valuable, and maturity is ignored, the code is
merit.coeff=c("total.yield"=1, "vine.maturity"=0, "fry.color"=1)
gain1 <- gain(prep1, merit=merit.coeff)
kable(gain1$table)
trait | response | index |
---|---|---|
total.yield | 0.436 | 0.707 |
vine.maturity | 0.384 | 0.000 |
fry.color | 0.412 | 0.707 |
In the table returned by gain
, because multi-trait BLUPs
are used in the selection index, the index coefficients have the same
ratio as the merit coefficients (1:0:1)–they have just been rescaled to
have unit norm. The new information is the expected response for each
trait, in units of intensity x genetic standard deviation (\(i\sigma\)). By default breeding values are
used for \(\sigma\), but this can be
controlled with argument “gamma” (see reference manual).
For a given selection intensity, the set of all possible responses is
an ellipsoid. The argument “traits” in gain
can be used to
see the ellipse for exactly 2 traits, which illustrates selection
tradeoffs.
The dashed red line shows the direction of the merit vector, while the blue line segment is the projection of the point on the ellipsoid that maximizes genetic merit.
Because our maturity trait scale ranges from 1 = early to 9 = late,
the positive response implies selection for later maturity, which is
undesirable. The “restricted” argument in the gain
command
can be used to restrict the response for particular traits while
maximizing genetic merit for other traits. The form of the argument is a
data frame with two columns: “trait” and “sign”. The “sign” column can
have one of three symbols: “=”, “<”, “>”, which indicate whether
the response is \(= 0\), \(\leq 0\), or \(\geq 0\), respectively. In this case, we
want the maturity response to be less than or equal to zero:
gain2 <- gain(input=prep1, merit=merit.coeff,
restricted=data.frame(trait="vine.maturity", sign="<"))
kable(gain2$table)
trait | response | index |
---|---|---|
total.yield | 0.286 | 0.644 |
vine.maturity | 0.000 | -0.414 |
fry.color | 0.428 | 0.644 |
Compared with the response table for the unrestricted index, the yield response decreased from \(0.44i\sigma\) to \(0.29i\sigma\), while the response for fry color slightly increased. Multiplying the response vector by the merit coefficients and summing generates the response for total genetic merit, which is necessarily lower for the restricted index:
## [1] 0.848
## [1] 0.714
One other option in gain
is “desired”, which allows the
user to specify the desired gains for each trait (in units of \(i\sigma\)), rather than specifying their
contribution to genetic merit. Either “desired” or “merit” can be
specified, not both. To achieve equal gains for yield and fry color,
while restricting maturity, the desired gains vector is numerically
equal to the previous merit coefficient vector:
trait | response | index |
---|---|---|
total.yield | 0.354 | 0.695 |
vine.maturity | 0.000 | -0.431 |
fry.color | 0.354 | 0.576 |
## [1] 0.708
The above result shows the response for total merit is slightly less under the desired gains index, if we continue to assume equal merit for yield and fry color. But if yield is actually more important than fry color, this solution seems desirable.
Let’s use the desired gains index to compute BLUPs.
index.coeff <- gain3$table$index
names(index.coeff) <- gain3$table$trait
GEBV <- blup(prep1, geno, what="BV", index.coeff=index.coeff)
head(GEBV)
## id value r2
## 1 AF5392-8 35.62543 0.4499786
## 2 AF5393-1 36.30372 0.5350524
## 3 AF5429-3 37.24289 0.6022596
## 4 AF5445-2 37.71679 0.5367704
## 5 AF5450-7 37.05010 0.5219305
## 6 AF5484-3 36.03935 0.6062087
As explained in Vignette 1, the dominance
function
expresses the inbreeding depression and dominance variance estimates
relative to the additive SD and variance estimates, respectively. This
also works for multi-trait models once the index coefficients are
specified. To be consistent with the above gain
calculation, gamma=1/3 is needed for tetraploid breeding values:
## ratio estimate
## 1 Vd/Va 0.4506758
## 2 b/SDa 7.6863051
One last comment: groups of unrelated traits can be analyzed
independently through blup_prep
, and the outputs from this
command can be combined as a list in blup
to make the
predictions. This assumes zero correlation between traits in the
different groups and requires the same genetic model to have been
used.
The “mask” argument for blup_prep
makes it easy to
investigate the potential benefit of using a correlated, secondary trait
to improve genomic selection. For example, many plant breeding programs
are exploring the use of spectral measurements from high-throughput
phenotyping platforms to improve selection for yield. The following
example is based on data from Rutkoski et al. (2016), who showed that
canopy temperature (CT) during grain fill was predictive of yield in
wheat. The G matrix and Stage 1 BLUEs from the drought and extreme
drought environments are distributed with the package. As with the
potato dataset in Vignette 1, including the Stage 1 errors in Stage 2
lowers the AIC substantially.
## env id trait BLUE
## 1 drought 6569128 GY 3.395834
## 2 drought 6569128 CT 36.950000
## 3 drought 6688880 GY 3.494048
## 4 drought 6688880 CT 37.902333
## 5 drought 6688916 GY 3.459078
## 6 drought 6688916 CT 38.706000
ans2a <- Stage2(data=wheat.blues, vcov=wheat.vcov, geno=wheat.geno,
non.add="none")
ans2b <- Stage2(data=wheat.blues, geno=wheat.geno, non.add="none")
data.frame(vcov=c(TRUE,FALSE), AIC=c(ans2a$aic,ans2b$aic))
## vcov AIC
## 1 TRUE -16.97154
## 2 FALSE 539.30801
Because the wheat lines are inbred, the genetic residual option in StageWise could be used for modeling non-additive values, but the AIC was lower without it (non.add=“none”). Genomic heritability was 0.45-0.50 for yield and canopy temperature, with an additive genetic correlation of -0.81.
## $var
## GY CT
## env 0.621 1.58
## additive 0.101 0.61
## g x env 0.048 0.44
## Stage1.error 0.055 0.30
##
## $PVE
## GY CT
## additive 0.495 0.454
## g x env 0.237 0.326
## Stage1.error 0.269 0.221
##
## $cor.mat
## GY CT
## GY 1.000 -0.807
## CT -0.807 1.000
For genomic predictions, first we will do a tenfold cross validation without using CT data for the selection candidates, which can be called marker-based selection (MBS, see Vignette 1). Since the goal is yield prediction, the index coefficients are 1 and 0 for GY and CT, respectively.
id <- unique(wheat.blues$id)
N <- length(id)
folds <- split(sample(id),cut(1:N,10))
MBS <- NULL
for (i in 1:10) {
prep <- blup_prep(wheat.blues, wheat.vcov, wheat.geno, ans2a$vars,
mask=data.frame(id=folds[[i]]))
pred <- blup(prep, geno=wheat.geno, what="BV",
index.coeff=c(GY=1, CT=0))
MBS <- rbind(MBS, pred[pred$id %in% folds[[i]],])
}
In the above code, the “mask” argument for blup_prep
only has the variable “id”, which means that all Stage 1 BLUEs for those
individuals are masked. To only mask grain yield and use CT as a
secondary trait for marker-assisted selection (MAS), a second variable
named “trait” is used.
MAS <- NULL
for (i in 1:10) {
prep <- blup_prep(wheat.blues, wheat.vcov, wheat.geno, ans2a$vars,
mask=data.frame(id=folds[[i]], trait="GY"))
pred <- blup(prep, geno=wheat.geno, what="BV",
index.coeff=c(GY=1, CT=0))
MAS <- rbind(MAS, pred[pred$id %in% folds[[i]],])
}
ans <- merge(MBS,MAS,by="id")
library(ggplot2)
ggplot(ans,aes(x=r2.x, y=r2.y)) + geom_hex() + coord_fixed(ratio=1) + geom_line(data=data.frame(x=c(0.2,0.8),y=c(0.2,0.8)),mapping=aes(x=x,y=y),linetype=2) + ggtitle("Reliability") +
xlab("MBS") + ylab("MAS")
The above figure shows that using CT increased the reliability of genomic prediction, by 0.2 on average.