Vignette 3: Multi-trait analysis

Multi-Trait Model

When estimating BLUEs in Stage1, an unstructured trait covariance matrix for the residuals is included. The phenotypes for one environment are denoted \(y_{ikl}\), where \(i\) represents genotype, \(k\) is trait, and \(l\) represents one or more indices for treatments, covariates, etc.

\[ y_{ikl} = \mu + g_{ik} + \dots + \epsilon_{ikl} \] The genotype effects \(g_{ik}\) are fixed, and the residuals are multivariate normal with \(var[\boldsymbol{\epsilon}] = \mathbf{I} \otimes \boldsymbol\Sigma_\epsilon\), where \(\boldsymbol\Sigma_\epsilon\) is an unstructured var-cov matrix for traits.

The response variable for Stage2 is \(BLUE[g_{ijk}]\), where the subscript \(j\) is now included to represent environment. The Stage2 model with multivariate normal additive \(\boldsymbol{a}\) and dominance \(\boldsymbol{d}\) effects is

\[BLUE[g_{ijk}] = E_{jk} + a_{ik} + d_{ik} - b_k F_i + gE_{ijk} + s_{ijk}\]

where \(E_{jk}\) is the fixed effect for environment, \(F_i\) is the genomic inbreeding coefficient, and its regression coefficient \(b_k\) represents baseline heterosis, i.e., the difference between the population at panmictic equilibrium vs. fully inbred. The other effects are multivariate normal with zero mean and \(var[\boldsymbol{a}]=\mathbf{G} \otimes \boldsymbol\Sigma_a\), \(var[\boldsymbol{d}]=\mathbf{D} \otimes \boldsymbol\Sigma_d\), and \(var[\boldsymbol{gE}]=\mathbf{I} \otimes \boldsymbol\Sigma_{gE}\). The preceding \(\boldsymbol\Sigma\) are unstructured var-cov matrices for traits. The Stage1 error term \(\boldsymbol{s}\) is also multivariate normal, with var-cov matrix equal to the direct sum of the var-cov matrices of the Stage1 BLUEs for each environment.

Potato dataset

In Vignette 1, three traits were analyzed independently: yield, maturity, and fry color. The syntax for analyzing multiple traits is very similar.

library(StageWise)
pheno.file <- system.file("vignette_data", "pheno1a.csv", package = "StageWise")
ans1 <- Stage1(filename=pheno.file,traits=c("total.yield","vine.maturity","fry.color"),
              effects <- data.frame(name=c("block","stand.count"),
                      fixed=c(FALSE,TRUE),
                      factor=c(TRUE,FALSE)))
## Online License checked out Wed Jul 24 14:18:21 2024
names(ans1)
## [1] "blues" "vcov"  "fit"   "resid"

As with the single trait analysis, Stage1 returns a data frame of BLUEs and a list of their var-cov matrices. Instead of residual diagnostic plots, however, the residual covariance matrices are returned in “resid”.

As of v1.04 of the package, the Stage2 command has an additional argument “pairwise”, which indicates whether all traits should be analyzed at once, or if the traits should be analyzed in pairs to build up the full covariance matrix. The pairwise approach may be needed with many traits to get convergence in ASReml-R. Let’s compare the two approaches with this dataset.

geno.file <- system.file("vignette_data", "geno1.csv", package = "StageWise")
geno <- read_geno(geno.file,ploidy=4,map=TRUE,dominance = TRUE)
## Minor allele threshold = 5 genotypes
## Number of markers = 12242
## Number of genotypes = 943
ans2 <- Stage2(data=ans1$blue, vcov=ans1$vcov, geno=geno, non.add="dom",
               pairwise=FALSE)
ans2p <- Stage2(data=ans1$blue, vcov=ans1$vcov, geno=geno, non.add="dom",
                pairwise=TRUE)

summary(ans2$vars)
## $var
##              total.yield vine.maturity fry.color
## env                 68.0          0.50      17.7
## additive            43.3          1.04       5.1
## dominance           26.9          0.22       0.8
## heterosis            3.9          0.01       0.0
## g x env             41.8          0.42       4.9
## Stage1.error        31.9          0.99       6.1
## 
## $PVE
##              total.yield vine.maturity fry.color
## additive           0.293         0.390     0.303
## dominance          0.182         0.080     0.049
## heterosis          0.026         0.003     0.000
## g x env            0.283         0.158     0.291
## Stage1.error       0.216         0.369     0.357
## 
## $cor.mat
##               total.yield vine.maturity fry.color
## total.yield         1.000         0.473    -0.194
## vine.maturity       0.473         1.000     0.112
## fry.color          -0.194         0.112     1.000
summary(ans2p$vars)
##               total.yield vine.maturity fry.color
## total.yield         1.000         0.502    -0.246
## vine.maturity       0.502         1.000     0.131
## fry.color          -0.246         0.131     1.000

When “pairwise=FALSE”, the summary command shows the variances and proportion of variation explained (PVE) as separate tables, as well as the additive genetic correlation between traits. When “pairwise=TRUE”, only the additive genetic correlation matrix is available. The correlation matrices are similar and show that later maturity is correlated with higher yield, which will need to be addressed when constructing a selection index.

The next step in the workflow is blup_prep, which has exactly the same syntax as the single trait analysis.

prep1 <- blup_prep(ans1$blues, vcov=ans1$vcov, geno=geno, 
                   vars=ans2$vars)

For multi-trait analyses, the blup command was designed to predict total genetic merit, not individual traits, so the user must supply index coefficients.

The gain command allows breeders to compare the expected response for each trait under different indices. The argument “merit” specifies the contribution of each standardized trait for total genetic merit. For example, if genetic gains for fry color and yield are equally valuable, and maturity is ignored, the code is

merit.coeff=c("total.yield"=1, "vine.maturity"=0, "fry.color"=1)

gain1 <- gain(prep1, merit=merit.coeff)

kable(gain1$table)
trait response index
total.yield 0.436 0.707
vine.maturity 0.384 0.000
fry.color 0.412 0.707

In the table returned by gain, because multi-trait BLUPs are used in the selection index, the index coefficients have the same ratio as the merit coefficients (1:0:1)–they have just been rescaled to have unit norm. The new information is the expected response for each trait, in units of intensity x genetic standard deviation (\(i\sigma\)). By default breeding values are used for \(\sigma\), but this can be controlled with argument “gamma” (see reference manual).

For a given selection intensity, the set of all possible responses is an ellipsoid. The argument “traits” in gain can be used to see the ellipse for exactly 2 traits, which illustrates selection tradeoffs.

gain1 <- gain(input=prep1, merit=merit.coeff,
              traits=c("total.yield","vine.maturity"))
gain1$plot

The dashed red line shows the direction of the merit vector, while the blue line segment is the projection of the point on the ellipsoid that maximizes genetic merit.

Because our maturity trait scale ranges from 1 = early to 9 = late, the positive response implies selection for later maturity, which is undesirable. The “restricted” argument in the gain command can be used to restrict the response for particular traits while maximizing genetic merit for other traits. The form of the argument is a data frame with two columns: “trait” and “sign”. The “sign” column can have one of three symbols: “=”, “<”, “>”, which indicate whether the response is \(= 0\), \(\leq 0\), or \(\geq 0\), respectively. In this case, we want the maturity response to be less than or equal to zero:

gain2 <- gain(input=prep1, merit=merit.coeff,
              restricted=data.frame(trait="vine.maturity", sign="<"))
kable(gain2$table)
trait response index
total.yield 0.286 0.644
vine.maturity 0.000 -0.414
fry.color 0.428 0.644

Compared with the response table for the unrestricted index, the yield response decreased from \(0.44i\sigma\) to \(0.29i\sigma\), while the response for fry color slightly increased. Multiplying the response vector by the merit coefficients and summing generates the response for total genetic merit, which is necessarily lower for the restricted index:

sum(merit.coeff*gain1$table$response)
## [1] 0.848
sum(merit.coeff*gain2$table$response)
## [1] 0.714

One other option in gain is “desired”, which allows the user to specify the desired gains for each trait (in units of \(i\sigma\)), rather than specifying their contribution to genetic merit. Either “desired” or “merit” can be specified, not both. To achieve equal gains for yield and fry color, while restricting maturity, the desired gains vector is numerically equal to the previous merit coefficient vector:

gain3 <- gain(input=prep1, desired=merit.coeff)
kable(gain3$table)
trait response index
total.yield 0.354 0.695
vine.maturity 0.000 -0.431
fry.color 0.354 0.576
sum(merit.coeff*gain3$table$response)
## [1] 0.708

The above result shows the response for total merit is slightly less under the desired gains index, if we continue to assume equal merit for yield and fry color. But if yield is actually more important than fry color, this solution seems desirable.

Let’s use the desired gains index to compute BLUPs.

index.coeff <- gain3$table$index
names(index.coeff) <- gain3$table$trait
GEBV <- blup(prep1, geno, what="BV", index.coeff=index.coeff)
head(GEBV)
##         id    value        r2
## 1 AF5392-8 35.62543 0.4499786
## 2 AF5393-1 36.30372 0.5350524
## 3 AF5429-3 37.24289 0.6022596
## 4 AF5445-2 37.71679 0.5367704
## 5 AF5450-7 37.05010 0.5219305
## 6 AF5484-3 36.03935 0.6062087

As explained in Vignette 1, the dominance function expresses the inbreeding depression and dominance variance estimates relative to the additive SD and variance estimates, respectively. This also works for multi-trait models once the index coefficients are specified. To be consistent with the above gain calculation, gamma=1/3 is needed for tetraploid breeding values:

dominance(ans2$params, index.coeff=index.coeff, gamma=1/3)
##   ratio  estimate
## 1 Vd/Va 0.4506758
## 2 b/SDa 7.6863051

One last comment: groups of unrelated traits can be analyzed independently through blup_prep, and the outputs from this command can be combined as a list in blup to make the predictions. This assumes zero correlation between traits in the different groups and requires the same genetic model to have been used.

Genomic prediction with secondary traits

The “mask” argument for blup_prep makes it easy to investigate the potential benefit of using a correlated, secondary trait to improve genomic selection. For example, many plant breeding programs are exploring the use of spectral measurements from high-throughput phenotyping platforms to improve selection for yield. The following example is based on data from Rutkoski et al. (2016), who showed that canopy temperature (CT) during grain fill was predictive of yield in wheat. The G matrix and Stage 1 BLUEs from the drought and extreme drought environments are distributed with the package. As with the potato dataset in Vignette 1, including the Stage 1 errors in Stage 2 lowers the AIC substantially.

data(wheat) #load the wheat data
head(wheat.blues)
##       env      id trait      BLUE
## 1 drought 6569128    GY  3.395834
## 2 drought 6569128    CT 36.950000
## 3 drought 6688880    GY  3.494048
## 4 drought 6688880    CT 37.902333
## 5 drought 6688916    GY  3.459078
## 6 drought 6688916    CT 38.706000
ans2a <- Stage2(data=wheat.blues, vcov=wheat.vcov, geno=wheat.geno,
                non.add="none")
ans2b <- Stage2(data=wheat.blues, geno=wheat.geno, non.add="none")

data.frame(vcov=c(TRUE,FALSE), AIC=c(ans2a$aic,ans2b$aic))
##    vcov       AIC
## 1  TRUE -16.97154
## 2 FALSE 539.30801

Because the wheat lines are inbred, the genetic residual option in StageWise could be used for modeling non-additive values, but the AIC was lower without it (non.add=“none”). Genomic heritability was 0.45-0.50 for yield and canopy temperature, with an additive genetic correlation of -0.81.

summary(ans2a$vars)
## $var
##                 GY   CT
## env          0.621 1.58
## additive     0.101 0.61
## g x env      0.048 0.44
## Stage1.error 0.055 0.30
## 
## $PVE
##                 GY    CT
## additive     0.495 0.454
## g x env      0.237 0.326
## Stage1.error 0.269 0.221
## 
## $cor.mat
##        GY     CT
## GY  1.000 -0.807
## CT -0.807  1.000

For genomic predictions, first we will do a tenfold cross validation without using CT data for the selection candidates, which can be called marker-based selection (MBS, see Vignette 1). Since the goal is yield prediction, the index coefficients are 1 and 0 for GY and CT, respectively.

id <- unique(wheat.blues$id)
N <- length(id)
folds <- split(sample(id),cut(1:N,10))
MBS <- NULL
for (i in 1:10) {
  prep <- blup_prep(wheat.blues, wheat.vcov, wheat.geno, ans2a$vars, 
                    mask=data.frame(id=folds[[i]]))
  pred <- blup(prep, geno=wheat.geno, what="BV", 
               index.coeff=c(GY=1, CT=0))
  MBS <- rbind(MBS, pred[pred$id %in% folds[[i]],])
}

In the above code, the “mask” argument for blup_prep only has the variable “id”, which means that all Stage 1 BLUEs for those individuals are masked. To only mask grain yield and use CT as a secondary trait for marker-assisted selection (MAS), a second variable named “trait” is used.

MAS <- NULL
for (i in 1:10) {
  prep <- blup_prep(wheat.blues, wheat.vcov, wheat.geno, ans2a$vars, 
                    mask=data.frame(id=folds[[i]], trait="GY"))
  pred <- blup(prep, geno=wheat.geno, what="BV", 
               index.coeff=c(GY=1, CT=0))
  MAS <- rbind(MAS, pred[pred$id %in% folds[[i]],])
}

ans <- merge(MBS,MAS,by="id")

library(ggplot2)
ggplot(ans,aes(x=r2.x, y=r2.y)) + geom_hex() + coord_fixed(ratio=1) + geom_line(data=data.frame(x=c(0.2,0.8),y=c(0.2,0.8)),mapping=aes(x=x,y=y),linetype=2) +  ggtitle("Reliability") +
  xlab("MBS") + ylab("MAS")

The above figure shows that using CT increased the reliability of genomic prediction, by 0.2 on average.