COMA Vignette

Jeff Endelman

The COMA package implements a new algorithm for optimum mate allocation (OMA), as well as the more traditional method, optimum contribution selection (OCS).

COMA requires two input files. The first file contains the marker effects and allele dosage data for the parental candidates. Dominance marker effects are optional, but when included should correspond to a breeding value parameterization, including the effect of heterosis/inbreeding depression (Endelman 2023). The blup command in R/StageWise is one option to compute marker effects.

The vignette dataset comes from the University of Wisconsin potato breeding program. There are 170 tetraploid clones genotyped at 12K markers, and the marker effects are derived from a multi-trait index.

geno.file <- system.file("vignette_data", "geno.csv", package = "COMA")
geno <- read.csv(geno.file,check.names=F)
geno[1:4,1:5]
##                marker    add   dom W17037-1 W17037-10
## 1 solcap_snp_c2_51460 -18.71  8.09     0.60      1.34
## 2 solcap_snp_c2_36608   8.48 19.14     2.00      1.00
## 3 solcap_snp_c2_36615   7.51  8.00     1.66      1.40
## 4 solcap_snp_c2_36658  46.53  4.71     4.00      2.00

The second input file is the kinship matrix to control inbreeding. Based on simulation results (Endelman 2024), the pedigree IBD kinship matrix is recommended for either OCS or OMA. The kinship matrix for the potato clones was generated using R package AGHmatrix.

K.file <- system.file("vignette_data", "kinship.csv", package = "COMA")

#preview files
K <- as.matrix(read.csv(K.file,check.names=F,row.names=1))
K[1:3,1:3]
##           W17037-1 W17037-10 W17037-11
## W17037-1    0.2631    0.1397    0.1397
## W17037-10   0.1397    0.2631    0.1397
## W17037-11   0.1397    0.1397    0.2631
summary(K[upper.tri(K,diag=F)])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00600 0.04700 0.08600 0.08062 0.09690 0.18820
#inbreeding coefficients from A matrix
F.A <- (4*diag(K)-1)/(4-1)
summary(F.A)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00360 0.01747 0.01820 0.02354 0.02050 0.09293

Now we are ready for read_data:

library(COMA)
data <- read_data(geno.file=geno.file,
                  kinship.file=K.file,
                  ploidy=4, matings="all",
                  standardize=TRUE, n.core=2)

head(data$parents)
##          id      merit
## 1  W17037-1 -0.6518888
## 2 W17037-10  0.6155129
## 3 W17037-11  0.4235588
## 4 W17037-12  0.8016658
## 5 W17037-13  0.3160459
## 6 W17037-15  0.6178174
head(data$matings)
##     parent1  parent2       merit
## 1  W17037-1 W17037-1 -2.06643337
## 2 W17037-10 W17037-1 -0.47212478
## 3 W17037-11 W17037-1 -0.42772748
## 4 W17037-12 W17037-1 -0.08777775
## 5 W17037-13 W17037-1 -0.24849597
## 6 W17037-15 W17037-1 -0.32733990

The matings argument for read_data indicates how to create the table of possible matings. The above example used “all”, which creates all unique parental combinations. Consult the function documentation for other options. When standardize=TRUE, genetic values are scaled by the additive standard deviation (\(\sigma_A\)). This is for convenience when interpreting the response values.

As shown above, the parents output contains the predicted merit (GEBV) for each individual. The matings output contains the predicted merit (F1 progeny mean) for each possible mating. For dioecious species (with separate sexes), a data frame with sex information is needed as argument sex in read_data, and the column headers for the matings are “female”,“male” instead of “parent1”,“parent2”.

OMA and OCS

The first argument of the oma function is dF=c(min,max), which represents lower and upper bounds on the inbreeding rate for the next generation, dF1. The upper bound is also applied to the average inbreeding rate over two generations, dF2, assuming random mating of the progeny. The second argument, parents=data.frame(id,min,max), contains lower and upper bounds on the contribution of each parent (do not include a column for parental merit). The third argument, matings, is a data frame of the merit for each possible mating, as well as lower and upper bounds on the allocation.

parents <- data.frame(id=data$parents$id, min=0, max=1)
matings <- data.frame(data$matings, min=0, max=1)
ans0 <- oma(dF=c(-1,0.005), parents=parents, matings=matings, ploidy=4, K=data$K)
ans0$response
##   dF1 dF2 merit n.parent n.mate
## 1  NA  NA    NA        0      0

The above result shows there is no feasible solution at 0.5% inbreeding for both generations, due to the relatedness of the clones. The argument dF.adapt can be used to automatically increasing the upper bound until a solution is found:

ans1 <- oma(dF=c(-1,0.005), parents=parents, matings=matings, ploidy=4, K=data$K,
            dF.adapt=list(step=0.001,max=0.05))
ans1$response
##       dF1   dF2    merit n.parent n.mate
## 1 -0.0082 0.008 1.215909       33     31

With no lower bound, a solution was found at dF2 = 0.8%, while the negative value for dF1 indicates the progeny would be 0.8% more outbred than the parents. When a similar approach was used in a simulation study (Endelman 2024), the population became trapped in a local optimum with limited long-term gain. Imposing a lower bound of 0.5% for dF1 led to more long-term gain in the simulation. As the following code shows, for the potato dataset, this required increasing the dF2 rate to 1.0%.

ans2 <- oma(dF=c(0.005,0.008), parents=parents, matings=matings, ploidy=4, K=data$K,
            dF.adapt=list(step=0.001,max=0.05))
ans2$response
##     dF1  dF2    merit n.parent n.mate
## 1 0.005 0.01 1.228329       43     43

Interestingly, the second solution uses more parents even though the inbreeding rate is higher. The optimal parental contributions are returned as oc, and the optimal mate allocations are returned as om.

head(ans2$oc)
##           id        value
## 40  W17038-4 0.0015535796
## 43 W17039-11 0.0008559787
## 71 W17039-69 0.0944907096
## 77  W17041-3 0.0167267015
## 78  W17042-3 0.0116170009
## 79  W17042-4 0.0227763478
head(ans2$om)
##           parent1   parent2        value
## 5987    W17062-26  W17038-4 0.0031071592
## 6374    W17062-26 W17039-11 0.0017119573
## 9497     W17042-8 W17039-69 0.0292801800
## 9581  W17AF6677-8 W17039-69 0.0197916027
## 9583 W17AF6684-10 W17039-69 0.0004466013
## 9584  W17AF6685-2 W17039-69 0.1394630353

The parental contributions of the two solutions can be visually compared with the COMA function plot_ribbon:

library(ggplot2)
plot_ribbon(oc=list(`-0.008`=ans1$oc,`0.005`=ans2$oc)) + 
   theme(legend.key.size=unit(0.1,'cm'))

COMA also has a function for OCS, which optimizes parental contributions but not specific matings. The arguments are similar, except dF is only a single number for the upper bound on group kinship, and the parents data frame includes a column for merit:

par2 <- data.frame(data$parents, min=0, max=1)
ans3 <- ocs(dF=0.005, parents=par2, ploidy=4, K=data$K, 
            dF.adapt=list(step=0.001, max=0.05))
ans3$response
##      dF    merit n.parent
## 1 0.015 1.222502       29

The smallest feasible inbreeding rate under OCS was 1.5%. The predicted merit is based on parental breeding values and therefore neglects specific combining ability (SCA). When SCA was accounted for, the OCS solution (using random mating) decreased from 1.2 to 1.0 \(\sigma_A\).

ocs.matings <- merge(expand.grid(parent1=ans3$oc$id,parent2=ans3$oc$id,stringsAsFactors = F),
                     data$matings)
i1 <- match(ocs.matings$parent1,ans3$oc$id)
i2 <- match(ocs.matings$parent2,ans3$oc$id)
ocs.matings$value <- ifelse(i1==i2,ans3$oc$value[i1]*ans3$oc$value[i1],
                            2*ans3$oc$value[i1]*ans3$oc$value[i2]) 
sum(ocs.matings$merit*ocs.matings$value)
## [1] 1.038668

Inbred Lines

Although COMA was developed for outbred populations, it can also be used to manage diversity under long-term recurrent selection in inbred crops. In this case, a sensible choice for the base, or reference, population to compute inbreeding rate is a hypothetical progeny population generated by random mating (RM) of the parents. This is controlled by the argument base in the ocs and oma functions. The default value is “current”, which is appropriate for outbred parents. For inbred parents, use “RM”.

Other Constraints

Because COMA is based on convex optimization, additional linear constraints can be incorporated. The potato dataset contains presence/absence information for a genetic marker for potato virus Y resistance (PVYR), which is a critical trait. Two different selection approaches will be illustrated: (1) only allowing matings with a resistant parent; (2) imposing a lower bound on the R gene frequency.

The following code implements method 1 with the 0.5% lower bound:

library(dplyr)
par.file <- system.file("vignette_data", "parents.csv", 
                          package = "COMA")
data2 <- read.csv(par.file)
PVYR.id <- data2$id[data2$PVYR=="Y"]

#calculate proportion of matings with PVYR parent from previous solution
n.mate <- ans2$response$n.mate
nR <- sum(ans2$om$parent1 %in% PVYR.id | ans2$om$parent2 %in% PVYR.id)

#add constraint
ans2a <- oma(dF=c(0.005,0.01), ploidy=4L, K=data$K, parents=parents, 
            matings=filter(matings, parent1%in%PVYR.id | parent2%in%PVYR.id))

#calculate frequency of PVYR gene, assuming single copy in parents
R.parents <- which(ans2$oc$id %in% PVYR.id)
freq2 <- sum(ans2$oc$value[R.parents])*0.25

R.parents <- which(ans2a$oc$id %in% PVYR.id)
freq2a <- sum(ans2a$oc$value[R.parents])*0.25

result <- data.frame(constraint=c("none","method 1"),
                     rbind(ans2$response,ans2a$response),
                     prop.resistant.matings=c(nR/n.mate,1),
                     Ry.freq=c(freq2,freq2a))
kable(result,digits=3)
constraint dF1 dF2 merit n.parent n.mate prop.resistant.matings Ry.freq
none 0.005 0.01 1.228 43 43 0.442 0.086
method 1 0.005 0.01 1.018 42 40 1.000 0.125

The above table compares the solution with and without the PVYR constraint. As expected, increasing the proportion of PVYR matings to 100% reduced the predicted merit. The frequency of the R gene in the progeny is also shown, assuming a single copy in the resistant parents.

Under method 2, the R gene frequency is constrained directly, which is less restrictive than method 1 because some matings could have two PVYR parents while others have none. Both ocs and oma allow linear inequality or equality constraints on any linear combination of the parental contributions. When the coefficients, \(\mathbf{s}\), of the contribution variables, \(\mathbf{y}\), are the R gene frequencies of each parent, this creates the desired constraint. The coefficients are included as an additional column in the parents data.frame, and the column header is parsed to complete the equation. In this case, the header “gt0.125” represents greater than or equal to 0.125: \[ \mathbf{s}^\prime \mathbf{y} \ge 0.125\]

par3 <- data.frame(parents,
                   gt0.125=ifelse(parents$id %in% PVYR.id, 0.25, 0))
ans2b <- oma(dF=c(0.005,0.01), ploidy=4L, K=data$K, parents=par3, matings=matings)

nR <- sum(ans2b$om$parent1 %in% PVYR.id | ans2b$om$parent2 %in% PVYR.id)
R.parents <- which(ans2b$oc$id %in% PVYR.id)
freq2b <- sum(ans2b$oc$value[R.parents])*0.25

result <- rbind(result,
                data.frame(constraint="method 2",ans2b$response,
                           prop.resistant.matings=nR/n.mate, Ry.freq=freq2b))

kable(result,digits=3)
constraint dF1 dF2 merit n.parent n.mate prop.resistant.matings Ry.freq
none 0.005 0.01 1.228 43 43 0.442 0.086
method 1 0.005 0.01 1.018 42 40 1.000 0.125
method 2 0.005 0.01 1.070 43 42 0.628 0.125

As expected, the predicted merit under method 2 was higher than method 1.

Other measures of genetic merit

The function read_data computes genetic merit based on genomic prediction of the progeny mean. But the oma and ocs functions are not limited to this definition; the user can supply their own merit values for each candidate parent or mating. For OCS, individual phenotypes can be used. For OMA, it is possible to incorporate progeny variance into the merit definition, e.g., through the “usefulness” criterion.