In diploid species, where the genome is organized as two sets of homologous chromosomes, a haploid contains only one set of chromosomes (typically the maternal set). In autotetraploids, the term “dihaploid”, which is a contraction of diploid and haploid, is used to describe individuals with half the number of somatic chromosomes (i.e., two sets of chromosomes).
Dihaploids of tetraploid potato (S. tuberosum Group Tuberosum) are typically created by pollination with certain diploid individuals from S. tuberosum Group Phureja, which are called haploid inducers. Historically, potato dihaploids were crossed to diploid wild relatives to introgress beneficial alleles. After recurrent selection, the improved diploids would be crossed back to elite tetraploids, exploiting their potential to produce unreduced gametes (“unilateral sexual polyploidization”) and generate tetraploid offspring for further breeding. In the past decade, diploid breeding efforts have shifted to focus on the development of inbred lines and F1 hybrids for variety development.
Not all seeds harvested from the maternal plant after pollination with the haploid inducer will be diploid; some will be aneuploid or tetraploid. In Vignette #1, the process of making tetraploid genotype calls based on a normal mixture model with 5 components was illustrated. This same model can be used for dihaploids to identify which individuals are eudiploid.
Data are provided for 158 potato dihaploids from 21 tetraploid parents, 2 of which are also included. The dihaploids are named following the convention [Tetraploid Parent]DH[Number]. The samples were genotyped using version 3 of the potato SNP array, which has 21,027 bi-allelic SNPs. The input data file contains the signal ratio Y/(X+Y) from the array, which provides an analog measure of allele dosage.
data.filename <- system.file("vignette_data", "potato_dihaploids.csv.gz", package = "polyBreedR")
data <- as.matrix(read.csv(data.filename,check.names=F,row.names=1))
id <- colnames(data)
parents <- unique(sapply(strsplit(id,"-DH",fixed=T),function(z){z[1]}))
parents #21 tetraploid parents
## [1] "Accumulator" "Atlantic" "W14137-10R" "Hodag" "W13069-5"
## [6] "W12078-76" "Lelah" "NY148" "OneidaGold" "W14NYQ9-2"
## [11] "W14NYQ29-5" "W13NYP12-2" "W13058-19" "W13058-4" "W13NYP19-2"
## [16] "W14137-7R" "W6609-3" "W6822-3" "W8822-1" "W9905-3"
## [21] "W9968-5"
## [1] "W14NYQ9-2" "W14NYQ29-5"
Several methods are available for classifying the allele ratio into
discrete values 0-4, including the normal mixture model implemented in R
package fitPoly.
Two thousand potato samples were used to fit the model for each marker,
and 11,043 SNPs remained after filtering. The model file provided with
the package contains 15 columns: five means, five standard deviations,
and 5 mixture probabilities. To detect aneuploids or tetraploids, we
will initially treat the dihaploid samples as tetraploid when using
geno_call
:
model.filename <- system.file("vignette_data", "potato_V3array_model.csv", package = "polyBreedR")
model <- read.csv(model.filename,check.names=F,row.names=1)
head(model)
## Mean0 Mean1 Mean2 Mean3 Mean4 SD0 SD1 SD2 SD3
## D_locus_(DFR)_C_LG02 0.2420 0.814 0.974 1.12 1.44 0.0446 0.0446 0.0446 0.0446
## D_locus_(DFR)_D_LG02 0.1110 0.678 0.851 1.02 1.48 0.0478 0.0478 0.0478 0.0478
## D_locus_(DFR)_G_LG02 0.0799 0.619 0.813 1.00 1.50 0.0473 0.0473 0.0473 0.0473
## D_locus_(DFR)_H_LG02 0.1740 0.687 0.853 1.02 1.43 0.0408 0.0408 0.0408 0.0408
## D_locus_(DFR)_I_LG02 0.3330 0.764 0.930 1.08 1.35 0.0408 0.0408 0.0408 0.0408
## Gro14_c_Paal_LG07 0.6150 0.804 0.978 1.16 1.45 0.0449 0.0449 0.0449 0.0449
## SD4 Prob0 Prob1 Prob2 Prob3 Prob4
## D_locus_(DFR)_C_LG02 0.0446 0.199000 0.3960 0.295 0.0979 0.0122
## D_locus_(DFR)_D_LG02 0.0478 0.200000 0.3960 0.295 0.0974 0.0121
## D_locus_(DFR)_G_LG02 0.0473 0.011900 0.0967 0.294 0.3970 0.2010
## D_locus_(DFR)_H_LG02 0.0408 0.011900 0.0964 0.293 0.3970 0.2010
## D_locus_(DFR)_I_LG02 0.0408 0.011900 0.0965 0.294 0.3970 0.2010
## Gro14_c_Paal_LG07 0.0449 0.000799 0.0158 0.117 0.3870 0.4790
## [1] 11043
library(polyBreedR)
geno4x <- geno_call(data=data,filename=model.filename,model.ploidy=4,sample.ploidy=4)
hist(geno4x)
With a perfect model and codominant marker, diploid samples should
have no dosage 1 or 3 calls. (The two SNP alleles are present in equal
numbers for dosage 2, which is equivalent to a diploid heterozygote.) In
reality, the number of dosage 1 or 3 calls is small but not zero.
Looking at a large dataset is the best way to determine how “small” this
number is, which in turn provides a basis for classifying polyploid
outliers. The function check_ploidy
computes the proportion
of markers for each chromosome with dosage 1 or 3 and displays the
result as a stacked bar chart. The map file used in this example is
based on version
6.1 of the DM potato reference genome and contains 10,931 markers
from the V3 array.
map.filename <- system.file("vignette_data", "potato_V3array_map.csv", package = "polyBreedR")
map <- read.csv(map.filename,as.is=T)
head(map)
## marker chrom bp
## 1 solcap_snp_c2_36608 chr01 508800
## 2 solcap_snp_c2_36658 chr01 527068
## 3 solcap_snp_c1_10930 chr01 566972
## 4 PotVar0120126 chr01 603013
## 5 PotVar0120085 chr01 681193
## 6 PotVar0120080 chr01 681379
##
## chr01 chr02 chr03 chr04 chr05 chr06 chr07 chr08 chr09 chr10 chr11 chr12
## 1480 1144 978 1015 944 808 864 724 760 648 871 695
The output plot is sorted from smallest to largest departure from the
ideal diploid. The two tetraploid parents are present near the end, so
we can conclude that a number of the putative dihaploids are in fact
tetraploid. Near the transition from diploid to tetraploid, aneuploids
are also visible based on having one chromosome with a large proportion
of dosage 1 or 3 markers. The function check_ploidy
also
returns the result in matrix format, with dimensions individual x
chromosome, which can be analyzed to see the aneuploids more
clearly.
plot.data <- data.frame(max=apply(ans$mat,1,max),mean=apply(ans$mat,1,mean),
group="ambiguous",stringsAsFactors = F)
plot.data$group <- ifelse(plot.data$mean > 0.2,"tetraploid",plot.data$group)
plot.data$group <- ifelse(plot.data$mean < 0.12,"diploid",plot.data$group)
plot.data$group <- ifelse(plot.data$max > 0.2 & plot.data$mean < 0.12,"aneuploid",
plot.data$group)
library(ggplot2)
plot.data$group <- factor(plot.data$group)
ggplot(data=plot.data,aes(x=mean,y=max,colour=group)) + geom_point() +
geom_abline(slope=1,intercept=0) + coord_fixed(ratio=1) +
xlab("Chrom Average") + ylab("Chrom Maximum") +
scale_color_brewer(palette="Set2", name="")
The above figure shows the maximum value observed for any chromosome plotted against the chromosome average. The diploids are in the lower left corner and tetraploids in the upper right. The three individuals with an average value similar to the diploids but one chromosome much higher are predicted to be aneuploid.
To obtain diploid genotype calls (0,1,2) for the diploids, use
geno_call
but with sample.ploidy=2
to omit the
dosage 1 and 3 components of the normal mixture model:
geno2x <- geno_call(data=data[,which(plot.data$group=="diploid")],
filename=model.filename, model.ploidy=4, sample.ploidy=2)
hist(geno2x)
In Vignette #1, the function check_trio
was demonstrated
for F1 tetraploid offspring of two tetraploid parents. This same
function can be used to check the parentage of dihaploids because, with
respect to the zygote probability distribution, haploid induction is
equivalent to using a father with dosage 0. To run this analysis, use
the reserved word “haploid” in the father column of the pedigree. The
genotype matrix must include the dihaploids called as diploids, and the
parents called as tetraploids. In this example, we will analyze
dihaploid progeny of the two tetraploids included in the dataset:
id <- colnames(geno2x) #names of diploids
parents <- sapply(strsplit(id,"-DH"),function(z){z[1]}) #their parents
geno.parents <- intersect(parents,colnames(geno4x)) #genotyped parents
k <- which(parents %in% geno.parents) #which dihaploids have genotyped parents
ped <- data.frame(id=id[k],mother=parents[k],father="haploid",stringsAsFactors = F)
geno <- cbind(geno2x[,k],geno4x[,geno.parents])
ans2 <- check_trio(parentage=ped,geno=geno,ploidy=4)
ans2 <- ans2[order(ans2$mother,ans2$error),]
head(ans2)
## id mother father error
## 32 W14NYQ29-5-DH24 W14NYQ29-5 haploid 0.2
## 34 W14NYQ29-5-DH30 W14NYQ29-5 haploid 0.2
## 38 W14NYQ29-5-DH25 W14NYQ29-5 haploid 0.2
## 33 W14NYQ29-5-DH26 W14NYQ29-5 haploid 0.3
## 36 W14NYQ29-5-DH11 W14NYQ29-5 haploid 0.4
## 31 W14NYQ29-5-DH23 W14NYQ29-5 haploid 0.5
Based on checking hundreds of trios, approximately 1% error is expected even with the correct parentage, when using the model parameter file included with the package. Values much higher than this indicate the parentage is incorrect. This can be demonstrated by re-running the analysis for the progeny of W14NYQ29-5 but with W14NYQ9-2 listed as the mother:
ped2 <- ped[ped$mother=="W14NYQ29-5",]
ped2$mother <- "W14NYQ9-2"
check_trio(parentage=ped2,geno=geno,ploidy=4)
## id mother father error
## 31 W14NYQ29-5-DH23 W14NYQ9-2 haploid 10.4
## 32 W14NYQ29-5-DH24 W14NYQ9-2 haploid 12.0
## 33 W14NYQ29-5-DH26 W14NYQ9-2 haploid 12.7
## 34 W14NYQ29-5-DH30 W14NYQ9-2 haploid 10.2
## 36 W14NYQ29-5-DH11 W14NYQ9-2 haploid 12.1
## 37 W14NYQ29-5-DH12 W14NYQ9-2 haploid 11.3
## 38 W14NYQ29-5-DH25 W14NYQ9-2 haploid 10.7