polyBreedR Vignette 2: Dihaploids

Jeff Endelman

Introduction

In diploid species, where the genome is organized as two sets of homologous chromosomes, a haploid contains only one set of chromosomes (typically the maternal set). In autotetraploids, the term “dihaploid”, which is a contraction of diploid and haploid, is used to describe individuals with half the number of somatic chromosomes (i.e., two sets of chromosomes).

Dihaploids of tetraploid potato (S. tuberosum Group Tuberosum) are typically created by pollination with certain diploid individuals from S. tuberosum Group Phureja, which are called haploid inducers. Historically, potato dihaploids were crossed to diploid wild relatives to introgress beneficial alleles. After recurrent selection, the improved diploids would be crossed back to elite tetraploids, exploiting their potential to produce unreduced gametes (“unilateral sexual polyploidization”) and generate tetraploid offspring for further breeding. In the past decade, diploid breeding efforts have shifted to focus on the development of inbred lines and F1 hybrids for variety development.

Not all seeds harvested from the maternal plant after pollination with the haploid inducer will be diploid; some will be aneuploid or tetraploid. In Vignette #1, the process of making tetraploid genotype calls based on a normal mixture model with 5 components was illustrated. This same model can be used for dihaploids to identify which individuals are eudiploid.

Ploidy analysis

Data are provided for 158 potato dihaploids from 21 tetraploid parents, 2 of which are also included. The dihaploids are named following the convention [Tetraploid Parent]DH[Number]. The samples were genotyped using version 3 of the potato SNP array, which has 21,027 bi-allelic SNPs. The input data file contains the signal ratio Y/(X+Y) from the array, which provides an analog measure of allele dosage.

data.filename <- system.file("vignette_data", "potato_dihaploids.csv.gz", package = "polyBreedR")
data <- as.matrix(read.csv(data.filename,check.names=F,row.names=1))
id <- colnames(data)
parents <- unique(sapply(strsplit(id,"-DH",fixed=T),function(z){z[1]}))
parents  #21 tetraploid parents
##  [1] "Accumulator" "Atlantic"    "W14137-10R"  "Hodag"       "W13069-5"   
##  [6] "W12078-76"   "Lelah"       "NY148"       "OneidaGold"  "W14NYQ9-2"  
## [11] "W14NYQ29-5"  "W13NYP12-2"  "W13058-19"   "W13058-4"    "W13NYP19-2" 
## [16] "W14137-7R"   "W6609-3"     "W6822-3"     "W8822-1"     "W9905-3"    
## [21] "W9968-5"
intersect(parents,id) #two parents included
## [1] "W14NYQ9-2"  "W14NYQ29-5"

Several methods are available for classifying the allele ratio into discrete values 0-4, including the normal mixture model implemented in R package fitPoly. Two thousand potato samples were used to fit the model for each marker, and 11,043 SNPs remained after filtering. The model file provided with the package contains 15 columns: five means, five standard deviations, and 5 mixture probabilities. To detect aneuploids or tetraploids, we will initially treat the dihaploid samples as tetraploid when using geno_call:

model.filename <- system.file("vignette_data", "potato_V3array_model.csv", package = "polyBreedR")
model <- read.csv(model.filename,check.names=F,row.names=1)
head(model)
##                       Mean0 Mean1 Mean2 Mean3 Mean4    SD0    SD1    SD2    SD3
## D_locus_(DFR)_C_LG02 0.2420 0.814 0.974  1.12  1.44 0.0446 0.0446 0.0446 0.0446
## D_locus_(DFR)_D_LG02 0.1110 0.678 0.851  1.02  1.48 0.0478 0.0478 0.0478 0.0478
## D_locus_(DFR)_G_LG02 0.0799 0.619 0.813  1.00  1.50 0.0473 0.0473 0.0473 0.0473
## D_locus_(DFR)_H_LG02 0.1740 0.687 0.853  1.02  1.43 0.0408 0.0408 0.0408 0.0408
## D_locus_(DFR)_I_LG02 0.3330 0.764 0.930  1.08  1.35 0.0408 0.0408 0.0408 0.0408
## Gro14_c_Paal_LG07    0.6150 0.804 0.978  1.16  1.45 0.0449 0.0449 0.0449 0.0449
##                         SD4    Prob0  Prob1 Prob2  Prob3  Prob4
## D_locus_(DFR)_C_LG02 0.0446 0.199000 0.3960 0.295 0.0979 0.0122
## D_locus_(DFR)_D_LG02 0.0478 0.200000 0.3960 0.295 0.0974 0.0121
## D_locus_(DFR)_G_LG02 0.0473 0.011900 0.0967 0.294 0.3970 0.2010
## D_locus_(DFR)_H_LG02 0.0408 0.011900 0.0964 0.293 0.3970 0.2010
## D_locus_(DFR)_I_LG02 0.0408 0.011900 0.0965 0.294 0.3970 0.2010
## Gro14_c_Paal_LG07    0.0449 0.000799 0.0158 0.117 0.3870 0.4790
nrow(model) 
## [1] 11043
library(polyBreedR)
geno4x <- geno_call(data=data,filename=model.filename,model.ploidy=4,sample.ploidy=4)
hist(geno4x)

With a perfect model and codominant marker, diploid samples should have no dosage 1 or 3 calls. (The two SNP alleles are present in equal numbers for dosage 2, which is equivalent to a diploid heterozygote.) In reality, the number of dosage 1 or 3 calls is small but not zero. Looking at a large dataset is the best way to determine how “small” this number is, which in turn provides a basis for classifying polyploid outliers. The function check_ploidy computes the proportion of markers for each chromosome with dosage 1 or 3 and displays the result as a stacked bar chart. The map file used in this example is based on version 6.1 of the DM potato reference genome and contains 10,931 markers from the V3 array.

map.filename <- system.file("vignette_data", "potato_V3array_map.csv", package = "polyBreedR")
map <- read.csv(map.filename,as.is=T)
head(map)
##                marker chrom     bp
## 1 solcap_snp_c2_36608 chr01 508800
## 2 solcap_snp_c2_36658 chr01 527068
## 3 solcap_snp_c1_10930 chr01 566972
## 4       PotVar0120126 chr01 603013
## 5       PotVar0120085 chr01 681193
## 6       PotVar0120080 chr01 681379
table(map$chrom)
## 
## chr01 chr02 chr03 chr04 chr05 chr06 chr07 chr08 chr09 chr10 chr11 chr12 
##  1480  1144   978  1015   944   808   864   724   760   648   871   695
ans <- check_ploidy(geno=geno4x,map=map)
ans$plot

The output plot is sorted from smallest to largest departure from the ideal diploid. The two tetraploid parents are present near the end, so we can conclude that a number of the putative dihaploids are in fact tetraploid. Near the transition from diploid to tetraploid, aneuploids are also visible based on having one chromosome with a large proportion of dosage 1 or 3 markers. The function check_ploidy also returns the result in matrix format, with dimensions individual x chromosome, which can be analyzed to see the aneuploids more clearly.

plot.data <- data.frame(max=apply(ans$mat,1,max),mean=apply(ans$mat,1,mean),
                        group="ambiguous",stringsAsFactors = F)
plot.data$group <- ifelse(plot.data$mean > 0.2,"tetraploid",plot.data$group)
plot.data$group <- ifelse(plot.data$mean < 0.12,"diploid",plot.data$group)
plot.data$group <- ifelse(plot.data$max > 0.2 & plot.data$mean < 0.12,"aneuploid",
                          plot.data$group)

library(ggplot2)
plot.data$group <- factor(plot.data$group)
ggplot(data=plot.data,aes(x=mean,y=max,colour=group)) + geom_point() + 
  geom_abline(slope=1,intercept=0) + coord_fixed(ratio=1) + 
  xlab("Chrom Average") + ylab("Chrom Maximum") + 
  scale_color_brewer(palette="Set2", name="")

The above figure shows the maximum value observed for any chromosome plotted against the chromosome average. The diploids are in the lower left corner and tetraploids in the upper right. The three individuals with an average value similar to the diploids but one chromosome much higher are predicted to be aneuploid.

To obtain diploid genotype calls (0,1,2) for the diploids, use geno_call but with sample.ploidy=2 to omit the dosage 1 and 3 components of the normal mixture model:

geno2x <- geno_call(data=data[,which(plot.data$group=="diploid")],
                    filename=model.filename, model.ploidy=4, sample.ploidy=2)
hist(geno2x)

Pedigree analysis

In Vignette #1, the function check_trio was demonstrated for F1 tetraploid offspring of two tetraploid parents. This same function can be used to check the parentage of dihaploids because, with respect to the zygote probability distribution, haploid induction is equivalent to using a father with dosage 0. To run this analysis, use the reserved word “haploid” in the father column of the pedigree. The genotype matrix must include the dihaploids called as diploids, and the parents called as tetraploids. In this example, we will analyze dihaploid progeny of the two tetraploids included in the dataset:

id <- colnames(geno2x)  #names of diploids
parents <- sapply(strsplit(id,"-DH"),function(z){z[1]}) #their parents
geno.parents <- intersect(parents,colnames(geno4x)) #genotyped parents
k <- which(parents %in% geno.parents)  #which dihaploids have genotyped parents

ped <- data.frame(id=id[k],mother=parents[k],father="haploid",stringsAsFactors = F)
geno <- cbind(geno2x[,k],geno4x[,geno.parents])
ans2 <- check_trio(parentage=ped,geno=geno,ploidy=4)
ans2 <- ans2[order(ans2$mother,ans2$error),]
head(ans2)
##                 id     mother  father error
## 32 W14NYQ29-5-DH24 W14NYQ29-5 haploid   0.2
## 34 W14NYQ29-5-DH30 W14NYQ29-5 haploid   0.2
## 38 W14NYQ29-5-DH25 W14NYQ29-5 haploid   0.2
## 33 W14NYQ29-5-DH26 W14NYQ29-5 haploid   0.3
## 36 W14NYQ29-5-DH11 W14NYQ29-5 haploid   0.4
## 31 W14NYQ29-5-DH23 W14NYQ29-5 haploid   0.5

Based on checking hundreds of trios, approximately 1% error is expected even with the correct parentage, when using the model parameter file included with the package. Values much higher than this indicate the parentage is incorrect. This can be demonstrated by re-running the analysis for the progeny of W14NYQ29-5 but with W14NYQ9-2 listed as the mother:

ped2 <- ped[ped$mother=="W14NYQ29-5",]
ped2$mother <- "W14NYQ9-2"
check_trio(parentage=ped2,geno=geno,ploidy=4)
##                 id    mother  father error
## 31 W14NYQ29-5-DH23 W14NYQ9-2 haploid  10.4
## 32 W14NYQ29-5-DH24 W14NYQ9-2 haploid  12.0
## 33 W14NYQ29-5-DH26 W14NYQ9-2 haploid  12.7
## 34 W14NYQ29-5-DH30 W14NYQ9-2 haploid  10.2
## 36 W14NYQ29-5-DH11 W14NYQ9-2 haploid  12.1
## 37 W14NYQ29-5-DH12 W14NYQ9-2 haploid  11.3
## 38 W14NYQ29-5-DH25 W14NYQ9-2 haploid  10.7