polyBreedR Vignette 3: VCF and Marker Imputation

Jeff Endelman

2024-01-02

This vignette documents functions developed for the research of Endelman et al. (2024) “Targeted genotyping-by-sequencing of potato and data analysis with R/polyBreedR”. Please cite this work if you use the software.

Variant Call Format (VCF)

This vignette illustrates functions for manipulating genome-wide marker data in Variant Call Format (Danecek et al. 2021). This vignette assumes some familiarity with the structure of VCF files.

An attractive feature of VCF is that it contains both genotype calls and the underlying information supporting that call. For genotyping-by-sequencing, the supporting data is the Allele Depth (AD). For microarrays, the supporting data is the normalized signal intensity for the alleles, which is represented as a percentage in the field Allele Intensity (AI) by the polyBreedR function array2vcf. This function was designed to convert an Illumina Genome Studio Final Report to VCF. Here is a potato example, which can be contrasted with the approach used in Vignette 1 (written several years before Vignette 3):

library(polyBreedR)

array.file <- system.file("vignette_data", "potato_V3array_XYdata.txt", package = "polyBreedR")
map.file <- system.file("vignette_data", "potato_V4array.vcf", package = "polyBreedR")
model.file <- system.file("vignette_data", "potato_V4array_model.csv", package = "polyBreedR")

array2vcf(array.file=array.file, map.file=map.file, model.file=model.file, ploidy=4,
          vcf.file="potato_example.vcf.gz")

As shown in the above code, array2vcf requires a VCF map file, and optionally it takes a model file with parameters for a normal mixture model to make genotype calls.

The DArTag targeted GBS platform is being utilized for mid-density genotyping in many crops. The function dart2vcf generates a VCFv4.3 compliant file from the two standard DArTag CSV files (“Allele_Dose_Report” and “Allele_match_counts_collapsed”). A small DArTag dataset of 85 potato clones is included with polyBreedR to illustrate:

counts.file <- system.file("vignette_data", "DArTag_Allele_match_counts_collapsed.csv", 
                           package = "polyBreedR")
dosage.file <- system.file("vignette_data", "DArTag_Allele_Dose_Report.csv", 
                           package = "polyBreedR")

dart2vcf(counts.file=counts.file, dosage.file=dosage.file, ploidy=4,
         vcf.file="DArTag.vcf.gz")

gbs(in.file="DArTag.vcf.gz", out.file="DArTag_gbs.vcf.gz", ploidy=4, 
    n.core=2, silent=TRUE)

The above code illustrates using the gbs function to replace DArT genotype calls with calls based on the R/updog software (Gerard et al. 2018), using the “norm” prior. R/updog can account for several departures from the traditional binomial model, including allelic bias and overdispersion, and these parameters are stored in the VCF. There may be situations when you want to apply the updog model generated with a previous dataset to new samples. After you import the model parameters from the INFO field of the old VCF file to the new file (e.g., using bcftools annotate), run the gbs function using option “model.fit=FALSE”. This is illustrated in the supplemental file of the publication.

The read.vcfR command in package VCFR is convenient for reading VCF data into R. The command extrac.gt can then be used to construct matrices (markers x id) for different genotype fields.

library(vcfR)
data1 <- read.vcfR("DArTag.vcf.gz")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 19
##   header_line: 20
##   variant count: 2503
##   column count: 94
## Meta line 19 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 2503
##   Character matrix gt cols: 94
##   skip: 0
##   nrows: 2503
##   row_num: 0
## Processed variant 1000Processed variant 2000Processed variant: 2503
## All variants processed
tmp <- getFIX(data1)
map <- data.frame(marker=tmp[,3],
                  chrom=tmp[,1],
                  pos=as.integer(tmp[,2]))
head(map)
##                marker chrom     pos
## 1 solcap_snp_c2_36615 chr01  510745
## 2       PotVar0120130 chr01  602954
## 3 solcap_snp_c2_36639 chr01  717629
## 4 solcap_snp_c1_10926 chr01  828025
## 5  solcap_snp_c1_2426 chr01 1177430
## 6       PotVar0071738 chr01 1254782
GT1 <- extract.gt(data1)
table(GT1[1,])
## 
## 0/0/0/0 0/0/0/1 0/0/1/1 0/1/1/1 
##      16      32      31       6
geno1 <- GT2DS(GT1,n.core=2)
table(geno1[1,])
## 
##  0  1  2  3 
## 16 32 31  6

As shown above, the GT2DS function in polyBreedR can be used to convert VCF GT character strings to integer allele dosages (DS). The code below compares the genotype calls between DArT (data1) and updog (data2) for the first marker, which were identical except for one sample.

data2 <- read.vcfR("DArTag_gbs.vcf.gz",verbose=F)
GT2 <- extract.gt(data2)
geno2 <- GT2DS(GT2,n.core=2)
table(geno1[1,], geno2[1,])
##    
##      0  1  2  3
##   0 16  0  0  0
##   1  0 32  0  0
##   2  0  1 30  0
##   3  0  0  0  6

The plot_geno function from the updog package produces nice plots for visualizing the relationship between allele counts and genotype calls.

AD1 <- extract.gt(data1, element="AD")
ALT1 <- ADsplit(AD1,ALT=TRUE,n.core=2)
REF1 <- ADsplit(AD1,ALT=F,n.core=2)

library(updog)
library(ggplot2)

k = 1 #first marker
plot_geno(refvec = ALT1[k,], sizevec=ALT1[k,] + REF1[k,],
          geno=geno1[k,], ploidy=4) + ylab("ALT") + xlab("REF") +
  ggtitle("DArT")

The allelic bias (AB) estimates from updog are stored as INFO in the VCF file and can be retrieved using extract.info. When AB = 1, or equivalently logAB=0, there is no bias. Polyploid genotype calls become less reliable as allelic bias increases.

AB <- extract.info(data2,element="AB",as.numeric=T)
logAB <- log(AB)/log(2) #base 2 log
hist(logAB)

Imputation

Missing data arises in GBS datasets when the number of fragments in the reduced genome representation exceeds the total number of reads per sample. This commonly occurs with RAD-seq, which uses restriction enzymes, but is less of an issue with targeted GBS, which use specific primer pairs or oligonucleotide baits.

The polyBreedR function impute was designed for datasets where some percentage of the samples for any given marker may be missing. It works with VCF input and output but is not limited to GBS data; array data could also be imputed. Provided at least several hundred samples are present, the Random Forest (RF) method in impute is likely to be the most accurate. For more information, consult the help page and this recorded presentation.

For this vignette, the focus is imputing from low- to high-density marker platforms. This situation arises when a lower density platform, such as DArTag, is used for genomic selection candidates, while the training population has been genotype with a higher density platform, like an array or sequence capture. Another application occurs when genotyping platforms are upgraded over time, to include more markers. For example, the original (V1) potato SNP array had 8K markers, and the current one (V4) has 31K.

Two functions were created in polyBreedR for low-to-high imputation. impute_L2H uses the Random Forest method based on a training population of individuals genotyped under both platforms. In general, a larger training population leads to better the imputation accuracy. The default behavior is to use the 100 closest markers and 100 classification trees, but these are adjustable parameters. In the publication, using 25 markers provided the lowest imputation error, but this will vary by dataset.

The other function, impute_LA, was designed to impute markers in F1 populations based on linkage analysis (LA). The current implementation requires the software PolyOrigin, for which you will need to install Julia and PolyOrigin separately and prepare for command line execution. For impute_LA, the high-density marker file contains the phased parental genotypes in 0|1 format.

To illustrate, the low and high density marker files are provided for a half-diallel population of 85 clones, derived from 5 parents. The high density file contains 10,695 phased SNPs based on array genotyping of their offspring. The low density file contains 1865 DArTag markers with good quantitative agreement with the SNP array (Endelman et al. 2024).

ped.file <- system.file("vignette_data", "diallel_pedigree.csv", 
                        package = "polyBreedR")
high.file <- system.file("vignette_data", "diallel_phased_parents.csv", 
                        package = "polyBreedR")
low.file <- system.file("vignette_data", "diallel_DArTag.vcf.gz", 
                        package = "polyBreedR")

#peek at phased parental genotype file
head(read.csv(high.file, check.names=F))
##                marker chromosome position     bp W6609-3 W12078-76 W13NYP102-7
## 1 solcap_snp_c2_36608      chr01        0 508800 0|0|1|1   1|1|0|1     0|1|0|1
## 2 solcap_snp_c2_36658      chr01        0 527068 0|1|0|0   0|0|1|0     0|0|0|0
## 3 solcap_snp_c1_10930      chr01        0 566972 0|1|0|0   0|0|1|0     0|0|0|0
## 4       PotVar0120126      chr01        0 603013 0|1|1|1   1|1|1|1     1|1|1|1
## 5       PotVar0120085      chr01        0 681193 0|0|0|0   0|0|0|0     0|0|0|0
## 6       PotVar0120080      chr01        0 681379 1|1|1|1   1|1|1|1     1|1|1|1
##   W14NYQ4-1 W14NYQ9-2
## 1   0|1|0|0   0|0|0|0
## 2   0|0|0|0   1|0|0|1
## 3   0|0|0|0   1|0|0|1
## 4   0|1|1|0   1|0|0|1
## 5   0|0|0|0   0|0|0|0
## 6   1|1|1|1   1|1|1|1
#peek at pedigree file
read.csv(ped.file, check.names=F)[1:8,]
##            id pop    mother      father ploidy
## 1     W6609-3   0         0           0      4
## 2   W12078-76   0         0           0      4
## 3 W13NYP102-7   0         0           0      4
## 4   W14NYQ4-1   0         0           0      4
## 5   W14NYQ9-2   0         0           0      4
## 6   W19012-12   1 W12078-76 W13NYP102-7      4
## 7   W19012-13   1 W12078-76 W13NYP102-7      4
## 8   W19012-14   1 W12078-76 W13NYP102-7      4
impute_LA(ped.file = ped.file, high.file = high.file,
          low.file = low.file, low.format = "GT", out.file="imputed.csv.gz")

The following code computes the root-mean-squared-error of the imputation.

array.file <- system.file("vignette_data", "diallel_array.vcf.gz", 
                        package = "polyBreedR")
array <- read.vcfR(array.file,verbose=F)
geno.array <- GT2DS(extract.gt(array))

imputed <- read.csv("imputed.csv.gz", check.names=F)
geno.imputed <- as.matrix(imputed[,-(1:3)]) #remove map
rownames(geno.imputed) <- imputed$marker
marks <- intersect(rownames(geno.imputed),rownames(geno.array))
id <- intersect(colnames(geno.imputed),colnames(geno.array))

#RMSE
sqrt(mean((geno.imputed[marks,id] - geno.array[marks,id])^2))
## [1] 0.1340864