This vignette documents functions developed for the research of Endelman et al. (2024) “Targeted genotyping-by-sequencing of potato and data analysis with R/polyBreedR”. Please cite this work if you use the software.
This vignette illustrates functions for manipulating genome-wide marker data in Variant Call Format (Danecek et al. 2021). This vignette assumes some familiarity with the structure of VCF files.
An attractive feature of VCF is that it contains both genotype calls
and the underlying information supporting that call. For
genotyping-by-sequencing, the supporting data is the Allele Depth (AD).
For microarrays, the supporting data is the normalized signal intensity
for the alleles, which is represented as a percentage in the field
Allele Intensity (AI) by the polyBreedR function array2vcf
.
This function was designed to convert an Illumina Genome Studio Final
Report to VCF. Here is a potato example, which can be contrasted with
the approach used in Vignette 1 (written several years before Vignette
3):
library(polyBreedR)
array.file <- system.file("vignette_data", "potato_V3array_XYdata.txt", package = "polyBreedR")
map.file <- system.file("vignette_data", "potato_V4array.vcf", package = "polyBreedR")
model.file <- system.file("vignette_data", "potato_V4array_model.csv", package = "polyBreedR")
array2vcf(array.file=array.file, map.file=map.file, model.file=model.file, ploidy=4,
vcf.file="potato_example.vcf.gz")
As shown in the above code, array2vcf
requires a VCF map
file, and optionally it takes a model file with parameters for a normal
mixture model to make genotype calls.
The DArTag
targeted GBS platform is being utilized for mid-density genotyping
in many crops. The function dart2vcf
generates a VCFv4.3
compliant file from the two standard DArTag CSV files
(“Allele_Dose_Report” and “Allele_match_counts_collapsed”). A small
DArTag dataset of 85 potato clones is included with polyBreedR to
illustrate:
counts.file <- system.file("vignette_data", "DArTag_Allele_match_counts_collapsed.csv",
package = "polyBreedR")
dosage.file <- system.file("vignette_data", "DArTag_Allele_Dose_Report.csv",
package = "polyBreedR")
dart2vcf(counts.file=counts.file, dosage.file=dosage.file, ploidy=4,
vcf.file="DArTag.vcf.gz")
gbs(in.file="DArTag.vcf.gz", out.file="DArTag_gbs.vcf.gz", ploidy=4,
n.core=2, silent=TRUE)
The above code illustrates using the gbs
function to
replace DArT genotype calls with calls based on the R/updog software (Gerard et
al. 2018), using the “norm” prior. R/updog can account for several
departures from the traditional binomial model, including allelic bias
and overdispersion, and these parameters are stored in the VCF. There
may be situations when you want to apply the updog model generated with
a previous dataset to new samples. After you import the model parameters
from the INFO field of the old VCF file to the new file (e.g., using
bcftools annotate), run the gbs
function using option
“model.fit=FALSE”. This is illustrated in the supplemental file of the
publication.
The read.vcfR
command in package VCFR is convenient for
reading VCF data into R. The command extrac.gt
can then be
used to construct matrices (markers x id) for different genotype
fields.
## Scanning file to determine attributes.
## File attributes:
## meta lines: 19
## header_line: 20
## variant count: 2503
## column count: 94
## Meta line 19 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 2503
## Character matrix gt cols: 94
## skip: 0
## nrows: 2503
## row_num: 0
## Processed variant 1000Processed variant 2000Processed variant: 2503
## All variants processed
tmp <- getFIX(data1)
map <- data.frame(marker=tmp[,3],
chrom=tmp[,1],
pos=as.integer(tmp[,2]))
head(map)
## marker chrom pos
## 1 solcap_snp_c2_36615 chr01 510745
## 2 PotVar0120130 chr01 602954
## 3 solcap_snp_c2_36639 chr01 717629
## 4 solcap_snp_c1_10926 chr01 828025
## 5 solcap_snp_c1_2426 chr01 1177430
## 6 PotVar0071738 chr01 1254782
##
## 0/0/0/0 0/0/0/1 0/0/1/1 0/1/1/1
## 16 32 31 6
##
## 0 1 2 3
## 16 32 31 6
As shown above, the GT2DS
function in polyBreedR can be
used to convert VCF GT character strings to integer allele dosages (DS).
The code below compares the genotype calls between DArT (data1) and
updog (data2) for the first marker, which were identical except for one
sample.
data2 <- read.vcfR("DArTag_gbs.vcf.gz",verbose=F)
GT2 <- extract.gt(data2)
geno2 <- GT2DS(GT2,n.core=2)
table(geno1[1,], geno2[1,])
##
## 0 1 2 3
## 0 16 0 0 0
## 1 0 32 0 0
## 2 0 1 30 0
## 3 0 0 0 6
The plot_geno
function from the updog package produces
nice plots for visualizing the relationship between allele counts and
genotype calls.
AD1 <- extract.gt(data1, element="AD")
ALT1 <- ADsplit(AD1,ALT=TRUE,n.core=2)
REF1 <- ADsplit(AD1,ALT=F,n.core=2)
library(updog)
library(ggplot2)
k = 1 #first marker
plot_geno(refvec = ALT1[k,], sizevec=ALT1[k,] + REF1[k,],
geno=geno1[k,], ploidy=4) + ylab("ALT") + xlab("REF") +
ggtitle("DArT")
The allelic bias (AB) estimates from updog are stored as INFO in the
VCF file and can be retrieved using extract.info
. When AB =
1, or equivalently logAB=0, there is no bias. Polyploid genotype calls
become less reliable as allelic bias increases.
Missing data arises in GBS datasets when the number of fragments in the reduced genome representation exceeds the total number of reads per sample. This commonly occurs with RAD-seq, which uses restriction enzymes, but is less of an issue with targeted GBS, which use specific primer pairs or oligonucleotide baits.
The polyBreedR function impute
was designed for datasets
where some percentage of the samples for any given marker may be
missing. It works with VCF input and output but is not limited to GBS
data; array data could also be imputed. Provided at least several
hundred samples are present, the Random Forest (RF) method in
impute
is likely to be the most accurate. For more
information, consult the help page and this recorded
presentation.
For this vignette, the focus is imputing from low- to high-density marker platforms. This situation arises when a lower density platform, such as DArTag, is used for genomic selection candidates, while the training population has been genotype with a higher density platform, like an array or sequence capture. Another application occurs when genotyping platforms are upgraded over time, to include more markers. For example, the original (V1) potato SNP array had 8K markers, and the current one (V4) has 31K.
Two functions were created in polyBreedR for low-to-high imputation.
impute_L2H
uses the Random Forest method based on a
training population of individuals genotyped under both platforms. In
general, a larger training population leads to better the imputation
accuracy. The default behavior is to use the 100 closest markers and 100
classification trees, but these are adjustable parameters. In the
publication, using 25 markers provided the lowest imputation error, but
this will vary by dataset.
The other function, impute_LA
, was designed to impute
markers in F1 populations based on linkage analysis (LA). The current
implementation requires the software PolyOrigin,
for which you will need to install Julia and PolyOrigin separately and
prepare for command
line execution. For impute_LA
, the high-density marker
file contains the phased parental genotypes in 0|1 format.
To illustrate, the low and high density marker files are provided for a half-diallel population of 85 clones, derived from 5 parents. The high density file contains 10,695 phased SNPs based on array genotyping of their offspring. The low density file contains 1865 DArTag markers with good quantitative agreement with the SNP array (Endelman et al. 2024).
ped.file <- system.file("vignette_data", "diallel_pedigree.csv",
package = "polyBreedR")
high.file <- system.file("vignette_data", "diallel_phased_parents.csv",
package = "polyBreedR")
low.file <- system.file("vignette_data", "diallel_DArTag.vcf.gz",
package = "polyBreedR")
#peek at phased parental genotype file
head(read.csv(high.file, check.names=F))
## marker chromosome position bp W6609-3 W12078-76 W13NYP102-7
## 1 solcap_snp_c2_36608 chr01 0 508800 0|0|1|1 1|1|0|1 0|1|0|1
## 2 solcap_snp_c2_36658 chr01 0 527068 0|1|0|0 0|0|1|0 0|0|0|0
## 3 solcap_snp_c1_10930 chr01 0 566972 0|1|0|0 0|0|1|0 0|0|0|0
## 4 PotVar0120126 chr01 0 603013 0|1|1|1 1|1|1|1 1|1|1|1
## 5 PotVar0120085 chr01 0 681193 0|0|0|0 0|0|0|0 0|0|0|0
## 6 PotVar0120080 chr01 0 681379 1|1|1|1 1|1|1|1 1|1|1|1
## W14NYQ4-1 W14NYQ9-2
## 1 0|1|0|0 0|0|0|0
## 2 0|0|0|0 1|0|0|1
## 3 0|0|0|0 1|0|0|1
## 4 0|1|1|0 1|0|0|1
## 5 0|0|0|0 0|0|0|0
## 6 1|1|1|1 1|1|1|1
## id pop mother father ploidy
## 1 W6609-3 0 0 0 4
## 2 W12078-76 0 0 0 4
## 3 W13NYP102-7 0 0 0 4
## 4 W14NYQ4-1 0 0 0 4
## 5 W14NYQ9-2 0 0 0 4
## 6 W19012-12 1 W12078-76 W13NYP102-7 4
## 7 W19012-13 1 W12078-76 W13NYP102-7 4
## 8 W19012-14 1 W12078-76 W13NYP102-7 4
impute_LA(ped.file = ped.file, high.file = high.file,
low.file = low.file, low.format = "GT", out.file="imputed.csv.gz")
The following code computes the root-mean-squared-error of the imputation.
array.file <- system.file("vignette_data", "diallel_array.vcf.gz",
package = "polyBreedR")
array <- read.vcfR(array.file,verbose=F)
geno.array <- GT2DS(extract.gt(array))
imputed <- read.csv("imputed.csv.gz", check.names=F)
geno.imputed <- as.matrix(imputed[,-(1:3)]) #remove map
rownames(geno.imputed) <- imputed$marker
marks <- intersect(rownames(geno.imputed),rownames(geno.array))
id <- intersect(colnames(geno.imputed),colnames(geno.array))
#RMSE
sqrt(mean((geno.imputed[marks,id] - geno.array[marks,id])^2))
## [1] 0.1340864