polyBreedR Vignette 1: SNP array and pedigree data

Genotype calls from SNP array intensities

SNP arrays provide signal intensity data for two different alleles, which can be visualized as the X and Y axes of a Cartesian plane. Function readXY reads a “FinalReport”-style output file from GenomeStudio that contains the XY data in long format; that is, each row corresponds to a different sample-marker combination. An example file containing 9 individuals genotyped with Version 3 of the potato SNP array (which has 21K markers) is provided with the vignette:

data.filename <- system.file("vignette_data", "potato_V3array_XYdata.txt", package = "polyBreedR")
dataXY <- read.table(data.filename, sep="\t",skip=9,check.names=F,as.is=T,header=T)
head(dataXY)
##   SNP Name Sample ID     X     Y
## 1   CTRL-1  W16102-3 0.007 0.002
## 2  CTRL-10  W16102-3 1.809 0.098
## 3 CTRL-100  W16102-3 0.029 0.013
## 4 CTRL-101  W16102-3 0.026 0.008
## 5 CTRL-103  W16102-3 0.020 0.001
## 6 CTRL-104  W16102-3 0.023 0.010
library(polyBreedR)
data <- readXY(filename=data.filename,skip=9)
str(data)
##  num [1:21226, 1:9] 0.2222 0.0514 0.3095 0.2353 0.0476 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:21226] "CTRL-1" "CTRL-10" "CTRL-100" "CTRL-101" ...
##   ..$ : chr [1:9] "W16102-3" "W16101-11" "W16100-13" "W16103-2" ...
hist(data)

As shown above, the output from readXY is a matrix with dimensions [markers x individuals], and the data values are the ratio Y/(X+Y), which varies from 0 to 1 and is an analog measure of allele dosage (theta values are also possible; see the manual). Several methods are available for classifying this ratio into discrete values 0-4, including the normal mixture model implemented in R package fitPoly. Two thousand potato samples were used to fit the model for each marker, and 11,043 SNPs remained after filtering. The model file provided with the package contains 15 columns: five means, five standard deviations, and 5 mixture probabilities.

model.filename <- system.file("vignette_data", "potato_V3array_model.csv", package = "polyBreedR")
model.params <- read.csv(model.filename)
head(model.params)
##                 Marker  Mean0 Mean1 Mean2 Mean3 Mean4    SD0    SD1    SD2
## 1 D_locus_(DFR)_C_LG02 0.2420 0.814 0.974  1.12  1.44 0.0446 0.0446 0.0446
## 2 D_locus_(DFR)_D_LG02 0.1110 0.678 0.851  1.02  1.48 0.0478 0.0478 0.0478
## 3 D_locus_(DFR)_G_LG02 0.0799 0.619 0.813  1.00  1.50 0.0473 0.0473 0.0473
## 4 D_locus_(DFR)_H_LG02 0.1740 0.687 0.853  1.02  1.43 0.0408 0.0408 0.0408
## 5 D_locus_(DFR)_I_LG02 0.3330 0.764 0.930  1.08  1.35 0.0408 0.0408 0.0408
## 6    Gro14_c_Paal_LG07 0.6150 0.804 0.978  1.16  1.45 0.0449 0.0449 0.0449
##      SD3    SD4    Prob0  Prob1 Prob2  Prob3  Prob4
## 1 0.0446 0.0446 0.199000 0.3960 0.295 0.0979 0.0122
## 2 0.0478 0.0478 0.200000 0.3960 0.295 0.0974 0.0121
## 3 0.0473 0.0473 0.011900 0.0967 0.294 0.3970 0.2010
## 4 0.0408 0.0408 0.011900 0.0964 0.293 0.3970 0.2010
## 5 0.0408 0.0408 0.011900 0.0965 0.294 0.3970 0.2010
## 6 0.0449 0.0449 0.000799 0.0158 0.117 0.3870 0.4790

Use function geno_call to make genotype (allele dosage) calls from the allele ratio data matrix and model parameters file:

geno <- geno_call(data=data,filename=model.filename)
str(geno)
##  num [1:11043, 1:9] 0 0 4 4 4 3 2 2 1 3 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:11043] "D_locus_(DFR)_C_LG02" "D_locus_(DFR)_D_LG02" "D_locus_(DFR)_G_LG02" "D_locus_(DFR)_H_LG02" ...
##   ..$ : chr [1:9] "W16102-3" "W16101-11" "W16100-13" "W16103-2" ...
#Look at relationship between XY and dosage for one marker 
snp <- "PotVar0028728"
ix <- which(dataXY$`SNP Name`==snp)
plot.data <- data.frame(X=dataXY$X[ix],
                        Y=dataXY$Y[ix],
                        dosage=factor(geno[snp,dataXY$`Sample ID`[ix]])
                        )

library(ggplot2)
ggplot(data=plot.data,aes(x=X,y=Y,colour=dosage)) + geom_point() + scale_colour_brewer(palette="Set1") + coord_fixed(ratio=1) + xlim(c(0,1)) + ylim(c(0,1)) + ggtitle(snp)

Pedigree analysis

Genome-wide markers are very useful for quality control in a breeding program, including to ensure seed mixtures have not occurred and to check the accuracy of the recorded parentage. A pedigree file for the 9 potato clones in the sample dataset is provided with the vignette:

pedfile <- system.file("vignette_data", "potato_pedigree.csv", package = "polyBreedR")
ped <- read.csv(pedfile,as.is=T,na.strings="NA")
nrow(ped)
## [1] 79
tail(ped)
##           id    mother      father
## 74  NYWJ11-5 NYC101-10       Marcy
## 75  W16118-3  NYWJ11-5     Nicolet
## 76     Hodag      Pike DakotaPearl
## 77  W16126-7     Hodag    MSR127-2
## 78 W16127-13     Hodag     Nicolet
## 79 W16130-10   W6609-3    MSR127-2

The full pedigree has 70 ancestors. Portions of the pedigree for certain individuals can be extracted with the function get_pedigree:

id <- colnames(geno)
ped1 <- get_pedigree(id = id[1],pedfile=pedfile,na.string="NA")
nrow(ped1)
## [1] 40
ped2 <- get_pedigree(id = id[1:2],pedfile=pedfile,na.string="NA")
nrow(ped2)
## [1] 43
print(parentage <- ped[ped$id %in% id,])
##           id         mother   father
## 34 W16100-13 SaginawChipper  Nicolet
## 36 W16101-11 SaginawChipper   Tundra
## 46  W16102-3 SaginawChipper  W6609-3
## 55  W16103-2       MSR127-2  Nicolet
## 56  W16105-1       MSR127-2  W6609-3
## 75  W16118-3       NYWJ11-5  Nicolet
## 77  W16126-7          Hodag MSR127-2
## 78 W16127-13          Hodag  Nicolet
## 79 W16130-10        W6609-3 MSR127-2

By subsetting the pedigree to just the 9 genotyped individuals, we see they are derived from 7 parents, which were genotyped years earlier using Version 1 of the potato SNP array. The Version 1 array had 8303 markers, and allele dosage data for 4167 of these markers for the 7 parents are provided with the vignette.

V1datafile <- system.file("vignette_data", "potato_V1array_geno.csv", package = "polyBreedR")
geno.parents <- as.matrix(read.csv(V1datafile,check.names=F,row.names=1))
str(geno.parents)
##  int [1:4167, 1:7] 2 2 0 1 3 0 3 2 1 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4167] "solcap_snp_c1_1" "solcap_snp_c1_1000" "solcap_snp_c1_10000" "solcap_snp_c1_10011" ...
##   ..$ : chr [1:7] "SaginawChipper" "MSR127-2" "NYWJ11-5" "Hodag" ...

When both parents of an individual have been genotyped, the function check_trio can be used to calculate what percent of the markers are inconsistent (using a random bivalents model). For example, if the mother has dosage 0 and father has dosage 1, then only dosage values of 0 and 1 are possible in the offspring. Because the parents and progeny were genotyped with different versions of the array, we need to first combine the two genotype matrices using only the markers in common:

common.marks <- intersect(rownames(geno.parents),rownames(geno))
geno2 <- cbind(geno.parents[common.marks,],geno[common.marks,])
str(geno2)
##  num [1:2781, 1:16] 2 1 3 3 2 3 1 4 1 2 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2781] "solcap_snp_c1_1" "solcap_snp_c1_10011" "solcap_snp_c1_10012" "solcap_snp_c1_10042" ...
##   ..$ : chr [1:16] "SaginawChipper" "MSR127-2" "NYWJ11-5" "Hodag" ...
trio.ans <- check_trio(parentage=parentage,geno=geno2,ploidy=4)
trio.ans
##           id         mother   father error
## 34 W16100-13 SaginawChipper  Nicolet   0.3
## 36 W16101-11 SaginawChipper   Tundra   0.6
## 46  W16102-3 SaginawChipper  W6609-3   0.4
## 55  W16103-2       MSR127-2  Nicolet   0.3
## 56  W16105-1       MSR127-2  W6609-3   0.3
## 75  W16118-3       NYWJ11-5  Nicolet   0.1
## 77  W16126-7          Hodag MSR127-2   0.2
## 78 W16127-13          Hodag  Nicolet   0.7
## 79 W16130-10        W6609-3 MSR127-2   0.8

For correct trios, the percent error is limited by the accuracy of the genotype data. Based on checking hundreds of trios, approximately 1% error is expected when using the model parameter file included with the vignette. Values much higher than this indicate the parentage is incorrect. To illustrate, we will replace MSR127-2 with Nicolet in the parentage of the last individual to introduce a pedigree error:

ped_with_error <- ped
ped_with_error$father[ped_with_error$id=="W16130-10"] <- "Nicolet"
parentage_with_error <- ped_with_error[ped_with_error$id %in% id,]
check_trio(parentage=parentage_with_error,geno=geno2,ploidy=4)
##           id         mother   father error
## 34 W16100-13 SaginawChipper  Nicolet   0.3
## 36 W16101-11 SaginawChipper   Tundra   0.6
## 46  W16102-3 SaginawChipper  W6609-3   0.4
## 55  W16103-2       MSR127-2  Nicolet   0.3
## 56  W16105-1       MSR127-2  W6609-3   0.3
## 75  W16118-3       NYWJ11-5  Nicolet   0.1
## 77  W16126-7          Hodag MSR127-2   0.2
## 78 W16127-13          Hodag  Nicolet   0.7
## 79 W16130-10        W6609-3  Nicolet   9.3

In this example, the incorrect parentage leads to almost 10% error. The plotting function GvsA can be used to determine which parent is wrong and identify the correct parent if it is in the dataset. G refers to the additive relationship matrix from markers, and A is the additive relationship from pedigree, which can be computed using the G_mat and A_mat functions, respectively.

G <- G_mat(geno=geno2,ploidy=4)
A <- A_mat(ped=ped,ploidy=4)
A <- A[rownames(G),colnames(G)]

A_error <-  A_mat(ped=ped_with_error,ploidy=4)
A_error <- A_error[rownames(G),colnames(G)]

#Plot with correct parentage
GvsA(parentage[9,],G,A)

The above figure shows the expected positive correlation between G and A when the parentage is correct, with the parents and any full-siblings (W16105-1 in this example) in the upper right corner, and more distantly related individuals in the lower left. You can control which points get labeled using the parameters thresh.G and thresh.A. Now look at the plot when the parentage is incorrect:

GvsA(parentage_with_error[9,],G,A_error,thresh.G=0.1)

The position of W6609-3 looks appropriate, but the G coefficient with Nicolet is too small for it to be the parent. Conversely, the G coefficient with MSR127-2 is too high relative to its A coefficient, which suggests it is the missing parent.