SNP arrays provide signal intensity data for two different alleles, which can be visualized as the X and Y axes of a Cartesian plane. Function readXY
reads a “FinalReport”-style output file from GenomeStudio that contains the XY data in long format; that is, each row corresponds to a different sample-marker combination. An example file containing 9 individuals genotyped with Version 3 of the potato SNP array (which has 21K markers) is provided with the vignette:
<- system.file("vignette_data", "potato_V3array_XYdata.txt", package = "polyBreedR")
data.filename <- read.table(data.filename, sep="\t",skip=9,check.names=F,as.is=T,header=T)
dataXY head(dataXY)
## SNP Name Sample ID X Y
## 1 CTRL-1 W16102-3 0.007 0.002
## 2 CTRL-10 W16102-3 1.809 0.098
## 3 CTRL-100 W16102-3 0.029 0.013
## 4 CTRL-101 W16102-3 0.026 0.008
## 5 CTRL-103 W16102-3 0.020 0.001
## 6 CTRL-104 W16102-3 0.023 0.010
library(polyBreedR)
<- readXY(filename=data.filename,skip=9)
data str(data)
## num [1:21226, 1:9] 0.2222 0.0514 0.3095 0.2353 0.0476 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:21226] "CTRL-1" "CTRL-10" "CTRL-100" "CTRL-101" ...
## ..$ : chr [1:9] "W16102-3" "W16101-11" "W16100-13" "W16103-2" ...
hist(data)
As shown above, the output from readXY
is a matrix with dimensions [markers x individuals], and the data values are the ratio Y/(X+Y), which varies from 0 to 1 and is an analog measure of allele dosage (theta values are also possible; see the manual). Several methods are available for classifying this ratio into discrete values 0-4, including the normal mixture model implemented in R package fitPoly. Two thousand potato samples were used to fit the model for each marker, and 11,043 SNPs remained after filtering. The model file provided with the package contains 15 columns: five means, five standard deviations, and 5 mixture probabilities.
<- system.file("vignette_data", "potato_V3array_model.csv", package = "polyBreedR")
model.filename <- read.csv(model.filename)
model.params head(model.params)
## Marker Mean0 Mean1 Mean2 Mean3 Mean4 SD0 SD1 SD2
## 1 D_locus_(DFR)_C_LG02 0.2420 0.814 0.974 1.12 1.44 0.0446 0.0446 0.0446
## 2 D_locus_(DFR)_D_LG02 0.1110 0.678 0.851 1.02 1.48 0.0478 0.0478 0.0478
## 3 D_locus_(DFR)_G_LG02 0.0799 0.619 0.813 1.00 1.50 0.0473 0.0473 0.0473
## 4 D_locus_(DFR)_H_LG02 0.1740 0.687 0.853 1.02 1.43 0.0408 0.0408 0.0408
## 5 D_locus_(DFR)_I_LG02 0.3330 0.764 0.930 1.08 1.35 0.0408 0.0408 0.0408
## 6 Gro14_c_Paal_LG07 0.6150 0.804 0.978 1.16 1.45 0.0449 0.0449 0.0449
## SD3 SD4 Prob0 Prob1 Prob2 Prob3 Prob4
## 1 0.0446 0.0446 0.199000 0.3960 0.295 0.0979 0.0122
## 2 0.0478 0.0478 0.200000 0.3960 0.295 0.0974 0.0121
## 3 0.0473 0.0473 0.011900 0.0967 0.294 0.3970 0.2010
## 4 0.0408 0.0408 0.011900 0.0964 0.293 0.3970 0.2010
## 5 0.0408 0.0408 0.011900 0.0965 0.294 0.3970 0.2010
## 6 0.0449 0.0449 0.000799 0.0158 0.117 0.3870 0.4790
Use function geno_call
to make genotype (allele dosage) calls from the allele ratio data matrix and model parameters file:
<- geno_call(data=data,filename=model.filename)
geno str(geno)
## num [1:11043, 1:9] 0 0 4 4 4 3 2 2 1 3 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:11043] "D_locus_(DFR)_C_LG02" "D_locus_(DFR)_D_LG02" "D_locus_(DFR)_G_LG02" "D_locus_(DFR)_H_LG02" ...
## ..$ : chr [1:9] "W16102-3" "W16101-11" "W16100-13" "W16103-2" ...
#Look at relationship between XY and dosage for one marker
<- "PotVar0028728"
snp <- which(dataXY$`SNP Name`==snp)
ix <- data.frame(X=dataXY$X[ix],
plot.data Y=dataXY$Y[ix],
dosage=factor(geno[snp,dataXY$`Sample ID`[ix]])
)
library(ggplot2)
ggplot(data=plot.data,aes(x=X,y=Y,colour=dosage)) + geom_point() + scale_colour_brewer(palette="Set1") + coord_fixed(ratio=1) + xlim(c(0,1)) + ylim(c(0,1)) + ggtitle(snp)
Genome-wide markers are very useful for quality control in a breeding program, including to ensure seed mixtures have not occurred and to check the accuracy of the recorded parentage. A pedigree file for the 9 potato clones in the sample dataset is provided with the vignette:
<- system.file("vignette_data", "potato_pedigree.csv", package = "polyBreedR")
pedfile <- read.csv(pedfile,as.is=T,na.strings="NA")
ped nrow(ped)
## [1] 79
tail(ped)
## id mother father
## 74 NYWJ11-5 NYC101-10 Marcy
## 75 W16118-3 NYWJ11-5 Nicolet
## 76 Hodag Pike DakotaPearl
## 77 W16126-7 Hodag MSR127-2
## 78 W16127-13 Hodag Nicolet
## 79 W16130-10 W6609-3 MSR127-2
The full pedigree has 70 ancestors. Portions of the pedigree for certain individuals can be extracted with the function get_pedigree
:
<- colnames(geno)
id <- get_pedigree(id = id[1],pedfile=pedfile,na.string="NA")
ped1 nrow(ped1)
## [1] 40
<- get_pedigree(id = id[1:2],pedfile=pedfile,na.string="NA")
ped2 nrow(ped2)
## [1] 43
print(parentage <- ped[ped$id %in% id,])
## id mother father
## 34 W16100-13 SaginawChipper Nicolet
## 36 W16101-11 SaginawChipper Tundra
## 46 W16102-3 SaginawChipper W6609-3
## 55 W16103-2 MSR127-2 Nicolet
## 56 W16105-1 MSR127-2 W6609-3
## 75 W16118-3 NYWJ11-5 Nicolet
## 77 W16126-7 Hodag MSR127-2
## 78 W16127-13 Hodag Nicolet
## 79 W16130-10 W6609-3 MSR127-2
By subsetting the pedigree to just the 9 genotyped individuals, we see they are derived from 7 parents, which were genotyped years earlier using Version 1 of the potato SNP array. The Version 1 array had 8303 markers, and allele dosage data for 4167 of these markers for the 7 parents are provided with the vignette.
<- system.file("vignette_data", "potato_V1array_geno.csv", package = "polyBreedR")
V1datafile <- as.matrix(read.csv(V1datafile,check.names=F,row.names=1))
geno.parents str(geno.parents)
## int [1:4167, 1:7] 2 2 0 1 3 0 3 2 1 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:4167] "solcap_snp_c1_1" "solcap_snp_c1_1000" "solcap_snp_c1_10000" "solcap_snp_c1_10011" ...
## ..$ : chr [1:7] "SaginawChipper" "MSR127-2" "NYWJ11-5" "Hodag" ...
When both parents of an individual have been genotyped, the function check_trio
can be used to calculate what percent of the markers are inconsistent (using a random bivalents model). For example, if the mother has dosage 0 and father has dosage 1, then only dosage values of 0 and 1 are possible in the offspring. Because the parents and progeny were genotyped with different versions of the array, we need to first combine the two genotype matrices using only the markers in common:
<- intersect(rownames(geno.parents),rownames(geno))
common.marks <- cbind(geno.parents[common.marks,],geno[common.marks,])
geno2 str(geno2)
## num [1:2781, 1:16] 2 1 3 3 2 3 1 4 1 2 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2781] "solcap_snp_c1_1" "solcap_snp_c1_10011" "solcap_snp_c1_10012" "solcap_snp_c1_10042" ...
## ..$ : chr [1:16] "SaginawChipper" "MSR127-2" "NYWJ11-5" "Hodag" ...
<- check_trio(parentage=parentage,geno=geno2,ploidy=4)
trio.ans trio.ans
## id mother father error
## 34 W16100-13 SaginawChipper Nicolet 0.3
## 36 W16101-11 SaginawChipper Tundra 0.6
## 46 W16102-3 SaginawChipper W6609-3 0.4
## 55 W16103-2 MSR127-2 Nicolet 0.3
## 56 W16105-1 MSR127-2 W6609-3 0.3
## 75 W16118-3 NYWJ11-5 Nicolet 0.1
## 77 W16126-7 Hodag MSR127-2 0.2
## 78 W16127-13 Hodag Nicolet 0.7
## 79 W16130-10 W6609-3 MSR127-2 0.8
For correct trios, the percent error is limited by the accuracy of the genotype data. Based on checking hundreds of trios, approximately 1% error is expected when using the model parameter file included with the vignette. Values much higher than this indicate the parentage is incorrect. To illustrate, we will replace MSR127-2 with Nicolet in the parentage of the last individual to introduce a pedigree error:
<- ped
ped_with_error $father[ped_with_error$id=="W16130-10"] <- "Nicolet"
ped_with_error<- ped_with_error[ped_with_error$id %in% id,]
parentage_with_error check_trio(parentage=parentage_with_error,geno=geno2,ploidy=4)
## id mother father error
## 34 W16100-13 SaginawChipper Nicolet 0.3
## 36 W16101-11 SaginawChipper Tundra 0.6
## 46 W16102-3 SaginawChipper W6609-3 0.4
## 55 W16103-2 MSR127-2 Nicolet 0.3
## 56 W16105-1 MSR127-2 W6609-3 0.3
## 75 W16118-3 NYWJ11-5 Nicolet 0.1
## 77 W16126-7 Hodag MSR127-2 0.2
## 78 W16127-13 Hodag Nicolet 0.7
## 79 W16130-10 W6609-3 Nicolet 9.3
In this example, the incorrect parentage leads to almost 10% error. The plotting function GvsA
can be used to determine which parent is wrong and identify the correct parent if it is in the dataset. G refers to the additive relationship matrix from markers, and A is the additive relationship from pedigree, which can be computed using the G_mat
and A_mat
functions, respectively.
<- G_mat(geno=geno2,ploidy=4)
G <- A_mat(ped=ped,ploidy=4)
A <- A[rownames(G),colnames(G)]
A
<- A_mat(ped=ped_with_error,ploidy=4)
A_error <- A_error[rownames(G),colnames(G)]
A_error
#Plot with correct parentage
GvsA(parentage[9,],G,A)
The above figure shows the expected positive correlation between G and A when the parentage is correct, with the parents and any full-siblings (W16105-1 in this example) in the upper right corner, and more distantly related individuals in the lower left. You can control which points get labeled using the parameters thresh.G
and thresh.A
. Now look at the plot when the parentage is incorrect:
GvsA(parentage_with_error[9,],G,A_error,thresh.G=0.1)
The position of W6609-3 looks appropriate, but the G coefficient with Nicolet is too small for it to be the parent. Conversely, the G coefficient with MSR127-2 is too high relative to its A coefficient, which suggests it is the missing parent.