Professional Documents
Culture Documents
12496
POLYPATEX:
an R package for paternity exclusion in
autopolyploids
ALEXANDER B. ZWART,* CAROLE ELLIOTT, TARA HOPLEY, DAVID LOVELL and
ANDREW YOUNG
*CSIRO Data61, GPO Box 664, Canberra, ACT 2601, Australia, CSIRO National Facilities and Collections, GPO Box 1600,
Canberra, ACT 2601, Australia, Queensland University of Technology (QUT), GPO Box 2434, Brisbane, QLD 4001, Australia
Abstract
Microsatellite markers have demonstrated their value for performing paternity exclusion and hence exploring mating
patterns in plants and animals. Methodology is well established for diploid species, and several software packages
exist for elucidating paternity in diploids; however, these issues are not so readily addressed in polyploids due to
the increased complexity of the exclusion problem and a lack of available software. We introduce POLYPATEX, an R
package for paternity exclusion analysis using microsatellite data in autopolyploid, monoecious or dioecious/bisex-
ual species with a ploidy of 4n, 6n or 8n. Given marker data for a set of offspring, their mothers and a set of candidate
fathers, POLYPATEX uses allele matching to exclude candidates whose marker alleles are incompatible with the alleles
in each offspringmother pair. POLYPATEX can analyse marker data sets in which allele copy numbers are known (geno-
type data) or unknown (allelic phenotype data) for data sets in which allele copy numbers are unknown, compar-
isons are made taking into account all possible genotypes that could arise from the compared allele sets. POLYPATEX is
a software tool that provides population geneticists with the ability to investigate the mating patterns of autopoly-
ploids using paternity exclusion analysis on data from codominant markers having multiple alleles per locus.
Keywords: allele matching, microsatellite, pollen dispersal, polyploid
Received 22 July 2014; revision received 13 November 2015; accepted 19 November 2015
ambiguity in determining allele frequencies when deal- This advantage also applies to similar phenotypic
ing with polyploid allelic phenotype data (Jones et al. situations (i.e. tetraploid offspring BC and mother AB;
2010), unless the data sets are treated as dominant candidate excluded with POLYPATEX and included with
instead of codominant markers (e.g. Riday et al. 2013, presence/absence method CEFG; candidates not
2015; Wang & Scribner 2014), as allele frequencies are the excluded with either method CE or CFG). Therefore, the
foundation for all likelihood-based methods. Wang & genotype and gamete simulations implemented in POLY-
Scribner (2014) transform polyploid codominant geno- PATEX provide a robust basis for refining the level of
types from microsatellite data to pseudodiploid-domi- detail to which candidates can be excluded.
nant genotypes (presence/absence only) and
demonstrate how useful this conversion can be, when The genetic model
coupled with the program COLONY (Jones & Wang 2002),
for determining parentage and sibship in polyploids. POLYPATEX assumes the following genetic model, which is
However, this transformation results in the loss of valu- based on Mendelian rules of inheritance, to exclude can-
able information, and with increasing polyploidy levels, didates as potential fathers using incompatibilities
there is a decrease in parentage assignment and exclu- between parent and offspring pairs. In an autopolyploid
sion accuracies (Wang & Scribner 2014). species of ploidy p, at a given locus,
POLYPATEX is an R (R Core Team 2015) package for con-
p/2 of the p alleles in the offspring are assumed to have
ducting paternity exclusion analysis in autopolyploid been selected, without replacement, from the mothers
monoecious, dioecious or bisexual species having ploidy p alleles (the maternal gamete).
level p of 4n, 6n or 8n. Developed in the context of The remaining p/2 of alleles in the offspring are
microsatellite data, POLYPATEX can also be used for other assumed to have been selected, without replacement,
codominant markers having multiple alleles per locus, from the fathers p alleles (the paternal gamete).
for example allozymes. POLYPATEX is not optimized for The relationship between mother and offspring is
data sets having very large numbers of loci, such as SNP known, and the aim of the analysis was to determine
data. For plants, self-compatible (i.e. mother included as which of one or more candidate fathers is capable of
a candidate father) and self-incompatible breeding sys- producing a paternal gamete compatible with the alle-
tems are options in the algorithm. POLYPATEX applies allele les in the offspringmother pair.
matching at each locus to determine whether a candidate Comparisons are made on the basis of allele presence/
fathers allele set (the (up to p) alleles observed at the absence only POLYPATEX does not use population allele
given locus) are compatible with the corresponding allele frequencies to compute likelihoods for paternity.
sets in an offspringmother pair. POLYPATEX can analyse
marker data sets in which allele copy numbers are To allow for the possibility of phenomena such as
known (genotype data) or unknown (allelic phenotype double reduction that violate the above model (as well as
data). For data sets in which allele copy numbers are the inevitable genotyping errors), POLYPATEX functions
unknown, POLYPATEX considers all possible genotypes for potentialFatherIDs and potential
each locus arising from the observed allele sets in candi- FatherCounts include an argument
date father, offspring and mother, to determine whether mismatches, that can be used to specify a maximum
a match is possible. Per-locus comparisons are then sum- number of mismatching (and nonmissing) loci between
marized across all loci, resulting in tables giving the candidate father and offspring that are allowed, before
identification (ID) and total counts of nonexcluded can- the candidate is excluded as a potential father. The
didate fathers (which we refer to as potential fathers) for default is to allow no mismatching loci in a nonexcluded
each offspring. candidate.
The advantage of the POLYPATEX algorithm over a sim-
pler presence-/absence-based algorithm is that all avail-
Data input format, loading and preprocessing
able information is utilized to exclude candidates. For
example, consider a tetraploid where the offspring has POLYPATEX functions require data to be presented as a
the genotype ABCE and the mother ABCD, with one can- table with one row per individual (mother, candidate
didate father having the genotype EFGG, and another father or offspring). Initial columns contain individual
CEGG. In this situation, the simpler presence/absence IDs, population identifier, ID of each offsprings mother
algorithm would exclude neither candidate because both and for dioecious species, adult gender. For ploidy p and
candidates contain the allele not found in the mother k observed loci, k further blocks of p columns each con-
(E). However, using the POLYPATEX algorithm, the first tain the allele labels. Cells in these k blocks should be left
candidate would be excluded, because it could not have blank as necessary when fewer than p alleles were
contributed a full gamete (two alleles) to the offspring. observed.
POLYPATEX provides the function inputData to p observed alleles. The locus genotype is known when p
load the data from a comma-separated value (CSV) for- alleles are observed, and when only one allele is
matted file into R, storing the data set as an R data frame. observed, the locus genotype can be inferred as compris-
inputData passes this dataframe to function ing p copies of that allele. But for 2 to p 1 observed alle-
preprocessData, which performs a number of les, the phenotype gives rise to more than one possible
checks and preprocessing steps to help ensure the valid- genotype. In the allele-matching process, therefore, the
ity of the data set for analysis by other POLYPATEX func- possible combinations of genotypes arising from off-
tions. In particular, preprocessData checks for spring, mother and candidate phenotypes must be
mismatches at each locus between mother and offspring. searched for genotype combinations that allow a match,
These arise when the mothers allele set cannot (in the before a candidate can be claimed as a possible match for
absence of, say, a mutation) generate a gamete compati- the offspring. For efficiency reasons, this search is imple-
ble with the offsprings allele set for that locus. The user mented in phenotPPE using lookup tables, for each
may specify the maximum number of such mismatches ploidy covered by POLYPATEX (4n, 6n and 8n at time of
(0 to k 1) that are allowed before the offspring is publication).
removed from the data set. When an offspring is not In either routine, when one or more of the allele sets
removed, loci that mismatch with its mother are set to in offspring, mother and candidate father are missing,
contain no alleles (i.e. they are set to be missing) in the comparisons cannot be made at that locus, so the
offspring. The default is to remove offspring containing affected locus is subsequently ignored for that trio of
any mismatches with their mothers. Motheroffspring individuals.
mismatch details are reported to the users so that they In the situation where more than one candidate father
can investigate these cases for genotyping errors and is the potential donor for an offspring, the user needs to
check whether the problem may lie in a mothers allele consider the method best suited for their research ques-
sets rather than in those of her offspring. tion and species when handling an exclusion analysis
For genotype data in a species of ploidy p, POLYPATEX that is unresolved. Some options include the following:
requires every allele set to contain exactly p alleles, or (i) considering the offspring derived from that group of
none. When preprocessData finds nonmissing candidate fathers without a definitive level of exclusion;
allele sets with fewer than p alleles in genotype data sets, (ii) fractionally assign males an equal proportion of an
they are reported to the user, then are set to be missing. offspring (Goto et al. 2004); (iii) choose the closest male
In allelic phenotype data sets, there is no requirement to (plants or other sessile organisms only) (Hardesty et al.
observe exactly p unique alleles, so this adjustment is 2006); or (iv) fractionally assign to a male (sessile organ-
only relevant to genotypic data. isms only) based on their location/distance from mother.
After these and other checks, preprocessData For a general review of methods of parentage analysis,
removes individuals from the data set having fewer than see Jones et al. (2010).
a user-specified minimum number (lociMin) of non-
missing allele sets. When a mother is so removed, all of
Exclusion analysis results summaries and
her offspring are also removed from the data frame.
export
Results from either exclusion routine are returned in an R
Paternity exclusion
list structure, whose contents are explained further in the
Two functions are provided by POLYPATEX for performing POLYPATEX documentation. Two further POLYPATEX func-
paternity exclusion, one for analysing genotype data tions provide more convenient summarizations of the
(genotPPE) and one for allelic phenotype data per-locus exclusion results across loci. Function
(phenotPPE). potentialFatherCounts returns an R data
For comparison of genotypes, genotPPE partitions frame with columns containing offspring ID, correspond-
each offspring allele set into alleles that also appear in ing mother ID and the number of candidates flagged as
the mother, and alleles that do not (and hence must be potential fathers for each offspring. Function
provided by the father). For a candidate not to be potentialFatherIDs provides a similar data
excluded as a potential father, it must account for all of frame listing the IDs of these potential fathers. Both func-
the latter alleles, plus as many of the shared alleles as is tions include the argument mismatches, that allows
needed to make a complete gamete of p/2 alleles. Func- a maximum number of mismatching loci between candi-
tion genotPPE takes proper account of allele multi- date father and offspring before the candidate are
plicities in making these comparisons. excluded as a potential father. Results from the two sum-
In the more common case of allelic phenotype data, a mary functions can be exported to file using R functions
nonmissing allele set at a given locus may consist of 1 to such as write.csv.
Table 1 The first twelve lines of the output table produced by FR_Genotype.csv contains data from seven loci
potentialFatherIDs. For each progeny, each potential in a tetraploid, dioecious species, Salix cinerea (Hopley
father identified by the algorithm is listed, along with the num- 2011). Appendix 1 shows example R code to load the alle-
ber of loci at which a match was made (FLCount Father Locus
lic phenotype data set into R, perform the exclusion anal-
Count), and the total number of loci at which a valid compar-
ison was possible (VLTotal Valid Loci Total). NA is Rs code ysis, and output results from the summary functions
for a missing datum, and appears in columns FLCount and potentialFatherCounts and potential
VLTotal when no potential father has been identified for a given FatherIDs to CSV files for scrutiny in a spreadsheet
offspring application. The code in Appendix 1 assumes that the
data file has been copied to a suitable working directory,
Progeny Mother PotentialFather FLCount VLTotal
and that the working directory of the R session has been
GF1-2310 GF1 None NA NA set to this directory prior to running the code.
GF1-2311 GF1 None NA NA The first 12 lines of the table produced by
GF1-2315 GF1 GF21 6 7 potentialFatherIDs are shown in Table 1.
GF1-2315 GF1 GF6 7 7
GF1-2316 GF1 GF14 7 7
GF1-2316 GF1 GF23 6 7 Performance testing and simulations
GF1-2317 GF1 GF21 6 6
We compared the performance of POLYPATEX (i.e. exclu-
GF1-2317 GF1 GF6 6 6
GF3-2337 GF3 GF2 7 7 sion based) against COLONY (i.e. likelihood based) using
GF3-2338 GF3 GF14 7 7 the phenotypic, autohexaploid example data set of
GF3-2339 GF3 None NA NA E. glabra. We converted the data to pseudodiploid-domi-
GF3-2341 GF3 GF13 7 7 nant genotypes following Wang & Scribner (2014) and
analysed it in COLONY (version 2.0.5.9; Jones & Wang
2002) using several different parameter settings. The type
of analysis method was full-likelihood with polyga-
Example
mous, inbreeding and monoecious as the core param-
The POLYPATEX R package contains two example eters. The data set contained 95 loci, and we defined the
microsatellite data sets in the required input file format. known maternal sibships. POLYPATEX was run as per
Once POLYPATEX is installed, the following R command Appendix 1, except that the number of mismatches
will print out the location of the files: allowed in the father was set to zero.
> system.file(extdata, package = We found that both programs produced similar
PolyPatEx) results for paternity assignments of the majority of the
File GF_Phenotype.csv contains data offspring tested (Table 2). As expected, POLYPATEX was
from seven loci in a hexaploid, monoecious species, more accurate in certain situations (described above),
Eremophila glabra ssp. glabra (Elliott 2010). File whereas COLONY inferred father candidates that could not
Table 2 The number of offspring from the Eremophila glabra example data set (n = 40) that fall into the following paternity assignment
groups: (i) selfing, maternal plant included in the paternity list; (ii) single father, only one father listed; (iii) multiple fathers, more than
one candidate available to sire the offspring; and (iv) no fathers, no candidate fathers assigned from the list of potential fathers, based
on the outcome of POLYPATEX (PPE) and COLONY analysis. COLONY was run at four probabilities that the father was included in the candida-
ture file (pr = 0.20, pr = 0.50, pr = 0.80 and pr = 0.95). The results of COLONY are split into individuals that had the same outcome as PPE
and those that were different to the PPE outcome within each probability class. The ranges in likelihoods from COLONY for each group
(parent pair or maternal only) are in parenthesis (pr = 1.00 if none presented)
POLYPATEX 1 6 7 26
COLONY 0.20 Same 1 3 (0.991.00) 4* 23 (0.991.00)
Different 4 5
0.50 Same 1 3 (0.991.00) 4* 23 (0.911.00)
Different 4 5
0.80 Same 1 3 (0.991.00) 4* 23 (0.641.00)
Different 4 (0.991.00) 5 (0.961.00)
0.95 Same 1 4 (0.571.00) 6* 21 (0.551.00)
Different 1 5 (0.861.00) 2 (0.901.00)
*COLONY listed only most likely father (pr = 1.00), but it was one of the fathers listed in PPE.
be candidates based on the original autohexaploid phe- inclusive when there were multiple candidate fathers
notype or did not infer a potential paternal relationship available for an offspring, despite the user needing to
when POLYPATEX did. In addition, POLYPATEX was more decide how to handle these unresolved exclusions. In
Riday H, Johnson DW, Heyduk K, Raasch JA, Darling ME, Sandman JM the example in this article, R scripts to implement the sim-
(2013) Paternity testing in an autotetraploid alfalfa breeding polycross.
ulations that produced Figs 1 and 2, and the Eremophila
Euphytica, 194, 335349.
Riday H, Smith MA, Peel MD (2015) A simple model for pollen-parent glabra ssp. glabra data set (GF_Phenotype) recoded
fecundity distributions in bee-pollinated forage legume polycrosses. for use with the COLONY software.
Theoretical and Applied Genetics, 128, 18651879.
Selkoe KA, Toonen RJ (2006) Microsatellites for ecologists: a practical
Appendix 1
guide to using and evaluating microsatellite markers. Ecology Letters, 9,
615629. Example R code for paternity analysis of a hexaploid,
Spielmann A, Harris SA, Boshier DH, Vinson CC (2015) ORCHARD:
paternity program for autotetraploid species. Molecular Ecology
monoecious species. (> denotes the R command prompt,
Resources, 15, 915920. + is Rs prompt indicating continuation of a command).
Thrall PH, Young A (2000) AUTOTET: a program for analysis of autote- Arguments to potentialFatherCounts and
traploid genotypic data. Journal of Heredity, 91, 348349.
potentialFatherIDs specify that a comparison
Wang J, Scribner KT (2014) Parentage and sibship inference from markers
in polyploids. Molecular Ecology Resources, 14, 541553. must involve at least two nonmissing loci in mother, off-
Warnes GR, Bolker B, Lumley B (2015) gtools: Various R Programming spring and candidate father for it to be included in the
Tools. R package version 3.5.0. http://CRAN.R-project.org/packa- summary (VLTMin = 2), and that at most one
ge=gtools.
mismatching locus is allowed in a candidate that is
Wood TE, Takebayashi N, Barker MS, Mayrose I, Greenspoon PB, Riese-
berg LH (2009) The frequency of polyploid speciation in vascular flagged as a potential father (mismatches = 1). The
plants. Proceedings of the National Academy of sciences, 106, 1387513879. latter argument provides some allowance for genotyping
errors or mutations in the data set. In this example,
rather than storing the output from potential
A.Y. suggested the concept. A.Z. developed the R pack- FatherCounts and potentialFatherIDs
age. C.E. and T.H. contributed data and aided in testing as R objects, R function write.csv is used to immedi-
and suggesting improvements to the package and docu- ately export these tables as CSV files in the current work-
mentation. D.L. developed theory used in further test- ing directory.
ing/confirmation of the validity of the package. All
authors contributed to the discussions around the devel- > require(PolyPatEx)
opment of the problem and the code. > adata <- inputData(GF_Phenotype.
csv,
+ numLoci=7,
+ ploidy=6,
Data Accessibility + dataType=phenotype,
+ dioecious=FALSE,
The POLYPATEX R package (which includes documentation
+ selfCompatible=TRUE)
and example data sets) is available from the Comprehen-
> pe1 < - phenotPPE(adata)
sive R Archive Network, (https://cran.r-project.org/web/
> write.csv(potentialFatherCounts
packages/PolyPatEx) or one of its regional mirrors. The
(pe1,mismatches=1,VLTMin=2),
example data sets FR_Genotype.csv and
+ potentialFatherCounts.csv)
GF_Phenotype.csv are included with the pack-
> write.csv(potentialFatherIDs(pe1,
age, and are also archived at the Dryad Digital Repository
mismatches=1,VLTMin=2),
(http://datadryad.org/), via doi:10.5061/dryad.64482.
+ potentialFatherIDs.csv)
Also available via this DOI are an R script to implement