Professional Documents
Culture Documents
© The Author (2011). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 1
chromosomes, based on SNP genotypes alone. Prior to the actual . Set NL = NL ∪ smin . In the backward elimination step, compute
imputation procedure, the employed statistical model is constructed
by iteratively selecting informative SNPs in the reference set from smax = argmax M(NL r s)
s∈NL
those that are typed in both the reference set and the sample to be
typed. The performance of LDMhc is influenced by the size and and remove the SNP smax with the highest score from NL , unless smin =
diversity of the training set and the initial model building algorithm smax . Continue alternating forward and backward steps, until a predetermi-
(Leslie et al., 2008). ned maximum number of SNPs has been reached or the reduction in M for
Here, we present three developments of the LDMhc algorithm: two subsequent added SNPs has fallen below a pre-defined threshold; then,
a modification of the original SNP-selection algorithm that leads set SL = NL .
to improved imputation call rates and accuracy; a parallelisation Now, we describe the new SNP selection optimality measure. In our
of this algorithm that enables rapid analysis of large data sets; and implementation, M is the sum of posterior error probabilities in a leave-
an integrated software suite (HLA*IMP) that enables researchers to one-out cross-validation analysis of all chromosomes in the reference set
H:
perform classical HLA allele imputation from genotype data collec-
ted from several available genome-wide SNP sets through reference
M(X) =
X 1 − P(hla type(c) = V(c); {H r c}; X),
to a reference data set of over 2,500 samples of European ancestry
2
Downloaded from bioinformatics.oxfordjournals.org at Andhra University on February 27, 2011
Fig. 1. Visualization of the L&S Hidden Markov Model states for a group of Fig. 2. The front-end of HLA*IMP controls for missing data, aligns com-
reference chromosomes carrying the A allele. Usually, the computation of an plementary SNPs and phases haplotypes in a largely automated manner. In
emission probability for a given chromosome c would involve filling the cor- this screenshot: graphical output from the alignment procedure, comparing
responding forward table (Rabiner, 1989) from s1 to sn and summing over SNP allele frequencies in the user dataset to HapMap allele frequencies,
the entries in sn . However, the emission probability can also be calculated at before (left) and after (middle) alignment. Complementary SNPs are aligned
any point s in the HMM, by combining the forward- and backward-tables up using an EM-based procedure. A straight line of data points (right) indica-
to s. In our parallelization approach, we compute both tables for each chro- tes that there are no gross deviations between EM-estimated and HapMap
mosome in advance (grey cells in the figure, polymorphisms highlighted in frequencies.
dark grey) and add the specific transition probabilities for any given SNP s
(middle column), which can be performed in parallel without changing the
pre-computed table values. and Illumina SNP genotyping platforms; uploads from other platforms are
currently not supported. HLA*IMP is free for academic use and available
from http://oxfordhla.well.ox.ac.uk. The online resource includes detailed
A missing data threshold of 5% was applied to SNPs and individuals and user information and a tutorial with a sample data set.
all SNPs were checked for strand inconsistencies. SNP haplotypes for the
1958BC and CEU+ samples were phased using IMPUTE v2 (Howie et al., 3 RESULTS AND DISCUSSION
2009) using the trio-phased HapMap samples as a reference dataset. Classi-
cally typed HLA genotypes were then phased into SNP haplotypes by using 3.1 Effects of modified SNP selection
PHASE (Stephens and Scheet, 2005) applying standard settings for multial- To assess the effects of the new SNP selection function, we repeated
lelic loci. The combined reference dataset consists of 5024 haplotypes with one of the validation experiments from Leslie et al. (2008), using
data on 7733 SNPs in the HLA region. This splits up into 2474 (HLA-A),
exactly the same datasets and exactly the same validation metho-
3090 (HLA-B), 2022 (HLA-C), 175 (HLA-DQA1), 2629 (HLA-DQB1), 2665
dology. We find that the new SNP selection algorithm based on
(HLA-DRB1) locus-specific haplotypes which are used for inference.
Data for the validation experiment was generated by conducting a random optimizing posterior probabilities typically outperforms the old SNP
2/3 – 1/3 split of the set of reference data, using the 2/3 part as reference data selection function, particularly when a threshold is applied to the
to impute the HLA types of the remaining 1/3. For the validation experiment certainty of calls (Supplementary Data, Table S1). This effect is
the model is built using only the 2/3 part of the data to avoid overfitting to largely driven by an increase in call rate rather than any increase
the 1/3 part, which is used as validation data. The 1/3 part of the data was not in accuracy, e.g. from 29% up to 75% for HLA-DQB1 at a call
re-phased, however we established empirically that the phasing results from threshold of T = 0.9. At this threshold the total number of correctly
the internal haplotype imputation module of HLA*IMP are very similar to imputed alleles increases by 44% across all loci. At lower thres-
the results from IMPUTE. holds this number is typically, though not consistently, increased.
The data for the disease association example presented in this paper was
Note that much greater gains in accuracy are obtained by increasing
prepared by the WTCCC2 and is described elsewhere (Strange et al., 2010).
the size of the reference panel (see below).
2.5 HLA*IMP software implementation
HLA*IMP is implemented in C++ and Perl. It consists of a front-end and
3.2 Cross-validation experiment
a back-end. The front-end is designed to assist end users in preparing their The new reference set of over 2,500 samples was split in two parts
data – it has inbuilt modules for quality control, SNP strand alignment and and one of the two parts (2/3) was used to impute HLA types of
haplotype phasing. Users are guided through these steps in a wizard-like the remaining part (1/3). Imputations were validated at the haplo-
sequential manner (Figure 2). Output files from some popular genotype type level and at 4-digit (amino acid identity) / 2-digit (sharing of
callers, including PLINK (Purcell et al., 2007), Birdsuite (Korn et al., serotypical features) resolution (see Table 1). A call threshold of
2008) and CHIAMO (Marchini et al., 2007), can be read in directly, as
T = 0.7 on the modes of the posterior HLA type distributions was
well as a simple generic format. The back-end part, implemented as an
online web-service, carries out the computationally intensive parts of the
employed, as our experience suggests that T = 0.7 represents a good
imputation process. It can automatically process the files generated by the compromise between accuracy and call rate. At 2-digit resolution,
front-end and notifies the end-user via email of completed processes. As between 97 (HLA-C) and 99% (HLA-DQB1) of calls are correct (i.e.
the result of the parallelized SNP selection depends on the initial set of they agree with the lab-based types), at call rates between 98 and
available SNPs, we have pre-selected SNPs for some popular Affymetrix 100%. At 4-digit accuracy, call rates are from 95 (HLA-DRB1) to
3
Table 1. Accuracy and Call Rate for a 2/3 (training data) – 1/3 (validation data) cross-validation experiment, using a
call threshold of T = 0.7. User datasets are always imputed using the full set of training data, which should result in
greater accuracy.
Locus # validated Call Rate (2-digit) Accuracy (2-digit) Call Rate (4-digit) Accuracy (4-digit)
4
Table 2. P values and odds ratios for imputed
HLA-C*06:02 alleles and the most predictive
SNP from Strange et al. (2010).
Howie, B. N., Donnelly, P., and Marchini, J. (2009). A flexible Strange, A., Capon, F., Spencer, C. C., Knight, J., Weale, M. E.,
and accurate genotype imputation method for the next generation of Allen, M. H., Barton, A., Band, G., Bellenguez, C., Bergboer, J. G.,
genome-wide association studies. PLoS Genet, 5, e1000529. Blackwell, J. M., Bramon, E., Bumpstead, S. J., Casas, J. P., Cork,