You are on page 1of 71

Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome

Gururaj p

Overview
Biological Background Terminology SNP related general information SNP detection techniques SNP Applications References

Biological Background
How can researchers hope to identify and study all the changes that occur in so many different diseases? How can they explain why some people respond to treatment and not others?

SNP is the answer to these questions

So what exactly are SNPs? How are they involved in so many different aspects of health?

What is SNP ?
A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more than 1 percent) of a large population.

Variations in Genome

Terminology
Polymorphism Linkage Disequilibrium
 Correlation of characters states among polymorphic sites  Insufficient passage of time to randomize character states by meiotic recombinations

Haplotype

Some Facts
In human beings, 99.9 percent bases are same. Remaining 0.1 percent makes a person unique.
 Different attributes / characteristics / traits
how a person looks, diseases he or she develops.

These variations can be:


 Harmless (change in phenotype)  Harmful (diabetes, cancer, heart disease, Huntington's disease, and hemophilia )  Latent (variations found in coding and regulatory regions, are not harmful on their own, and the change in each gene only becomes apparent under certain conditions e.g. susceptibility to lung cancer)

SNP facts
SNPs are found in
 coding and (mostly) noncoding regions.

Occur with a very high frequency


 about 1 in 1000 bases to 1 in 100 to 300 bases.

The abundance of SNPs and the ease with which they can be measured make these genetic variations significant. SNPs close to particular gene acts as a marker for that gene. SNPs in coding regions may alter the protein structure made by that coding region.

SNPs may / may not alter protein structure

SNPs act as gene markers

SNP maps
Sequence genomes of a large number of people Compare the base sequences to discover SNPs. Generate a single map of the human genome containing all possible SNPs => SNP maps

SNP Maps

SNP Profiles
Genome of each individual contains distinct SNP pattern. People can be grouped based on the SNP profile. SNPs Profiles important for identifying response to Drug Therapy. Correlations might emerge between certain SNP profiles and specific responses to treatment.

SNP Profiles

Techniques to detect known Polymorphisms


Hybridization Techniques
 Micro arrays  Real time PCR

Enzyme based Techniques


   

Nucleotide extension Cleavage Ligation Reaction product detection and display

Comparison of Techniques used

Hybridization Techniques
Micro Arrays
        Sequencing by hybridization utilize a set of tiling oligonucleotides somewhat complex pooling and processing of PCR amplicons that are subsequently hybridized to a DNA micro array and visualized. Theoretically capable of genotyping thousands of polymorphisms simultaneously Success rate 97% (Somewhat low for this kind of analysis) High False rates 11 21% Design and fabrication of micro arrays is expensive, hence users are confined to the set of genotypes established by the manufacturer.

Real Time PCRs


 Utilizes TaqmanTM DNA probes to detect PCR products in real-time real TaqmanTM probe contains a fluorescent reporter at the 5' end and a fluorescence resonance energy transfer (FRET) moiety at the 3' end, which quenches the fluorescent signal of the reporter.  The probe sequence is complementary to the PCR amplicon and is designed to anneal at the extension temperature.  During extension, the 5' 3' exonuclease activity of Taq DNA polymerase I cleaves the probe, emitting signal due to the separation of the reporter from the quencher.  Polymorphism is determined solely by hybridization and not by the ability of the enzyme to discriminate.  Because the enzyme does not confer specificity in detection, this technique is classified as hybridization-based. hybridization Depending on optical thermocycler platform 384 reactions can be monitored for each cycle without removing any sample  amenable to robotic automation.

Real Time PCRs

Enzyme based Techniques


Nucleotide extension
 Simplest techniques for known polymorphism detection  Existing in numerous variations (also known as minisequencing, SNuPE, GBA, APEX, AS-PE capture, FNC, TDI or PROBE) this assay AStypically involves the single base extension of an oligonucleotide by a polymerase  Oligonucleotide is designed to anneal immediately upstream of the polymorphism locus and differentially labeled fluorescent dideoxynucleotides are utilized as substrates for polymerase extension.  The fluorescent signal emitted corresponds to the nucleotide incorporated and thus the sequence of the polymorphism.  Simplicity and accuracy in distinguishing between heterozygous and homozygous genotypes.  Targets need to be PCR amplified + PCR reagents must be removed.  False negatives due to mis-priming mis-

Nucleotide Extension

Cleavage
 The InvaderTM assay utilizes the exonuclease activity of Cleavase VIII on overlapping oligonucleotide strands.  Two oligonucleotides, an invader probe and either a wild-type or wildmutant primary probe, overlap each other at a single nucleotide position on the template only if they are complementary to the polymorphism being queried.  Cleavage occurs when the specific overlapping conformation is present, freeing an oligonucleotide referred to as a flap .  This flap can be detected in a multiplex manner by size, mass or sequence  Commonly the flap participates in a second cleavage assay with another complementary target, causing release of a fluorescent signal.  Advantage - the same flap may bind to many targets, generating a cascading signal amplification and thereby obviating the need for PCR amplification.  Single-tube one-step reaction. Singleone-

Cleavage

Ligation
 One of the most specific assays due to the high specificity of T4 ligase (oligo ligation assay) and even higher specificity of thermostable ligases (ligation detection reaction, LDR)  Two primers are designed to anneal adjacent to one another on the target of interest  Generally, the upstream primer (discriminating primer) contains a fluorescent label at the 5' end, with the 3' nucleotide overlapping the polymorphic base.  The fluorescent signal corresponds to the allele being queried at the 3' position of the discriminating primer  When the discriminating primer forms a perfect complement with the target at the junction, the ligase covalently attaches the adjacent downstream primer (common primer)  The resulting product is approximately twice as long as each of the individual primers and can be easily monitored for detection by means of capillary electrophoresis or by display on a microarray  Advantage Very good sensitivity and specificity

Techniques to detect unknown Polymorphisms

Direct Sequencing Microarray Cleavage / Ligation Electrophoretic mobility assays Comparison of Techniques used

Direct Sequencing
Sanger dideoxysequencing can detect any type of unknown polymorphism and its position, when the majority of DNA contains that polymorphism. Misses polymorphisms and mutations when the DNA is heterozygous limited utility for analysis of solid tumors or pooled samples of DNA due to low sensitivity Once a sample is known to contain a polymorphism in a specific region, direct sequencing is particularly useful for identifying a polymorphism and its specific position. Even if the identity of the polymorphism cannot be discerned in the first pass, multiple sequencing attempts have proven quite successful in elucidating sequence and position information.

Microarray
Variation detection arrays (VDA) scans large sequence blocks and identify regions containing unknown polymorphisms. This methodology suffers from the same limitations in fabrication and design as observed in known polymorphism analysis, but has demonstrated much greater success in the context of unknown polymorphism detection for both SNP and tumor analysis. With respect to SNP analysis, a recent study of chromosome 21 successfully identified approximately half of the estimated number of common SNPs (frequency of 10 50%) across the entire chromosome. The experimental design required a sacrifice in sensitivity in order to minimize false positives. This explains the decrease in successful identification from 80 to 50%.

Cleavage/Ligation
Unknown polymorphisms can also be identified by the cleavage of mismatches in DNA DNA heteroduplexes. This can be achieved either chemically [chemical cleavage method (CCM) or enzymatically (T4 Endo nuclease VII, MutY cleavage or Cleavase). Typically, at least two samples are PCR amplified (one sample can be sufficient for solid tumor samples with high levels of stromal contamination), denatured and then hybridized to create DNA DNA heteroduplexes of the variant strands. Enzymes cleave adjacent to the mismatch and products are resolved via gel or capillary electrophoresis. Unfortunately, the cleavage enzymes often nick complementary regions of DNA as well. This increases background noise, lowers specificity, and reduces the pooling capacity of the assay.

Cleavage / Ligation

SNP Applications
Gene discovery and mapping AssociationAssociation-based candidate polymorphism testing Diagnostics/risk profiling Response prediction Homogeneity testing/study design Gene function identification

HighHigh-resolution haplotype structure in the human genome


Mark J. Daly, John D. Rioux, Stephen F. Schaffner, Thomas J. Hudson & Eric S. Lander

Abstract
Authors are describing a high-resolution analysis highof the haplotype structure across 500 KB on chromosome 5q31 using 103 SNPs in a European derived population. They developed an analytical model for Linkage disequilibrium (LD) mapping based on highhighresolution haplotype blocks, which offers a coherent framework for creating a haplotype map of the human genome.

Data used
500 kb region on human chromosome 5q31 that is implicated as containing a genetic risk factor for Crohn disease.
 Rioux, J. D et al. Hierarchical linkage disequilibrium mapping of a susceptibility gene for Crohn s disease to the cytokine cluster on chromosome 5. Nature Gene. 29, 223223-228(2001)

103 common (>5% minor allele frequency) SNPs genotyped from a European-derived population. EuropeanStudy describe 258 chromosomes transmitted to individuals with Crohn disease and 258 untransmitted chromosomes.

Data used
The genotype data used in study provides the highest-resolution picture of the patterns highestof genetic variation across a large genomic region, with a market density of 1 SNP roughly every 5 kb.

Study
Focus on identifying the underlying haplotypes. Authors initial focus was on untransmitted control chromosomes, however, the same haplotype structure was seen in the chromosomes transmitted to individuals with Crohn disease, with the only difference being that one of the haplotypes was enriched in frequency, reflecting its association with Crohn disease.

Study
It became evident during the study that the region could be largely decomposed into discrete haplotype blocks, each with a lack of diversity. As haplotype block structure was the same in both groups, they presented combined data from all chromosomes (transmitted and untransmitted).

Haplotype block structure on 5q31

Haplotype block structure on 5q31

a. Common haplotype patterns in each block of low diversity. Dashed lines indicate locations where more than 2% of all chromosomes are observed to transition one common haplotype to a different one.

Haplotype block structure on 5q31

b. Percentage of observed chromosomes that match one of the common patterns exactly (total chromosomes = 258 transmitted + 258 untransmitted).

Haplotype block structure on 5q31

c. Percentage of each of the common patterns among 258 untransmitted chromosomes.

Haplotype block structure on 5q31

d. Rate of haplotype exchange between the blocks as estimated by the HMM.

Haplotype block structure on 5q31

-The haplotype blocks span up to 100 kb and contain multiple (five or more) common SNPs. -The blocks have only few (2-4) haplotypes, which show no evidence of being derived from one another by recombination, and which account for nearly all chromosomes (>90%) in all cases in the sample.

Haplotype block structure on 5q31

For example, an 84 kb block shows only two distinct haplotypes that together account for 95% of the observed chromosomes (table -1).

Study
The discrete blocks are separated by intervals in which several independent historical recombination event seem to have occurred, giving rise to greater haplotype diversity for regions spanning the blocks. The most common recombination events are indicated in previous figure by lines connecting the haplotypes. The recombination events appear to be clustered; multiple obligate exchanges must have occurred between most blocks, with little or no exchange within block.

Study
Although there is detectable recombination between blocks, it is modest enough for there to be clear long-range correlation longamong (that is, LD) blocks. The haplotypes at the various blocks can be readily assigned to one of the four ancestral longlong-range haplotypes. Indeed, 38% of the chromosomes studies carried one of these four haplotypes across the entire length of the region.

Study
Using HMM, they developed an approach to define the block structure formally. The HMM simultaneously assigns every position along each observed chromosome to one of the four ancestral haplotypes and estimates the maximum-likelihood values of maximumthe historical recombination frequency ( ) between each pair of markers. markers.

Study
The quantity provides a convenient summary of the degree of haplotype exchange across inter-marker intervals and interrelates directly toe conventional measures of LD. In this study, is estimated at less than 1% for 73 of the inter-marker intervals, 1-4% for inter114 of the intervals, and more than 4% for only 9 of the intervals.

Methods: Individuals and market selection


The individuals studies, Canadians from metropolitan Toronto of predominantly European descent and the genotyping methodologies are described in the paper
 Rioux, J. D et al. Hierarchical linkage disequilibrium mapping of a susceptibility gene for Crohn s disease to the cytokine cluster on chromosome 5. Nature Gene. 29, 223223-228(2001)

To ensure the ability to reconstruct multi-marker multihaplotypes, SNPs for haplotype analysis were selected from the set of markers for which full genotypes were available for all members. SNPs at CpG sites were not included to prevent potential confounding of common haplotype patterns from recurrent mutations.

Methods: Haplotype counting Haplotype percentages in Haplotype block structure in 5q31 figure were computed using haplotypes generated by the transmission disequilibrium test (TDT) implementation in Genehunter 2.0 (ref. 22 in the paper), followed by use of an EM-type EMalgorithm (ref. 23,24 in paper), to include the minority of chromosomes that had one or more markers with ambiguous phase or where one marker was missing genotype data.

Methods: Hidden Markov model


The observation that over long distances most haplotypes can be described either as belonging to one of a small number of common haplotypes categories suggested the use of an HMM in which haplotype categories were defined as state. Authors assigned observed chromosomes to those hidden states and simultaneously estimated the transition probability in each map interval by using an EM algorithm and by making the simplifying assumption that there was any transition probability for each map interval rather than allowing specific transition probabilities from each state to each state. The output of this method was a maximum-likelihood maximumassignment to haplotype category at each position and ML estimates of indicating how significantly recombination has acted to increase haplotype diversity in each map interval.

Discussion of Study
The region of chromosome 5q31 may be largely divided into discrete blocks of 10-100 10kb; each block has only a few common haplotypes; and the haplotype correlation between blocks gives rise o long-range LD. longFocusing on haplotype blocks greatly clarifies LD analyses. Once the haplotype blocks are identified, they can be treated as alleles and tested for LD (instead of single-marker singleanalyses of LD).

Discussion of Study
In analogous fashion, the haplotype structure provides a crisp approach for testing the association of genomic segments with disease. By contrast, disease association studies transitionally involve testing individual SNPs in and around a gene. Once the haplotype blocks are defined, it is straightforward to examine a subset of SNPs that uniquely distinguish the common haplotypes in each block. This allows the common variation in a gene to tested exhaustively for association with disease.

Discussion of Study
This approach provides a precise framework for creating a comprehensive haplotype map of the human genome. By testing a sufficiently large collections of SNPs, it should be possible to define all of the common haplotypes underlying blocks of LD. Once such a map is created, it will be possible to select an optimal reference set of SNPs for any subsequent genotyping study. This detailed understanding of common human variation represents an important step in the Human genome project.

Linkage Disequilibrium
Uses unrelated individuals Good for fine scale mapping because there is greater opportunity for recombination to occur. Map of loci that contribute to inherited genetic disorders States can not be considered independent because they are related by distance and recombination, so individual haplotypes may not be the cause of disease, but rather a combination of several haplotypes in blocks

Linkage Disequilibrium
Greater distance between genes, the greater chance of recombination Lesser distance between genes, the less chance of recombination Knowing the above and observing inherited alleles, one can estimate the relative distance between genes

Measures of Linkage Disequilibrium


cM centiMorgans 50cM would mean that two genes have a 50% chance of recombination occurring.
 Genes are relatively far apart

Importance of Linkage Disequilibrium


Offers us a way to measure the distance between genes. NonNon-random Measure of relation between markers and disease mutations. Possibly used to map disease genes because high LD areas would be related to recombination and formation of new alleles

Data Mining Applied to Linkage Disequilibrium Mapping


HPM - Haplotype Pattern Mining Method of data mining LD-based gene mapping LDUses haplotypes as inputs which can be obtained from genetic simulation programs such as GENEHUNTER Extension of traditional association analysis Search for shared and flexible haplotypes and find out which ones are strongly associated with a disease. Uses non-parametric statistical model without any genetic nonmodels on the basis of the locations of the haplotypes

What we know
LD, which has a non-random association of nonhaplotypes to a disease, is likely strongest around the DS(Disease Susceptibility) gene. A locus will most likely be where the strongest associations are.

Notation
Haplotype Map M has k parameters; (m1, ,mk) (m The haplotype pattern P on M consists of the vector space (p1, ,pk), where each pi is an allele of mi or a wild-card wild(*) P occurs on the haplotype vector, which is simply the chromosome (H), so H = (h1, ,hk) where hi = pi or (H (h hi = * Example:
 P1 = (*, 2, 5, *, 3, *, *, *, * , *)  PC = (4, 2, 5, 1, 3, 2, 6, 4, 5, 3)

Issues in Shape of Haplotype Pattern


1.

Length of the pattern


Defined as maximal distance between any 2 markers measured in centiMorgans Extremely long sequences don t give us much information, so the size of the P is constrained in HPM

2.

Gaps in sequences
Accounts for mutations, errors, missing data, and recombination Gap size and number can be controlled in HPM

Procedure
DepthDepth-first search finds all haplotype patterns that exceed the lower bound threshold and meets the association measure Calculate the frequency f(mi) of marker mi with respect to (M, H, Y, x), where Y= phenotype and x = positive x), association threshold Markers with highest frequencies are predicted to be the area of the DS gene, assuming a DS gene is present. Prediction of granularity of marker density Ranked based on frequency

Results: Simulated Data


Founder population which grows from 300 to ~100, 000 in 500 years was simulated in the Populus simulator package Simulated data used because it is cheaper and can be easily manipulated

List of 11 most strongly disease-associated haplotype diseasepatterns in the simulated data Chromosome has 101 markers Dashed line indicates the true gene location

Frequency histogram of previous slides data, but with patterns exceeding the threshold of association Dashed line indicates the true gene location Marker 5 now has the highest frequency

The actual vs. predicted locations for 100 data sets

a) b) c) d)

Mutation carrying chromosomes, denoted by A Sample founder population size Corrupted data Missing data

Real Data: HLA complex


Data consisting of affected sib-pair families with type 1 sibdiabetes from the UK that were genotyped for 25 markers was used Markers covered 14-Mb and covered the entire HLA 14complex The HLA-DQB1 and HLA-DRB1 loci, which are located in HLAHLAthe middle of these 14-Mb, are known to be the primary 14factors for type 1 diabetes Randomly selected 200 from 385 sample space to compare with simulated results

Frequency vs. Map Location of HLA markers ___ HPM calculated frequencies ----- Background LD frequencies Vertical lines indicates true locations of markers

Discussion of HPM Technique


Robust to lost and erroneous data Applicable to complex gene mapping Works well with small data sets, but accuracy is increased with the increase of data Works with real and simulated data Does not include any previously derived models

References
Introduction to SNPs: Discovery of Markers of Disease SNP seeking long term association with complex diseases SNP mapping using Genome-wide Unique Sequences GenomeThe Structure of Haplotypes Blocks in Human Genome Using Haplotype blocks to map human complex trait loci High Resolution haplotype structure in human genome Detection of regulatory variation in mouse genes http://linkage.rockefeller.edu/wli/lld.html http://statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt http://www.cs.helsinki.fi/u/htoivone/pubs/ajhg_2000.pdf Resolution of Haplotypes and Haplotype Frequencies from SNP Genotype of Pooled Samples http://www.journals.uchicago.edu/AJHG/journal/issues/v71n6/024386/024386.html http://www.ncbi.nlm.nih.gov/entrez/utils/fref.fcgi?http://www.sciencemag.org/cgi/pmidl ookup?view=full&pmid=11452081 http://www.genome.gov/10001665 http://walnut.usc.edu/~magnus/papers/tig.pdf

You might also like