You are on page 1of 91

Statistics for Association

Studies
DEPT. OF ANIMAL GENETICS & BREEDING
COLLEGE OF VETERINARY SCIENCE AND ANIMAL HUSBANDRY
ANAND AGRICULTURAL UNIVERSITY
ANAND - 388 001.
Maul! U"a#$%a%
REG. NO.&- O'-1'0(-)010
AGB-*+1
MAJOR ADVISOR
DR. C. G. Joshi
Professor and Head
Dept. of Animal Biotechnology
MINOR ADVISOR
DR. D. N. Rank
Professor and Head
Dept. of Animal Genetics & Breeding
POST-GRADUATE
SEMINAR
ON
1
Conclusion
Conclusion
Defnition, need and
scope
Methods to control Multiple correction
Single SNP Multiple SNP
Haplotype
models
Bayesian
introduction
SNP quality control Missing data Imputation
Defnition, need and outline in!age and "ssociation studies
#lo$ o% Presentation
&
Increasing 'rend((((
Nature Reviews Genetics
carried nearly )* re+ie$
articles related to
association analysis one
$ay or another, -.p
to,&**/0,
Lancet pu1lished a series
o% re+ie$ and introductory
articles in &**2 on genetic
epidemiology $ith
association as the ma3or
component,
Annual Review 3ournals
pu1lished many re+ie$s
that can 1e lin!ed to
association studies,
-ee,&**/0
)
'
o
t
a
l

N
u
m
1
e
r

o
%

P
u
1
l
i
c
a
t
i
o
n
s
Calendar 4uarter
951
Pu1lished 56" 7eports, &**2 8 9:&*11
;
!http"##$$$.genome.go%#g$ast&dies#'
Defnition
"n association 1et$een a SNP < a phenotype that is present in the
population %rom $hich a sample is ta!en,
-Stephens and Balding, &**=0
5enetic association studies aim to detect association 1et$een one
or more genetic polymorphisms and a trait, $hich might 1e some
quantitati+e characteristic or a discrete attri1ute or disease,
-Cordelland Clayton,
&**20
5enetic association studies assess correlations 1et$een genetic
+ariants and trait di?erences on a population scale,
-Cordon and Bell, &**10
2
@Aample o% "ssociation in CaseBControl
Study
Control C * 1 1 1 1 1 * 1 * & 1 & & * 1 * * * 1 1
& * 1 1 1 & * * * 1 * 1 1 * 1 1 * 1 * *
& * 1 & & * 1 & 1 * * 1 1 * 1 * * 1 1 1
1 & 1 1 & 1 1 1 1 * 1 1 1 * * & & & * &
Cases C 1 1 & 1 * 1 & 1 1 1 1 & 1 & 1 & 1 & 1 1
& & 1 & * 1 * * * 1 & & 1 & 1 & 1 * & 1
* 1 1 * * & 1 * * & 1 1 1 & 1 1 & * 1 *
* 1 1 * * 1 * & & 1 1 1 1 & * 1 & 1 1 &
5oal C 'o identi%y the genetic 1asis o% gi+en
phenotypes or diseases
-7e%, C DA%ord .ni+ersity 6e1site,
httpC::$$$,stats,oA,ac,u!:Emc+ean:g$a;,pd%0
9
in!age and "ssociation
"ssociation
di?ers %rom
lin!age in that
the same allele
-or alleles0 is
associated $ith
the trait in a
similar manner
across the $hole
population, $hile
lin!age allo$s
di?erent alleles
to 1e associated
$ith the trait in
di?erent
%amilies,
-Cardon and Bell, &**10
/
-7e%, C DA%ord .ni+ersity 6e1site,
httpC::$$$,stats,oA,ac,u!:Emc+ean:g$a;,pd%0
F
Causes o% association
the polymorphism has a causal role -Direct association0
the polymorphism has no causal role 1ut is associated $ith
a near1y causal +ariant -Indirect association0G or
the association is due to some underlying stratifcation or
admiAture o% the population -Con%ounded association0,
-Cordell and Clayton, &**90
=
'ypes o% genetic association
Candidate polymorphism
Candidate gene
#ine mapping
5enome $ide association
-Stephens and Balding, &**=0
1*
Designs %or genetic association
studies
Foo!in" are di#erent t$%es
of desi"ns of "enetic
association studies
Statistica ana$sis
10, Cross sectional ogistic :inear regression, chiB
square test
&0, Cohort studies Sur+i+al analysis method
)0, Case control ogistic :inear regression, chiB
square test
;0, @Atreme +alue inear regression < Permutation
approach
20, CaseBParent triad 'D', ogistic, ogBlinear method
90, CaseBParentB5rand parent
septets
ogBlinear methods
/0, 5eneral pedigree PD', #amily 1ased association
test, 'D'
F0, Case only ogistic regression, ChiBsquare
=0, DN" Bpooling Hariance component estimation
-Cordell and Clayton,
&**20
11
'est o%
"ssociation
'est o%
"ssociation
Single SNP
association
Single SNP
association
ChiBsquare
test
ChiBsquare
test
"rmitage test
"rmitage test
#isherIs eAact
test
#isherIs eAact
test
5eneral
linear model
5eneral
linear model
ogistic
regression
models
ogistic
regression
models
-Balding, &**90
1&
'est o%
"ssociation
'est o%
"ssociation
Multiple SNP
association
Multiple SNP
association
MD7
MD7
SNP set
association
SNP set
association
ogistic
7egression
ogistic
7egression
Haplotype
1ased
regression
model
Haplotype
1ased
regression
model
1)
SNP 4uality Control
'he quality control -4C0 fltering o% single nucleotide
polymorphisms -SNPs0 is an important step especially in genomeB
$ide association studies to minimiJe potential %alse fndings,
SNP 4C commonly uses eApertBguided flters 1ased on 4C
+aria1les, to remo+e SNPs $ith insuKcient genotyping quality,
such as C
( Hardy86ein1erg equili1rium
( missing proportion -MSP0
( minor allele %requency -M"#0
#ollo$ing are some o% the criteria %or SNP 4CC
-i0 percentage o% SNPs eAcluded due to lo$ quality
-ii0 inLation %actor o% the test statistics -)
-iii0 num1er o% %alse associations %ound in the fltered dataset
-i+0 num1er o% true associations missed in the fltered dataset,
-Pongpanich et al., &*1*0
1;
SNP quality control -4C0 is commonly sa%eguarded 1y Msuper+isedI -i,e,
eApertBguided0 flters to eAclude lo$Bquality SNPs,
'he Msuper+isedI eApert flters aim to remo+e SNPs that %all into the
eAtremes o% 4C +aria1les including Hardy86ein1erg equili1rium -H6@0,
missing proportion -MSP0 and minor allele %requency -M"#0,
'he rationale is clearC
( eAtreme de+iation %rom H6@ is typically used to identi%y gross
genotyping error -'eo et al., &**/0
( a high MSP indicates poor genotype pro1e per%ormance and lo$
genotyping accuracy -Neale and Purcell, &**FG 6'CCC, &**/0
( SNPs $ith lo$ M"# are more prone to error, as %e$er samples $ould
1e $ithin a genotype cluster and most clusteringB1ased calling
algorithms do not per%orm $ell $ith rare alleles -Neale and Purcell,
&**FG 'eo, &**F0
12
#or singleBSNP analyses, i% a %e$ genotypes are missing there is
not much pro1lem,
#or multipoint SNP analyses, missing data can 1e more
pro1lematic 1ecause many indi+iduals might ha+e one or more
missing genotypes,
Dne con+enient solution is data imputationC replacing missing
genotypes $ith predicted +alues that are 1ased on the o1ser+ed
genotypes at neigh1ouring SNPs,
5enotype imputation is the term used to descri1e the process o%
predicting or imputing genotypes that are not directly assayed
in a sample o% indi+iduals,
Missing 5enotypic
Imputation
19
!Balding) *++,'
'here are se+eral distinct scenarios in $hich genotype imputation is
desira1le, 1ut the term no$ most o%ten re%ers to the situation in
$hich a re%erence panel o% haplotypes at a dense set o% SNPs is
used to impute into a study sample o% indi+iduals that ha+e 1een
genotyped at a su1set o% the SNPs,
- Marchini and Ho$ie, &*1*0
Imputation methods $or! 1y com1ining a re%erence panel o%
indi+iduals genotyped at a dense set o% polymorphic sites -usually
singleBnucleotide polymorphisms, or MMSNPsII0 $ith a study sample
collected %rom genetically similar population and genotyped at a
su1set o% these sites, -Ho$ie et al., &**=0
Imputation methods either see! a M1estI prediction o% a missing
genotype, such as a maximum likelihood estimate -single
imputation0, or randomly select it %rom a pro1a1ility distri1ution
(multiple imputations0, -Balding, &**90
'he goal is to predict the genotypes at the SNPs that are not
directly genotyped in the study sample,
1/
"": "C: CC
"": "': ''
55: 5': ''
"": "5: 55
"": "C: CC
CC: C5: 55
"C
BB
5'
""
BB
55
":C *,* 1,* *,*
":' *,& *,2 *,)
5:' *,* 1,* *,*
":51,* *,* *,*
":C *,1 *,* *,=
C:5*,* *,* 1,*
D1ser+ed
5enotypes
Imputation
7e%erence
Predicted
5enotypes
Some
"lgorithms
Posterior
Pro1a1ility
Imputation o%
5enotypes
1F
- Marchini and Ho$ie, &*1*0
1=
Genot$%e I&%utation Met'ods (o! it )or*s+
I!"#$ v% @Atension o% HMM
I!"#$ v& More LeAi1le than M+1I, SNP
di+ided into t$o sets C Set ' <
Set ., uses HMM < MCMC
'ast!(A)$ .ses the o1ser+ation that
haplotypes tend to cluster into
groups o% closely related or
similar haplotypes, HMM
*I*A Bayesian "pproach
M"CH HMM, Iterati+ely assigns
haplotypes to the genotypes
1ased on the con+erging model
B@"5@ -C@0 5raphical model o% a set o%
haplotypes, Iteration method
PINN-C@0
SNP tagging approach
SNPS'"'
.NPH"S@D
'.N"-C@0
- Marchini and Ho$ie, &*1*0
&*
.ses o%
Imputatio
n
Boostin
g Po$er
#ine
Mappin
g
Meta
"nalysis
Imputati
on o%
untyped
+ariation
Imputati
on o%
NonBSNP
+ariation
Correctio
n o%
genotypi
ng
+ariation
- Marchini and Ho$ie, &*1*0
&1
Single ocus association
analysis
&&
Pearson goodnessBo%Bft test
Categorical data may 1e displayed in contingency ta1les
'he chiBsquare statistic compares the o1ser+ed count in each
ta1le cell to the count $hich $ould 1e eApected under the
assumption o% no association 1et$een the ro$ and column
classifcations
'he chiBsquare statistic may 1e used to test the hypothesis o% no
association 1et$een t$o or more groups, populations, or criteria
&)
#or a single SNP $ith alleles " and B tested in a case control
study, the data generated consist o% siA counts o% the
num1ers o% genotypes -"", "B and BB0 in cases and controls
C
D1ser+ed +alue %or "" genotypes in cases, D
1
C a
@Apected +alue %or "" genotypes in cases, @
1
C
ChiBSquare statistic C
AA A, ,, Tota
Cases a 1 C n
case

-7
1
0
Control
s
d e % n
cont
,
-7
&
0
'otal n
""
-C
1
0 n
"B
-C
&
0 n
BB
-C
)
0 ON
-a. Fu "enot$%e ta/e for a "enera "enetic
&ode
&;
AA A,0,,
Case a 1Pc
Control d eP%
A ,
Case &aP1 1P&c
Control &dPe eP&%
AA0A, ,,
Case aP1 C
Control dPe %
-10 Dominant modelC allele B increases ris!
-c0 7ecessi+e modelC t$o copies o% allele B required %or
increased ris!
-d0 Multiplicati+e modelC rB%old increased ris! %or "B, r
&
increased ris! %or
BB, "nalysed 1y allele, not 1y genotype
-e$is, &**&0
&2
Consider a sample o% SNP genotypes %or N unrelated diploid
indi+iduals measured at an autosomal locus,
n
"
C 7are copy o% allele
n
B
C Common allele
n
""
C n
BB
C

Possi1le arrangement o% alleles in the sample C


@Aact test
AA A, ,, Tota
Cases n
""
n
"B
n
BB
N
-6igginton et al.,
&**20
&9

@Aactly n
"B
heteroJygotes C
'hus, under the assumption o% H6@, the pro1a1ility o%
o1ser+ing eAactly n
A*
hetero+,-otes in a sample o% N
individuals with n
A
minor alleles is
'his equation holds %or each possi1le num1er o%
heteroJygotes, n
A*
.
-6igginton et al.,
&**20
&/
'he eApression %or P-N
"B
O n
"B
QN, N
"
0 gi+en in equation
leads to natural tests %or H6@,
DneBsided test C
Defcit o% heteroJygotes, P
lo$
O P-N
"B
R n
"B
QN, N
"
0 -In1reeding,
Stratifcation0
@Acess o% heteroJygotes, P
high
O P-N
"B
S n
"B
QN, N
"
0 -5enotyping
error0
In each case, the statistic can 1e calculated 1y simply
summing o+er equation, to include all possi1le +alues o% N
"B

that are lo$er -%or P
lo$
0 or higher -%or P
high
0 than those
o1ser+ed in the actual data
-6igginton et al.,
&**20
&F
Control genotypes should 1e in Hardy86ein1erg equili1rium,
pro+ided the population they are selected %rom is random mating
and is large in siJe,
Suppose the population %requency o% allele " is p and allele B is qO
1Bp, then the genotypes "", "B and BB should ha+e %requency p
&
,
&pq and q
&
,
Pro+ided the controls are in H6@, the cases may then 1e tested, I%
the SNP has a true genetic e?ect that is no controlled 1y a
multiplicati+e model, the cases $ill not 1e in H6@ -although again,
the test has little po$er to detect small departures %rom H6@0, I% the
cases are in H6@, the data may 1e analysed 1y allele counting, as
any genetic e?ect is consistent $ith a multiplicati+e model,
" signifcant result sho$ing that controls are not in Hardy86ein1erg
equili1rium -H6@0 could arise 1ecause o%C
( random chance
( genotyping pro1lems
( heterogeneous population
-e$is, &**20
&=
)*
T'e Odds Ratio 1 a Measure of
Association
" use%ul statistic %or measuring the le+el o% association in
contingency ta1les is the odds ratio,.
I% the odds are equal, their ratio equals one, " sample estimator o%
the odds ratio , .R is
Ddd 7atio C " measurement o% association that is commonly used in
caseBcontrol studies, It is defned as the odds o% eAposure to the
suscepti1le genetic +ariant in cases compared $ith that in
controls, I% the odds ratio is signifcantly greater than one, then the
genetic +ariant is associated $ith the disease
-6ang et al.,
&**20
DD -T0 O
)1
Confdence Inter+al < Interpretation
Standard error is +ery much necessary to fnd confdence
inter+al %or null hypothesis o% no association C
CI %or D7 O D7U1,=9VD7VW
CI %or DD O DDU 1,=9V W
SNP has no inLuence on disease i% the =2X CI %or D7
includes M1I or CI %or DD includes M*I
)&
))
"rmitageIs 'rend test
'he disad+antages o% Population stratifcation and con%ounding %actor
is o+ercomed, to some eAtent, 1y applying the "rmitageYs trend test,
as suggested 1y "rmitage -1=220, Sasieni -1==/0, and Schaid and
Zaco1sen -1===0,
'here are three common choices o% scoring systemC
10 coBdominant scoreC A
*
O *, A
1
O 1, and A
&
O &G
&0 dominant scoreC A
*
O *, A
1
O 1, and A
&
O 1G
)0 recessi+e scoreC A
*
O *, A
1
O *, and A
&
O 1,
Here, the names o% scoring systems are in %a+our o% the minor allele
[m\,
-#ang et al., &**=0
Genot$%es
MM Mm mm 'otal
Case n
1*
n
11
n
1&
N
1
Control n
**
n
*1
n
*&
N
*
'otal N
P*
N
P1
N
P&
N
Score ]
*
]
1
]
&
);

'$o sided alternati+e hypothesisC


( 5enotypes at a SNP are associated $ith the disease o% interest
( 'est statistic C

Dne sided alternati+e hypothesis C


( Minor allele is positi+ely associated $ith disease
D7
( Ma3or allele is positi+ely associated $ith disease
( 'est statistic C
-#ang et al.,&**=0
)2
#igure & Q Ar&ita"e test of sin"e-SNP association !it'
case2contro outco&e3
'he dots indicate the proportion o%
cases, among cases and controls
com1ined, at each o% three SNP
genotypes -coded as *, 1 and&0,
together $ith their leastBsquares line,
'he "rmitage test corresponds to
testing the hypothesis that the line
has Jero slope, Here, the line fts the
data reasona1ly $ell as the
heteroJygote ris! estimate is
intermediate 1et$een the t$o
homoJygote ris! estimatesG this
corresponds to additi+e genotype
ris!s,
'he test has good po$er in this case
1ut po$er is reduced 1y de+iations
%rom additi+ity,
In an eAtreme scenario, i% the t$o
homoJygotes ha+e the same ris! 1ut
the heteroJygote ris! is di?erent
-o+er dominance0, then the "rmitage
test $ill ha+e no po$er %or any
sample siJe e+en though there is a
true association,
-Balding, &**90
)9
'ransmission Disequili1rium
'est
'he 'D' tests %or 1oth lin!age and association in %amilies $ith
o1ser+ed transmissions %rom parents to a?ected o?spring
-Spielman et al., 1==)0,
It $as originally de+eloped to test %or lin!age in the presence o%
association, 1ut its most common usage is no$ to test %or
association in the presence o% lin!age, since it is ro1ust against
population stratifcation,
'he 'D' tests %or distortion in transmission o% alleles %rom a
heteroJygous parent to an a?ected o?spring,
)/
!-e$is) *++*'
)F
!-e$is) *++*'
Continuous outcomesC inear
7egression -5M0
inear models are used to study ho$ a quantitati+e +aria1le
depends on one or more predictors or eAplanatory +aria1les,
'he predictors themsel+es may 1e quantitati+e or qualitati+e,
-7odrigueJ, &**/0
y =
0
+
1
x +
$hereC
, / dependent varia0le
x / independent varia0le

0
,
1
/ re-ression parameters
1 / random error
)=
So%t$are used C S"S F,*&
-P7DC 5M0
;*
5eneraliJed inear Model
5eneraliJed linear models -5Ms0 are a large class o% statistical
models %or relating responses to linear com1inations o% predictor
+aria1les, including many commonly encountered types o%
dependent +aria1les and error structures as special cases,
-Za!man, &**&0
"d+antages o% using 5MsC
( No need to trans%orm the data into normality
( 5Ms uni%y a $ide +ariety o% statistical methods,
" 5M generaliJes ordinary regression models in t$o $aysC #irst, it
allo$s 2 to have a distri0ution other than the normal. )econd, it
allo$s modeling some %unction o% the mean,
Both generaliJations are important %or categorical data,
-"gresti, &**/0
;1
5Ms %or
1inary data
5Ms %or
1inary data
ogit Model
ogit Model
ogit in!
ogit in!
Pro1it Model
Pro1it Model
Pro1it in!
Pro1it in!
'rans%orm to
M^I scores
%rom snd
'rans%orm to
M^I scores
%rom snd
;&
ogit %or single SNP
@ach su13ect in our sample consists o% a -y
i
G A
i
0 pair $here y
i

is case:control status -1:*0 and A
i
-*,1,&0 is the genotype at
typed locusC
Genot$
%e
4
i
Odds Para&e
ters
aa * _ `
*
"a 1 _ -1Pa0 `
1
"" & _
-1Pa0
&
`
&
-7e%, C DA%ord .ni+ersity 6e1site,
httpC::$$$,stats,oA,ac,u!:Emc+ean:g$a;,pd%0
;)
No$ trans%ormation logit -3) / lo- (3 4 (% 5 3)) is applied to 3
i
, the
disease risk o' the i6th individual.
'he +alue o% logit -b
i
0 is equated to either `
*
, `
1
, or `
&
,

according to the
genotype o% indi+idual i -`
1
%or heteroJygotes0,
'he li!elihoodBratio test o% this general model, against the null
hypothesis `
*
O`
1
O `
&
, has & d,%, ,and %or large sample siJes is
equi+alent to the Pearson &Bd% test,
.sers can impro+e the po$er to detect specifc disease ris!s, at the
cost o% lo$er po$er against some other ris! models, 1y restricting the
+alues o% `
*
, `
1
and `
&
,
'ests %or recessi+e or dominant e?ects can 1e o1tained 1y requiring
that `
*
O `
1
or `
1
O `
&
,
-Balding, &**90
;;
ogistic 7egression o%
Melanoma status on
5enotype
Ris* Factor Odds Ratio 9556I P 7aue
Models $ithout Co+ariate
SNPC no, o%
copies ['\
alleles
*,/F *,9/B*,=) *,**;
Models $ith intermediate %actor as co+ariate
SNPC no, o%
copies o% ['\
allele
*,F= *,/;B1,*/ *,&)
Ne+us count &,9* &,&FB&,=/ c1*
B;)
;2
!.eggini and /orris) *+00'
'est o% association C Multiple SNPs
;9
Set association, to e+aluate sets o% SNP mar!ers at +arious positions
in the genome -in particular, in di?erent suscepti1ility genes0,
'his method per%orms a simultaneous signifcance test on se+eral sets
o% loci $hile !eeping the o+erall type I error in control,
SNPBsetB1ased analysis 1orro$s in%ormation %rom di?erent 1ut
correlated SNPs that are grouped on the 1asis o% prior 1iological
!no$ledge and hence has the possi1ility o% pro+iding results $ith
impro+ed reproduci1ility and increased po$er, especially $hen
indi+idualBSNP e?ects are moderate, as $ell as impro+ed
interpreta1ility,


'o increase the po$er o% the test, sometime it is %easi1le to com1ine
rele+ant sources o% in%ormation %or a gi+en SNP, such as C
"llelic association -""0, HardyB6ein1erg disequili1rium -H6D0, and
e+idence %or genotyping errors, -Heidema et al,,
&**/0
SNP set analysis
-6u et al.,&*1*0
;/
'his mode o% analysis proceeds +ia a t$oBstep procedureC
( SNP are assigned to set on the 1asis o% some meaning%ul
1iological criteria -genomic %eatures0 e,g, 5enes
( 'hen, tests %or the association 1et$een each genomic %eature
and a disease phenotype are per%ormed $ith the use o% a
logistic !ernel machineB 1ased multilocus test, across the
genome,
SNPBset analysis can pro+e ad+antageous o+er the standard
analysis o% indi+idual SNPs, By %orming SNP sets and testing each
SNP set as a unit, $e are reducing the num1er o% hypotheses 1eing
tested and thus relaAing the stringent conditions %or reaching
genomeB$ide signifcance in case o% 56",
'here are %ollo$ing $ays o% grouping SNPs into set C
( SNP location in the gene as or near to gene -gene 1ased
set analysis0
( Set %ormation on the 1asis o% N@55 path$ay
( 5roup SNPs onto e+olutionary conser+ed regions
( 5rouping SNPs into haplotype 1loc!s
-6u et al.,&*1*0
;F
5enome $ide SNP set testing
"ssume population 1ased caseBcontrol status -#or a single set0C
( let J
i1
, J
i&
,,, J
ip
1e genotype +alues %or the SNPs in the SNP set
%or the IIth su13ect -i O 1,d,n0,
( 'he caseB control status %or the iIth su13ect is denoted 1y y
i
-y
i

O 1 %or cases, and yO * %or controls0,
( J
i3
O *, 1, & corresponding to homoJygotes %or the ma3or allele,
heteroJygotes, and homoJygotes %or the minor allele,
respecti+ely,
( #urther assume collection o% MmI additional set o% demographic,
en+ironmental and other con%ounding +aria1les,
#or the iIth su13ect let A
i1
, A
i&
,,,,, A
im
denote the +alues o% the
co+ariates that $e $ould li!e to ad3ust %or,
'he goal o% SNPBset analysis is then to test the glo1al null o% $hether
any o% the p SNPs are related to the outcome $hile ad3usting %or the
additional co+ariates
-6u et al., &*1*0
;=
ogistic Nernel Machine 7egression
Model
'he !ernelBmachine %rame$or! has 1ecome +ery popular %or
modelling highBdimensional 1iomedical data 1ecause o% its a1ility to
allo$ %or compleA:nonlinear relationships 1et$een the dependent and
independent +aria1les -Bro$n et al., &***0 $hile ad3usting %or
co+ariate e?ects,
.nder the logistic Nernel Machine 7egression Model, #ollo$ing is the
model %or SNP 3oint interaction and considering other co+ariates C
( In $hich _
*
is the intercept
( _
1
, _
&
, d,, _
m
are regression coeKcients corresponding to the
en+ironmental and demographic co+ariates,
( 'he SNPs, J
i1
, ,, J
ip
, inLuence y
i
through the general %unction h-
3
0,
$hich is an ar1itrary %unction that that has a %orm defned only 1y
a positi+e, semi defnite !ernel %unction N-
3
,
3
0,
-6u et al., &*1*0
2*
MD7 is a nonparametric data mining approach
'o reduce t$o or more SNPs, %or eAample, to a ne$ single
+aria1le that is then e+aluated using a classifer such as
Bayes or logistic regression,
In MD7, each multiBlocus genotype o% a SNP com1ination is
assigned to a highBris! or lo$Bris! group, depending on the
ratio o% cases and nonBcases $ith this multiBlocus genotype,
I% this ratio eAceeds a certain threshold, this multiBlocus
genotype is assigned to as highBris!, other$ise it is
assigned to as lo$Bris!,
By assigning all multiBlocus genotypes %or a certain
com1ination o% SNPs to either highBris! or lo$Bris!, MD7
reduces the num1er o% multiBlocus genotypes to one ris!
%actor consisting o% t$o le+els, highBris! or lo$Bris!,
'he aim is to construct a ne$ ris! %actor that %acilitates the
detection o% nonlinear interactions among SNPs such that
the prediction o% the outcome +aria1le is impro+ed o+er the
original representation o% the data,
MultiBDimensional 7eduction
-7itchie et al., &**10
21
-ee et al.,
&**F0
2&
2)
ogistic regression
ogistic regression analyses %or SNPs are a natural eAtension o% the
singleBSNP analyses that are discussed in pre+ious slidesC there is no$
a coe%%icient -`*, `1 or `&0 %or each SNP, leading to a general test $ith
& d%, By constraining the coeKcients, tests $ith d% can 1e o1tained,
Co+ariates such as seA, age or en+ironmental eAposures are readily
included, Similarly, interactions 1et$een SNPs can 1e included,
-Balding, &**90
'his con+eys little 1eneft, and can reduce po$er to detect an
association, i% there is a single underlying causal +ariant and little or
no recom1ination 1et$een SNPs, 1ut it is potentially use%ul %or
in+estigating epistatic e?ects,
-6u et al., &*1*0
2;
Haplotype 1ased methods
6hen hundreds o% thousands o% SNPs are genotyped, it
happens that most o% them are in high lin!age
disequili1rium -$hich are called haplotype, i% they happen
to 1e ad3acent on the chromosome0,
#e$ methods ha+e 1een proposed in the literature %or
identi%ying haplotype 8 PH"S@ -Stephens et al.,
&**10,SNPH"P, #"S'PH"S@ -Scheet < Stephens, &**90,
Haplo+ie$, PINN -Purcell et al., &**/0 etc, Most o% these
are a+aila1le as so%t$are,
"1o+e tools can 1e used to identi%y haplotype in 56"
datasets and replace the entire Haplotype 1loc! $ith a
representati+e SNP called a I'ag SNPI,
22
tSNP can sometimes pro+ide greater analytical po$er than
singleBmar!er analysis %or genetic association studies,
'his is 1ecause haplotypes are inherited together in the
ma3ority o% cases, and they incorporate linkage
disequilibrium in%ormation -"!ey and ]iong, &**)G Schaid,
&**;0,
Con+ersely, haplotypeB1ased statistical analysis has a
$ea!ness since haplotypes are o%ten not directly
o1ser+a1le,
Hence, haplotypes and their %requencies are in%erred 1y
statistical methods such as the @ApectationB MaAimiJation
-@M0 algorithm -Dempster et al., %7889 @AcoKer and
Slat!in, 1==20 or the Bayesian method -Stephens et al.,
&::%9 in et al., &::&).
29
5i+en haplotype assignments, the simplest analysis in+ol+es
testing %or independence o% ro$s and columns in a & e k
contin-enc, ta0le, where k denotes the num0er o% distinct
haplotypes -Sham, 1==F0,
"lternati+e approaches can 1e 1ased on the estimated
haplotype proportions among cases and controls, $ithout an
eAplicit haplotype assignment %or indi+iduals -Schaid, &**;0C
the test compares the product o% separate multinomial
li!elihoods %or cases and controls $ith that o1tained 1y
com1ining cases and controls,
Haplotype 1ased regression model is +ery use%ul in
haplotype 1ased association study
2/
!1ang et al., *++2'
2F
7egression Models %or
Haplotypes
6ithin the %rame$or! o% the generaliJed linear model -5M0, the
haplotype e?ect on traits can 1e statistically descri1ed and tested,
'he model can 1e eApressed as @-f0 O %
B1
-3')

$here f denotes the trait

] represents the haplotypes that are coded into the desi-n


matrix)

denotes the e?ects o% haplotype, and

% is a %unction that generaliJes the usual linear regression


such as logistic regression in the caseBcontrol study,
!4ohee et al., *++5'
2=
et fO g* %or Control and 1 %or Caseh
et -h
i
, h
3
0 1e a random +aria1le that denotes the pair o%
haplotypes %or each indi+idual, iO3 or i i 3, et H O gh
1
, h
&
, ,,,, h
p
h
1e a set o% haplotypes
MaAimum num1er o% possi1le haplotypes is &
m
, $here m is the
num1er o% SNPs,
In association studies, the main interest lies in estimating the
e?ects o% H on f,
So, in nutshell, regression models %or Haplotypes consists o% C
PredictorB Haplotype counts
7egression ParametersB Phenotypic e?ect o% each
haplotypes
DutcomeB 'he phenotype o% interest
!4ohee et al.) *++5'
9*
Direct Design MatriA
Indi7id
ua
(a%ot$
%es
Pro/a/i
it$
Direct Desi"n Matri4
h
1
h
&
h
)
h
;
f
1
-h
1
,h
1
0 1 1 * * *
f
&
-h
1
, h
;
0 *,&
*,1 *,; *,; *,1
-h
&
, h
)
0 *,F
f
)
-h
&
, h
&
0 1 * 1 * *
f
;
-h
&
, h
;
0 1 * *,2 * *,2
f
2
-h
1
, h
&
0 *,&
*,&2 *,&* *,&2 *,)*
-h
1
, h
;
0 *,)
-h
&
, h
)
0 *,)
-h
&
, h
;
0 *,&
'he direct type o% design matriA relies on the estimated haplotype
pro1a1ilities -proportions0,
!4ohee et al., *++5'
91
Indirect Design MatriA
Indi7id
ua
(a%ot$
%es
Pro/a/i
it$
Indirect Desi"n Matri4
h
1
h
&
h
)
h
;
6eight
f
1
-h
1
,h
1
0 1 & * * * 1
f
&
-h
1
, h
;
0 *,& 1 * * 1 *,&
-h
&
, h
)
0 *,F * 1 1 * *,F
f
)
-h
&
, h
&
0 1 * & * * 1
f
;
-h
&
, h
;
0 1 * 1 * 1 1
f
2
-h
1
, h
&
0 *,& 1 1 * * *,&
-h
1
, h
;
0 *,) 1 * * 1 *,)
-h
&
, h
)
0 *,) * 1 1 * *,&
-h
&
, h
;
0 *,& * * 1 1 *,)
!4ohee et al., *++5'
9&
Introduction to Bayesian "pproach
" statistical school o% thought that holds that in%erences a1out any
un!no$n parameter or hypothesis should 1e encapsulated in a
pro1a1ility distri1ution, gi+en the o1ser+ed data, Computing this
posterior pro1a1ility distri1ution usually proceeds 1y speci%ying a
prior distri1ution that summariJes !no$ledge a1out the un!no$n
1e%ore the o1ser+ed data are considered, and then using BayesI
theorem to trans%orm the prior distri1ution into a posterior
distri1ution,
Bayesian methods pro+ide an alternati+e approach to assessing
associations that alle+iates the limitations o% p;values at the cost o'
some additional modellin- assumptions,
Bayesian methods compute measures o% e+idence that can 1e
directly compared among SNPs $ithin and across studies
9)
-Stephens < Balding,
&**=0
Calculating Pro1a1ilities o% "ssociation
'his deals $ith computing, %or each SNPs in 56"S, the pro1a1ility
that it is truly associated $ith the phenotype a!a [ Posterior
Pro1a1ility o% "ssociation -PP"0\,
'his posterior pro1a1ility o% association -PP"0 can 1e thought o% as
the Bayesian analogue o% a p;value o0tained, 'or eAample, 1y
using the "rmitage trend test -"''0 or the #isher eAact test,
'he calculation o% PP" can 1e split into three di?erent steps C
( Choose a +alue %or b, the prior pro1a1ility o% H
1
( Compute a Bayes %actor %or each SNP
( Calculate the posterior odds on H
1
-Stephens < Balding,
&**=0
9;
Step I C Choose a +alue %or b, the prior pro1a1ility o% H
1

( b +alue quanti%ies our prior assumption o% each SNPs
1eing associated
( Halue o% b %or H
1
depends on prior !no$ledge, %or
eAample C M"#, ProAimity to certain genes o% interest
etc,
( i% b is assumed to 1e the same %or all SNPs, it can 1e
interpreted as a prior estimate o% the o+erall proportion
o% SNPs that are truly associated $ith a phenotype,
( 'ypically, only a minority o% SNPs is eApected to 1e truly
associated $ith a gi+en phenotypeC the range 1*
8;
to 1*
8
9
has 1een suggested %or 3. -'he 6ellcome 'rust
Case Control Consortium, &**/0
( 'he pro1a1ility o% H
*
is ta!en to 1e 1 8 3.
Step II C Compute a Bayes %actor %or each SNP
( " Bayes %actor -B#0 is the ratio 1et$een the pro1a1ilities
o% the data under H
1
and under H
*
,
( 'he B# is similar to a li!elihood ratio, 1ut it compares
t$o di?erent models rather than t$o parameter +alues
in a model,
( 'he o1ser+ed data are B# times more li!ely under H1
than under H*,
-Stephens < Balding,
&**=0
92
( 'he B# and b can 1e used to compute posterior odds on H
1
C
( 'his can 1e used to calculate PP"
( 'he PP" can 1e interpreted directly as a pro1a1ility, irrespecti+e o%
po$er, sample siJe or ho$ many other SNPs $ere tested,
( Intuiti+ely, the PP" com1ines the e+idence in the o1ser+ed
association data -the B#0 $ith the prior pro1a1ility -b0 that a SNP is
truly associated $ith phenotype, Because b is typically so small, the
B# has to 1e large -%or eAample, j1*; 8 1*90 to pro+ide con+incing
e+idence %or an association -that is, to gi+e a PP" close to 10,
( 'he requirement %or a large B# is analogous to setting a stringent
threshold %or genomeB$ide signifcance in a %requentist approach
Steps III C Calculate the posterior odds on H
1

-Stephens and Balding,
&**=0
99
Population stratifcation
Population stratifcation re%ers to di<erences in allele 're=uencies
0etween cases and controls due to s,stematic di<erences in
ancestr, rather than association o% genes $ith disease -#reedman
et al., &**;0,
6hen cases and controls ha+e
Di?erent allele %requencies attri1uta1le
to di+ersity in 1ac!ground population,
unrelated to outcome status, a study
is said to ha+e population stratifcation,
-Cordon and Palmer, &**)0,
9/
Population stratifcation is pro1a1ly the most o%ten cited reason %or
nonBreplication o% genetic association results, $hich ha+e
un%ortunately 1een more the rule than the eAception,
-'a1or et al., &**), 6eiss and 'er$illiger, &***0
eading scientifc 3ournals ha+e noted the importance o% population
stratifcation as a cause o% non replicated association outcomes,
-"non, 1===0 and it is usual practice in grant applications and
manuscript peerBre+ie$ to demand that stratifcation is eAplicitly
addressed, -5auderman, 1===0
'$o circumstances must 1e met %or population stratifcation to
a?ect genetic association studiesC
i, Di?erences in disease pre+alence must eAist 1et$een cases and
controlsG and
ii, +ariations in allele %requency 1et$een groups must 1e present
-Stephen et al., &**)0
9F
4B4 PD'
-McCarthy
et
al,,&**F0
9=
5enomic Control
'his some$hat older method, pioneered 1y De+lin and
7oeder-De+lin and 7oeder, 1===0 notes that the chiB
squared distri1ution o% statistics %rom association tests
1eing con%ounded 1y stratifcation $ill 1e more [spread
out\ than they should 1e,
#he result is a hi-her median than the median o' a true chi;
s=uare distri0ution, Se+eral models eAist %or ho$ much the
distri1ution should 1e spread out, depending on the test
type, 1ut the distri1ution $ill usually 1e uni%ormly spread
out 1y a certain [inLation %actor\, also !no$n as
[Hariance InLation %actor\
/*
'he e?ect o% stratifcation on association studies, -a. Strati8cation
infates 9: association statistics /$ a factor ;< !'ic' c'an"es
de%endin" on t'e sample siJe, Scenario 1 corresponds to gross
stratifcationG scenarios & and ) correspond to the range o%
stratifcation estimated in the "%rican "merican prostate cancer studyG
and scenario ; corresponds to no stratifcation,
-#reedman et al.,
&**;0
/1
In 5enomic Control -5C0 the "rmitage test statistic is
computed at each o% the null SNPs, and k is calculated as
the empirical median di+ided 1y its eApectation under the
l&
1
distri1ution,
-De+lin and 7oeder, 1===0
'hen the "rmitage test is applied at the candidate SNPs,
and i% k j 1 the test statistics are di+ided 1y k,
#he motivation 'or G> is that, as we expect 'ew i' an, o' the
null )N!s to 0e associated with the phenot,pe, a value o'
? % is likel, to 0e due to the e<ect o' population
strati'ication, and dividin- 0, cancels this e''ect 'or the
candidate )N!s,
5C per%orms $ell under many scenarios, 1ut it is limited in
applica1ility to the simplest, singleBSNP analyses, and can
1e conser+ati+e in eAtreme settings -and antiBconser+ati+e
i% insuKcient null SNPs are used0,
-Marchini et al., &**;0
/&
Structured "ssociation
'hese approaches

are 1ased on the idea o% attri1uting the
genomes o% study indi+iduals to hypothetical
su1populations, and testing %or association that is
conditional on this su1population allocation,
-Pritchard et al., &***0
'hese approaches are computationally demanding, and
1ecause the notion o% su1population is a theoretical
construct that only imper%ectly reLects reality, the question
o% the correct num1er o% su1populations can ne+er 1e %ully
resol+ed,
,asic Idea= 'ry to in%er -disco+er0 the structure and then condition
on the structure $hen testing %or association,
/)
Dther approaches
Null SNPs can mitigate the e?ects o% population structure $hen
included as co+ariates in regression analyses,
-Seta!is et al., &**90
i!e 5C, this approach does not eAplicitly model the population
structure and is computationally %ast, 1ut it is much more LeAi1le
than 5C 1ecause epistatic and co+ariate e?ects can 1e included in
the regression model,
-Balding, &**90
@mpirically, the logistic regression approaches sho$ greater po$er
than 5C, 1ut their typeB1 error rate must 1e assessed through
simulation, -Seta!is et al., &**90
/;
Multiple testing
It re%ers to the pro1lem that arises $hen many null
hypotheses are testedG some signifcant results are li!ely
e+en i% all the hypotheses are %alse,
-Balding, &**90
@specially in 56", @ach SNP that is analyJed constitutes
one hypothesis test, In traditional hypothesis testing, the
signifcant le+el is o%ten set at 2X,
Ho$e+er, as sho$n in the ta1le 1elo$ as the num1er o%
SNPs tested increases, the num1er o% SNPs %alsely claimed
to 1e signifcant increases, pro+ided that all the SNPs are
nonBsignifcant,
No3 of SNP tested Fase %ositi7e
1** 2
1*,*** 2**
2,**,*** &2,***
-Scott,
&**=0
/2
'here are t$o di?erent approach %or multiple testing C
( Bayesian "pproach
( #requentist approach
'he %requentist paradigm o% controlling the o+erall typeB1 error
rate sets a signi%icance le+el _ -o%ten 2X0, and all the tests
that the in+estigator plans to conduct should together
generate no more than pro1a1ility _ o% a %alse positi+e,
#ollo$ing are the t$o most methods $hich ha+e 1een,
generally, $idely used in multiple testing C
( Sequential Bon%erroni 7e3ection
( #alse disco+ery rate -#D70
-Balding, &**90
#D7 $as proposed 1y Ben3amini and Hoch1erg -1==;0, they
also noted that the classical approaches despite their uses in
industries are less li!ely used in genetic,
-Scott, &**=0
/9
Sequential Bon%erroni re3ection
'his $as proposed 1y Holm -1=/=0,
'he method is 1ased on the Bon%erroni test and requires
the typeBI error to 1e as small as possi1le,
Philosophically, %or each o% these n tests, the pro1a1ility o%
committing a typeBI error is less than or equal to a small
predetermined +alue

Notation $hich $e $ill used %urther in this methodC
MnI null hypothesis C H
1
, H
&
, H
)
, dd,,H
n
"lternati+e hypothesis C N
1
, N
&
, N
)
,d,,N
n
'est statistics C f
1
, f
&
, f
)
,d,,f
n
MnI critical region C C
1
, C
&
, C
)
, d,,, c
n
-Scott,
&**=0
//
No$ let the corresponding pB+alues generated %rom the test
statistics, f
1
, f
&
, f
)
,d,,f
n
,1e P
1
, P
&
, P
)
,d,,, P
n
$here !O1, &,
d, n,
6hen these pB+alues are ordered, P
-10
RP
-&0
RP
-)0
Rd,,RP
-n0
,
along $ith their corresponding hypotheses,
H
-10
RH
-&0
RH
-)0
Rd,,RH
-n0
, the most signifcant ones $ould
ha+e the smallest pB+alues,
'he S7B method attempts to sol+e this multiple testing
pro1lem 1y ad3usting the signifcant le+el , %or each
hypotheses tested, 1e%ore comparing it $ith the pB+alues,
Specifcally, these pB+alues are compared to corresponding
le+els denoted 1yC
'he hypotheses are re3ected until no other re3ections are
possi1le, Since the most important hypotheses $ould ha+e
the smallest pB+alues, they are compared $ith the smallest
le+el o% O _ :-nBi P 10, $here i O 1, &, ),,,,,,,, n,
-Scott,
&**=0
/F
Conclusions
#amily 1ased lin!age mapping is the 1est approach to detect
region in+ol+ed in recessi+e high penetrant Mendelian
disease and has lo$ resolution $hile population 1ased
genetic association study helps in mapping compleA traits
and has high resolution,
CaseBControl design applies to disease traits and t$o tailed
sampling design applies to quantitati+e traits,
Huge genotyping data due to increased usage o% 56"S
study and spurious association due to study design has put
%or$ard computational challenges,
H6@, MSP and M"# are the main components in 4C and
imputation methods uses HMM, M, Multiple imputation %or
imputing missing alleles,
Single locus association testing in case o% disease trait
requires application o% tests applied to categorical data such
as chiBsquared test, odd ratio,
/=
ogistic regression can ad3ust the e?ect o% co+ariates gi+ing
an ad+antage o+er other categorical data tests,
In case o% quantitati+e trait linear regression model and its
+ariant such as fAed, random or miAed model can 1e
applied depending on the types o% independent +aria1les,
Multiple SNPs model apart %rom logistic regression and
linear regression uses the SNP set analysis $hich gi+es
1etter result using !ernel %unction and it also uses machine
learning language methods such as MD7, 7#",
Bayesian models computes B# and uses +arious models to
detect the PP" $hich is the analogue o% PB+alue in
%requentists approach,
Population stratifcation is one o% the main reasons 1ehind
nonBreplication o% 56"S in caseBcontrol design,
F*
Haplotype 1ased regression models alle+iate the pro1lem o%
multiple testing and also increases the po$er o% test,
5enomic control is the $idely used method to detect
stratifcation,
Multiple testing requires the application o% Bon%erroni
correction and #D7 to control type I error,
F1
T'an* >ou
[In 5od $e trust, all others must 1ring
data\
67. 8d$ards Deming
F&
Supplementary Slides
Data 4uality Control

#or 56"S conducted 1y the 6ellcome 'rust Case


Control Consortium -6'CCC0, the criteria %or
retaining a SNP areC H6@ !;value@A.8B%:58,
MSPR2X i% M"#S2X, MSPR1X i% M"#c 2X and
M"#?:.:% -6'CCC, &**/0, Slade! et al. (&::8)
included )N!s when the(C$ !;value?:.::%,
)!DAE and AF?:.:%. "noki et al. (&::G)
included SNPs $hen the H6@ !;value @%:5H and
)! D%:E.

)tatistical method usedI PC"


5enotyping @rror

Hariation in DN" sequence

o$ quantity and quality o% DN"

Biochemical arti%act and lo$ quality reagent

Human #actor
SNP 4C so%$are

S"4C -SNP array 4uality Control0 on 7

DBSC"N
PC"

is a mathematical procedure that uses


anorthogonal trans%ormationto con+ert a set o%
o1ser+ations o% possi1ly correlated +aria1les into
a set o% +alues o%linearly uncorrelated+aria1les
called%rinci%a co&%onents
Haplotype analysis

"nalysis methods 1ased on single SNPs ha+e


limited po$er to detect a true genetic e?ect that
requires a specifc allele at se+eral SNPs, 'his
may 1e detected using haplotypeB1ased methods,
analysing all SNPs concurrently,

5enehunter allo$s haplotype analysis o% up to


%our SNPs, Dne o% the most LeAi1le programs %or
'D'Btype analysis is 'ransmit
So%t$ares%or 'D' and Haplotype
analysis
'D':Si1'D'
httpC::genomics,med,upenn,edu:spielman:'D',htm
5enehunter
httpC::$$$,%hcrc,org:la1s:!ruglya!:Do$nloads:indeA,html
PD' httpC::$$$,chg,du!e,edu:so%t$are:pdt,html
@'D'
httpC::$$$,mds,qm$,ac,u!:statgen:dcurtis:so%t$are,html
'ransmit httpC::%tpBgene,cimr,cam,ac,u!:clayton:so%t$are:
@H %tpC::lin!age,roc!e%eller,edu:so%t$are:eh:
@hplus
httpC::$$$,iop,!cl,ac,u!:IoP:Departments:PsychMed:5@piBS
t:so%t$ar
@Aample o% ogistic 7egression
$here x
i
T
i is the ith row of the design matrix. With suitable choice of design
matrix, the regression coefficients) , are the logarithms of the odds ratio
parameters

You might also like