Linkage Mapping and QTL Notes 04 Rev

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/268747417
Linkage Mapping and QTL Analysis
Chapter January 2004

DOI: 10.13140/2.1.4501.9846
CITATIONS READS
0 560
1 author:
Subhash Chandra
Agriculture Research Branch
200 PUBLICATIONS 1,310 CITATIONS
SEE PROFILE
Available from: Subhash Chandra

Retrieved on: 05 September 2016
ICRISAT
Draft Version 1.3
Linkage Mapping and QTL Analysis
Subhash Chandra
Biometrics Unit
Global Theme Biotechnology
International Crops Research Institute for the Semi-Arid Tropics
Patancheru 502 324, India
2004
ICRISAT
Contents
QTL Analysis: An Overview 3-6
Chapter 1. Things we need to get there 7-11

1.1 The basic ingredients
1.2 Selection of parents
1.3 Generating a mapping population
1.4 Mapping population size
1.5 Selection of marker system
1.6 Number of markers vis--vis mapping population size
1.7 The data: How do they look like?
1.8 Genotyping of mapping population
1.9 Linkage map
Chapter 2. Phenotyping of mapping population 12-21

2.1 Physical dimensions of phenotyping
2.2 Outputs expected from phenotyping
2.3 Accuracy and precision: Theory and reality
2.4 Ingredients of reliable phenotyping
2.5 Experimental design: Some basics
2.6 Ingredients of a statistically sound experimental design
2.7 Replication vs blocking
2.8 Experimental design options
2.9 Number of environments and replications: A balancing act
2.10 Biometric analysis of data: Plan it ahead
2.11 Understanding the block structure
2.12 Single trial analysis using GenStat
2.13 Pooled analysis across trials using GenStat
2.14 Spatial analysis using ASReml
Chapter 3. Building a linkage map 22-33

3.1 Single-locus analysis for data screening
3.2 Two-locus analysis for estimation and detection of linkage
3.3 Minimum sample size to detect linkage
3.4 Linkage grouping
3.5 Linkage grouping criteria
3.6 Locus ordering
3.7 Marker coverage and map density
3.8 Predicting marker coverage and map density
Chapter 4. QTL analysis 34-56

4.1 The key idea
4.2 Quantitative genetic models
4.3 Prelim estimate of number of loci/QTLs
4.4 Genetic models for F2 and BC
4.5 QTL mapping strategies
4.6 Single marker analysis
4.7 Multiple marker analysis using multiple regression
4.8 Simple interval mapping
4.9 Composite interval mapping
4.10 Significance threshold for QTL detection
4.11 Estimation of key MAS parameters
4.12 Cross validation and bootstrapping
4.13 In summary
Appendix 1. Fixed, Random and Mixed Models 57-58
2
ICRISAT
QTL Analysis: An Overview

A quantitative trait locus (QTL) is a region of the genome that
contain gene(s) associated with a quantitative trait (Single gene?)
QTL analysis: A means to estimate the locations, number,

magnitude of phenotypic effects, and mode of gene action, of
individual genetic loci (QTLs) that contribute to the inheritance
of a quantitative trait
Polymorphic markers are needed for QTL analysis
QTL analysis can be done with or without a genetic linkage map
A linkage map needs adequate polymorphic markers and

allows mapping of QTL, estimating their number, phenotypic
effects, and modes of gene action
Without a linkage map, only QTL-harboring markers can be

identified; QTL location, number, effect and mode of gene
action cannot be determined
QTL analysis can be done using a Mapping population (Linkage

analysis) or a natural/breeding population (Association analysis)
Why QTL analysis?
- Identify important trait-linked markers (if there is no map)
- Detect regions of the genome associated with the trait (with a map)
! How many and where the QTLs are? Identify flanking markers
- Describe the effect of QTL on the trait (Quantitative Genetics)

How much variation is caused by the QTL (Ra2)
What is the gene action associated with the QTL? Additive, dominance
Which allele is associated with favorable effect?
- Facilitate marker-assisted selection, breeding, introgression: Note that it is

MARKER-assisted, not QTL-assisted!
- Comparative QTL mapping across related species
A quantitative trait exhibits continuous variation in its

phenotypic expression
3
ICRISAT
The trait expression may be controlled by many genes (a few

hopefully of large effect and the others of small effect) Major
QTL (Ra2>15%), Minor QTL (Ra2 below 15% but above 5%) and
Polygenes (Ra2<5%): The L-shape distribution hypothesis
Trait expression may be affected by non-genetic factors such as

macro- and micro-environment variations as well as measurement
errors: NOISE
The genes may interact with each other (gene x gene

interaction; epistasis) and with the environment (gene x
environment interaction) in determining phenotypic expression
of the trait: NOISE
The fundamental challenge in QTL analysis is to disentangle the

genetic SIGNAL at any individual locus from the NOISE
A QTL represents a significant statistical association between

trait phenotype and marker genotype: Significance Threshold?
The ability to resolve a QTL (the SIGNAL) depends upon
(Heritability?)
- The magnitude of its phenotypic effect: Large, small
- The extent to which its effect is modified by non-linear
interactions with other QTLs (epistasis)
- The degree to which macro- and micro-environmental factors and
measurement errors obscure the resulting phenotype
- Density of genetic markers to scan the genome for QTLs
We can influence these factors by

- Choice of mapping population (RIL, F2, )
- Number of segregating progenies
- Type (dominant or co-dominant) and spacing of genetic markers
- Error-free genotyping
- Appropriate field block design to account for micro-environmental
variation for accurate phenotyping (Definition of phenotype?)
- Appropriate data analysis to account for micro-environmental

variation not accounted for by design used: accurate phenotyping
4
ICRISAT
These factors impact, in varying degrees, the reliability of the

estimates of the number, magnitude, and the
distribution/position of QTLs on the genome
The reliability of QTL analysis results is critically dependent

on the sample size (# progenies genotyped and phenotyped)
Unbiased QTL analysis requires a (costly) two-stage sequential

process detecting QTLs from sample A, and using this
information in estimating QTL parameters from another
independent sample B of mapping population individuals
Normally these two actions are performed on the same single

sample whose size (still) normally varies between 100-200
mapping population individuals due to the expensive
phenotyping and genotyping
The same sample is also used to build the linkage map
Repeated use of same sample produces severe overestimation of

QTL parameters. This results in highly optimistic, hence
misleading, prospects for MAS (False positives)
The resampling (simulation) procedures of cross-validation and

bootstrapping can be used to obtain realistic estimates of QTL
parameters from a single sample of available experimental data on
genotyping and phenotyping
These notes are divided into four chapters and discuss only
the mapping-population-based QTL analysis
- Chapter 1 is about The Things We Need To Get There
- Chapter 2 discusses issues related to Phenotyping
- Chapter 3 is discusses Building a Linkage Map
- Chapter 4 discusses the methods for QTL Analysis including
cross-validation and bootstrapping
5
ICRISAT
Suggested Reading
Kearsey MJ, Farquhar AGL (1998). QTL analysis in plants: Where are we now? Heredity
80:137-142
Asins MJ (2002). Present and future of quantitative trait locus analysis in plant
breeding. Plant Breeding 121:281-291
Doerge RW (2002). Mapping and analysis of quantitative trait loci in experimental

populations. Nature Reviews Genetics 3:43-52
Tuberosa R et al. (2002). Mapping QTLs regulating morpho-physiological traits and

yield: Case studies, shortcomings, and perspectives in drought-stressed maize. Annals
of Botany 89:941-963
Paterson AH (2002). What has QTL mapping taught us about plant domestication? New
Phytologist 154:591-608
Members of Complex Trait Consortium (2003). The nature and identification of

quantitative trait loci: A communitys view. Nature Reviews Genetics 4:911-916
Xu S (2003). Theoretical basis of the Beavis Effect. Genetics 165:2259-2268
Bernardo R (2004). What proportion of declared QTL in plants are false? Theoretical
and Applied Genetics 109:419-424
Schoen CS et al. (2004). Quantitative trait locus mapping based on resampling in a vast
maize testcross experiment and its relevance to quantitative genetics for complex
traits. Genetics 167:485-498
Kraakman ATW et al. (2004). Linkage disequilibrium mapping of yield and yield
stability in modern spring barley cultivars. Genetics 168:435-446
6
ICRISAT
Chapter 1
Things we need to get there
1.1 The basic ingredients
- Identifying the trait(s)
- Definition of trait phenotype
- Selection of parents
- Generating a mapping population
- Mapping population size
- Phenotyping of mapping population: Accurate
- Phenotyping data analyses
- Selection of marker system
- Number of markers
- Genotyping of mapping population: Error-free
- Genotyping data analysis
- Building a linkage map
- Linking phenotype to genotype to find QTLs
7
ICRISAT
1.2 Selection of parents
Parents must have sufficient variation both at phenotypic and

molecular levels. Variation at molecular level is essential to
trace recombination events [Links with molecular diversity analysis]
Completely inbred lines are ideal as parents. They create

marker-trait association due to their F1s being in complete
linkage disequilibrium for genes differing between lines
LD = Non-random association of alleles from two loci on the same chromosome
1.3 Generating a mapping population (Show DH)
Inbred Parent 1 x Inbred Parent 2 (Divergent Inbred Parents)

F1 x Inbred Parent
Self
F2 BC1 x Inbred Parent (donor)
Self
F31 F32 F33 F34 BC2 x Inbred Parent (donor)
Self
F41 F42 F43 F44 BC3 x Inbred Parent (donor)

RIL1 RIL2 RIL3 RIL4 BC Introgression Line (donor genotype)
Most widely used mating designs to generate a mapping

population are F2, BC, RIL, DH
A great advantage of RILs and DHs is their eternity; they can
be multiplied without loosing genetic identity; This is
important for QTL detection as it enables using replicates and
to collect data in multiple environments
Low proportion of heterozygotes in an RIL population produces
higher recombination frequency between marker-pairs !
higher resolution of QTL mapping
8
ICRISAT
1.4 Mapping population size

Representative sample of mapping population
Number of individuals, ng 500-1000 [h2, QTL effect size, ]
ng affects map resolution and marker order more is better
Small ng ! Poor quality linkage map
Low power to detect QTL
QTL effects are overestimated
Ho: There is no QTL

Prob (Type I Error) = Level of Significance = Risk = False Positive
: Inferring the presence of a QTL that does not exist
: Low Type I Error ! Less false positives: Unavoidable!
Prob (Type II Error) = False Negative
: Inferring the absence of a QTL that exists
: Low Type II Error ! Less false negatives: Unavoidable!
Power=1Prob (Type II Error)=Prob (Identify a QTLThere is a QTL)
For a fixed sample size: Lowering one error increases the other one
9
ICRISAT
1.5 Selection of marker system (Include description of SSR, AFLP markers)

Different marker systems have different levels of resolution
for detecting genomic variation
- Select a system that allows experimental detection of

heritable genomic variation among individuals of the
mapping population
- Co-dominant markers provide more power for QTL

detection; Dominant markers are generally less informative
- For eternal RIL and DH, both types of markers are equally
informative
1.6 Number of markers vis--vis mapping population size

How many markers (nm)? For locating QTLs, a dense map (large
nm) is preferable to a sparse map (small nm) as this allows
greater precision of QTL location
However increasing nm beyond a certain average density

makes little sense if ng is not increased at the same time
QTL mapping, like linkage mapping, relies on the frequency of

detectable recombination events, which, beyond a given
marker density (average inter-marker distance of 15 cM), can
only be increased by increasing ng
A possible approach is to initially use an evenly spaced sparse

map to detect significant chromosomal regions to which more
markers could be subsequently saturated for fine-scale
localization of QTLs
Suggested further reading

Jones N, Ougham H, Thomas H (1997). Markers and mapping: we are all geneticists
now. New Phytol 137:165-177
Malyshev SV, Kartel NA (1997). Molecular markers in mapping of plant genomes.

Molecular Biology 31(2):163-171
10
ICRISAT
Lee M (1995). DNA markers and plant breeding programs. Advances in Agronomy
55:265-344
Dudley JW (1993). Molecular markers in plant improvement: Manipulation of genes

affecting quantitative traits. Crop Science 33:660-668
Williams JGK et al. (1990). DNA polymorphisms amplified by arbitrary primers as

useful genetic markers. Nucleic Acids Research 18(22):6531-6535
1.7 The data: How do they look like?

Example: 500 F8:9 RILs, 200 SSR markers
RIL1 RIL2 RIL3 RIL4 RIL5 RIL6 RIL7 RIL500
Marker
M1 B A A B A B A B
M2 A A B A A B A B
M3 B A B A A B A B
M4 A B A A B B B A
M5 B A A A B B B A

M200 A A A B B A A B
Pheno
Yield 23 20 10 8 25 11 45 34
A: Parent 1, B: Parent 2 (one high, one low); (A,B) can also be coded as (1,0)
1.8 Genotyping of mapping population (Show scoring of CD & D markers)
Error-free genotyping of ng mapping population individuals

with respect to each of the nm markers
Marker genotype data contain information on segregation at
various positions of the genome
Genotyping errors may lead to distorted segregation, inflated
linkage map, biased QTL inferences
1.9 Linkage map

The (ngxnm)=(500x200) genotype data matrix forms the raw
material to build a linkage map
A linkage map must be available for QTL mapping
Chapter 3 discusses how to build a linkage map
11
ICRISAT
Chapter2
Phenotyping of mapping population
Reliable QTL mapping demands reliable phenotyping of traits
of interest under defined target environmental (e.g. drought)
conditions
Phenotype data contain information on segregation and the
phenotypic effects of QTLs
2.1 Physical dimensions of phenotyping

- Large number ng of individuals
- Multi-Environment Phenotyping (MEP) to obtain a more
realistic estimate of line heritability (H vs Effect & #QTLs)
2
H2 = G2/[G2+(GE2/nE)+{e2/(nEnr)}]
- A large experimental facility in each environment

depending on plot size and number nr of replications
2.2 Outputs expected from phenotyping
Accurate and precise estimate m of genetic parameter .

Genetic parameters to be estimated are
- Expected phenotypic values (G in individual as well as
across trials, and GEI) of entries. BLUPs of G and GEI
more appropriate to use than BLUE/GLSE (Explain BLUP, BLUE)
Separation of G from GEI will facilitate mapping of QTLs
for G and GEI to assess QTL x Env interactions to
appreciate implications for MAS
- Variance components G2, GE2, e2

- Line heritability of the trait (H2)
12
ICRISAT
2.3 Accuracy and precision: Theory and Reality
MSE(m) = E(m-)2 = Total error/uncertainty in estimate m of
= E{m-E(m)}2 + {E(m)-}2
= {SE(m)}2 + {Bias(m)}2
= Imprecision + Inaccuracy
Obtain unbiased estimate m of with minimum SE(m) by

choosing appropriate ED and biometric analysis tools.
A statistically unbiased estimate may not necessarily be a
practically unbiased estimate!
Keep in mind: Practical unbiasedness requires careful and
conscious conduct of the experiment during the entire course
of the experiment
Uniform application of standardized experimental protocols
13
ICRISAT
2.4 Ingredients of reliable phenotyping
- Use of appropriate intra-environment experimental

design ( field blocking) that closely matches with
(unknown/ expected) inherent (spatial) variability
pattern of experimental field;
- Determination of nE (number and type) and nr; and
- Use of appropriate biometric analysis tools to obtain

accurate and precise estimate m of (unknown) genetic
parameter .
2.5 Experimental design: Some Basics
- Functions of an experimental design (GH or field) are
To disentangle: genotypic variation (SIGNAL) from inter-

plot variation (NOISE).
Separate chaff from grain to obtain unbiased

estimates with minimum SE.
To provide valid and unbiased estimate of uncertainty (SE)

in estimates of genetic parameters.
2.6 Ingredients of a statistically sound experimental design
Replication of entries
Randomization of entries
Local control of error arising from inter-plot variation
The functions of replication are
To obtain an internal estimate of experimental error (e2);
To permit separation of 2GE from e2 in MEP for a more

realistic estimation of H2.
14
ICRISAT
The functions of randomization are
Provide validity to results from statistical analyses.

Protect against bias in estimates of genetic parameters.
(Physical) local control (or blocking): Most critical
Grouping plots into (in)complete blocks such that

Intra-block variation is minimized, and
Inter-block variation is maximized.
The basic function of Local Control is to reduce

experimental error; Blocking does not remove intra-block
variation
Intra-block variation could be further reduced by careful

identification and plot-wise quantification of extraneous
factors and using them as COVARIATES. Spatial Analysis is
another alternative to explore.
2.7 Replication vs blocking
Replication is a numerical entity indicating only the number

of plots assigned to an entry;
Increasing nr
- Is not a device to reduce e2;
- Provides only a more stable estimate of e2.
Blocking is a physical concept

- To control/reduce e2;
- To facilitate field operations.
Replication and block is not always the same thing;
Replication could be costlier than blocking;
SEm=e2/r; Reduce e2 by blocking than increasing nr.
15
ICRISAT
2.8 Experimental design options

Alpha design: Ideal for phenotyping large ng
Much more flexible than lattices that require ng=k2;
Could be chosen to conform to expected field variation;
Greater convenience for management of experiment;

ng = k x b; k<b; k=#plots in a block, b=#blocks in a replicate
150 = 3 x 50 = 5 x 30 = 10 x 15
300 = 3 x 100 = 5 x 60 = 6 x 50 = 10 x 30 = 15 x 20;
Orient blocks perpendicular to (expected) field gradient for

the (major) trait of interest.
Augmented designs
Permit unreplicated phenotyping of large number of test

entities with check entities replicated.
Replicated checks enable estimation of e2 and adjustment of

performance of test entities for field variation.
Frequency of checks? 2-10 depending on magnitude of field

variation; higher for higher variability.
Wanted: Augmented designs with low frequency of checks to

allow more test entities as well as effective control of field
variation!
Usual augmented designs look something like
C T T T T T C T T T T T C T T T T T C Block 1
C T T T T T C T T T T T C T T T T T C Block 2 Incomplete blocks
C T T T T T C T T T T T C T T T T T C Block 3
16
ICRISAT
A better approach is to fix the position of checks, as above,

and use for them, eg, RCBD with r=3 blocks with 4 different
checks randomized within blocks
C1 T T T T T C4 T T T T T C3 T T T T T C2 Block 1
Above design allows both testing of differences between check

and test entities, and adjustment of the latter for field variation.
For normally used rectangular plots, Lin & Poushinsky (1985, Can
J Plant Sci 65:743-749) suggest MAD-2, which can be adapted to
any standard design. An example of a 3x6 row-column design to
test 72 test lines is
Column
1 2 3 4 5 6
T T T T T T
T T T T T T
C C C C C C Row 1
T T T T T T
T T T T T T
T T T T T T
T T T T T T
C C C C C C Row 2
T T T T T T
T T T T T T
T T T T T T
T T T T T T
C C C C C C Row 3
T T T T T T
T T T T T T
Each row-column combination is like a main-plot in a split-plot

design with 5 subplots, the control allocated to middle plot and
test entities assigned to remaining 4 plots.
Number of subplots in a whole plot should be such that they

together form a nearly square area
17
ICRISAT
2.9 Number of environments and replications

nE depends on the expected variability among targeted
environments;
Representative selection of nE environments! Min 2 Mid 4 - Max
nr=2 adequate because in
H2 = G2/[G2+(GE2/nE)+{e2/(nEnr)}]
the error variance of a genotype mean
(GE2/nE)+{e2/(nEnr)}
is reduced more by larger nE than by larger nr.
nr=2 will allow internal estimation of error variance for each
individual trial to assess the relative magnitude of errors
across environments and also to separate 2GE from e2;
For nr=2 and given nE, attempt should be made to reduce
e2 in individual trials, to compensate for less nr,
- By covariance adjustments;
- By spatial analysis.
2.10 Biometric analysis of data

Before starting analysis
- Bring data in a format amenable for analysis: Software-
specific
- Validate data for correctness: Do it BEFORE analysis!!!
Saves time & trouble
- Understand Data Structure = Treatment Structure

+
Block Structure
- Nature of Data recorded on each variable [discrete,
continuous, %, ]
18
ICRISAT
- Nature of Factors: Fixed, Random (See Appendix 1)

This makes a lot of difference in analysis and inference
- Build a Model according to Structure and Nature of Data
2.11 Understanding the block structure

This refers to the manner in which local control (and
randomization) are (physically) exercised in the experiment.
Appreciation of this is most crucial since this is what
determines the valid error term(s) in an analysis.
In an Alpha design, field blocking is done at three levels: field is

first divided into nr replicates, each replicate subdivided into a
number of incomplete blocks (IB), and each IB subdivided into a
number of plots leading to a block structure
Replicate " Block " Plots In GenStat: Replicate/Block
2.12 Single trial analysis using GenStat (Lab)
Use REML to get BLUPs of entry performance (G) treating entry,

block, and replicate effects as random, though with nr=2 or 3,
replicate variance component may not be well estimated.
REML provides
Unbiased estimates of variance components;

BLUEs/GLSEs of fixed effects; and
BLUPs of random effects
But requires the data to be normally distributed.
In GenStat, use REML directive with
FIXED MODEL (blank)

RANDOM MODEL entry + replicate/block
to get entry BLUPs and their SE, and estimates of G2 and e2 and
their SE.
19
ICRISAT
Use residual plots to check normality, homoscedasticity
Check effectiveness of blocking by comparing b2 with its SE
Line heritability for the trial can be computed as
H2 = G2/[G2+(e2/nr)]
Effect of alpha blocking on magnitude of the estimates of

parameters can be assessed by running alternative REML
analysis in the following way
FIXED MODEL (blank)

RANDOM MODEL entry + replicate
2.13 Pooled analysis across trials using GenStat (Lab)
Use REML to get BLUPs of G and GEI treating entry, block, replicate,
and environment effects as random.
In GenStat, use REML directive with
FIXED MODEL (blank)

RANDOM MODEL entry*env + env.replicate/block
to get BLUPs of G and GEI and their SE, and estimates of G2, 2GE
and e2 and their SE.
Line heritability across the nE trials can be estimated from
H2 = G2/[G2+(2GE/nE)+{e2/(nEnr)}]
Effect of alpha blocking on magnitude of the estimates of

parameters can be assessed by running alternative REML analysis in
the following way
FIXED MODEL (blank)

RANDOM MODEL entry*env + env.replicate
20
ICRISAT
2.14 Spatial analysis using GenStat/ASReml (Lab)
Field blocking using an experimental design, of necessity,

must be done a priori using our best knowledge of the pattern
of underlying field variability
What matters, and which is often neglected, is the pattern of

field variability that presents itself at the time of data
recording; this may not match with the field blocking used a
priori many things happen during the course of the
experiment that might create an unknown and systematic
patterns of field variability along the rows and/or columns of
experimental field
Analyzing data using the adopted design structure is then

unlikely to produce unbiased estimates of entry means
Spatial analysis can be used to detect/model these unknown

and systematic patterns of field variability and use this
information to obtain unbiased estimates of entry means
Depending on the magnitude and pattern of this unknown and

systematic pattern of field variability, the entry means, after
spatial adjustment, can be entirely different from the entry
means obtained from a design-based analysis!
The spatially adjusted entry means may even result in reversal

of the sign of phenotypic effect of a QTL, with disastrous
consequences for MAS
ASREML and GENSTAT can be used to get spatially-adjusted

entry means (BLUPs/BLUEs)
It is better to do spatial analysis or each individual trial

separately since pattern of spatial variability may differ from
one trial to another
21
ICRISAT
Chapter 3
Building a linkage map
Linkage map: linear arrangement of genetic markers (loci) on the
genome obtained on the basis of estimates of recombination
fractions among the markers
Data analysis to construct a genetic linkage map requires only

marker genotype data on mapping population individuals. It involves
Single-locus analysis: Quality control by screening each marker

to test conformity to expected Mendelian segregation
Two-locus analysis: Estimation of recombination fraction and

linkage detection
Linkage grouping
Locus ordering
22
ICRISAT
3.1 Single-locus analysis for data screening

3.1A Backcross (BC) mapping population (Show BC)
A backcross (M/M x M/m) produces following distribution of zygotes
for n progenies
M/M M/m Total
Expected frequency
Expected number (Ei) E2=n/2 E1=n/2 n
Observed number (Oi) O2=n2 O1=n1 n
2 = i [(Oi Ei)2/Ei] 2(1) i=1,2 (large n)
2 = i [{(Oi Ei)2-|Oi-Ei|+}/Ei] 2(1) i=1,2 (small n)
Alternatively, a likelihood ratio (LR) test can be used
LR = G = 2 i Oi loge (Oi/Ei) 2(1) i=1,2 (large n)
3.1B F2 mapping population (Show F2)
An F2 cross (M/m x M/m) produces following distribution of zygotes

for n progenies
M/M M/m m/m Total

Expected Frequency
Expected Number (Ei) n/4 n/2 n/4 n
Observed Number (Oi) n2 n1 n0 n
2 = i [(Oi Ei)2/Ei] 2(2) i=1,2,3 (large n)
2 = i [{(Oi Ei)2-|Oi-Ei|+}/Ei] 2(2) i=1,2,3 (small n)
LR = G = 2 i Oi loge (Oi/Ei) 2(2) i=1,2,3 (large n)
23
ICRISAT
3.1C Effects of segregation distortion
Each marker should conform to expected Mendelian segregation.
If few markers exhibit segregation distortion, drop them (!) from

further analysis because
- It biases estimation of recombination frequency between

markers, (downward)
- It reduces statistical power to identify QTLs, (frequency of false

negatives increases) and
- It biases estimation of QTL position and effect
Causes of segregation distortion:
- Genotyping errors
- .
24
ICRISAT
3.2 Two-locus analyses for recombination fraction

estimation and linkage detection
Linkage among markers forms the basis to construct linkage
maps and subsequent molecular dissection of a QT using the
map
Linkage analysis is based on Mendelian laws about co-

segregation and co-transmission of different genes to the next
progeny generation
Pre-requisite of linkage analysis between any 2 markers is

their known allelic arrangements (or linkage phases) on the
homologous chromosome
With known linkage phases, parental vs non-parental

haplotypes can be readily ascertained (Add about inbred-P-based populations)
3.2A BC mapping population
Consider two markers A & B each having two alleles (A,a) and
(B,b)
Possible genotypes of these markers are (A/A, A/a) and (B/B,

B/b)
Linkage is the association of two genes located on the same

chromosome (Shift in previous Section)
If an offsprings genotype differs from parental genotypes at

that marker, a recombination is observed (Show BC scheme)
B/B B/b
A/A n1 n2
(1-r)/2 r/2
A/a n3 n4
r/2 (1-r)/2
Total recombinant events/individuals = n2 + n3 = nR
25
ICRISAT
The ML estimator of r is
rAB = (n2 + n3)/n n=n1+n4+n2+n3
Var(rAB)=rAB(1-rAB)/n
The null hypothesis of no linkage (H0:r= vs H1:r<) can be

tested using 2 test
2 = (nNR nR)2/n df=1

LOD score (log10 of odds, an LR test-statistic) is more
frequently used for linkage detection
Z = log10 [L(r=rAB)/L(r=)]
A threshold value of Z=3 is commonly used (in human

genetics) to declare existence of linkage.
Z=3 means that observed linkage is 1000 times more likely

than at r=1/2
Distorted segregation tends to increase Z leading to detection

of spurious linkage. To overcome this problem, JoinMap uses a
modified LOD score computed as
Z*= G* / [2 ln(10)]
G* = [(4 - k) k 3] (d - 1) + G d
k = e-Gd/{2(d-1)}
G = 2 ij [Oij ln(Oij/Eij)], i=1,r j=1,,c df=d=(r-1)(c-1)
Oij = observed frequency in i-th row and j-th column of 2-way

table for two given markers
Eij = expected frequency in i-th row and j-th column of above

2-way table
26
ICRISAT
3.3 Minimum sample size (ng) to detect linkage
Use of F2 mapping population with co-dominant markers has

the highest statistical power relative to all other models.
For F2 populations, dominant markers should generally be

avoided as they usually have a mixture of coupling and
repulsion linkage phase. The repulsion linked dominant
markers provide the least linkage information and statistical
power.
Table 1. Minimum sample size needed to detect linkage at =0.05

(Adapted from Liu 1998, Statistical Genomics, CRC Press, p 190)
True r Power BC F2-CC F2-CD F2-DDc F2-DDr

.05 .80 16 11 21 21 97
.90 21 15 27 28 130
----------------------------------------------------------------------------------------------------------------------
.10 .80 21 16 28 30 108
.90 29 22 38 40 144
----------------------------------------------------------------------------------------------------------------------
.20 .80 41 35 57 63 156
.90 55 47 77 85 209
----------------------------------------------------------------------------------------------------------------------
.30 .80 95 88 139 166 297
.90 128 118 186 223 397
BC: Back cross; F2-CC: both co-dominant markers; F2-CD: one co-dominant and one dominant
marker; F2-DDc: both dominant markers in coupling phase; F2-DDr: both dominant markers in
repulsion phase.
3.4 Linkage grouping
Linkage grouping refers to placing loci into linkage groups

based on their linkage relationship (estimate of recombination
fraction).
A linkage group is a group of loci/markers where each loci is

linked (r < ) at least to one other marker. (Ideal is r < 0.35)
Biologically, a linkage group is defined as a group of genes

with their loci located on the same chromosome.
27
ICRISAT
Statistically, a linkage group is a group of genes inherited

together according to certain statistical criterion, eg Z* score
Loci on the same chromosome may be grouped into different

linkage groups based on a statistical criterion because loci on
a large segment of a chromosome may not be observed.
3.5 Linkage grouping criteria

Linkage grouping is usually based on estimates of either
(rAB,ZAB) or (rAB,pAB) where rAB, ZAB, and pAB are respectively the
two-point recombination fraction, LOD score, and significant
P-value for a pair of loci A and B
Criteria based on (rAB,ZAB):
If [{rAB c} and {ZAB a}]

Then, loci A and B belong to the same linkage group
If [{rAB c} or {ZAB a}] (JoinMap recommends a[4,7])

Criteria based on (rAB,pAB):
If [{rAB c} and {pAB b}]

If [{rAB c} or {pAB b}]

c = maximum recombination fraction value to be declared
a linkage
a = minimum LOD score value for declaring a linkage
b = maximum significant P-value for declaring a linkage
In practice, joint consideration of c, a, b, and biologically

information, such as number of chromosomes, could be used
to infer linkage groups.
28
ICRISAT
3.6 Locus ordering: Critically important

Locus order is defined as the relative linear arrangement of
loci/markers on a linkage group. Multiple locus ordering is
based on minimizing the number of crossovers.
Let (a1, a2, , ak) k2
denote k loci on a linkage group with map distances (cM)
(d12, d23, , dk-1,k)
where map distance dij is derived from recombination fraction

rij and crossover interference (1-C) between loci i and j.
rXY rYZ
_____X__________Y_________________Z___
dXY dYZ
Haldanes (H) mapping function [interference absent, (1-C)=0]
dij = -(1/2) loge (1-2rij)
Kosambis (K) map function [interference present, (1-C) 0]
dij = (1/4) loge [(1+2rij)/(1-2rij)]
Recombination fractions are not additive due to multiple crossing overs. For
example, for locus order (X,Y,Z)
rXZ = rXY + rYZ 2 C rXY rYZ
where C is the coefficient of coincidence, and (1-C) is coefficient of interference.
Map distances, on the other hand, are additive, i.e.
dXZ = dXY + dYZ
Two markers are said to be separated by d cM if d is the

expected (odd) number of crossovers between the markers in
100 meiotic products (Define cross-over interference)
29
ICRISAT
Effect of map function on QTL mapping is small. For a given

rij, following relationship between H and K holds
dij(Haldane) dij(Kosambi).
For k loci, there are (k!/2) possible locus orders if orientation

of orders is ignored.
# loci : 2 3 5 10 20
#Locus-
Orders : 1 3 60 1,814,400 1.22x1018
Finding the correct locus order is a computationally intensive

problem as the number of possible locus orders increases very
quickly as the number of loci increases.
Multiple locus ordering approaches: Minimum SARF, Minimum

PARF, Maximum SALOD, Stams (Weighted) Least Square (Stam
1993; The Plant Journal 3(5):739-744, implemented in JoinMap).
JoinMap uses values of r and Z*, accompanied by a 2

goodness-of-fit test, to arrive at a locus order. Threshold
limits of r0.499 and Z*0.001 are recommended in order to
use all available information.
JoinMap, relative to MapMaker, offers more comprehensive

analytical tools and data-quality checks to build a genetic
map. The software DrawMap or GGT could be used to draw a
(publishable) genetic linkage map. The latter can be
downloaded from internet. (Map length in Mapmaker < JoinMap!)
30
ICRISAT
3.7 Marker coverage and map density

The quality of a genetic map can be quantified using the
confidence of estimated locus order and loci distribution on
the map.
An ideal genetic map is one with a high level of confidence in

the estimated locus order, with markers evenly distributed on
the map, and with sufficient density (e.g. at least one marker
within a 5 cM segment).
Factors affecting marker coverage and map density include,

among others, genome length, number of markers,
distribution of markers on genome, mapping population size
(n), and type of mapping strategy.
Estimating the marker coverage and map density, therefore, is

not only useful to quantify the quality of a map, but also for
designing a more efficient genomic experiment.
3.8 Predicting marker coverage and map density
Given total genome length (L) and a genome map (with map
length L*), these can be easily estimated.
Marker coverage (c) is the ratio of genome map length and

total genome length, c=L*/L.
Map density is the average or the maximum map distance

between adjacent markers (gap).
Assuming that markers are randomly distributed, two

approaches are available to predict marker coverage and map
density
- Marker coverage approach

- Confidence probability approach
31
ICRISAT
Marker coverage is the proportion of genome flanked by two

markers with a certain minimum map distance (say < 2d M)
between them over the whole genome.
Confidence probability approach considers that at least one

marker is located within a 2d M genome segment.
Both approaches are approximately the same when the ratio

of minimum map distance to genome length is small.
_.__._____.__.____.___.______ L1
__.___._____.___._.____.___._ L2
_.___.____.__.____.___.__.___ L3

+d 0 -d L=L1+L2+L3
Figure 1. Probability that a random marker is located within a 2d genome-

segment is 2d/L
3.8A Marker coverage approach
Marker coverage (c) can be estimated as (Lange and Boehnke

1982, Am J Hum Genet 34:842-845)
c = 1 e-2md/L
m=number of markers, L= total genome length (cM)
Given the marker coverage c, genome length L, and minimum

distance 2d, the number of markers (m) needed can be
estimated from
m = - L log (1-c) / (2d)
32
ICRISAT
3.8B Confidence probability approach
The probability that at least one marker is located within a 2d

M genome segment is
P= 1 [1 (2d/L)]m
The number of markers needed to have at least one marker

within a 2d M genome segment, given genome length L and a
confidence probability P, is
m = [log (1-P)] / [log {1-(2d/L)}]
The map density (2d) can be estimated from
2d = L [1 (1-P)1/m]
Table 2. Numbers of markers needed at confidence probability P or marker coverage c assuming

genome length of 1M (Adapted from Liu 1998, Statistical Genomics, CRC Press, p 352)
2d (cM) ----------------- P ------------------ ---------------- c ---------------------
0.80 0.9 0.99 0.80 0.90 0.99
1 161 230 459 161 231 461
5 32 45 90 33 47 93
10 16 22 44 17 24 47
20 8 11 21 9 12 24
30 5 7 13 6 8 16
33
ICRISAT
Chapter 4
QTL Analysis
4.1 The key idea
A genetic marker that tends to co-segregate with the trait is
likely to be close to a QTL controlling that trait
We seek an association between marker alleles

(genotypes) and trait values (phenotypes) (Add regression picture here)
The purpose of QTL analysis is to infer QTL genotypes in order

to estimate QTL effects and locations using their statistical
association with known genetic markers: MAS vs QAS
But we only know
- Marker genotypes (M) for mapping population individuals
- Trait values (Y) of mapping population individuals
- Genetic markers located at certain positions on
genome for each of the mapping population individuals
The key to solve the problem is to use the statistical concept of

conditional probability (Make more simple taking an example!)
The conditional probability that the QTL genotype is Qk, given the
observed marker genotype is Mj, is
Pr(QkMj) = Pr(QkMj) / Pr(Mj)

where Pr(QkMj) and Pr(Mj) are joint and marginal probabilities
The joint probability Pr(QkMj) is the probability of co-

segregation of the marker and the QTL
Values of Pr(QkMj) and Pr(Mj) depend on mapping population

used and the position of putative QTL with respect to the
marker loci (known linkage map) (Clarify)
34
ICRISAT
The trait values Y are then modeled as an appropriate

function of the conditional probabilities Pr(QkMj) for mapping
QTLs onto a given genetic map (Simplify)
Y = f{Pr(QkMj)}
The derivation of conditional probabilities Pr(QkMj) requires

specification of a quantitative genetic model corresponding to the
mapping population, genetic markers used, and a linkage map.
Goal is to identify a few QTLs with large effect for application

in MAS, MAB, introgression
4.2 Quantitative genetic models
A genetic model for a quantitative trait, in classical quantitative

genetics, is usually defined in terms of
Number of genes (say m)
(Phenotypic) effects of genes [additive (a), dominance (d)]
Gene frequencies
Relationships among genes (epstatic interactions), and
Relationship between environment and gene action (GxE

interactions)
35
ICRISAT
4.3 Prelim estimate of number of Loci/QTLs
- To get some indication of chance of success to locate QTLs
- Easier to locate genes when only a few affect the trait
- Use Wright-Castle Index to get a lower bound on m
m = (P1 - P2)2 / (8a2) (A)
where a2 is F2 additive genetic variance arising from differences

in allele frequencies of two parental populations (Castle 1921,
Science 54:541-553)
- Assumptions in equation (A)
The m loci have equal effects and are additive
No linkage between the m loci
Complete fixation of alleles in P1 and P2 for the trait
- See also (for more on accurate estimation of m)
Comstock & Enfield 1981, TAG 59:373-379
Cockerham 1986, Genetics 114:659-664
Zeng et al. 1990, Genetics 126:235-247
Zeng 1992, Genetics 131:987-1001
36
ICRISAT
4.4 Genetic models for F2 and BC
Consider a bi-allelic locus Qk with the effects of the three

genotypes defined as (k=1,,m) [Define a and d]
QkQk Qkqk qkqk

ak dk -ak
Assume populations P1 and P2 are fixed with alleles Qk and qk

for each of m loci. Also, assume there is no epistasis.
4.4A F2 model (Qkqk x Qkqk)
Genotype Obs# Value Variance

QkQk nQQ QQ 2
Qkqk nQq Qq 2
qkqk nqq qq 2
Total n
Additive effect ak = (QQ - qq)/2

Dominance effect dk = Qq [(QQ - qq)/2]
The additive effect is the same as the average effect of gene

substitution since expected allele frequencies for two alleles
are same (=0.5) in F2
4.4B BC model (QkQk x Qkqk)
Genotype Obs# Value Variance

QkQk nQQ QQ 2
Qkqk nQq Qq 2
Total n
Genetic effect gk = (QQ - Qq)/2

= [QQ (1/2)(QQ+qq+2dk)]/2
= (ak + dk)/2
=> ak and dk cannot be separated
37
ICRISAT
4.4C Models for DH, RIL and test cross (TC)
These can be treated like a BC in terms of data analysis since

expected genotypic frequencies are same as in BC.
B U T, interpretations of QTL mapping results will be different
- QTL effects in BC are mixture of a and d effects
- QTL effects in DH and RIL are purely additive
4.5 QTL analysis approaches
These could be classified according to the number of genetic

markers used as unit of analysis in analyzing the data
Single-Marker Analysis (SMA)

(QTL Cartographer, )
Two-Marker Analysis
[Simple Interval Mapping (SIM)]
(QTL Cartographer, MapQTL, PlabQTL )
Multiple-Marker Analysis
[Multiple regression, CIM, MQM]
(QTL Cartographer, PlabQTL, MapQTL, )
Multiple Interval Mapping (MIM)

A multiple QTL-oriented method that allows estimation of
number, positions, effects, and epistatic interactions
among significant QTLs simultaneously (QTL Cartographer)
Multiple-Trait IM and CIM

Testing and estimating QTLs affecting multiple traits;
Testing pleiotropy and pleiotropy-vs-linkage; Testing QTL x
environment interactions (QTL Cartographer, MultiQTL)
Non-Parametric Mapping (MapQTL)
38
ICRISAT
4.6 Single-marker analysis (BC as an example)

r
------A--------------------------Q--------
------a--------------------------q--------
marker QTL
Joint Segregation of QTL and Marker Genotypes
Marker Obs# Marginal QTL Genotype Expected

Geno Prob QQ Qq Trait Value
(M) [Pr(A)]
Joint Probability
Pr(QA)
AA nAA (1-r)/2 r/2
Aa nAa r/2 (1-r)/2
Conditional Prob
Pr(QA)
AA nAA (1-r) r AA=(1-r)QQ+rQq
Aa nAa r (1-r) Aa=rQQ+(1-r)Qq
Above, the Expected Trait Value, e.g. for marker genotype AA,
is derived as
E(YAA)= AA= Pr(QQ AA) QQ + Pr(Qq AA) Qq

E(YAA)= AA= (1-r) QQ + r Qq
Single marker analysis can be done without a linkage map
39
ICRISAT
4.6A Statistical tests for QTL detection
t-test
t = [m(AA) m(Aa)] / [s2{(1/nAA) + (1/nAa)}]1/2 df=n-2
s2 = pooled variance within the two marker genotypes,
m(AA) = phenotypic mean of marker genotype AA
The expected value of difference between two marker genotype

means is
E[m(AA) m(Aa)] = AA - Aa
= (1-2r) (QQ - Qq) = (1-2r)
= (1-2r) (a+d)
Ho : AA - Aa = 0
=> (a+d)=0 " There is no genetic effect
or r= " QTL and Marker are independent
Since P1 and P2 were chosen to be different for the trait, the

condition 0 will be satisfied unless allele Q is completely
dominant to allele q.
Analysis of variance
A one-way ANOVA could also be used which, in case of BC having

only two marker genotype classes, would give the same inference
as a t-test since ANOVA F=t2.
40
ICRISAT
Regression approach
Model yj = 0 + Xj + ej
where yj = trait value for test-unit j in the mapping popn.

0= intercept = overall mean of trait
= slope of regression line
Xj = 1 if test-unit j is AA, -1 if test-unit j is Aa
ej = random error of test-unit j N(0,2)
The expected means, variances, and covariances to estimate

regression coefficients are
E[m(X)] = (1/2) x (1) + (1/2) x (-1) = 0

E(sX2) = (1/2) x (1)2 + (1/2) x (-1)2 = 1
E[m(y)] = (1/2)[(1-r)QQ+rQq+rQQ+(1-)Qq]=(QQ+Qq)/2
E(sXy) = (1/2) x (1) x [(1-r)QQ+rQq]
+ (1/2) x (-1) x [rQQ+(1-r)Qq]
= (1/2) (1-2r) (QQ - Qq)
Therefore
0 = E[m(y)] = (QQ+Qq)/2
= E(sXy) / E(sX2) = E(sXy) = (1/2) (1-2r) (QQ - Qq)
= (1/2) (1-2r) (a+d)
H0: = 0
=> (a+d)=0 " There is no genetic effect
or r= " QTL and Marker are independent
41
ICRISAT
Maximum likelihood approach
- Assume that the trait values for each QTL genotype QQ

and Qq follow a normal distribution with mean QQ and
Qq respectively and constant variance 2.
- Then, since
AA=(1-r) QQ + r Qq
Aa= r QQ + (1-r) Qq
the distribution of trait values for each the two marker

genotype classes AA and Aa will be a mixture of two
normal distributions
- The likelihood for a BC population with n individuals is
L(QQ,Qq,2,ryi) = L(QQ,Qq,2,r)
=(22)-n/2ijPr(QjAi) exp[-(yi-j)2/(22)] (1)
i=1,,n ; j=1,2
which is the joint distribution function for a mixture of two normal

distributions in unequal proportions (1-r) and r.
- It is assumed in L that each of the four marker-QTL genotype

classes has a constant variance 2, where
Pr(QjAi) conditional probability of QTL genotype

being Qj given that marker genotype is Ai;
(mixing proportion a function of r)
yi observed trait value of individual i
j expected trait value of QTL genotype j
42
ICRISAT
Test-statistic to test H0: r =
- Log-likelihood ratio (LR) test
G=2 log [L{mQQ,mQq,s2,est(r)}/ L(mQQ,mQq,s2,r=1/2)]
which is distributed asymptotically as a 2 with df=1.
- LOD score
Z=log10 [L{mQQ,mQq,s2,r} / L(mQQ,mQq,s2,r=1/2)]
- Relation between LR-statistic G and LOD-score
G is computed using natural logarithm.
LOD is computed using base-10-logarithm
LOD=0.21715*G G=4.60511*LOD
- G is interpreted as the probability of occurrence of the given

data under the null hypothesis using a theoretical 2 distribution.
- LOD is interpreted using the concept of odds-ratio. A LOD=2

means that the alternative hypothesis is 102=100 times more
likely to happen than the null hypothesis.
43
ICRISAT
4.7 Multiple marker analysis using multiple regression

No linkage map is needed
A simple extension of single-marker regression analysis to

identify markers simultaneously linked to trait loci
yj = 0 + m m Xjm + ej m=1, , M j=1, , n
Identify markers with m 0
With M markers at hand, there are 2M possible models to

search to find the most parsimonious model: Search strategy?
Need a criterion to choose the final model
Search strategy: step-wise regression, backward elimination,

forward selection, MCMC,
Model selection criterion: Ra2, AIC, BIC,
Our experience: Use step-wise regression with increasing

stringent Fin and Fout to filter out the most important unlinked
trait-linked markers (Fin = Fout [3, 4, , 15])
BIC is another good candidate to identify important unlinked

trait-linked markers
With a large sample of progenies, results from stringent Fin-

Fout-based step-wise regression closely match with those from
map-based QTL analysis
44
ICRISAT
4.8 Simple interval mapping (SIM)

The single-marker analysis has shortcomings
- Gives indirect estimates of QTL position and effect since both
are intrinsically confounded [H0:(1-2rAQ)=0]
- Cannot separate linked QTLs
- Ignores effect of other QTLs
- Order of loci cannot be resolved: QA or AQ?
In SIM, intervals formed by adjacent markers are used as unit

of analysis to test for presence of a single QTL by using
genetic information from two flanking markers
Possible to separate r and , and to infer position of Q relative

to both.
More precision and power expected from use of additional

information on the second marker
r1 r2
___A__________Q_______________B______
___a__________q_______________ b______
#----------------- r -----------------"
"
Assuming no crossover interference (1-C=0), relationships among recombination

fractions is
r = r 1 + r 2 2 r 1 r2
With small r, double crossover (2 r1 r2) may be ignored which reduces the above
equation to
r = r 1 + r2
The QTL position can be represented by a point relative to the interval between two
markers by a position parameter
= r1/r
with 1- = 1- (r1/r) = r2/r
A linkage map must be available for SIM
45
ICRISAT
4.8a SIM in a BC population
Parents AAQQBB x aaqqbb

F1 AaQqBb x AAQQBB

BC Progeny Joint
(8 marker-QTL genotypes) Prob
AAQQBB, AaQqBb (1-r)/2

AAQqBb, AaQQBB r1/2
AAQQBb, AaQqBB r2/2
AAQqBB, AaQQBb 0
(Resulting from double cross-over)
If A and B are tightly linked, the possibility of a double crossover in

both segments AQ and QB can be ignored
Then, the frequency of marker-QTL genotypes AAQqBB and AaQQBb,
being results of double crossovers, is relatively small and can be
practically treated as zero as shown above
Marginal and Joint Probabilities
Marker Obs# Marginal Joint Probability P(QM)

Geno Prob P(M) QQ Qq
AABB n1 (1-r)/2 (1-r)/2 0
AABb n2 r/2 r2/2 r1/2
AaBB n3 r/2 r1/2 r2/2
AaBb n4 (1-r)/2 0 (1-r)/2

See 2-locus analysis in Chapter3
46
ICRISAT
Conditional probabilities P(Q M) and expected trait values

for marker genotypes
Marker P(Q M) Expected Trait Value

Geno QQ Qq of marker genotype
AABB 1 0 QQ=AABB
AABb r2/r=(1-) r1/r= (1-)QQ+Qq=AABb
AaBB r1/r= r2/r=(1-) QQ+(1-)Qq=AaBB
AaBb 0 1 Qq=AaBb
Mean QQ Qq (QQ + Qq) / 2
If =0 --> QTL located right at marker A

=1/2 --> QTL located in the middle of A and B
=1 --> QTL located right at marker B
Regression approach
- Model: y = 0 + X* a + e
X* = P(Qj Mi) computed for each map position

a = effect of putative QTL
- Regression is computed at regular positions in each interval.
- LR test-statistic for each map position is computed as
LR = n*loge(SSResred/SSResfull)
= -n*loge(1-R2)
= 2*loge10*LOD
p(MSreg/MSres) = p*Freg
p is the number of parameters (including QTL position) fitted
R2 is the coefficient of determination, n is #progenies
47
ICRISAT
- Regression-LR provides a very close approximation to ML-LR if

the distance between adjacent markers is <20 cM and there are
not many missing data
- Regression approach offers many advantages

Faster computations
Application of standard statistical software
Flexibility in choice of genetic models (including epistasis)
Incorporation of experimental design features
Robust to non-normality
- The LR (or LOD) is plotted against genome position and compared
with genome-wide threshold ! LR (or LOD) profile
- Wherever the LOD curve exceeds the threshold, presence of a

QTL is inferred
- The point at which the LOD is maximum is used as an estimate of

QTL position
- A one- or two-LOD support interval around the inferred QTL

position is used as an interval estimate of QTL position
- SIM provides no improvement over single marker analysis in

detecting QTLs when using a relatively dense marker map (10 cM
spacing or less) and a small or moderate number of progeny (500
or less)
- The benefit of SIM is in providing more precise estimates of QTL

location and effects
48
ICRISAT
4.9 Composite interval mapping

SIM is a single-QTL model multiple QTLs cannot be clearly
determined
Location of QTLs may not be clearly resolved, ghost QTL can

occur
- Ghost effects occur when a QTL is located in an interval, and
adjacent intervals too exhibit significant LOD score
- Problem is: IM does not take into account all markers at once,
but uses them only two at a time. This makes it difficult to
differentiate between actual and ghost QTL that may exist simply
due to relative density of the map
- If there are QTLs Q1 and Q2 in non-adjacent intervals (M1,M2)
and (M3,M4), there may be a spurious indication of a QTL in the
intervening interval (M2,M3) (Martinez & Curnow 1992, TAG 85:480-488)
With other QTLs segregating in the mapping population, effects

of these and their interactions with the QTL in target SIM
segment are pooled into the experimental error (genetic
background effects) since they are not included into the
statistical model. This reduces power to detect QTL
The basic assumption of SIM is that there is only a single

segregating QTL affecting the trait. However, if more than one
QTL exists, SIM is not independent in different segments.
Composite interval mapping (CIM) is a combination of interval

mapping and multiple linear regression. For a BC population, the
CIM analysis for a genome segment between markers Mi and Mi+1,
the statistical model is
#----- r ----"
#r1"#-- r2-"
___.______.____.______._____________._________
Mi-1 Mi Q Mi+1 Mi+2
(A) QTL (B)
49
ICRISAT
Model: yj = 0 + i Xij + k Xkj + ej j=1,,n i=1,,nm*

ki,i+1
0 = intercept of the model
i = genetic effect of putative QTL Q located between

markers Mi and Mi+1
Xij = 1 for marker genotype AABB
0 for marker genotype AaBb
1 with prob=1- and 0 with prob= for AaBB
1 with prob= and 0 with prob=1- for AABb
k = partial regression coefficient of yj on marker k

(k i, k i+1)
Xkj =1 for marker genotype AA
0 for marker genotype Aa
ej = random error N(0, 2)

nm* = number of markers selected as cofactors
The effect of other QTLs if any, other than the one in test-
interval (Mi,Mi+1), are removed through the regression of
markers outside the test-interval. This increases power of QTL
detection
The CIM model parameters can be estimated using maximum

likelihood. Alternatively, a LS regression approach, replacing
X by X* as in SIM, can be used. The basic procedure for both is
similar to SIM described earlier
The LR test statistic G or LOD score Z can be computed for

each position on the genome in a similar manner as for SIM
The values of the test-statistic (G or Z) are plotted against

genome positions. The resulting plot, using an appropriate
significance threshold value, can be used to infer the
presence, position, and effects of putative QTLs
50
ICRISAT
4.9a Practical implementation of CIM

Step 1: Selection of markers to act as co-factors
This can be done by SIM, or by multiple regression of markers using
step-wise regression procedure
Step 2: QTL detection
Two possible alternatives are to use as co-factors
a. Only unlinked or distantly linked markers selected in Step 1
b. These markers plus the closest linked markers with a minimum
distance of 20 to 30 cM from the study interval
c. Compare the results from a and b
d. A putative QTL is declared to be present if the test statistic
shows a clear peak at least under (b) and exceeds the critical
threshold either under (a) or (b).
Step 3: Estimation QTL effects
Obtain by separately fitting a model with all QTL positions
considered simultaneously identified in Step 2
4.9b Step-wise regression for cofactor selection
2u possible models with u markers/predictors;
Specify Fin and Fout values; PlabQTL recommends both to be

set at 3.5 for =0.25
Select first predictor having largest F-value
Enter extra predictors if
Rin=[SSResu-SSResu+1]/[SSResu+1/(n-u-2)]>Fin
Delete predictor from model if
Rout=[SSResu-1-SSresu]/[SSresu/(n-u-1)] < Fout
51
ICRISAT
Stepwise procedure performs better when predictors are highly

correlated
4.10 Significance threshold for QTL detection
A significance threshold for test-statistic G (or Z) depends on

Type-1- error (incorrectly inferring presence of a QTL, false
positive) and number of markers genotyped, the latter resulting
in say M intervals.
Issue is complicated because
Distribution of test-statistic under H0 is often not clear
Multiple simultaneous tests are performed many of which are

not independent
In CIM, G (or Z) is nearly independent in different intervals.

One could therefore use the Bonferroni approximation /M for
individual tests to control a specified overall Type-1-error .
Another alternative is to use permutation methods to derive

empirical significance threshold from the experimental data.
Use at least 1000 permutation for =0.05, and at least 10000
for =0.01.
The choice of is not a technical issue. It depends on goals of

the experiment. QTL researchers often tend to use =0.05. But
for exploratory QTL studies, =0.25 could be acceptable (Beavis
1998, In Molecular Dissection of Complex Traits, Ed Paterson, CRC Press,
pp 145-162).
52
ICRISAT
4.11 Estimation of key MAS parameters
Relative to phenotypic selection, the efficiency of MAS is

defined as (p/H2) (Lande & Thompson 1990, Genetics 124:743-756),
where p is the proportion of genetic variance accounted for by
detected QTLs, i.e., p=2QTL/2G. Parameter p can be estimated
as
p= R2a/H2
R2a = 1 {(n-1)/(n-u)}*(1-R2) = 2QTL/2P
H2= 2G/2P
R2=SSreg/SStotal=SSQTL/SStotal
Factors affecting p are
Sample size n
Population type
Heritability
Genetic architecture
Data set Common Practice L&T

QTL detection
D
D=E E
D
QTL estimation
E
53
ICRISAT
4.12 Cross validation and bootstrapping

Cross validation (CV) is a statistical method to evaluate the
predictive capability of a given set of models.
Utz et al. (2000, Genetics 154:1839-1849) used k-fold CV to assess

the bias and sampling error in estimated p arising from
genotypic and environmental sampling. The principle is shown
in Fig. 2.
Genotypic sample N = 344
Fig. 2. Subdivision of the data set

used for cross validation (CV) into
Estimation set (ES) sub-samples. Data from the
N = 276, u = 3 estimation set (ES) serve for QTL
detection and mapping. Data
from the test set (TS) are used
for obtaining asymptotically
unbiased estimates of p
In 5-fold cross validation, entire data set (EDS), having N

genotypes, is divided into five equal parts. QTL detection and
position-estimation by an established method (e.g., CIM by
the regression approach) is based on the estimation set (ES),
comprising genotypes in four subsets using phenotypic data
from all environments but one
Genotypic data from 5th subset together with phenotypic data

from remaining environment, the test set (TS), are used for
estimation of QTL effects and p. This procedure is repeated
250 times with different partitioning of EDS into ES and TS
Estimation Set (ES) Test Set (TS)

QTL detection (model selection) X*es X*ts
Estimation of QTL effects bes Predict Yts as yts=X*ts bes
Estimate pes=R2adj/H2es Estimate pts=r2(yts,Yts)/H2ts
54
ICRISAT
Bootstrap (BS) is a resampling technique to measure the

quality of statistical estimates and models (Efron & Tibshirani
1993, An Introduction to the Bootstrap, Chapman & Hall)
A valid application of BS requires that the sample being

bootstrapped consist of independent observations. This
assumption is satisfied by the data used in QTL analysis,
because the progenies are representative sample from a
segregating population
The original sample of N progenies in EDS is used to generate

so-called BS samples each of size N using sampling with
replacement.
A large number (b=1, ...B) of independent BS samples are

generated. The original sample is used for QTL mapping and
yields an estimate p$ of p. Likewise, each BS sample is used
for QTL mapping and the same decision rules (LOD threshold,
selection of cofactors etc.) are employed as with the original
sample, and yields an estimate p$ b.
The standard error (SE) and bias of p$ can be estimated as

follows:
1/ 2

SE ( p$ ) = SD( p$ b ) = ( p$ b p$ ) 2 / ( B 1) ,
b
bias( p$ ) = p$ p$ b / B
b
A bias-corrected estimate of p can be obtained as
p$ * = 2 p$ p$ b / B
b
55
ICRISAT
4.13 QTL mapping in summary

$ What we need to have
Marker genotype data error free
Genetic linkage map - good quality
Trait phenotype data accurate (bias-free)

$ What we need to appreciate
Theoretical linkage relationships between marker(s) and QTL

[Conditional probabilities Pr(Qk Mj)] as per QG model for
mapping population and marker system employed
Expected trait values for marker genotype classes (AA, Aa, )

which contain a mixture of putative QTL genotypes in
different proportions depending on the strength of linkage
$ What we are looking for
Putative QTL exist?
If yes, what is the QTL effect?
Which are the important trait-linked markers?
Genomic position of the QTL?

$ What statistical methods to use
Use composite interval mapping, Multiple regression,
Use appropriate significance threshold! =0.25
Use CV/BS to get unbiased estimates of QTL parameters
Bottom line: The care and expense invested in generating marker

and trait data should be accompanied by equal care in biometric
analysis of data
56
ICRISAT
Appendix 1: Fixed, Random, and Mixed Models
Consider the example of a field trial laid out in an RCBD. Suppose there are t=16
groundnut varieties as treatments each decided to be tested with r=3 replications.
The experimental field is divided into b=r=3 blocks, each individual block containing
k=t=16 plots. There are N=txr=48 plot observations corresponding to the 48 plots used
in the trial.
The following linear additive model is commonly assumed to represent (model) any
individual plot observation in an RCBD
Plot Obs. = Trial Mean + Variety Effect + Block Effect + Error (1A)
Yij = + i + j + ij (1B)
where i=1,,t (# treatments), j=1,,b (# blocks).
The model above contains four terms [, i, j, ij]. We need to specify which of these
terms is a fixed and which is a random effect.
WHAT IS A FIXED EFFECT? A model term, representing the effect of a certain factor,
is said to be a fixed effect if the different levels of the factor, say t levels of the
factor variety included in the trial, represent t distinct populations (treatments), and
interest lies in estimation (BLUE) of the means of those, and only those, distinct
populations (treatments). For example, if the 16 groundnut varieties represent 16
distinct genetic populations, the corresponding model term i will be taken as a fixed
effect. Our interest will be only in estimating the means of, and possibly in some well-
defined differences among, these 16 genetic populations.
WHAT IS A RANDOM EFFECT? A model term, representing the effect of a certain

factor, is defined to be a random effect if the different levels of the factor, included
in the trial, represent a random sample from all possible levels of a single population.
For example, if the 16 groundnut varieties represent a random sample of 16 genotypes
from a single genetic population, the corresponding model term i will be taken as a
random effect. Interest here may lie in the variability within this single population
from which the sample came (variance component estimation), and/or perhaps in the
prediction (BLUP) of the mean of particular levels (here genotypes).
DECIDING WHETHER A MODEL TERM BE TAKEN AS FIXED OR

RANDOM (RJ Baker 1996)
Step 1: A Research Point of View
Initially declare the term to be random effect
Now ask two questions
57
ICRISAT
(a) Is it physically possible for the used factor levels to be repeated at some
future time or in some other place?
(b) If answer to (a) is YES, would it be reasonable in the context of this research
for you or someone else to choose the same levels for repetition of this
research at some future time or in some other place?
If the answers to questions (a) and (b) are BOTH YES, declare the term
as a fixed effect.
Step 2: A Statistical Point of View
If a fixed-effect factor involves a large number of levels (say >10) and there is no
structure among those levels, it might perhaps be best to declare the factor effect
as random and use BLUP to predict mean values.
The basis for this recommendation is: With a large number of unstructured
levels, it is likely that the extremely low or high means will partially reflect a
fortuitous combination of random error effects and should therefore be shrunken
toward the trial mean.
If a random-effect factor involves too few levels (say if the factor represents an
uninteresting nuisance factor in the trial), and if comparisons among levels of this
factor provide no information about other factors (no inter-class information), then
declare the factor effect as fixed. This will ease the computing demands and
provide identical results for the interesting factors.
NOTE 1: Whatever the type of model (Fixed, Random, or Mixed), is always taken as
fixed, and ij, due to random allocation of treatments to plots, is always taken as
random. What makes a model as Fixed, Random, or Mixed then depends on the nature
of the remaining terms in any model.
WHAT IS A FIXED-EFFECTS MODEL (Model 1)? Under the proviso of NOTE 1, if all
other model terms represent fixed effects, the model is called a fixed-effects (or a
fixed) model.
WHAT IS A RANDOM-EFFECTS MODEL (Model 2)? Under the proviso of NOTE 1, if all
other model terms represent random effects, the model is said to be a random-effects
(or a random) model.
WHAT IS A MIXED-EFECTS MODEL (Model 3)? Under the proviso of NOTE 1, if some of
the remaining model terms are fixed effects and some are random effects, the model
is defined to be a mixed-effects (or a mixed) model.
58
ICRISAT
59
ICRISAT
MAPMAKER/QTL Model Interval mapping

Population F2, BC, RIL, DHL
Computer PC window
platform
Contact Eric Lander
(mapmaker@genome.wi.mit.edu)
QTL cartographer Model Composite interval mapping

Population F2, BC
Computer Mac, PC window
platform
Contact Christopher Basten
(basten@essjp.stat.ncsu.edu)
MAPQTL Model Interval mapping

Population F2, BC, RIL, DHL,
Heterozygous F1
Computer Mac, PC
platform
Contact Johan van Ooijen
(j.w.vanooijen@cpro.dlo.nl)
Map manager QTL Model Interval mapping using

regression
Population F2, BC
Computer Mac, PC
platform
Contact Kenneth Manly
(kmanly@mcbio.med.bufflo.edu)
Qgene Model Linear regression

Population F2, BC
Computer Mac
platform
Contact James Nelson
(jcn5@cornell.edu)
60
ICRISAT
Allelic effect
I. Dominant deviation:
This deviation will be detected in F2 population
Its calculated as: Heterozygous [(P1+P2)/2]
A positive effect reflect growth of the heterozygous that exceeds the midparent
A negative effect reflects growth that is less than the midparent
II. Additive effects: the effects of homozygous.

Its calculated as: (Homozygous for P1 Homozygous for P2)/2
A positive effect reflects greater growth of the P1 homozygous
A negative effect reflects greater growth of the P2 homozygous
Steps to obtain a high resolution QTL mapping
Quantitative Trait Analysis
Marker analysis
Major peaks and potential linked QTLs
Fill gaps
(More markers in target region)
More individuals
(More recombinants for the target region)
61
ICRISAT
62
ICRISAT
Marker Assisted Selection

Advantages:
Eliminate the effects of environments
Select at early stage
Speed up the breeding cycles
Save land and labors
Disadvantages
Require well-equipped lab and well-trained workers
High cost
63

Linkage Mapping and QTL Notes 04 Rev

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linkage Mapping and QTL Notes 04 Rev

Uploaded by

Copyright:

Available Formats

See

Linkage Mapping and QTL Analysis

Chapter January 2004

Available from: Subhash Chandra

Draft Version 1.3

Linkage Mapping and QTL Analysis

Chapter 1. Things we need to get there 7-11

Chapter 2. Phenotyping of mapping population 12-21

Chapter 3. Building a linkage map 22-33

Chapter 4. QTL analysis 34-56

Appendix 1. Fixed, Random and Mixed Models 57-58

QTL Analysis: An Overview

QTL analysis: A means to estimate the locations, number,

Polymorphic markers are needed for QTL analysis

QTL analysis can be done with or without a genetic linkage map

A linkage map needs adequate polymorphic markers and

Without a linkage map, only QTL-harboring markers can be

QTL analysis can be done using a Mapping population (Linkage

- Describe the effect of QTL on the trait (Quantitative Genetics)

- Facilitate marker-assisted selection, breeding, introgression: Note that it is

- Comparative QTL mapping across related species

A quantitative trait exhibits continuous variation in its

The trait expression may be controlled by many genes (a few

Trait expression may be affected by non-genetic factors such as

The genes may interact with each other (gene x gene

The fundamental challenge in QTL analysis is to disentangle the

A QTL represents a significant statistical association between

We can influence these factors by

- Appropriate data analysis to account for micro-environmental

These factors impact, in varying degrees, the reliability of the

The reliability of QTL analysis results is critically dependent

Unbiased QTL analysis requires a (costly) two-stage sequential

Normally these two actions are performed on the same single

The same sample is also used to build the linkage map

Repeated use of same sample produces severe overestimation of

The resampling (simulation) procedures of cross-validation and

Doerge RW (2002). Mapping and analysis of quantitative trait loci in experimental

Tuberosa R et al. (2002). Mapping QTLs regulating morpho-physiological traits and

Members of Complex Trait Consortium (2003). The nature and identification of

Xu S (2003). Theoretical basis of the Beavis Effect. Genetics 165:2259-2268

1.1 The basic ingredients

- Identifying the trait(s)

- Definition of trait phenotype

- Generating a mapping population

- Mapping population size

- Phenotyping of mapping population: Accurate

- Phenotyping data analyses

- Selection of marker system

- Genotyping of mapping population: Error-free

- Genotyping data analysis

- Building a linkage map

- Linking phenotype to genotype to find QTLs

1.2 Selection of parents

Parents must have sufficient variation both at phenotypic and

Completely inbred lines are ideal as parents. They create

1.3 Generating a mapping population (Show DH)

Inbred Parent 1 x Inbred Parent 2 (Divergent Inbred Parents)

Most widely used mating designs to generate a mapping

1.4 Mapping population size

Ho: There is no QTL

1.5 Selection of marker system (Include description of SSR, AFLP markers)

- Select a system that allows experimental detection of

- Co-dominant markers provide more power for QTL

1.6 Number of markers vis--vis mapping population size