You are on page 1of 64

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/268747417

Linkage Mapping and QTL Analysis

Chapter January 2004


DOI: 10.13140/2.1.4501.9846

CITATIONS READS

0 560

1 author:

Subhash Chandra
Agriculture Research Branch
200 PUBLICATIONS 1,310 CITATIONS

SEE PROFILE

Available from: Subhash Chandra


Retrieved on: 05 September 2016
ICRISAT

Draft Version 1.3

Linkage Mapping and QTL Analysis

Subhash Chandra

Biometrics Unit
Global Theme Biotechnology
International Crops Research Institute for the Semi-Arid Tropics
Patancheru 502 324, India

2004
ICRISAT

Contents
QTL Analysis: An Overview 3-6

Chapter 1. Things we need to get there 7-11


1.1 The basic ingredients
1.2 Selection of parents
1.3 Generating a mapping population
1.4 Mapping population size
1.5 Selection of marker system
1.6 Number of markers vis--vis mapping population size
1.7 The data: How do they look like?
1.8 Genotyping of mapping population
1.9 Linkage map

Chapter 2. Phenotyping of mapping population 12-21


2.1 Physical dimensions of phenotyping
2.2 Outputs expected from phenotyping
2.3 Accuracy and precision: Theory and reality
2.4 Ingredients of reliable phenotyping
2.5 Experimental design: Some basics
2.6 Ingredients of a statistically sound experimental design
2.7 Replication vs blocking
2.8 Experimental design options
2.9 Number of environments and replications: A balancing act
2.10 Biometric analysis of data: Plan it ahead
2.11 Understanding the block structure
2.12 Single trial analysis using GenStat
2.13 Pooled analysis across trials using GenStat
2.14 Spatial analysis using ASReml

Chapter 3. Building a linkage map 22-33


3.1 Single-locus analysis for data screening
3.2 Two-locus analysis for estimation and detection of linkage
3.3 Minimum sample size to detect linkage
3.4 Linkage grouping
3.5 Linkage grouping criteria
3.6 Locus ordering
3.7 Marker coverage and map density
3.8 Predicting marker coverage and map density

Chapter 4. QTL analysis 34-56


4.1 The key idea
4.2 Quantitative genetic models
4.3 Prelim estimate of number of loci/QTLs
4.4 Genetic models for F2 and BC
4.5 QTL mapping strategies
4.6 Single marker analysis
4.7 Multiple marker analysis using multiple regression
4.8 Simple interval mapping
4.9 Composite interval mapping
4.10 Significance threshold for QTL detection
4.11 Estimation of key MAS parameters
4.12 Cross validation and bootstrapping
4.13 In summary

Appendix 1. Fixed, Random and Mixed Models 57-58

2
ICRISAT

QTL Analysis: An Overview


A quantitative trait locus (QTL) is a region of the genome that
contain gene(s) associated with a quantitative trait (Single gene?)

QTL analysis: A means to estimate the locations, number,


magnitude of phenotypic effects, and mode of gene action, of
individual genetic loci (QTLs) that contribute to the inheritance
of a quantitative trait

Polymorphic markers are needed for QTL analysis

QTL analysis can be done with or without a genetic linkage map

A linkage map needs adequate polymorphic markers and


allows mapping of QTL, estimating their number, phenotypic
effects, and modes of gene action

Without a linkage map, only QTL-harboring markers can be


identified; QTL location, number, effect and mode of gene
action cannot be determined

QTL analysis can be done using a Mapping population (Linkage


analysis) or a natural/breeding population (Association analysis)
Why QTL analysis?
- Identify important trait-linked markers (if there is no map)

- Detect regions of the genome associated with the trait (with a map)
! How many and where the QTLs are? Identify flanking markers

- Describe the effect of QTL on the trait (Quantitative Genetics)


How much variation is caused by the QTL (Ra2)
What is the gene action associated with the QTL? Additive, dominance
Which allele is associated with favorable effect?

- Facilitate marker-assisted selection, breeding, introgression: Note that it is


MARKER-assisted, not QTL-assisted!

- Comparative QTL mapping across related species

A quantitative trait exhibits continuous variation in its


phenotypic expression

3
ICRISAT

The trait expression may be controlled by many genes (a few


hopefully of large effect and the others of small effect) Major
QTL (Ra2>15%), Minor QTL (Ra2 below 15% but above 5%) and
Polygenes (Ra2<5%): The L-shape distribution hypothesis

Trait expression may be affected by non-genetic factors such as


macro- and micro-environment variations as well as measurement
errors: NOISE

The genes may interact with each other (gene x gene


interaction; epistasis) and with the environment (gene x
environment interaction) in determining phenotypic expression
of the trait: NOISE

The fundamental challenge in QTL analysis is to disentangle the


genetic SIGNAL at any individual locus from the NOISE

A QTL represents a significant statistical association between


trait phenotype and marker genotype: Significance Threshold?
The ability to resolve a QTL (the SIGNAL) depends upon
(Heritability?)
- The magnitude of its phenotypic effect: Large, small
- The extent to which its effect is modified by non-linear
interactions with other QTLs (epistasis)
- The degree to which macro- and micro-environmental factors and
measurement errors obscure the resulting phenotype
- Density of genetic markers to scan the genome for QTLs

We can influence these factors by


- Choice of mapping population (RIL, F2, )
- Number of segregating progenies
- Type (dominant or co-dominant) and spacing of genetic markers
- Error-free genotyping
- Appropriate field block design to account for micro-environmental
variation for accurate phenotyping (Definition of phenotype?)

- Appropriate data analysis to account for micro-environmental


variation not accounted for by design used: accurate phenotyping

4
ICRISAT

These factors impact, in varying degrees, the reliability of the


estimates of the number, magnitude, and the
distribution/position of QTLs on the genome

The reliability of QTL analysis results is critically dependent


on the sample size (# progenies genotyped and phenotyped)

Unbiased QTL analysis requires a (costly) two-stage sequential


process detecting QTLs from sample A, and using this
information in estimating QTL parameters from another
independent sample B of mapping population individuals

Normally these two actions are performed on the same single


sample whose size (still) normally varies between 100-200
mapping population individuals due to the expensive
phenotyping and genotyping

The same sample is also used to build the linkage map

Repeated use of same sample produces severe overestimation of


QTL parameters. This results in highly optimistic, hence
misleading, prospects for MAS (False positives)

The resampling (simulation) procedures of cross-validation and


bootstrapping can be used to obtain realistic estimates of QTL
parameters from a single sample of available experimental data on
genotyping and phenotyping

These notes are divided into four chapters and discuss only
the mapping-population-based QTL analysis
- Chapter 1 is about The Things We Need To Get There
- Chapter 2 discusses issues related to Phenotyping
- Chapter 3 is discusses Building a Linkage Map
- Chapter 4 discusses the methods for QTL Analysis including
cross-validation and bootstrapping

5
ICRISAT

Suggested Reading

Kearsey MJ, Farquhar AGL (1998). QTL analysis in plants: Where are we now? Heredity
80:137-142

Asins MJ (2002). Present and future of quantitative trait locus analysis in plant
breeding. Plant Breeding 121:281-291

Doerge RW (2002). Mapping and analysis of quantitative trait loci in experimental


populations. Nature Reviews Genetics 3:43-52

Tuberosa R et al. (2002). Mapping QTLs regulating morpho-physiological traits and


yield: Case studies, shortcomings, and perspectives in drought-stressed maize. Annals
of Botany 89:941-963

Paterson AH (2002). What has QTL mapping taught us about plant domestication? New
Phytologist 154:591-608

Members of Complex Trait Consortium (2003). The nature and identification of


quantitative trait loci: A communitys view. Nature Reviews Genetics 4:911-916

Xu S (2003). Theoretical basis of the Beavis Effect. Genetics 165:2259-2268

Bernardo R (2004). What proportion of declared QTL in plants are false? Theoretical
and Applied Genetics 109:419-424

Schoen CS et al. (2004). Quantitative trait locus mapping based on resampling in a vast
maize testcross experiment and its relevance to quantitative genetics for complex
traits. Genetics 167:485-498

Kraakman ATW et al. (2004). Linkage disequilibrium mapping of yield and yield
stability in modern spring barley cultivars. Genetics 168:435-446

6
ICRISAT

Chapter 1
Things we need to get there

1.1 The basic ingredients

- Identifying the trait(s)

- Definition of trait phenotype

- Selection of parents

- Generating a mapping population

- Mapping population size

- Phenotyping of mapping population: Accurate

- Phenotyping data analyses

- Selection of marker system

- Number of markers

- Genotyping of mapping population: Error-free

- Genotyping data analysis

- Building a linkage map

- Linking phenotype to genotype to find QTLs

7
ICRISAT

1.2 Selection of parents

Parents must have sufficient variation both at phenotypic and


molecular levels. Variation at molecular level is essential to
trace recombination events [Links with molecular diversity analysis]

Completely inbred lines are ideal as parents. They create


marker-trait association due to their F1s being in complete
linkage disequilibrium for genes differing between lines
LD = Non-random association of alleles from two loci on the same chromosome

1.3 Generating a mapping population (Show DH)

Inbred Parent 1 x Inbred Parent 2 (Divergent Inbred Parents)



F1 x Inbred Parent
Self
F2 BC1 x Inbred Parent (donor)
Self
F31 F32 F33 F34 BC2 x Inbred Parent (donor)
Self
F41 F42 F43 F44 BC3 x Inbred Parent (donor)




RIL1 RIL2 RIL3 RIL4 BC Introgression Line (donor genotype)

Most widely used mating designs to generate a mapping


population are F2, BC, RIL, DH
A great advantage of RILs and DHs is their eternity; they can
be multiplied without loosing genetic identity; This is
important for QTL detection as it enables using replicates and
to collect data in multiple environments
Low proportion of heterozygotes in an RIL population produces
higher recombination frequency between marker-pairs !
higher resolution of QTL mapping

8
ICRISAT

1.4 Mapping population size


Representative sample of mapping population
Number of individuals, ng 500-1000 [h2, QTL effect size, ]
ng affects map resolution and marker order more is better
Small ng ! Poor quality linkage map
Low power to detect QTL
QTL effects are overestimated

Ho: There is no QTL


Prob (Type I Error) = Level of Significance = Risk = False Positive
: Inferring the presence of a QTL that does not exist
: Low Type I Error ! Less false positives: Unavoidable!
Prob (Type II Error) = False Negative
: Inferring the absence of a QTL that exists
: Low Type II Error ! Less false negatives: Unavoidable!
Power=1Prob (Type II Error)=Prob (Identify a QTLThere is a QTL)

For a fixed sample size: Lowering one error increases the other one

9
ICRISAT

1.5 Selection of marker system (Include description of SSR, AFLP markers)


Different marker systems have different levels of resolution
for detecting genomic variation

- Select a system that allows experimental detection of


heritable genomic variation among individuals of the
mapping population

- Co-dominant markers provide more power for QTL


detection; Dominant markers are generally less informative

- For eternal RIL and DH, both types of markers are equally
informative

1.6 Number of markers vis--vis mapping population size


How many markers (nm)? For locating QTLs, a dense map (large
nm) is preferable to a sparse map (small nm) as this allows
greater precision of QTL location

However increasing nm beyond a certain average density


makes little sense if ng is not increased at the same time

QTL mapping, like linkage mapping, relies on the frequency of


detectable recombination events, which, beyond a given
marker density (average inter-marker distance of 15 cM), can
only be increased by increasing ng

A possible approach is to initially use an evenly spaced sparse


map to detect significant chromosomal regions to which more
markers could be subsequently saturated for fine-scale
localization of QTLs

Suggested further reading


Jones N, Ougham H, Thomas H (1997). Markers and mapping: we are all geneticists
now. New Phytol 137:165-177

Malyshev SV, Kartel NA (1997). Molecular markers in mapping of plant genomes.


Molecular Biology 31(2):163-171

10
ICRISAT

Lee M (1995). DNA markers and plant breeding programs. Advances in Agronomy
55:265-344

Dudley JW (1993). Molecular markers in plant improvement: Manipulation of genes


affecting quantitative traits. Crop Science 33:660-668

Williams JGK et al. (1990). DNA polymorphisms amplified by arbitrary primers as


useful genetic markers. Nucleic Acids Research 18(22):6531-6535

1.7 The data: How do they look like?


Example: 500 F8:9 RILs, 200 SSR markers
RIL1 RIL2 RIL3 RIL4 RIL5 RIL6 RIL7 RIL500
Marker
M1 B A A B A B A B
M2 A A B A A B A B
M3 B A B A A B A B
M4 A B A A B B B A
M5 B A A A B B B A

M200 A A A B B A A B
Pheno
Yield 23 20 10 8 25 11 45 34

A: Parent 1, B: Parent 2 (one high, one low); (A,B) can also be coded as (1,0)

1.8 Genotyping of mapping population (Show scoring of CD & D markers)

Error-free genotyping of ng mapping population individuals


with respect to each of the nm markers
Marker genotype data contain information on segregation at
various positions of the genome
Genotyping errors may lead to distorted segregation, inflated
linkage map, biased QTL inferences

1.9 Linkage map


The (ngxnm)=(500x200) genotype data matrix forms the raw
material to build a linkage map
A linkage map must be available for QTL mapping
Chapter 3 discusses how to build a linkage map

11
ICRISAT

Chapter2
Phenotyping of mapping population
Reliable QTL mapping demands reliable phenotyping of traits
of interest under defined target environmental (e.g. drought)
conditions
Phenotype data contain information on segregation and the
phenotypic effects of QTLs

2.1 Physical dimensions of phenotyping


- Large number ng of individuals
- Multi-Environment Phenotyping (MEP) to obtain a more
realistic estimate of line heritability (H vs Effect & #QTLs)
2

H2 = G2/[G2+(GE2/nE)+{e2/(nEnr)}]

- A large experimental facility in each environment


depending on plot size and number nr of replications

2.2 Outputs expected from phenotyping

Accurate and precise estimate m of genetic parameter .


Genetic parameters to be estimated are
- Expected phenotypic values (G in individual as well as
across trials, and GEI) of entries. BLUPs of G and GEI
more appropriate to use than BLUE/GLSE (Explain BLUP, BLUE)
Separation of G from GEI will facilitate mapping of QTLs
for G and GEI to assess QTL x Env interactions to
appreciate implications for MAS

- Variance components G2, GE2, e2


- Line heritability of the trait (H2)

12
ICRISAT

2.3 Accuracy and precision: Theory and Reality

MSE(m) = E(m-)2 = Total error/uncertainty in estimate m of

= E{m-E(m)}2 + {E(m)-}2

= {SE(m)}2 + {Bias(m)}2
= Imprecision + Inaccuracy

Obtain unbiased estimate m of with minimum SE(m) by


choosing appropriate ED and biometric analysis tools.
A statistically unbiased estimate may not necessarily be a
practically unbiased estimate!
Keep in mind: Practical unbiasedness requires careful and
conscious conduct of the experiment during the entire course
of the experiment

Uniform application of standardized experimental protocols

13
ICRISAT

2.4 Ingredients of reliable phenotyping

- Use of appropriate intra-environment experimental


design ( field blocking) that closely matches with
(unknown/ expected) inherent (spatial) variability
pattern of experimental field;

- Determination of nE (number and type) and nr; and

- Use of appropriate biometric analysis tools to obtain


accurate and precise estimate m of (unknown) genetic
parameter .

2.5 Experimental design: Some Basics

- Functions of an experimental design (GH or field) are

To disentangle: genotypic variation (SIGNAL) from inter-


plot variation (NOISE).

Separate chaff from grain to obtain unbiased


estimates with minimum SE.

To provide valid and unbiased estimate of uncertainty (SE)


in estimates of genetic parameters.

2.6 Ingredients of a statistically sound experimental design

Replication of entries
Randomization of entries
Local control of error arising from inter-plot variation

The functions of replication are

To obtain an internal estimate of experimental error (e2);

To permit separation of 2GE from e2 in MEP for a more


realistic estimation of H2.

14
ICRISAT

The functions of randomization are

Provide validity to results from statistical analyses.


Protect against bias in estimates of genetic parameters.

(Physical) local control (or blocking): Most critical

Grouping plots into (in)complete blocks such that


Intra-block variation is minimized, and
Inter-block variation is maximized.

The basic function of Local Control is to reduce


experimental error; Blocking does not remove intra-block
variation

Intra-block variation could be further reduced by careful


identification and plot-wise quantification of extraneous
factors and using them as COVARIATES. Spatial Analysis is
another alternative to explore.

2.7 Replication vs blocking

Replication is a numerical entity indicating only the number


of plots assigned to an entry;

Increasing nr
- Is not a device to reduce e2;
- Provides only a more stable estimate of e2.

Blocking is a physical concept


- To control/reduce e2;
- To facilitate field operations.

Replication and block is not always the same thing;

Replication could be costlier than blocking;

SEm=e2/r; Reduce e2 by blocking than increasing nr.

15
ICRISAT

2.8 Experimental design options


Alpha design: Ideal for phenotyping large ng

Much more flexible than lattices that require ng=k2;

Could be chosen to conform to expected field variation;

Greater convenience for management of experiment;


ng = k x b; k<b; k=#plots in a block, b=#blocks in a replicate
150 = 3 x 50 = 5 x 30 = 10 x 15
300 = 3 x 100 = 5 x 60 = 6 x 50 = 10 x 30 = 15 x 20;

Orient blocks perpendicular to (expected) field gradient for


the (major) trait of interest.

Augmented designs

Permit unreplicated phenotyping of large number of test


entities with check entities replicated.

Replicated checks enable estimation of e2 and adjustment of


performance of test entities for field variation.

Frequency of checks? 2-10 depending on magnitude of field


variation; higher for higher variability.

Wanted: Augmented designs with low frequency of checks to


allow more test entities as well as effective control of field
variation!

Usual augmented designs look something like

C T T T T T C T T T T T C T T T T T C Block 1
C T T T T T C T T T T T C T T T T T C Block 2 Incomplete blocks
C T T T T T C T T T T T C T T T T T C Block 3

16
ICRISAT

A better approach is to fix the position of checks, as above,


and use for them, eg, RCBD with r=3 blocks with 4 different
checks randomized within blocks

C1 T T T T T C4 T T T T T C3 T T T T T C2 Block 1
C2 T T T T T C3 T T T T T C4 T T T T T C1 Block 2
C3 T T T T T C1 T T T T T C2 T T T T T C4 Block 3

Above design allows both testing of differences between check


and test entities, and adjustment of the latter for field variation.

For normally used rectangular plots, Lin & Poushinsky (1985, Can
J Plant Sci 65:743-749) suggest MAD-2, which can be adapted to
any standard design. An example of a 3x6 row-column design to
test 72 test lines is

Column
1 2 3 4 5 6

T T T T T T
T T T T T T
C C C C C C Row 1
T T T T T T
T T T T T T

T T T T T T
T T T T T T
C C C C C C Row 2
T T T T T T
T T T T T T

T T T T T T
T T T T T T
C C C C C C Row 3
T T T T T T
T T T T T T

Each row-column combination is like a main-plot in a split-plot


design with 5 subplots, the control allocated to middle plot and
test entities assigned to remaining 4 plots.

Number of subplots in a whole plot should be such that they


together form a nearly square area

17
ICRISAT

2.9 Number of environments and replications


nE depends on the expected variability among targeted
environments;
Representative selection of nE environments! Min 2 Mid 4 - Max

nr=2 adequate because in

H2 = G2/[G2+(GE2/nE)+{e2/(nEnr)}]
the error variance of a genotype mean

(GE2/nE)+{e2/(nEnr)}
is reduced more by larger nE than by larger nr.
nr=2 will allow internal estimation of error variance for each
individual trial to assess the relative magnitude of errors
across environments and also to separate 2GE from e2;
For nr=2 and given nE, attempt should be made to reduce
e2 in individual trials, to compensate for less nr,
- By covariance adjustments;
- By spatial analysis.

2.10 Biometric analysis of data


Before starting analysis
- Bring data in a format amenable for analysis: Software-
specific
- Validate data for correctness: Do it BEFORE analysis!!!
Saves time & trouble

- Understand Data Structure = Treatment Structure


+
Block Structure
- Nature of Data recorded on each variable [discrete,
continuous, %, ]

18
ICRISAT

- Nature of Factors: Fixed, Random (See Appendix 1)


This makes a lot of difference in analysis and inference
- Build a Model according to Structure and Nature of Data

2.11 Understanding the block structure


This refers to the manner in which local control (and
randomization) are (physically) exercised in the experiment.
Appreciation of this is most crucial since this is what
determines the valid error term(s) in an analysis.

In an Alpha design, field blocking is done at three levels: field is


first divided into nr replicates, each replicate subdivided into a
number of incomplete blocks (IB), and each IB subdivided into a
number of plots leading to a block structure

Replicate " Block " Plots In GenStat: Replicate/Block

2.12 Single trial analysis using GenStat (Lab)

Use REML to get BLUPs of entry performance (G) treating entry,


block, and replicate effects as random, though with nr=2 or 3,
replicate variance component may not be well estimated.

REML provides

Unbiased estimates of variance components;


BLUEs/GLSEs of fixed effects; and
BLUPs of random effects

But requires the data to be normally distributed.

In GenStat, use REML directive with

FIXED MODEL (blank)


RANDOM MODEL entry + replicate/block

to get entry BLUPs and their SE, and estimates of G2 and e2 and
their SE.

19
ICRISAT

Use residual plots to check normality, homoscedasticity

Check effectiveness of blocking by comparing b2 with its SE

Line heritability for the trial can be computed as

H2 = G2/[G2+(e2/nr)]

Effect of alpha blocking on magnitude of the estimates of


parameters can be assessed by running alternative REML
analysis in the following way

FIXED MODEL (blank)


RANDOM MODEL entry + replicate

2.13 Pooled analysis across trials using GenStat (Lab)

Use REML to get BLUPs of G and GEI treating entry, block, replicate,
and environment effects as random.

In GenStat, use REML directive with

FIXED MODEL (blank)


RANDOM MODEL entry*env + env.replicate/block

to get BLUPs of G and GEI and their SE, and estimates of G2, 2GE
and e2 and their SE.

Line heritability across the nE trials can be estimated from

H2 = G2/[G2+(2GE/nE)+{e2/(nEnr)}]

Effect of alpha blocking on magnitude of the estimates of


parameters can be assessed by running alternative REML analysis in
the following way

FIXED MODEL (blank)


RANDOM MODEL entry*env + env.replicate

20
ICRISAT

2.14 Spatial analysis using GenStat/ASReml (Lab)

Field blocking using an experimental design, of necessity,


must be done a priori using our best knowledge of the pattern
of underlying field variability

What matters, and which is often neglected, is the pattern of


field variability that presents itself at the time of data
recording; this may not match with the field blocking used a
priori many things happen during the course of the
experiment that might create an unknown and systematic
patterns of field variability along the rows and/or columns of
experimental field

Analyzing data using the adopted design structure is then


unlikely to produce unbiased estimates of entry means

Spatial analysis can be used to detect/model these unknown


and systematic patterns of field variability and use this
information to obtain unbiased estimates of entry means

Depending on the magnitude and pattern of this unknown and


systematic pattern of field variability, the entry means, after
spatial adjustment, can be entirely different from the entry
means obtained from a design-based analysis!

The spatially adjusted entry means may even result in reversal


of the sign of phenotypic effect of a QTL, with disastrous
consequences for MAS

ASREML and GENSTAT can be used to get spatially-adjusted


entry means (BLUPs/BLUEs)

It is better to do spatial analysis or each individual trial


separately since pattern of spatial variability may differ from
one trial to another

21
ICRISAT

Chapter 3
Building a linkage map
Linkage map: linear arrangement of genetic markers (loci) on the
genome obtained on the basis of estimates of recombination
fractions among the markers

Data analysis to construct a genetic linkage map requires only


marker genotype data on mapping population individuals. It involves

Single-locus analysis: Quality control by screening each marker


to test conformity to expected Mendelian segregation

Two-locus analysis: Estimation of recombination fraction and


linkage detection

Linkage grouping

Locus ordering

22
ICRISAT

3.1 Single-locus analysis for data screening


3.1A Backcross (BC) mapping population (Show BC)
A backcross (M/M x M/m) produces following distribution of zygotes
for n progenies
M/M M/m Total
Expected frequency
Expected number (Ei) E2=n/2 E1=n/2 n
Observed number (Oi) O2=n2 O1=n1 n

2 = i [(Oi Ei)2/Ei] 2(1) i=1,2 (large n)

2 = i [{(Oi Ei)2-|Oi-Ei|+}/Ei] 2(1) i=1,2 (small n)

Alternatively, a likelihood ratio (LR) test can be used

LR = G = 2 i Oi loge (Oi/Ei) 2(1) i=1,2 (large n)

3.1B F2 mapping population (Show F2)

An F2 cross (M/m x M/m) produces following distribution of zygotes


for n progenies

M/M M/m m/m Total


Expected Frequency
Expected Number (Ei) n/4 n/2 n/4 n
Observed Number (Oi) n2 n1 n0 n

2 = i [(Oi Ei)2/Ei] 2(2) i=1,2,3 (large n)

2 = i [{(Oi Ei)2-|Oi-Ei|+}/Ei] 2(2) i=1,2,3 (small n)

LR = G = 2 i Oi loge (Oi/Ei) 2(2) i=1,2,3 (large n)

23
ICRISAT

3.1C Effects of segregation distortion

Each marker should conform to expected Mendelian segregation.

If few markers exhibit segregation distortion, drop them (!) from


further analysis because

- It biases estimation of recombination frequency between


markers, (downward)

- It reduces statistical power to identify QTLs, (frequency of false


negatives increases) and

- It biases estimation of QTL position and effect

Causes of segregation distortion:

- Genotyping errors

- .

24
ICRISAT

3.2 Two-locus analyses for recombination fraction


estimation and linkage detection
Linkage among markers forms the basis to construct linkage
maps and subsequent molecular dissection of a QT using the
map

Linkage analysis is based on Mendelian laws about co-


segregation and co-transmission of different genes to the next
progeny generation

Pre-requisite of linkage analysis between any 2 markers is


their known allelic arrangements (or linkage phases) on the
homologous chromosome

With known linkage phases, parental vs non-parental


haplotypes can be readily ascertained (Add about inbred-P-based populations)

3.2A BC mapping population

Consider two markers A & B each having two alleles (A,a) and
(B,b)

Possible genotypes of these markers are (A/A, A/a) and (B/B,


B/b)

Linkage is the association of two genes located on the same


chromosome (Shift in previous Section)

If an offsprings genotype differs from parental genotypes at


that marker, a recombination is observed (Show BC scheme)
B/B B/b
A/A n1 n2
(1-r)/2 r/2
A/a n3 n4
r/2 (1-r)/2

Total recombinant events/individuals = n2 + n3 = nR

25
ICRISAT

The ML estimator of r is
rAB = (n2 + n3)/n n=n1+n4+n2+n3

Var(rAB)=rAB(1-rAB)/n

The null hypothesis of no linkage (H0:r= vs H1:r<) can be


tested using 2 test

2 = (nNR nR)2/n df=1


LOD score (log10 of odds, an LR test-statistic) is more
frequently used for linkage detection
Z = log10 [L(r=rAB)/L(r=)]

A threshold value of Z=3 is commonly used (in human


genetics) to declare existence of linkage.

Z=3 means that observed linkage is 1000 times more likely


than at r=1/2

Distorted segregation tends to increase Z leading to detection


of spurious linkage. To overcome this problem, JoinMap uses a
modified LOD score computed as
Z*= G* / [2 ln(10)]
G* = [(4 - k) k 3] (d - 1) + G d

k = e-Gd/{2(d-1)}

G = 2 ij [Oij ln(Oij/Eij)], i=1,r j=1,,c df=d=(r-1)(c-1)

Oij = observed frequency in i-th row and j-th column of 2-way


table for two given markers

Eij = expected frequency in i-th row and j-th column of above


2-way table

26
ICRISAT

3.3 Minimum sample size (ng) to detect linkage

Use of F2 mapping population with co-dominant markers has


the highest statistical power relative to all other models.

For F2 populations, dominant markers should generally be


avoided as they usually have a mixture of coupling and
repulsion linkage phase. The repulsion linked dominant
markers provide the least linkage information and statistical
power.

Table 1. Minimum sample size needed to detect linkage at =0.05


(Adapted from Liu 1998, Statistical Genomics, CRC Press, p 190)

True r Power BC F2-CC F2-CD F2-DDc F2-DDr


.05 .80 16 11 21 21 97
.90 21 15 27 28 130
----------------------------------------------------------------------------------------------------------------------
.10 .80 21 16 28 30 108
.90 29 22 38 40 144
----------------------------------------------------------------------------------------------------------------------
.20 .80 41 35 57 63 156
.90 55 47 77 85 209
----------------------------------------------------------------------------------------------------------------------
.30 .80 95 88 139 166 297
.90 128 118 186 223 397
BC: Back cross; F2-CC: both co-dominant markers; F2-CD: one co-dominant and one dominant
marker; F2-DDc: both dominant markers in coupling phase; F2-DDr: both dominant markers in
repulsion phase.

3.4 Linkage grouping

Linkage grouping refers to placing loci into linkage groups


based on their linkage relationship (estimate of recombination
fraction).

A linkage group is a group of loci/markers where each loci is


linked (r < ) at least to one other marker. (Ideal is r < 0.35)

Biologically, a linkage group is defined as a group of genes


with their loci located on the same chromosome.

27
ICRISAT

Statistically, a linkage group is a group of genes inherited


together according to certain statistical criterion, eg Z* score

Loci on the same chromosome may be grouped into different


linkage groups based on a statistical criterion because loci on
a large segment of a chromosome may not be observed.

3.5 Linkage grouping criteria


Linkage grouping is usually based on estimates of either
(rAB,ZAB) or (rAB,pAB) where rAB, ZAB, and pAB are respectively the
two-point recombination fraction, LOD score, and significant
P-value for a pair of loci A and B
Criteria based on (rAB,ZAB):

If [{rAB c} and {ZAB a}]


Then, loci A and B belong to the same linkage group

If [{rAB c} or {ZAB a}] (JoinMap recommends a[4,7])


Then, loci A and B belong to the same linkage group
Criteria based on (rAB,pAB):

If [{rAB c} and {pAB b}]


Then, loci A and B belong to the same linkage group

If [{rAB c} or {pAB b}]


Then, loci A and B belong to the same linkage group
c = maximum recombination fraction value to be declared
a linkage
a = minimum LOD score value for declaring a linkage
b = maximum significant P-value for declaring a linkage

In practice, joint consideration of c, a, b, and biologically


information, such as number of chromosomes, could be used
to infer linkage groups.

28
ICRISAT

3.6 Locus ordering: Critically important


Locus order is defined as the relative linear arrangement of
loci/markers on a linkage group. Multiple locus ordering is
based on minimizing the number of crossovers.

Let (a1, a2, , ak) k2

denote k loci on a linkage group with map distances (cM)

(d12, d23, , dk-1,k)

where map distance dij is derived from recombination fraction


rij and crossover interference (1-C) between loci i and j.

rXY rYZ
_____X__________Y_________________Z___
dXY dYZ

Haldanes (H) mapping function [interference absent, (1-C)=0]

dij = -(1/2) loge (1-2rij)

Kosambis (K) map function [interference present, (1-C) 0]

dij = (1/4) loge [(1+2rij)/(1-2rij)]

Recombination fractions are not additive due to multiple crossing overs. For
example, for locus order (X,Y,Z)

rXZ = rXY + rYZ 2 C rXY rYZ

where C is the coefficient of coincidence, and (1-C) is coefficient of interference.

Map distances, on the other hand, are additive, i.e.

dXZ = dXY + dYZ

Two markers are said to be separated by d cM if d is the


expected (odd) number of crossovers between the markers in
100 meiotic products (Define cross-over interference)

29
ICRISAT

Effect of map function on QTL mapping is small. For a given


rij, following relationship between H and K holds

dij(Haldane) dij(Kosambi).

For k loci, there are (k!/2) possible locus orders if orientation


of orders is ignored.

# loci : 2 3 5 10 20
#Locus-
Orders : 1 3 60 1,814,400 1.22x1018

Finding the correct locus order is a computationally intensive


problem as the number of possible locus orders increases very
quickly as the number of loci increases.

Multiple locus ordering approaches: Minimum SARF, Minimum


PARF, Maximum SALOD, Stams (Weighted) Least Square (Stam
1993; The Plant Journal 3(5):739-744, implemented in JoinMap).

JoinMap uses values of r and Z*, accompanied by a 2


goodness-of-fit test, to arrive at a locus order. Threshold
limits of r0.499 and Z*0.001 are recommended in order to
use all available information.

JoinMap, relative to MapMaker, offers more comprehensive


analytical tools and data-quality checks to build a genetic
map. The software DrawMap or GGT could be used to draw a
(publishable) genetic linkage map. The latter can be
downloaded from internet. (Map length in Mapmaker < JoinMap!)

30
ICRISAT

3.7 Marker coverage and map density


The quality of a genetic map can be quantified using the
confidence of estimated locus order and loci distribution on
the map.

An ideal genetic map is one with a high level of confidence in


the estimated locus order, with markers evenly distributed on
the map, and with sufficient density (e.g. at least one marker
within a 5 cM segment).

Factors affecting marker coverage and map density include,


among others, genome length, number of markers,
distribution of markers on genome, mapping population size
(n), and type of mapping strategy.

Estimating the marker coverage and map density, therefore, is


not only useful to quantify the quality of a map, but also for
designing a more efficient genomic experiment.

3.8 Predicting marker coverage and map density

Given total genome length (L) and a genome map (with map
length L*), these can be easily estimated.

Marker coverage (c) is the ratio of genome map length and


total genome length, c=L*/L.

Map density is the average or the maximum map distance


between adjacent markers (gap).

Assuming that markers are randomly distributed, two


approaches are available to predict marker coverage and map
density

- Marker coverage approach


- Confidence probability approach

31
ICRISAT

Marker coverage is the proportion of genome flanked by two


markers with a certain minimum map distance (say < 2d M)
between them over the whole genome.

Confidence probability approach considers that at least one


marker is located within a 2d M genome segment.

Both approaches are approximately the same when the ratio


of minimum map distance to genome length is small.

_.__._____.__.____.___.______ L1

__.___._____.___._.____.___._ L2

_.___.____.__.____.___.__.___ L3

+d 0 -d L=L1+L2+L3

Figure 1. Probability that a random marker is located within a 2d genome-


segment is 2d/L

3.8A Marker coverage approach

Marker coverage (c) can be estimated as (Lange and Boehnke


1982, Am J Hum Genet 34:842-845)

c = 1 e-2md/L

m=number of markers, L= total genome length (cM)

Given the marker coverage c, genome length L, and minimum


distance 2d, the number of markers (m) needed can be
estimated from

m = - L log (1-c) / (2d)

32
ICRISAT

3.8B Confidence probability approach

The probability that at least one marker is located within a 2d


M genome segment is

P= 1 [1 (2d/L)]m

The number of markers needed to have at least one marker


within a 2d M genome segment, given genome length L and a
confidence probability P, is

m = [log (1-P)] / [log {1-(2d/L)}]

The map density (2d) can be estimated from

2d = L [1 (1-P)1/m]

Table 2. Numbers of markers needed at confidence probability P or marker coverage c assuming


genome length of 1M (Adapted from Liu 1998, Statistical Genomics, CRC Press, p 352)
2d (cM) ----------------- P ------------------ ---------------- c ---------------------
0.80 0.9 0.99 0.80 0.90 0.99
1 161 230 459 161 231 461
5 32 45 90 33 47 93
10 16 22 44 17 24 47
20 8 11 21 9 12 24
30 5 7 13 6 8 16

33
ICRISAT

Chapter 4
QTL Analysis
4.1 The key idea
A genetic marker that tends to co-segregate with the trait is
likely to be close to a QTL controlling that trait

We seek an association between marker alleles


(genotypes) and trait values (phenotypes) (Add regression picture here)

The purpose of QTL analysis is to infer QTL genotypes in order


to estimate QTL effects and locations using their statistical
association with known genetic markers: MAS vs QAS
But we only know
- Marker genotypes (M) for mapping population individuals
- Trait values (Y) of mapping population individuals
- Genetic markers located at certain positions on
genome for each of the mapping population individuals

The key to solve the problem is to use the statistical concept of


conditional probability (Make more simple taking an example!)
The conditional probability that the QTL genotype is Qk, given the
observed marker genotype is Mj, is

Pr(QkMj) = Pr(QkMj) / Pr(Mj)


where Pr(QkMj) and Pr(Mj) are joint and marginal probabilities

The joint probability Pr(QkMj) is the probability of co-


segregation of the marker and the QTL

Values of Pr(QkMj) and Pr(Mj) depend on mapping population


used and the position of putative QTL with respect to the
marker loci (known linkage map) (Clarify)

34
ICRISAT

The trait values Y are then modeled as an appropriate


function of the conditional probabilities Pr(QkMj) for mapping
QTLs onto a given genetic map (Simplify)

Y = f{Pr(QkMj)}

The derivation of conditional probabilities Pr(QkMj) requires


specification of a quantitative genetic model corresponding to the
mapping population, genetic markers used, and a linkage map.

Goal is to identify a few QTLs with large effect for application


in MAS, MAB, introgression

4.2 Quantitative genetic models

A genetic model for a quantitative trait, in classical quantitative


genetics, is usually defined in terms of

Number of genes (say m)

(Phenotypic) effects of genes [additive (a), dominance (d)]

Gene frequencies

Relationships among genes (epstatic interactions), and

Relationship between environment and gene action (GxE


interactions)

35
ICRISAT

4.3 Prelim estimate of number of Loci/QTLs

- To get some indication of chance of success to locate QTLs

- Easier to locate genes when only a few affect the trait

- Use Wright-Castle Index to get a lower bound on m

m = (P1 - P2)2 / (8a2) (A)

where a2 is F2 additive genetic variance arising from differences


in allele frequencies of two parental populations (Castle 1921,
Science 54:541-553)

- Assumptions in equation (A)

The m loci have equal effects and are additive

No linkage between the m loci

Complete fixation of alleles in P1 and P2 for the trait

- See also (for more on accurate estimation of m)

Comstock & Enfield 1981, TAG 59:373-379

Cockerham 1986, Genetics 114:659-664

Zeng et al. 1990, Genetics 126:235-247

Zeng 1992, Genetics 131:987-1001

36
ICRISAT

4.4 Genetic models for F2 and BC

Consider a bi-allelic locus Qk with the effects of the three


genotypes defined as (k=1,,m) [Define a and d]

QkQk Qkqk qkqk


ak dk -ak

Assume populations P1 and P2 are fixed with alleles Qk and qk


for each of m loci. Also, assume there is no epistasis.

4.4A F2 model (Qkqk x Qkqk)

Genotype Obs# Value Variance


QkQk nQQ QQ 2
Qkqk nQq Qq 2
qkqk nqq qq 2
Total n

Additive effect ak = (QQ - qq)/2


Dominance effect dk = Qq [(QQ - qq)/2]

The additive effect is the same as the average effect of gene


substitution since expected allele frequencies for two alleles
are same (=0.5) in F2

4.4B BC model (QkQk x Qkqk)

Genotype Obs# Value Variance


QkQk nQQ QQ 2
Qkqk nQq Qq 2
Total n

Genetic effect gk = (QQ - Qq)/2


= [QQ (1/2)(QQ+qq+2dk)]/2
= (ak + dk)/2

=> ak and dk cannot be separated

37
ICRISAT

4.4C Models for DH, RIL and test cross (TC)

These can be treated like a BC in terms of data analysis since


expected genotypic frequencies are same as in BC.

B U T, interpretations of QTL mapping results will be different

- QTL effects in BC are mixture of a and d effects

- QTL effects in DH and RIL are purely additive

4.5 QTL analysis approaches

These could be classified according to the number of genetic


markers used as unit of analysis in analyzing the data

Single-Marker Analysis (SMA)


(QTL Cartographer, )

Two-Marker Analysis
[Simple Interval Mapping (SIM)]
(QTL Cartographer, MapQTL, PlabQTL )

Multiple-Marker Analysis
[Multiple regression, CIM, MQM]
(QTL Cartographer, PlabQTL, MapQTL, )

Multiple Interval Mapping (MIM)


A multiple QTL-oriented method that allows estimation of
number, positions, effects, and epistatic interactions
among significant QTLs simultaneously (QTL Cartographer)

Multiple-Trait IM and CIM


Testing and estimating QTLs affecting multiple traits;
Testing pleiotropy and pleiotropy-vs-linkage; Testing QTL x
environment interactions (QTL Cartographer, MultiQTL)

Non-Parametric Mapping (MapQTL)

38
ICRISAT

4.6 Single-marker analysis (BC as an example)


r
------A--------------------------Q--------
------a--------------------------q--------
marker QTL

Joint Segregation of QTL and Marker Genotypes

Marker Obs# Marginal QTL Genotype Expected


Geno Prob QQ Qq Trait Value
(M) [Pr(A)]
Joint Probability
Pr(QA)

AA nAA (1-r)/2 r/2

Aa nAa r/2 (1-r)/2

Conditional Prob
Pr(QA)

AA nAA (1-r) r AA=(1-r)QQ+rQq

Aa nAa r (1-r) Aa=rQQ+(1-r)Qq

Above, the Expected Trait Value, e.g. for marker genotype AA,
is derived as

E(YAA)= AA= Pr(QQ AA) QQ + Pr(Qq AA) Qq


E(YAA)= AA= (1-r) QQ + r Qq

Single marker analysis can be done without a linkage map

39
ICRISAT

4.6A Statistical tests for QTL detection

t-test

t = [m(AA) m(Aa)] / [s2{(1/nAA) + (1/nAa)}]1/2 df=n-2

s2 = pooled variance within the two marker genotypes,

m(AA) = phenotypic mean of marker genotype AA

The expected value of difference between two marker genotype


means is

E[m(AA) m(Aa)] = AA - Aa
= (1-2r) (QQ - Qq) = (1-2r)
= (1-2r) (a+d)

Ho : AA - Aa = 0

=> (a+d)=0 " There is no genetic effect

or r= " QTL and Marker are independent

Since P1 and P2 were chosen to be different for the trait, the


condition 0 will be satisfied unless allele Q is completely
dominant to allele q.

Analysis of variance

A one-way ANOVA could also be used which, in case of BC having


only two marker genotype classes, would give the same inference
as a t-test since ANOVA F=t2.

40
ICRISAT

Regression approach

Model yj = 0 + Xj + ej

where yj = trait value for test-unit j in the mapping popn.


0= intercept = overall mean of trait
= slope of regression line
Xj = 1 if test-unit j is AA, -1 if test-unit j is Aa
ej = random error of test-unit j N(0,2)

The expected means, variances, and covariances to estimate


regression coefficients are

E[m(X)] = (1/2) x (1) + (1/2) x (-1) = 0


E(sX2) = (1/2) x (1)2 + (1/2) x (-1)2 = 1
E[m(y)] = (1/2)[(1-r)QQ+rQq+rQQ+(1-)Qq]=(QQ+Qq)/2
E(sXy) = (1/2) x (1) x [(1-r)QQ+rQq]
+ (1/2) x (-1) x [rQQ+(1-r)Qq]
= (1/2) (1-2r) (QQ - Qq)

Therefore

0 = E[m(y)] = (QQ+Qq)/2

= E(sXy) / E(sX2) = E(sXy) = (1/2) (1-2r) (QQ - Qq)

= (1/2) (1-2r) (a+d)

H0: = 0

=> (a+d)=0 " There is no genetic effect

or r= " QTL and Marker are independent

41
ICRISAT

Maximum likelihood approach

- Assume that the trait values for each QTL genotype QQ


and Qq follow a normal distribution with mean QQ and
Qq respectively and constant variance 2.

- Then, since

AA=(1-r) QQ + r Qq

Aa= r QQ + (1-r) Qq

the distribution of trait values for each the two marker


genotype classes AA and Aa will be a mixture of two
normal distributions

- The likelihood for a BC population with n individuals is

L(QQ,Qq,2,ryi) = L(QQ,Qq,2,r)

=(22)-n/2ijPr(QjAi) exp[-(yi-j)2/(22)] (1)

i=1,,n ; j=1,2

which is the joint distribution function for a mixture of two normal


distributions in unequal proportions (1-r) and r.

- It is assumed in L that each of the four marker-QTL genotype


classes has a constant variance 2, where

Pr(QjAi) conditional probability of QTL genotype


being Qj given that marker genotype is Ai;
(mixing proportion a function of r)

yi observed trait value of individual i

j expected trait value of QTL genotype j

42
ICRISAT

Test-statistic to test H0: r =

- Log-likelihood ratio (LR) test

G=2 log [L{mQQ,mQq,s2,est(r)}/ L(mQQ,mQq,s2,r=1/2)]

which is distributed asymptotically as a 2 with df=1.

- LOD score

Z=log10 [L{mQQ,mQq,s2,r} / L(mQQ,mQq,s2,r=1/2)]

- Relation between LR-statistic G and LOD-score

G is computed using natural logarithm.

LOD is computed using base-10-logarithm

LOD=0.21715*G G=4.60511*LOD

- G is interpreted as the probability of occurrence of the given


data under the null hypothesis using a theoretical 2 distribution.

- LOD is interpreted using the concept of odds-ratio. A LOD=2


means that the alternative hypothesis is 102=100 times more
likely to happen than the null hypothesis.

43
ICRISAT

4.7 Multiple marker analysis using multiple regression


No linkage map is needed

A simple extension of single-marker regression analysis to


identify markers simultaneously linked to trait loci

yj = 0 + m m Xjm + ej m=1, , M j=1, , n

Identify markers with m 0

With M markers at hand, there are 2M possible models to


search to find the most parsimonious model: Search strategy?

Need a criterion to choose the final model

Search strategy: step-wise regression, backward elimination,


forward selection, MCMC,

Model selection criterion: Ra2, AIC, BIC,

Our experience: Use step-wise regression with increasing


stringent Fin and Fout to filter out the most important unlinked
trait-linked markers (Fin = Fout [3, 4, , 15])

BIC is another good candidate to identify important unlinked


trait-linked markers

With a large sample of progenies, results from stringent Fin-


Fout-based step-wise regression closely match with those from
map-based QTL analysis

44
ICRISAT

4.8 Simple interval mapping (SIM)


The single-marker analysis has shortcomings
- Gives indirect estimates of QTL position and effect since both
are intrinsically confounded [H0:(1-2rAQ)=0]
- Cannot separate linked QTLs
- Ignores effect of other QTLs
- Order of loci cannot be resolved: QA or AQ?

In SIM, intervals formed by adjacent markers are used as unit


of analysis to test for presence of a single QTL by using
genetic information from two flanking markers

Possible to separate r and , and to infer position of Q relative


to both.

More precision and power expected from use of additional


information on the second marker
r1 r2
___A__________Q_______________B______
___a__________q_______________ b______
#----------------- r -----------------"
"

Assuming no crossover interference (1-C=0), relationships among recombination


fractions is
r = r 1 + r 2 2 r 1 r2

With small r, double crossover (2 r1 r2) may be ignored which reduces the above
equation to

r = r 1 + r2

The QTL position can be represented by a point relative to the interval between two
markers by a position parameter

= r1/r

with 1- = 1- (r1/r) = r2/r

A linkage map must be available for SIM

45
ICRISAT

4.8a SIM in a BC population

Parents AAQQBB x aaqqbb



F1 AaQqBb x AAQQBB

BC Progeny Joint
(8 marker-QTL genotypes) Prob

AAQQBB, AaQqBb (1-r)/2


AAQqBb, AaQQBB r1/2
AAQQBb, AaQqBB r2/2
AAQqBB, AaQQBb 0
(Resulting from double cross-over)

If A and B are tightly linked, the possibility of a double crossover in


both segments AQ and QB can be ignored
Then, the frequency of marker-QTL genotypes AAQqBB and AaQQBb,
being results of double crossovers, is relatively small and can be
practically treated as zero as shown above

Marginal and Joint Probabilities

Marker Obs# Marginal Joint Probability P(QM)


Geno Prob P(M) QQ Qq

AABB n1 (1-r)/2 (1-r)/2 0

AABb n2 r/2 r2/2 r1/2

AaBB n3 r/2 r1/2 r2/2

AaBb n4 (1-r)/2 0 (1-r)/2


See 2-locus analysis in Chapter3

46
ICRISAT

Conditional probabilities P(Q M) and expected trait values


for marker genotypes

Marker P(Q M) Expected Trait Value


Geno QQ Qq of marker genotype

AABB 1 0 QQ=AABB

AABb r2/r=(1-) r1/r= (1-)QQ+Qq=AABb

AaBB r1/r= r2/r=(1-) QQ+(1-)Qq=AaBB

AaBb 0 1 Qq=AaBb

Mean QQ Qq (QQ + Qq) / 2

If =0 --> QTL located right at marker A


=1/2 --> QTL located in the middle of A and B
=1 --> QTL located right at marker B

Regression approach

- Model: y = 0 + X* a + e

X* = P(Qj Mi) computed for each map position


a = effect of putative QTL
- Regression is computed at regular positions in each interval.
- LR test-statistic for each map position is computed as
LR = n*loge(SSResred/SSResfull)
= -n*loge(1-R2)
= 2*loge10*LOD
p(MSreg/MSres) = p*Freg
p is the number of parameters (including QTL position) fitted
R2 is the coefficient of determination, n is #progenies

47
ICRISAT

- Regression-LR provides a very close approximation to ML-LR if


the distance between adjacent markers is <20 cM and there are
not many missing data

- Regression approach offers many advantages


Faster computations
Application of standard statistical software
Flexibility in choice of genetic models (including epistasis)
Incorporation of experimental design features
Robust to non-normality
- The LR (or LOD) is plotted against genome position and compared
with genome-wide threshold ! LR (or LOD) profile

- Wherever the LOD curve exceeds the threshold, presence of a


QTL is inferred

- The point at which the LOD is maximum is used as an estimate of


QTL position

- A one- or two-LOD support interval around the inferred QTL


position is used as an interval estimate of QTL position

- SIM provides no improvement over single marker analysis in


detecting QTLs when using a relatively dense marker map (10 cM
spacing or less) and a small or moderate number of progeny (500
or less)

- The benefit of SIM is in providing more precise estimates of QTL


location and effects

48
ICRISAT

4.9 Composite interval mapping


SIM is a single-QTL model multiple QTLs cannot be clearly
determined

Location of QTLs may not be clearly resolved, ghost QTL can


occur
- Ghost effects occur when a QTL is located in an interval, and
adjacent intervals too exhibit significant LOD score
- Problem is: IM does not take into account all markers at once,
but uses them only two at a time. This makes it difficult to
differentiate between actual and ghost QTL that may exist simply
due to relative density of the map
- If there are QTLs Q1 and Q2 in non-adjacent intervals (M1,M2)
and (M3,M4), there may be a spurious indication of a QTL in the
intervening interval (M2,M3) (Martinez & Curnow 1992, TAG 85:480-488)

With other QTLs segregating in the mapping population, effects


of these and their interactions with the QTL in target SIM
segment are pooled into the experimental error (genetic
background effects) since they are not included into the
statistical model. This reduces power to detect QTL

The basic assumption of SIM is that there is only a single


segregating QTL affecting the trait. However, if more than one
QTL exists, SIM is not independent in different segments.

Composite interval mapping (CIM) is a combination of interval


mapping and multiple linear regression. For a BC population, the
CIM analysis for a genome segment between markers Mi and Mi+1,
the statistical model is

#----- r ----"
#r1"#-- r2-"
___.______.____.______._____________._________
Mi-1 Mi Q Mi+1 Mi+2
(A) QTL (B)

49
ICRISAT

Model: yj = 0 + i Xij + k Xkj + ej j=1,,n i=1,,nm*


ki,i+1

0 = intercept of the model

i = genetic effect of putative QTL Q located between


markers Mi and Mi+1
Xij = 1 for marker genotype AABB
0 for marker genotype AaBb
1 with prob=1- and 0 with prob= for AaBB
1 with prob= and 0 with prob=1- for AABb

k = partial regression coefficient of yj on marker k


(k i, k i+1)
Xkj =1 for marker genotype AA
0 for marker genotype Aa

ej = random error N(0, 2)


nm* = number of markers selected as cofactors

The effect of other QTLs if any, other than the one in test-
interval (Mi,Mi+1), are removed through the regression of
markers outside the test-interval. This increases power of QTL
detection

The CIM model parameters can be estimated using maximum


likelihood. Alternatively, a LS regression approach, replacing
X by X* as in SIM, can be used. The basic procedure for both is
similar to SIM described earlier

The LR test statistic G or LOD score Z can be computed for


each position on the genome in a similar manner as for SIM

The values of the test-statistic (G or Z) are plotted against


genome positions. The resulting plot, using an appropriate
significance threshold value, can be used to infer the
presence, position, and effects of putative QTLs

50
ICRISAT

4.9a Practical implementation of CIM


Step 1: Selection of markers to act as co-factors
This can be done by SIM, or by multiple regression of markers using
step-wise regression procedure
Step 2: QTL detection
Two possible alternatives are to use as co-factors
a. Only unlinked or distantly linked markers selected in Step 1
b. These markers plus the closest linked markers with a minimum
distance of 20 to 30 cM from the study interval
c. Compare the results from a and b
d. A putative QTL is declared to be present if the test statistic
shows a clear peak at least under (b) and exceeds the critical
threshold either under (a) or (b).
Step 3: Estimation QTL effects
Obtain by separately fitting a model with all QTL positions
considered simultaneously identified in Step 2

4.9b Step-wise regression for cofactor selection

2u possible models with u markers/predictors;

Specify Fin and Fout values; PlabQTL recommends both to be


set at 3.5 for =0.25

Select first predictor having largest F-value

Enter extra predictors if

Rin=[SSResu-SSResu+1]/[SSResu+1/(n-u-2)]>Fin

Delete predictor from model if

Rout=[SSResu-1-SSresu]/[SSresu/(n-u-1)] < Fout

51
ICRISAT

Stepwise procedure performs better when predictors are highly


correlated

4.10 Significance threshold for QTL detection

A significance threshold for test-statistic G (or Z) depends on


Type-1- error (incorrectly inferring presence of a QTL, false
positive) and number of markers genotyped, the latter resulting
in say M intervals.

Issue is complicated because

Distribution of test-statistic under H0 is often not clear

Multiple simultaneous tests are performed many of which are


not independent

In CIM, G (or Z) is nearly independent in different intervals.


One could therefore use the Bonferroni approximation /M for
individual tests to control a specified overall Type-1-error .

Another alternative is to use permutation methods to derive


empirical significance threshold from the experimental data.
Use at least 1000 permutation for =0.05, and at least 10000
for =0.01.

The choice of is not a technical issue. It depends on goals of


the experiment. QTL researchers often tend to use =0.05. But
for exploratory QTL studies, =0.25 could be acceptable (Beavis
1998, In Molecular Dissection of Complex Traits, Ed Paterson, CRC Press,
pp 145-162).

52
ICRISAT

4.11 Estimation of key MAS parameters

Relative to phenotypic selection, the efficiency of MAS is


defined as (p/H2) (Lande & Thompson 1990, Genetics 124:743-756),
where p is the proportion of genetic variance accounted for by
detected QTLs, i.e., p=2QTL/2G. Parameter p can be estimated
as

p= R2a/H2

R2a = 1 {(n-1)/(n-u)}*(1-R2) = 2QTL/2P

H2= 2G/2P

R2=SSreg/SStotal=SSQTL/SStotal

Factors affecting p are

Sample size n
Population type
Heritability
Genetic architecture

Data set Common Practice L&T


QTL detection
D
D=E E
D
QTL estimation
E

53
ICRISAT

4.12 Cross validation and bootstrapping


Cross validation (CV) is a statistical method to evaluate the
predictive capability of a given set of models.

Utz et al. (2000, Genetics 154:1839-1849) used k-fold CV to assess


the bias and sampling error in estimated p arising from
genotypic and environmental sampling. The principle is shown
in Fig. 2.

Genotypic sample N = 344

Fig. 2. Subdivision of the data set


used for cross validation (CV) into
Estimation set (ES) sub-samples. Data from the
N = 276, u = 3 estimation set (ES) serve for QTL
detection and mapping. Data
from the test set (TS) are used
for obtaining asymptotically
unbiased estimates of p

In 5-fold cross validation, entire data set (EDS), having N


genotypes, is divided into five equal parts. QTL detection and
position-estimation by an established method (e.g., CIM by
the regression approach) is based on the estimation set (ES),
comprising genotypes in four subsets using phenotypic data
from all environments but one

Genotypic data from 5th subset together with phenotypic data


from remaining environment, the test set (TS), are used for
estimation of QTL effects and p. This procedure is repeated
250 times with different partitioning of EDS into ES and TS

Estimation Set (ES) Test Set (TS)


QTL detection (model selection) X*es X*ts
Estimation of QTL effects bes Predict Yts as yts=X*ts bes
Estimate pes=R2adj/H2es Estimate pts=r2(yts,Yts)/H2ts

54
ICRISAT

Bootstrap (BS) is a resampling technique to measure the


quality of statistical estimates and models (Efron & Tibshirani
1993, An Introduction to the Bootstrap, Chapman & Hall)

A valid application of BS requires that the sample being


bootstrapped consist of independent observations. This
assumption is satisfied by the data used in QTL analysis,
because the progenies are representative sample from a
segregating population

The original sample of N progenies in EDS is used to generate


so-called BS samples each of size N using sampling with
replacement.

A large number (b=1, ...B) of independent BS samples are


generated. The original sample is used for QTL mapping and
yields an estimate p$ of p. Likewise, each BS sample is used
for QTL mapping and the same decision rules (LOD threshold,
selection of cofactors etc.) are employed as with the original
sample, and yields an estimate p$ b.

The standard error (SE) and bias of p$ can be estimated as


follows:
1/ 2

SE ( p$ ) = SD( p$ b ) = ( p$ b p$ ) 2 / ( B 1) ,
b
bias( p$ ) = p$ p$ b / B
b

A bias-corrected estimate of p can be obtained as

p$ * = 2 p$ p$ b / B
b

55
ICRISAT

4.13 QTL mapping in summary


$ What we need to have

Marker genotype data error free

Genetic linkage map - good quality

Trait phenotype data accurate (bias-free)


$ What we need to appreciate

Theoretical linkage relationships between marker(s) and QTL


[Conditional probabilities Pr(Qk Mj)] as per QG model for
mapping population and marker system employed

Expected trait values for marker genotype classes (AA, Aa, )


which contain a mixture of putative QTL genotypes in
different proportions depending on the strength of linkage
$ What we are looking for

Putative QTL exist?

If yes, what is the QTL effect?

Which are the important trait-linked markers?

Genomic position of the QTL?


$ What statistical methods to use

Use composite interval mapping, Multiple regression,

Use appropriate significance threshold! =0.25

Use CV/BS to get unbiased estimates of QTL parameters

Bottom line: The care and expense invested in generating marker


and trait data should be accompanied by equal care in biometric
analysis of data

56
ICRISAT

Appendix 1: Fixed, Random, and Mixed Models

Consider the example of a field trial laid out in an RCBD. Suppose there are t=16
groundnut varieties as treatments each decided to be tested with r=3 replications.
The experimental field is divided into b=r=3 blocks, each individual block containing
k=t=16 plots. There are N=txr=48 plot observations corresponding to the 48 plots used
in the trial.

The following linear additive model is commonly assumed to represent (model) any
individual plot observation in an RCBD

Plot Obs. = Trial Mean + Variety Effect + Block Effect + Error (1A)

Yij = + i + j + ij (1B)

where i=1,,t (# treatments), j=1,,b (# blocks).

The model above contains four terms [, i, j, ij]. We need to specify which of these
terms is a fixed and which is a random effect.

WHAT IS A FIXED EFFECT? A model term, representing the effect of a certain factor,
is said to be a fixed effect if the different levels of the factor, say t levels of the
factor variety included in the trial, represent t distinct populations (treatments), and
interest lies in estimation (BLUE) of the means of those, and only those, distinct
populations (treatments). For example, if the 16 groundnut varieties represent 16
distinct genetic populations, the corresponding model term i will be taken as a fixed
effect. Our interest will be only in estimating the means of, and possibly in some well-
defined differences among, these 16 genetic populations.

WHAT IS A RANDOM EFFECT? A model term, representing the effect of a certain


factor, is defined to be a random effect if the different levels of the factor, included
in the trial, represent a random sample from all possible levels of a single population.
For example, if the 16 groundnut varieties represent a random sample of 16 genotypes
from a single genetic population, the corresponding model term i will be taken as a
random effect. Interest here may lie in the variability within this single population
from which the sample came (variance component estimation), and/or perhaps in the
prediction (BLUP) of the mean of particular levels (here genotypes).

DECIDING WHETHER A MODEL TERM BE TAKEN AS FIXED OR


RANDOM (RJ Baker 1996)
Step 1: A Research Point of View

Initially declare the term to be random effect

Now ask two questions

57
ICRISAT

(a) Is it physically possible for the used factor levels to be repeated at some
future time or in some other place?

(b) If answer to (a) is YES, would it be reasonable in the context of this research
for you or someone else to choose the same levels for repetition of this
research at some future time or in some other place?

If the answers to questions (a) and (b) are BOTH YES, declare the term
as a fixed effect.

Step 2: A Statistical Point of View

If a fixed-effect factor involves a large number of levels (say >10) and there is no
structure among those levels, it might perhaps be best to declare the factor effect
as random and use BLUP to predict mean values.

The basis for this recommendation is: With a large number of unstructured
levels, it is likely that the extremely low or high means will partially reflect a
fortuitous combination of random error effects and should therefore be shrunken
toward the trial mean.

If a random-effect factor involves too few levels (say if the factor represents an
uninteresting nuisance factor in the trial), and if comparisons among levels of this
factor provide no information about other factors (no inter-class information), then
declare the factor effect as fixed. This will ease the computing demands and
provide identical results for the interesting factors.

NOTE 1: Whatever the type of model (Fixed, Random, or Mixed), is always taken as
fixed, and ij, due to random allocation of treatments to plots, is always taken as
random. What makes a model as Fixed, Random, or Mixed then depends on the nature
of the remaining terms in any model.

WHAT IS A FIXED-EFFECTS MODEL (Model 1)? Under the proviso of NOTE 1, if all
other model terms represent fixed effects, the model is called a fixed-effects (or a
fixed) model.

WHAT IS A RANDOM-EFFECTS MODEL (Model 2)? Under the proviso of NOTE 1, if all
other model terms represent random effects, the model is said to be a random-effects
(or a random) model.

WHAT IS A MIXED-EFECTS MODEL (Model 3)? Under the proviso of NOTE 1, if some of
the remaining model terms are fixed effects and some are random effects, the model
is defined to be a mixed-effects (or a mixed) model.

58
ICRISAT

59
ICRISAT

MAPMAKER/QTL Model Interval mapping


Population F2, BC, RIL, DHL
Computer PC window
platform
Contact Eric Lander
(mapmaker@genome.wi.mit.edu)

QTL cartographer Model Composite interval mapping


Population F2, BC
Computer Mac, PC window
platform
Contact Christopher Basten
(basten@essjp.stat.ncsu.edu)

MAPQTL Model Interval mapping


Population F2, BC, RIL, DHL,
Heterozygous F1
Computer Mac, PC
platform
Contact Johan van Ooijen
(j.w.vanooijen@cpro.dlo.nl)

Map manager QTL Model Interval mapping using


regression
Population F2, BC
Computer Mac, PC
platform
Contact Kenneth Manly
(kmanly@mcbio.med.bufflo.edu)

Qgene Model Linear regression


Population F2, BC
Computer Mac
platform
Contact James Nelson
(jcn5@cornell.edu)

60
ICRISAT

Allelic effect
I. Dominant deviation:
This deviation will be detected in F2 population
Its calculated as: Heterozygous [(P1+P2)/2]
A positive effect reflect growth of the heterozygous that exceeds the midparent
A negative effect reflects growth that is less than the midparent

II. Additive effects: the effects of homozygous.


Its calculated as: (Homozygous for P1 Homozygous for P2)/2
A positive effect reflects greater growth of the P1 homozygous
A negative effect reflects greater growth of the P2 homozygous

Steps to obtain a high resolution QTL mapping

Quantitative Trait Analysis

Marker analysis

Major peaks and potential linked QTLs

Fill gaps
(More markers in target region)

More individuals
(More recombinants for the target region)

61
ICRISAT

62
ICRISAT

Marker Assisted Selection


Advantages:
Eliminate the effects of environments
Select at early stage
Speed up the breeding cycles
Save land and labors

Disadvantages
Require well-equipped lab and well-trained workers
High cost

63

You might also like