Professional Documents
Culture Documents
10
5
5
Empirical Reported Quality
0
5
original
recalibrated
10
10
AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG
Dinuc Dinuc
Make
before/
Model
the
error
modes
aHer
plots
Apply
recalibra2on
and
write
to
le
How do we identify the error modes in the data?
RMSE = 4.188
10
basecall
features
5
Empirical Reported Quality
Several
relevant
features:
0
Reported
quality
score
Posi2on
within
the
read
5
(machine
cycle)
Sequence
context
10
(sequencing
chemistry
eects)
AA AG CA CG GA GG TA TG
Dinuc
40
40
40
40
30
30
30
30
30
Empirical Quality
Empirical Quality
Empirical Quality
Empirical Quality
Empirical Quality
20
20
20
20
20
10
10
10
10
10
Original, RMSE = 5.242 Original, RMSE = 2.556 Original, RMSE = 1.215 Original, RMSE = 4.479 Original, RMSE = 5.634
Recalibrated, RMSE = 0.196 Recalibrated, RMSE = 0.213 Recalibrated, RMSE = 0.756 Recalibrated, RMSE = 0.235 Recalibrated, RMSE = 0.135
0
0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Reported Quality Reported Quality Reported Quality Reported Quality Reported Quality
Accuracy (Empirical Reported Quality)
10
10
10
10
Original, RMSE = 2.207 Original, RMSE = 1.784 Original, RMSE = 1.688 Original, RMSE = 2.679 Original, RMSE = 2.609
Recalibrated, RMSE = 0.186 Recalibrated, RMSE = 0.136 Recalibrated, RMSE = 0.213 Recalibrated, RMSE = 0.182 Recalibrated, RMSE = 0.089
5
5
0
0
5
5
second
of
pair
reads
rst
of
pair
reads
second
of
pair
reads
rst
of
pair
reads
second
of
pair
reads
rst
of
pair
reads
10
10
10
10
10
0 5 10 15 20 25 30 35 0 50 100 150 200 30 20 10 0 10 20 30 30 20 10 0 10 20 30 100 50 0 50 100
Machine Cycle Machine Cycle Machine Cycle Machine Cycle Machine Cycle
Accuracy (Empirical Reported Quality)
10
10
10
10
Original, RMSE = 2.598 Original, RMSE = 2.169 Original, RMSE = 1.656 Original, RMSE = 3.503 Original, RMSE = 2.469
Recalibrated, RMSE = 0.052 Recalibrated, RMSE = 0.135 Recalibrated, RMSE = 0.088 Recalibrated, RMSE = 0.06 Recalibrated, RMSE = 0.083
5
0
0
5
5
10
10
10
10
10
AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG
Dinucleotide Dinucleotide Dinucleotide Dinucleotide Dinucleotide
Per-base
indel
error
rate
also
varies
by
lane,
sequence
context
and
sequencing
technology
AAAAA context
50
ReadGroup
40
Empirical gap open penalty
20FUK.1
20FUK.2
30
20FUK.3
20FUK.4
HiSeq 20FUK.5
20 20FUK.6
20FUK.7
20FUK.8
10
PacBio
0
PacBio
GGG
CGG
GCG
GGC
CCG
CGC
GCC
GGA
AGG
GAG
CCC
CGA
GCA
GGT
GTG
TGG
ACG
AGC
CAG
GAC
CCA
CGT
CTG
GAA
GCT
GTC
TCG
TGC
AAG
ACC
AGA
CAC
CAA
CCT
CTC
TCC
TGA
AAC
ACA
AGT
AAA
GTT
TCA
TGT
TTG
ACT
CTT
TCT
TTC
ATG
GAT
GTA
TAG
ATC
CAT
CTA
TAC
TTT
AAT
TAA
ATT
TTA
ATA
TAT
suffix
Per-base indel error estimates are required for accurate indel calling, particularly on
13
new technologies with indel-rich error model such as Pacific Biosciences.
Empirical
es2mates
of
base
inser2on
and
base
dele2on
error
rates
unify
SNP
and
indel
error
models
20 6
7
20
(Extra
quals
for
indels
make
the
BAMs
signicantly
bigger)
30 40 50
10
10 20 30 40 50 10 20 30 40 50
8
9
10 20 30 40
Reported Quality Score Reported Quality Score
4 Recalibration
Quality Score Accuracy
2
Recalibrated
Original
BQSRv2
Recalibrated
0
log10(nBases)
2
6.75
4
6.80
6 6.85
100
100
100
50
50
50
100
100
100
50
50
50
50
0
0
Cycle Covariate Cycle Covariate
2 Original
Recalibrated
0
Recalibrated
BQSRv2
2
log10(nBases)
4
6.5
6
7.0
8 7.5
8.0
GGG
GGG
GGG
GGG
CGG
GCG
GGC
CGG
GCG
GGC
CGG
GCG
GGC
CGG
GCG
GGC
CCG
CGC
GCC
GGA
CCG
CGC
GCC
GGA
CCG
CGC
GCC
GGA
CCG
CGC
GCC
GGA
GAG
AGG
GAG
AGG
GAG
AGG
GAG
CCC
CGA
GCA
GGT
GTG
TGG
CCC
CGA
GCA
GGT
GTG
TGG
CCC
CGA
GCA
GGT
GTG
TGG
CCC
CGA
GCA
GGT
GTG
CAG
GAC
ACG
AGC
CAG
GAC
ACG
AGC
CAG
GAC
ACG
AGC
CAG
GAC
CCA
CGT
CTG
GAA
GCT
GTC
TCG
TGC
CCA
CGT
CTG
GAA
GCT
GTC
TCG
TGC
CCA
CGT
CTG
GAA
GCT
GTC
TCG
TGC
CCA
CGT
CTG
GAA
GCT
GTC
CAC
AAG
ACC
AGA
CAC
AAG
ACC
AGA
CAC
AAG
ACC
AGA
CAC
CAA
CCT
CTC
TCC
TGA
CAA
CCT
CTC
TCC
TGA
CAA
CCT
CTC
TCC
TGA
CAA
CCT
CTC
AAC
ACA
AGT
AAC
ACA
AGT
AAC
ACA
AGT
GTT
TCA
TGT
TTG
AAA
GTT
TCA
TGT
TTG
AAA
GTT
TCA
TGT
TTG
AAA
GTT
ACT
ACT
ACT
CTT
TCT
TTC
CTT
TCT
TTC
CTT
TCT
TTC
CTT
GAT
GTA
ATG
GAT
GTA
ATG
GAT
GTA
ATG
GAT
GTA
TAG
TAG
TAG
CAT
CTA
ATC
CAT
CTA
ATC
CAT
CTA
ATC
CAT
CTA
TAC
TAC
TAC
TTT
TTT
TTT
TAA
AAT
TAA
AAT
TAA
AAT
TAA
TTA
ATT
TTA
ATT
TTA
ATT
ATA
ATA
ATA
TAT
TAT
TAT
GG
GG
GG
GG
CG
GC
CG
GC
CG
GC
CG
GC
CC
GA
CC
GA
CC
GA
CC
GA
AG
AG
AG
GT
TG
CA
GT
TG
CA
GT
TG
CA
GT
AC
AC
AC
CT
TC
AA
CT
TC
AA
CT
TC
AA
CT
TT
TT
TT
TA
AT
TA
AT
TA
AT
TA
14
Context Covariate Context Covariate
PROTOCOL
Base
Recalibra2on
steps/tools
AnalyzeCovariates
BaseRecalibrator
Apply
recalibra2on
and
write
to
le
PrintReads
BQSR
involves
two
complementary
paths
BaseRecalibrator (1)
AnalyzeCovariates
PROCESSED
DATA
PLOTS
BQSR
involves
two
complementary
paths
BaseRecalibrator (1)
AnalyzeCovariates
PROCESSED
DATA
PLOTS
Base
Recalibra2on
workow:
data
processing
path
BaseRecalibrator
Recalibra2on table
PrintReads
Recalibrated
BAM
le
First
we
build
the
model
BaseRecalibrator
Recalibra2on table
PrintReads
Recalibrated
BAM
le
TOOL
TIPS
BaseRecalibrator
java
jar
GenomeAnalysisTK.jar
T
BaseRecalibrator
\
R
human.fasta
\
I
realigned.bam
\
knownSites
dbsnp137.vcf
\
knownSites
gold.standard.indels.vcf
\
[
L
exome_targets.intervals
\
]
o
recal.table
Why
specify
L
intervals
when
running
BaseRecalibrator
on
WEx?
BaseRecalibrator
Recalibra2on table
PrintReads
Recalibrated
BAM
le
Second
step:
actually
recalibrate
the
data
BaseRecalibrator
Recalibra2on table
PrintReads
Recalibrated
BAM
le
TOOL
TIPS
Print
Reads
java
jar
GenomeAnalysisTK.jar
T
PrintReads
\
R
human.fasta
\
I
realigned.bam
\
BQSR
recal.table
\
o
recal.bam
Creates
a
new
bam
le
using
the
input
table
generated
previously
which
has
exquisitely
accurate
base
subs2tu2on,
inser2on,
and
dele2on
quality
scores
BaseRecalibrator
Recalibra2on table
PrintReads
Recalibrated
BAM
le
Great,
now
how
do
Make
before/
we
get
the
plots?
aHer
plots
AnalyzeCovariates
Lets
look
at
the
other
path
BaseRecalibrator (1)
AnalyzeCovariates
PROCESSED
DATA
PLOTS
Returning
to
the
overall
BQSR
workow
...
BaseRecalibrator
Recalibra2on table
PrintReads
Recalibrated
BAM
le
We
keep
the
rst
step
(actual
recalibra2on)
as
base
that
well
branch
o
from
BaseRecalibrator
Recalibra2on
table
and
subs2tute
the
ploing
workow
BaseRecalibrator (1)
AnalyzeCovariates
BaseRecalibrator
(2)
Plots
BaseRecalibrator (1)
AnalyzeCovariates
BaseRecalibrator
(2)
Plots
BaseRecalibrator (1)
AnalyzeCovariates
BaseRecalibrator
(2)
Plots
BaseRecalibrator (1)
AnalyzeCovariates
BaseRecalibrator
(2)
Plots
java
jar
GenomeAnalysisTK.jar
T
BaseRecalibrator
\
R
human.fasta
\
I
realigned.bam
\
knownSites
dbsnp137.vcf
\
knownSites
gold.standard.indels.vcf
\
BQSR
recal.table
\
o
a[er_recal.table
producing
a
second
recalibra2on
table
BaseRecalibrator (1)
AnalyzeCovariates
BaseRecalibrator
(2)
Plots
BaseRecalibrator (1)
AnalyzeCovariates
BaseRecalibrator
(2)
Plots
java
jar
GenomeAnalysisTK.jar
T
AnalyzeCovariates
\
R
human.fasta
\
before
recal.table
\
a[er
a[er_recal.table
\
plots
recal_plots.pdf
There
is
an
op2on
to
keep
the
intermediate
.csv
le
used
for
ploing,
if
you
want
to
play
with
the
plot
data.
and
now
we
have
before/aHer
recalibra2on
plots!
BaseRecalibrator (1)
AnalyzeCovariates
BaseRecalibrator
(2)
Plots
BaseRecalibrator (1)
AnalyzeCovariates
PROCESSED
DATA
BEFORE
AFTER
PLOTS
RESULTS
Em
Em
Em
10
10
10
0
30 40 0 10 20 30 40 (doesnt
0 10 fix20it!) 30 40 0 10 20 30 40
Quality Reported Quality Reported Quality Reported Quality
10
10
40
40
ginal, RMSE = 1.784 Original, RMSE
= 1.688 Original, RMSE = 2.679 Original, RMSE = 2.609
calibrated,
RMSE = 0.136 RMSE = 0.213
Recalibrated, Recalibrated,
RMSE = 0.182 Recalibrated, RMSE = 0.089
30
5
5
Empirical Quality
Empirical Quality
20
20
should
remain
5
5
10
10
E = 1.215
Original, RMSE = 4.479 Original, RMSE = 5.634
10
10
10
RMSE = 0.756 Recalibrated, RMSE = 0.235 Recalibrated, RMSE = 0.135
0
0
0 15040 200 0 30
10 20 2010 0 30 10 20
40 30 0 30
10 20 2010 0 30 10 20
40 30 100 50 0 50 100
Cycle ReportedMachine
Quality Cycle ReportedMachine
Quality Cycle Machine Cycle
Accuracy (Empirical Reported Quality)
10
10
10
10
40
ginal,
E
RMSE = 2.169
= 1.688 Original,
Original, RMSE RMSE = 1.656
= 2.679 Original,
Original, RMSE RMSE = 3.503
= 2.609 Original, RMSE = 2.469
calibrated,
RMSE RMSE = 0.135
= 0.213 Recalibrated,
Recalibrated, RMSE = 0.088 Recalibrated,Recalibrated, RMSE = 0.06 Recalibrated, RMSE = 0.083
RMSE
= 0.182 RMSE = 0.089
30
5
5
Empirical Quality
20
0
0
5
5
5
10
9
Original, RMSE = 5.634
10
10
10
10
10
A2040
GG30TA TG 0 30 AA AG
10 20 20 CA0 CG
10 3010 GA2040
GG30TA TG 100 AA
50AG CA0 CG GA
50 GG 100
TA TG AA AG CA CG GA GG TA TG
otide Machine
Reported Dinucleotide
QualityCycle Dinucleotide
Machine Cycle Dinucleotide
rted Quality)
rted Quality)
rted Quality)
10
10
10
9E = 1.656 Original,
Original, RMSE RMSE = 3.503
= 2.609 Original, RMSE = 2.469
RMSE
0.182 = 0.088 Recalibrated,
Recalibrated, RMSE = 0.06
RMSE = 0.089 Recalibrated, RMSE = 0.083
42
5
5
5
Non-humans:
no
known
resources
for
recalibra2on?
BQSR
bootstrapping
workow:
round
#1
BaseRecalibrator
Recalibra2on table
HaplotypeCaller
VariantFiltra=on
(BQSR
recal.table)
Raw
VCF
le
BQSR
bootstrapping
workow:
rounds
#2
to
#N
BaseRecalibrator
Recalibra2on table
HaplotypeCaller
VariantFiltra=on
BQSR
recal.table
Raw
VCF
le
We are here in the Best Practices workflow
Next
Step:
Variant
Discovery
talks
Further
reading
hNp://www.broadins2tute.org/gatk/guide/best-prac2ces
hNp://www.broadins2tute.org/gatk/guide/ar2cle?id=44
hNp://www.broadins2tute.org/gatk/gatkdocs/
org_broadins2tute_s2ng_gatk_walkers_bqsr_BaseRecalibrator.html
hNp://www.broadins2tute.org/gatk/gatkdocs/org_broadins2tute_s2ng_gatk_walkers_PrintReads.html
hNp://www.broadins2tute.org/gatk/gatkdocs/
org_broadins2tute_s2ng_gatk_walkers_bqsr_AnalyzeCovariates.html