You are on page 1of 47

talks

Base Quality Score Recalibra2on

Assigning accurate condence scores


to each sequenced base
We are here in the Best Practices workflow
Base Recalibra,on

PURPOSE
Real data is messy -> properly es2ma2ng the evidence is cri2cal
Quality scores emitted by sequencing machines
are biased and inaccurate

Quality scores are cri2cal for all downstream analysis


Systema2c biases are a major contributor to bad calls

Example of bias: quali2es reported depending on nucleo2de context


RMSE = 4.188 RMSE = 0.281
10

10
5

5
Empirical Reported Quality

Empirical Reported Quality


0

0
5

original recalibrated
10

10

AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG

Dinuc Dinuc

BQSR method identifies bias and applies correction



PRINCIPLES
Base Recalibra2on phases

Make before/
Model the error modes aHer plots



Apply recalibra2on
and write to le
How do we identify the error modes in the data?

RMSE = 4.188

Systema2c errors correlate with

10
basecall features

5
Empirical Reported Quality
Several relevant features:

0
Reported quality score
Posi2on within the read

5
(machine cycle)
Sequence context

10
(sequencing chemistry eects)
AA AG CA CG GA GG TA TG

Dinuc

Calculate error empirically and nd paNerns in how error varies with


basecall features

Method is empowered by looking at en2re lane of data


(works per read group)
Covariation patterns allow us to calculate adjustment factors

For each base in each read:

-- is it in AA context? -> adjust by X points


-- ...
-- is it at 3rd posi2on? -> adjust by Y points
-- ...
How do we derive the adjustment factors used for recalibration?

Any sequence mismatch = error except known variants*!

Keep track of number of observa2ons and number of errors


as a func2on of various error covariates

(lane, original quality score, machine cycle, and sequencing context)

# of reference mismatches +1 PHRED-scaled


# of observed bases + 2 quality score

* If you dont have known varia=on, bootstrap (see later on)


Any sequence mismatch = error except known variants

Goal is to iden2fy the signal within the noise


All mismatches Known varia=on


Base Qu
Highlighted as one of the
Base Quality Score Recalibration provides a calibrated
major methodological error model from which to make mutation calls
advances of the 1000
Genomes Pilot Project!

SLX GA 454 SOLiD Complete Genomics HiSeq


40

40

40

40

40













30

30

30

30

30

Empirical Quality

Empirical Quality

Empirical Quality

Empirical Quality

Empirical Quality













20

20

20

20

20













10

10

10

10

10









Original, RMSE = 5.242 Original, RMSE = 2.556 Original, RMSE = 1.215 Original, RMSE = 4.479 Original, RMSE = 5.634

Recalibrated, RMSE = 0.196 Recalibrated, RMSE = 0.213 Recalibrated, RMSE = 0.756 Recalibrated, RMSE = 0.235 Recalibrated, RMSE = 0.135
0

0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Reported Quality Reported Quality Reported Quality Reported Quality Reported Quality
Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)


10

10

10

10

10
Original, RMSE = 2.207 Original, RMSE = 1.784 Original, RMSE = 1.688 Original, RMSE = 2.679 Original, RMSE = 2.609
Recalibrated, RMSE = 0.186 Recalibrated, RMSE = 0.136 Recalibrated, RMSE = 0.213 Recalibrated, RMSE = 0.182 Recalibrated, RMSE = 0.089
5

5















































































































































































0

0








































































































































































5

5



second of pair reads rst of pair reads second of pair reads rst of pair reads second of pair reads rst of pair reads

10

10

10

10

10
0 5 10 15 20 25 30 35 0 50 100 150 200 30 20 10 0 10 20 30 30 20 10 0 10 20 30 100 50 0 50 100
Machine Cycle Machine Cycle Machine Cycle Machine Cycle Machine Cycle
Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)


10

10

10

10

10
Original, RMSE = 2.598 Original, RMSE = 2.169 Original, RMSE = 1.656 Original, RMSE = 3.503 Original, RMSE = 2.469
Recalibrated, RMSE = 0.052 Recalibrated, RMSE = 0.135 Recalibrated, RMSE = 0.088 Recalibrated, RMSE = 0.06 Recalibrated, RMSE = 0.083

5










0

0



5

5

10

10

10

10

10
AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG
Dinucleotide Dinucleotide Dinucleotide Dinucleotide Dinucleotide
Per-base indel error rate also varies by lane,
sequence context and sequencing technology

AAAAA context








50

























































































































































ReadGroup
40


























Empirical gap open penalty















20FUK.1

20FUK.2
30
20FUK.3
20FUK.4

HiSeq 20FUK.5
20 20FUK.6
20FUK.7


20FUK.8

10












PacBio

0
PacBio
GGG
CGG

GCG

GGC
CCG

CGC

GCC

GGA
AGG

GAG
CCC

CGA

GCA

GGT

GTG

TGG
ACG

AGC

CAG

GAC
CCA

CGT

CTG
GAA

GCT

GTC

TCG

TGC
AAG

ACC

AGA

CAC
CAA

CCT

CTC

TCC

TGA
AAC

ACA

AGT
AAA

GTT

TCA

TGT

TTG
ACT

CTT

TCT

TTC
ATG

GAT

GTA

TAG
ATC

CAT

CTA

TAC

TTT
AAT

TAA
ATT

TTA
ATA

TAT
suffix

Per-base indel error estimates are required for accurate indel calling, particularly on
13
new technologies with indel-rich error model such as Pacific Biosciences.
Empirical es2mates of base inser2on and base
dele2on error rates unify SNP and indel error models

Indels can be recalibrated also using an addi2onal resource Recalibration

with known indels


Base Substitution Base
BaseSubstitution
Insertion Base Insertion
Deletion Base Deletion
Recalibrated
Empirical Quality Score
50 BQSRv2

New base inser2on and dele2on quals will be used by




40 log10(nBases)
4
30

HaplotypeCaller for beNer indel calls


5

20 6
7

20
(Extra quals for indels make the BAMs signicantly bigger)
30 40 50
10

10 20 30 40 50 10 20 30 40 50
8
9
10 20 30 40
Reported Quality Score Reported Quality Score

Base Substitution Base


BaseSubstitution
Insertion Base Insertion
Deletion Base Deletion

4 Recalibration
Quality Score Accuracy

2
Recalibrated
Original






BQSRv2
Recalibrated






























0








log10(nBases)
2














6.75


4








6.80




6 6.85
100

100

100
50

50

50
100

100

100
50

50

50

50
0

0
Cycle Covariate Cycle Covariate

Base Substitution Base


BaseSubstitution
Insertion Base Insertion
Deletion Base Deletion
Recalibration
Quality Score Accuracy

2 Original
Recalibrated

0


Recalibrated
BQSRv2
2
log10(nBases)
4
6.5
6
7.0
8 7.5
8.0
GGG

GGG

GGG

GGG
CGG

GCG
GGC

CGG

GCG
GGC

CGG

GCG
GGC

CGG

GCG
GGC
CCG
CGC

GCC
GGA

CCG
CGC

GCC
GGA

CCG
CGC

GCC
GGA

CCG
CGC

GCC
GGA
GAG

AGG

GAG

AGG

GAG

AGG

GAG
CCC
CGA

GCA

GGT
GTG

TGG

CCC
CGA

GCA

GGT
GTG

TGG

CCC
CGA

GCA

GGT
GTG

TGG

CCC
CGA

GCA

GGT
GTG
CAG

GAC

ACG
AGC

CAG

GAC

ACG
AGC

CAG

GAC

ACG
AGC

CAG

GAC
CCA

CGT
CTG
GAA

GCT

GTC

TCG
TGC

CCA

CGT
CTG
GAA

GCT

GTC

TCG
TGC

CCA

CGT
CTG
GAA

GCT

GTC

TCG
TGC

CCA

CGT
CTG
GAA

GCT

GTC
CAC

AAG
ACC
AGA

CAC

AAG
ACC
AGA

CAC

AAG
ACC
AGA

CAC
CAA

CCT

CTC

TCC
TGA

CAA

CCT

CTC

TCC
TGA

CAA

CCT

CTC

TCC
TGA

CAA

CCT

CTC
AAC
ACA

AGT

AAC
ACA

AGT

AAC
ACA

AGT
GTT

TCA

TGT
TTG

AAA

GTT

TCA

TGT
TTG

AAA

GTT

TCA

TGT
TTG

AAA

GTT
ACT

ACT

ACT
CTT

TCT

TTC

CTT

TCT

TTC

CTT

TCT

TTC

CTT
GAT

GTA

ATG

GAT

GTA

ATG

GAT

GTA

ATG

GAT

GTA
TAG

TAG

TAG
CAT

CTA

ATC

CAT

CTA

ATC

CAT

CTA

ATC

CAT

CTA
TAC

TAC

TAC
TTT

TTT

TTT
TAA

AAT

TAA

AAT

TAA

AAT

TAA
TTA

ATT

TTA

ATT

TTA

ATT
ATA

ATA

ATA
TAT

TAT

TAT
GG

GG

GG

GG
CG

GC

CG

GC

CG

GC

CG

GC
CC

GA

CC

GA

CC

GA

CC

GA
AG

AG

AG
GT

TG

CA

GT

TG

CA

GT

TG

CA

GT
AC

AC

AC
CT

TC

AA

CT

TC

AA

CT

TC

AA

CT
TT

TT

TT
TA

AT

TA

AT

TA

AT

TA
14 Context Covariate Context Covariate

PROTOCOL
Base Recalibra2on steps/tools

Model the error modes Make before/


aHer plots

AnalyzeCovariates
BaseRecalibrator

Apply recalibra2on
and write to le

PrintReads

BQSR involves two complementary paths

BaseRecalibrator (1)

PrintReads BaseRecalibrator (2)

AnalyzeCovariates
PROCESSED DATA

PLOTS
BQSR involves two complementary paths

BaseRecalibrator (1)

PrintReads BaseRecalibrator (2)

AnalyzeCovariates
PROCESSED DATA

PLOTS
Base Recalibra2on workow: data processing path

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

PrintReads

Recalibrated BAM le
First we build the model

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

PrintReads

Recalibrated BAM le
TOOL TIPS
BaseRecalibrator

Builds recalibra2on model


java jar GenomeAnalysisTK.jar T BaseRecalibrator \
R human.fasta \
I realigned.bam \
knownSites dbsnp137.vcf \
knownSites gold.standard.indels.vcf \
[ L exome_targets.intervals \ ]
o recal.table

Why specify L intervals when running BaseRecalibrator on WEx?

BQSR depends on key assump2on: every mismatch is an


error, except sites in known variants

O-target sequence likely to have higher error rates with


dierent error modes

If o-target sequence is included in recalibra2on, may skew


the model and mess up results

Use L argument with BaseRecalibrator to restrict


recalibra=on to capture targets.
BaseRecalibrator produces the recalibra2on table

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

PrintReads

Recalibrated BAM le
Second step: actually recalibrate the data

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

PrintReads

Recalibrated BAM le
TOOL TIPS
Print Reads

General-use tool co-opted with BQSR ag and fed a recalibra2on report


java jar GenomeAnalysisTK.jar T PrintReads \
R human.fasta \
I realigned.bam \
BQSR recal.table \
o recal.bam

Creates a new bam le using the input table generated previously which
has exquisitely accurate base subs2tu2on, inser2on, and dele2on quality
scores

Original quali2es can be retained with OQ tag (not default)



PrintReads produces the recalibrated BAM le

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

PrintReads

Recalibrated BAM le
Great, now how do Make before/
we get the plots? aHer plots

AnalyzeCovariates


Lets look at the other path

BaseRecalibrator (1)

PrintReads BaseRecalibrator (2)

AnalyzeCovariates
PROCESSED DATA

PLOTS
Returning to the overall BQSR workow ...

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

PrintReads

Recalibrated BAM le
We keep the rst step (actual recalibra2on)
as base that well branch o from

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table
and subs2tute the ploing workow

Original BAM le + Known sites

BaseRecalibrator (1)

Recalibra2on table (1)

AnalyzeCovariates
BaseRecalibrator (2)
Plots

Recalibra2on table (2)


We already did the rst step earlier:

Original BAM le + Known sites

BaseRecalibrator (1)

Recalibra2on table (1)

AnalyzeCovariates
BaseRecalibrator (2)
Plots

Recalibra2on table (2)


which produced the original recalibra2on table

Original BAM le + Known sites

BaseRecalibrator (1)

Recalibra2on table (1)

AnalyzeCovariates
BaseRecalibrator (2)
Plots

Recalibra2on table (2)


So now we do a second pass with BaseRecalibrator:

Original BAM le + Known sites

BaseRecalibrator (1)

Recalibra2on table (1)

AnalyzeCovariates
BaseRecalibrator (2)
Plots

Recalibra2on table (2)


TOOL TIPS
Base Recalibrator

Second pass evaluates what the data looks like aHer


recalibra2on


java jar GenomeAnalysisTK.jar T BaseRecalibrator \
R human.fasta \
I realigned.bam \
knownSites dbsnp137.vcf \
knownSites gold.standard.indels.vcf \
BQSR recal.table \
o a[er_recal.table

producing a second recalibra2on table

Original BAM le + Known sites

BaseRecalibrator (1)

Recalibra2on table (1)

AnalyzeCovariates
BaseRecalibrator (2)
Plots

Recalibra2on table (2)


Finally, we use the two tables to make the plots

Original BAM le + Known sites

BaseRecalibrator (1)

Recalibra2on table (1)

AnalyzeCovariates
BaseRecalibrator (2)
Plots

Recalibra2on table (2)


TOOL TIPS
AnalyzeCovariates

Makes plots based on before/aHer recalibra2on tables


java jar GenomeAnalysisTK.jar T AnalyzeCovariates \
R human.fasta \
before recal.table \
a[er a[er_recal.table \
plots recal_plots.pdf

There is an op2on to keep the intermediate .csv le used for ploing, if you want
to play with the plot data.
and now we have before/aHer recalibra2on plots!

Original BAM le + Known sites

BaseRecalibrator (1)

Recalibra2on table (1)

AnalyzeCovariates
BaseRecalibrator (2)
Plots

Recalibra2on table (2)


So to recap the two complementary paths:

BaseRecalibrator (1)

PrintReads BaseRecalibrator (2)

AnalyzeCovariates
PROCESSED DATA
BEFORE

AFTER

PLOTS

RESULTS

Em

Em

Em



10

10

10


Recalibration produces a more accurate estimation of error






ginal, RMSE = 2.556 Original, RMSE = 1.215 Original, RMSE = 4.479 Original, RMSE = 5.634

calibrated, RMSE = 0.213 Recalibrated, RMSE = 0.756 Recalibrated, RMSE = 0.235 Recalibrated, RMSE = 0.135

0
30 40 0 10 20 30 40 (doesnt
0 10 fix20it!) 30 40 0 10 20 30 40
Quality Reported Quality Reported Quality Reported Quality

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)


10

10

10
40

40

ginal, RMSE = 1.784 Original, RMSE

= 1.688 Original, RMSE = 2.679 Original, RMSE = 2.609
calibrated,
RMSE = 0.136 RMSE = 0.213
Recalibrated, Recalibrated,




RMSE = 0.182 Recalibrated, RMSE = 0.089

Post-recalibra2on quality scores should t the empirically-








30

30

5

5

Empirical Quality

Empirical Quality




derived quality scores very well; no obvious systema2c biases


























20

20

should remain
























5

5
10

10










E = 1.215
Original, RMSE = 4.479 Original, RMSE = 5.634
10

10

10
RMSE = 0.756 Recalibrated, RMSE = 0.235 Recalibrated, RMSE = 0.135
0

0
0 15040 200 0 30
10 20 2010 0 30 10 20
40 30 0 30
10 20 2010 0 30 10 20
40 30 100 50 0 50 100
Cycle ReportedMachine
Quality Cycle ReportedMachine
Quality Cycle Machine Cycle
Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)


Accuracy (Empirical Reported Quality)

Accuracy (Empirical Reported Quality)


10

10

10
10

10
40


ginal,
E
RMSE = 2.169
= 1.688 Original,
Original, RMSE RMSE = 1.656
= 2.679 Original,
Original, RMSE RMSE = 3.503
= 2.609 Original, RMSE = 2.469

calibrated,
RMSE RMSE = 0.135
= 0.213 Recalibrated,
Recalibrated, RMSE = 0.088 Recalibrated,Recalibrated, RMSE = 0.06 Recalibrated, RMSE = 0.083
RMSE
= 0.182 RMSE = 0.089







30

5
5



Empirical Quality


20

0

0

5
5

5
10





9




Original, RMSE = 5.634
10

10

10
10

10

0.235 Recalibrated, RMSE = 0.135


0

A2040
GG30TA TG 0 30 AA AG
10 20 20 CA0 CG
10 3010 GA2040
GG30TA TG 100 AA
50AG CA0 CG GA
50 GG 100
TA TG AA AG CA CG GA GG TA TG
otide Machine
Reported Dinucleotide
QualityCycle Dinucleotide
Machine Cycle Dinucleotide
rted Quality)

rted Quality)
rted Quality)

10

10
10

9E = 1.656 Original,
Original, RMSE RMSE = 3.503
= 2.609 Original, RMSE = 2.469
RMSE
0.182 = 0.088 Recalibrated,
Recalibrated, RMSE = 0.06
RMSE = 0.089 Recalibrated, RMSE = 0.083

42
5

5
5
Non-humans: no known resources for recalibra2on?

Solu2on: bootstrap a set of known variants

Call variants on realigned, unrecalibrated data


Filter resul2ng variants with stringent lters
Use variants that pass lters as known for BQSR
Repeat un2l convergence


BQSR bootstrapping workow: round #1

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

HaplotypeCaller
VariantFiltra=on
(BQSR recal.table)

Raw VCF le
BQSR bootstrapping workow: rounds #2 to #N

Original BAM le + Known sites

BaseRecalibrator

Recalibra2on table

HaplotypeCaller
VariantFiltra=on
BQSR recal.table

Raw VCF le
We are here in the Best Practices workflow
Next Step: Variant Discovery
talks

Further reading
hNp://www.broadins2tute.org/gatk/guide/best-prac2ces

hNp://www.broadins2tute.org/gatk/guide/ar2cle?id=44

hNp://www.broadins2tute.org/gatk/gatkdocs/
org_broadins2tute_s2ng_gatk_walkers_bqsr_BaseRecalibrator.html

hNp://www.broadins2tute.org/gatk/gatkdocs/org_broadins2tute_s2ng_gatk_walkers_PrintReads.html

hNp://www.broadins2tute.org/gatk/gatkdocs/
org_broadins2tute_s2ng_gatk_walkers_bqsr_AnalyzeCovariates.html

You might also like