You are on page 1of 64

Genomics

101
An Introduction To The Genomic Workflow

PRODUCED BY:

IN PARTNERSHIP WITH:

INTRODUCTION
e have come a very long way since DNA was first isolated back in 1869. As Friedrich Miescher
was investigating nuclein, I doubt he could have imagined the advances that took place in
the subsequent century and a half. The latter half of the 20th century in particular saw
tremendous intellects drive forward what would eventually develop into the field of
genomics. 1953 saw James Watson and Francis Crick describe the structure of the DNA helix, and as
technology advanced the 21st century began with the publication of the human genome in 2001.

Today, genomics not only represents the pinnacle of our understanding of human biology, but also an
industry of extraordinary potential set to impact several aspects of our lives.
The raw requirements to generate, manage, analyse and interpret genomic data have become far
more accessibly in recent years. This has led to a phenomenal boom in not just data creation, but our
understanding and leveraging of that data. Simply put, there has never been a better time to adopt
genomic technology. And that is exactly what so many of you are already doing.
This is where this handbook comes in. Genomics is moving at such a rapid pace that finding easy to
understand information to help explain how it all works is hard to come by. Here at Front Line Genomics,
we want to do what we can to help lower the barrier to entry to adoption of genomic technology. With
the help of some of the leading technology companies (our Strategic Partners, Agilent Technologies,
Seven Bridges Genomics, and Twist Bioscience, and partners Affymetrix, DNAnexus, New England
Biolabs and WuXi NextCODE), weve put together the Genomics 101 as a guided tour through the world
of human genomics.
The 101 is not intended to offer detailed protocols to take into the lab. Our intention is to help you
understand the kinds of questions you can use genomic approaches to ask, the kinds of platforms
available to you, and how they work. Well help you explore DNA microarrays and Next Generation
Sequencing. Well familiarise you with the basic chemistries involved in producing sequence
information. Well then explain how all that data gets turned into something you can use to help
improve patients lives.
There is a lot that we could have included in this edition, but we tried to focus on the core technology
areas that are at the heart of genomics today and shaping the future of the field. We will endeavour to
not only keep these chapters up to date and add new content as technology and applications progress.
For now, we hope you find the handbook an interesting read, and above all useful.

GENOMICS 101

CONTENTS
1 INTRODUCTION
3 GLOSSARY
4

10
14

CHAPTER 1: DESIGNING GENOMICS


EXPERIMENTS
What is the right method to use for an
experiment? In this chapter we guide you
through the available tools and the wide
range of uses for genomic data.
AFFYMETRIX: PIONEERING WHOLEGENOME ANALYSIS

CHAPTER 2: TURNING DNA INTO DATA


To sequence a strand of DNA involves
a sequence of precise steps that begin
with sample preparation. Here we take
a look at how different sequencing
techniques actually work, and the
critical steps involved in preparing a
sample for sequencing.

16

AGILENT TECHNOLOGIES: DEEP


CLONAL PROFILING OF PURIFIED
TUMOR CELL POPULATIONS FROM
FFPE SAMPLES

20

NEW ENGLAND BIOLABS: MEETING


THE CHANGING DEMANDS OF
NGS SAMPLE PREPARATION WITH
NEBNEXT

24

CHAPTER 3: ANALYSING DATA


Generating image data from sequencing
is just the beginning. From producing
raw read data to reconstructing an
entire sequence, this chapter focusses
on how we build a DNA sequence for
study and research.

26

SEVEN BRIDGES GENOMICS:


DISCOVERY IN MILLIONS OF GENOMES

14

24

50

2 / Genomics 101

30

DNANEXUS: DNANEXUS MADE


RIDICULOUSLY SIMPLE

34

CHAPTER 4: NGS INTERPRETATION


AND DISCOVERY
So you have reconstructed a DNA
sequence, what comes next?
Here we explore the process
of interpreting sequence data,
searching for disease-linked variants,
and the computing challenges facing
researchers and clinicians.

38

WUXI NEXTCODE: THE OS OF THE


GENOME

42

CHAPTER 5: NGS IN THE CLINIC


Perhaps the greatest challenge in clinical
genomics is reporting the outcomes of
genetic testing to patients. In this chapter
we explore how the rise and rapid
evolution of NGS tests have affected the
nature of clinical reporting, and what the
future holds for genomic medicine.

46

AGILENT TECHNOLOGIES: ACMG


RECOMMENDATIONS ON SEQUENCE
VARIANT INTERPRETATION:
IMPLEMENTATION ON THE BENCH
NGS PLATFORM

50

CHAPTER 6: GENOME EDITING


Editing the human genome has
enormous potential to treat disease,
but no topic has attracted more debate
and ethical discussion. In this chapter
we examine the history of gene editing,
the rise of CRISPR, and look to the
future of the technology.

54

TWIST BIOSCIENCE : REIMAGINE


GENOME SCALE RESEARCH

GENOMICS 101

GLOSSARY
ADAPTORS
A short nucleotide molecule that binds to each end of a DNA
fragment prior to sequencing.

MESSENGER RNA (MRNA)


A single-stranded template for creating specific proteins, created
during transcription

ALLELE
One of two forms of a gene, or other portion of DNA, located at the
same place on a chromosome

MUTATION
A DNA sequence variation that differs from the reference
sequence. This can be a SNP, and insertion, or a deletion of the
base pairs in the sequence

COMPLIMENTARY DNA (CDNA)


A double-stranded DNA molecule synthesised from mRNA, often
used during gene cloining
COMPLEX TRAIT/DISEASE
A trait or disease that is determined by more than one gene, and/
or environmental factors. Traits or diseases determined by just on
gene are called single-gene traits/diseases)
COPY NUMBER VARIATION
Variation between individuals in the number of copies of a
particular region of genomic DNA
CRISPR/CAS9
A genome-editing technique that uses cas genes found in bacteria
to cut, edit and regulate the genome of an organism
DNA MICROARRAY
Known DNA sequence fragments attached to a slide or
membrane, allowing for the detection of specific sequences in
an unknown DNA sample
EXOME
The region of an organisms genome that codes for essential
proteins. Coding regions of the genome are called Exons, while
non-coding regions are called Introns

NEXT GENERATION SEQUENCING (NGS)


A range of DNA sequencing technologies that can sequence
millions of DNA fragments at once, creating larger datasets
and more efficient results
PHENOTYPE
The physical and behavioural manifestations of an organisms
genotype
POLYMERASE CHAIN REACTION (PCR)
A technique used to replicate a particular stretch of DNA rapidly
and selectively
PRECISION/PERSONALISED MEDICINE
Customised or tailored healthcare solutions based upon a
patients genome, environment, lifestyle etc.
READS
In sequencing a read refers to a data string of A, T, C and G
bases from the sample DNA. Different sequencing techniques
generate different length reads
REFERENCE GENOME
A fully sequenced and assembled genome that acts as a template
for reconstructing new sequences.

GENE EXPRESSION
Process by which the information from a gene is used to create a
functional protein product

SINGLE NUCLEOTIDE POLYMORPHISM (SNP)


Snips are single base-pair mutations at specific locations in the
genome, and are one of the most common forms of genetic variation

GENE PANEL
A selection of genes relevant to a particular condition, that can be
sequenced in order to make a clinical diagnosis

SOMATIC CELLS
Cells that are not destined to become reproductive cells.
Mutations in somatic cells are not passed on from parent to
offspring

GENOME
The full genetic sequence of an orgnism, including both coding and
non-coding regions
GENOME-WIDE ASSOCIATION STUDY (GWAS)
A study that evaluates the genomes of a large number of
participants, looking for correlations between genetic variation
and particular traits or diseases.
GENOTYPE
The complete genetic make up of an organism
GERMLINE CELLS
Cells that will go on to become sperm and ova. Mutations in the
germline can be transmitted from parent to offspring.

TRANSCRIPTION
The process of creating a messenger RNA (mRNA) from a DNA
sequence
TRANSLATION
The process of creating a protein chain, composed of amino
acids, from a strand of mRNA
WHOLE EXOME SEQUENCING (WES)
The process for sequencing the entire coding portion of an
individual genome
WHOLE GENOME SEQUENCING (WGS)
The process for sequencing all of an individuals DNA

Genomics 101 / 3

CHAPTER 1:
DESIGNING
GENOMICS
EXPERIMENTS
SPONSORED BY

DESIGNING GENOMICS EXPERIMENTS

INTRODUCTION
In this first chapter of the Genomics 101, we take a look at the broad
range of options available to anyone looking to generate, or make use of
genomic data. Genomic data can range from whole genome to just the
exome, or to a subset of genes down to just a single gene. In addition,
the genomic data can be in the form of DNA sequence, single nucleotide
polymorphism, copy number variation, or structural variation. In
addition to reading the genome, we can now also generate profiles of
the genes expressed to gain another layer of information that can help
understand disease at a cellular level by exposing novel transcripts,
splice variants, and non-coding RNAs, which can become valuable
biomarkers for diagnostic tests. Before we look into the different
methods and platforms available to you, it is important to take a step
back to consider what it is that you are trying to achieve.
Since the Human Genome Project was completed more than a
decade ago, different whole genome analysis technologies have
become available. Whole genome analysis using microarrays has
been the traditional work horse for gene expression profiling
as well as genotyping applications. They are the perfect tool
for scientists in and out of the clinic, due to their affordability,
consistency, and quick, easy, and standardised data analysis.
Relatively new next generation sequencing (NGS) methods have
improved dramatically over the past few years. You may have seen
several graphs showing how sequence output and thus the cost of
sequencing has dropped faster than the rate described by Moores Law
(which describes a long-term trend in the computer industry during
which compute power doubles every two years). The dramatic increases

in sequence output over the past 10 years have now made it possible to
consider sequencing projects that were impossible or at least completely
unaffordable prior to these advances. January, 2014, brought the news
that Illumina had delivered the first commercially available $1,000
genome with their HiSeq X Ten Sequencer. Although that includes the
cost of reagents and sample preparation, there is still an argument
that: to truly break the $1,000 barrier, the cost must also include
interpretation of data produced, as well as storage of the resulting data.
The improvements in sequencing technology have led to a flood
of genomic information. This has greatly increased the level of
understanding of the genome and roles of specific genes. As well
as advancing research, this is also leading to the development of
more powerful DNA microarrays leveraging the growing wealth of
identified gene variants.
Before you get too excited about NGS, consider whether it is really
the best option for what you want to do. Although much cheaper
than it used to be; NGS is still relatively expensive, and requires
considerable I.T. capabilities (as well see in the Analysis chapter).
If you are undertaking discovery, or hypothesis-free research,
have sufficient funds, and the necessary infrastructure to perform
and analyse the data, NGS may well be the right option for you.
However, if you are undertaking a hypothesis-driven study, a large
sample size epidemiology study, or working with difficult samples
such as FFPE or a limited amount of samples such as fine needle
biopsy, you can leverage genomic data much more cost effectively
by using well-designed microarrays.

Genomics 101 / 5

DESIGNING GENOMICS EXPERIMENTS

As we guide you through your options, try to keep your end result
in mind. What kind of data do you need to answer your question
most efficiently? You can then begin to make a judgement on
platforms based on your operating restrictions:







What is the question that you are asking?


What is your budget?
What are your overall accuracy and reproducibility requirements?
How many samples do you have access to and need to analyse?
What are your turnaround time requirements?
Is a large, active user-community and support network important?
How user-friendly do you need your platform to be?
Do you have the bioinformatics support for translating raw data
to answers for your biological questions?

NEXT GENERATION SEQUENCING (NGS)


Next Generation Sequencing (NGS) is a catch-all term used to
describe the current sequencing technologies. Depending on
how much of the genome you need, here are three different
approaches; each with their own pros and cons.

WHOLE GENOME SEQUENCING (WGS)


As the name suggests, this lets you look at the full genome. This
is the primary approach being used by the Precision Medicine
Initiative (USA), 100 000 Genomes Project (UK), and several other

national sequencing projects. A considerable amount of work


is being carried out to better understand how to integrate WGS
into a healthcare system at a national scale. For now, WGS is still
mainly a powerful research tool, with the notable exception of
the pioneering work of Steven Kingsmore and colleagues who are
clinically utilising it routinely in the NICU (Neonatal Intensive Care
Unit) settings. Whole genome studies are the basis of most , if not all,
genomic applications. WGS can provide you a very full picture of one
individual, or can help you identify disease causing variants within a
population. For hypothesis-free research and discovery, WGS will give
you a lot to work with, and ensure you dont miss anything of interest
that may better help you understand your disease or trait of interest.

WHOLE EXOME SEQUENCING (WES)


While the whole genome is undoubtedly very useful, much
of the contained information may be irrelevant to your
application. There are several instances when looking at the
exome (protein-coding DNA) is much more practical. As the
exome accounts for less than 2% of the genome, it is also
considerably cheaper to read and generates much more
manageable volumes of data.
WES is particularly useful when trying to map rare variants in
complex disorders. Disease-causing variants with large effects
will typically be found within the exome. Complex disorders are
governed by multiple genes, so you will typically need a very large
sample size to discover variants of interest. In this instance, WGS is
not a practical option.
By contrast, Mendelian disorders typically have far fewer causative
variants behind the condition. Selection pressures are likely
to make these variants extremely rare, and may be missed by
standard genotyping assays, but still not require WGS.
WES can also be used as a diagnostic tool. Ambry Genetics were the
first CLIA-certified laboratory to offer exome sequencing for clinical
diagnostic purposes. WES is also now being clinically utilized by
such places as Washington University in St. Louis.

TARGETED GENE PANELS


In a clinical setting, WES still presents a few challenges. The cost can
still be relatively high, particularly if you are sequencing a child and
both parents. Perhaps the biggest drawback is a long turnaround
time. As we will see in later chapters, sequencing is only the start of
the journey. Returning full results can take months. WES is likely to
uncover multiple variants, so the identification of the causal variant
of interest can be a difficult task.
One of the contentious issues surrounding WGS and WES today
is around incidental findings. This is when sequencing uncovers
potential medically relevant or actionable results not related to
the indication you were originally testing for. Is it the patients
right to receive all information, or to decline results? Or is it the
clinicians duty to inform the patient regardless? We will cover this

6 / Genomics 101

DESIGNING GENOMICS EXPERIMENTS

ONE OF THE CONTENTIOUS ISSUES


SURROUNDING WGS AND WES TODAY
IS AROUND INCIDENTAL FINDINGS.
THIS IS WHEN SEQUENCING UNCOVERS POTENTIAL
MEDICALLY RELEVANT OR ACTIONABLE RESULTS NOT
RELATED TO THE INDICATION YOU WERE ORIGINALLY
TESTING FOR.

in more detail in the Genomics In The Clinic chapter, later on.


One way to avoid these issues is to use gene panels. Rather than
sequencing the whole exome, you can choose to sequence a panel
or selection of genes relevant to a particular phenotype. This will
certainly bring the operating costs down, and present a simpler
analysis, but does present us with a new problem: panel design.
WGS and WES can present us with too much information. Gene
panels can present us with too little information. One testing
laboratorys design may differ from another laboratorys design.
This is because the design itself will be based on published results
and how we choose to interpret them. What is deemed to be
relevant to an indication may vary from person to person, which
can lead to a lack of uniformity of gene panel design. There are
several commercially available panels for the previously mentioned
sequencing platforms, and you can of course design your own to
answer your own specific questions.
This raises an interesting question. Is it better to have thousands
of different panels optimised to answer specific questions, or to
have a single CLIA-certified exome that can be used to answer all
of those same questions? Mayo Clinic in the USA, have pursued
the gene panel route to simplify reimbursement issues, while
Washington University have taken the bolder WES strategy.

Genomics 101 / 7

DESIGNING GENOMICS EXPERIMENTS

NGS PLATFORMS
These platforms all have their advantages and disadvantages. How
heavily those sway your potential decision should come down to
what your own set of parameters are. So do investigate them all,
and try to find first-hand testimonials from existing users.
The following are the most common NGS platforms available today:
454 Life Sciences: The Roche company produce high throughput
sequencing machines based on their pyrosequencing technology.
While the cost per run is relatively expensive, the machines are
quite fast and produce longer reads than most at around 700 base
pairs. However, Roche is no longer supporting this platform as
other technologies have superseded this now antiquated platform.
Illumina: Sequencing by synthesis is by far the most popular way to
sequence today. Illumina have a dominant market share due to the cost
effectiveness of sequencing through their platforms, and the potential
for particularly high yields. This company produces a wide range of NGS
machines from the smallest machine the MiniSeq (capable of 7.5 Gbs of
sequence/run) all the way up to the HiSeq XTen (which is the platform
where the cost of WGS has dropped below $1,000 if you sequence at a
high enough volume). The main draw backs here are the initial cost of
the equipment itself, potentially short lifespan of the instrumentation,
and uneven coverage associated with short reads.
Ion Torrent Sequencing: The ion semiconductor method of sequencing
proved very popular when it first hit the market. The sequencer itself
tends to be very competitively priced and is exceptionally fast. This
helped it find a home in several diagnostic laboratories, where quick
turnaround times are crucial, and absolute base-pair accuracy less so.
This platform is not viable for WGS as its output is significantly
below what is needed for complex genomes. However, it is ideally
suited for small gene panels up to exome-based analysis that we
discussed in the previous section.
Pacific Biosciences: Using single-molecule real-time sequencing
(SMRT), this platform is known for producing long reads (up to
60,000 base pairs with their latest machine). This gives you a
considerable advantage if you want to identify structural variations,
and increase your coverage in difficult to amplify areas of the
genome. However, the Pacific Biosciences equipment does come
in at a higher cost than most, and doesnt quite have the same
throughput as some of the other platforms available.
Another advantage of this platform is its potential ability to
sequence modified bases (such as 5-methyl-cytosine). Currently
this is not a viable platform for whole genome sequencing simply
because of its limited output. However, if cost is not an object, it is a
good option for examining even the most complex genomes. What
it is better at, however, is scaffolding genomes together from other
short-read technologies.
Looking Ahead
Developing sequencing technology could offer alternatives to the existing
NGS platforms and potentially (down the road) even replace some of
those platforms. Most popular amongst these is nanopore sequencing.

8 / Genomics 101

THESE PLATFORMS ALL HAVE THEIR


ADVANTAGES AND DISADVANTAGES.
HOW HEAVILY THOSE SWAY YOUR
POTENTIAL DECISION SHOULD COME DOWN TO
WHAT YOUR OWN SET OF PARAMETERS ARE.

DESIGNING GENOMICS EXPERIMENTS

Microarrays can be used for whole genome as well as targeted


analysis. Technological advancements over the years are enabling
affordable and fast turnaround of uber-sized epidemiology studies
by large biobanks, providing robust assays to address challenging
samples such as FFPE and single cells, and obtaining regulatory
clearance for diagnostic use in the clinic. Key applications for
microarrays include gene expression profiling of mRNA and miRNA,
genotyping and copy number variation analysis, and high-resolution
chromosomal abnormality detection.
Gene Expression: RNA is isolated and enriched from the sample.
It can then be amplified and labelled ready to be hybridised to
a microarray. After a wash to remove unbound material the
microarray is scanned to measure fluorescence at each spot.
Depending on how many genes you are measuring, and across
how many samples, you will likely find yourself with a multicoloured grid, a heat map. Your image requires processing to
convert colour and intensity of fluorescence into numbers. This
then allows you to see which genes are expressed in which
samples and at what levels.

Oxford Nanopore Technologies are recognised as the leading


company in this field. Their technology operates on the principle of
reading the DNA molecule directly by passing it through a nanopore
and measuring the effects on ions and electrical current flowing
through the pore.
This effectively allows you to read DNA in real-time, and would be
considerable cheaper and faster than current methods. Oxford
Nanopores MinION is roughly the size of a USB stick, and plugs directly
into a computer. Unfortunately sequence accuracy on all current single
molecule sequences is quite poor, thus in order to obtain high sequence
accuracy, you need to do considerably more sequence coverage of each
base to reach a decent consensus sequence. If sequencing accuracy is
high enough, and the output on this platform could be increased by at
least several orders of magnitude, this could potentially help bring NGS
into the clinic to quickly identify a persons associated risk of disease, or
to identify their pharmacogenomics profile.

DNA MICROARRAYS
DNA microarray is a technology by which known DNA sequences
are either deposited, or synthesised, onto a surface. This allows us
to detect the presence, and concentration, of sequences of interest.
The turn of the century saw a dramatic increase in our
understanding of the human genome. New production methods,
and fluorescent detection, were adapted to build modern
microarrays. While DNA arrays have been around in early forms
since the 1970s, it was only in the 1990s that microarrays started to
become the invaluable tool we know them as today.

If you are assaying multiple samples at the same time, you can
begin to tease apart meaningful information, such as differential
gene expression level between different sample types. Calculating
similarities in gene expression across samples allows you to put
them into hierarchical clusters. Clustering genes and samples
can help build up an interesting picture of the genetics and
biology of your indication of interest. With the advancement of
microarray technology, such as the whole transcriptome arrays
from Affymetrix, it is now possible to not only measure gene-level
differences, but also exon-level and alternative splice variants. With
the standardisation of microarray gene expression data analysis,
one can easily derive meaningful biological information in weeks
rather than months.
Genotyping: As NGS costs continue to fall, this is still one area
in which microarrays continue to be the dominant, and much
more cost-effective, technology. The most common methods
of detecting single-nucleotide-polymorphisms (SNPs) are allele
discrimination by hybridisation, allele specific extension and
ligation to a bar-code, or extending arrayed DNA across the
SNP in a single nucleotide extension reaction. Affymetrix and
Illumina both produce highly effective SNP genotyping arrays
that have been used extensively around the world. As well as
being able to detect over 1 million different human SNPs with
high degrees of accuracy and reproducibility. The arrays can also
be used to detect copy number variations. While SNPs are crucial
biomarkers, copy number variations (a structural variation
within a cells DNA that gives it fewer, correct, or more copies
of a certain section of DNA) have also been associated with
susceptibility and resistance to some diseases. Large biobank
studies, such as UK Biobank and Million Veteran Program in
the US, are utilising microarrays for generating genotyping
data to understand the relationship between genes, lifestyle,
environment, and medical history from 500,000 to 1,000,000
volunteers, respectively. The resulting database is being used
by researchers to understand diseases with the goal for better
diagnosis and treatment. Genotyping array is also what is used
by direct-to-consumer companies such as 23andMe to generate
customer profiles. p.12

Genomics 101 / 9

DESIGNING GENOMICS EXPERIMENTS

Pioneering microarray analysis


Late 1980s to early 1990s
Stephen P.A. Fodor and a team of scientists integrated semiconductor manufacturing
techniques with combinatorial chemistry on a small silicon chip to enable the collection of
vast amounts of genetic data that would help scientists learn the biology of disease
at the molecular level.
Their findings, published in Science in 1991, launched the microarray industry and forever
changed how genomics studies are done.

Enabling the translation of scientific


discoveries to the clinic
Affymetrixs GeneChip array
measures only 1.5 x 3 inches.
The silicon chip contains
6,903,680 probes.

1990s and beyond


The first commercial GeneChip array contained ~18,000 oligonucleotides. Today a GeneChip
array can contain up to 6.9 million oligonucleotides, enabling the analysis of the entire human
genome on one array.
What were once visions, such as precision medicine, are becoming realities as Affymetrix
continues to bring new products to laboratories and clinics worldwide.

1996

2002

2004

2008

The first GeneChip array is

Early-stage GeneChip expression

A small-scale genotyping

Affymetrixs DMET array is used

commercialized for the

array is used in a study identifying

study using a GeneChip array

in a study finding a DNA variant

identification of novel mutations

95 genes whose expression could

pinpoints a genetic mutation

in CYP4F2, a gene for drug-

associated with drug resistance

be used to predict sensitivity

associated with sudden infant

metabolizing enzymes, that affects

in HIV patients.1

of leukemic cells to STI571,

death to dysgenesis of the

warfarin dose requirements.4

a promising agent for treatment

testes in males among the

of advanced Philadelphia-

Old Order Amish.3

chromosome-positive (Ph+) acute

Today

lymphoblastic leukemia.2

More than 70,000 publications


cite Affymetrixs technology,
many in clinical studies
contributing to improved
diagnosis and treatment.

Advancing clinical research


10 / Genomics 101

DESIGNING GENOMICS EXPERIMENTS

2005
Collaboration between Wellcome Trust and Affymetrix to identify genetic associations in 7 common diseases
The Wellcome Trust Case Control Consortium (WTCCC) genotypes 17,000 samples to identify genetic variants associated to
type 1 and 2 diabetes, coronary heart disease, hypertension, bipolar disorder, rheumatoid arthritis, and Crohns disease.

2009
Collaboration between UCSF, Kaiser Permanente, and Affymetrix to enable genomic studies of common diseases
100,000 samples are genotyped in 15 months and the database made available to qualified scientists. Among recent studies published
are the identification of potential biomarkers to improve prostate cancer screening 5 and the finding of a genetic susceptibility to
Staphylococcus aureus that could pave the way to new treatment and prevention of antibiotic resistant infections like MRSA.6

Empowering
biobanks
to discover
the interplay
of genes,
environment,
and lifestyle

2013
Collaboration between UK Biobank and Affymetrix to genotype 500,000 volunteers
for prospective study
The resulting biomedical database is made available to scientists worldwide by UK Biobank.
Many scientists have already published their discoveries on lung function, smoking behavior,
neurobiological disorders, and many other conditions.

2015
Million Veteran Program builds huge database of genotyping data using Affymetrixs platform
The US Department of Veterans Office of Research and Development funds the Million Veteran
Program, resulting in one of the worlds largest medical databases that includes genotyping data.
The data collected from one million veteran volunteers furthers scientists understanding of how
genes affect health, especially military-related illnesses.

Realizing the vision of precision medicine


2004 to today

Roche and Affymetrix


launch first FDAcleared microarraybased test
AmpliChip CYP450 Test
detects cytochrome P450
genetic variations that
impact drug metabolism.
The test results can help
physicians individualize
patient treatmenta first
for precision medicine.

New test reports


the probable tissue of
origin for 15 common
types of tumors
Pathwork Diagnostics
introduces the Tissue of
Origin Test (now offered
by Cancer Genetics, Inc.),
the first FDA-cleared test
based on a customized
gene-expression array
from Affymetrix.

Ariosa, now part


of Roche, selects
Affymetrixs platform
for noninvasive prenatal
testing development
Affymetrix supplies a custom
array for the development
of a noninvasive prenatal
test that is faster and
more accurate than
next-gen sequencing.

Affymetrix launches
first test to help
diagnose postnatal
developmental delay
CytoScan Dx Assay is
the first and only FDAcleared, whole-genome,
microarray-based genetic
test to aid the increase in
diagnostic yield for postnatal
developmental delay and
intellectual disability.

Translating biomarkers
from lab to clinic
Affymetrix collaborates
with diagnostic companies,
such as Almac, GenomeDx,
Lineagen, PathGEN Dx,
SkylineDx, and Veracyte,
turning their biomarker
signatures into microarraybased tests for improved
diagnosis and treatment.

References
1. Kozal M. J. et al. Nat Med 2(7):753-59 (1996). 2. Hofmann W. K., et al. Lancet 359(9305):48186 (2002). 3. Puffenberger E. G., et al. Proc Natl Acad Sci USA 101(32):11689694 (2004).
4. Caldwell M. D., et al. Blood 111(8):410612 (2008). 5. Hoffmann T. J., et al. Cancer Discov 5(8):87891 (2015). 6. DeLorenze G. N., et al. J Infect Dis 213(5):81623 (2016).
2016 Affymetrix, Inc. All rights reserved. Unless otherwise noted, Affymetrix products are For Research Use Only. Not for use in diagnostic procedures.

P/N COR06694-1

Genomics 101 / 11

DESIGNING GENOMICS EXPERIMENTS

Chromosomal Microarrays Analysis (CMA): CMA is increasingly


used to detect chromosomal abnormalities, including
submicroscopic ones that are too small to be detected by
conventional karyotyping. CMA is now recommended to be a
first-tier test in the genetic evaluation of infants and children
with unexplained developmental delay and intellectual disability.
High-resolution chromosomal microarrays, containing both SNP
and copy number probes, can elucidate allelic imbalances and
identify LOH/AOH that can be associated with uniparental disomy
or consanguinity, both of which increase the risk of recessive
disorders. In 2014, Affymetrix introduced CytoScan Dx Assay,
the first-of-its-kind FDA-cleared and CE marked whole-genome
postnatal blood test to aid in the diagnosis of developmental
delay, intellectual disabilities, congenital anomalies, or dysmorphic
feature in children.

12 / Genomics 101

NOT ALL MICROARRAYS ARE CREATED EQUAL


Here are the different ways a microarray can be manufactured and
the pros and cons of each.
In-situ Synthesized Arrays: These types of arrays were pioneered
by the founder of Affymetrix (Fodor et. al.) in the 1990s. These rely
on photolithography, a method that uses UV masking and lightdirected combinatorial chemical synthesis on a solid support to
selectively synthesize probes directly on the surface of the array,
one nucleotide at a time per spot. This technology saw early use
to detect mutations in the reverse transcriptase and protease
genes in the HIV-1 genome and to measure variation in the human
mitochondrial genome. Since then, these kinds of arrays have been

DESIGNING GENOMICS EXPERIMENTS

developed for a wide range of applications in gene expression


analysis, genotyping, copy number analysis, and chromosomal
abnormality detection.
In 1996, Blanchard et. al. published a method adapting inkjet
printer heads to microarrays. Picoliter volumes of nucleotides are
printed onto the array surface in repeated rounds of base-by-base
printing that extends the length of specific probes. The method was
commercialised by Rosetta Inpharmatics and eventually licensed to
Agilent Technologies.
The shorter probe and manufacturing method used by Affymetrix
produces much higher density microarrays which are much better
suited for whole transcriptome analysis including splice variants,
whole-genome genotyping, and SNP+CNV analysis. The Agilent
designs allow the synthesis of longer probes with lower density
limiting the applications to expression profiling and CGH.
Self-Assembled/High-Density Bead Arrays: Originally developed
by David Walts group at Tufts University in the late 90s and 2000,
this technology was licensed to Illumina. DNA is synthesised onto
small silica beads which are deposited on to the end of fibre-optic
bundles or silicon slides, pitted with microwells for the beads.
Different types of DNA are synthesised on different beads, and are
randomly assembled into an array. Hybridising and detecting short,
fluorescently labelled oligonucleotides in a sequential series of
steps allows the beads to be decoded.
The simplicity of the assay and lower bioinformatics burden make
microarrays not only a very powerful way to leverage todays
genomic knowledge, but they can also be much cheaper and have
a faster turnaround time than NGS methods. However, to design a
microarray you have to know the sequence the genome, whereas
with NGS you can sequence and characterise both known and
unknown genomes.

SUMMARY
This chapter is not intended to explain how sequencing or
microarrays work. It is intended to show you that you have a range
of options available to you. At the start of the chapter we asked
you to keep a few questions in mind. Principally, what kind of data
do you need to be able to answer your question most efficiently? If
you are looking for novel or rare variants in an individual, looking
at whole genome sequencing well be the way to go. If you want
to explore known regions of interest, then just take a look at the
exome or get a bit more specific with a targeted approach. Maybe
you want to genotype a large population to identify associations?
A genotyping array is going to be much cheaper and easier to
manage than NGS. Pick the technology that works for you, your
operating restrictions, and which will produce the data you need.
In the next chapter we take a look at the chemistry involved in
turning your DNA into Data. Now that youve decided what you
want to do with your sample, youll need to know how to prepare it
for the right platform. n

Genomics 101 / 13

CHAPTER 2:

TURNING DNA
INTO DATA
SPONSORED BY

TURNING DNA INTO DATA

INTRODUCTION
Over the last twenty years fundamental advances in sequencing
technology have brought us a long way from the 10 years and
10 billion dollars spent on the Human Genome Project. Next
Generation Sequencing (NGS) and microarray techniques
have dramatically reduced the time and the cost associated
with large scale genome exploration. Today whether you are
analysing a panel of genes, exploring an exome, or shooting for
an entire genome there is an extraordinary wealth of different
techniques available.
We will explore the basic science behind these different techniques,
taking a look at how genetic sequencing actually works and how
we generate high quality sequence data for research and clinical
application. Successful analysis is critically dependent on accurate
sample preparation, so well take a look at the the basics of the
sample preparation process for a range of sequencing techniques,
including NGS and microarrays.
There are numerous kits and methods available for NGS sample
preparations, but several of the basic steps needed to prepare
DNA for sequencing are conserved across different sequencing
techniques. So for example preparing DNA for Illumina
sequencing, Ion Torrent sequencing or a DNA microarray requires
DNA fragmentation. Given the widespread use of the Illumina
platform, for this chapter we will largely focus on the crucial steps
in preparing DNA for Illumina sequencing.
As well as understanding genomic sequences, there are also
sequencing methods that enable us to explore DNA expression,
which genes or gene regions are active, at a particular point in
time. During this chapter we will look at how these methods
work, when they are used and how sample preparation differs
for these protocols.

FIRST, WE NEED SOME DNA...


The first step for DNA sequencing is collecting a tissue sample.
In a clinical setting DNA for sequencing is extracted from patient
samples, such as peripheral blood, bone marrow, fresh tissue, or
formalin-fixed paraffin-embedded (FFPE) tissue.
Extracting DNA from a tissue sample involves three steps.
First the tissue cells are broken open to expose the DNA in
the nucleus, which is commonly referred to as cell disruption
or cell lysis. This can be done using physical methods, such as
blending; chemical methods, or by sonication, in which highfrequency ultrasound is applied to samples to disrupt the
cell membranes. Second, the remaining membrane lipids are
removed by adding detergents.
Third, after extraction, the DNA has to be purified to remove
the detergents, proteins, salts and reagents used during cell
lysis. Finally, the DNA sample is run through an amplification
polymerase chain reaction (PCR) to enrich the sample ready for
library preparation. p.18

Genomics 101 / 15

TURNING DNA INTO DATA

DEEP CLONAL PROFILING OF PURIFIED


TUMOR CELL POPULATIONS FROM
FFPE SAMPLES
SUREPRINT G3 HUMAN CGH MICROARRAYS AND SURESELECT EXOME SEQUENCING
FACILITATED HIGH DEFINITION GENOMIC PROFILING OF PURIFIED TUMOR CELL
POPULATIONS FROM FFPE SAMPLES

ormalin fixed paraffin embedded (FFPE) tissues are a


vast resource of clinically annotated samples. High definition
genomics of these informative materials could improve patient
management and provide a molecular basis for the selection
of personalized therapeutics. A collaboration between several
research groups around the world, including the laboratory of Dr.
Michael T. Barrett at Translational Genomics Research Institute (TGen),
has developed a method based on flow cytometry cell-sorting to isolate
individual tumor cell populations from FFPE tissues, as well as methods
for their accurate genomic analysis1.
Recent studies have described various methods to interrogate FFPE
samples with array and sequencing technologies. These methods
typically select for samples that exceed a threshold for tumor cell
content using histological methods. Solid tumors, however, exhibit
high degrees of tissue heterogeneity. Current approaches for enriching
tumor samples prior to analysis of cancer genomes in FFPE tissue,
such as laser capture microdissection (LCM), are limited in their ability
to sufficiently distinguish and isolate different cell types in a timely
manner, making them less suitable for clinical research applications of
highly sensitive single molecule-resolution approaches such as NGS.

ACCURATE NEXT GENERATION SEQUENCING DATA ON FFPE


CLONAL POPULATIONS
NGS analysis of even highly tumor cell-enriched bulk cancer
samples, including those prepared by LCM, cannot accurately
distinguish whether aberrations in the tumor DNA are present in
a single cancer genome or if they are distributed in multiple clonal
populations in each biopsy. In contrast, NGS analysis of these
highly defined clonal populations can provide accurate sequence
information on specific tumor cell types. An analysis was performed
on sorted FFPE samples prepared from a PDA cell line whose
exome has been extensively studied. The PDA cell line, primary FF
tissue from which the cell line was derived, and the corresponding
FFPE blocks were used to validate the sorting-based NGS analyses.
A comparison of the paired end reads alignments against the
reference genome in each of the 3 samples showed that almost 80
% of the target areas had at least 20X coverage in all three samples.
The overlap of unique reads and the detection of known mutations
across the three independent sample preparations demonstrated
that sorted FFPE samples can generate accurate NGS Data, using the
SureSelect Human All Exon Kit.

RELIABLE CGH ARRAY DATA FROM CLONAL PROFILING OF


FFPE SAMPLES
Recent advances in cytometry-based cell sorting technology facilitate the
detection of relatively rare events in dilute admixed samples, enabling
DNA content-based flow cytometry assays for high definition analyses
of human cancer biopsies. These assays provide intact nuclei for DNA
extraction, eliminate the need and bias to preselect samples based on
tumor content and non-quantitative morphology-based measures, and
greatly increase the number of samples for analyses. To evaluate the use
of sorted solid tissue FFPE samples, fresh frozen (FF) pancreatic ductal
adenocarcinoma (PDA) tissue samples were compared to matching
FFPE samples. Genomic intervals for ADM2 were used to measure the
reproducibility of aCGH data in the matching FFPE and FF samples. The
top 20 ranked amplicons in the FFPE sample were used for this analysis.
In 19 of these, the overlap was >90 % with the same ADM2-defined
interval in the sorted fresh frozen sample. The global utility of the CGH
assay was determined with different tissues, including triple negative
breast cancer, bladder carcinoma, glioblastoma and small cell carcinoma
of the ovary. The CGH assay was able to discriminate homozygous and
partial deletions, map breakpoints, and amplicon boundaries to the single
gene level in these sorted samples, regardless of tumor cell content.

CONCLUSIONS
These highly sensitive and quantitative sorting assays provide
pure and objectively defined populations of neoplastic cells prior
to analysis. The deep and unbiased clonal profiling of sorted FFPE
samples by aCGH and NGS provides a valuable methodology with
broad application for cancer research which can advance the
development of personalized patient therapies.
Agilent offers a wide range of resources on CGH microarrays and NGS
that include applications notes, featured articles, how-to videos and
much more. These resources can be accessed from the links below:

16 / Genomics 101

NGS Cancer Research Resource Center


www.agilent.com/genomics/NGSCancer
NGS Constitutional Research Resource Center
www.agilent.com/genomics/NGSConstitutional
CGH Resource Center
www.agilent.com/genomics/CGHResource

CM

MY

CY

CMY

TURNING DNA INTO DATA

THESE HIGHLY SENSITIVE


AND QUANTITATIVE
SORTING ASSAYS PROVIDE
PURE AND OBJECTIVELY
DEFINED POPULATIONS
OF NEOPLASTIC CELLS
PRIOR TO ANALYSIS.

Notes
1. T. Holley et al.,
Deep Clonal Profiling
of Formalin Fixed
Paraffin Embedded
Clinical Samples. PLoS
ONE 7(11): e50586.
doi:10.1371/journal.
pone.0050586.
This article was
adapted from Agilent
Publication 59913333EN.
For Research Use
Only. Not for use in
diagnostic procedures.

GenetiSure FLG Ad February_2.0.pdf

2/8/16

1:30 PM

GenetiSure CGH + SNP Arrays

HAVE A LOT TO SAY ABOUT

EXON-LEVEL COVERAGE
Two Catalog Arrays for Postnatal and Cancer Research
Designed for exon-level coverage of disease-associated regions recommended
by ClinGen/ISCA or COSMIC and Cancer Genetics Consortium databases
Enhanced loss of heterozygosity detection with a resolution validated to 2.5Mb
Easily customize your microarray at no additional cost
www.agilent.com/genomics/GenetiSureCGH+SNP
For Research Use Only. Not for use in diagnostic procedures.

Genomics 101 / 17

TURNING DNA INTO DATA

The same process is applied when


extracting RNA for gene expression
studies, such as microarray tests. DNA
microarrays designed to measure gene
expression are compatible with a large
spectrum of different sample types.
RNA extracted from cells, blood, saliva,
fecal samples, FFPE, and fresh/frozen
tissue can all be processed on gene
expression arrays.

HOW DOES SEQUENCING WORK?


In order to understand the crucial
steps in library preparation, well need
to do a run-through of what happens
during sequencing itself. All sequencing
methods involve a precise set of chemical
reactions, and the DNA needs to be
properly prepared.
The basic principle of DNA sequencing
is converting the bases on a DNA strand
into detectable physical events, such as
fluorescence (Illumina, Sanger, SOLiD)
or a change in pH (Ion Torrent). A
sequencing machine can detect changes
in this physical event and translate that
into a read-out of base pairs.

THERE ARE TWO


MAIN METHODS
FOR STUDYING
GENE EXPRESSION:
MICROARRAYS AND RNA-SEQ.
THE CRUCIAL DIFFERENCE
BETWEEN THE TWO TECHNIQUES
IS THAT MICROARRAYS DETECT
KNOWN TRANSCRIPTS THAT
CORRESPOND TO KNOWN
GENOMIC SEQUENCES, WHEREAS
RNA-SEQ CAN DETECT BOTH
KNOWN TRANSCRIPTS AS WELL
AS POTENTIAL NOVEL
TRANSCRIPTS WITHOUT THE
NEED TO HAVE PRIOR
KNOWLEDGE OF THE GENOMIC
SEQUENCES.

18 / Genomics 101

SANGER SEQUENCING

454 LIFE SCIENCES

Used during the Human Genome


Project, Sanger sequencing involves
the production of a large number of
different length DNA fragments from
the same base sequence, all ending in a
fluorescently labelled base.

Also known as pyrosequencing, the


454 method detects the activity of
DNA polymerase enzyme, through
fluorescence, as bases are added to a
DNA sequence.

A laser is used to excite the fluorescent


label, allowing the sequencer to record
which base is present, and the DNA
fragments are sorted from smallest to
largest, allowing the sequencer to reconstruct the original order of the bases.

Fragmented DNA strands are attached


to an agarose bead 1m in diameter. The
fluorescence reaction takes place inside
a picolitre tube as fluorescently-labelled
bases are added.

ION TORRENT

SOLID SEQUENCING

When a nucleotide is incorporated


into a DNA strand by an enzyme, an
H+ ion is released. Instead of detecting
fluorescence, Ion Torrent sequencing
detects pH changes caused by H+ ion
release.

As with the 454 protocol, as part of SOLiD


(sequencing by Oligonucleotide Ligation
and Detection) identical DNA fragments
are attached to agarose beads. However,
during SOLiD the fluorescence used to
record the DNA sequence is generated
by the action of DNA ligase enzyme as
sections of DNA, rather than individual
bases, are joined together.

During sequencing, each well of the


sequencer is flooded with one nucleotide
at a time until a pH change is recorded,
indicating a base match.

SINGLE-MOLECULE REALTIME SEQUENCING (SMRT)


Allowing for longer DNA reads than other
NGS protocols, SMRT uses fluorescentlylabelled nucleotides to sequence DNA
strands in real time.
A single molecule of DNA is immobilised
at the bottom of a zero-mode waveguide
a tube whose dimensions are
smaller than the wavelength of light,
approximately 70nm in diameter along
with a single DNA polymerase enzyme.
A detector records which bases are
incorporated during DNA synthesis by
measuring fluorescence.

ILLUMINA
Illumina sequencing all takes place
on a specialised flow cell coated in a
lawn of primers. Fragments of DNA are
hybridised(or attached) to the twodimensional surface, forming localised
clusters of about 2000 identical DNA
fragments. This step is called cluster
generation.
During sequencing these clusters
are bathed in fluorescently labelled
nucleotides, along with a DNA polymerase
enzyme that attaches each fluorescent
nucleotide to its non-fluorescent
correspondent, so fluorescent A binds with
non-fluorescent T, and so on.
As with Sanger sequencing and other
fluorescent methods, the surface of
the flow cell is then imaged using laser
excitation and the resulting colours used to
record the DNA sequence of each cluster.

TURNING DNA INTO DATA

SEQUENCING FOR GENE EXPRESSION


As well as determining the presence and absence of genes, sequencing techniques can also
be used to capture a snapshot of gene activity under particular conditions, also known as
the transcriptome. As with the majority of DNA sequencing, fluorescent labelling is used to
convert gene activity into a measurable, physical effect. So the stronger the fluorescent signal,
the greater the level of expression in the sample.
There are two main methods for studying gene expression: microarrays and RNA-Seq. The crucial
difference between the two techniques is that microarrays detect known transcripts that correspond
to known genomic sequences, whereas RNA-Seq can detect both known transcripts as well as
potential novel transcripts without the need to have prior knowledge of the genomic sequences.
Broadly speaking, microarrays are particularly useful in a clinical environment because they can be
used to rapid, accurate assessment and diagnoses of established clinical variants. RNA-Seq is often
favoured during research into unknown areas of the DNA transcriptome. p.22

RNA-SEQ

MICROARRAYS

The newer of the two technologies, RNA


sequencing or RNA-seq is exactly that;
reading the sequence of a strand of RNA.
RNA-Seq provides a snapshot of the
presence and quantity from a genome
at a given moment in time, allowing
for experiments that interrogate gene
expression under particular conditions.

A microarray is typically a glass slide


or cell coated in DNA probes that the
sample DNA fragments can attach to.

During Illumina RNA-seq, mRNA


molecules fragmented and converted to
double-stranded cDNA. The cDNA is then
subjected to similar library preparation
steps as for DNA sequencing: end repair,
adaptor ligation, and PCR amplification.
The sequencing reaction for RNA is the
same as for DNA, beginning with the allimportant cluster formation.

A DNA microarray protocol begins in


much the same way as RNA-Seq: by
isolating mRNA from biological samples,
and converting those to cDNA. During
the creation of cDNA, a fluorescent label
is added to the cDNA fragment.
These labelled fragments hybridise to
the microarray, and the fluorescence is
activated by laser to generate a signal.
These signals are used to create a
digital image of the array, which is used
for analysis.

LONG AND THE


SHORT OF IT
One of the limitations of the
most popular NGS methods are
that they produce short reads of
DNA sequence.
Short reads are great for rapid,
high throughput sequencing,
but can make accurate genome
reconstruction challenging.
Short reads also provide
limited coverage of certain parts
of the genome, for example
areas that are rich in GC
nucleotide bases, which have a
higher denaturing point than AT
nucleotides. As a result, during
PCR, GC-rich regions are less well
amplified than AT-rich regions.
In recent years sequencing
techniques that produce longer
reads have emerged to address
this problem. Long reads
make genome reconstruction
simpler by making the puzzle
pieces larger, and increase the
coverage of genome areas that
are harder to sequence.

WHAT IS PCR?
Developed in 1983, the
polymerase chain reaction or
PCR has become an essential
part of any genetics toolkit.
PCR is a molecular
photocopier, which amplifies,
or makes multiple copies of
small segments of DNA. Genetic
sequencing requires large
amounts of sample DNA, making
the process almost impossible
without PCR.
The DNA sample is heated,
so that the two DNA strands
denature and pull apart into two
separate strands.
Next, an enzyme called Taq
polymerase builds two new
strands of DNA using the original
strands as a template. This
process creates two identical
versions of the original strand,
which can then be used to create
two new copies, and so on.

Genomics 101 / 19

TURNING DNA INTO DATA

MEETING THE CHANGING DEMANDS OF NGS


SAMPLE PREPARATION WITH NEBNEXT

LIBRARY PREPARATION IS A CRITICAL PART OF THE NEXT GENERATION SEQUENCING WORKFLOW;


SUCCESSFUL SEQUENCING REQUIRES HIGH QUALITY LIBRARIES OF SUFFICIENT YIELD AND QUALITY.

At New England Biolabs, we understand the challenges you face,


and are uniquely positioned to help you meet them. For over 40 years,
NEB has been a leading supplier of molecular biology enzymes for
the life science community. We have been at the forefront of developing recombinant enzymes, as well as stringent quality controls that
ensure product purity and performance, and since 2009 we have
been applying this expertise to improve sample preparation products
for NGS.
To meet the increasing applications and new challenges of NGS
sample preparation, we continue to expand our NEBNext product
portfolio. Available for the Illumina and Ion Torrent platforms,
NEBNext reagents are designed to streamline workflows, minimize
inputs and improve library yields and quality. NEBNext sample
preparation kits are available for genomic DNA, ChIP DNA, FFPE
DNA, microbiome DNA, RNA and small RNA samples. In addition to
the extensive QCs on individual kit components, all NEBNext kits are
functionally validated by library preparation, followed by sequencing
on the appropriate platform.
SUBSTANTIALLY IMPROVED LIBRARY PREPARATION WITH THE
NEBNEXT ULTRA II DNA LIBRARY PREP KIT FOR ILLUMINA
The NEBNext Ultra II DNA Library Prep Kit pushes the limits of library
preparation. Each component of the kit has been carefully formulated,
resulting in a several-fold increase in library yield with as little as 500
pg of human DNA. These advances deliver unprecedented performance, while enabling lower inputs and fewer PCR cycles, all with a
fast, streamlined workflow.
IMPROVEMENTS IN LIBRARY YIELD AND CONVERSION RATE
An important measure of the success of library preparation is the yield
of the final library. The reformulation of each step in the library prep
workflow enables substantially higher yields from the NEBNext Ultra II

Figure 1. NEBNext Ultra II produces the highest yield libraries


from a broad range of input amounts.
Ultra II

140

Kapa
Hyper

Library Yield (nM)

120
100

TruSeq
Nano

80
60
40

Kit compared to other commercially available kits (Figure 1). Even when
using very low input amounts (e.g. 500 pg of human DNA), high yields
of high quality libraries can be obtained, using fewer PCR cycles.
The efficiency of the end repair, dA-tailing and adaptor ligation steps
during library construction can be measured separately from the PCR
step by qPCR quantitation of adaptor-ligated fragments prior to library
amplification. This enables determination of the rate of conversion of
input DNA to adaptor-ligated fragments, i.e. sequenceable molecules.
Therefore, measuring conversion rates is another way to assess the
efficiency of library construction and also provide information on the
diversity of the library. Again, NEBNext Ultra II enables substantially
higher rates of conversion as compared to other commercially available
kits (Figure 2).

Figure 2. NEBNext Ultra II produces the highest rates of


conversion to adaptor-ligated molecules from a broad range of
input amounts.
Ultra II
1.2
Relative Conversion Rate

s sequencing technologies improve and capacities expand,


boundaries are also being pushed on library construction. High
performance is required from ever-decreasing input quantities
and from samples of lower quality or those with extreme GC content.
At the same time, the need is increasing for faster, automatable protocols that perform reliably and do not compromise the quality of the
libraries produced.

Kapa
Hyper

TruSeq
Nano

0.8
0.6
0.4
0.2
0

100 ng

10 ng

1 ng

500 pg

DNA Input

Libraries were prepared from Human NA19240 genomic DNA using the input amounts and library prep kits
shown without an amplification step, and following manufacturers recommendations. qPCR was used to
quantitate adaptor-ligated molecules, and quantitation values were then normalized to the conversion rate
for Ultra II. The Ultra II kit produces the highest rate of conversion to adaptor-ligated molecules, for a broad
range of input amounts.

MINIMIZATION OF PCR CYCLES


In general, it is preferable to use as few PCR cycles as possible to amplify libraries. In addition to reducing workflow time, this also limits the risk
of introducing bias during PCR. A consequence of increased efficiency
of end repair, dA-tailing and adaptor ligation is that fewer PCR cycles
are required to achieve the library yields necessary for sequencing or
other intermediate downstream workflows (Figure 3).
IMPROVEMENTS IN LIBRARY QUALITY
While sufficient yield of a library is required for successful sequencing,
quantity alone is not enough. The quality of a library is also critical,
regardless of the input amount or GC content of the sample DNA. A high
quality library will have uniform representation of the original sample, as
well as even coverage across the GC spectrum.

20
0

100 ng
5

10 ng
8

1 ng
11

500 pg
14

DNA Input
PCR Cycles

Libraries were prepared from Human NA19240 genomic DNA using the input amounts and numbers of
PCR cycles shown. Manufacturers recommendations were followed, with the exception that size selection
was omitted.

20 / Genomics 101

REFERENCE (1) Kozarewa, I. et al. (2009). Amplification-free Illumina sequencing library preparation facilitates
improved mapping and assembly of (G+C) biased genomes. Nat. Methods 6:291295.
NEW ENGLAND BIOLABS, NEB, NEBNEXT are registered trademarks of New England Biolabs, Inc.
ULTRA is a trademark of New England Biolabs, Inc.
ILLUMINA, MISEQ and TRUSEQ are registered trademarks of Illumina, Inc.
ION TORRENT is a trademark owned by Life Technologies, Inc.
KAPA is a trademark of Kapa Biosystems.

TURNING DNA INTO DATA


Figure 3. Number of PCR cycles required to generate 1 g
amplified library for target enrichment.
20

Kapa
Hyper

18
16

Ultra II

14
PCR Cycles

UNIFORM GC COVERAGE
Libraries from varying input amounts of three microbial genomic DNAs
with low, medium and high GC content (H. influenza, E. coli and H. palustris) were prepared using the NEBNext Ultra II Kit. In all cases, uniform
coverage was obtained, regardless of GC content and input amount
(Figure 4A). GC coverage of libraries prepared using other commercially
available kits was also analyzed using the same trio of genomic DNAs.
Again, NEBNext Ultra II provided good GC coverage (Figure 4B).

12
10
8
6
4
2
0

1 g

100 ng

10 ng

DNA Input

Ultra II libraries were prepared from Human NA19240 genomic DNA using NEBNext Ultra II and the
input amounts shown. Yields were measured after each PCR cycle and the number of cycles required to
generate at least 1 g of amplified library determined. Cycle numbers for Kapa Hyper were obtained from
Kapa Biosystems website and plotted alongside the cycle numbers obtained experimentally for Ultra II.

When amplification is required to obtain sufficient library yields, it is


important to ensure that no bias is introduced, and that representation of
GC-rich and AT-rich regions is not skewed in the final library. Comparison
with libraries produced without amplification (PCR-free) is also a useful
measure (1). Coverage of libraries prepared from human genomic DNA
using NEBNext Ultra II, as well as other commercially available kits, were
compared to a PCR-free library. Results demonstrated that the Ultra II
library coverage is most similar to the PCR-free library, and also covers
the range of GC content (data not shown).
For more performance data using NEBNext Ultra II,
visit NEBNextUltraII.com and download the full technical note.

Figure 4. NEBNext Ultra II provides uniform GC coverage for microbial genomic DNA over a broad range of GC composition
and input amounts.
A.

B.
Library
Ultra II 100 ng
Ultra II 1 ng
Ultra II 500 pg

Library
Ultra II
Kapa Hyper
TruSeq Nano

Libraries were made using 500 pg, 1 ng and 100 ng of the genomic DNAs shown and the Ultra II DNA Library Prep Kit (A) or using 100 ng of the genomic DNAs and the library prep kits shown (B), and sequenced on an
Illumina MiSeq. Reads were mapped using Bowtie 2.2.4 and GC coverage information was calculated using Picards CollectGCBiasMetrics (v1.117). Expected normalized coverage of 1.0 is indicated by the horizontal grey
line, the number of 100 bp regions at each GC% is indicated by the vertical grey bars, and the colored lines represent the normalized coverage for each library.

Even more
from less.
NEBNext Ultra II DNA
Library Prep Kit for NGS
Visit NEBNextUltraII.com to request a sample.
Genomics 101 / 21

TURNING DNA INTO DATA

BUILDING A DNA LIBRARY


Just as the speed of genetic sequencing has decreased phenomenally
over the past decade, so has the speed of library preparation.
Early protocols, such as Sanger sequencing, used molecular cloning to
create libraries of identical DNA fragments. This process involves using
host bacteria to store and replicate target DNA. The downside of this
approach, aside from the speed, was that the resulting DNA sequence
might contain parts of the bacterial cloning vector. In the past few years
Illumina NGS library preparation has increased in speed from 1-2 days
to around 90 minutes.
There can be a lot of variation in library preparation methods, some
of which are specific to particular sequencing methods or products,
but broadly speaking there are five main stages involved.
1. DNA FRAGMENTATION
The first step in library preparation to fragment the sample DNA.
Fragmentation is a crucial step in the library preparation process, as it
ensures that all the DNA strands are the same size before sequencing.
This is because different length strands can behave in different
ways during library preparation, particularly during the later PCR
stages where smaller DNA fragments may be over amplified
compared to larger ones.
There are numerous methods for fragmenting DNA, but three of
the most common are:
Acoustic sheering: using high-frequency acoustic energy waves to
break DNA strands
Nebulisation: forcing the DNA through a small hole in a nebuliser unit
Enzyme restriction: using enzymes to break the DNA strands into
smaller pieces
For NGS the fragments produced are typically less than 800 base
pairs in length, as these approaches produce short read DNA data.
However, for the newly emerging long read sequencing methods
the DNA strands are fragmented into 10kb lengths.
Microarrays developed for SNP genotyping or genome-wide
DNA copy number detection can be used with DNA extracted
from a wide assortment of samples. These assays, unlike NGSassays, do not require shearing of the DNA. The DNA is simply
used as is, or can be restriction-enzyme digested.
2. END REPAIR
DNA fragmentation leaves a range of 3 and 5 ends (recessed,
overhang, blunt) on DNA fragments. As with fragments of different
length, fragments with different ends may react differently later on
in the process, and so need to be repaired.
To repair the DNA strands a series of different treatments are
applied that remove overhangs and fill in recesses, creating a
sample of entirely blunt-ended fragments.

22 / Genomics 101

dA-tailing
During Illumina sequencing there is an additional step called
dA-tailing of the 3 end of the repaired fragment. An A nucleotide
overhang is attached to the 3 end of each DNA strand, which will
enable the right adaptors to ligate or attach to the DNA strand in
the next step.
3. ADAPTER LIGATION
Quite simply, this involves attaching known sequences
(adaptors) to the ends of the prepared DNA fragments whose
sequence is unknown. Adaptors are needed further downstream
in the sequencing process, and are essential for sequencing to
work properly.
For example, during Illumina sequencing the adaptors are
needed to hybridise the DNA strands to the flow cell. The flow
cell itself is covered in a dense lawn of primers to which the
DNA fragments attach. Adaptors can also contain an index
sequence, allowing for multiple different samples to be studied
in a single flow cell.
During SOLiD or 454 sequencing protocols, the adaptors are
required to bind the DNA fragments to the agarose beads on which
the sequencing reaction takes place.
4. AMPLIFY
Finally, a PCR amplification is performed to create a robust library
of DNA fragments that is suitable for sequencing. This step
increases the amount of library, and ensures that only molecules
with an adaptor at each end are selected for sequencing.
5. CLEAN-UP AND QUANTIFY
For Illumina sequencing, a final round of gel electrophoresis is
often used to purify the final product, and conclude the library
preparation process.
Before sequencing it is important to determine that the library
contains a suitable number of molecules that are ready to be
sequenced: that the right number of DNA fragments, with attached
adaptors, is present in the sample.
Another reason to quantitate a library is if more than one are
due to be sequenced at the same time, as is possible with
Illumina sequencing.
There are several different methods for library quantitation, but
they all broadly work in the same way, detecting the presence of
the right sized fragments. For example:
Spectrophotometry: this method detects the absorption of UV
light by macromolecules in the sample. The larger the DNA
molecule, the greater the UV absorption.
Fluorimetry: this method involves binding a fluorescent dye to
the DNA molecules and measuring the fluorescence. Larger
molecules fluoresce more brightly than small.

TURNING DNA INTO DATA

JUST THE BEGINNING


Generating a series of DNA sequence fragments, whether using
NGS techniques or DNA microarrays, is the first step in generating
high quality genomic data that can form the basis of clinical testing
or research. In the following chapters we will explore what happens
next, including reconstructing entire genomes from those DNA
sequence fragments, data analysis for microarrays, how to be
confident in data quality, and how that sequence data can be used
to inform both research and clinical diagnostics. n

MAKING A LIBRARY
DNA FRAGMENTATION

GENERATING A
SERIES OF DNA
SEQUENCE FRAGMENTS
IS THE FIRST STEP IN
GENERATING HIGH
QUALITY SEQUENCE
DATA

END REPAIR

ADAPTER LIGATION

PCR ENRICHMENT

Genomics 101 / 23

CHAPTER 3:

ANALYSING DATA
SPONSORED BY

ANALYSING DATA

INTRODUCTION
In the previous two chapters we have considered the kinds of
genomic data we can generate, and the chemistry that makes it
possible. At the heart of genomics is data analysis. Once you have
digitised your DNA, you can start to explore it, understand it, and
query it. In this chapter we will look at how to analyse microarray
and NGS data, and how to turn it into useful information.

MICROARRAY DATA ANALYSIS


INTRODUCTION
Over the last several years the analysis of microarray data has
standardised to relatively simple workflows. There are standard
methodologies and analysis pipelines that can be used to generate
genotypes, identify copy number aberrations, identify regions of
absence of heterozygosity (AOH) or loss of heterozygosity (LOH), identify
differential gene expression and identify splice variants from microarray
data. The results from an analysis of microarray data are available on
the order of minutes or hours as opposed to days. In addition analysis
of microarray data usually does not require any specific hardware and is

commonly performed using a standard laptop computer. The exception


to this is the large genotyping studies with hundreds of thousands of
samples that are more commonly analysed using Linux clusters or cloud
computing resources. In general, microarray analysis can be done by
bench scientists using standard computers and programs provided by
the microarray providers.
QUALITY CONTROL
One critical component of any analysis pipeline, including those for
microarrays, is quality control checks. Each microarray manufacturer
has a set of quality control processes and guidelines recommended
for each of their microarray products. For example, there are
spike-in controls like the ERCC RNA controls developed by N.I.S.T.,
recommendations to process standard sample controls concurrently
with experimental samples to evaluate the performance of the
reagents independent of sample quality, and specific algorithm metrics.
Generally, each of the application specific algorithms referred to
below contain metrics indicating the quality or confidence attributed
to their output. In the overall analysis workflow, these quality control
checks exist at multiple places providing insight into things ranging
from the quality of the sample, to the performance of the reagents
to the performance of the algorithm. In discussing analysis pipelines,
quality control processes and guidelines are often omitted. Yet they are
probably the single most important part of any analyses. p.28

Genomics 101 / 25

ANALYSING DATA

DISCOVERY IN MILLIONS
OF GENOMES
Julia Fan Li, Senior Vice President, Seven Bridges

STUDIES THAT ANALYZE MILLIONS OF GENOMES AT ONCE WONT JUST BE TECHNICAL FEATS
-- THEY WILL LEAD US TO TARGETED TREATMENTS FOR SUFFERERS OF MANY DISEASES,
INCLUDING CANCER.

iscovering pathogenic variants, or those that might lead


to drug targets is a numbers game. The more patients
we sequence, the higher our statistical power to work
out the relationships between genotype and phenotype,
other expression or disease.
Getting to the numbers is more complex than anticipated. We
need to examine an individual patient in the context of millions of
others and the data required to do so is massive.
Working with millions of genomes in a single study is about to
become a reality.
There are projects around the world sequencing and analyzing
more than 100,000 genomes in alongside to other biomedical data.
These range from work at Human Longevity, to the International
Cancer Genome Consortium, the Million Veteran Program, and
SequenceBio.
Weve also already seen the value that higher statistical power can
bring. Lawrence, in a Nature paper1 published out of the Getz lab a
the Broad Institute, showed that for cancers with low signal-to-noise
ratios (i.e. mutations that occur in less than two percent of tumors
of a given type), we only begin to identify mutations when sampling
more than 5,000 patients. The paper posited that we will need
significantly more data to make the kinds of advances we all hope
will result in ever more precise medicine.
The question, then, is how this is accomplished. How do we work
with petabytes of genomic and other data more efficiently, and how
do we put it to use?
In our work with large-scale and national genomics projects,
Seven Bridges has identified three key trends that will shape how we
deliver on the promise of precision medicine.
COMPUTATION CENTERS
With the rise of elastically available cloud computation resources,
the first is, in hindsight, blindingly obvious: Computation centers will
replace data repositories.
Historically, any researcher wanting to work on large-scale
genomic data needed large quantities of both money and time.
Grants would contain funding for local high-performance computing
(HPC) clusters and would expect researchers to spend months
making copies of datasets to these facilities.
Only then could one start work and by the time research

26 / Genomics 101

begins, a dataset could already be out of date. Further, there is


ongoing expense to keep up the computing infrastructure.
Though projects like the National Cancer Institutes Cancer
Genomics Cloud (CGC) <www.cancergenomicscloud.org>, weve
discovered a better way.
We bring the biological questions to the data, not the other
way around.
By co-locating large datasets with computing resources,
researchers can upload their tools, pipelines, and metadata all of
which are orders of magnitude smaller than the original datasets
and pay only for the computation they need in a given experiment.
This offers four tangible benefits. First, when combined with the
right software these centers enable simple and secure collaboration.
We know that the best science is done in teams, and giving everyone
the exact access they need and not more is transformational.
A PI can review all the data, and inspect every pipeline with
every parameter. She can manage multiple funding sources, and
distribute work among her lab. Meanwhile, a lab assistant can help
construct a pipeline and select relevant cases to explore, without
the ability to execute large computational tasks (or, in other words,
spend money without approval).
Second, it reduces costs because funding bodies now only need
to pay to keep one copy of a given dataset. We estimate that storing
one copy of The Cancer Genome Atlas (TCGA) on the CGC will cost
approximately $2 million per year.
Third, it increases access to the data because any researcher,
not just those with available HPCs, can run their analysis. Instead
of making a large capital expenditure, researchers can turn
computation into an operating expense that scales directly
with how much they actually use it. Better still, with the huge
competition in the cloud-computing market, these prices are
consistently falling.2
Fourth, it greatly accelerates research. Not only is the monthslong waiting times to make copies of data eliminated, but the time
to access computational resources is also reduced to just minutes.
Further, in cases where time is of the absolute essence, researchers
can optimise to take advantage of vastly more cores than they
normally would have in an HPC to further accelerate a task. For
example, weve worked with customers to reduce whole genome
pipelines to just about five hours.

ANALYSING DATA
This is what we mean when we say graph genomes are self-improving: It gets better the more you use it.

G G C

60%

80%

G
40%

A G

30%

C G C C

20%

C C

60%

A
C

C A

50%

40%

G
50%

A A

C A G

60%

10%
PORTABLE, REPRODUCIBLE WORKFLOWS
The second trend weve identified with our partners is the need for
completely portable and thus reproducible workflows and pipelines.
The more large-scale data analysis enters the every-day practice
of science and medicine, the clearer it is that algorithms and
the software used to implement them have become an integral
and important part of research methods.3 But the complexity
of tracking let alone sharing software methods increases in
lockstep with the complexity of the tools themselves. For example,
a typical TCGA marker paper uses more than 50 bioinformatic tools,
each of which comes in multiple versions and with many different
parameters. Standardising the way we document tools, parameters,
and their dependencies is crucial to making the process of repeating
methods easier.
These issues were the impetus for the bioinformatics community
to develop the Common Workflow Language (CWL) <www.
commonwl.org>. CWL is a specification, much like HTML, that uses
plain text to store every piece of a complex computational workflow.
Better still, it was defined with Docker <www.docker.com> in mind,
meaning CWL-compliant software can also perfectly reproduce a
given workflow in the future by re-downloading the exact version of
any given application. And, because CWL is an open specification, it
prevents lock-in: researchers can use any analysis tool the prefer or
even write their own.
Weve become big believers in CWL, and have built it into both the
CGC and the Seven Bridges Platform. Other industry partners are
doing the same including the Institute for Systems Biology, the
Sanger Institute, the Galaxy Project, and the Broad Institute.
Making reproducibility a copy-and-paste affair is not just good
for science it accelerates the pace at which we can build off the
discoveries of the entire community.
ADVANCED DATA STRUCTURES
The final trend we see across all our projects from large
pharmaceuticals to national projects like Genomics England is a
need to bring genomic data structures into the 21st century.
The linear data formats of traditional genomics tools cant scale to
the number of samples we need to analyze simultaneously.
Today, when we want to understand an individual patient we align
their reads to a static reference, and store the results in static, flat
files. And we repeat this process for each new patient. Worse still, the
static reference isnt updated but once every three years, and only
represents a small collection of individuals, leading to inherent bias.
Instead, we need a reference that can be updated immediately
with new evidence. We need a reference that learns. We need a
reference that contains knowledge of an entire population.
We do this through a new technology we call the Graph Genome,
which advances genetic analysis in two key ways. First, it helps us
create an ever-more accurate view of both an individuals genetic
makeup and that of the population as a whole. Second, it is a more
efficient method to store and analyze vast quantities of genetic data.

ACCURATE INDIVIDUALS AND POPULATIONS


The Graph Genome gets increasingly precise by keeping and learning
from data that other genetic analysis tools would throw away.
A sequencer does not read a persons entire genome from
beginning to end. Instead, it breaks the genome up into smaller
chunks and reads them at random. Then, large numbers of
computers piece the small segments back together like a person
completes a jigsaw puzzle, comparing the piece in her hand to the
completed picture on the box top.
Until the Graph Genome, that box top, called the reference
genome, was a composite of just a few people. The majority comes
from one person.
But the Graph Genome does something different.
Instead of comparing pieces to the box top and stopping, we
update the overall image with the unique pieces of each individual.
Over time, as we complete more puzzles (sequence more people),
the box top starts to look like the entire population. Importantly, we
dont store identifiable individuals, but instead update the frequency
of their unique variations with each new sequence.
FASTER AND MORE EFFICIENT
Traditionally, the data for each genome is stored separately. This
makes genomic datasets so large that they are impractical to store
and move. Even with the fastest connections in the world, copying
a dataset could take weeks. We allow researchers to work with the
data in the cloud and download only the results of their work. By
collapsing many individuals into only the differences between them,
we not only need less storage space, but can also more efficiently
read and work with that data.
Theres also less of something else in graph genomes: patientidentifying information. Because graph-based references can be
updated with just the new frequencies of variation, without the need
to store an individuals path through the graph, they also represent
a way to truly anonymise patient data while simultaneously
capturing the full benefit from it.
In this instance, less is actually more.

Julia is a Senior Vice President at Seven Bridges, the biomedical data


analysis company accelerating breakthroughs in genomics research
for cancer, drug development and precision medicine. She leads the
Seven Bridges UK office, which focuses on advancing the state of the
art in graph genome research, as well as serving national governments
on their largest, million-genome-scale projects.

Notes
1. Discovery and saturation analysis of cancer genes across 21 tumour types Nature.
2014 Jan 23;505(7484):495-501. doi: 10.1038/nature12912. Epub 2014 Jan 5.
2. http://www.economist.com/news/business/21648685-cloud-computing-prices-keepfalling-whole-it-business-will-change-cheap-convenient
3. Software with Impact Nature Methods 11, 211 (2014) doi:10.1038/nmeth.2880

Genomics 101 / 27

ANALYSING DATA

When evaluating different platforms and analysis pipelines make sure


to understand the recommendations and limitations of the platforms
quality control systems to ensure generation of accurate and robust
data. The famous computer science adage Garbage in, garbage out
applies equally well to any data analysis pipeline and microarrays and
NGS are no exceptions.
ANALYSIS
The fundamental steps in analysing microarray data, after quality
control, are the same independent of the application field of interest
(i.e. copy number, genotyping or expression) or the microarray
platform. Each technology platform has its own recommended
standard pipeline for data processing for their arrays, but the high
level steps are similar. Intensities are captured off the microarray
using a scanner and algorithms provided by the technology providers.
These intensities are then subjected to signal processing algorithms
including but not limited to normalisation, background correction and
outlier removal. Following this signal processing the intensities are
summarised together providing a signal for each probe/probe set/
feature on the array. At this point the data processing diverges for the
different application spaces utilising application specific algorithms.
Gene expression arrays are now ready to be subjected to standard
statistical tests like t-tests, and ANOVA analysis to identify differentially
expressed genes. More sophisticated next generation expression
arrays can also be further analysed for splice variants using specific
algorithms provided in software from the manufacturer.
For genotyping microarrays the data is transformed into a space
with properties more suitable for evaluating genotypes, usually
utilising a clustering algorithm such as BRLMM-P or GenCAll.
Analysing microarray data for copy number variation is a multi-step
process involving creating log2 ratios against a reference and then
calculating the corresponding copy number for each of the probe
sets based on the ratio. These individual copy number calls are then
processed by a segmentation algorithm to identify stretches of identical
copy number calls corresponding to stretches of the chromosome,
commonly referred to as segments, with that copy number state.
Some oligonucleotide microarrays also contain SNP probes on the
arrays that can be processed to identify stretches of AOH. In general,
these application specific analysis pipelines are packaged in easy to
use desktop computer applications provided by either the microarray
suppliers themselves or third party software providers.
In our opening chapter, we learned that DNA Microarrays often
offer a simpler and more cost effective way to generate certain
types of data. This is also true of the analysis of microarray data.
While most bench scientists will be capable of carrying out the
analysis outlined above, NGS analysis is a little more complicated.

NGS ANALYSIS
INTRODUCTION
In the last decade the genomics industry has seen the cost of next
generation sequencing (NGS) drop faster than the slope of Moores
law, from about US$10 million to now approximately $1,000 per

28 / Genomics 101

genome. The drop in cost and the increase in speed of sequencing


technologies have led to widespread adoption of NGS by the
research community and increasing use in the clinic for diagnosis
and treatment of disease.
With NGS data being generated at an ever-increasing rate, the
need for standardised genome informatics and data management
practices have become critical. Gaining knowledge from genomic
data, (stream of As, Cs, Ts, and Gs that make up the output of the
DNA sequencer) involves three broad steps:
Primary analysis Production of raw reads to ACTG code and
assigning quality scores
Secondary analysis QA filtering, alignment and assembly of
reads, and variant calling
Tertiary analysis Annotation and filtering of variants for study
specific investigations
PRIMARY ANALYSIS
Primary analysis involves steps required to convert data from
the sequencer into base pairs and compute quality scores for
each base. As NGS typically works by breaking the genome
into fragments which are then read, the sequencer generates
unordered and unaligned raw reads, along with a quality score for
each of the basesknown as the Phred score (denoted by Q).
A Phred quality score is a measure of the quality of the
identification of the nucleotide bases generated by automated
DNA sequencing. It was originally developed for Phred base
calling to help in the automation of DNA sequencing in the
Human Genome Project. Phred quality scores are assigned to
each nucleotide base call in automated sequencer traces. They
have become widely accepted to characterise the quality of DNA
sequences, and can be used to compare the efficacy of different
sequencing methods. Perhaps the most important use of Phred
quality scores is the automatic determination of accurate, qualitybased consensus sequences.
Typically, the instrument manufacturers provide the software for
primary analyses and the task is handled with in the sequencing
instrument itself.
The output from the sequencer is typically a FASTQ file, which is ASCII
text data that contains sequence identifiers, i.e., nucleotides (A, G, T, or
C), along with corresponding Phred scores. Once the FASTQ file has been
generated, it is ready for processing in a secondary analysis pipeline.
SECONDARY ANALYSIS
Dominant NGS technologies are based on the shotgun approach
(see Creating Data and Reading DNA chapters). The sequence
data that comes off the sequencer is composed of small
nucleotide sequences of ACTG code (or reads) that need to be
put back together. Secondary analysis involves the reassembly
and alignment of these reads to construct the original sequence,
a process known as genome assembly. But before the
reassembly, the reads are assessed. They are filtered by length
or the quality reported by the sequencer in order to produce the
best results.

ANALYSING DATA

GENOME ASSEMBLY CAN BE DONE IN THREE WAYS:


Reference genome mapping: Reads are assembled and aligned
against a reference genome, a representative example of a
species genome, and variant calling is performed, highlighting the
difference between these two.

It is known that certain SNPs are associated with particular traits,


which makes them a widely used biomarker for drug development
and genetic studies. For SNP mining, reads generated from multiple
individuals are assembled against a reference genome to identify
sites with single nucleotide variation.
Variant calling depends on several compounding factors, which include:

De novo sequence assembly: The sequence is assembled without the


aid of a reference genome. Typically, this is a complex computational
process, which uses different computing techniques (such as
constructing de Bruijn graphs using k-mers (for short reads) or using an
overlap-layout-consensus approach) for gene reassembly. Producing
long read lengths allow you to span certain repetitive and complex
elements of the genome that short reads are unable to resolve.
Graph-based reference Genomes: A third way of assembling
reads involves an advance form of a reference genome based
on a graph that contains not just a representative example,
but instead data from many tens or hundreds of thousands
of individuals in a population. This method holds the potential
to be more accurate, less computationally expensive, and also
anonymise individuals while still allowing them to be studied. It
also helps to achieve many of the goals of de novo assembly,
including identifying structural variation. This is covered in more
detail on pages 26 and 27.
In terms of speed and resource intensiveness, reference genome
mapping is simpler. At the same time, de novo methods produce
sequences that are free of errors associated with alignment
tools and can detect variations, such as structural variations,
that could be overlooked while aligning the sequence using a
reference genome.
In either case, scientists require a determined average depth
and coverage of the sequence reads over the entire genome
or targeted areas of interest. Depth is measured by how many
reads are stacked over a given locus of the genome. For de
novo assembly, a higher average depth is typically required,
which aids the development of large contigs, sets of overlapping
DNA segments that represent a consensus region of DNA, that
can form a physical map of the genome and used to guide the
assembly of the draft genome. Higher average depth, in the case
of sequence alignment, means more confidence in the consensus
sequence of the sample and more accuracy in detecting variants
from the reference.
Post genome assembly, variant calling is carried out to
identify the variants or differences in the assembled genome
in comparison with the reference genome. These differences
could include single nucleotide variants (SNVs); smaller
insertions or deletions (INDELS); or larger structural variants,
such as translocations, transversions, and copy number
variants (CNVs).
SNV identification is used to identify germline mutations, and
it is therefore an integral part of genetics-based research. An
individual inherits these mutations from his or her parents, and
such mutations can occur with some frequency (though limited)
across populationin which case they are termed single nucleotide
polymorphisms (SNPs).

Cloning process artifacts


Error rate associated with the sequence reads and mapping
Reliability of the reference genome
The final output of secondary analysis are variant calling files (VCFs)
where every variant present in the sequence sample in annotated.
Tertiary analysis aims to make biological sense of the generated
data from secondary analysis and is covered in more detail in the
following chapter.
COMPUTATIONAL AND STORAGE REQUIREMENTS FOR PRIMARY
AND SECONDARY ANALYSIS
With sequencing time and cost no longer a bottleneck, scientists
are now faced with computational analysis and storage
requirement challenges.
Primary analysis is largely provided by the sequence instrument
providers, the analysis software is designed to keep pace with
the throughput of the sequencer. In contrast, secondary analysis
requires much more computationally intensive resources.
Secondary analysis performs a given set of algorithms and
bioinformatics on a per-sample basis. This repeatable process can
be placed in an analytic pipeline thats entirely automated. You
can fine-tune the pipeline by exploring and optimising your set
parameters to make sure that steps have been segmented in the
most sensible way.
Often, pipelines are monitored through programs that have error
reporting, quality control, and performance metric logging to keep
improving them.
Over the past several years, the methods and algorithms designed
for primary and secondary analysis have matured and there
are many industry-recognised open source tools that are freely
available to use and modify.
For sequence alignment to a reference genome, BurrowsWheeler Alignment (BWA) algorithm is an industry favorite for
its speed and accuracy. Bowtie is another popular software
application commonly used for sequence alignment, promoted
as an ultrafast, memory-efficient sequence aligner. Single
Nucleotide Variant (SNV) detection has gone through a few
generations of algorithmic improvements, GATK, an application
for identifying SNPs and indels in DNA and RNAseq data,
has become a commonly used tool. Developed by the Broad
Institute, GATK is free for academic use, however there is a
licensing fee for commercial use. Freebayes, like its name implies
is a free open source tool and great alternative to GATK, used to
find SNPs, indels, MNPs (multi-nucleotide polymorphisms), and
complex events. p.32

Genomics 101 / 29

ANALYSING DATA

DNAnexus Made Ridiculously Simple


David Shaywitz, MD, PhD
Chief Medical Ocer, DNAnexus

In response to questions I receive from friends and colleagues who ask What does DNAnexus do,
I thought I might oer a high-level perspective.

WHAT IS DNAnexus?
DNAnexus is a professional grade platform that makes it
easier for users to do three things, each in a secure and
compliant fashion:
1. Analyze large amounts of raw genetic data
2. Collaborate around large amounts of data (including but
not limited to genetics)
3. Integrate genetic data with other types of data, such as
data from electronic medical records to advance science
and improve clinical care

(1) Analysis Of Raw Sequencing Data


The basic idea here is that the machines that are used to read
DNA sequence are incredibly powerful, but dont generate
a book of information that starts at the beginning of the
rst chromosome and concludes at the end of the last one.
Rather, most sequencing machines spit out phrases of about
100 letters, phrases randomly located anywhere in the 3
billion letter book that is the human genome. A computer
must gure out where each individual phrase ts in the
book, and must also determine whether there are any typos.
This can be a computationally intensive task, but DNAnexus
provides a way to do this eciently, by dividing the task into
multiple parallel streams each of which can be tackled by a
powerful computer.
The computers DNAnexus uses are run by Amazon Web
Services or other providers and our use of them is an
example of whats known as cloud computing because the
computers operate from a massive, dedicated central facility,
rather than from a users own institution. One advantage of
using cloud computing is its very much on demand i.e.
you have essentially unlimited access to as many computers
as you need, and you only pay for the computers that you
actually use, and only when you are actually using them.

One example of the DNAnexus Platforms scalability was the


CHARGE Project, our collaboration with the Human Genome
Sequencing Center (HGSC) at Baylor College of Medicine.
As part of its participation in the CHARGE consortium,
the HGSC utilized the DNAnexus Platform to analyze the
genomes of over 14,000 participants, encompassing 3,751
whole genomes and 10,940 exomes. Over the course of a
four-week period approximately 3.3 million core-hours of
computational time were used, generating 430 TB of results.
This data was made available for worldwide integration and
collaboration to over 300 researchers.

(2) Distributed Collaboration


Progress in both science and medicine can be accelerated
when data can be easily shared. When there are large
volumes of data, as is increasingly the case in research and
clinical realms, this can be a real problem. Remarkably, the
most common method of large-scale data sharing today is
FedExing hard drives between institutions. What DNAnexus
enables for a distributed team of researchers or clinicians is
access to the same data, tools and pipelines at the same time.
By bringing together the data, the experts, and the tools for
analysis, DNAnexus facilitates collaboration and accelerates
understanding.
DNAnexus is ideally suited to power many types of data
sharing, involving:
NIH investigators (as in the case with our work with
CHARGE in the area of cardiovascular disease);
Federal agencies (our work with the FDA on the
precisionFDA platform, building a community to advance
regulatory science in the area of NGS);
Diagnostic companies (our work with Natera and CareDx);
Translational research partnerships (our work with
Regeneron and Geisinger Health System);
Public/private partnership of cancer researchers (our work
with ITOMIC led by University of Washingtons Tony Blau).

30 / Genomics 101

ANALYSING DATA

Our ability to support distributed innovation also enables


DNAnexus to provide global support for commercial consortia
which have been created by companies like Natera. DNAnexus
provides a key component of Nateras Constellation
bioinformatics platform which, combined with assay kits and
protocols that Natera distributes, allows global sequencing
labs to access the same analysis pipelines and algorithms that
Natera employs in their central laboratories for applications
such as NIPT and cell free DNA analysis in oncology.

Regeneron, for example. In the same way our partners can


easily access and eciently utilize the fundamental tools of
genetic analysis on our platform, so too can they access and
utilize the tools required for integrating genetic data with
other data types. DNAnexus is adding tools constantly, based
on the needs expressed by our partners.

LOOKING AHEAD
Guided by the visionary partners with whom we are privileged
to work, DNAnexus continues to enhance our abilities
within each of these three areas: DNA analysis, distributed
collaboration, and integration with other data types. We are
constantly seeking opportunities to leverage the technology
weve developed through collaborations with innovative
leaders looking to use the power of our platform to approach
compelling scientic and clinical challenges.

(3) Integration With Other Data Types


The insights that may be available in genetic data are often
revealed only when the information is considered and
analyzed in the context of other data types, such as data
from electronic health records (EHR). Integrating genetic
and EHR data is fundamental to the drug discovery work of

END-TO-END WORKFLOW

LIMS & Upstream


Integration
and Wrap

Sequencers & Related


2 Analysis & Collaboration
Research/Govt

Integrated
Partner
Solutions

Clinical

Pharna

3 Analysis & Applications


Tools
GATK, Graph

Interpretation/
Annotation

Databases

REPORT

A exible enterprise-grade platform for organizations pursuing genomic-based approaches


to health. Laboratory Information Management Systems (LIMS) and sequencing
instruments easily integrate with DNAnexus, as well as downstream tertiary analysis and
reporting solutions.

@DNANEXUS

I N FO @ D N A N E X U S .CO M

W W W. D N A N E X U S .CO M

Genomics 101 / 31

ANALYSING DATA

Although the analysis tools listed above are packaged for single-purpose,
they can be linked together to form a secondary analysis pipeline. This
can be done manually on your local cluster with a considerable amount
of I.T. customisation or through an open source bioinformatics platform
like Galaxy or commercial bioinformatics platforms like Seven Bridges
and DNAnexus. If you have a very good grasp of the I.T. required, and
the scope to carry it out, an open source platform might be a good fit.
However it will require a more continuous I.T. effort than what you
would need to commit with a commercial solution.
Example of a typical secondary analysis pipeline:
1. FASTQ input from primary analysis typically conducted on the
sequencer
2. BWA or Bowtie for mapping to the reference genome, which
generates a BAM file.
3. GATK or Freebayes takes the BAM file and identifies variants in
the donor relevant to the reference genome.
4. Output is VCF, which lists all the donor variants in relation to
the reference.
Today, many organisations are participating in global large-scale
sequencing projects to study thousands or even millions of genomes,
making the challenge of storing and managing NGS data more critical.
In a recently published paper, Big Data: Astronomical or Genomical?,
published in the PLoS Biology journal, between 100 million and 2
billion human genomes are expected to be sequenced by 2025. The
storage capacity required for this alone would be pegged at ~240
exabytes (1 exabyte = 1018 bytes), which exceeds the projected data
storage requirement for three other major big data generators:
YouTube, (data storage projection of 12 exabytes), Twitter
(estimated to require 117 petabytes {1 petabyte = 1015 bytes per
year} of data storage) and the Square Kilometer Array or SKA (which
might create a demand for 1 exabyte data storage capacity).
On average, the storage space required for analysing a whole
genome via a Illumina Hi-Seq is ~200 Gb. Considering the variations
in genome of human species, the storage requirements for a largescale genome sequencing project is huge. For example, the 1000
genomes project consists of more than 200 terabytes of data for the
1700 participants. The analysis costs associated with such a large
project may sometimes exceed reagent costs, considering the fact
that the genome sequencing cost has significantly reduced now.
CLOUD COMPUTING AS A SOLUTION
Converting DNA into meaningful genetic information involves
extensive computational resources dedicated to the application of
bioinformatics for secondary analysis, let alone considerable data
storage capacity. With research projects involving the sequence
and analysis of tens of thousands to millions of genomes becoming
the norm, many organisations are finding that their local clusters
cant keep pace with the sequencing volume. The cloud is the
only technology that is capable of keeping pace with big data.
Accordingly, the genomics industry is finding cloud approaches to
suit its need for scalable computational and storage requirements.
Cloud service providers like Amazon Web Services or Google Cloud
offer scientists access to powerful computational resources without

32 / Genomics 101

the investment in costly on-premise infrastructure. Users are able


to access the resources on an as needed basis to analyze big data
genomic pipelines, store petabytes of data, and share results with
collaborations around the world.
While some organisations are capable of building and housing the
data storage and computational resources to analyse large genomic
datasets, it may not be the most cost efficient way to approach
this. One of the major advantages of cloud-based approaches is
the flexibility they offer. In several cases, your demands on your
computational resources will have peaks and troughs. Depending on
your usage demands and patterns, a pay as you use model may be
more cost efficient in comparison to building everything in-house.
This complexity is why many organisations choose to use a
commercial bioinformatics platform built on top of the cloud: they
get the scalability benefits, without the need to work on software
updates, optimisation, authentication, authorisation, collaboration,
security, or compliance issues.
Many of the popular bioinformatics applications for research in
genomics are parallelisable, which make them more suitable for
running in a cloud environment. While larger-scale users tend to
have clusters in-house, many of their workloads are erratic and need
integrated ways to push analysis to the cloud when they lack enough
compute resources in-house. Some commercial providers even aid in
building such hybrid clouds. For example, Seven Bridges is participating
in collaborative research with the Precision Medicine Initiative in the
United States by helping the Million Veteran Program more easily
create a hybrid cloud. The cloud is actually making companys existing
infrastructure investments more valuable, they do not need to worry
about overprovisioning, when they need more compute resources they
can burst into the cloud, rather than increase hardware investment.
While cloud service providers, such as Amazon Web Services and
Google Cloud Platform, do support data management, storage,
compute, and security and compliance tools, there are still gaps.
When users DIY on the cloud they still need to implement a formal
Information Security Management System to ensure highest level of
compliance with clinical regulations.
As big data moves to the cloud, new standards will need to
emerge for discovering and querying datasets, as well as for
authenticating requests, encrypting sensitive information, and
controlling access. The Global Alliance for Genomics and Health
and others are working together to develop approaches that
facilitate interoperability.

SUMMARY
In this chapter we have looked at one of the most important
parts of genomics: turning raw data into something you can use.
For microarrays, there is a wealth of standardised, easy to use
analysis options. For NGS, things are a little bit more complicated
and potentially require considerably more resources.
In the next chapter we take a closer look at what you can do with
your NGS data to add biological context to it. n

ANALYSING DATA

ON AVERAGE, THE
STORAGE SPACE
REQUIRED FOR
ANALYSING A WHOLE
GENOME VIA A ILLUMINA
HI-SEQ IS ~200 GB.

Genomics 101 / 33

CHAPTER 4:
NGS
INTERPRETATION
AND DISCOVERY
SPONSORED BY

NGS INTERPRETATION AND DISCOVERY

INTRODUCTION

so are not in the literature or existing gene panels, or are found in the
patient but not inherited from either parent (de novo variants).

Interpretation and discovery is where the genome meets medicine.

Effective interpretation and discovery requires managing and


using data on an unprecedented scale, and connecting vast
data collections around the world. Limitations and challenges in
interpretation have become a critical bottleneck in the progression
of personalised medicine. For precision medicine to become a
standard part of global healthcare, clinical diagnoses need to be
made and confirmed at same speed with which we perform other
complex tasks by accessing large-scale data over the internet.

Up to this point we have explored the history of sequencing, and the


steps required to generate high-quality sequence data. But reading
the genome sequence is just the first step. Turning this information
into meaningful insights into disease involves correlating variation in
the sequence with phenotypes, namely diseases or other traits. This
can be done for a single patient, as clinical interpretation, or at largescale for research and discovery. In this chapter we will explore the
complex challenges associated with mining vast genomic datasets for
actionable information.
Developments in sequencing technology, explored in previous
chapters, have created a tidal wave of genomic data available
to clinicians and scientists. Capturing the full medical benefit of
this information requires the ability to go beyond scanning for
what is already known (regions of the genome that are known
to be linked to disease and are well-documented in reference
databases and panels) and begin to efficiently interrogate whole
exomes and genomes.
For example, research into rare diseases has revealed that the majority
of disease-causing variants have either never been seen before, and

Data Generation

Genomic Big Data


Interpretation & Discovery

There are three main challenges associated with the actual process
of genomic interpretation and discovery:
Scale: The vast size and complexity of raw genomic data.
Power: Limited diagnostic and discovery yield when we seek to fully
exploit all of the available data.
Reach: The increasing need to connect data sets and link
interpretive tools worldwide.
Finally, we shall take a look at the regulatory issues surrounding unknown
variants. A single NGS test has the potential to identify thousands of
variants that could be used in a diagnosis, but in order to form the basis
of a diagnosis the test must meet regulatory standards. We shall explore
the regulatory challenges surrounding interpretation and discovery.

Precision Medicine

Scalable
Population

Clinical

Discovery

Wellness

Seamless
Sequencing

Normalised

Global

Cloud

The sheer volume of data generated by


NGS is creating a bottleneck, slowing
the development of precision medicine.

Genomics 101 / 35

NGS INTERPRETATION AND DISCOVERY

SNIP SNIP
SNPs, single nucleotide
polymorphisms, or
snips are one of the
most common forms
of genetic variation. A
SNP is a single base-pair
mutation at a specific
location in the genome.
In humans SNPs can
be associated with
disease susceptibility.
Conditions such as
sickle-cell anaemia and
cystic fibrosis have been
linked to specific SNPs.

SCALE
Only a few years back, genomics was relatively data poor. SNP
genotyping enabled broad coverage of the genome but was useful
principally for identifying common variants and assigning risk for
common diseases. Finding rare variants was slow, painstaking and
expensive, requiring sequencing of individual genes and the steady
but very slow compilation of disease-linked variant panels.
The good news was that genotyping in this way did not generate
very large quantities of data. What data was produced could be
stored and analysed using standard database technology.
The emergence of NGS changed all of that. Following the advent
of advanced sequencing techniques, the raw data from a single
exome, the coding region of a genome, now can weigh in at more
than 10 gigabytes of data (depending on read depth). An entire
human genome at 30X depth generates a file of approximately 90
gigabytes. In context, a computer with a 1 terabyte hard drive can
store fewer than ten individual genomes, hardly enough for a largescale research project. Data on this scale, particularly querying
thousands of genomes simultaneously, is overwhelming for
standard databases and all but the largest IT systems. Even when
these data can be stored, mining sequences intensively is extremely
slow because the time it takes to complete a computation is limited
by the input/output channel of information. This issue is further
compounded as the number of analysed samples increases.
The answer to this problem is to develop new and more
computationally efficient data architecture. The most widelyused solution for large-scale research and diagnostics is the
genomically-ordered relational database (GORdb), developed a
decade ago for the worlds largest population genomics research
effort genetics in Iceland. Offering a different approach to data
storage and retrieval, this system is now being further refined and
deployed around the world.

36 / Genomics 101

Traditional relational database systems were designed principally for


finance and banking, to perform lots of small operations on relatively
simple datasets of reasonable size. This makes them ill-suited to the
task of dealing with high volumes of sequence reads and variation data,
because they are built for lots of small transactions and have legacy
command and data structures made to perform those original tasks.
By contrast, GOR was designed from the ground up specifically
to address the need to store and interrogate genomic data. GOR
databases resolve issues with input/output lag time and computer
crashes by storing sequence data according to its inherent structure
its position on human chromosomes with underlying data structures
and commands created specifically for genomically-ordered tables. This
enables genomic data and annotation data including reference datasets
to be stored and updated separately and queried rapidly. Applications
built on GOR databases can retrieve individual reads quickly, correlating
only the relevant bits of data before moving on. Massive amounts of
sequence data can be interrogated in minutes rather than days or weeks.
In a clinical context this allows for the rapid filtering of patient
genomes against gene lists, public and private gene and annotation
databases, all of which can be harmonised to the GOR format.
In research, case-control analyses involving tens of thousands of
genomes can take minutes rather than hours or days.

POWER
The ability to mine sequence datasets is only useful if potentially
pathogenic variants can be efficiently identified. The diagnostic yield
of a system is directly related to its ability to compare samples to
databases of existing genomic data, access to extensive reference
libraries, and an ability to predict deleterious variants, even if they are
novel and not previously annotated. Creating a seamless link between
this information and the clinic is a critical step in advancing both
precision medicine and genomic research.

NGS INTERPRETATION AND DISCOVERY

Sequence information
Looking for disease-causing variation within the genome typically begins
with checking the sequence against lists of known disease-associated
genes and reference sources. In clinical diagnostics, for example, this
could involve using a gene panel test that examines specific regions of
the genome looking for known alterations that are linked to disease.
For example, the TaGSCAN (Targeted Gene Sequencing and Custom
Analysis) screening panel examines 514 genetic regions that have been
associated with childhood diseases. There are gene panels available
from a wide array of companies that can be used for the identification of
carrier status, assessment of disease risk, and diagnoses.
While this approach is a valuable start in interrogating a genome,
detailed knowledge of individual variants (the Known-Knowns)
is not always extensive enough to obtain an answer. In fact, this
approach will only solve 20-25% of rare disease cases.
If comparative methods fail to generate a result, the next step
is typically a systematic search of the genome that filters the
information for a range of different genetic features. These include:
Population Allele Frequency: Tools like the Exome Aggregation
Consortium (ExAC) allow researchers to identify rare, potentially
disease-linked variants within a cohort of over 60,000 individuals.
Variant Impact: Tools like Variant Effect Predictor (VEP) can
predict the impact that identified variants will have on genes,
transcripts and proteins. This analysis is based upon the location
of a variant within a gene and the expected effect that a mutation
will have on it product (most often a protein).
Inheritance Model: Such as autosomal dominant or recessive.
This also includes de novo mutation when a variant has been
arisen spontaneously and has not been inherited.
Paralogs: These are usually silent second copies or versions
of genes that have been kept in the genome over the course of
evolution, but can still be or become functional.

A crucial component of the effective use of this form of genome


exploration is the ability to instantly visualise raw genomic data
reads, showing what is happening at each base in a sample. This
enables visual confirmation of statistical or summary findings and
to check the quality of the sequencing. If data is normalised, any
variants can be viewed in the context of standard and customised or
proprietary reference sets, such as the Human Genome Project, all in
the same format.
Once the exploration of the genome has revealed an interesting
variant, its effect on human phenotype can be indicated or confirmed
in another genome with the same disease, usually through a search
of the published medical literature. There currently exist over 60
million articles and guidelines, and new information is constantly
becoming available, once again presenting a significant problem for
data access and handling. Initiatives are underway to centralize this
information, such as the government-funded ClinGen and ClinVar
systems, which aim to create an open-access resource that defines
the clinical relevance of genes and variants respectively.
Another approach for replicating or confirming variants that may not
yet be in the literature or databases is the development of data-sharing
beacons. One such has been developed by NIH, and another by the
Global Alliance for Genomics in Health (GA4GH). Beacons are webservers housing genome data, submitted by contributing institutions,
which can answer specific questions about the presence or absence
of particular alleles at a particular genomic location. A researcher can
ask a beacon Do you have any genomes with an A at position X on
chromosome 6 and the beacon responds simply Yes or No.
As well as interrogating the sequence data, there are databases
available for exploring the biology of genetic diseases. Archives such
as the Genetic Association Database (GAD) and The Online Mendelian
Inheritance in Man (OMIM) contain detailed information on genetic
diseases, their biology, their relationship to relevant genes and the
complexity of gene(s) associated with the disease. p.41

Visualisation of raw genomic data reads show


what is happening at each base in the sequence,
and how that compares to reference sequences.

Genomics 101 / 37

EXPLORING A QUADRILLION BASES

NGS INTERPRETATION AND DISCOVERY

TO FIND THE ONE THAT MATTERS


38 / Genomics 101

NGS INTERPRETATION AND DISCOVERY


THE OS OF THE GENOME

NO ONE PUTS THE FULL POWER OF THE GENOME AT YOUR FINGERTIPS LIKE WUXI NEXTCODE.
INTERPRET CLINICAL CASES WITH UNRIVALLED POWER; MINE POPULATION WGS IN MINUTES;
JOIN COHORTS ON MANY CONTINENTS. ALL YOUR DATA AND RESULTS BACKED BY ALL
KEY GLOBAL REFERENCE SETS AND COLLECTIONS IN ONE SYSTEM, ONE FORMAT, AT RAW
RESOLUTION AND IN REAL TIME.

FIND OUT WHY IN HEAD-TO-HEAD COMPARISONS AGAINST EVERY


PLATFORM ON THE PLANET, WE TAKE OUR PARTNERS FARTHER AND FASTER.
GENOMIC BIG DATA SOLVED
Speed. Our Genomically Ordered Relational Data (GOR)
model does what others cant: it enables on-the-fly queries
and joins of massive sequence data, wherever it resides. By
relying on the structure of the genome itself, it can perform
complex analyses in minutes, not weeks, and provides
always-on raw sequence visualization at a click.
Scale. GOR technology manages, mines, interprets and
connects more genomes than any other. It is the only
system built and proven at population scale, underpinning
the power of our tests and scans, as well as the largest and
leading precision medicine efforts on three continents.
The global standard. This scalability, coupled with our
pioneering Deep Learning capabilities, creates an insight
engine to benefit patients everywhere. Our unparalleled
dynamic knowledge base includes curated data from all
key public as well as the worlds largest proprietary
genomics datasets.
SEAMLESS DIAGNOSTICS AND DISCOVERY
Broad. Our expertise and tests span rare disease,
cancer, public health and wellness.
Powerful. We dont rely solely on the known annotations
that panel tests do. By systematically scanning the entire
genome, our system has solved years-long diagnostic
odysseys in hours and points straight into the biology to
find actionable results.

Efficient. We let you instantly filter variants by frequency,


impact, mode of inheritance, scan for de novos, and validate
immediately by viewing raw reads - shaving days from once
tedious workflows.
Integrated. Since so many causative variants are novel, we
use GOR to bring diagnostics and discovery together.
Toggle between clinical and research datasets, increasing
diagnostic yields, uncovering new targets, and following them
up in fast and customized case-control studies.
A GLOBAL ECOSYSTEM FOR PRECISION MEDICINE
Soup to nuts, worldwide. Menu of our products and
services include CLIA-certified sequencing, secondary
analysis, data storage, diagnostics, discovery and product
development. Choose a la carte or get a complete turnkey
workflow.
Let the cloud do the heavy lifting. As well as onsite
installation, you can also run our entire system in the cloud.
Youll get elastic scalability, compliance and the best-in-class
security that our partnership with DNAnexus brings.
The internet of DNA. Powered by GOR, the largest global
network of genomes makes it possible to collaborate
instantly with colleagues and institutions around the world,
using full-resolution sequence data without moving big files.
This capability accelerates diagnostics and discovery, to the
benefit of patients and populations everywhere.

sales@wuxinextcode.com
Shanghai | Cambridge | Reykjavik

Genomics 101 / 39

INTERPRETATION & DISCOVERY

IN RARE DISEASE
CASES, PARTICULAR
VARIANTS ARE
RARE BUT MANY
MAY CLUSTER IN A
PARTICULAR GENE.

40 / Genomics 101

INTERPRETATION & DISCOVERY

On the basis of these analyses, gene variants are typically classified


as pathogenic, benign, or variant of unknown significance (VUS).
This information can be used to support a clinical genetic report,
which we will explore in more detail in the following chapter.
Case-control studies
Comparing disease and control cases across large groups of patients
presents a very different interpretive challenge. Here the genomes of
hundreds or even thousands of patients who have a particular disease
may be compared to the genomes of many times more control
subjects. Genome-wide association studies (GWAS) have been used to
compare many common genetic variants to identify loci that may be
linked to disease (typically using genotyping arrays). With the falling
costs of whole genome sequencing (WGS), however, many researchers
are now turning to WGS analysis to achieve base-level granularity and
detect disease associations even with low-frequency alleles.
The growing number of national genome projects, such as the
UKs 100,000 Genomes Project and the Qatar Genome Project,
are conducting this research as part of the wider integration of
genomic medicine into their national health programmes. This
process requires enormous quantities of data, and is an area where
improved database architecture has enormously improved the
speed of genome interrogation.
In rare disease cases, particular variants are rare but many may
cluster in a particular gene. To increase the power of a case control
analysis, variant aggregation analysis tools can be used to collapse
all variants in a gene associated with a disease into one pseudo
variant. This can increase the power of association studies and with
potentially powerful results.
One such utility is the identification of rare variants that point to
potential drug pathways in common diseases: that is, to use rare
disease genetics not only for the diagnosis of individuals but to develop
drugs to treat common diseases or phenotypes. An example of this is
PCSK9, which was identified as a potent modulator of LDL cholesterol
levels through gene discovery in families with rare variants. Subsequent
discovery work on the pathway established that compounds inhibiting
PCSK9 could lower LDL cholesterol, an important public health
impact especially for those resistant to or intolerant of statins, the
most common cholesterol-lowering drugs. Two such inhibitors were
approved by the FDA in 2015 and are now on the market.

REACH
Storing, accessing and mining data in situ is the first part of the
challenge surrounding interpretation and discovery. The second
part, which is set to be a crucial game-changer, is the ability to
work with these datasets online, from anywhere in the world. The
beacons being developed by Global Alliance are a simple example
of how this can work, but for the future it will become critical for
researchers and clinicians to go beyond asking basic questions.
Given the scale of genomic data, the standard approach is to
hold the genomic information in one central database and allow
researchers and clinicians remote access, sometimes accompanied

by a suite of cloud-based analytical tools. Genomics England and


the 100,000 Genomes Project aim to be a good example of this. Deidentified patient data is placed in a central data centre, accessible
by approved researchers, doctors, nurses and other healthcare
professionals. Research users have restricted, remote access to
datasets that contain only the information needed for their specific
study. Genomics England are able to provide IT support, computing
infrastructure, genome analysis tools and other technical services
through partnerships with a range of companies and organisations.
One of the critical future challenges will be to enable researchers to
interrogate multiple large data stores at once, rather than one by
one. Given the amount of sequencing going on around the world
and the number of large-scale projects aimed at discovery for
improving diagnostics and therapies linked to them this is an area
of great potential.
In the spring 2016 the Simons Simplex Collection in autism, comprising
some 10,000 whole exomes, was made available in GOR format on the
WuXi NextCODE Exchange to the global autism research and clinical
community. This represents the first online use of large-scale genome
data at full resolution, and points to the potential and likely rapid
development of this capability in the near future.

REGULATORY CHALLENGES
As we have already discussed, NGS produces enormous quantities
of data, and has the potential to identify thousands of variants
that may be disease-linked. This creates a significant challenge for
regulators. For a diagnostic genetic test to gain regulatory approval,
and so be clinically useful, the U.S. Food and Drug Administration
typically requires that the variant identified by the test is reported,
and is known to be associated with a disease. Test developers must
show clinical significance before approval can be given.
However, to date a relatively small number of disease-linked
variants have been identified, and often NGS tests are used
precisely because they can routinely detect rare variants that may
not be identified by established tests.
At present, the discussions around how to effectively regulate NGS
tests are on-going, and one focus is to evaluate the methodology of
NGS interpretation as well as previously seen links, combined with
an ongoing assessment of the clinical outcomes.

ON TO THE CLINIC
As we have outlined in this chapter, the challenges involved in
interpreting the tidal wave of NGS data are considerable, but not
insurmountable. There are numerous initiatives underway to ease
the strain in the bottleneck and speed the development of precision
medicine systems.
In the next chapter we will look at genomics in the clinic, where the
results of interpretation and discovery are turned into actionable
results for clinicians and patients. n

Genomics 101 / 41

CHAPTER 5:

NGS IN THE
CLINIC
SPONSORED BY

NGS IN THE CLINIC

INTRODUCTION
In the preceding chapters we have explored the process of
collecting and analysing genomic sequence data, in a manner that
could be applied to both research and to clinical diagnostics. For
this chapter we will be focussing entirely on genomics in the clinic,
and how the outcomes of genomic tests are communicated to
clinicians and patients by the clinical laboratories conducting them.
Creating an accessible, useful report based on NGS information
and analysis for a physician is one of the most challenging areas
of clinical genomics. High quality patient care is dependent on a
written report that is easy to understand and easy for physicians
and genetic counsellors to act upon.
Traditionally, laboratory tests have looked for specific genetic
variants with known disease outcomes. But with the tidal wave
of information generated by NGS tests clinical laboratories
are faced with the thorny issue of how to present information
on thousands of genetic variants, many of which will have
inconclusive clinical outcomes.

Exon
(coding)

Intron
(non-coding)

The balance between what to include in a molecular genetics report


is essentially a Goldilocks problem: a report should aim to provide
just enough information, but not too much; just the right level of
detail, but not too complicated; and to be succinct, but still include
all the relevant information.
We have previously examined the different steps involved in
evaluating the clinical relevance of different variants. Here we will
take a look at what happens in the clinic, starting with selecting
the right test for a patient, and how to present the test outcomes
in an effective clinical report. We will look at the approaches and
challenges associated with clinical reporting, and explore what
current best practice looks like.

CHOOSING THE RIGHT TEST


There are four main types of NGS tests available to patients, which
provide varying degrees of genome coverage and varying levels
of diagnostic detail. The type of test that a clinician will choose is
generally determined by a patients symptoms and medical history.

Exon

Intron

Start of gene

Targeted gene panel testing


A targeted gene panel test is typically used when a patients
symptoms and medical or family history strongly indicate a
particular genetic condition associated with a small number of
specific genes. Targeted panels explore the exons (coding regions)
of between 20 and over 100 genes known to be disease-linked,
and while the test examines several different genes the analysis
will be targeted on a specific genetic condition. Diseases that
targeted gene panels have been developed for include epilepsy
and hearing impairment.
Medical exome sequencing
Whereas a targeted gene panel looks for known or specific variants
in a few disease-linked genes, a medical exome test takes a broader
diagnostic approach, exploring the exons in up to 4,600 diseaselinked genes.
Whole exome sequencing
Exploring the entire genome sequence is still relatively expensive,
making whole exome sequencing a more cost-effective alternative

Exon

End of gene

for patients and healthcare providers. A clinician is only likely


to order a whole exome or a whole genome test when a panel
of genes is not available for a particular condition, or when the
diagnosis is very unclear.
The human exome contains approximately 20,000 genes, so in
the first instance the whole exome analysis will focus on genes
believed to be linked to the patients condition. However, the
analysis can be extended to cover more exome genes if the first
results are inconclusive.
Typically whole exome sequencing is conducted as a trio analysis,
meaning that both the patient and their parents will be sequenced
and analysed. This way genetic variants in the genes of the patient
can be compared with variants in their parents.
Whole genome sequencing
This approach sequences and analyses the patients entire genome,
and is typically only used when a patient is very ill and previous tests
have proved inconclusive. At present, sequencing and interpreting an
entire genome is extremely costly, and the clinical usefulness of this
test can be impaired by the sheer quantity of data generated.

Genomics 101 / 43

NGS IN THE CLINIC

TO KNOW OR
NOT TO KNOW?
During a whole exome
test to diagnose a
patients rare condition,
the clinical laboratory
conducting the test
makes a secondary
finding, namely a
mutation in the
patients BRCA1 gene
that massively increases
their lifetime likelihood
of developing breast
cancer. However, the
patient has specifically
asked not to be notified
of secondary findings.
What is the right course
of action?
This debate is ongoing
in the clinical genomics
community. On the one
hand it seems morally
wrong not to inform
the patient about a
potentially lethal gene
mutation that they are
otherwise unaware of,
even if they have not
consented to receive
that information.
However, patients do
and should have the
right to autonomy
over their medical
information and how it
is used.
One solution,
recommended by
ACMG, is to have a
minimum list of known
conditions, such as
BRCA1 mutations, which
are routinely evaluated
and reported on as
part of a genetic test.
These results would
be reported without
seeking preferences
from the patient.
What do you think?

REPORTING THE RESULTS

49

As yet there is no agreed industry


standard for reporting genomic test
results, but several professional
organisations have laid out guidelines
for the sector. In the US, the American
College of Medical Genetics and Genomics
produced a document entitled Standards
and Guidelines for Clinical Genetics
Laboratories, while in the UK the
Association for Clinical Genetic Science
has produced a series of guidelines for
genomic reporting.
The style and content of the report varies
depending on the type of test ordered, but
broadly speaking the majority of clinical
genomics reports contain:
The patients results and the interpretation
of the results;
The location of the identified variant(s);
The type of test used, the methodology,
and its limitations;
Any secondary findings, if requested by the
patient.
The results of a genetic test show one of
two things: either there is a deviation in
the usual sequence of a gene found in the
sample provided by the patient, or there
is not.

44 / Genomics 101

During the interpretation and discovery process,


a patients genes (or entire exome/genome) are
compared to those from other people with a
similar condition, to identify possible diseaselinked variants. As a result, the test may identify
a variant that is definitely known to be the cause
of the patients condition, or to contribute to it
in some way. This is a pathogenic or diseasecausing variant.
Our knowledge of genetic variants and
their associated diseases is still relatively
small, and as a result a genetic test may
well uncover a variant with an unknown or
uncertain link to the patients condition. This
is a variant of unknown significance.
No two humans have identical genomes; genetic
variation is a natural part of the genome. These
variations may well be picked up by a genetic
test, but because they are not linked to a disease
condition they are called benign variants.
Finally, there is a chance that any genetic test
may pick up a known disease-linked variant that
is not associated with the patients condition.
This is called a secondary finding, and there is
an ongoing and extensive ethical debate within
the genomics community about how best to
communicate these findings to patients.

NGS IN THE CLINIC

THE ETHICS OF...


NGS tests are a relatively new clinical tool, and as a result there
is no one established industry standard for handling the array
of issues that can arise from NGS data. Here we consider three
of the major issues facing clinical laboratories and healthcare
practitioners: secondary findings, data- re-analysis, and patient
access to raw data. While guidelines exist for how to approach all
three topics, the final decision is at the discretion of the individual
clinical laboratory.
Secondary findings
The more comprehensive a genetic test, the more genes sequenced
and explored, the higher the likelihood that the analysis will throw
up unexpected results or secondary findings. How to report
secondary findings, whether to report them, and issues around
patient consent are an ongoing area for discussion within the
research and clinical communities.
Secondary findings are genetic variants not linked to the
condition being tested for, which may be linked to other
disease conditions that could either affect the patient in later
life, or could affect the health of future offspring. During
genetic diagnostics many laboratories may conduct a separate
secondary finding analysis that specifically looks for 56 gene
alterations recommended by ACMG. Patients can choose to opt
out of receiving this information if they wish. Conditions on the
ACMG list include familial cancer disposition, such as mutations
in BRCA1 or BRCA2 genes.
The likelihood of generating secondary findings are relatively low,
particularly in targeted gene panels that only look at a narrow
subset of specific genes. Even in whole exome sequencing the
chance is low if the analysis is restricted to a specific condition.
During a targeted analysis of a whole exome sequence it has been
estimated that an unexpected variant would occur once in every
100 tests. If the whole exome is searched that likelihood increases
to 3 in every 100 tests.

variant. The solution to this problem is to store the entire


sequence data collected during the test, but as discussed in the
previous chapter this can place significant strain on facilitys
data storage capacity.
Patient data access
Patients may well request access to the raw data from their NGS
tests, often because they are interested in looking for variants that
may be relevant to their condition that were not included in the
laboratory report. This enables patients to take control of their
information, allowing them to follow up on unreported findings
with their clinicians or to monitor the scientific literature for new
information about the function or disease association of these
variants. And as with secondary findings and data re-analysis,
the decision to provide this data to the patient, and the format
of the data (raw sequence data or VCF) is at the discretion of the
individual laboratory.
The key concerns around providing raw data to patients focus on
expertise, and the limitations of the tests themselves. A certain
degree of training is required to evaluate interpret raw data files,
and the risk is that patients or clinicians may over-interpret the
findings, ascribing significance to certain variants whose effects
may be relatively benign or harmless.
NGS tests are also not completely accurate; there is a risk of false
positives identifying variants that are in fact not present and
false negatives variants identified as absent that are actually
present. In most cases clinical laboratories will use a second
sequencing method, such as Sanger (see chapter 2) to confirm the
validity of any findings to address this problem. A patient looking
at their raw data would not be able to confirm if a change is
p.49
necessarily real.

Data re-analysis
Over time, as our knowledge of the genome increases, patients
whose test were previously unsuccessful may find themselves
in a position to obtain a diagnosis. A crucial part of developing
the clinical reporting system will be futureproofing, ensuring
that in the future patients can come back for a diagnosis as
the science advances. Again, how to handle data re-analysis
currently comes down the discretion and capabilities of the
individual clinical laboratory.
For example, how much patient data should a clinical
laboratory store in order to support a re-analysis? As with the
formatting and content of a clinical report, there is no hard and
fast industry standard. One option is to store the list of variants
discovered by the test, the VCF or variant call format, rather
than the complete exome or genome sequence. This solution
places less strain on a laboratorys data infrastructure, but
there is the risk that the VCF file may not contain the relevant

Genomics 101 / 45

NGS IN THE CLINIC

ACMG RECOMMENDATIONS ON SEQUENCE


VARIANT INTERPRETATION: IMPLEMENTATION
ON THE BENCH NGS PLATFORM
Berivan Baskin1, Ph.D, FACMG, FCCMG; Steven Van Vooren2, Ph.D

1.
2.

MOLECULAR GENETICS LABORATORY, UPPSALA UNIVERSITY HOSPITAL, SWEDEN


CARTAGENIA INC (A PART OF AGILENT TECHNOLOGIES), 485 MASSACHUSETTS AVENUE,
SUITE 300, CAMBRIDGE, MA 02139, USA

INTRODUCTION
In a recent paper, the American College of Medical Genetics and Genomics
published standards and guidelines for the interpretation of sequence
variants1. The College made these available as an educational resource
for clinical laboratory geneticists to help them provide qualitative clinical
laboratory services. Although adherence to these standards and guidelines
is voluntary and cannot replace the clinical laboratory geneticists
professional judgment, the recommendations represent a broad consensus
of the clinical genetics community. With increasing volumes and the use
of large gene panels (clinical, full exomes and even full genomes) in clinical
genetics routine practice, labs need strong informatics tools that support
them in the automation and standardization of variant assessment and
reporting, in order to benefit from community standards and to keep
up with the best standard of care. In this case study, we showcase how
Cartagenia Bench Lab NGS enables labs to implement their take on
the ACMG recommendations. The Molecular Genetics department at
Uppsala University Hospital illustrates how it is has implemented the
recommendations in their specific routine diagnostic setting, using a flexible,
drag-and-drop interface to build and store the labs variant triage protocol.

KEY REQUIREMENTS
The standards and guidelines describe an evidence-based approach
for the assessment of variants of clinically validated genes. The
recommendations use literature and database-based criteria to classify
variants in five different categories: benign, likely benign, uncertain
significance, likely pathogenic and pathogenic. Evidence levels are
weighted (e.g. Strong, Moderate). To allow labs to automate their
implementation of this evidence-based approach, a number of specific
tools are required.
ANNOTATION SOURCES, SUCH AS POPULATION, DISEASE-SPECIFIC,
AND SEQUENCE DATABASES
The guidelines recommend the use of a wide range of criteria.
Examples include: population databases such as the Exome
Aggregation Consortium (ExAC, http://exac.broadinstitute.org/); disease
databases such as ClinVar (http://www.ncbi.nlm.nih.gov/clinvar), and
sequence databases such as RefSeq (http://www.ncbi.nlm.nih.gov/
refseq/rsg). With Cartagenias Bench NGS platform, labs can integrate
and use a wide range of community-accepted resources, including

Figure 1.
Partial view of the Uppsala
University Hospital decision
tree representing their filtration
strategy, investigating public
and in-house variant databases,
modes of inheritance, population
frequency statistics databases, and
variant coding effect. Top: decision
tree. Middle: currently selected
ACMG category PP5. Bottom:
variants matching selected criteria.
(Courtesy of Dr. Berivan Baskin)

46 / Genomics 101

NGS IN THE CLINIC


the tools and data sources recommended in the ACMG guidelines.
Moreover, with Bench NGS, labs can benefit from full version control
and traceability on these resources.
RULES FOR COMBINING CRITERIA TO CLASSIFY SEQUENCE
VARIANTS: DECISION TREES AND SCORING
The guidelines recommend a broad set of informative criteria
for assessing the clinical impact of a sequence variant. With
each criteria, they also provide a level of evidence strength. For
example, for a de novo variant in a patient with the disease, no
family history and with both maternity and paternity confirmed,
the evidence to classify the variant as Pathogenic is suggested
to be Strong. Other levels are Very Strong, Moderate and
Supporting. The guidelines also propose a scheme of rules by
which labs can combine different criteria for classifying variants
with different levels of evidence. In order to automate such a
scheme, a tool set is required to represent rules into a workflow,
and associate scores to variants accordingly. Bench NGS elegantly
provides such a system by means of classification trees.
The user can choose from a library of filter components that
each represent filter criteria such as population frequency, and
can drag and- drop these into a decision tree.
Then, classifications (e.g. likely benign) can be assigned to
the outcome of a particular branch, and labels can be used to
annotate statuses, review actions or levels of evidence (e.g. PVS1,
with which the guidelines represent very strong evidence of
pathogenicity, or review, which prompts the lab to investigate a
variant further).

IMPLEMENTATION
The molecular genetics laboratory at the Uppsala University Hospital has
implemented the ACMG guidelines on the Cartagenia Bench NGS platform
and validated their approach on a set of clinical cases. The lab has
implemented different criteria as well as levels of evidence in a decision
tree, partially shown in Figure 1. In this view, a validated pipeline is run on
a Connective Tissue Panel sample, showcasing a variant in the COL1A2
gene that is reported as clinically relevant. The protocol represented
by the tree has checked all variants in the assay, and highlighted the
p.Gly949Ser variant for review. The clinical geneticist consecutively verifies
relevant sources in this case: ESP, 1000 Genomes, ExAC, HGMD, in silico
score annotations from ACMG-recommended SIFT, Mutation Taster and
PolyPhen, and a confirmed spectrum of missense mutations in the gene
at hand. Parental samples tested negative for this variant.
CONCLUSION
With this case study, the lab has illustrated how various features of the
Cartagenia Bench Lab NGS platform were used to implement an automated
Standard Operating Procedure that reflects how the lab performs variant
filtration. This case illustrates strong advantages in lab efficiency - whereas a
manual process of variant filtration is time consuming and error prone, the
lab benefits from automation of these manual protocols, freeing up time for
genetic specialists to focus on variant interpretation and reporting.
Notes
1. Richards et al., Genetics in Medicine, advance online publication 5 March 2015.
doi:10.1038/gim.2015.30
This article is adapted from Agilent Publication 5991-6387EN.
Cartagenia Bench Lab is marketed in the USA as exempt Class I Medical Device and in
Europe and Canada as a Class I Medical Device.

Cartagenia Bench Lab

Discover unique use cases


and best practices on NGS
for clinical genetics and pathology
labs at www.cartagenia.com.

Efficient variant
assessment
A clinical-grade
solution
Access relevant
content
Draft lab reports
with ease

Cartagenia Bench Lab is marketed in the USA as exempt


Class I Medical Device and in Europe and Canada as a Class I Medical Device.
Genomics 101 / 47

NGS IN THE CLINIC

IN AN IDEAL SCENARIO,
FOLLOWING A GENETIC
TEST A PATIENTS
REPORT AND ALL THE
ASSOCIATED RAW DATA
WOULD BE UPLOADED
TO THEIR EMR.

48 / Genomics 101

NGS IN THE CLINIC

ELECTRONIC MEDICAL RECORDS


Many health services across the world have migrated, or are in the
process of migrating their patient health information into digital
storage in the form of electronic medical records or EMRs. Across
healthcare in general this is seen as a vital step in creating a more
integrated system, and for genomic information in particular being
able to append sequence information to patient EMR is a vital step
in the development of precision medicine.
In an ideal scenario, following a genetic test a patients report and
all the associated raw data would be uploaded to their EMR. This
information could then be accessed by any clinician in the future, with
the patients permission, who may need to re-analyse the data or make
a fresh diagnosis based on new research. Patients may also be able to
access their data remotely and securely. And researchers and clinical
laboratories will be able to access de-identified data from thousands of
patients as part of genomic interpretation and discovery (see chapter 5),
and enable the uptake of genomic diagnostics into routine clinical care.
In reality, while a 1-3 page test report can be easily incorporated into
an electronic medical record, there are significant difficulties associated
with adding complete genomic data to a patients record. Nevertheless
there are several large-scale projects underway aimed at creating a
successful electronic database for genomic data that will benefit both
patients and research.

The Electronic Medical Records and Genomics (eMERGE)


Network, a National Institutes of Health-organised and funded
consortium of US medical research institutions that are
developing research processes combining information from
DNA biorepositories and clinical information stored in EMRs in
order to conduct large-scale genomic research.
Now entering phase III as of September 2015, the project
participants are exploring the best avenues for incorporating
genetic variants into EMRs for use in clinical care and
diagnostics.

CONCLUSION
The use of NGS tests for clinical diagnostics look set to become
part of routine healthcare practice, and with the development of
EMR systems the long-term benefits for medical research could
be significant.
However, there are many challenges that have yet to be solved, and
many tools and processes that need to be developed in order to fully
realise the benefit. Clinical reporting is set to evolve rapidly in the future,
as the cost of sequencing decreases and our knowledge of the genome
increases. Consequently best practices for analysis, interpreting variants
and clinical reporting will also continue to evolve. n

Genomics 101 / 49

CHAPTER 6:

EDITING THE
GENOME
SPONSORED BY

EDITING THE GENOME

Painting from ancient


Egypt depicting early
domestication

INTRODUCTION
It is impossible to go to a conference in the genomics, molecular
biology, or synthetic biology space without hearing the terms
genome editing or CRISPR. Precise, specific, and controlled
genome editing is arguably the trendiest application in these spaces
at the moment, with active momentum building due to its bold
promises. It is hard to ignore a field that could change the face of
personalised medicine, with the potential to treat thousands of
currently untreatable diseases.
Today there are five papers published every day on just CRISPR
alone an astounding number for a technology barely 3 years
old! But before we get to that we want to get back to humble
beginnings, to where the genome-engineering journey began.
Human influenced genomic modification of organisms is as old
as selective breeding, which has existed whether intentionally
selecting for beneficial traits, or unintentionally by domestication
for millennia. The direct modification of organisms using targeted
methods has existed for around four decades.

HOMOLOGOUS RECOMBINATION MEDIATED


GENE TARGETING
During the early microinjection studies of the late 70s it was
discovered that the success rates of exogenous DNA expression
in mammalian cells could be greatly increased if the DNA being
introduced also contained a viral DNA sequence on either end. In
the early 80s scientists discovered this same viral within cell line
genomes, with the exogenous DNA inserted at these sites. Using

this paradigm, foreign DNA could be inserted anywhere into the


genome so long as there existed regions of homology. By the mid
to late 80s researchers were designing constructs to integrate
foreign DNA into the genomes of a number of organisms, aiming to
disrupt and discover gene or pathway form and function.
Over 7000 genes and regulatory elements had their function
inferred since the late 80s thanks to this method of gene
targeting. Knockout mutants, or point mutation variants were
created, and depending on phenotype, function could be
inferred. The importance of this technology led to Drs Capecchi,
Evans and Smithies co-winning the 2007 Nobel Prize for
physiology or medicine.
Homologous recombination mediated gene targeting occurs via
a process called strand invasion part of the homology directed
repair of double stranded DNA breaks. A successful insertion
event therefore relies on a randomly occurring double stranded
break existing at a desired position. This makes experiment
efficiency low.
Successful modifications occur at best in 0.1% of cells. Site-specific
recombinase enzymes were developed as an adjunct to traditional
homologous recombination, increasing the integration efficiency,
though at least one recombinase-free recombination event was
still required for success. Genome editing protocols usually
required months of in vitro fertilisation and crossbreeding to find
a double mutant for a desired allele. Add that to the challenge of
uncontrollable integration of plasmid DNA throughout the genome
through the non-homologous end joining repair machinery, and the
stage was set for a more efficient technology to be developed.

Genomics 101 / 51

EDITING THE GENOME

TECHNOLOGY
SHOWCASE:
DEVELOPING
MOUSE MODELS
FOR CYSTIC
FIBROSIS
Cystic Fibrosis patients
carry a mutation in a
chloride ion channel,
causing mucosal tissues
to function incorrectly
leading to impaired
mucosal secretion and
damage the intestinal
tracts. Patients also
often suffer severe and
chronic infection from
over 50 pathogenic or
opportunistic species.
Much of the
pathophysiology of this
life shortening disease
was learned through
early homologous
recombination generated
mouse models. One
important study area
has been the study of
highly complex bacterial
biofilms within the CFmodel mouse lung.
Importantly, coinfection with multiple
pathogens, i.e.
Pseudomonas aeruginosa
and Burkholderia
cenocepacia led to both
increased inflammatory
responses and chronic
infection establishment
in mice. In the last
5 years a number of
potential anti-biofilm
drugs are being tested
on the CF model mice
with the hope it could
extend patient lives.

52 / Genomics 101

ZINC FINGER NUCLEASES (ZFNs)


Targeted, site-specific enzymes that facilitate
directed cuts at any point in the genome
arrived on the scene with the introduction
of zinc finger nucleases, commonly known
as ZFNs, around 20 years ago, promising to
dramatically improve genome-editing efficacy.
ZFNs are synthetic, modular proteins. They
consist of DNA binding domains sourced
from the active site of transcription factor
proteins, each approximately 30 amino
acids in length, stabilised by a zinc ion. Each
ZF is engineered to bind a specific triplet of
nucleotides, and ZFs can be engineered to
bind almost any triplet sequence.
Modularly adding triplet-binders to an
endonuclease allows for sequence directed
cuts to be induced anywhere in the genome.
In order to maximise double stranded break
efficiency and minimise off target ZF binding to
several orders of magnitude beyond a human
genome, a duplex approach is adopted.
This requires two zinc finger complexes;
each binding to opposite strands of the DNA,

Microinjection of a
cell to introduce new
genetic material

each fused with one half of the bipartite FokI


endonuclease. In order to produce a sitespecific cut, both zinc finger complexes were
engineered to bind the perfect distance from
each other in order bringing both FokI subunits
in close enough proximity to induce a cut.
Fine control over a cut site position gave
researchers tweezer-like control over
homologous recombination events.
Additionally, by making more than one
specific cut, sections of the genome could
now be completely deleted, taking advantage
of the cells non-homologous end joining
DNA repair machinery. Although a major
breakthrough, processes used to make each
DNA binding unit to be highly specific were
arduous and ultimately expensive.
Each zinc finger sub-unit has finely nuanced
binding affinity further complicating the use
of this technology. To use the analogy of a
hand, a zinc finger in the pinky position of
the entire ZFN structure could have extremely
high efficiency. However in the ring position
the same zinc finger will display an entirely
different, often worse binding efficiency.
Therefore binding efficiency had to be
engineered in the context of the entire protein.

EDITING THE GENOME

Transcription Activator-Like Effector Nucleases (TALENs)


TALENs are topologically similar to ZFNs, however instead of zinc fingers, it is transcription
activator-like Effector (TALE) proteins that are modularised to facilitate specific DNA binding.
Native to Xanthamonas plant pathogens, TALEs are important virulence factors that bind to
promoter sequences in the host genome to increase the expression of plant proteins that
will facilitate cell colonisation by the pathogen.

ZF1

ZF3

ZF2

ZF5

ZF6

ZF4

A cartoon dipicting how Zinc Finger Nucleases (ZNFs) bind to 3 specific nucleic acid bases

TAL1

)*+
TAL4

TAL3

TAL4

TAL2

TAL3

TAL1

TAL4

TAL1

In 2009 researchers deciphered their astoundingly simple DNA binding mechanisms, and by
ZF3
ZF1 were being
ZF2
2010 TALENs
engineered
to direct double stranded breaks in DNA. An individual
TAL is a small 33-35 amino acid protein, with two adjacent amino acids (in position 12 and 13)
controlling DNA binding. Therefore only four TAL-4 variants (one for each A, T, C, G nucleotide)
have to be organised to provide sequence specific DNA binding.

TAL2

TAL3

TAL1

TAL3

TAL1

)*+
TAL4

TAL2

TAL2

TAL3

TAL1

TAL2

TAL4

TAL2

TAL2

TAL2

TAL2

TAL3

TAL3

TAL2

TAL1

TAL1

TAL3

TAL4

TAL4

TAL1

TAL1

TAL4

When designed properly, cleavage efficiency between ZFNs and


is actuallyZF4
ZF5
ZF6 TALENs
somewhat similar. What differentiates TALENs is their absolute ease of engineering.
ZFNs required a deep understanding of ZF binding modalities, as well as an E. coli based
screening system that used libraries of sequences. TALENs required the shuffling of the
four module variants.

A cartoon dipicting how TAL Effector Nucleases (TALENSs) bind to individual nucleic acid bases

TECHNOLOGY
SHOWCASE:
12 PEOPLE TREATED
FOR HIV
In 1995, individuals were
found to be naturally
resistant to HIV. Each
carried a 32 base pair
deletion in an immune
receptor called CCR5,
ablating its function.
In 2008, using zincfinger nucleases, CCR5null cells were produced
in the lab. Between
2011 and 2013 12 HIV+
patients had their
immune cells cultivated,
modified with zinc finger
nucleases to contain the
CCR5-null phenotype and
re-introduced into the
patients. Patients saw a
reduction in viral load,
and the persistence of a
HIV-resistant population
of T-cells. It is important
to note that patients
are not cured, but
their disease outlook is
improved.
Current research aims
to combine the same
procedure with stem cell
therapy to provide a one
shot HIV cure. Sangamo
are currently in FDA
approved phase 2 clinical
trials for T cell technology
and phase 1 for stem cell
technology.

The ease of engineering led a number of new, novel applications that can be brought about
by an easily engineered double stranded break, including very large (1.5 million base) scale
deletion, and inversion event models.

CRISPR-CAS9
In 2012, the entire field of genome editing was shaken up again with the coadaptation of the
CRISPR-Cas9 system to genome editing. In nature, the CRISPR-Cas9 system is found in over
40% of all sequenced bacteria, and almost every archaea. It affords immunity to invading
DNA elements (from viruses or other pathogens) by site-specific RNA guided cleavage. Due
to its simplicity of use, accuracy and ease of further Cas9 modification to shuttle other DNA
acting enzymes to specific genomic regions is has become the current gold standard in
genome editing machinery. p.56

Genomics 101 / 53

Reimagine Genome

Massively Parallel DNA Synthesis

SOFTWARE TOOLS ALLOW OFF-TARGET FREE GRNA TO BE DESIGNED

AT 1:500, TWIST BIOSCIENCES


OLIGO SYNTHESIS ERROR
RATE IS INDUSTRY LEADING
Local Accuracy

121 oligos per cluster

0.8
0.6

^ # transcripts

0.4
0.0

0.2

ACTIVITY SCORE

1.0

MUC4 GUIDES

5000

sgRNA with low off-target activity


sgRNA with moderate off-target activity
sgRNA with off-target activity in MUC4 only

10000

15000

Distance from AUG (bp)

Control clusters

Protein-coding Exon

Desktop Genetics offer a platform for the easy, and efficient design of gRNA libraries that
are accurate for a genome of interest, and have minimal off target effects. a) Snapshot
of the Desktop Genetic interface showing BRCA1 introns and exons at its position in
the human genome. b) Several scoring algorithms are used to define whether a specific
gRNA will possess off target activity throughout the genome.

29,040 unique 80mer oligo sequences


were designed to contain a central 40bp
variable region (25% representation of
each base) with identical 20mer flanking
regions. These designs were synthesized
simultaneously on the Twist Bioscience

WORKFLOW FOR PATHWAY


COMPONENT DISCOVERY

DESIGN

MANUFACTURE

AN ACCURATE, OFFTARGET ACTIVITY


FREE gRNA LIBRARY

WHOLE GRNA TEMPLATE


LIBRARY USING
MASSIVELY PARALLEL
DNA SYNTHESIS

TWIST BIOSCIENCE OFFERS GRNA AS EITHER


UNAMPLIFIED OR AMPLIFIED POOLS
1. Pooled gRNA libraries

2. Cloning-ready gRNA pools

Amplified
Amplified
oligo pools
oligo pools

PACKAGE

gRNA TEMPLATE
LBRARY INTO
VECTOR(S) OF
CHOICE

LIBRARY INTO
LENTIVIRAL DELIVERY
SYSTEM

gRNA FROM TWIST BIOSCIENCE CUTS WITH


HIGH SPECIFICITY

Ready for
Trasfection

Unamplified

oligo pools
Unamplified
oligo pools

CLONE

POOLED
PLASMA sgRNA

IN VITRO
TRANSCRIPTION

Scale Research

rewrites CRISPR workflows

OLIGO POOLS ARE UNIFORM IN THEIR REPRESENTATION

silicon DNA writing platform in 240


clusters, and sequenced with an illumina
mi-seq. Alignments with design sequences
showed that around 1 in 500 nucleotides
per cluster was erroneous - an industry
leading synthesis accuracy.

The same 29,040 oligonucleotides that were designed above had their NGS data assessed for
abundance and oligonucleotide representation. 100% of the designed oligonucleotides were
present in the NGS analysis. Additionally 90% of all sequences were synthesized at a density
within 4x the mean density. This data confirms that what you design is exactly what will be
synthesized on Twist Biosciences platforM.

TRANSFORM

SCREEN

ANALYZE

LENTIVIRAL LIBRARY INTO


CELL LINE TO PRODUCE A
CELL LINE LIBRARY

LIBRARY WITH MULTIPLE


ROUNDS OF SELECTION
THAT ACTS ON A
PARTICULAR PATHWAY

RESULTS TO IDENTIFY
ALL GENES INVOLVED IN
SELECTION PATHWAY

Two 70mer oligonucletides were synthesized on Twist Biosciences Silicon DNA writing platform.
These oligonucleotides were assembled to make the full 120mer gRNA template (peak denoted
by blue arrow in i) that was complimentary to a sequence of interest. In-vitro transcription was
used to convert the template into gRNA (peak in ii). This gRNA was used to guide Cas9 to the DNA
sequence of interest. The 760bp sequence (blue arrow in iii) was cleaved successfully into two
pieces (blue arrows in iv) 321 and 439 bp in length. No remaining full length target or non specific
events were detectable.

Tell us what Twist Bioscience


can do for you
www.twistbioscience.com
@TwistBioscience

EDITING THE GENOME

TECHNOLOGY
SHOWCASE:
TALENS TO
MODIFY TALES

BRIEF HISTORY OF CRISPR-CAS9

While Xanthamonas
have been a useful
source of TALE proteins
for genome engineering,
they also use the same
tool to cause crop
destroying rice blight. In
true fighting fire with
fire fashion researchers
designed TALENs to
modify the natural TALE
binding regions of the
rice crop by inducing
either deletions or
mutations in an effector
binding region of the
plant genome.
Using a modified
DNA injecting plant
pathogen Agrobacterium
tumefasciens, plasmids
encoding TALENs
were injected into rice
embryonic cells, which
were then screened
for double knockout
mutants. These mutants
were found to show no
impairment growth or
development, alongside
a resistance to the 32
rice-infecting strains that
target the now modified
(unrecognisable) target
site. Due to the simplicity
of this experiment it can
be easily used in any
plant to confer resistance
to many blighting
pathogens that use a
TALE (or similar) infection
system.

Scientists had previously noted that many


bacterial and single-celled (archaeal)
genomes contained distinct repeated motifs
of <50bp that were clearly, neatly and
consistently ordered.

PROKARYOTIC GENOMES CONTAIN WELLORGANISED REPEATS


Jansen et al., 2002

Order implies function, but their function


was baffling they were non-coding for
one, and their pattern kept showing up
in different species, each with its own
unique, repetitive sequence that was often
highly diverged from other species with
the same pattern.
Efforts to find a solution to their function
came in the paper above, kickstarting the
field by naming the repeat regions Clustered
Regularly Interspaced Short Palindromic
Repeats or CRISPR for short and
documenting the existence a number of
CRISPR associated genes adjacent to these
repeats named the Cas family. The CRISPRCas paradigm was born.
SEQUENCES IN BETWEEN THESE REPEATS ARE
SURPRISINGLY FOREIGN
Mojica et al., 2005
CRISPRs were always interspersed
with what seemed like totally arbitrary
sequence, also <50bp in length. A CRISPR
locus looks as follows:
CRISPR-random-CRISPR-random-CRISPRrandom-CRISPR-random
in 2005 Mojica et al. sequenced 4500
CRISPR sequences from 67 strains
representing both bacteria and archaea,
and compared these sequences against
repositories within GenBank.
Their astounding finding revealed that
the sequences matched a mixture of
bacteriophage (viruses that infect bacteria)
sequences, invasive plasmid (weapons
used by bacteria to destroy other bacteria)
sequences and personal genomic
sequences that had been sequestered into
CRISPRs interspacing regions.

56 / Genomics 101

EUKARYOTES ARE NOT THE ONLY


ORGANISMS TO HAVE AN ADAPTIVE
IMMUNE SYSTEM
Barrangou et al., 2007
The authors of the previous study noticed
that a single-celled organism that grows
primarily in hotsprings, Sulfolobus
solfataricus, was naturally immune to a
virus called SIRV. It also had SIRV DNA in
its CRISPR spacers, noteworthy in that
viruses use DNA as their weapon to infect
hosts. It was hypothesised that CRISPR
was a form of bacterial adaptive immunity
against viral attack.
In 2007 Barrangou et al., showed that
subjecting bacteria to viral attack until it
became resistant caused that virus DNA to
be introduced into the CRISPR interspacing
regions. To account for false positives, they
then took out the spacers containing viral
DNA from the resistant strain, and subjected
it to viral attack once again. Resistance was
instantly lost.
CAS USES CRISPR TO BECOME A GUIDED
MISSILE
Garneau et al., 2010 and Deltcheva et al., 2011
Given that CRISPR and spacer DNA provides
immunity to viruses, nresearchers dug
for the precise mechanism that leads to
one genetic element destroying another.
Between 2010 and 2011, CRISPR/Cas was
shown to use the information in CRISPR
spacers as coordinates for cutting up
invading DNA sequences.
Viral DNA challenged with CRISPR/Cas of a
resistant strain was always cleaved within
the sequence that matched the spacer.
This cleavage always occurred at a specific
distance from a recognised CRISPR sequence
motif that was always consistent between
spacers of any particular species (the
Protospacer Adjacent Motif, or PAM).
Virus resistant-bacteria produce an
abundance of RNA from two distinct regions
in the CRISPR/Cas system. One was the
CRISPR spacer itself (crRNA), with the other
existing just outside of the CRISPR repeats,
near where the Cas genes are found
(tracrRNA). Both RNA fragments together
were necessary to cleave viral DNA alongside
the endonuclease protein Cas9.

EDITING THE GENOME

TIMELINE
HOMOLOGOUS RECOMBINATION
The homologous recombination machinery that allows
genetic recombination during meiosis was hijacked to afford
the directed homologous recombination of exogenous DNA
into precise positions in mouse genomes, with an efficiency of
one recombination in every 106 cells.

LATE 80s

SYNTHETIC ZINC-FINGER NUCLEASE


PROTEINS (ZFNS).
LATE 90s

TERM CLUSTERED REGULARLY


INTERSPACED PALINDROMIC REPEATS
(CRISPR) IS COINED.

2002

Identification of genes that are associated with DNA repeats


in prokaryotes.
Jansen et al. Molecular Microbiology

2005

CRISPR IS A PROKARYOTES ADAPTIVE


IMMUNE SYSTEM.
CRISPR provides acquired resistance against viruses in
prokaryotes.
Barrangou et al. Science

The CRISPR/Cas bacterial immune system cleaves


bacteriophage and plasmid DNA.
Garneau et al. Nature

LINKAGE OF CRRNA AND TRACRRNA


INTO A SINGLE GRNA COULD ENABLE
GUIDED GENE EDITING.
A programmable dual-RNA-guided DNA endonuclease in
adaptive bacterial immunity.
Deltcheva et al. Nature

Intervening sequences of regularly spaced prokaryotic


elements derive from foreign genetic elements.
Mojica et al. Journal of Molecular Evolution

SYNTHETIC TAL EFFECTOR NUCLEASE


PROTEINS (TALENS).
Plant pathogenic Xanthamonas were found to use DNA
binding TAL effectors as virulence factors. TAL effectors
bind specific nucleotides, and added together, along with a
Fokl nuclease, can introduce site directed double stranded
breaks. When properly designed efficiency can be above 1
in every 2 cells.

2010

2011

NUCLEASE PROTEIN CSN1 (NOW


NAMED CAS9) IS GUIDED BY 2 CRISPRENCODED RNA STRUCTURES (CRRNA
AND TRACRRNA) TO A SPECIFIC DNA
SEQUENCE.
CRISPR RNA maturation by trans-encoded small RNA and
host factor RNase III.
Deltcheva et al. Nature

2012

CRISPR/CAS9 NUCLEASES.
Some bacteria have adaptive immune systems that protect
them from viral DNA. A Cas9 protein is guided by RNA
which has been transcribed from learned information
about the virus. Guide RNA is complimentary to a viral
strand, and can be engineered to be complimentary to
any strand of choice, allowing Cas9 to introduce double
stranded breaks in up to 9 in every 10 cells.

NON-REPEATING SPACERS IN CRISPR


FOUND TO CONTAIN VIRAL DNA.

2007

LATE 00s

CRISPR AFFORDS IMMUNITY BY


CUTTING UP FOREIGN DNA.

The homologous recombination machinery that allows genetic


recombination during meiosis was hijacked to afford the
directed homologous recombination of exogenous DNA into
precise positions in mouse genomes, with an efficiency of one
recombination in every 106 cells.

CRISPR/CPF1 NUCLEASES.
2015

A new, relatively non-validated, technology. Proteins


similar to Cas9 were screened, and from this Cpf1 was
discovered. It was found to require smaller guide RNAs,
and induce an overhanging break instead of a double
stranded break.

TODAY
CRISPR IS ONE OF THE FASTEST EVOLVING
FIELDS IN BIOLOGY.
According to PubMed, in 2015 alone there were 1266 CRISPR
publications. In January 2016 alone there were 207 paper, fitting in with
the trend of exponential growth.

Genomics 101 / 57

EDITING THE GENOME

IF CAS9 IS CRISPR RNA GUIDED, AND WE CAN ENGINEER DNA


SEQUENCES...
Jinek et al., 2012
The 2012 study by Jinek et al. is a good candidate for the greatest
biological advance in the last five years. It was a one-two punch
knockout, with an incredible leap that revolutionised not just
synthetic biology, but genetic engineering, personalised medicine,
agricultural science, genetics, cell biology, to name a few.
First, the authors demonstrated that because crRNA and
tracrRNA are complimentary sequences, they bind into a
double strand. This double strand then guides the Cas9 to the
complimentary strand in the invasive DNA starting at the PAM.
The Cas9, which contains two different DNA-cutting domains,
unwinds the DNA into two strands, and then creates a blunt
break in the invading DNA.
Next, Jinek et al. showed that linking the crRNA and the tracrRNA
into a new molecule they named guide RNA (gRNA), they could
simplify the system. Any gRNA sequence could be synthesised to
facilitate Cas9-mediated blunt-ended cleavage of any DNA strand.
This gives researchers precise control to cut anywhere in an
organisms genome, allowing genes to be engineered in or out of
selected organisms with relative ease.

A CLOSER LOOK AT CRISPR/CAS9 FOR


GENOME EDITING
Very few parts have to come together to facilitate CRISPR mediated
genome editing. A Cas9 protein is expressed with a correctly
designed gRNA complimentary to a DNA sequence of interest,
which targets the Cas9 to the genome if the complimentary
sequence to the gRNA is adjacent to a PAM.
Where homologous recombination-mediating recombinases
were locked into an unchangeable 30+ base recognition
sequence, Cas9 is almost unblocked in sequence recognition. As

A cartoon depicting how Cas9, gRNA and the PAM come together
for DNA cleavage to facilitate genome engineering

58 / Genomics 101

the 20bp guide RNA sequence can be any string of nucleotides


that are complimentary to a genome sequence, the only
constraining factor to Cas9 binding is the existence of a PAM.
S. pyogenes Cas9 effectively recognises a two base pair PAM
(NGG, where N can be anything) so any sequence preceding two
guanines can be cut.
Such simplicity is extremely powerful, as it allows almost any
genomic position to be modified in any organism that can express
exogenous DNA. In order to fully take advantage of this system, it
has been vital to understand exactly how these parts interplay at
a molecular level eventually leading to further improvements in
this system.

IMPORTANT STRUCTURAL ELEMENTS TO CAS9


Cas9 mediated DNA cleavage can be considered as a threestep process: destabilisation, invasion, and cleavage. Cas9 is a
considerably large protein, with S. pyogenes Cas9 weighing in at
1368 amino acids. Smaller Cas9 proteins do exist, however due
to its simple PAM site, S. pyogenes Cas9 sees the most use. Within
Cas9 are two lobes with four active sites, split into the recognition
lobe - which directs the recognition between gRNA and DNA,
and the nuclease lobe - which cleaves each strand of the DNA
independently, and recognises the PAM site.
DNA cleavage relies of all four active sites, but is performed by the
two endonuclease-like domains called RuvC and HNH.
Destabilisation: First, gRNA is integrated into the recognition
lobe of Cas9, forming a stable structure that has all of the active
sites required for DNA cleavage. Once this complex is formed,
the cleavage lobe facilitates a DNA scan for PAM sites. If found,
the PAM recognition site in the cleavage lobe of the protein is
thought to locally destabilise the chemical interactions that hold
both complimentary strands of the PAM site DNA together. Further
protein-DNA interactions then stabilise the now unwound DNA
immediately upstream of the PAM site.
Invasion: At this point, if there is no match between the gRNA and
the unwinding DNA, the binding energy within the entire system
is too low to maintain the overall structure, and the Cas9-gRNA
complex moves on to another PAM site. If there is a match, each
gRNA residue will subsequently displace its homologous DNA
residue, complimentarily binding to the target sequence. A number
of complex sequential interactions between the growing target
DNA-gRNA dimer and the recognition lobe facilitate and stabilise
the entire process.
Cleavage: Once the whole Cas9-gRNA-DNA complex is formed
the DNA is held into the protein in an accessible formation
for the HNH and RuvC cleavage domains. HNH is flexible in its
movement. Once into a favorable conformation with the target
DNA-gRNA dimer it will cleave the DNA strand. Simultaneously,
the non-complimentary strand is positioned for RuvC to cleave.
The entire protein complex then disassociates, the cleaved DNA
strands re-wind, and a targeted double stranded break is left in
the DNA.

EDITING THE GENOME

IMPROVING CAS9
Cas9, when expressed or transfected in cells alongside a gRNA,
allows for the targeted introduction or deletion of genetic
information. This process was used to produce knock out mutant
mice that had a mutation in both alleles in a process that only took
a total of around 4 weeks from start to finish. It has been deemed
a fantastic success, and is often referred to as one of the greatest
breakthrough technologies in recent years.
CRISPR has already shown incredible promise in the
development of personalised gene therapies for rare diseases
in human-cell lines, and in mouse models. Pre-clinical trial,
proof of concept treatments already exist for -thalassemia,
rheumatoid arthritis, Duchenne muscular dystrophy, cystic
fibrosis and tyrosinemia.

homologous recombination-directed repair of a sequence.


Alternatively the non-homologous end joining machinery could
rejoin the two strands back together, potentially integrating an
exogenous sequence in any orientation in the space. The nonhomologous end joining machinery often causes the loss of a few
nucleotides, so in each case can lead to the production introduction
of a faulty version of the target gene.
Recently, it was discovered that homologous recombination
can occur following an individual nick event in place of to the
canonical single stranded repair mechanisms, although the
efficiency of this process is reduced 24-fold. Regardless this
allows researchers to introduce a homologous recombination
event without worrying about inducing a double-stranded break
based error.

Once the full structure and mechanism of CRISPR-Cas9 mediated


DNA cleavage was identified, researchers could set about
improving the technology even further, increasing its efficacy and
expanding the potential applications. While many mutants and
fusion proteins have been produced, three stick out: Cas9n, dCas9
and hfCas9.

dCas9: Short for dead Cas9, it has had both its RuvC and its HNH
nuclease domains inactivated. This turns Cas9 into a shuttle for
other enzymes that can act upon the DNA. dCas9 has been used as
a fusion product with transcription factors in order to tightly control
the activation or repression of particular proteins outside of their
usual activity. It has also been fused to FokI, and used as a dual
strand cleavage system that belongs to the same paradigm as ZFNs
and TALENs.

Cas9n: Cas9n, or nicking Cas9 has either its RuvC or its HNH
cleavage domain modified to be inactive. This inactivation leaves
Cas9 only able to produce only a stranded break in the DNA (a nick),
not a double stranded break. This is significant for two applications.

hfCas9: Instead of using dual Cas9n proteins to generate the offtarget effect-free Cas9 cut, researchers took to modifying the Cas9
enzyme itself to reduce off target effects, and keep the Cas9 system
as simple as possible.

First, there is concern over the effects that off target Cas9 cutting
events will have on any cell that is engineered with this system.
Research has showed that off target effects are often few and far
between, but their impact cannot be ignored. For this reason, two
Cas9n enzymes, one for each strand, could be used to produce the
double stranded break. As they would have to recognise both the
upstream and downstream regions of the cut site, off target effects
are almost always ablated.

As mentioned earlier, there are a number of complex interactions


that maintain the DNA in its unwound position to facilitate
gRNA base pairing. It was thought that the net effect of all of the
interactions means the binding energy is above that which is
required for the reaction to be successfully carried out. This allows
a relaxed specificity meaning that the gRNA and target DNA do not
have to be a perfect match.

There are a number of fates a DNA sequence may undergo


following a double stranded break. The most common, is the

This is great for the bacteria as it can endure viral DNA that has
undergone one or more mutations since it was last encountered,
however it could be detrimental to genome editing due to the offtarget effects caused.
Therefore, by mutating four of these DNA interacting domains, the
DNA binding energy of the whole system was reduced to a point in
which the gRNA had to be exactly correct in order to induce a cut in
the DNA. When the target organisms genome was sequences and
analysed for off target effects, not a single one was found. So long
as gRNA is designed with off target effects in mind (with a tool like
the genome search algorithms offered by Desktop Genetics) hfCas9
allows genome editing is specific only for the site of interest.

A crystal structure of the Cas9 protein


(Blue and Cyan) using gRNA (green) to
interact with unwound DNA (Magenta)

Where next? It was not mentioned in this chapter, but the CRISPR
associated protein Cpf1 could extend the reach of the CRISPR/
Cas system. Maybe there is something even more powerful
looming on the horizon. The future of genetic engineering
continues to evolve. n
For a fully referenced version see the digital version at frontlinegenomics.com

Genomics 101 / 59

EDITING THE GENOME

THE CRISPR JARGON BUSTER:


ACTIVATOR:

A transcription factor protein that controls the


transcription of DNA by binding to specific regions
of the genome. Activators specifically encourage
gene expression.

ADAPTIVE IMMUNE RESPONSE:

An organisms defense against pathogens, in which


the end goal is destruction of the pathogen through
a process that uses learned information about the
pathogen. This learned information could be retained
for the organism to call upon when it next encounters
a pathogen. Mammals have antibodies and
lymphocytes. Bacteria have the CRISPR/Cas system.

BLUNT-CUT :

An enzyme induced break in the DNA, in which both


DNA strands are cut in the same place:
5-atgcatgcavtgcatgcatgc
tacgtacgt^acgtacgtacg-3

Cas :

Short for CRISPR associated protein. Cas is a protein


family whose members are found adjacent to a
CRISPR motif, and are required for the bacterial
adaptive immune response.

Cas9 :

(also Csn1) A Cas enzyme that forms a complex


with both crRNA and the tracrRNA, or synthetically
engineered gRNA in order to perform sequence
guided blunt ended cuts in a strand of DNA.

Cas9n:

Cas9n, or nicking Cas9 has either its RuvC or its


HNH cleavage domain modified to be inactive. This
inactivation leaves Cas9 only able to produce a single
stranded break in the DNA (a nick), not a double
stranded break.

dCas9:

An engineered mutant of Cas9 named dead Cas9,


which has had its endonuclease sites rendered
catalytically inactive. This enzyme is often fused to
other DNA relevant enzymes to afford precisely
targeted genetic control.

hfCas9:

An engineered mutant of Cas9 named high fidelity


Cas9 which has had its DNA interacting domains
modified. This makes the Cas9 bind to DNA weakly,
so gRNA has to be an exact match to its target DNA in
order for a cut to occur.

DOUBLE STRANDED BREAK:

When a strand of DNA is broken across, leaving two


free fragments. This can occur naturally by DNA
damaging radiation, or enzymatically by nucleases.

DUAL NICK:

ENDONUCLEASE:

OFF-TARGET EFFECTS:

An enzyme that is able to produce either a blunt-cut,


or overhanging-cut in both strands of DNA sequence.

FokI:

An endonuclease commonly used in genome editing.


Requires a dimer of FokI enzymes to produce a blunt
cut in the DNA so can facilitate precision DNA cutting
activity when used as a fusion with TAL effector
proteins, Zinc Finger proteins or dCas9.

EXOGENOUS DNA :

A length of DNA that is either taken up by the cell from


its surroundings, or synthetically introduced into the
cell. Both cases can lead to new genetic information
being inserted into the cells genome. Exogenous DNA
of a specific, desired sequence is used in genome
editing to introduce new genes, or mutant versions of
genes into cells in order to study their effects.

FRAME SHIFT:

A type of mutation that changes the amino acid


composition of a protein. This occurs as the
information in RNA is translated into amino acids as
triplets of nucleic acids. A single deletion or insertion
will cause the reading frame to shift, often rendering
a protein inactive:

RNA:
Amino:

AUG UCU UGU UCU GGU A


Met Ser Cys Ser Gly

Deletion event at second Guanine causes frame shift,


changing amino acid composition throughout the
remainder of the protein:

RNA:
Amino:

AUG UCU UUU CUG GUA A


Met Ser Phe Leu Val

HOMOLOGOUS:

A nucleic acid sequence with significant similarity to


an existing nucleic acid sequence. In molecular biology
this applies to a sequence with sufficient similarity to
facilitate homologous recombination.

HOMOLOGOUS RECOMBINATION:

Csn1:

NON-HOMOLOGOUS RECOMBINATION:

CRISPR:

An acronym for Clustered Regularly Interspersed


Palindromic Repeats. CRISPR are repetitive prokaryotic
motifs in prokaryotic genomes that are interspersed
with foreign sequences. These interspersing
sequences sequences are learned from invasive
nucleic acids. Become the crRNA and tracrRNA, that
guide Cas9 toward invading nucleic acids, and with the
Cas enzymes form the prokaryotic adaptive immune
system. Has been repurposed to facilitate directed
genome editing

DELETION:

The removal of one or more nucleic acids from a


genome sequence. Can cause frame shift mutations.

Nucleic acids that have undergone a double stranded


break can undergo a process called non-homologous
end joining, in which both free ends of DNA are
enzymatically stitched together regardless of their
sequence.

INSERTION:

The introduction of new genetic material into a nucleic


acid sequence. Insertions can cause protein silencing
and frame shift mutations if introduced into a protein
encoding DNA sequence.

KNOCK-IN:

The introduction of new generic material into a


specific point in an organisms genome

KNOCK-OUT:

The removal of genetic information from a specific


point in an organisms genome

60 / Genomics 101

5-atgcatgcavtgcatgcatgc
tacgtacgt^actacgtacg-3

NON-HOMOLOGOUS REPAIR:

An enzyme that was discovered due to its relatedness


to Cas9. Like Cas9 it is an RNA guided nuclease. Unlike
Cas9 it forms a complex with only a crRNA RNA strand
making it simpler to engineer. It also provides an
overhanging-cut.

see Cas9

A type of nuclease enzyme that performs a cut on a


single DNA strand:

A reaction that breaks the DNA using two nickase


enzymes,

Nucleic acids are integrated from one strand of DNA


to an identical DNA sequence suffering from a double
stranded break to DNA, to accurately stitch together
the two broken ends. Can also occur without damage
to produce sequence variation during meiosis. Used
by bacteria and viruses to mediate sequence invasion
during horizontal gene transfer. Utilised in genome
editing to accurately introduce foreign sequences into
specific genome positions.

Cpf1:

NICKASE:

see non-homologous recombination


An unintentional, unforseen genomic modification
or manipulation caused by a targeted modification
tool also being able to bind to genomic sequences
elsewhere in the genome.

OVERHANGING-CUT:

An enzyme induced break in DNA which occurs at


separate sites, leaving an overhanging single strand
on each product. Overhang length depends on the
enzyme:
5-atgcatgcav tgcatgcatgc
tacgtacgta ^cgtacgtacg-3

PAM :

An Acronym for Protospacer Adjacent Motif. This


motif is the recognition site for Cas9 and is specific to
each Cas9 containing species. Streptococcus pyogenes
PAM is NGG. Cas9 will not successfully bind and
cleave the target DNA if there is no PAM immediately
following the region of homology.

REPRESSOR:

A transcription factor protein that controls the


transcription of DNA by binding to specific regions
of the genome. Repressors specifically inhibit
gene expression.

RESIDUE:

Another term for a nucleic acid, usually used in the


context of a whole DNA molecule.

RNA:

Double stranded DNA in the genome is transcribed


into the single stranded RNA. RNA either provides the
code to make proteins, is independently catalytically
functional (a ribozyme), or forms a complex with other
RNA/a protein to provide a catalytic function.

crRNA:

The cr stands for CRISPR. crRNA is RNA encoded


within the CRISPR motif, and is one of two RNA
sequences required for Cas9 to target a specific
DNA sequence.

gRNA:

The g stands for guide. gRNA is a synthetic linkage of


crRNA and tracrRNA which makes the CRISPR system
easier to engineer for use in genome engineering.
crRNA and tracrRNA are joined by a linker sequence.

tracRRNA:

The tracr stands for trans-activating cr. tracrRNA


is RNA that is encoded outside of CRISPR, but is
complimentary to the repeat regions of crRNA. It is
one of the two RNA sequences required for Cas9 to
target a specific DNA sequence.

TALEN:

An Acronym for TAL Effector Nuclease. TAL effectors


are enzymes secreted by Xanthamonus species as
part of their infection process, which can directly
bind to individual nucleic acids. Multiple TALs can be
combined to target a specific DNA sequence. They are
often fused to FokI to produce targeted blunt ended
cuts in DNA during genome editing.

ZFN:

An acronym for Zinc Finger Nuclease. Zinc Fingers


are motifs that allow transcription factors to bind to
specific nucleic acid triplets. Multiple ZFNs can be
combined to target a specific DNA sequence. They are
often fused to FokI to produce targeted blunt ended
cuts in DNA during genome editing.

You might also like