Professional Documents
Culture Documents
101
An Introduction To The Genomic Workflow
PRODUCED BY:
IN PARTNERSHIP WITH:
INTRODUCTION
e have come a very long way since DNA was first isolated back in 1869. As Friedrich Miescher
was investigating nuclein, I doubt he could have imagined the advances that took place in
the subsequent century and a half. The latter half of the 20th century in particular saw
tremendous intellects drive forward what would eventually develop into the field of
genomics. 1953 saw James Watson and Francis Crick describe the structure of the DNA helix, and as
technology advanced the 21st century began with the publication of the human genome in 2001.
Today, genomics not only represents the pinnacle of our understanding of human biology, but also an
industry of extraordinary potential set to impact several aspects of our lives.
The raw requirements to generate, manage, analyse and interpret genomic data have become far
more accessibly in recent years. This has led to a phenomenal boom in not just data creation, but our
understanding and leveraging of that data. Simply put, there has never been a better time to adopt
genomic technology. And that is exactly what so many of you are already doing.
This is where this handbook comes in. Genomics is moving at such a rapid pace that finding easy to
understand information to help explain how it all works is hard to come by. Here at Front Line Genomics,
we want to do what we can to help lower the barrier to entry to adoption of genomic technology. With
the help of some of the leading technology companies (our Strategic Partners, Agilent Technologies,
Seven Bridges Genomics, and Twist Bioscience, and partners Affymetrix, DNAnexus, New England
Biolabs and WuXi NextCODE), weve put together the Genomics 101 as a guided tour through the world
of human genomics.
The 101 is not intended to offer detailed protocols to take into the lab. Our intention is to help you
understand the kinds of questions you can use genomic approaches to ask, the kinds of platforms
available to you, and how they work. Well help you explore DNA microarrays and Next Generation
Sequencing. Well familiarise you with the basic chemistries involved in producing sequence
information. Well then explain how all that data gets turned into something you can use to help
improve patients lives.
There is a lot that we could have included in this edition, but we tried to focus on the core technology
areas that are at the heart of genomics today and shaping the future of the field. We will endeavour to
not only keep these chapters up to date and add new content as technology and applications progress.
For now, we hope you find the handbook an interesting read, and above all useful.
GENOMICS 101
CONTENTS
1 INTRODUCTION
3 GLOSSARY
4
10
14
16
20
24
26
14
24
50
2 / Genomics 101
30
34
38
42
46
50
54
GENOMICS 101
GLOSSARY
ADAPTORS
A short nucleotide molecule that binds to each end of a DNA
fragment prior to sequencing.
ALLELE
One of two forms of a gene, or other portion of DNA, located at the
same place on a chromosome
MUTATION
A DNA sequence variation that differs from the reference
sequence. This can be a SNP, and insertion, or a deletion of the
base pairs in the sequence
GENE EXPRESSION
Process by which the information from a gene is used to create a
functional protein product
GENE PANEL
A selection of genes relevant to a particular condition, that can be
sequenced in order to make a clinical diagnosis
SOMATIC CELLS
Cells that are not destined to become reproductive cells.
Mutations in somatic cells are not passed on from parent to
offspring
GENOME
The full genetic sequence of an orgnism, including both coding and
non-coding regions
GENOME-WIDE ASSOCIATION STUDY (GWAS)
A study that evaluates the genomes of a large number of
participants, looking for correlations between genetic variation
and particular traits or diseases.
GENOTYPE
The complete genetic make up of an organism
GERMLINE CELLS
Cells that will go on to become sperm and ova. Mutations in the
germline can be transmitted from parent to offspring.
TRANSCRIPTION
The process of creating a messenger RNA (mRNA) from a DNA
sequence
TRANSLATION
The process of creating a protein chain, composed of amino
acids, from a strand of mRNA
WHOLE EXOME SEQUENCING (WES)
The process for sequencing the entire coding portion of an
individual genome
WHOLE GENOME SEQUENCING (WGS)
The process for sequencing all of an individuals DNA
Genomics 101 / 3
CHAPTER 1:
DESIGNING
GENOMICS
EXPERIMENTS
SPONSORED BY
INTRODUCTION
In this first chapter of the Genomics 101, we take a look at the broad
range of options available to anyone looking to generate, or make use of
genomic data. Genomic data can range from whole genome to just the
exome, or to a subset of genes down to just a single gene. In addition,
the genomic data can be in the form of DNA sequence, single nucleotide
polymorphism, copy number variation, or structural variation. In
addition to reading the genome, we can now also generate profiles of
the genes expressed to gain another layer of information that can help
understand disease at a cellular level by exposing novel transcripts,
splice variants, and non-coding RNAs, which can become valuable
biomarkers for diagnostic tests. Before we look into the different
methods and platforms available to you, it is important to take a step
back to consider what it is that you are trying to achieve.
Since the Human Genome Project was completed more than a
decade ago, different whole genome analysis technologies have
become available. Whole genome analysis using microarrays has
been the traditional work horse for gene expression profiling
as well as genotyping applications. They are the perfect tool
for scientists in and out of the clinic, due to their affordability,
consistency, and quick, easy, and standardised data analysis.
Relatively new next generation sequencing (NGS) methods have
improved dramatically over the past few years. You may have seen
several graphs showing how sequence output and thus the cost of
sequencing has dropped faster than the rate described by Moores Law
(which describes a long-term trend in the computer industry during
which compute power doubles every two years). The dramatic increases
in sequence output over the past 10 years have now made it possible to
consider sequencing projects that were impossible or at least completely
unaffordable prior to these advances. January, 2014, brought the news
that Illumina had delivered the first commercially available $1,000
genome with their HiSeq X Ten Sequencer. Although that includes the
cost of reagents and sample preparation, there is still an argument
that: to truly break the $1,000 barrier, the cost must also include
interpretation of data produced, as well as storage of the resulting data.
The improvements in sequencing technology have led to a flood
of genomic information. This has greatly increased the level of
understanding of the genome and roles of specific genes. As well
as advancing research, this is also leading to the development of
more powerful DNA microarrays leveraging the growing wealth of
identified gene variants.
Before you get too excited about NGS, consider whether it is really
the best option for what you want to do. Although much cheaper
than it used to be; NGS is still relatively expensive, and requires
considerable I.T. capabilities (as well see in the Analysis chapter).
If you are undertaking discovery, or hypothesis-free research,
have sufficient funds, and the necessary infrastructure to perform
and analyse the data, NGS may well be the right option for you.
However, if you are undertaking a hypothesis-driven study, a large
sample size epidemiology study, or working with difficult samples
such as FFPE or a limited amount of samples such as fine needle
biopsy, you can leverage genomic data much more cost effectively
by using well-designed microarrays.
Genomics 101 / 5
As we guide you through your options, try to keep your end result
in mind. What kind of data do you need to answer your question
most efficiently? You can then begin to make a judgement on
platforms based on your operating restrictions:
6 / Genomics 101
Genomics 101 / 7
NGS PLATFORMS
These platforms all have their advantages and disadvantages. How
heavily those sway your potential decision should come down to
what your own set of parameters are. So do investigate them all,
and try to find first-hand testimonials from existing users.
The following are the most common NGS platforms available today:
454 Life Sciences: The Roche company produce high throughput
sequencing machines based on their pyrosequencing technology.
While the cost per run is relatively expensive, the machines are
quite fast and produce longer reads than most at around 700 base
pairs. However, Roche is no longer supporting this platform as
other technologies have superseded this now antiquated platform.
Illumina: Sequencing by synthesis is by far the most popular way to
sequence today. Illumina have a dominant market share due to the cost
effectiveness of sequencing through their platforms, and the potential
for particularly high yields. This company produces a wide range of NGS
machines from the smallest machine the MiniSeq (capable of 7.5 Gbs of
sequence/run) all the way up to the HiSeq XTen (which is the platform
where the cost of WGS has dropped below $1,000 if you sequence at a
high enough volume). The main draw backs here are the initial cost of
the equipment itself, potentially short lifespan of the instrumentation,
and uneven coverage associated with short reads.
Ion Torrent Sequencing: The ion semiconductor method of sequencing
proved very popular when it first hit the market. The sequencer itself
tends to be very competitively priced and is exceptionally fast. This
helped it find a home in several diagnostic laboratories, where quick
turnaround times are crucial, and absolute base-pair accuracy less so.
This platform is not viable for WGS as its output is significantly
below what is needed for complex genomes. However, it is ideally
suited for small gene panels up to exome-based analysis that we
discussed in the previous section.
Pacific Biosciences: Using single-molecule real-time sequencing
(SMRT), this platform is known for producing long reads (up to
60,000 base pairs with their latest machine). This gives you a
considerable advantage if you want to identify structural variations,
and increase your coverage in difficult to amplify areas of the
genome. However, the Pacific Biosciences equipment does come
in at a higher cost than most, and doesnt quite have the same
throughput as some of the other platforms available.
Another advantage of this platform is its potential ability to
sequence modified bases (such as 5-methyl-cytosine). Currently
this is not a viable platform for whole genome sequencing simply
because of its limited output. However, if cost is not an object, it is a
good option for examining even the most complex genomes. What
it is better at, however, is scaffolding genomes together from other
short-read technologies.
Looking Ahead
Developing sequencing technology could offer alternatives to the existing
NGS platforms and potentially (down the road) even replace some of
those platforms. Most popular amongst these is nanopore sequencing.
8 / Genomics 101
DNA MICROARRAYS
DNA microarray is a technology by which known DNA sequences
are either deposited, or synthesised, onto a surface. This allows us
to detect the presence, and concentration, of sequences of interest.
The turn of the century saw a dramatic increase in our
understanding of the human genome. New production methods,
and fluorescent detection, were adapted to build modern
microarrays. While DNA arrays have been around in early forms
since the 1970s, it was only in the 1990s that microarrays started to
become the invaluable tool we know them as today.
If you are assaying multiple samples at the same time, you can
begin to tease apart meaningful information, such as differential
gene expression level between different sample types. Calculating
similarities in gene expression across samples allows you to put
them into hierarchical clusters. Clustering genes and samples
can help build up an interesting picture of the genetics and
biology of your indication of interest. With the advancement of
microarray technology, such as the whole transcriptome arrays
from Affymetrix, it is now possible to not only measure gene-level
differences, but also exon-level and alternative splice variants. With
the standardisation of microarray gene expression data analysis,
one can easily derive meaningful biological information in weeks
rather than months.
Genotyping: As NGS costs continue to fall, this is still one area
in which microarrays continue to be the dominant, and much
more cost-effective, technology. The most common methods
of detecting single-nucleotide-polymorphisms (SNPs) are allele
discrimination by hybridisation, allele specific extension and
ligation to a bar-code, or extending arrayed DNA across the
SNP in a single nucleotide extension reaction. Affymetrix and
Illumina both produce highly effective SNP genotyping arrays
that have been used extensively around the world. As well as
being able to detect over 1 million different human SNPs with
high degrees of accuracy and reproducibility. The arrays can also
be used to detect copy number variations. While SNPs are crucial
biomarkers, copy number variations (a structural variation
within a cells DNA that gives it fewer, correct, or more copies
of a certain section of DNA) have also been associated with
susceptibility and resistance to some diseases. Large biobank
studies, such as UK Biobank and Million Veteran Program in
the US, are utilising microarrays for generating genotyping
data to understand the relationship between genes, lifestyle,
environment, and medical history from 500,000 to 1,000,000
volunteers, respectively. The resulting database is being used
by researchers to understand diseases with the goal for better
diagnosis and treatment. Genotyping array is also what is used
by direct-to-consumer companies such as 23andMe to generate
customer profiles. p.12
Genomics 101 / 9
1996
2002
2004
2008
A small-scale genotyping
in HIV patients.1
of advanced Philadelphia-
Today
lymphoblastic leukemia.2
2005
Collaboration between Wellcome Trust and Affymetrix to identify genetic associations in 7 common diseases
The Wellcome Trust Case Control Consortium (WTCCC) genotypes 17,000 samples to identify genetic variants associated to
type 1 and 2 diabetes, coronary heart disease, hypertension, bipolar disorder, rheumatoid arthritis, and Crohns disease.
2009
Collaboration between UCSF, Kaiser Permanente, and Affymetrix to enable genomic studies of common diseases
100,000 samples are genotyped in 15 months and the database made available to qualified scientists. Among recent studies published
are the identification of potential biomarkers to improve prostate cancer screening 5 and the finding of a genetic susceptibility to
Staphylococcus aureus that could pave the way to new treatment and prevention of antibiotic resistant infections like MRSA.6
Empowering
biobanks
to discover
the interplay
of genes,
environment,
and lifestyle
2013
Collaboration between UK Biobank and Affymetrix to genotype 500,000 volunteers
for prospective study
The resulting biomedical database is made available to scientists worldwide by UK Biobank.
Many scientists have already published their discoveries on lung function, smoking behavior,
neurobiological disorders, and many other conditions.
2015
Million Veteran Program builds huge database of genotyping data using Affymetrixs platform
The US Department of Veterans Office of Research and Development funds the Million Veteran
Program, resulting in one of the worlds largest medical databases that includes genotyping data.
The data collected from one million veteran volunteers furthers scientists understanding of how
genes affect health, especially military-related illnesses.
Affymetrix launches
first test to help
diagnose postnatal
developmental delay
CytoScan Dx Assay is
the first and only FDAcleared, whole-genome,
microarray-based genetic
test to aid the increase in
diagnostic yield for postnatal
developmental delay and
intellectual disability.
Translating biomarkers
from lab to clinic
Affymetrix collaborates
with diagnostic companies,
such as Almac, GenomeDx,
Lineagen, PathGEN Dx,
SkylineDx, and Veracyte,
turning their biomarker
signatures into microarraybased tests for improved
diagnosis and treatment.
References
1. Kozal M. J. et al. Nat Med 2(7):753-59 (1996). 2. Hofmann W. K., et al. Lancet 359(9305):48186 (2002). 3. Puffenberger E. G., et al. Proc Natl Acad Sci USA 101(32):11689694 (2004).
4. Caldwell M. D., et al. Blood 111(8):410612 (2008). 5. Hoffmann T. J., et al. Cancer Discov 5(8):87891 (2015). 6. DeLorenze G. N., et al. J Infect Dis 213(5):81623 (2016).
2016 Affymetrix, Inc. All rights reserved. Unless otherwise noted, Affymetrix products are For Research Use Only. Not for use in diagnostic procedures.
P/N COR06694-1
Genomics 101 / 11
12 / Genomics 101
SUMMARY
This chapter is not intended to explain how sequencing or
microarrays work. It is intended to show you that you have a range
of options available to you. At the start of the chapter we asked
you to keep a few questions in mind. Principally, what kind of data
do you need to be able to answer your question most efficiently? If
you are looking for novel or rare variants in an individual, looking
at whole genome sequencing well be the way to go. If you want
to explore known regions of interest, then just take a look at the
exome or get a bit more specific with a targeted approach. Maybe
you want to genotype a large population to identify associations?
A genotyping array is going to be much cheaper and easier to
manage than NGS. Pick the technology that works for you, your
operating restrictions, and which will produce the data you need.
In the next chapter we take a look at the chemistry involved in
turning your DNA into Data. Now that youve decided what you
want to do with your sample, youll need to know how to prepare it
for the right platform. n
Genomics 101 / 13
CHAPTER 2:
TURNING DNA
INTO DATA
SPONSORED BY
INTRODUCTION
Over the last twenty years fundamental advances in sequencing
technology have brought us a long way from the 10 years and
10 billion dollars spent on the Human Genome Project. Next
Generation Sequencing (NGS) and microarray techniques
have dramatically reduced the time and the cost associated
with large scale genome exploration. Today whether you are
analysing a panel of genes, exploring an exome, or shooting for
an entire genome there is an extraordinary wealth of different
techniques available.
We will explore the basic science behind these different techniques,
taking a look at how genetic sequencing actually works and how
we generate high quality sequence data for research and clinical
application. Successful analysis is critically dependent on accurate
sample preparation, so well take a look at the the basics of the
sample preparation process for a range of sequencing techniques,
including NGS and microarrays.
There are numerous kits and methods available for NGS sample
preparations, but several of the basic steps needed to prepare
DNA for sequencing are conserved across different sequencing
techniques. So for example preparing DNA for Illumina
sequencing, Ion Torrent sequencing or a DNA microarray requires
DNA fragmentation. Given the widespread use of the Illumina
platform, for this chapter we will largely focus on the crucial steps
in preparing DNA for Illumina sequencing.
As well as understanding genomic sequences, there are also
sequencing methods that enable us to explore DNA expression,
which genes or gene regions are active, at a particular point in
time. During this chapter we will look at how these methods
work, when they are used and how sample preparation differs
for these protocols.
Genomics 101 / 15
CONCLUSIONS
These highly sensitive and quantitative sorting assays provide
pure and objectively defined populations of neoplastic cells prior
to analysis. The deep and unbiased clonal profiling of sorted FFPE
samples by aCGH and NGS provides a valuable methodology with
broad application for cancer research which can advance the
development of personalized patient therapies.
Agilent offers a wide range of resources on CGH microarrays and NGS
that include applications notes, featured articles, how-to videos and
much more. These resources can be accessed from the links below:
16 / Genomics 101
CM
MY
CY
CMY
Notes
1. T. Holley et al.,
Deep Clonal Profiling
of Formalin Fixed
Paraffin Embedded
Clinical Samples. PLoS
ONE 7(11): e50586.
doi:10.1371/journal.
pone.0050586.
This article was
adapted from Agilent
Publication 59913333EN.
For Research Use
Only. Not for use in
diagnostic procedures.
2/8/16
1:30 PM
EXON-LEVEL COVERAGE
Two Catalog Arrays for Postnatal and Cancer Research
Designed for exon-level coverage of disease-associated regions recommended
by ClinGen/ISCA or COSMIC and Cancer Genetics Consortium databases
Enhanced loss of heterozygosity detection with a resolution validated to 2.5Mb
Easily customize your microarray at no additional cost
www.agilent.com/genomics/GenetiSureCGH+SNP
For Research Use Only. Not for use in diagnostic procedures.
Genomics 101 / 17
18 / Genomics 101
SANGER SEQUENCING
ION TORRENT
SOLID SEQUENCING
ILLUMINA
Illumina sequencing all takes place
on a specialised flow cell coated in a
lawn of primers. Fragments of DNA are
hybridised(or attached) to the twodimensional surface, forming localised
clusters of about 2000 identical DNA
fragments. This step is called cluster
generation.
During sequencing these clusters
are bathed in fluorescently labelled
nucleotides, along with a DNA polymerase
enzyme that attaches each fluorescent
nucleotide to its non-fluorescent
correspondent, so fluorescent A binds with
non-fluorescent T, and so on.
As with Sanger sequencing and other
fluorescent methods, the surface of
the flow cell is then imaged using laser
excitation and the resulting colours used to
record the DNA sequence of each cluster.
RNA-SEQ
MICROARRAYS
WHAT IS PCR?
Developed in 1983, the
polymerase chain reaction or
PCR has become an essential
part of any genetics toolkit.
PCR is a molecular
photocopier, which amplifies,
or makes multiple copies of
small segments of DNA. Genetic
sequencing requires large
amounts of sample DNA, making
the process almost impossible
without PCR.
The DNA sample is heated,
so that the two DNA strands
denature and pull apart into two
separate strands.
Next, an enzyme called Taq
polymerase builds two new
strands of DNA using the original
strands as a template. This
process creates two identical
versions of the original strand,
which can then be used to create
two new copies, and so on.
Genomics 101 / 19
140
Kapa
Hyper
120
100
TruSeq
Nano
80
60
40
Kit compared to other commercially available kits (Figure 1). Even when
using very low input amounts (e.g. 500 pg of human DNA), high yields
of high quality libraries can be obtained, using fewer PCR cycles.
The efficiency of the end repair, dA-tailing and adaptor ligation steps
during library construction can be measured separately from the PCR
step by qPCR quantitation of adaptor-ligated fragments prior to library
amplification. This enables determination of the rate of conversion of
input DNA to adaptor-ligated fragments, i.e. sequenceable molecules.
Therefore, measuring conversion rates is another way to assess the
efficiency of library construction and also provide information on the
diversity of the library. Again, NEBNext Ultra II enables substantially
higher rates of conversion as compared to other commercially available
kits (Figure 2).
Kapa
Hyper
TruSeq
Nano
0.8
0.6
0.4
0.2
0
100 ng
10 ng
1 ng
500 pg
DNA Input
Libraries were prepared from Human NA19240 genomic DNA using the input amounts and library prep kits
shown without an amplification step, and following manufacturers recommendations. qPCR was used to
quantitate adaptor-ligated molecules, and quantitation values were then normalized to the conversion rate
for Ultra II. The Ultra II kit produces the highest rate of conversion to adaptor-ligated molecules, for a broad
range of input amounts.
20
0
100 ng
5
10 ng
8
1 ng
11
500 pg
14
DNA Input
PCR Cycles
Libraries were prepared from Human NA19240 genomic DNA using the input amounts and numbers of
PCR cycles shown. Manufacturers recommendations were followed, with the exception that size selection
was omitted.
20 / Genomics 101
REFERENCE (1) Kozarewa, I. et al. (2009). Amplification-free Illumina sequencing library preparation facilitates
improved mapping and assembly of (G+C) biased genomes. Nat. Methods 6:291295.
NEW ENGLAND BIOLABS, NEB, NEBNEXT are registered trademarks of New England Biolabs, Inc.
ULTRA is a trademark of New England Biolabs, Inc.
ILLUMINA, MISEQ and TRUSEQ are registered trademarks of Illumina, Inc.
ION TORRENT is a trademark owned by Life Technologies, Inc.
KAPA is a trademark of Kapa Biosystems.
Kapa
Hyper
18
16
Ultra II
14
PCR Cycles
UNIFORM GC COVERAGE
Libraries from varying input amounts of three microbial genomic DNAs
with low, medium and high GC content (H. influenza, E. coli and H. palustris) were prepared using the NEBNext Ultra II Kit. In all cases, uniform
coverage was obtained, regardless of GC content and input amount
(Figure 4A). GC coverage of libraries prepared using other commercially
available kits was also analyzed using the same trio of genomic DNAs.
Again, NEBNext Ultra II provided good GC coverage (Figure 4B).
12
10
8
6
4
2
0
1 g
100 ng
10 ng
DNA Input
Ultra II libraries were prepared from Human NA19240 genomic DNA using NEBNext Ultra II and the
input amounts shown. Yields were measured after each PCR cycle and the number of cycles required to
generate at least 1 g of amplified library determined. Cycle numbers for Kapa Hyper were obtained from
Kapa Biosystems website and plotted alongside the cycle numbers obtained experimentally for Ultra II.
Figure 4. NEBNext Ultra II provides uniform GC coverage for microbial genomic DNA over a broad range of GC composition
and input amounts.
A.
B.
Library
Ultra II 100 ng
Ultra II 1 ng
Ultra II 500 pg
Library
Ultra II
Kapa Hyper
TruSeq Nano
Libraries were made using 500 pg, 1 ng and 100 ng of the genomic DNAs shown and the Ultra II DNA Library Prep Kit (A) or using 100 ng of the genomic DNAs and the library prep kits shown (B), and sequenced on an
Illumina MiSeq. Reads were mapped using Bowtie 2.2.4 and GC coverage information was calculated using Picards CollectGCBiasMetrics (v1.117). Expected normalized coverage of 1.0 is indicated by the horizontal grey
line, the number of 100 bp regions at each GC% is indicated by the vertical grey bars, and the colored lines represent the normalized coverage for each library.
Even more
from less.
NEBNext Ultra II DNA
Library Prep Kit for NGS
Visit NEBNextUltraII.com to request a sample.
Genomics 101 / 21
22 / Genomics 101
dA-tailing
During Illumina sequencing there is an additional step called
dA-tailing of the 3 end of the repaired fragment. An A nucleotide
overhang is attached to the 3 end of each DNA strand, which will
enable the right adaptors to ligate or attach to the DNA strand in
the next step.
3. ADAPTER LIGATION
Quite simply, this involves attaching known sequences
(adaptors) to the ends of the prepared DNA fragments whose
sequence is unknown. Adaptors are needed further downstream
in the sequencing process, and are essential for sequencing to
work properly.
For example, during Illumina sequencing the adaptors are
needed to hybridise the DNA strands to the flow cell. The flow
cell itself is covered in a dense lawn of primers to which the
DNA fragments attach. Adaptors can also contain an index
sequence, allowing for multiple different samples to be studied
in a single flow cell.
During SOLiD or 454 sequencing protocols, the adaptors are
required to bind the DNA fragments to the agarose beads on which
the sequencing reaction takes place.
4. AMPLIFY
Finally, a PCR amplification is performed to create a robust library
of DNA fragments that is suitable for sequencing. This step
increases the amount of library, and ensures that only molecules
with an adaptor at each end are selected for sequencing.
5. CLEAN-UP AND QUANTIFY
For Illumina sequencing, a final round of gel electrophoresis is
often used to purify the final product, and conclude the library
preparation process.
Before sequencing it is important to determine that the library
contains a suitable number of molecules that are ready to be
sequenced: that the right number of DNA fragments, with attached
adaptors, is present in the sample.
Another reason to quantitate a library is if more than one are
due to be sequenced at the same time, as is possible with
Illumina sequencing.
There are several different methods for library quantitation, but
they all broadly work in the same way, detecting the presence of
the right sized fragments. For example:
Spectrophotometry: this method detects the absorption of UV
light by macromolecules in the sample. The larger the DNA
molecule, the greater the UV absorption.
Fluorimetry: this method involves binding a fluorescent dye to
the DNA molecules and measuring the fluorescence. Larger
molecules fluoresce more brightly than small.
MAKING A LIBRARY
DNA FRAGMENTATION
GENERATING A
SERIES OF DNA
SEQUENCE FRAGMENTS
IS THE FIRST STEP IN
GENERATING HIGH
QUALITY SEQUENCE
DATA
END REPAIR
ADAPTER LIGATION
PCR ENRICHMENT
Genomics 101 / 23
CHAPTER 3:
ANALYSING DATA
SPONSORED BY
ANALYSING DATA
INTRODUCTION
In the previous two chapters we have considered the kinds of
genomic data we can generate, and the chemistry that makes it
possible. At the heart of genomics is data analysis. Once you have
digitised your DNA, you can start to explore it, understand it, and
query it. In this chapter we will look at how to analyse microarray
and NGS data, and how to turn it into useful information.
Genomics 101 / 25
ANALYSING DATA
DISCOVERY IN MILLIONS
OF GENOMES
Julia Fan Li, Senior Vice President, Seven Bridges
STUDIES THAT ANALYZE MILLIONS OF GENOMES AT ONCE WONT JUST BE TECHNICAL FEATS
-- THEY WILL LEAD US TO TARGETED TREATMENTS FOR SUFFERERS OF MANY DISEASES,
INCLUDING CANCER.
26 / Genomics 101
ANALYSING DATA
This is what we mean when we say graph genomes are self-improving: It gets better the more you use it.
G G C
60%
80%
G
40%
A G
30%
C G C C
20%
C C
60%
A
C
C A
50%
40%
G
50%
A A
C A G
60%
10%
PORTABLE, REPRODUCIBLE WORKFLOWS
The second trend weve identified with our partners is the need for
completely portable and thus reproducible workflows and pipelines.
The more large-scale data analysis enters the every-day practice
of science and medicine, the clearer it is that algorithms and
the software used to implement them have become an integral
and important part of research methods.3 But the complexity
of tracking let alone sharing software methods increases in
lockstep with the complexity of the tools themselves. For example,
a typical TCGA marker paper uses more than 50 bioinformatic tools,
each of which comes in multiple versions and with many different
parameters. Standardising the way we document tools, parameters,
and their dependencies is crucial to making the process of repeating
methods easier.
These issues were the impetus for the bioinformatics community
to develop the Common Workflow Language (CWL) <www.
commonwl.org>. CWL is a specification, much like HTML, that uses
plain text to store every piece of a complex computational workflow.
Better still, it was defined with Docker <www.docker.com> in mind,
meaning CWL-compliant software can also perfectly reproduce a
given workflow in the future by re-downloading the exact version of
any given application. And, because CWL is an open specification, it
prevents lock-in: researchers can use any analysis tool the prefer or
even write their own.
Weve become big believers in CWL, and have built it into both the
CGC and the Seven Bridges Platform. Other industry partners are
doing the same including the Institute for Systems Biology, the
Sanger Institute, the Galaxy Project, and the Broad Institute.
Making reproducibility a copy-and-paste affair is not just good
for science it accelerates the pace at which we can build off the
discoveries of the entire community.
ADVANCED DATA STRUCTURES
The final trend we see across all our projects from large
pharmaceuticals to national projects like Genomics England is a
need to bring genomic data structures into the 21st century.
The linear data formats of traditional genomics tools cant scale to
the number of samples we need to analyze simultaneously.
Today, when we want to understand an individual patient we align
their reads to a static reference, and store the results in static, flat
files. And we repeat this process for each new patient. Worse still, the
static reference isnt updated but once every three years, and only
represents a small collection of individuals, leading to inherent bias.
Instead, we need a reference that can be updated immediately
with new evidence. We need a reference that learns. We need a
reference that contains knowledge of an entire population.
We do this through a new technology we call the Graph Genome,
which advances genetic analysis in two key ways. First, it helps us
create an ever-more accurate view of both an individuals genetic
makeup and that of the population as a whole. Second, it is a more
efficient method to store and analyze vast quantities of genetic data.
Notes
1. Discovery and saturation analysis of cancer genes across 21 tumour types Nature.
2014 Jan 23;505(7484):495-501. doi: 10.1038/nature12912. Epub 2014 Jan 5.
2. http://www.economist.com/news/business/21648685-cloud-computing-prices-keepfalling-whole-it-business-will-change-cheap-convenient
3. Software with Impact Nature Methods 11, 211 (2014) doi:10.1038/nmeth.2880
Genomics 101 / 27
ANALYSING DATA
NGS ANALYSIS
INTRODUCTION
In the last decade the genomics industry has seen the cost of next
generation sequencing (NGS) drop faster than the slope of Moores
law, from about US$10 million to now approximately $1,000 per
28 / Genomics 101
ANALYSING DATA
Genomics 101 / 29
ANALYSING DATA
In response to questions I receive from friends and colleagues who ask What does DNAnexus do,
I thought I might oer a high-level perspective.
WHAT IS DNAnexus?
DNAnexus is a professional grade platform that makes it
easier for users to do three things, each in a secure and
compliant fashion:
1. Analyze large amounts of raw genetic data
2. Collaborate around large amounts of data (including but
not limited to genetics)
3. Integrate genetic data with other types of data, such as
data from electronic medical records to advance science
and improve clinical care
30 / Genomics 101
ANALYSING DATA
LOOKING AHEAD
Guided by the visionary partners with whom we are privileged
to work, DNAnexus continues to enhance our abilities
within each of these three areas: DNA analysis, distributed
collaboration, and integration with other data types. We are
constantly seeking opportunities to leverage the technology
weve developed through collaborations with innovative
leaders looking to use the power of our platform to approach
compelling scientic and clinical challenges.
END-TO-END WORKFLOW
Integrated
Partner
Solutions
Clinical
Pharna
Interpretation/
Annotation
Databases
REPORT
@DNANEXUS
I N FO @ D N A N E X U S .CO M
W W W. D N A N E X U S .CO M
Genomics 101 / 31
ANALYSING DATA
Although the analysis tools listed above are packaged for single-purpose,
they can be linked together to form a secondary analysis pipeline. This
can be done manually on your local cluster with a considerable amount
of I.T. customisation or through an open source bioinformatics platform
like Galaxy or commercial bioinformatics platforms like Seven Bridges
and DNAnexus. If you have a very good grasp of the I.T. required, and
the scope to carry it out, an open source platform might be a good fit.
However it will require a more continuous I.T. effort than what you
would need to commit with a commercial solution.
Example of a typical secondary analysis pipeline:
1. FASTQ input from primary analysis typically conducted on the
sequencer
2. BWA or Bowtie for mapping to the reference genome, which
generates a BAM file.
3. GATK or Freebayes takes the BAM file and identifies variants in
the donor relevant to the reference genome.
4. Output is VCF, which lists all the donor variants in relation to
the reference.
Today, many organisations are participating in global large-scale
sequencing projects to study thousands or even millions of genomes,
making the challenge of storing and managing NGS data more critical.
In a recently published paper, Big Data: Astronomical or Genomical?,
published in the PLoS Biology journal, between 100 million and 2
billion human genomes are expected to be sequenced by 2025. The
storage capacity required for this alone would be pegged at ~240
exabytes (1 exabyte = 1018 bytes), which exceeds the projected data
storage requirement for three other major big data generators:
YouTube, (data storage projection of 12 exabytes), Twitter
(estimated to require 117 petabytes {1 petabyte = 1015 bytes per
year} of data storage) and the Square Kilometer Array or SKA (which
might create a demand for 1 exabyte data storage capacity).
On average, the storage space required for analysing a whole
genome via a Illumina Hi-Seq is ~200 Gb. Considering the variations
in genome of human species, the storage requirements for a largescale genome sequencing project is huge. For example, the 1000
genomes project consists of more than 200 terabytes of data for the
1700 participants. The analysis costs associated with such a large
project may sometimes exceed reagent costs, considering the fact
that the genome sequencing cost has significantly reduced now.
CLOUD COMPUTING AS A SOLUTION
Converting DNA into meaningful genetic information involves
extensive computational resources dedicated to the application of
bioinformatics for secondary analysis, let alone considerable data
storage capacity. With research projects involving the sequence
and analysis of tens of thousands to millions of genomes becoming
the norm, many organisations are finding that their local clusters
cant keep pace with the sequencing volume. The cloud is the
only technology that is capable of keeping pace with big data.
Accordingly, the genomics industry is finding cloud approaches to
suit its need for scalable computational and storage requirements.
Cloud service providers like Amazon Web Services or Google Cloud
offer scientists access to powerful computational resources without
32 / Genomics 101
SUMMARY
In this chapter we have looked at one of the most important
parts of genomics: turning raw data into something you can use.
For microarrays, there is a wealth of standardised, easy to use
analysis options. For NGS, things are a little bit more complicated
and potentially require considerably more resources.
In the next chapter we take a closer look at what you can do with
your NGS data to add biological context to it. n
ANALYSING DATA
ON AVERAGE, THE
STORAGE SPACE
REQUIRED FOR
ANALYSING A WHOLE
GENOME VIA A ILLUMINA
HI-SEQ IS ~200 GB.
Genomics 101 / 33
CHAPTER 4:
NGS
INTERPRETATION
AND DISCOVERY
SPONSORED BY
INTRODUCTION
so are not in the literature or existing gene panels, or are found in the
patient but not inherited from either parent (de novo variants).
Data Generation
There are three main challenges associated with the actual process
of genomic interpretation and discovery:
Scale: The vast size and complexity of raw genomic data.
Power: Limited diagnostic and discovery yield when we seek to fully
exploit all of the available data.
Reach: The increasing need to connect data sets and link
interpretive tools worldwide.
Finally, we shall take a look at the regulatory issues surrounding unknown
variants. A single NGS test has the potential to identify thousands of
variants that could be used in a diagnosis, but in order to form the basis
of a diagnosis the test must meet regulatory standards. We shall explore
the regulatory challenges surrounding interpretation and discovery.
Precision Medicine
Scalable
Population
Clinical
Discovery
Wellness
Seamless
Sequencing
Normalised
Global
Cloud
Genomics 101 / 35
SNIP SNIP
SNPs, single nucleotide
polymorphisms, or
snips are one of the
most common forms
of genetic variation. A
SNP is a single base-pair
mutation at a specific
location in the genome.
In humans SNPs can
be associated with
disease susceptibility.
Conditions such as
sickle-cell anaemia and
cystic fibrosis have been
linked to specific SNPs.
SCALE
Only a few years back, genomics was relatively data poor. SNP
genotyping enabled broad coverage of the genome but was useful
principally for identifying common variants and assigning risk for
common diseases. Finding rare variants was slow, painstaking and
expensive, requiring sequencing of individual genes and the steady
but very slow compilation of disease-linked variant panels.
The good news was that genotyping in this way did not generate
very large quantities of data. What data was produced could be
stored and analysed using standard database technology.
The emergence of NGS changed all of that. Following the advent
of advanced sequencing techniques, the raw data from a single
exome, the coding region of a genome, now can weigh in at more
than 10 gigabytes of data (depending on read depth). An entire
human genome at 30X depth generates a file of approximately 90
gigabytes. In context, a computer with a 1 terabyte hard drive can
store fewer than ten individual genomes, hardly enough for a largescale research project. Data on this scale, particularly querying
thousands of genomes simultaneously, is overwhelming for
standard databases and all but the largest IT systems. Even when
these data can be stored, mining sequences intensively is extremely
slow because the time it takes to complete a computation is limited
by the input/output channel of information. This issue is further
compounded as the number of analysed samples increases.
The answer to this problem is to develop new and more
computationally efficient data architecture. The most widelyused solution for large-scale research and diagnostics is the
genomically-ordered relational database (GORdb), developed a
decade ago for the worlds largest population genomics research
effort genetics in Iceland. Offering a different approach to data
storage and retrieval, this system is now being further refined and
deployed around the world.
36 / Genomics 101
POWER
The ability to mine sequence datasets is only useful if potentially
pathogenic variants can be efficiently identified. The diagnostic yield
of a system is directly related to its ability to compare samples to
databases of existing genomic data, access to extensive reference
libraries, and an ability to predict deleterious variants, even if they are
novel and not previously annotated. Creating a seamless link between
this information and the clinic is a critical step in advancing both
precision medicine and genomic research.
Sequence information
Looking for disease-causing variation within the genome typically begins
with checking the sequence against lists of known disease-associated
genes and reference sources. In clinical diagnostics, for example, this
could involve using a gene panel test that examines specific regions of
the genome looking for known alterations that are linked to disease.
For example, the TaGSCAN (Targeted Gene Sequencing and Custom
Analysis) screening panel examines 514 genetic regions that have been
associated with childhood diseases. There are gene panels available
from a wide array of companies that can be used for the identification of
carrier status, assessment of disease risk, and diagnoses.
While this approach is a valuable start in interrogating a genome,
detailed knowledge of individual variants (the Known-Knowns)
is not always extensive enough to obtain an answer. In fact, this
approach will only solve 20-25% of rare disease cases.
If comparative methods fail to generate a result, the next step
is typically a systematic search of the genome that filters the
information for a range of different genetic features. These include:
Population Allele Frequency: Tools like the Exome Aggregation
Consortium (ExAC) allow researchers to identify rare, potentially
disease-linked variants within a cohort of over 60,000 individuals.
Variant Impact: Tools like Variant Effect Predictor (VEP) can
predict the impact that identified variants will have on genes,
transcripts and proteins. This analysis is based upon the location
of a variant within a gene and the expected effect that a mutation
will have on it product (most often a protein).
Inheritance Model: Such as autosomal dominant or recessive.
This also includes de novo mutation when a variant has been
arisen spontaneously and has not been inherited.
Paralogs: These are usually silent second copies or versions
of genes that have been kept in the genome over the course of
evolution, but can still be or become functional.
Genomics 101 / 37
NO ONE PUTS THE FULL POWER OF THE GENOME AT YOUR FINGERTIPS LIKE WUXI NEXTCODE.
INTERPRET CLINICAL CASES WITH UNRIVALLED POWER; MINE POPULATION WGS IN MINUTES;
JOIN COHORTS ON MANY CONTINENTS. ALL YOUR DATA AND RESULTS BACKED BY ALL
KEY GLOBAL REFERENCE SETS AND COLLECTIONS IN ONE SYSTEM, ONE FORMAT, AT RAW
RESOLUTION AND IN REAL TIME.
sales@wuxinextcode.com
Shanghai | Cambridge | Reykjavik
Genomics 101 / 39
IN RARE DISEASE
CASES, PARTICULAR
VARIANTS ARE
RARE BUT MANY
MAY CLUSTER IN A
PARTICULAR GENE.
40 / Genomics 101
REACH
Storing, accessing and mining data in situ is the first part of the
challenge surrounding interpretation and discovery. The second
part, which is set to be a crucial game-changer, is the ability to
work with these datasets online, from anywhere in the world. The
beacons being developed by Global Alliance are a simple example
of how this can work, but for the future it will become critical for
researchers and clinicians to go beyond asking basic questions.
Given the scale of genomic data, the standard approach is to
hold the genomic information in one central database and allow
researchers and clinicians remote access, sometimes accompanied
REGULATORY CHALLENGES
As we have already discussed, NGS produces enormous quantities
of data, and has the potential to identify thousands of variants
that may be disease-linked. This creates a significant challenge for
regulators. For a diagnostic genetic test to gain regulatory approval,
and so be clinically useful, the U.S. Food and Drug Administration
typically requires that the variant identified by the test is reported,
and is known to be associated with a disease. Test developers must
show clinical significance before approval can be given.
However, to date a relatively small number of disease-linked
variants have been identified, and often NGS tests are used
precisely because they can routinely detect rare variants that may
not be identified by established tests.
At present, the discussions around how to effectively regulate NGS
tests are on-going, and one focus is to evaluate the methodology of
NGS interpretation as well as previously seen links, combined with
an ongoing assessment of the clinical outcomes.
ON TO THE CLINIC
As we have outlined in this chapter, the challenges involved in
interpreting the tidal wave of NGS data are considerable, but not
insurmountable. There are numerous initiatives underway to ease
the strain in the bottleneck and speed the development of precision
medicine systems.
In the next chapter we will look at genomics in the clinic, where the
results of interpretation and discovery are turned into actionable
results for clinicians and patients. n
Genomics 101 / 41
CHAPTER 5:
NGS IN THE
CLINIC
SPONSORED BY
INTRODUCTION
In the preceding chapters we have explored the process of
collecting and analysing genomic sequence data, in a manner that
could be applied to both research and to clinical diagnostics. For
this chapter we will be focussing entirely on genomics in the clinic,
and how the outcomes of genomic tests are communicated to
clinicians and patients by the clinical laboratories conducting them.
Creating an accessible, useful report based on NGS information
and analysis for a physician is one of the most challenging areas
of clinical genomics. High quality patient care is dependent on a
written report that is easy to understand and easy for physicians
and genetic counsellors to act upon.
Traditionally, laboratory tests have looked for specific genetic
variants with known disease outcomes. But with the tidal wave
of information generated by NGS tests clinical laboratories
are faced with the thorny issue of how to present information
on thousands of genetic variants, many of which will have
inconclusive clinical outcomes.
Exon
(coding)
Intron
(non-coding)
Exon
Intron
Start of gene
Exon
End of gene
Genomics 101 / 43
TO KNOW OR
NOT TO KNOW?
During a whole exome
test to diagnose a
patients rare condition,
the clinical laboratory
conducting the test
makes a secondary
finding, namely a
mutation in the
patients BRCA1 gene
that massively increases
their lifetime likelihood
of developing breast
cancer. However, the
patient has specifically
asked not to be notified
of secondary findings.
What is the right course
of action?
This debate is ongoing
in the clinical genomics
community. On the one
hand it seems morally
wrong not to inform
the patient about a
potentially lethal gene
mutation that they are
otherwise unaware of,
even if they have not
consented to receive
that information.
However, patients do
and should have the
right to autonomy
over their medical
information and how it
is used.
One solution,
recommended by
ACMG, is to have a
minimum list of known
conditions, such as
BRCA1 mutations, which
are routinely evaluated
and reported on as
part of a genetic test.
These results would
be reported without
seeking preferences
from the patient.
What do you think?
49
44 / Genomics 101
Data re-analysis
Over time, as our knowledge of the genome increases, patients
whose test were previously unsuccessful may find themselves
in a position to obtain a diagnosis. A crucial part of developing
the clinical reporting system will be futureproofing, ensuring
that in the future patients can come back for a diagnosis as
the science advances. Again, how to handle data re-analysis
currently comes down the discretion and capabilities of the
individual clinical laboratory.
For example, how much patient data should a clinical
laboratory store in order to support a re-analysis? As with the
formatting and content of a clinical report, there is no hard and
fast industry standard. One option is to store the list of variants
discovered by the test, the VCF or variant call format, rather
than the complete exome or genome sequence. This solution
places less strain on a laboratorys data infrastructure, but
there is the risk that the VCF file may not contain the relevant
Genomics 101 / 45
1.
2.
INTRODUCTION
In a recent paper, the American College of Medical Genetics and Genomics
published standards and guidelines for the interpretation of sequence
variants1. The College made these available as an educational resource
for clinical laboratory geneticists to help them provide qualitative clinical
laboratory services. Although adherence to these standards and guidelines
is voluntary and cannot replace the clinical laboratory geneticists
professional judgment, the recommendations represent a broad consensus
of the clinical genetics community. With increasing volumes and the use
of large gene panels (clinical, full exomes and even full genomes) in clinical
genetics routine practice, labs need strong informatics tools that support
them in the automation and standardization of variant assessment and
reporting, in order to benefit from community standards and to keep
up with the best standard of care. In this case study, we showcase how
Cartagenia Bench Lab NGS enables labs to implement their take on
the ACMG recommendations. The Molecular Genetics department at
Uppsala University Hospital illustrates how it is has implemented the
recommendations in their specific routine diagnostic setting, using a flexible,
drag-and-drop interface to build and store the labs variant triage protocol.
KEY REQUIREMENTS
The standards and guidelines describe an evidence-based approach
for the assessment of variants of clinically validated genes. The
recommendations use literature and database-based criteria to classify
variants in five different categories: benign, likely benign, uncertain
significance, likely pathogenic and pathogenic. Evidence levels are
weighted (e.g. Strong, Moderate). To allow labs to automate their
implementation of this evidence-based approach, a number of specific
tools are required.
ANNOTATION SOURCES, SUCH AS POPULATION, DISEASE-SPECIFIC,
AND SEQUENCE DATABASES
The guidelines recommend the use of a wide range of criteria.
Examples include: population databases such as the Exome
Aggregation Consortium (ExAC, http://exac.broadinstitute.org/); disease
databases such as ClinVar (http://www.ncbi.nlm.nih.gov/clinvar), and
sequence databases such as RefSeq (http://www.ncbi.nlm.nih.gov/
refseq/rsg). With Cartagenias Bench NGS platform, labs can integrate
and use a wide range of community-accepted resources, including
Figure 1.
Partial view of the Uppsala
University Hospital decision
tree representing their filtration
strategy, investigating public
and in-house variant databases,
modes of inheritance, population
frequency statistics databases, and
variant coding effect. Top: decision
tree. Middle: currently selected
ACMG category PP5. Bottom:
variants matching selected criteria.
(Courtesy of Dr. Berivan Baskin)
46 / Genomics 101
IMPLEMENTATION
The molecular genetics laboratory at the Uppsala University Hospital has
implemented the ACMG guidelines on the Cartagenia Bench NGS platform
and validated their approach on a set of clinical cases. The lab has
implemented different criteria as well as levels of evidence in a decision
tree, partially shown in Figure 1. In this view, a validated pipeline is run on
a Connective Tissue Panel sample, showcasing a variant in the COL1A2
gene that is reported as clinically relevant. The protocol represented
by the tree has checked all variants in the assay, and highlighted the
p.Gly949Ser variant for review. The clinical geneticist consecutively verifies
relevant sources in this case: ESP, 1000 Genomes, ExAC, HGMD, in silico
score annotations from ACMG-recommended SIFT, Mutation Taster and
PolyPhen, and a confirmed spectrum of missense mutations in the gene
at hand. Parental samples tested negative for this variant.
CONCLUSION
With this case study, the lab has illustrated how various features of the
Cartagenia Bench Lab NGS platform were used to implement an automated
Standard Operating Procedure that reflects how the lab performs variant
filtration. This case illustrates strong advantages in lab efficiency - whereas a
manual process of variant filtration is time consuming and error prone, the
lab benefits from automation of these manual protocols, freeing up time for
genetic specialists to focus on variant interpretation and reporting.
Notes
1. Richards et al., Genetics in Medicine, advance online publication 5 March 2015.
doi:10.1038/gim.2015.30
This article is adapted from Agilent Publication 5991-6387EN.
Cartagenia Bench Lab is marketed in the USA as exempt Class I Medical Device and in
Europe and Canada as a Class I Medical Device.
Efficient variant
assessment
A clinical-grade
solution
Access relevant
content
Draft lab reports
with ease
IN AN IDEAL SCENARIO,
FOLLOWING A GENETIC
TEST A PATIENTS
REPORT AND ALL THE
ASSOCIATED RAW DATA
WOULD BE UPLOADED
TO THEIR EMR.
48 / Genomics 101
CONCLUSION
The use of NGS tests for clinical diagnostics look set to become
part of routine healthcare practice, and with the development of
EMR systems the long-term benefits for medical research could
be significant.
However, there are many challenges that have yet to be solved, and
many tools and processes that need to be developed in order to fully
realise the benefit. Clinical reporting is set to evolve rapidly in the future,
as the cost of sequencing decreases and our knowledge of the genome
increases. Consequently best practices for analysis, interpreting variants
and clinical reporting will also continue to evolve. n
Genomics 101 / 49
CHAPTER 6:
EDITING THE
GENOME
SPONSORED BY
INTRODUCTION
It is impossible to go to a conference in the genomics, molecular
biology, or synthetic biology space without hearing the terms
genome editing or CRISPR. Precise, specific, and controlled
genome editing is arguably the trendiest application in these spaces
at the moment, with active momentum building due to its bold
promises. It is hard to ignore a field that could change the face of
personalised medicine, with the potential to treat thousands of
currently untreatable diseases.
Today there are five papers published every day on just CRISPR
alone an astounding number for a technology barely 3 years
old! But before we get to that we want to get back to humble
beginnings, to where the genome-engineering journey began.
Human influenced genomic modification of organisms is as old
as selective breeding, which has existed whether intentionally
selecting for beneficial traits, or unintentionally by domestication
for millennia. The direct modification of organisms using targeted
methods has existed for around four decades.
Genomics 101 / 51
TECHNOLOGY
SHOWCASE:
DEVELOPING
MOUSE MODELS
FOR CYSTIC
FIBROSIS
Cystic Fibrosis patients
carry a mutation in a
chloride ion channel,
causing mucosal tissues
to function incorrectly
leading to impaired
mucosal secretion and
damage the intestinal
tracts. Patients also
often suffer severe and
chronic infection from
over 50 pathogenic or
opportunistic species.
Much of the
pathophysiology of this
life shortening disease
was learned through
early homologous
recombination generated
mouse models. One
important study area
has been the study of
highly complex bacterial
biofilms within the CFmodel mouse lung.
Importantly, coinfection with multiple
pathogens, i.e.
Pseudomonas aeruginosa
and Burkholderia
cenocepacia led to both
increased inflammatory
responses and chronic
infection establishment
in mice. In the last
5 years a number of
potential anti-biofilm
drugs are being tested
on the CF model mice
with the hope it could
extend patient lives.
52 / Genomics 101
Microinjection of a
cell to introduce new
genetic material
ZF1
ZF3
ZF2
ZF5
ZF6
ZF4
A cartoon dipicting how Zinc Finger Nucleases (ZNFs) bind to 3 specific nucleic acid bases
TAL1
)*+
TAL4
TAL3
TAL4
TAL2
TAL3
TAL1
TAL4
TAL1
In 2009 researchers deciphered their astoundingly simple DNA binding mechanisms, and by
ZF3
ZF1 were being
ZF2
2010 TALENs
engineered
to direct double stranded breaks in DNA. An individual
TAL is a small 33-35 amino acid protein, with two adjacent amino acids (in position 12 and 13)
controlling DNA binding. Therefore only four TAL-4 variants (one for each A, T, C, G nucleotide)
have to be organised to provide sequence specific DNA binding.
TAL2
TAL3
TAL1
TAL3
TAL1
)*+
TAL4
TAL2
TAL2
TAL3
TAL1
TAL2
TAL4
TAL2
TAL2
TAL2
TAL2
TAL3
TAL3
TAL2
TAL1
TAL1
TAL3
TAL4
TAL4
TAL1
TAL1
TAL4
A cartoon dipicting how TAL Effector Nucleases (TALENSs) bind to individual nucleic acid bases
TECHNOLOGY
SHOWCASE:
12 PEOPLE TREATED
FOR HIV
In 1995, individuals were
found to be naturally
resistant to HIV. Each
carried a 32 base pair
deletion in an immune
receptor called CCR5,
ablating its function.
In 2008, using zincfinger nucleases, CCR5null cells were produced
in the lab. Between
2011 and 2013 12 HIV+
patients had their
immune cells cultivated,
modified with zinc finger
nucleases to contain the
CCR5-null phenotype and
re-introduced into the
patients. Patients saw a
reduction in viral load,
and the persistence of a
HIV-resistant population
of T-cells. It is important
to note that patients
are not cured, but
their disease outlook is
improved.
Current research aims
to combine the same
procedure with stem cell
therapy to provide a one
shot HIV cure. Sangamo
are currently in FDA
approved phase 2 clinical
trials for T cell technology
and phase 1 for stem cell
technology.
The ease of engineering led a number of new, novel applications that can be brought about
by an easily engineered double stranded break, including very large (1.5 million base) scale
deletion, and inversion event models.
CRISPR-CAS9
In 2012, the entire field of genome editing was shaken up again with the coadaptation of the
CRISPR-Cas9 system to genome editing. In nature, the CRISPR-Cas9 system is found in over
40% of all sequenced bacteria, and almost every archaea. It affords immunity to invading
DNA elements (from viruses or other pathogens) by site-specific RNA guided cleavage. Due
to its simplicity of use, accuracy and ease of further Cas9 modification to shuttle other DNA
acting enzymes to specific genomic regions is has become the current gold standard in
genome editing machinery. p.56
Genomics 101 / 53
Reimagine Genome
0.8
0.6
^ # transcripts
0.4
0.0
0.2
ACTIVITY SCORE
1.0
MUC4 GUIDES
5000
10000
15000
Control clusters
Protein-coding Exon
Desktop Genetics offer a platform for the easy, and efficient design of gRNA libraries that
are accurate for a genome of interest, and have minimal off target effects. a) Snapshot
of the Desktop Genetic interface showing BRCA1 introns and exons at its position in
the human genome. b) Several scoring algorithms are used to define whether a specific
gRNA will possess off target activity throughout the genome.
DESIGN
MANUFACTURE
Amplified
Amplified
oligo pools
oligo pools
PACKAGE
gRNA TEMPLATE
LBRARY INTO
VECTOR(S) OF
CHOICE
LIBRARY INTO
LENTIVIRAL DELIVERY
SYSTEM
Ready for
Trasfection
Unamplified
oligo pools
Unamplified
oligo pools
CLONE
POOLED
PLASMA sgRNA
IN VITRO
TRANSCRIPTION
Scale Research
The same 29,040 oligonucleotides that were designed above had their NGS data assessed for
abundance and oligonucleotide representation. 100% of the designed oligonucleotides were
present in the NGS analysis. Additionally 90% of all sequences were synthesized at a density
within 4x the mean density. This data confirms that what you design is exactly what will be
synthesized on Twist Biosciences platforM.
TRANSFORM
SCREEN
ANALYZE
RESULTS TO IDENTIFY
ALL GENES INVOLVED IN
SELECTION PATHWAY
Two 70mer oligonucletides were synthesized on Twist Biosciences Silicon DNA writing platform.
These oligonucleotides were assembled to make the full 120mer gRNA template (peak denoted
by blue arrow in i) that was complimentary to a sequence of interest. In-vitro transcription was
used to convert the template into gRNA (peak in ii). This gRNA was used to guide Cas9 to the DNA
sequence of interest. The 760bp sequence (blue arrow in iii) was cleaved successfully into two
pieces (blue arrows in iv) 321 and 439 bp in length. No remaining full length target or non specific
events were detectable.
TECHNOLOGY
SHOWCASE:
TALENS TO
MODIFY TALES
While Xanthamonas
have been a useful
source of TALE proteins
for genome engineering,
they also use the same
tool to cause crop
destroying rice blight. In
true fighting fire with
fire fashion researchers
designed TALENs to
modify the natural TALE
binding regions of the
rice crop by inducing
either deletions or
mutations in an effector
binding region of the
plant genome.
Using a modified
DNA injecting plant
pathogen Agrobacterium
tumefasciens, plasmids
encoding TALENs
were injected into rice
embryonic cells, which
were then screened
for double knockout
mutants. These mutants
were found to show no
impairment growth or
development, alongside
a resistance to the 32
rice-infecting strains that
target the now modified
(unrecognisable) target
site. Due to the simplicity
of this experiment it can
be easily used in any
plant to confer resistance
to many blighting
pathogens that use a
TALE (or similar) infection
system.
56 / Genomics 101
TIMELINE
HOMOLOGOUS RECOMBINATION
The homologous recombination machinery that allows
genetic recombination during meiosis was hijacked to afford
the directed homologous recombination of exogenous DNA
into precise positions in mouse genomes, with an efficiency of
one recombination in every 106 cells.
LATE 80s
2002
2005
2010
2011
2012
CRISPR/CAS9 NUCLEASES.
Some bacteria have adaptive immune systems that protect
them from viral DNA. A Cas9 protein is guided by RNA
which has been transcribed from learned information
about the virus. Guide RNA is complimentary to a viral
strand, and can be engineered to be complimentary to
any strand of choice, allowing Cas9 to introduce double
stranded breaks in up to 9 in every 10 cells.
2007
LATE 00s
CRISPR/CPF1 NUCLEASES.
2015
TODAY
CRISPR IS ONE OF THE FASTEST EVOLVING
FIELDS IN BIOLOGY.
According to PubMed, in 2015 alone there were 1266 CRISPR
publications. In January 2016 alone there were 207 paper, fitting in with
the trend of exponential growth.
Genomics 101 / 57
A cartoon depicting how Cas9, gRNA and the PAM come together
for DNA cleavage to facilitate genome engineering
58 / Genomics 101
IMPROVING CAS9
Cas9, when expressed or transfected in cells alongside a gRNA,
allows for the targeted introduction or deletion of genetic
information. This process was used to produce knock out mutant
mice that had a mutation in both alleles in a process that only took
a total of around 4 weeks from start to finish. It has been deemed
a fantastic success, and is often referred to as one of the greatest
breakthrough technologies in recent years.
CRISPR has already shown incredible promise in the
development of personalised gene therapies for rare diseases
in human-cell lines, and in mouse models. Pre-clinical trial,
proof of concept treatments already exist for -thalassemia,
rheumatoid arthritis, Duchenne muscular dystrophy, cystic
fibrosis and tyrosinemia.
dCas9: Short for dead Cas9, it has had both its RuvC and its HNH
nuclease domains inactivated. This turns Cas9 into a shuttle for
other enzymes that can act upon the DNA. dCas9 has been used as
a fusion product with transcription factors in order to tightly control
the activation or repression of particular proteins outside of their
usual activity. It has also been fused to FokI, and used as a dual
strand cleavage system that belongs to the same paradigm as ZFNs
and TALENs.
Cas9n: Cas9n, or nicking Cas9 has either its RuvC or its HNH
cleavage domain modified to be inactive. This inactivation leaves
Cas9 only able to produce only a stranded break in the DNA (a nick),
not a double stranded break. This is significant for two applications.
hfCas9: Instead of using dual Cas9n proteins to generate the offtarget effect-free Cas9 cut, researchers took to modifying the Cas9
enzyme itself to reduce off target effects, and keep the Cas9 system
as simple as possible.
First, there is concern over the effects that off target Cas9 cutting
events will have on any cell that is engineered with this system.
Research has showed that off target effects are often few and far
between, but their impact cannot be ignored. For this reason, two
Cas9n enzymes, one for each strand, could be used to produce the
double stranded break. As they would have to recognise both the
upstream and downstream regions of the cut site, off target effects
are almost always ablated.
This is great for the bacteria as it can endure viral DNA that has
undergone one or more mutations since it was last encountered,
however it could be detrimental to genome editing due to the offtarget effects caused.
Therefore, by mutating four of these DNA interacting domains, the
DNA binding energy of the whole system was reduced to a point in
which the gRNA had to be exactly correct in order to induce a cut in
the DNA. When the target organisms genome was sequences and
analysed for off target effects, not a single one was found. So long
as gRNA is designed with off target effects in mind (with a tool like
the genome search algorithms offered by Desktop Genetics) hfCas9
allows genome editing is specific only for the site of interest.
Where next? It was not mentioned in this chapter, but the CRISPR
associated protein Cpf1 could extend the reach of the CRISPR/
Cas system. Maybe there is something even more powerful
looming on the horizon. The future of genetic engineering
continues to evolve. n
For a fully referenced version see the digital version at frontlinegenomics.com
Genomics 101 / 59
BLUNT-CUT :
Cas :
Cas9 :
Cas9n:
dCas9:
hfCas9:
DUAL NICK:
ENDONUCLEASE:
OFF-TARGET EFFECTS:
FokI:
EXOGENOUS DNA :
FRAME SHIFT:
RNA:
Amino:
RNA:
Amino:
HOMOLOGOUS:
HOMOLOGOUS RECOMBINATION:
Csn1:
NON-HOMOLOGOUS RECOMBINATION:
CRISPR:
DELETION:
INSERTION:
KNOCK-IN:
KNOCK-OUT:
60 / Genomics 101
5-atgcatgcavtgcatgcatgc
tacgtacgt^actacgtacg-3
NON-HOMOLOGOUS REPAIR:
see Cas9
Cpf1:
NICKASE:
OVERHANGING-CUT:
PAM :
REPRESSOR:
RESIDUE:
RNA:
crRNA:
gRNA:
tracRRNA:
TALEN:
ZFN: