You are on page 1of 12

REVIEWS

Machine learning applications in


genetics and genomics
Maxwell W.Libbrecht1 and William Stafford Noble1,2
Abstract | The field of machine learning, which aims to develop computer algorithms
that improve with experience, holds promise to enable computers to assist humans in
the analysis of large, complex data sets. Here, we provide an overview of machine
learning applications for the analysis of genome sequencing data sets, including the
annotation of sequence elements and epigenetic, proteomic or metabolomic data. We
present considerations and recurrent challenges in the application of supervised,
semi-supervised and unsupervised machine learning methods, as well as of generative
and discriminative modelling approaches. We provide general guidelines to assist in
the selection of these machine learning methods and their practical application for the
analysis of genetic and genomic data sets.

Machine learning
The field of machine learning is concerned with the regulatory elements followed by sequencing (FAIRE
A field concerned with the development and application of computer algorithms seq); or chromatin immunoprecipitation followed by
development and application that improve with experience1. Machine learning meth- sequencing (ChIPseq) data of histone modification or
of computer algorithms that ods have been applied to a broad range of areas within transcription factor binding. Gene expression data can
improve with experience.
genetics and genomics. Machine learning is perhaps be used to learn to distinguish between different dis-
most useful for the interpretation of large genomic ease phenotypes and, in the process, to identify poten-
data sets and has been used to annotate a wide variety tially valuable disease biomarkers. Chromatin data
of genomic sequence elements. For example, machine can be used, for example, to annotate the genome in
learning methods can be used to learn how to rec- an unsupervised manner, thereby potentially enabling
ognize the locations of transcription start sites (TSSs) the identification of new classes of functional elements.
in a genome sequence2. Algorithms can similarly be Machine learning applications have also been exten-
trained to identify splice sites3, promoters4, enhancers5 sively used to assign functional annotations to genes.
or positioned nucleosomes6. In general, if one can com- Such annotations most frequently take the form of
pile a list of sequence elements of a given type, then a Gene Ontology term assignments8. Input of predictive
machine learning method can probably be trained to algorithms can be any one or more of a wide variety
recognize those elements. Furthermore, models that of datatypes, including the genomic sequence; gene
each recognize an individual type of genomic element expression profiles across various experimental condi-
can be combined, along with (learned) logic about tions or phenotypes; proteinprotein interaction data;
1
Department of Computer
Science and Engineering,
their relative locations, to build machine learning sys- synthetic lethality data; open chromatin data; and
University of Washington, tems that are capable of annotating genes includ- ChIPseq data of histone modification or transcrip-
185 Stevens Way, Seattle, ing their untranslated regions (UTRs), introns and tion factor binding. As an alternative to Gene Ontology
Washington 981952350, exons along entire eukaryotic chromosomes7. term prediction, some predictors instead identify
USA.
As well as learning to recognize patterns in DNA cofunctional relationships, in which the machine
2
Department of Genome
Sciences, University of sequences, machine learning algorithms can use input learning method outputs a network in which genes are
Washington, 3720 15th Ave data generated by other genomic assays for example, represented as nodes and an edge between two genes
NE Seattle, Washington microarray or RNA sequencing (RNA-seq) expression indicates that they have a common function9.
981955065, USA. data; data from chromatin accessibility assays such as Finally, a wide variety of machine learning methods
Correspondence to W.S.N.
email: william-noble@uw.edu
DNaseI hypersensitive site sequencing (DNase-seq), have been developed to help to understand the mecha-
doi:10.1038/nrg3920 micrococcal nuclease digestion followed by sequencing nisms underlying gene expression. Some techniques
Published online 7 May 2015 (MNaseseq) and formaldehyde-assisted isolation of aim to predict the expression of a gene on the basis of

NATURE REVIEWS | GENETICS VOLUME 16 | JUNE 2015 | 321

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

Artificial intelligence the DNA sequence alone10, whereas others take into and then outlining some of the major challenges in
A field concerned with the account ChIPseq profiles of histone modification11 applying machine learning methods to practical prob-
development of computer or transcription factor binding 12 at the gene promoter lems in genomics. We provide an overview of the
algorithms that replicate region. More sophisticated methods attempt to jointly type of research questions for which machine learn-
human skills, including learning,
visual perception and natural
model the expression of all of the genes in a cell by ing approaches are appropriate and advise on how to
language understanding. training a network model13. Like a cofunctional net- select the type of methods that are likely to be effec-
work, each node in a gene expression network denotes tive. A more detailed discussion of machine learning
Heterogeneous data sets a gene; however, edges in this case represent, for exam- applied to particular subfields of genetics and genomics
A collection of data sets
ple, the regulatory relationships between transcription is available elsewhere1518.
from multiple sources or
experimental methodologies.
factors and their targets.
Artefactual differences Many of the problems listed above can also be solved Stages of machine learning
between data sets can using techniques from the field of statistics. Indeed, the Machine learning methods proceed through three
confound analysis. line between machine learning and statistics is at best stages (FIG.1). As an example, we consider an applica-
Likelihood
blurry, and some prefer the term statistical learning over tion to identify the locations of TSSs within a whole-
The probability of a data set machine learning 14. Historically, the field of machine genome sequence2. First, a machine learning researcher
given a particular model. learning grew out of the artificial intelligence commu- develops an algorithm that he or she believes will lead
nity, in which the term machine learning became to successful learning. Second, the algorithm is pro-
Label
popular in the late 1990s. In general, machine learning vided with a large collection of TSS sequences as well
The target of a prediction task.
In classification, the label is
researchers have tended to focus on a subset of prob- as, optionally, a list of sequences that are known not to
discrete (for example, lems within statistics, emphasizing in particular the be TSSs. The annotation indicating whether a sequence
expressed or not expressed); analysis of large heterogeneous data sets. Accordingly, is a TSS is known as the label. The algorithm processes
in regression, the label is of real many core statistical concepts such as the cali- these labelled sequences and stores a model. Third, new
value (for example, a gene
expression value).
bration of likelihood estimates, statistical confidence unlabelled sequences are given to the algorithm, and it
estimations and power calculations are essentially uses the model to predict labels (in this case, TSS or
Examples absent from the machine learning literature. not TSS) for each sequence. If the learning was suc-
Data instances used in a Here, we provide an overview of machine learning cessful, then all or most of the predicted labels will be
machine learning task.
applications in genetics and genomics. We discuss the correct. If the labels associated with test set examples are
Supervised learning main categories of machine learning methods and known that is, if these examples were excluded from
Machine learning based on an the key considerations that must be made when apply- the training set because they were intended to be used
algorithm that is trained on ing these methods to genomics. We do not attempt to test the performance of the learning system then
labelled examples and used to to catalogue all machine learning methods or all the performance of the machine learning algorithm
predict the label of unlabelled
examples.
reported applications of machine learning to genom- can be assessed immediately. Otherwise, in a prospec-
ics, nor do we discuss any particular method in great tive validation setting, the TSS predictions produced
detail. Instead, we begin by explaining several key by the machine learning system must be tested inde-
distinctions in the main types of machine learning pendently in the laboratory. Note that this is an exam-
ple of a subtype of machine learning called supervised
learning, which is described in more detail in the next sec-
Training set Testing set tion.This process of algorithm design, learning and test-
TSS 1. ing is simultaneously analogous to the scientific method
TSS 2. on two different levels. First, the designlearntest
TSS 3. process provides a principled way to test a hypothesis
TSS 4. about machine learning: for example, algorithm X can
Not TSS 5. successfully learn to recognizeTSSs. Second, the algo-
Not TSS 6. rithm itself can be used to generate hypotheses: for
Not TSS 7. example, sequence Y is a TSS. In the latter setting, the
Predicted
Not TSS 8. labels resulting scientific theory is instantiated in the model
Labels Data (features) produced by the learning algorithm. In this case, a key
1. Not TSS
Model question, which we return to below, is whether and how
2. TSS
3. TSS
easily a human can interpret this model.
Machine 4. Not TSS
Prediction
learning
algorithm 5. Not TSS Supervised versus unsupervised learning
algorithm Machine learning methods can usefully be segregated
6. TSS
7. Not TSS into two primary categories: supervised or unsupervised
8. TSS learning methods. Supervised methods are trained
on labelled examples and then used to make predic-
Figure 1 | A canonical example of a machine learning application. A training set of
tions about unlabelled examples, whereas unsuper-
DNA sequences is provided as input to a learning procedure, along with binary labels
Nature Reviews | Genetics
indicating whether each sequence is centred on a transcription start site (TSS) or not. The vised methods find structure in a data set without
learning algorithm produces a model that can then be subsequently used, in conjunction using labels. To illustrate the difference, consider
with a prediction algorithm, to assign predicted labels (such as TSS or not TSS) to againgene-finding algorithms, which use the DNA
unlabelled test sequences. In the figure, the redblue gradient might represent, for sequence of a chromosome as input to predict the
example, the scores of various motif models (one per column) against the DNA sequence. locations and detailed intronexon structure of all of

322 | JUNE 2015 | VOLUME 16 www.nature.com/reviews/genetics

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

Input: genomic DNA sequence a pre-determined set of labels on the data, then we must
use unsupervised rather than supervised learning. In
GCCTGGGAAAACCCTCAACTTT...
this type of approach, the machine learning algorithm
uses only the unlabelled data and the desired number of
different labels to assign as input 1921; it then automati-
Gene model Internal
cally partitions the genome into segments and assigns
exon a label to each segment, with the goal of assigning the
same label to segments that have similar data. The unsu-
pervised approach requires an additional step in which
semantics must be manually assigned to each label,
Initial Internal Final
5 UTR
exon intron exon
5 UTR but it provides the benefits of enabling training when
labelled examples are unavailable and has the ability to
identify potentially novel types of genomic elements.

Non- Semi-supervised learning. The intermediate between


coding
supervised and unsupervised learning is semi-supervised
learning22. In supervised learning, the algorithm receives
as input a collection of data points, each with an asso-
Output: gene annotation ciated label, whereas in unsupervised learning the
algorithm receives the data but no labels. The semi-
supervised setting is a mixture of these two approaches:
the algorithm receives a collection of data points, but
Figure 2 | A gene-finding model. A simplified gene-finding model that captures the only a subset of these data points have associated labels.
Nature Reviews | Genetics
basic properties of a protein-coding gene is shown. The model takes the DNA sequence In practice, gene-finding systems are often trained
of a chromosome, or a portion thereof, as input and produces detailed gene annotations using a semi-supervised approach, in which the input
as output. Note that this simplified model is incapable of identifying overlapping genes is a collection of annotated genes and an unlabelled
or multiple isoforms of the same gene. UTR, untranslated region. whole-genome sequence. The learning procedure
begins by constructing an initialgene-finding model
on the basis of the labelled subset of the training data
the protein-coding genes on the chromosome (FIG.2). alone. Next, the model is used to scan the genome, and
The most straightforward way to do so is to use what tentative labels are assigned throughout the genome.
is already known about the genome to help to build a These tentative labels can then be used to improve
predictive model. In particular, a supervised learning the learned model, and the procedure iterates until no
algorithm for gene finding requires as input a training new genes are found. The semi-supervised approach
set of labelled DNA sequences specifying the locations can work much better than a fully supervised approach
of the start and end of the gene (that is, the TSS and the because the model is able to learn from a much larger
transcription termination site, respectively), as well as set of genes all of the genes in the genome rather
all of the splice sites in between these sites. The model than only the subset of genes that have been identified
then uses this training data to learn the general proper- with high confidence.
ties of genes, such as the DNA sequence patterns that
typically occur near a donor or an acceptor splice site; Which type of method to use. When faced with a new
the fact that inframe stop codons should not occur machine learning task, the first question to consider
within coding exons; and the expected length distribu- is often whether to use a supervised, unsupervised or
tions of 5 and 3 UTRs and of initial, internal and final semi-supervised approach. In some cases, the choice
introns. The trained model can then use these learned is limited; for example, if no labels are available, then
properties to identify additional genes that resemble only unsupervised learning is possible. However, when
the genes in the training set. labels are available, a supervised approach is not always
When a labelled training set is not available, the best choice because every supervised learning
unsupervised learning is required. For example, con- method rests on the implicit assumption that the distri-
sider the interpretation of a heterogeneous collec- bution responsible for generating the training data set
tion of epigenomic data sets, such as those generated is the same as the distribution responsible for generat-
by the Encyclopedia of DNA Elements (ENCODE) ing the test data set. This assumption will be respected
Unsupervised learning Consortium and the Roadmap Epigenomics Project. if, for example, one takes a single labelled data set and
Machine learning based on an
Apriori, we expect that the patterns of chromatin acces- randomly subdivides it into a training set and a testing
algorithm that does not require
labels, such as a clustering sibility, histone modifications and transcription factor set. However, it is often the case that the plan is to train
algorithm. binding along the genome should be able to provide a an algorithm on a training set that is generated differ-
detailed picture of the biochemical and functional activ- ently from the testing data to which the trained model
Semi-supervised learning ity of the genome. We may also expect that these activi- will eventually be applied. A gene finder trained using
A machine-learning method
that requires labels but that
ties could be accurately summarized using a fairly small a set of human genes will probably not perform very
also makes use of unlabelled set of labels. If we are interested in discovering what well at finding genes in the mouse genome. Often,
examples. types of label best explain thedata, rather than imposing the divergence between training and testing is less

NATURE REVIEWS | GENETICS VOLUME 16 | JUNE 2015 | 323

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

obvious. For example, a TSS data set generated by cap observing the ith DNA base at position j. We can gener-
analysis of gene expression (CAGE) will not contain ate a random bound sequence according to this PSFM
non-polyadenylated genes23. If such genes also exhibit model by drawing w random numbers, each in the
differences around the TSSs, then the resulting TSS range [0,1. For the jth randomnumber, we select the
predictor will be biased. In general, supervised learn- corresponding DNA base according to the frequencies
ing should be used only when the training set and test in the jth column of the matrix. Conversely, scoring a
set are expected to exhibit similar statistical properties. candidate binding site using the model corresponds to
When supervised learning is feasible and additional computing the product of the corresponding frequen-
unlabelled data points are easy to obtain, one may con- cies from the PSFM. This value is called the likelihood.
sider whether to use a supervised or semi-supervised Training a PSFM is simple the empirical frequency
approach. Semi-supervised learning requires making of each nucleotide at each position simply needs to be
certain assumptions about the data set 22and, in practice, computed.
assessing these assumptions can often be very difficult. A simple example of a discriminative algorithm is
Therefore, a good rule of thumb is to use semi-supervised the support vector machine (SVM)24,25 (FIG.3c), the goal
learning only when there are a small amount of labelled of which is to learn to output a value of 1 whenever
data and a large amount of unlabelled data. it is given a positive training example and a value of
1 whenever it is given a negative training example.
Generative versus discriminative modelling In the transcription factor binding prediction prob-
Applications of machine learning methods generally lem, the input sequence of length w is encoded as a
have one of two goals: prediction or interpretation. binary string of length 4w, and each bit corresponds to
Consider the problem of predicting, on the basis of a the presence or absence of a particular nucleotide at a
ChIPseq experiment, the locations at which a given particular position.
transcription factor will bind to genomic DNA. This This generative modelling approach offers several
task is analogous to the TSS prediction task (FIG.1), compelling benefits. First, the generative description
except that the labels are derived from ChIPseq peaks. of the data implies that the model parameters have
A researcher applying a machine learning method well-defined semantics relative to the generative pro-
to this problem may either want to understand what cess. Accordingly, as shown in the example above,
properties of a sequence are the most important for the model not only predicts the locations to which a
determining whether a transcription factor will bind given transcription factor binds but also explains why
(that is, interpretation), or simply want to predict the the transcription factor binds there. If we compare two
locations of transcription factor binding as accurately different potential binding sites, we can see that the
as possible (that is, prediction). There are trade-offs model prefers one site over another and also that
between accomplishing these two goals methods the reason is, for example, the preference for an ade-
Prediction accuracy that optimize prediction accuracy often do so at the cost nine rather than a thymine at position7 of the motif.
The fraction of predictions that of interpretability. Second, generative models are frequently stated in
are correct. It is calculated by
The distinction between generative models and terms of probabilities, and the probabilistic framework
dividing the number of correct
predictions by the total discriminative models plays a large part in the trade-off provides a principled way to handle problems like
number of predictions. between interpretability and performance. The genera- missing data. For example, it is still possible for a PSFM
tive approach builds a full model of the distribution of to make a prediction for a binding site where one or
Generative models features in each of the two classes and then compares more of the bound residues is unknown. This is accom-
Machine learning models that
build a full model of the
how the two distributions differ from one another. By plished by probabilistically averaging over the missing
distribution of features. contrast, the discriminative approach focuses on accu- bases. The output of the probabilistic framework has
rately modelling only the boundary between the two well-defined, probabilistic semantics, and this can be
Discriminative models classes. From a probabilistic perspective, the discrimi- helpful when making downstream decisions about how
Machine learning approaches
native approach involves modelling only the conditional much to trust a given prediction.
that model only the
distribution of a label when distribution of the label given the input feature data In many cases, including the example of transcrip-
given the features. sets, as opposed to the joint distribution of the labels tion factor binding, the training data set contains a
and features. Schematically, if we were to separate two mixture of positive and negative examples. In a genera-
Features groups of points in a 2D space (FIG.3a), the generative tivesetting, these two groups of examples are modelled
Single measurements or
descriptors of examples used
approach builds a full model of each class, whereas the separately, and each has its own generative process. For
in a machine learning task. discriminative approach focuses only on separating instance, for the PSFM model, the negative (or back-
the two classes. ground) model is often a single set (B) of nucleotide
Probabilistic framework A researcher applying a generative approach to the frequencies that represents the overall mean frequency
A machine learning approach
transcription factor binding problem begins by con- of each nucleotide in the negative training examples.
based on a probability
distribution over the labels sidering what procedure could be used to generate To generate a sequence of length w according to this
and features. the observed data. A widely used generative model of model, we again generate w random numbers, but now
transcription factor binding uses a position-specific each base is selected according to the frequencies in B.
Missing data frequency matrix (PSFM) (FIG.3b), in which a collec- To use the foreground PSFM model together with the
An experimental condition
in which some features are
tion of aligned binding sites of width w are summarized background model B, we compute a likelihood ratio
available for some, but not in a 4w matrix (M) of frequencies, where the entry that is simply the ratio of the likelihoods computed
all, examples. at position i,j represents the empirical frequency of with respect to the PSFM and with respect to B.

324 | JUNE 2015 | VOLUME 16 www.nature.com/reviews/genetics

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

a b
Generative Discriminative AAGTGT
TAATGT
+ + AATTGT
+ + + + AATTGA 6
+ + + + ATCTGT 8
+ +
+ + + + AATTGT A 0.75 0.75 0.25 0.00 0.25

+ + TGTTGT C 0.00 0.00 0.13 0.00 0.00
AAATGA G 0.00 0.13 0.13 1.00 0.00
T 0.25 0.13 0.50 0.00 0.75


Pr (AAGTGT ) = 0.75 0.75 0.13 1.00 1.00 0.75
= 0.05

c d
1.0
Input: AAGTGT

A C G T A C G T A C G T ... A C G T

Accuracy
0.9

0.8
PSFM
SVM

?
10 100
Output: bound or not bound Number of training examples

Figure 3 | Two models of transcription factor binding. a | Generative and discriminative models are different in their
interpretability and prediction accuracy. If we were to separate two groups of points, the generative model
Nature characterizes
Reviews | Genetics
both classes completely, whereas the discriminative model focuses on the boundary between the classes. b | In a
position-specific frequency matrix (PSFM) model, the entry in row i and column j represents the frequency of the ith base
occurring at position j in the training set. Assuming independence, the probability of the entire sequence is the product
of the probabilities associated with each base. c | In a linear support vector machine (SVM) model of transcription factor
binding, labelled positive and negative training examples (red and blue, respectively) are provided as input, and a
learning procedure adjusts the weights on the edges to predict the given label. d | The graph plots the mean accuracy
(95% confidence intervals) of PSFM and SVM, on a set of 500 simulated test sets, of predicting transcription factor
binding as a function of the number of training examples.

The primary benefit of the discriminative model- the discriminative model performs better at solving the
ling approach is that it probably achieves better per- discrimination task at hand. Thus, empirically, the dis-
formance than the generative modelling approach with criminative approach will tend to give predictions that
infinite trainingdata26,27. In practice, analogous gen- are more accurate.
erative and discriminative approaches often converge However, the flipside of this accuracy is that by solv-
to the same solution, and generative approaches can ing a single problem well, the discriminative approach
sometimes perform better with limited training data. fails to solve other problems at all. Specifically, because
However, when the amount of labelled training data is the internal parameters of a generatively trained model
reasonably large, the discriminative approach will have well-defined semantics, we can use the model to
tend to find a better solution, in the sense that ask various related questions, for example, not only
it will predict the desired outcome more accurately whether CCCTC-binding factor (CTCF) bind to a
when tested on previously unseen data (assuming that particular sequence but also why does it bind to this
the data are from the same underlying distribution sequence more tightly than to some other sequence. By
as the training data). To illustrate this phenomenon, contrast, the discriminative model only enables us to
we simulated data according to a simple PSFM model answer the single question for which it was designed.
and trained a PSFM and an SVM, varying the number Thus, choosing between a generative and discrimi-
of training examples. To train a model with a width of native model involves a trade-off between predictive
19 nucleotides to discriminate between bound and accuracy and interpretability of the model. Although
unbound sites with 90% accuracy requires 8 training the distinction between generative and discriminative
examples for a PSFM model and only 4 examples for models plays a large part in determining the interpret-
an SVM model (FIG.3d). This improvement in perfor- ability of a model, the models complexity that is, the
mance is achieved because, by not attempting to accu- number of parameters it has can be just as impor-
rately characterize the simpler parts of the 2D space, tant. Models of either type that have a large number

NATURE REVIEWS | GENETICS VOLUME 16 | JUNE 2015 | 325

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

Feature selection of parameters tend to be difficult to interpret but can of machine learning methods to distinguish among
The process of choosing a generally achieve higher accuracy than models with various types of genomic elements. Using chromatin
smaller set of features from few parameters if they are provided with enough data. data DNaseI accessibility and ChIPseq profiles of
a larger set, either before The complexity of a model can be limited either by histone modifications and transcription factor binding
applying a machine learning
method or as part of training.
choosing a simple model or by using a feature selection one classifier distinguished regulatory regions that
strategy to restrict the complexity of the learnedmodel. are close to a gene from regulatory regions that are far
Input space away from any gene. In this context, design of the input
A set of features chosen to be Incorporating prior knowledge space (that is, the features that are provided as input
used as input for a machine
In many applications, the success or failure of a to the classifier) is of crucial importance. In the work
learning method.
machine learning method depends on the extent to carried out by Yip etal.29, the features were computed
Uniform prior which the method accurately encodes various types of by averaging over a 100bp window. The choice of a
A prior distribution for a prior knowledge about the problem at hand. Indeed, 100bp window is likely to reflect the prior knowledge
Bayesian model that assigns much of the practical application of machine learning that, at least for histone modification data, the data are
equal probabilities to all
models.
involves understanding the details of a particular prob- arguably only meaningful at approximately the scale
lem and then selecting an algorithmic approach that of a single nucleosome (147bp) or larger. Moreover,
enables those details to be accurately encoded. There replacing a single averaged feature with 100 separate
is no optimal machine learning algorithm that works features may be problematic. By contrast, for DNase
best for all problems28, so the selection of an approach accessibility data, it is plausible that averaging over
that matches the researchers prior knowledge about the 100bp may remove some useful signal. Alternatively,
problem is crucial to the success of the analysis. prior knowledge may be implicitly encoded in the
Often, the encoding of prior knowledge is implicit learning algorithm itself, in which some types of solu-
in the framing of the machine learning problem. As tions are preferred over others30. Therefore, in general,
an example, consider the prediction of nucleosome the choice of input data sets, their representations and
positioning from primary DNA sequence6. The labels any pre-processing must be guided by prior knowledge
for this prediction task can be derived, for example, about data and application.
from anMNaseseq assay. In this case, after appropri-
ate processing, each base is assigned an integer count Probabilistic priors. In a probabilistic framework,
of the number of nucleosome-sized fragments that some forms of prior knowledge can be represented
cover the base. Therefore, it might seem natural to explicitly by specifying a prior distribution over the
frame the problem as a regression, in which the input data. A common prior distribution is the uniform prior
is, for example, a sequence of 201bp and the output is (also known as uninformative prior) which, despite
the predicted coverage at the centre of the sequence. the name, can be useful in some contexts. Consider,
However, in practice, we may be particularly inter- for example, a scenario in which we have gathered a
ested in identifying nucleosome-free regions. Hence, collection of ten validated binding sites for a particu-
rather than asking the algorithm to solve the problem lar transcription factor (FIG.4), and we observe that
of predicting exact coverage at each base, we might the sequences cluster around a clear consensus motif.
instead opt to predict whether each base occurs in a If we represent these sequences using a pure PSFM
nucleosome-free region. Switching from regression then, because our data set is fairly small, a substan-
to classification makes the problem easier but, more tial number of the entries in the PSFM will be zero.
importantly, this switch also encodes the prior knowl- Consequently, the model will assign any sequence that
edge that regions of high MNase accessibility are of contains one of these zero entries an overall proba-
particular biological interest. bility of zero, even if the sequence otherwise exactly
matches the motif. This is counter-intuitive. The solu-
Implicit prior knowledge. In other cases, prior knowl- tion to this problem is to encode the prior knowledge
edge is encoded implicitly in the choice of data that that every possible DNA sequence has the potential to
we provide as input to the machine learning algo- be bound by a given transcription factor. The result
rithm. For example, Yip etal.29 trained a collection is that even sequences containing nucleotides that we

AAGTGT
TAATGT
AATTGT A 6 6 2 0 2 A 6.1 6.1 2.1 0.1 2.1 A 0.73 0.73 0.25 0.01 0.25
AATTGA C 0 0 1 0 0 C 0.1 0.1 1.1 0.1 0.1 C 0.01 0.01 0.13 0.01 0.01
ATCTGT G 0 1 1 8 0 G 0.1 1.1 1.1 8.1 0.1 G 0.01 0.13 0.13 0.96 0.01
AATTGT T 2 1 4 0 6 T 2.1 1.1 4.1 0.1 6.1 T 0.25 0.13 0.49 0.01 0.73
TGTTGT
AAATGA
Input Counts Counts and pseudocounts Frequencies

Figure 4 | Incorporating a probabilistic prior into a position-specific frequency matrix. A simple, principled
Nature
method for putting a probabilistic prior on a position-specific frequency matrix involves augmenting theReviews | Genetics
observed
nucleotide counts with pseudocounts and then computing frequencies with respect to the sum. The magnitude of the
pseudocount corresponds to the weight assigned to the prior.

326 | JUNE 2015 | VOLUME 16 www.nature.com/reviews/genetics

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

knowledge. The topology of a multilayer ANN can rep-


Gene expression Genetic interactions Amino acid sequences
resent information about dependencies between input
Condition features in the input space, but more general priors
MCDLLVISHSSL cannot be represented.
Gene

MLLLDRTSCSRI One class of discriminative methods does provide a


MKVLIDRVYDRD more general mechanism for representing prior knowl-
edge, if that knowledge can be encoded into a general-
ized notion of similarity. Kernel methods are algorithms
Feature Kernel Probability
extraction transformation model that use a general class of mathematical functions
called kernels in place of a simple similarity func-
tion (specifically, the cosine of the angle between two
a Condition b Gene c
input vectors) 33. The flagship kernel method is
the SVM classifier, which has been widely used in
Gene

Gene

manyfields, including biological applications rang-


Gene expression ing from DNA and protein sequence classification to
+ mass spectrometry analysis25. Other methods that can
Gene Gene
use kernels include support vector regression as well as
0 1 0 1 Cytoskeleton-
classical algorithms such as kmeans clustering, prin-
Gene

Gene

1 0 0 1 dependent
0 0 0 0 intracellular cipal component analysis and hierarchical clustering.
0 1 1 0 transport Genetic interactions Prior knowledge can be provided to a kernel method
+ by selecting or designing an appropriate kernel func-
k-mers Gene
tion. For example, a wide variety of kernels can be
defined between pairs of DNA sequences, in which the
Gene

Gene

similarity can be based on shared kmers irrespective of


Sequences their positions within the sequences34, on nucleotides
occurring at particular positions35 or on a mixture of
Figure 5 | Three ways to accommodate heterogeneous data in machine learning. the two36. A DNA sequence can even encode a simple
Nature Reviews | Genetics
The task of predicting gene function labels requires methods that take as input data such model of molecular evolution, using algorithms that
as gene expression profiles, genetic interaction networks and amino acid sequences.
are similar to the SmithWaterman alignment algo-
These diverse data types can be encoded into fixed-length features, represented using
pairwise similarities (that is, kernels) or directly accommodated by a probability model. rithm37. Furthermore, the Fisher kernel provides a
general framework for deriving a kernel function from
any probability model38. In this way, formal probabil-
istic priors can be used in conjunction with any kernel
have never observed in a given position can still be method. Kernel methods have a rich literature, which
assigned a non-zero probability by themodel. is reviewed in more detail in REF.39.
In many probability models, much more sophis-
ticated priors have been developed to capturemore- Handling heterogeneous data
complex prior knowledge. A particularly successful Another common challenge in learning from real bio-
example is the use of Dirichlet mixture priors in protein logical data is that the data themselves are heterogene-
modelling 31. In this example, the idea is that if we are ous. For example, consider the problem of learning to
examining an alignment of protein sequences and if we assign Gene Ontology terms to genes. For a given term,
see many aligned leucines in a particular column, then, such as cytoskeleton-dependent intracellular transport,
because we know that leucine and valine are biochemi- a wide variety of data types might be relevant, includ-
cally similar, we may want to assign a high probability ing the amino acid sequence of the protein encoded
to sequences that contain a valine in that same column, by the gene; the inferred evolutionary relationships of
even if we have never seen a valine there in our train- that protein to other proteins across various species;
ing data. The name Dirichlet mixture refers to the fact the microarray orRNA-seq expression profile of the
that the prior distribution is represented as a Dirichlet gene across a variety of phenotypic or environmental
distribution, and that the distribution is a mixture in conditions; and the number and identity of neigh-
which each component corresponds to a group of bio- bouring proteins identified using yeast two-hybrid or
Dirichlet mixture priors chemically similar amino acids. Such priors lead to tandem affinity purification tagging experiments, or
Prior distributions for a substantially improved performance both in the mod- GFP-tagged microscopy images (FIG.5). Such data sets
Bayesian model over the elling of evolutionarily related families of proteins31 and are difficult to analyse jointly because of their hetero-
relative frequencies of, for
in the discovery of protein motifs32. geneity: an expression profile is a fixed-length vector of
example, amino acids.
real values; proteinprotein interaction information is
Kernel methods Prior information in non-probabilistic models. By a binary network; and protein sequences are of variable
A class of machine learning contrast, the incorporation of prior knowledge into lengths and are made up of a discrete alphabet. Many
methods (for example, support non-probabilistic methods can be more challenging. statistical and machine learning methods for classifi-
vector machine) that use a
type of similarity measure
For example, discriminative classifiers, such as artifi- cation assume that all of the data can be represented
(called a kernel) between cial neural networks (ANNs) or randomforests, do not as fixed-length vectors of real numbers. Such methods
feature vectors. provide any explicit mechanism for representing prior cannot be directly applied to heterogeneous data.

NATURE REVIEWS | GENETICS VOLUME 16 | JUNE 2015 | 327

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

The most straightforward way to solve this problem data may contain a very large number of trainable
is to transform each type of data into vector format parameters. Therefore, an alternative method for han-
before processing (FIG.5a). For example, this was the dling heterogeneous data in a probability model is to
approach taken by Pea-Castillo etal.40 in an assess- make use of the general probabilistic mechanism for
ment of methods for predicting gene function in mice. handling prior knowledge by treating one type of data
Participating research groups were provided with sepa- before another. For example, as discussed above, pre-
rate matrices, each representing a single type of data: dicting the location on a genome sequence to which
gene expression, sequence patterns, proteinprotein a particular transcription factor will bind can be
interactions, phenotypes, conservation profiles or dis- framed as a classification problem and solved using a
ease associations. All matrices shared a common set of PSFM. However, invivo, the binding of a transcrip-
rows (one per gene), but the number and meanings tion factor depends not only on the native affinity of
of the columns differed from one matrix to the next. For the transcription factor for a given DNA sequence but
example, gene expression data were represented directly, also on the competitive binding of other molecules.
whereas protein sequence data were represented indi- Accordingly, measurements of chromatin accessibility,
rectly through annotations from repositories such as as provided by assays such as DNase-seq47, can offer
Pfam41 and InterPro42, and proteinprotein interac- valuable information about the overall accessibility of
tions were represented as square matrices in which a given genomic locus to transcription factor binding.
entries were the shortest-path lengths betweengenes. A joint probability model can take this accessibility into
Alternatively, each type of data can be encoded account 48,49, but training such a model can be challeng-
using a kernel function (FIG.5b), with one kernel for ing. The alternative approach uses the DNase-seq data
each data type. Mathematically, the use of kernels is to create a probabilistic prior and applies this prior
formally equivalent to transforming each data type into during the scanning of thePSFM50.
a fixed-length vector; however, in the kernel approach
the vectors themselves are not always represented Feature selection
explicitly. Instead, the similarity between two data ele- In any application of machine learning methods, the
ments such as amino acid sequences or nodes in researcher must decide what data to provide as input
a proteinprotein interaction network is encoded to the algorithm. As noted above, this choice provides
in the kernel function. The kernel framework ena- a method for incorporating prior knowledge into the
blesmore-complex kernels to be defined from com- procedure because the researcher can decide which
binations of simple kernels. For example, a simple features of the data are likely to be relevant or irrel-
summation of all the kernels for a given pair of genes evant. For example, consider the problem of training
is itself a kernel function. Furthermore, the kernels a multiclass classifier to distinguish, on the basis of
themselves can encode prior knowledge by, for exam- gene expression measurements, among different types
ple, allowing for statistical dependencies within a given of cancers51. Such a classifier could be valuable in two
data type but not between data types43. Kernels also ways. First, the classifier itself could help to establish
provide a general framework for automatically learning accurate diagnoses in cases of atypical presentation or
the relative importance of each type of data relative to histopathology. Second, the model produced during
a given classification task44. the learning phase could perform feature selection,
Finally, probability models provide a very differ- thus identifying subsets of genes with expression pat-
ent method for handling heterogeneous data. Rather terns that contribute specifically to different types of
than forcing all the data into a vector representation cancer. In general, feature selection can be carried out
or requiring that all data be represented using pair- within any supervised learning algorithm, in which
wise similarities, a probability model explicitly repre- the algorithm is given a large set of features (or input
sents diverse data types in the model itself (FIG.5c). An variables) and then automatically makes a decision to
early example of this type of approach assigned gene ignore some or all of the features, focusing on the sub-
functional labels to yeast genes on the basis of gene set of features that are most relevant to the task athand.
expression profiles; physical associations from affinity In practice, it is important to distinguish among
purification, two-hybrid and direct binding measure- three distinct motivations for carrying out feature
ments; and genetic associations from synthetic lethal- selection. First, in some cases, we want to identify a
ity experiments45. The authors used a Bayesian network, very small set of features that yield the best possible
which is a formal graphical language for representing classifier. For example, we may want to produce an
the joint probability distribution over a set of random inexpensive way to identify a disease phenotype on the
variables46. Querying the network with respect to the basis of the measured expression levels of a handful of
variable representing a particular Gene Ontology label genes. Such a classifier, if it is accurate enough, might
yields the probability that a given gene is assigned that form the basis of an inexpensive clinical assay. Second,
label. In principle, the probabilistic framework enables we may want to use the classifier to understand the
Bayesian network a Bayesian network to represent any arbitrary data type underlying biology 5254. In this case, we want the fea-
A representation of a and carry out joint inference over heterogeneous data ture selection procedure to identify only the genes with
probability distribution that
specifies the structure of
types using a singlemodel. expression levels that are actually relevant to the task
dependencies between In practice, such inference can be challenging at hand in the hope that the corresponding functional
variables as a network. because a joint probability model over heterogeneous annotations or biological pathways might provide

328 | JUNE 2015 | VOLUME 16 www.nature.com/reviews/genetics

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

insights into the aetiology of disease. Third, we often In this context, it is more appropriate to separately
simply want to train the most accurate possible classi- evaluate sensitivity (that is, the fraction of enhanc-
fier 55. In this case, we hope that the feature selection ers detected) and precision (that is, the percentage of
enables the classifier to identify and eliminate noisy or predicted enhancers that are truly enhancers). The
redundant features. Researchers are often disappointed balanced classifier described above has a high preci-
to find that feature selection cannot optimally perform sion (>99.9%) but a very low sensitivity of 0.5%. The
more than one of these three tasks simultaneously. behaviour of the classifier can be improved by using all
Feature selection is especially important in the third of the enhancers for training and then picking a ran-
case because the analysis of high-dimensional data sets, dom set of 49,000 non-enhancer positions as negative
including genomic, epigenomic, proteomic or metabo- training examples. However, balancing the classes in
lomic data sets, suffers from the curse of dimensionality56 this way results in the classifier learning to reproduce
the general observation that many types of analysis this artificially balanced ratio. The resulting classifier
become more difficult as the number of input dimen- achieves much higher sensitivity (81%) but very poor
sions (that is, data measurements) grows very large. For precision (40%); thus, this classifier is not useful for
example, as the number of data features that are pro- finding enhancers that can be validated experimentally.
vided as input to a machine learning classifier grows, It is possible to trade off sensitivity and precision
it is increasingly likely that, by chance, one feature while retaining the training power of a balanced train-
perfectly separates the training examples into positive ing set by placing weights on the training examples. In
and negative classes. This phenomenon leads to good the case of enhancer prediction, we used the balanced
performance on the training data but poor generaliza- training set, but during training we weighted each neg-
tion to data that were not used in training owing to ative example 36 times more than a positive example.
overfitting of the model to the training data. Feature Doing so results in an excellent sensitivity of 53% with
selection methods and dimensionality reduction tech- a precision of 95%.
Curse of dimensionality niques, such as principal component analysis or mul- In general, the most appropriate performance meas-
The observation that analysis tidimensional scaling, aim to solve this problem by ure depends on the intended application of the classi-
can sometimes become more projecting the data from higher to lower dimensions. fier. For problems such as identifying which tissue a
difficult as the number of
given cell comesfrom, it may be equally important to
features increases, particularly
because overfitting becomes Imbalanced class sizes identify rare and abundant tissues, and so the overall
more likely. A common stumbling block in many applications of number of correct predictions may be the most inform-
machine learning to genomics is the large imbalance ative measure of performance. In other problems, such
Overfitting (or label skew) in the relative sizes of the groups being as enhancer detection, predictions in one class may be
A common pitfall in machine
learning analysis that occurs
classified. For example, suppose one is trying to use a more important than predictions in another. For exam-
when a complex model is discriminative machine learning method to predict the ple, if positive predictions will be published, the most
trained on too few data points locations of enhancers in the genome. Starting with a appropriate measure may be the sensitivity among a
and becomes specific to the set of 641 known enhancers, the genome can be broken set of predictions with a predetermined precision (for
training data, resulting in poor
up into 1,000bp segments and each segment assigned example, 95%).A wide variety of performance meas-
performance on other data.
a label (enhancer or not enhancer) on the basis of ures are used in practice, including the F1 measure,
Label skew whether it overlaps with a known enhancer. This pro- the receiver operating characteristic curve and the
A phenomenon in which two cedure produces 1,711 positive examples and around precision-recall curve58,59, among others60. Machine learn-
labels in a supervised learning 3,000,000 negative examples 2,000 times as many ing classifiers perform best when they are optimized for
problem are present at
different frequencies.
negative examples as positive examples. Unfortunately, a realistic performance measure.
most software cannot handle 3,000,000 examples.
Sensitivity The most straightforward solution to this prob- Handling missing data
(Also known as recall). The lem is to select arandom, smaller subset of the data. Machine learning analysis can often be complicated by
fraction of positive examples
However, in the case of enhancer prediction, selecting missing data values. Missing values can come from vari-
identified; it is given by the
number of positive predictions 50,000 examples at random results in 49,971 negative ous sources, such as defective cells in a gene expression
that are correct divided by the examples and only 28 positive examples. This number microarray, unmappable genome positions in a func-
total number of positive of positive examples is far too small to train an accurate tional genomic assay or measurements that are unre-
examples. classifier. To demonstrate this problem, we simulated liable because they saturate the detection limits of an
Precision
93 noisy ChIPseq assays using a Gaussian model for instrument. Missing data values can be divided into two
The fraction of positive enhancers and background positions based on data types: values that are missing at random or for reasons
predictions that are correct; produced by the ENCODE Consortium57. We trained that are unrelated to the task at hand (such as defec-
it is given by the number of a logistic regression classifier to distinguish between tive microarray cells), and values that, when absent,
positive predictions that are
enhancers and non-enhancers on the basis of these provide information about the task at hand (such as
correct divided by the total
number of positive predictions. data. The overall accuracy of the predictions (that is, saturated detectors). The presence or absence of values
the percentage of predictions that were correct) was of the latter type is usually best incorporated directly
Precision-recall curve 99.9%. Although this seems good, accuracy is not an into themodel.
For a binary classifier applied appropriate measure with which to evaluate perfor- The simplest way to deal with data that are miss-
to a given data set, a curve that
plots precision (y axis) versus
mance in this setting because a null classifier that sim- ing at random is to impute the missing values61. This
recall (x axis) for a variety of ply predicts everything to be non-enhancers achieves can be done either with a very simple strategy, such as
classification thresholds. nearly the same accuracy. replacing all of the missing values with zero, or with a

NATURE REVIEWS | GENETICS VOLUME 16 | JUNE 2015 | 329

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

True regulatory Correlation Inferred probability of observing a certain value given this label.
network network network Missing data points are handled by summing over all
possibilities for that random variable in the model. This
approach, called marginalization, represents the case in
which a particular variable is unobserved. However,
marginalization is only appropriate when data points
are missing for reasons that are unrelated to the task at
hand. When the presence or absence of a data point is
Regulation Correlation likely to be correlated with the values themselves, incor-
porating presence or absence explicitly into the model
Figure 6 | Inferring network structure. Methods that
Nature Reviews | Genetics is more appropriate.
infer each relationship in a network separately, such as by
computing the correlation between each pair, can be Modelling dependence among examples
confounded by indirect relationships. Methods that infer Sofar, we have focused on machine learning tasks that
the network as a whole can identify only direct
involve sets of independent instances of a common
relationships. Inferring the direction of causality inherent
in networks is generally more challenging than inferring
pattern. However, in some domains, individual enti-
the network structure68; as a result, many network ties, such as genes, are not independent, and the rela-
inference methods, such as Gaussian graphical model tionships among them are important. By inferring such
learning, infer only the network. relationships, it is possible to integrate many examples
into a meaningful network. Such a network might rep-
resent physical interactions among proteins, regula-
tory interactions between genes or symbiosis between
more sophisticated strategy. For example, Troyanskaya microorganisms. Networks are useful both for under-
etal.62 used the correlations between data values to standing the biological relationships between entities
impute missing microarray values. For each target gene and as input into a downstream analysis that makes use
expression profile, the authors found the 10 expres- of these relationships.
sion profiles showing the greatest similarity to the The most straightforward way to infer the relation-
target profile, and they replaced each missing value in ships among examples is to consider each pair indepen-
the target profile with an average of the values in the dently. In this case, the problem of network learning is
similar profiles. Imputing data points this way can be reduced to a normal machine learning problem, defined
used as apre-processing step for any other analysis, on pairs of individuals rather than individual examples.
but downstream analyses are blind to this information Qiu etal.64 used an SVM classifier to predict, using data
and cannot make use of either the fact that a data point such as protein sequences and cellular localization,
is missing or the added uncertainty that results from whether a given pair of proteins physically interact.
missing data values. A downside of any approach that considers each
Another method for dealing with missing data is to relationship independently is that such methods can-
include in the model information about the missing- not take into account the confounding effects of indirect
ness of each data point. For example, Kircher etal.63 relationships (FIG.6). For example, in the case of gene
aimed to predict the deleteriousness of mutations based regulation, an independent model cannot infer whether
on functional genomic data. The functional genomic a pair of genes directly regulate each other or whether
data provided a feature vector associated with each they are both regulated by a third gene. Such spurious
mutation, but some of these features were missing. inferred relationships, called transitive relationships, can
Therefore, for each feature, the authors added a Boolean be removed by methods that infer the graph as a whole.
feature that indicated whether the corresponding fea- For example, Friedman etal.65 inferred a Bayesian net-
ture value was present. The missing values themselves work on gene expression data that models which genes
were then replaced with zeroes. A sufficiently sophis- regulate each other. Such a network includes only
ticated model will be able to learn the pattern that direct effects and models indirect correlations through
determines the relationship between the feature and multiple-hop paths in the network. Therefore, methods
the presence or absence indicator. An advantage of this that infer a network as a whole are more biologically
approach to handling missing data is that it is applica- interpretable because they remove these indirect cor-
ble regardless of whether the absence of a data point is relations; a large number of such methods have been
Marginalization significant if it is not, the model will learn to ignore described and reviewed elsewhere66,67.
A method for handling missing the absence indicator.
data points by summing over
Finally, probability models can explicitly model Conclusions
all possibilities for that random
variable in the model. missing data by considering all the potential miss- On the one hand, machine learning methods, which are
ing values. For example, this approach was used by most effective in the analysis of large, complex data sets,
Transitive relationships Hoffman etal.21 to analyse functional genomic data, are likely to become ever more important to genomics as
An observed correlation which contain missing values due to mappability more large data sets become available through interna-
between two features that is
caused by direct relationships
issues fromshort-read sequencing data. The probabil- tional collaborative projects, such as the 1000 Genomes
between these two features ity model provides an annotation by assigning a label Project, the 100,000 Genomes Project, ENCODE, the
and a third feature. to each position across the genome and modelling the Roadmap Epigenomics Project and the US National

330 | JUNE 2015 | VOLUME 16 www.nature.com/reviews/genetics

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

Institutes of Healths 4D Nucleome Initiative. On the data sets emerge pushing beyond DNA sequencing to
other hand, even in the presence of massive amounts mass spectrometry, flow cytometry and high-resolution
of data, machine learning techniques are not generally imaging methods demand will increase not only for
useful when applied in an arbitrary manner. In practice, new machine learning methods but also for experts
achieving good performance from a machine learn- that are capable of applying and adapting them to big
ing method usually requires theoretical and practical data sets. In this sense, both machine learning itself and
knowledge of both machine learning methodology and scientists proficient in these applications are likely to
the particular research application area. As new tech- become increasingly important to advancing genetics
nologies for generating large genomic and proteomic and genomics.

1. Mitchell,T. Machine Learning (McGraw-Hill, 1997). 21. Hoffman,M.M. etal. Unsupervised pattern discovery 42. Apweiler,R. etal. The InterPro database, an
This book provides a general introduction to in human chromatin structure through genomic integrated documentation resource for protein
machine learning that is suitable for undergraduate segmentation. Nature Methods 9, 473476 (2012). families, domains and functional sites. Nucleic Acids
or graduate students. 22. Chapelle,O., Schlkopf,B. & Zien,A. (eds) Semi- Res. 29, 3740 (2001).
2. Ohler,W., Liao,C., Niemann,H. & Rubin,G.M. supervised Learning (MIT Press, 2006). 43. Pavlidis,P., Weston,J., Cai,J. & Noble,W.S.
Computational analysis of core promoters in the 23. Stamatoyannopoulos,J.A. Illuminating eukaryotic Learning gene functional classifications from multiple
Drosophila genome. Genome Biol. 3, RESEARCH0087 transcription start sites. Nature Methods 7, 501503 data types. J.Computat. Biol. 9, 401411 (2002).
(2002). (2010). 44. Lanckriet,G.R.G., Bie,T.D., Cristianini,N.,
3. Degroeve,S., Baets,B.D., dePeer,Y.V. & Rouz,P. 24. Boser,B.E., Guyon,I.M. & Vapnik,V.N.in A Training Jordan,M.I. & Noble,W.S. A statistical framework
Feature subset selection for splice site prediction. Algorithm for Optimal Margin Classifiers (ed. for genomic data fusion. Bioinformatics 20,
Bioinformatics 18, S75S83 (2002). Haussler,D.) 144152 (ACM Press, 1992). 26262635 (2004).
4. Bucher,P. Weight matrix description of four eukaryotic This paper was the first to describe the SVM, a 45. Troyanskaya,O.G., Dolinski,K., Owen,A.B.,
RNA polymerase II promoter elements derived from type of discriminative classification algorithm. Altman,R.B. & Botstein,D.A. Bayesian framework
502 unrelated promoter sequences. J.Mol. Biol. 4, 25. Noble,W.S. What is a support vector machine? for combining heterogeneous data sources for gene
563578 (1990). Nature Biotech. 24, 15651567 (2006). function prediction (in Saccharomycescerevisiae).
5. Heintzman,N. etal. Distinct and predictive chromatin This paper describes a non-mathematical Proc. Natl Acad. Sci. USA 100, 83488353 (2003).
signatures of transcriptional promoters and enhancers introduction to SVMs and their applications to life 46. Pearl,J. Probabilistic Reasoning in Intelligent
in the human genome. Nature Genet. 39, 311318 science research. Systems: Networks of Plausible Inference (Morgan
(2007). 26. Ng,A.Y. & Jordan,M.I. Advances in Neural Kaufmann, 1998).
6. Segal,E. etal. A genomic code for nucleosome Information Processing Systems (eds Dietterich,T. This textbook on probability models for machine
positioning. Nature 44, 772778 (2006). etal.) (MIT Press, 2002). learning is suitable for undergraduates or graduate
7. Picardi,E. & Pesole,G. Computational methods for ab 27. Jordan,M.I. Why the logistic function? a tutorial students.
initio and comparative gene finding. Methods Mol. discussion on probabilities and neural networks. 47. Song,L. & Crawford,G.E. DNase-seq: a high-
Biol. 609, 269284 (2010). Computational Cognitive Science Technical Report resolution technique for mapping active gene
8. Ashburner,M. etal. Gene ontology: tool for the 9503 [online], http://www.ics.uci.edu/~dramanan/ regulatory elements across the genome from
unification of biology. Nature Genet. 25, 2529 teaching/ics273a_winter08/homework/jordan_logistic. mammalian cells. Cold Spring Harbor Protoc. 2,
(2000). pdf (1995). pdb.prot5384 (2010).
9. Fraser,A.G. & Marcotte,E.M. A probabilistic view of 28. Wolpert,D.H. & Macready,W.G. No free lunch 48. Wasson,T. & Hartemink,A.J. An ensemble model of
gene function. Nature Genet. 36, 559564 (2004). theorems for optimization. IEEE Trans. Evol. Comput.1, competitive multi-factor binding of the genome.
10. Beer,M.A. & Tavazoie,S. Predicting gene expression 6782 (1997). Genome Res. 19, 21022112 (2009).
from sequence. Cell 117, 185198 (2004). This paper provides a mathematical proof that no 49. Pique-Regi,R. etal. Accurate inference of
11. Karlic,R., R. Chung,H., Lasserre,J., Vlahovicek,K. & single machine learning method can perform best transcription factor binding from DNA sequence
Vingron,M. Histone modification levels are predictive on all possible learning problems. and chromatin accessibility data. Genome Res. 21,
for gene expression. Proc. Natl Acad. Sci. USA 107, 29. Yip,K.Y. etal. Classification of human genomic 447455 (2011).
29262931 (2010). regions based on experimentally determined binding 50. Cuellar-Partida,G. etal. Epigenetic priors for
12. Ouyang,Z., Zhou,Q. & Wong,H.W. ChIPseq of sites of more than 100 transcription-related factors. identifying active transcription factor binding sites.
transcription factors predicts absolute and differential Genome Biol. 13, R48 (2012). Bioinformatics 28, 5662 (2011).
gene expression in embryonic stem cells. Proc. Natl 30. Urbanowicz,R. J., Granizo-Mackenzie,D. & Moore, 51. Ramaswamy,S. etal. Multiclass cancer diagnosis
Acad. Sci. USA 106, 2152121526 (2009). J.H. in Proceedings of the Parallel Problem Solving using tumor gene expression signatures. Proc. Natl
13. Friedman,N. Inferring cellular networks using From Nature 266275 (Springer, 2012). Acad. Sci. USA 98, 1514915154 (2001).
probabilistic graphical models. Science 303, 31. Brown,M. etal. in Proceedings of the Third International 52. Glaab,E., Bacardit,J., Garibaldi,J.M. &
799805 (2004). Conference on Intelligent Systems for Molecular Krasnogor,N. Using rule-based machine learning for
14. Hastie,T., Tibshirani,R. & Friedman,J. The Elements Biology (ed. Rawlings,C.) 4755 (AAAI Press, 1993). candidate disease gene prioritization and sample
of Statistical Learning: Data Mining, Inference and 32. Bailey,T.L. & Elkan,C.P. in Proceedings of the Third classification of cancer gene expression data. PLoS
Prediction (Springer, 2001). International Conference on Intelligent Systems for ONE 7, e39932 (2012).
This book provides an overview of machine learning Molecular Biology (eds Rawlings,C. etal.) 2129 53. Tibshirani,R.J. Regression shrinkage and selection
that is suitable for students with a strong (AAAI Press, 1995). via the lasso. J.R.Statist. Soc. B 58, 267288
background in statistics. 33. Schlkopf,B. & Smola,A. Learning with Kernels (1996).
15. Hamelryck,T. Probabilistic models and machine (MIT Press, 2002). This paper was the first to describe the technique
learning in structural bioinformatics. Stat. Methods 34. Leslie,C. etal. (eds) Proceedings of the Pacific known as lasso (or L1 regularization), which performs
Med. Res. 18, 505526 (2009). Symposium on Biocomputing (World Scientific, 2002). feature selection in conjunction with learning.
16. Swan,A.L., Mobasheri,A., Allaway,D., Liddell,S. & 35. Rtsch,G. & Sonnenburg,S. in Kernel Methods in 54. Urbanowicz,R.J., Granizo-Mackenzie,A. &
Bacardit,J. Application of machine learning to Computational Biology (eds Schlkopf, B. etal.) Moore,J.H. An analysis pipeline with statistical and
proteomics data: classification and biomarker 277298 (MIT Press, 2004). visualization-guided knowledge discovery for
identification in postgenomics biology. OMICS 17, 36. Zien,A. etal. Engineering support vector machine Michigan-style learning classifier systems. IEEE
595610 (2013). kernels that recognize translation initiation sites. Comput. Intell. Mag. 7, 3545 (2012).
17. Upstill-Goddard,R., Eccles,D., Fliege,J. & Collins,A. Bioinformatics 16, 799807 (2000). 55. Tikhonov,A.N. On the stability of inverse problems.
Machine learning approaches for the discovery of 37. Saigo,H., Vert,J.P. & Akutsu,T. Optimizing amino Dokl. Akad. Nauk SSSR 39, 195198 (1943).
genegene interactions in disease data. Brief. acid substitution matrices with a local alignment This paper was the first to describe the
Bioinform. 14, 251260 (2013). kernel. BMC Bioinformatics 7, 246 (2006). now-ubiquitous method known as L2 regularization
18. Yip,K.Y., Cheng,C. & Gerstein,M. Machine learning 38. Jaakkola,T. & Haussler,D. Advances in Neural or ridge regression.
and genome annotation: a match meant to be? Information Processing Systems 11 (Morgan 56. Keogh,E. & Mueen,A. Encyclopedia of Machine
Genome Biol. 14, 205 (2013). Kauffmann, 1998). Learning (Springer, 2011).
19. Day,N., Hemmaplardh,A., Thurman,R.E., 39. Shawe-Taylor,J. & Cristianini,N. Kernel Methods for 57. ENCODE ProjectConsortium. An integrated
Stamatoyannopoulos,J.A. & Noble,W.S. Pattern Analysis (Cambridge Univ. Press, 2004). encyclopedia of DNA elements in the human genome.
Unsupervised segmentation of continuous genomic This textbook describes kernel methods, including Nature 489, 5774 (2012).
data. Bioinformatics 23, 14241426 (2007). a detailed mathematical treatment that is suitable 58. Manning,C.D. & Schtze,H. Foundations of Statistical
20. Ernst,J. & Kellis,M. ChromHMM: automating for quantitatively inclined graduate students. Natural Language Processing (MIT Press, 1999).
chromatin-state discovery and characterization. 40. Pea-Castillo,L. etal. A critical assessment of 59. Davis,J. & Goadrich,M. Proceedings of the
Nature Methods 9, 215216 (2012). M.musculus gene function prediction using integrated International Conference on Machine Learning
This study applies an unsupervised hidden Markov genomic evidence. Genome Biol. 9, S2 (2008). (ACM, 2006).
model algorithm to analyse genomic assays such as 41. Sonnhammer,E., Eddy,S. & Durbin,R. Pfam: a This paper provides a succinct introduction to
ChIPseq and DNase-seq in order to identify new comprehensive database of protein domain families precision-recall and receiver operating
classes of functional elements and new instances of based on seed alignments. Proteins 28, 405420 characteristic curves, and details under which
existing functional element types. (1997). scenarios these approaches should be used.

NATURE REVIEWS | GENETICS VOLUME 16 | JUNE 2015 | 331

2015 Macmillan Publishers Limited. All rights reserved


REVIEWS

60. Cohen,J. Weighted : nominal scale agreement using a framework that takes advantage of the fact
provision for scaled disagreement or partial credit. that natural selection removes deleterious variation. FURTHER INFORMATION
Psychol. Bull. 70, 213 (1968). 64. Qiu,J. & Noble,W.S. Predicting cocomplexed protein 1000 Genomes Project: http://www.1000genomes.org
61. Luengo,J., Garca,S. & Herrera,F. On the choice of pairs from heterogeneous data. PLoS Comput. Biol. 4, 100,000 Genomes Project: http://www.genomicsengland.co.uk
the best imputation methods for missing values e1000054 (2008). 4D Nucleome: https://commonfund.nih.gov/4Dnucleome/
considering three groups of classification methods. 65. Friedman,N., Linial,M., Nachman,I. & Peer,D. Using index
Knowl. Inf. Syst. 32, 77108 (2012). Bayesian networks to analyze expression data. ENCODE: http://www.encodeproject.org
62. Troyanskaya,O. etal. Missing value estimation J.Comput. Biol. 7, 601620 (2000). InterPro: http://www.ebi.ac.uk/interpro/
methods for DNA microarrays. Bioinformatics 17, 66. Bacardit,J. & Llor,X. Large-scale data mining using Machine learning on Coursera: https://www.coursera.org/
520525 (2001). genetics-based machine learning. Wiley Interdiscip. course/ml
This study uses an imputation-based approach to Rev. 3, 3761 (2013). Machine learning open source software: https://mloss.org/
handle missing values in microarray data. The 67. Koski,T.J. & Noble,J. A review of Bayesian networks software/
method was widely used in subsequent studies to and structure learning. Math. Applicanda 40, 51103 Pfam: http://pfam.xfam.org/
address this common problem. (2012). PyML: http://pyml.sourceforge.net/
63. Kircher,M. etal. A general framework for estimating 68. Pearl,J. Causality: Models, Reasoning and Inference Roadmap Epigenomics Project: http://www.
the relative pathogenicity of human genetic variants. (Cambridge Univ. Press, 2000). roadmapepigenomics.org
Nature Genet. 46, 310315 (2014). Weka 3: http://www.cs.waikato.ac.nz/ml/weka/
This study uses a machine learning approach to Competing interests statement ALL LINKS ARE ACTIVE IN THE ONLINE PDF
estimate the pathogenicity of genetic variants The authors declare no competing interests.

332 | JUNE 2015 | VOLUME 16 www.nature.com/reviews/genetics

2015 Macmillan Publishers Limited. All rights reserved

You might also like