You are on page 1of 11

This Is a Publication of

The American Association for Artificial Intelligence

This electronic document has been retrieved from the


American Association for Artificial Intelligence
445 Burgess Drive
Menlo Park, California 94025
(415) 328-3123
(415) 321-4457
info@aaai.org
http://www.aaai.org

(For membership information,


consult our web page)

The material herein is copyrighted material. It may not be


reproduced in any form by any electronic or
mechanical means (including photocopying, recording,
or information storage and retrieval) without permission
in writing from AAAI.
Articles

Does Machine Learning


Really Work?
Tom M. Mitchell

Does machine learning really work? Yes. Over the understanding of human learning abilities
past decade, machine learning has evolved from a (and disabilities) as well.
field of laboratory demonstrations to a field of sig- Although we do not yet know how to make
nificant commercial value. Machine-learning algo- computers learn nearly as well as people learn,
rithms have now learned to detect credit card
in recent years, many successful machine-
fraud by mining data on past transactions, learned
learning applications have been developed,
to steer vehicles driving autonomously on public
highways at 70 miles an hour, and learned the ranging from data-mining programs that learn
reading interests of many individuals to assemble to detect fraudulent credit card transactions to
personally customized electronic newspapers. A information-filtering systems that learn users
new computational theory of learning is begin- reading preferences to autonomous vehicles
ning to shed light on fundamental issues, such as that learn to drive on public highways. Along
the trade-off among the number of training exam- with this growing list of applications, there
ples available, the number of hypotheses consid- have been important advances in the theory
ered, and the likely accuracy of the learned and algorithms that form the foundations of
hypothesis. Newer research is beginning to explore
this field.
issues such as long-term learning of new represen-
tations, the integration of Bayesian inference and
induction, and life-long cumulative learning. This The Niche for Machine Learning
article, based on the keynote talk presented at the
Thirteenth National Conference on Artificial Intel- One way to understand machine learning is to
ligence, samples a number of recent accomplish- consider its role within the more broad field of
ments in machine learning and looks at where the computer science. Given the broad range of
field might be headed. software paradigms within computer science,
what is the niche to be filled by software that
learns from experience? Obviously, for com-
puter applications such as matrix multiplica-

E
ver since computers were invented, it has tion, machine learning is neither necessary nor
been natural to wonder whether they desirable. However, for other types of prob-
might be made to learn. Imagine comput- lems, machine-learning methods are already
ers learning from medical records to discover emerging as the software development method
emerging trends in the spread and treatment of of choice. In particular, machine learning is
new diseases, houses learning from experience beginning to play an essential role within the
to optimize energy costs based on the particu- following three niches in the software world:
lar usage patterns of their occupants, or per- (1) data mining, (2) difficult-to-program appli- Portions of this arti-
sonal software assistants learning the evolving cations, and (3) customized software applica- cle first appeared in
interests of their users to highlight especially tions. Machine Learning
relevant stories from the online morning news- Data mining: Data mining is the process of by Tom M. Mitchell,
paper. A successful understanding of how to using historical databases to improve subse- 1997 McGraw Hill
make computers learn would open up many quent decision making. For example, banks Book Company, and
new uses for computers. And a detailed under- now analyze historical data to help decide have been reprinted
standing of information-processing algorithms which future loan applicants are likely to be here with permission
for machine learning might lead to a better credit worthy. Hospitals now analyze historical from the publisher.

All new material in this article is FALL 1997 11


Copyright 1997, American Association for Artificial Intelligence. All rights reserved.
Articles

Data:
Patient103 time = 1 Patient103 time = 2 Patient103 time = n
Age: 23 Age: 23 Age: 23
FirstPregancy: no FirstPregancy: no FirstPregancy: no
Anemia: no Anemia: no Anemia: no
Diabetes: no Diabetes: YES Diabetes: no
PreviousPrematureBirth: no PreviousPrematureBirth: no PreviousPrematureBirth: no
Ultrasound: ? Ultrasound: abnormal Ultrasound: ?
Elective C-Section: ? Elective C-Section: no Elective C-Section: no
Emergency C-Section: ? Emergency C-Section: ? Emergency C-Section: Yes

Learned Rule:
If No previous vaginal deliveray and
In particular, Abnormal 2nd Trimester Ultrasound, and
Malpresentation at admissions, and
machine Then Probability of Emrgency C-Section is 0.6
learning is Training set accuracy: 26/41 - .63
beginning to Test set accuracy: 12/20 - .60
play an
essential Figure 1. A Typical Data-Mining Application.
role within A historical set of 9714 medical records describes pregnant women over time. The top portion of the figure
illustrates a typical patient record, where ? represents that the feature value is unknown. The task here is to
three niches identify classes of patients at high risk of receiving an emergency Cesarean section (C-section). The bottom
in the portion of the figure shows one of many rules discovered from these data using Schenley Park Researchs RULER
system. Whereas 7 percent of all pregnant women received emergency C-sections, this rule identifies a subclass
software of high-risk patients, 60 percent of whom received emergency C-sections.
world:
data to help decide which new patients are dar that customizes to the individual users
data mining, likely to respond best to which treatments. As scheduling preferences. Because it is unrealistic
difficult-to- the volume of online data grows each year, to manually develop a separate system for each
this niche for machine learning is bound to user, machine learning offers an attractive
program grow in importance. See www.kdnuggets.com/ option for enabling software to automatically
applications, for a variety of information on this topic. customize itself to individual users.
Difficult-to-program applications: The following sections present examples of
and Machine-learning algorithms can play an successful applications of machine-learning
customized essential role in applications that have proven algorithms within each of these three niches.
too difficult for traditional manual program- As in any short summary of a large field, the
software mingapplications such as face recognition role of this article is to sample a few of the
applications. and speech understanding. The most accurate many research projects and research directions
current programs for face recognition, for in the field to provide a feel for the current state
example, were developed using training exam- of the art and current research issues. More
ples of face images together with machine- lengthy treatments of the current state of
learning algorithms. In a variety of applica- machine learning can be found in Dietterichs
tions where complex sensor data must be article (also in this issue) and in Mitchell
interpreted, machine-learning algorithms are (1997). Treatments of various subfields of
already the method of choice for developing machine learning are also available for topics
software. such as statistical learning methods (Bishop
Customized software applications: In 1996), neural networks (Chauvin and Rumel-
many computer applications, such as online hart 1995; Hecht-Nielsen 1990), instance-based
news browsers and personal calendars, one learning (Aha, Kibler, and Albert 1991), genetic
would like a system that automatically cus- algorithms (Mitchell 1996; Forrest 1993); com-
tomizes to the needs of individual users after it putational learning theory (Kearns and Vazi-
has been fielded. For example, we would like a rani 1994), decision tree learning (Quinlan
news browser that customizes to the individ- 1993), inductive logic programming (Lavrac
ual users reading interests or an online calen- and Dzeroski 1994), and reinforcement learn-

12 AI MAGAZINE
Articles

ing (Kaelbling, Littman, and Moore 1996; Bar- Time-series prediction problems such as this
to, Bradtke, and Singh 1995). occur in many problem domains and data sets.
For example, many banks have time-series
Data Mining: Predicting Medical data indicating which credit card transactions
Outcomes from Historical Data were later found to be fraudulent. Many busi-
Data mining involves applying machine-learn- nesses have customer databases that indicate
ing methods to historical data to improve which customers were later found to purchase
future decisions. One prototypical example is certain items. Universities have data on which
illustrated in figure 1. In this example, histori- students were later found to successfully grad-
cal data describing 9714 different pregnant uate. Manufacturers have time-series data on
women have been provided, with each woman which process parameters later produced
described by a set of 215 attributes over time. flawed or optimal products. Figure 2 illustrates
These attributes describe the womans health a number of time-series prediction problems
history (for example, number of previous preg- with the same abstract structure as the preg-
nancies), measurements taken at various times nancy example. In all these cases, the problem
during pregnancy (for example, ultrasound of discovering predictive relations between
results), the type of delivery (for example, nor- earlier and later features can be of great value.
mal, elective Cesarean sections [C-sections], What is the current state of the art for such
emergency C-section), and final health of problems? The answer is that we now have a
mother and baby. robust set of first-generation machine-learning
Given such time-series data, one is often algorithms that are appropriate for many prob-
interested in learning to predict features that lems of this form. These include decision
occur late in the time series based on features treelearning algorithms such as C4.5 (Quinlan Data mining
that are known earlier. For example, given the 1993), rule-learning algorithms such as CN2 involves
pregnancy data, one interesting problem is to (Clark and Niblett 1989) and FOIL (Quinlan
predict which future patients are at exception- 1990), and neural networklearning algo- applying
ally high risk of requiring an emergency C-sec- rithms such as back propagation (Rummelhart machine-
tion. This problem is clinically significant and McClelland 1986). Commercial imple-
because most physicians agree that if they mentations of many of these algorithms are learning
know in advance that a particular patient is at now available. At the same time, researchers methods to
high risk of an emergency C-section, they will are actively pursuing new algorithms that
seriously consider an elective C-section instead learn more accurately than these first-genera-
historical
to improve the expected clinical outcome for tion algorithms and that can use types of data data to
the mother and child. The bottom half of fig- that go beyond the current commercial state of
the art. To give a typical example of this style
improve
ure 1 provides one typical rule learned from
these historical data. Whereas the risk of emer- of research, Caruana, Baluja, and Mitchell future
gency C-section for the entire population of (1995) describe a multitask learning algorithm decisions.
patients in this case was 7 percent, this rule that achieves higher prediction accuracy than
identifies a class of patients for which the risk these other methods. The key to their success
is 60 percent. The rule was learned by the RULER is that their system is trained to simultaneous-
program developed by Schenley Park Research, ly predict multiple features rather than just the
Inc. The program used two-thirds of the avail- feature of interest. For example, their system
able data as a training set, holding the remain- achieves improved accuracy for predicting
ing one-third as a test set to verify rule accura- mortality risk in pneumonia patients by simul-
cy. As summarized in the figure, a total of 41 taneously predicting various lab results for the
patients in the training set match the rule con- patient.
ditions, of which 26 received an emergency C- What are good topics for new thesis research
section. A similar accuracy is found over the in this area? There are many, such as develop-
test set (that is, of the 20 patients in the test set ing algorithms for active experimentation to
who satisfy the rule preconditions, 12 received acquire new data, integrating learning algo-
emergency C-sections). Thus, this rule success- rithms with interactive data visualization, and
fully identifies a class of high-risk patients, learning across multiple databases. Two of my
both over the training data and over a separate current favorite research topics in this area are
set of test data. (1) learning from mixed-media data and (2)
This medical example illustrates a prototyp- learning using private data plus internet-acces-
ical use of machine-learning algorithms for sible data.
data mining. In this example, a set of time- Learning from mixed-media data: Where-
series data was used to learn rules that predict as rule-learning methods such as the one dis-
later features in the series from earlier features. cussed in figure 1 can be applied to data

FALL 1997 13
Articles

Customer Purchase Behavior


Customer103: (time = t0) Customer103: (time = t1) Customer103: (time = tn)
Sex: M Sex: M Sex: M
Age: 53 Age: 53 Age: 53
Income: $50k Income: $50k Income: $50k
Own House: Yes Own House: Yes Own House: Yes
MS Products: Word MS Products: Word MS Products: Word
Computer 386 PC Computer Pentium Computer Pentium
Purchase Excel?: ? Purchase Excel?: ? Purchase Excel?: Yes

Customer Retention:
Customer 103: (time = t0) Customer103: (time = t1) Customer103: (time = tn)
Sex: M Sex: M Sex: M
Age: 53 Age: 53 Age: 53
Income: $50K Income: $50K Income: $50K
Own House: Yes Own House: Yes Own House: Yes
Checking: $5k Checking: $20k Checking: $0
Savings: $15k Savings: $0 Savings: $0
Current-customer?: yes Current-customer?: yes Current-customer?: No

Process Optimization
Product72: (time = t0) Product72: (time = t1) Product72: (time = tn)
Stage: mix Stage: cook Stage: cool
Mixing-speed: 60 rpm Temperature: 325 Fan-speed: medium
Viscosity: 1.3 Viscosity: 3.2 Viscosity: 1.3
Fat content: 15% Fat content: 12% Fat content: 12%
Density: 2.8 Density: 1.1 Density: 1.2
Spectral peak: 2800 Spectral peak: 3200 Spectral peak: 3100
Product underweight?: ?? Product underweight?: ?? Product underweight?: Yes

Figure 2. Additional Examples of Time-Series Prediction Problems.

described by numeric features (for example, advantage of these data. Research on new algo-
age) and symbolic features (for example, gen- rithms that use such mixed-media data sets in
der), current medical records include a rich medicine and other domains could lead to sub-
mixture of additional media, such as images stantially more accurate methods for learning
(for example, sonogram images, x-rays), signals from available historical data and to substan-
from other instruments (for example, EKG sig- tially more accurate decision-making aids for
nals), text (for example, notes made on the physicians and other decision makers.
patient chart), and even speech (for example, Learning using private data plus internet-
in some hospitals, physicians speak into hand- accessible data: To illustrate this topic, consid-
held recorders as they make hospital rounds). er a university interested in using historical
Although we have machine-learning methods data to identify current students who are at
that can learn over numeric and symbolic data, high risk of dropping out because of academic
other methods that can learn over images, and or other difficulties. The university might col-
other methods that can aid learning with lect data on students over several years and
speech data, we have no principled methods apply a rule-learning method similar to that
that can take advantage of the union of these illustrated in figure 1. Such first-generation
types of data. Put briefly, we are in a situation learning methods might work well, provided
where we already have available a rich, mixed- there is a sufficiently large set of training data
media set of historical medical data but where and a data set that captures the features that
we lack the machine-learning methods to take are actually relevant to predicting student

14 AI MAGAZINE
Articles

attrition. Now imagine a learning method that struct a face-recognition program of compara-
could augment the data collected for this study ble accuracy.
by also using the rich set information available To illustrate the use of neural networks for
over the World Wide Web. For example, the image classification, consider the slightly sim-
web might contain much more complete pler task of classifying these same images
information about the particular academic according to the direction in which the person
courses taken by the student, the extracurricu- is looking (for example, to his/her right or left
lar clubs they joined, their living quarters, and or up or straight ahead). This task can also be
so on. The point here is that the internet can accomplished with approximately 90-percent
provide a vast source of data that goes far accuracy by training the neural network shown
beyond what will be available in any given sur- at the top left of figure 3. The input to the neur-
vey. If this internet information can be used al network encode the image, and the output
effectively, it can provide important additional classify the input image according to the direc-
features that can improve the accuracy of tion in which the person is looking. More
learned rules. Of course, much of this web- specifically, the image is represented by a 30
based information appears in text form, and it 32 coarse-grained grid of pixel values, and each
contains a very large number of irrelevant facts of these 960 pixel values is used as a distinct
in addition to the small number of relevant input to the network. The first layer of the net-
facts. Research on new methods that utilize the work shown in the figure contains 3 units, each
web and other public online sources (for exam- of which has the same set of 960 input. The
ple, news feeds, electronic discussion groups) activation of each such unit is computed by A second
to augment available private data sets holds calculating a linear-weighted sum of the 960 application
the potential to significantly increase the accu- input, then passing the result through a sig-
racy of learning in many applications. moid-squashing function that maps this linear niche for
Learning to Perceive:
sum into the interval [0,1]. The activations of machine
these three units are then input to the final lay-
Face Recognition er of four output units. As shown in the figure,
learning
A second application niche for machine learn- each output unit represents one of the four pos- lies in
ing lies in constructing computer programs sible face directions, with the highest-valued constructing
that are simply too difficult to program by output taken as the network prediction. This
hand. Many sensor-interpretation problems network is trained using the back-propagation computer
fall into this category. For example, todays top algorithm (Rummelhart and McClelland 1986) programs
speech-recognition systems all use machine- to adjust the weights of each network unit so
learning methods to some degree to improve that the difference between the global network that are
their accuracy. A second example is image-clas- output and the known correct value for each simply too
sification problems, such as the face-recogni- training example is minimized.
tion task illustrated in figure 3. The face images One intriguing property of such multilayer
difficult to
shown here have been used to train neural net- neural networks is that during training, the program by
works to classify new face images according to internal hidden units develop an internal rep-
the identity of the person, his/her gender, and resentation of the input image that captures
hand.
the direction in which he/she is looking. These features relevant to the problem at hand. In a
images are part of a larger collection contain- sense, the learning algorithm discovers a new
ing 624 images of 20 different people (selected representation of the input image that captures
students from Carnegie Mellon Universitys the most important features. We can examine
machine-learning class). For each person, this discovered representation in the current
approximately 32 different grey-scale images example by looking at the 960 input weights to
were collected, varying the persons expres- each of the 3 internal units in the network of
sion, the direction in which he/she is looking, figure 3. Suppose we display the 960 weight
and his/her appearance with or without sun- values (one for weighting each input pixel) in a
glasses. Given these data, students in the grid, with each weight displayed in the posi-
machine-learning class were asked to train a tion of the corresponding pixel in the image
computer program to correctly recognize the grid. This visualization is exactly the visualiza-
face in a new image. Students succeeded in tion of network weights shown to the right of
training a neural network to achieve this task the network in the figure. Here, large positive
with an accuracy of 90 percent at recognizing weights are displayed in bright white, large
these faces compared to the 5-percent accuracy negative weights in dark black, and near-zero
that would be obtained by randomly guessing weights in grey. The topmost four blocks depict
which of the 20 people is in the image. In con- the weights for each of the four output units,
trast, it is extremely difficult to manually con- and the bottom three large grids depict the

FALL 1997 15
Articles

Left Start Right Up Learned Weights

... ...

30 x 32 Network Inputs

Typical Input Images

Figure 3. Face Recognition.


A neural network (top left) is trained to recognize whether the person is looking to his/her left or right or straight ahead or up. Each of the
30 32 pixels in the coarse-grained image is a distinct input to the network. After training on 260 such images, the network achieves an
accuracy of 90 percent classifying future images. Similar accuracy is obtained when training to recognize the individual person. The dia-
grams on the top right depict the weight values of the learned network. These data and the learning code can be obtained at
www.cs.cmu.edu/~tom/faces.html.

weights for the three internal, or hidden, units. Now consider the second hidden unit,
To understand these weight diagrams, con- whose activation will cause the correct output
sider first the four blocks at the top right, unit to activate. The weights for this unit are
which describe the weights for the four output depicted in the middle diagram of the large
units of the network. Let us look at the third weight diagrams directly below the output
block, which encodes the weights for the third unit diagrams. Which pixels in the input
output unit, which represents that the person image is this middle unit most sensitive to?
is looking to his/her right. This block shows Note first that the largest weight values (bright
four weight values. The first (leftmost) of these white and dark black) for this unit are associat-
is the weight that determines the units thresh- ed with pixels that cover the face or body of
old, and the remaining three are the weights the person, and the smallest (light grey)
connecting the three hidden units to this out- weights correspond to the image background.
put unit. Note this right output unit has a Thus, the face classification will depend pri-
large positive (bright-white) weight connect- marily on this section of the image. Second,
ing to the middle hidden unit and a large neg- notice the cluster of bright (positive) weight
ative (dark-black) weight connecting to the values near the left side of the face and the
third hidden unit. Therefore, this unit will pre- cluster of dark (negative) weight values near
dict that the person is facing to his/her right if the right side. Thus, this particular unit will
the second hidden unit has a high activation, produce a large weighted sum when it is given
and the third hidden unit does not. an input image with bright pixels overlapping

16 AI MAGAZINE
Articles

True Rating Predicted Rating


1 2 3 4 5 Skip Total
1: 0 1 0 0 0 1 2
2: 1 15 6 4 0 15 41
3: 0 6 31 20 0 15 72
4: 0 6 8 42 0 20 76
5: 0 0 0 4 0 1 5
skip: 0 8 4 5 1 141 159

Table 1. Results of NEWS WEEDER for One User.


After observing the user read approximately 1300 news articles, the system predicted ratings for approximately
600 new articles. The previous confusion matrix shows the relationship between NEWS WEEDERs predictions and
the actual user ratings. For example, the 20 in the table indicates there were 20 articles for which NEWS WEEDER
predicted a rating of 4 but that the user rated a 3. Note the large numbers along the diagonal indicate correct
predictions.

these positive weights and dark pixels overlap- There is a great deal of active research on
ping these negative weights. One example of learning in perception and on neural network
such an image is the third face image shown at learning in general. One active area of new
the bottom of the figure, in which the person research involves developing computational
is looking to his/her right. This unit will there- models that more closely fit what is known
fore have a high output when given the image about biological perceptual systems. Another
of the person looking to his/her right, whereas active area involves attempts to unify neural
it will have a low value when given an image networklearning methods with Bayesian
of a person looking to his/her left (for exam- learning methods (for example, learning of
ple, the first image shown in the figure). Bayesian belief networks). One of my favorite
Neural networklearning algorithms such as areas for new thesis topics is long-term learn-
the back-propagation algorithm given here are ing of new representations. Because of the abil-
used in many sensor-interpretation problems. ity of neural networks to learn hidden-layer
If you would like to experiment for yourself representations, as illustrated in figure 3, they
with this face-recognition data and learning form one good starting point for new research
algorithm, the data, code, and documentation on long-term learning of representations.
are available at www.cs.cmu.edu/~tom/faces.
html. A Self-Customizing News Reader
What is the state of the art in applying A third niche for machine learning is comput-
machine learning to such perception prob- er software that automatically customizes to its
lems? In many perception problems, learning users after it has been fielded. One prototypical
methods have been applied successfully and example is the NEWS WEEDER system (Lang
outperform nonlearning approaches (for 1995), an interface to electronic news groups
example, in many image-classification tasks that learns the reading interests of its users.
and other signal-interpretation problems). Here, we briefly summarize NEWS WEEDER and its
Still, there remain substantial limits to the cur- learning method.
rent generation of learning methods. For The purpose of NEWS WEEDER is to learn to fil-
example, the straightforward application of ter the flood of online news articles on behalf
back propagation illustrated in figure 3 of its user. Each time the user reads an article
depends on the face occurring reliably in the using NEWS WEEDER, he/she can assign it a rating
same portion of the image. This network can- from 1 (very interesting) to 5 (uninteresting) to
not easily learn to recognize faces independent indicate interest in the article. NEWS WEEDER
of translation, rotation, or scaling of the face uses these training examples to learn a general
within the image. For an example of how such model of its users interests. It can then use this
neural network face-recognition systems can learned interest model to automatically exam-
be extended to accommodate translation and ine new articles, or articles from other news
scaling, see Rowley, Baluja, and Kanade (1996). groups, to assemble a top 20 list for the user.

FALL 1997 17
Articles

Given that there are thousands of news groups higher the conditional probabilities P(wj | Ci)
available online and that most users read only for the words wj that are actually encountered
a few, the potential benefit is that the agent in d, the greater the chance that Ci is the cor-
can perform the task of sorting through the rect rating for d.
vast bulk of uninteresting articles to bring the In fact, NEWS WEEDERs learning method uses
few interesting ones to the attention of its user. a variant of this algorithm, described in Lang
The key machine-learning question underly- (1995). One difference is that the document
ing NEWS WEEDER is, How can we devise algo- length is taken into account in estimating the
rithms capable of learning a general model of probability of word occurrence. A second is
user interests from a collection of specific text that the system combines the strength of its
documents? NEWS WEEDER takes an approach predictions for each rating level, using a linear
that is based both on earlier work in informa- combination of these predictions to produce
tion retrieval and on Bayesian learning meth- its final rating.
ods. First, each text article is re-represented by How well does NEWS WEEDER learn the users
a very long feature vector, with one feature for interests? In one set of experiments, the sys-
each possible word that can occur in the text. tem was trained for one user using a set of
Given that there are approximately 50,000 approximately 1900 articles. For each article
words in standard English, this feature vector the user read, a rating of 1 to 5 was assigned.
contains approximately 50,000 features. The Articles that the user chose not to read were
value for each feature (for example, the feature labeled skip. A subset of approximately 1300 of
corresponding to the word windsurf) is the these articles was used to train the system, and
number of times the word occurs in the docu- its performance was then evaluated over the
ment. This feature-vector representation, based remaining articles. NEWS WEEDERs predicted rat-
on earlier work in information retrieval, is ings are compared to the actual user ratings in
sometimes called a bag of words representation table 1. Note that numbers along the diagonal
for the document because it omits information of this confusion matrix indicate cases where
about word sequence, retaining only informa- the predicted and true ratings were identical,
tion about word frequency. whereas off-diagonal entries indicate differ-
Once the document is represented by such a ences between these two. As is apparent from
feature vector, NEWS WEEDER applies a variant of table 1, NEWS WEEDER does a fairly good job of
a naive Bayes classification learning method to predicting user interest in subsequent docu-
learn to classify documents according to their ments. Note that in many cases, its errors rep-
interestingness rating. In the naive Bayes ap- resent small differences between the predicted
proach, the system assigns a rating to each new and actual ratings (for example, many of its
article based on several probability terms that errors when predicting a rating of 4 were arti-
it estimates from the training data, as describ- cles that were actually rated 3 by the user).
ed in the following paragraph. NEWS WEEDER illustrates the potential role of
During training, the naive Bayes classifier machine learning in automatically customiz-
determines several probabilities that are used to ing software to users needs. It also illustrates
classify new articles. First, it determines the pri- how machine learning can be applied to the
or probability P(Ci) for each possible rating Ci by growing volume of text data found on elec-
calculating the frequency with which this rat- tronic news feeds, the World Wide Web, and
ing occurs over the training data. For example, other online sources. Because the initial exper-
if the user assigned a top rating of 1 to just 9 of iments with NEWS WEEDER were promising, a
the 100 articles he/she read, then P(Rating = 1) second-generation system called WISE WIRE has
= 0.9. Second, it estimates the conditional prob- now been developed at WiseWire Corporation.
ability P(wj | Ci) of encountering each possible This system is available free of charge at
word wj given that the document has a true rat- www.wisewire.com.
ing of Ci. For example, if the word windsurf The current state of the art for learning to
occurs 11 times among the 10,000 words classify text documents is fairly well represent-
appearing in articles rated 1 by the user, then ed by the naive Bayes text classifier described
P(windsurf | Rating = 1) = .0011. A new text doc- here. An implementation of this algorithm for
ument d is then assigned a new rating by text classification can be obtained at www.cs.
choosing the rating Ci that maximizes cmu.edu/~tom/ml-examples.html. Machine
P (Ci ) P ( w j | Ci ) learning over text documents is an active area
w j d of current research. One interesting research
where wjd P(wj | Ci) represents the product of topic involves going beyond the bag-of-words
the various P(wj | Ci) over all words wj found approach, representing the text by more lin-
in the document d. As we might expect, the guistic features such as noun phrases and par-

18 AI MAGAZINE
Articles

tial interpretations of the meaning of the text. some of the specific research directions noted
In addition to classifying text documents, oth- earlier, I believe there are several areas where
er important text-learning problems include significant breakthroughs are possible. Below
learning to extract information from text (for are some of my speculations on which research
example, learning to extract the title, speaker, areas might produce dramatic improvements
room, and subject, given online seminar in the state of the art of machine learning.
announcements). Given the rapidly growing Incorporation of prior knowledge with the impact
importance of the World Wide Web and other training data: One striking feature of the
online text such as news feeds, it seems likely learning methods described earlier is that they of machine
that machine learning over text will become a are all tabula rasa methods; that is, they assume learning is
major research thrust over the coming years. the learner has no initial knowledge except for
the representation of hypotheses and that it
certain to
must learn solely from large quantities of train- grow over the
Where Next? ing data. It is well understood that prior coming years,
As the previous three examples illustrate, knowledge can reduce sample complexity and
machine learning is already beginning to have improve learning accuracy. In fact, machine- as more and
a significant impact in several niches within learning algorithms such as explanation-based more data
computer science. My personal belief is that the learning (DeJong 1997) and, more recently,
impact of machine learning is certain to grow Bayesian belief networks provide mechanisms come online,
over the coming years, as more and more data for combining prior knowledge with training we develop
come online, we develop more effective algo- data to learn more effectively. Nevertheless, it
rithms and underlying theories for machine remains a fact that almost all the successful
more effective
learning, the futility of hand-coding increasing- uses of machine learning to date rely only on algorithms
ly complex systems becomes apparent, and tabula rasa approaches. More research is need- and
human organizations themselves learn the val- ed to develop effective methods of combining
ue of capturing and using historical data. prior knowledge with data to improve the underlying
Put in a long-term historical perspective, effectiveness of learning. theories for
computer science is a young field still in its first Lifelong learning: Most current learning
50 years. Perhaps we will look back after the systems address just one task. However, humans machine
first hundred years to see that the evolution of learn many tasks over a long lifetime and seem learning,
computer science followed a predictable to use the accumulating knowledge to guide
course: During its first few decades, computer subsequent learning. One interesting direction
the futility of
science focused on fairly simple algorithms in current research is to develop long-life agents hand-coding
and computer programs and developed a basic that accumulate knowledge and reuse it to
guide learning in a similar fashion. These agents
increasingly
understanding of data structures and computa-
tion. During the next several decades, it strug- might learn new representations, as well as complex
gled with the problem of how to construct facts, so that they can more effectively represent systems
increasingly complex software systems but and reason about their environment as they
continued to follow its early approach of man- age. Several researchers are already examining becomes
ually hand crafting each line of code. During this topic in the context of robot learning and apparent,
the second half of its first century (hopefully!), autonomous software agents.
a shift is seen to take place in which software Machine learning embedded in program-
and human
developers eventually abandon the goal of ming languages: If one assumes that machine organizations
manually hand crafting increasingly complex learning will play a useful and routine role
software systems. Instead, they shift to a soft- within certain types of computer software,
themselves
ware development methodology of manually then it is interesting to ask how future software learn the
designing the global structure of the system developers will incorporate learning into their value of
but using machine-learning methods to auto- software. Perhaps it is time for the develop-
matically train individual software compo- ment of new programming languages or pro- capturing
nents that cannot be found in a software gramming environments that allow develop- and using
library and to optimize global performance. Of ers to incorporate machine learning into their
course, we will have to wait to find out how software as easily as current environments
historical
the second 50 years pans out, but at least for allow incorporating abstract data types and data.
the three niches discussed here, it is already other standard software constructs.
the case that machine learning competes Machine learning for natural language:
favorably with manual hand-crafted code. One fact that seems certain to influence
As for us researchers, many exciting oppor- machine learning is the fact that the majority
tunities are available right now regardless of of the worlds online data is now text. Given
how the next 50 years pan out. In addition to approximately 200,000,000 pages on the

FALL 1997 19
Articles

World Wide Web, and approximately Joachims, T.; Freitag, D.; and Mitchell, T. 1997. WEB-
1,000,000,000 hyperlinks, it might be time to WATCHER: A Tour Guide for the World Wide Web. In

consider machine learning for natural lan- Proceedings of the Fifteenth International Joint
guage problems. Consider that each of these Conference on Artificial Intelligence. Menlo Park,
Calif.: International Joint Conferences on Artificial
billion hyperlinks is a kind of semisupervised
Intelligence. Forthcoming.
training example that indicates the meaning
Kaelbling, L. P.; Littman, M. L.; and Moore, A. W.
of the words underlined by the hyperlink. For
1996. Reinforcement Learning: A Survey. Journal of
example, the meaning of a hyperlink such as AI Research 4:237285. Available at www.cs.washing-
my favorite vacation spot is indicated by the web ton.edu/research/jair/home.html.
page to which it points. Given that there are Kearns, M. J., and Vazirani, U. V. 1994. An Introduc-
only 50,000 words in everyday English, the tion to Computational Learning Theory. Cambridge,
1,000,000,000 hyperlinks produce, on average, Mass.: MIT Press.
20,000 training examples of each word. If we Lang, K. 1995. NEWS WEEDER: Learning to Filter NET-
could invent learning algorithms that use this NEWS. In Proceedings of the Twelfth International Con-
kind of training data, we might be able to ference on Machine Learning, eds. A. Prieditis and S.
make significant new progress on natural lan- Russell, 331339. San Francisco, Calif.: Morgan Kauf-
guage processing. mann.
Lavrac, N., and Dzeroski, S. 1994. Inductive Logic Pro-
Acknowledgments gramming: Techniques and Applications. Ellis Hor-
The research described here is the result of the wood.
efforts of a number of people and organiza- Mitchell, T. M. 1997. Machine Learning. New York:
tions. Schenley Park Research, Inc., provided McGraw Hill.
the medical data-mining example. Jeff Shufelt, Mitchell, M. 1996. An Introduction to Genetic Algo-
Matt Glickman, and Joseph OSullivan helped rithms. Cambridge, Mass.: MIT Press.
create the code and collect the data for the Mitchell, T. M.; Caruana, R.; Freitag, D.; McDermott,
face-recognition example. This work has been J.; and Zabowski, D. 1994. Experience with a Learn-
supported in part by the Defense Advanced ing Personal Assistant. Communications of the ACM
Research Projects Agency under research grant 37(70): 8191.
F33615-93-1-1330. Quinlan, J. R. 1993. C4.5: Programs for Machine Learn-
Parts of this article are excerpted from the ing. San Francisco, Calif.: Morgan Kaufmann.
textbook Machine Learning (Mitchell 1997). Quinlan, J. R. 1990. Learning Logical Definitions
from Relations. In Machine Learning 5, 239266. New
References York: Kluwer Academic.
Aha, D.; Kibler D.; and Albert, M. 1991. Instance- Rowley, H.; Baluja, S.; and Kanade, T. 1996. Human
Based Learning Algorithms. In Machine Learning 6, Face Detection in Visual Scenes. In Advances in Neur-
3766. New York: Kluwer Academic. al Information Processing Systems 8, eds. D. S. Touret-
zky, M. C. Mozer, and M. E. Hasselmo. Cambridge,
Barto, A.; Bradtke, S.; and Singh, S. 1995. Learning to
Mass.: MIT Press.
Act Using Real-Time Dynamic Programming. Artifi-
cial Intelligence (Special Issue on Computational Rumelhart, D. E., and McClelland, J. L. 1986. Parallel
Research on Interaction and Agency) 72(1): 81138. Distributed Processing: Exploration in the Microstructure
of Cognition, Volumes 1 and 2. Cambridge, Mass.: MIT
Bishop, C. M. 1996. Neural Networks for Pattern Recog-
Press.
nition. Oxford, U.K.: Oxford University Press.
Caruana, R.; Baluja, S.; and Mitchell, T. 1995. Using
the Future to Sort Out the Present: RANKPROP and Mul- Tom Mitchell is currently profes-
titask Learning for Medical Risk Analysis. In Neural sor of computer science and robot-
Information Processing 7, eds. G. Tesauro, D. S. Touret- ics at Carnegie Mellon University
zky, and T. K. Leen. Cambridge, Mass.: MIT Press. (CMU) and director of the CMU
Center for Automated Learning
Chauvin, Y., and Rumelhart, D. 1995. Backpropaga-
and Discovery. He recently com-
tion: Theory, Architectures, and Applications. Hillsdale,
pleted a comprehensive textbook
N.J.: Lawrence Erlbaum.
entitled Machine Learning (Mc-
Clark, P., and Niblett, R. 1989. The CN2 Induction Graw Hill, 1997). Mitchells cur-
Algorithm. In Machine Learning 3, 261284. New rent research involves applying machine learning to
York: Kluwer Academic. the vast store of undisciplined information found on
DeJong, G. 1997. Explanation-Based Learning. In the World Wide Web and other internet sources. His
The Computer Science and Engineering Handbook, ed. A. e-mail address is tom.mitchell@cmu.edu.
Tucker, 499520. Boca Raton, Fla.: CRC Press.
Forrest, S. 1993. Genetic Algorithms: Principles of
Natural Selection Applied to Computation. Science
261:872878.
Hecht-Nielsen, R. 1990. Neurocomputing. Reading,
Mass.: Addison Wesley.

20 AI MAGAZINE

You might also like