You are on page 1of 3

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220017784

Data Mining: Pratical Machine Learning Tools


and Techniques
Chapter November 2010

CITATIONS

READS

6,439

1,610

2 authors, including:
Ian Witten
The University of Waikato
498 PUBLICATIONS 54,443 CITATIONS
SEE PROFILE

Available from: Ian Witten


Retrieved on: 18 July 2016

Data Mining: Practical Machine Learning Tools and


Techniques with Java Implementations
b y / a n H. Witten a n d Eibe F r a n k
M o r g a n K a u f m a n n Publishers, 2 0 0 0
416 pages, Paper, $ 4 9 . 9 5
I S B N 1-.55860-552-5

R e v i e w by:
James Geller, N e w Jersey Institute of T e c h n o l o g y
CS D e p a r t m e n t , 323 Dr. King Blvd., N e w a r k , NJ 07 t 0 2
geller@oak.njit.edu
http:Hweb, n j i t . e d u / - g e l l e r /
Story

o f the b o o k

Witten and Frank's textbook was one of two


books that 1 used for a data mining class in
the Fall o f 2001. T h e b o o k covers all m a j o r
methods o f data mining that p r o d u c e a
knowledge
representation
as
output.
Knowledge
representation
is
hereby
u n d e r s t o o d as a representation that can be
studied, understood, and interpreted by
h u m a n beings, at least in principle. T h u s ,
neural networks and genetic algorithms are
excluded f r o m the topics of this textbook.
W e need to say "can be u n d e r s t o o d in
principle" b e c a u s e a large decision tree or a
large rule set m a y be as hard to interpret as a
neural network.
T h e b o o k first develops the basic m a c h i n e
learning and data mining methods. T h e s e
include decision trees, classification and
association rules, s u p p o r t vector machines,
instance-based
learning,
Naive
Bayes
classifiers,
clustering,
and
numeric
prediction
based
on
linear regression,
regression trees, and model trees. It then
goes
deeper
into
evaluation
and
i m p l e m e n t a t i o n issues. Next it moves on to
deeper c o v e r a g e of issues such as attribute
selection, discretization, data cleansing, and
c o m b i n a t i o n s o f multiple models (bagging,
boosting, and stacking). T h e final c h a p t e r
deals with a d v a n c e d topics such as visual
m a c h i n e learning, text mining, and W e b
mining.

76

A w a l k through the contents


T h e greatest strength of this Data M i n i n g
b o o k lies outside o f the b o o k itself. All the
algorithms described in this b o o k are
i m p l e m e n t e d and freely available t h r o u g h
the WEK.A ( W a i k a t o E n v i r o n m e n t for
Knowledge
Ana lys is)
W e b s i te
(www.cs.waikato.ac.nz/ml/weka). Chapter 8
o f the book is a tutorial to the i m p l e m e n t e d
algorithms. T h e integration b e t w e e n the
b o o k and the W e b site is excellent, and the
W e b site is alive, thriving and growing.
T h u s , the n u m b e r o f data mining a l g o r i t h m s
available on the W e b site goes far beyond
what is described in the book. Indeed. even
Neural N e t w o r k s have been added to the
W e b site since the b o o k was first published.
W h i l e m a n y books offer an associated W e b
site by now, the close linkage between b o o k
and W e b site and the rapid g r o w t h o f the
W e b site are highly c o m m e n d a b l e .
A n o t h e r pleasant feature o f the W E K A
i m p l e m e n t a t i o n is that it is d o n e in Java.
T h i s m a k e s it possible to c o n s t r u c t systems,
based on Java, that capitalize on the other
strengths of Java, s u c h as access to relational
d a t a b a s e s t h r o u g h J D B C and easy access to
W e b pages f r o m within Java p r o g r a m s .

T a r g e t audience
T h e b o o k is written for a c a d e m i c s and
practitioners and I believe it can be well
understood, even by undergraduate students.

SIGMOD

R e c o r d , Vol. 31, N o . 1, M a r c h 2002

In fact, it is probably the most accessible


survey of data mining in print, without
sacrificing too much of precision and rigor.
The book is written in a highly redundant
style, which I would like to describe as an
exercise in iterative deepening. Basic
concepts are repeated in several chapters.
but covered to a deeper level in the later
chapters.
This should make it easy for
students to keep reading it, without having
to refer back to earlier chapters at every step
of the way. On the other hand. for a person
that is already familiar with the basics of
data mining, this makes boring reading at
some places. However, I do not recommend
a streamlining of the book. Instead, I
recommend
that
readers with
some
knowledge of the topic may skip paragraphs
that sound familiar without any guilty
feelings.

have) to strengthen the formulas, without


necessarily adding new ones.

Reviewer's appreciation

In America we say "Actions speak louder


than words". Thus. instead of summarizing
the book I will describe some actions that I
intend to take (or that I am already taking).
(1) I am using W E K A for my research.
(2) If I teach the same course again, I will
use Witten and Frank's book again.
(3) If the book appears in a second edition, I
will acquire it.

The book goes to great lengths to avoid


"formula shock". Formulas are developed
step-by-step and well explained. Only
absolutely necessary formulas are included.
In many cases, where the derivation of a
complex result is irrelevant to the actual data
mining issues, the authors defer to statistics
textbooks. While I am greatly in favor of
both these approaches in writing textbooks, I
feel that they have gone too far at a few
places. At a number of places, the authors
avoid introducing "'one more letter" to keep
the text readable. However, the price they
pay for that is that many of their formulas
have no cclual signs. Thus, a sentence is
terminated with a colon and followed with a
formula, which is presumably equal to the
quantity described by the sentence. This is
done on many pages, e.g., 132--135, 137,
196, 207, 222, etc.
Not in my wildest
dreams would I have thought that I could
ever criticize a book author for having too
few formulas and too few variables. But
this is exactly what I need to do here. While
I do not recommend eliminating the
previously
mentioned
redundancy
of
description, I do recommend for the next
edition (which this book will undoubtedly

SIGMOD

Record, Vol. 31, No. 1, March 2002

At a few places, the book could also be


improved by adding rnore explanations to
figures. Figure 3.6 is a prime example for
this issue. I found myself spending time
verifying that instance counts in two
subfigurcs truly add to the same total (of
209). They do. The reader could be spared
this effort by a better caption or a better
description in the body of the text.
Similarly,
the
Apriori
algorithm
is
introduced in a figure, but only in the
"'Further Reading" subsection (following
much later) is the name of the algorithm
mentioned. A better figure caption would
help the scholarly advancement of students
who might not take the "Further Reading"
section that seriously.

77

You might also like