Machine Learning and Data Mining
By Igor Kononenko and Matjaz Kukar
3/5
()
About this ebook
- Provides an introduction to the main issues associated with the basics of machine learning and the algorithms used in data mining
- A valuable addition to the libraries and bookshelves of companies using the principles of data mining (or KDD) to effectively deliver solid business and industry solutions
Igor Kononenko
Igor Kononenko studied computer science at the University of Ljubliana, Slovenia, receiving his BSc in 1982, MSc in 1985 and PhD in 1990. He is now professor at the Faculty of Computer and Information Science there, teaching courses in Programming Languages, Algorithms and Data Structures; Introduction to Algorithms and Data Structures; Knowledge Engineering, Machine Learning and Knowledge Discovery in Databases. He is the head of the Laboratory for Cognitive Modelling and a member of the Artificial Intelligence Department at the same faculty. His research interests include artificial intelligence, machine learning, neural networks and cognitive modelling. He is the (co) author of 170 scientific papers in these fields and 10 textbooks. Professor Kononenko is a member of the editorial board of Applied Intelligence and Informatica journals and was also twice chair of the programme committee of the International Cognitive Conference in Ljubliana.
Related authors
Related to Machine Learning and Data Mining
Related ebooks
Practical Machine Learning for Data Analysis Using Python Rating: 0 out of 5 stars0 ratingsIntroduction to Algorithms for Data Mining and Machine Learning Rating: 0 out of 5 stars0 ratingsDesigning Machine Learning Systems with Python Rating: 0 out of 5 stars0 ratingsData Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next Rating: 0 out of 5 stars0 ratingsNeural Data Science: A Primer with MATLAB® and Python™ Rating: 5 out of 5 stars5/5Data Science: Concepts and Practice Rating: 3 out of 5 stars3/5Data Mining: Practical Machine Learning Tools and Techniques Rating: 4 out of 5 stars4/5Data Mining: Concepts and Techniques Rating: 4 out of 5 stars4/5Business Modeling and Data Mining Rating: 3 out of 5 stars3/5Handbook of Statistical Analysis and Data Mining Applications Rating: 4 out of 5 stars4/5Introduction to Statistical Machine Learning Rating: 4 out of 5 stars4/5Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners Rating: 3 out of 5 stars3/5Machine Learning with R Rating: 4 out of 5 stars4/5Mastering Python for Data Science Rating: 3 out of 5 stars3/5Pattern Recognition Rating: 4 out of 5 stars4/5Python Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Deep Learning for Data Analytics: Foundations, Biomedical Applications, and Challenges Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science Rating: 4 out of 5 stars4/5Pattern Recognition and Machine Learning Rating: 0 out of 5 stars0 ratingsLearning Predictive Analytics with Python Rating: 0 out of 5 stars0 ratingsBig Data Analytics: Disruptive Technologies for Changing the Game Rating: 4 out of 5 stars4/5Computer Vision: Principles, Algorithms, Applications, Learning Rating: 5 out of 5 stars5/5Practical Data Analysis Rating: 4 out of 5 stars4/5Understanding Big Data: A Beginners Guide to Data Science & the Business Applications Rating: 4 out of 5 stars4/5Learning Data Mining with Python Rating: 0 out of 5 stars0 ratingsData Mining Applications with R Rating: 4 out of 5 stars4/5R and Data Mining: Examples and Case Studies Rating: 3 out of 5 stars3/5Machine Learning with R - Second Edition Rating: 5 out of 5 stars5/5Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Summary of Super-Intelligence From Nick Bostrom Rating: 5 out of 5 stars5/52084: Artificial Intelligence and the Future of Humanity Rating: 4 out of 5 stars4/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5101 Midjourney Prompt Secrets Rating: 3 out of 5 stars3/5ChatGPT For Dummies Rating: 0 out of 5 stars0 ratingsCreating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5ChatGPT For Fiction Writing: AI for Authors Rating: 5 out of 5 stars5/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Enterprise AI For Dummies Rating: 3 out of 5 stars3/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Our Final Invention: Artificial Intelligence and the End of the Human Era Rating: 4 out of 5 stars4/5ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology Rating: 0 out of 5 stars0 ratingsChat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures Rating: 4 out of 5 stars4/5What Makes Us Human: An Artificial Intelligence Answers Life's Biggest Questions Rating: 5 out of 5 stars5/5The Algorithm of the Universe (A New Perspective to Cognitive AI) Rating: 5 out of 5 stars5/5Impromptu: Amplifying Our Humanity Through AI Rating: 5 out of 5 stars5/5Writing AI Prompts For Dummies Rating: 0 out of 5 stars0 ratingsThe Exponential Age: How Accelerating Technology is Transforming Business, Politics and Society Rating: 5 out of 5 stars5/5AI for Educators: AI for Educators Rating: 5 out of 5 stars5/5
Reviews for Machine Learning and Data Mining
2 ratings0 reviews
Book preview
Machine Learning and Data Mining - Igor Kononenko
Machine Learning and Data Mining
Introduction to Principles and Algorithms
First Edition
Igor Kononenko
Professor of Computer and Information Science, University of Ljubljana, Slovenia
Matjaž Kukar
Assistant Professor of Computer and Information Science, University of Ljubljana, Slovenia
Horwood Publishing
Chichester, UK
Table of Contents
Cover image
Title page
Copyright page
Foreword
Preface
Acknowledgements
Chapter 1: Introduction
1.1 THE NAME OF THE GAME
1.2 OVERVIEW OF MACHINE LEARNING METHODS
1.3 HISTORY OF MACHINE LEARNING
1.4 SOME EARLY SUCCESSES
1.5 APPLICATIONS OF MACHINE LEARNING
1.6 DATA MINING TOOLS AND STANDARDS
1.7 SUMMARY AND FURTHER READING
Chapter 2: Learning and Intelligence
2.1 WHAT IS LEARNING
2.2 NATURAL LEARNING
2.3 LEARNING, INTELLIGENCE, CONSCIOUSNESS
2.4 WHY MACHINE LEARNING
2.5 SUMMARY AND FURTHER READING
Chapter 3: Machine Learning Basics
3.1 BASIC PRINCIPLES
3.2 MEASURES FOR PERFORMANCE EVALUATION
3.3 ESTIMATING PERFORMANCE
3.4 *COMPARING PERFORMANCE OF MACHINE LEARNING ALGORITHMS
3.5 COMBINING SEVERAL MACHINE LEARNING ALGORITHMS
3.6 SUMMARY AND FURTHER READING
Chapter 4: Knowledge Representation
4.1 PROPOSITIONAL CALCULUS
4.2 *FIRST ORDER PREDICATE CALCULUS
4.3 DISCRIMINANT AND REGRESSION FUNCTIONS
4.4 PROBABILITY DISTRIBUTIONS
4.5 SUMMARY AND FURTHER READING
Chapter 5: Learning as Search
5.1 EXHAUSTIVE SEARCH
5.2 BOUNDED EXHAUSTIVE SEARCH (BRANCH AND BOUND)
5.3 BEST-FIRST SEARCH
5.4 GREEDY SEARCH
5.5 BEAM SEARCH
5.6 LOCAL OPTIMIZATION
5.7 GRADIENT SEARCH
5.8 SIMULATED ANNEALING
5.9 GENETIC ALGORITHMS
5.10 SUMMARY AND FURTHER READING
Chapter 6: Measures for Evaluating the Quality of Attributes
6.1 MEASURES FOR CLASSIFICATION AND RELATIONAL PROBLEMS
6.2 MEASURES FOR REGRESSION
6.3 **FORMAL DERIVATIONS AND PROOFS
6.4 SUMMARY AND FURTHER READING
Chapter 7: Data Preprocessing
7.1 REPRESENTATION OF COMPLEX STRUCTURES
7.2 DISCRETIZATION OF CONTINUOUS ATTRIBUTES
7.3 ATTRIBUTE BINARIZATION
7.4 TRANSFORMING DISCRETE ATTRIBUTES INTO CONTINUOUS
7.5 DEALING WITH MISSING VALUES
7.6 VISUALIZATION
7.7 DIMENSIONALITY REDUCTION
7.8 **FORMAL DERIVATIONS AND PROOFS
7.9 SUMMARY AND FURTHER READING
Chapter 8: *Constructive Induction
8.1 DEPENDENCE OF ATTRIBUTES
8.2 CONSTRUCTIVE INDUCTION WITH PRE-DEFINED OPERATORS
8.3 CONSTRUCTIVE INDUCTION WITHOUT PRE-DEFINED OPERATORS
8.4 SUMMARY AND FURTHER READING
Chapter 9: Symbolic Learning
9.1 LEARNING OF DECISION TREES
9.2 LEARNING OF DECISION RULES
9.3 LEARNING OF ASSOCIATION RULES
9.4 LEARNING OF REGRESSION TREES
9.5 *INDUCTIVE LOGIC PROGRAMMING
9.6 NAIVE AND SEMI-NAIVE BAYESIAN CLASSIFIER
9.7 BAYESIAN BELIEF NETWORKS
9.8 SUMMARY AND FURTHER READING
Chapter 10: Statistical Learning
10.1 NEAREST NEIGHBORS
10.2 DISCRIMINANT ANALYSIS
10.3 LINEAR REGRESSION
10.4 LOGISTIC REGRESSION
10.5 *SUPPORT VECTOR MACHINES
10.6 SUMMARY AND FURTHER READING
Chapter 11: Artificial Neural Networks
11.1 INTRODUCTION
11.2 TYPES OF ARTIFICIAL NEURAL NETWORKS
11.3 *HOPFIELD’S NEURAL NETWORK
11.4 *BAYESIAN NEURAL NETWORK
11.5 PERCEPTRON
11.6 RADIAL BASIS FUNCTION NETWORKS
11.7 **FORMAL DERIVATIONS AND PROOFS
11.8 SUMMARY AND FURTHER READING
Chapter 12: Cluster Analysis
12.1 INTRODUCTION
12.2 MEASURES OF DISSIMILARITY
12.3 HIERARCHICAL CLUSTERING
12.4 PARTITIONAL CLUSTERING
12.5 MODEL-BASED CLUSTERING
12.6 OTHER CLUSTERING METHODS
12.7 SUMMARY AND FURTHER READING
Chapter 13: **Learning Theory
13.1 COMPUTABILITY THEORY AND RECURSIVE FUNCTIONS
13.2 FORMAL LEARNING THEORY
13.3 PROPERTIES OF LEARNING FUNCTIONS
13.4 PROPERTIES OF INPUT DATA
13.5 CONVERGENCE CRITERIA
13.6 IMPLICATIONS FOR MACHINE LEARNING
13.7 SUMMARY AND FURTHER READING
Chapter 14: **Computational Learning Theory
14.1 INTRODUCTION
14.2 GENERAL FRAMEWORK FOR CONCEPT LEARNING
14.3 PAC LEARNING MODEL
14.4 VAPNIK-CHERVONENKIS DIMENSION
14.5 LEARNING IN THE PRESENCE OF NOISE
14.6 EXACT AND MISTAKE BOUNDED LEARNING MODELS
14.7 INHERENT UNPREDICTABILITY AND PAC-REDUCTIONS
14.8 WEAK AND STRONG LEARNING
14.9 SUMMARY AND FURTHER READING
Appendix A: *Definitions of some lesser known terms
A.1 COMPUTATIONAL COMPLEXITY CLASSES
A.2 ASYMPTOTIC NOTATION
A.3 SOME BOUNDS FOR PROBABILISTIC ANALYSIS
A.4 COVARIANCE MATRIX
References
Index
Copyright
HORWOOD PUBLISHING LIMITED
International Publishers in Science and Technology
Coli House, Westergate, Chichester, West Sussex, PO20 3QL, UK
First published in 2007
COPYRIGHT NOTICE
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the permission of Horwood Publishing, Coll House, Westergate, Chichester, West Sussex, PO20 3QL, UK.
© Horwood Publishing Limited, 2007.
British Library Cataloguing in Publication Data
A catalogue record of this book is available from the British Library
ISBN − 10:1-904275-21-4
ISBN − 13: 978-1-904275-21-3
Cover design by Jim Wilkie.
Printed and bound in the UK by Antony Rowe Limited.
Foreword
Ivan Bratko
In the past two decades, Machine Learning (ML) has made, among the many areas of Artificial Intelligence (AI), probably the most significant progress, and has grown into a field of its own. In this period, many of its methods found their way into practical applications, and in many application areas are now routinely used. The wide spread and visibility of ML was accelerated by powerful new techniques, and also by the appearance of effective and user-friendly implementations of ML tools, including some excellent freely available platforms, such as Weka and Orange. All this has resulted in ML, together with its related areas of Data Mining, Knowledge Discovery in Databases, and Intelligent Data Analysis, becoming an effective tool for many other disciplines as diverse as engineering, medicine, biology, ecology, biochemistry and finance. In science, it is becoming part of general scientific methodology and is now used by scientists in interpreting experimental data and in formulating new scientific theories. ML is beginning to be taught outside Computer Science and is on the way to attaining a similar status as some other subjects of general importance, in the same way as it is, for example, useful to students of medicine and biology to learn the basics of statistics.
This book is written by two scientists who have made their own original contributions to the methods and applications of ML, one of the co-authors (I. K.) having been involved since its early stages. In my view, the main strength of this book is its breadth of coverage and wealth of material presented, without loss of depth. In this respect, the book is rather unique among the books on machine learning. The book covers a large majority of key methods, concepts and issues. The style of writing is terse and directly to the point.
In comparison with other books on ML, the following chapters seem to be particularly strong. Chapter 3 systematically introduces a repertoire of basic elements that are universally used as components of ML methodology. Chapter 6 gives an outstanding review of measures for attribute selection. Chapter 8 introduces constructive induction, a topic that is rarely covered in books. In Chapters 13 and 14 on the theory of ML, the book addresses deep theoretical questions of learnability.
This book will be useful reading both for beginners to the field, and even more so to people who have already done some work in ML, but would like to broaden their knowledge and are looking for reference to learn about yet unfamiliar topics.
Preface
Igor Kononenko; Matjaž Kukar, Ljubljana, December 2006
In learning, each day something is gained. In following the Tao, each day something is lost,
— Lao Tse
The book describes the basics of machine learning principles and algorithms that are used in data mining and is intended for students of computer science, researchers who need to adapt various algorithms for particular data mining tasks, and for education of sophisticated users of machine learning and data mining tools.
Machine learning has exhibited tremendous progress in the last two decades. Numerous systems for machine learning are nowadays available and the number of their applications in various fields, such as industry, medicine, economics, ecology, finance, and many others, is rapidly growing. Machine learning is used for data analysis and knowledge discovery in databases, data mining, automatic generation of knowledge bases for expert systems, game playing, text classification and text mining, automatic recognition of speech, handwriting, images, etc.
The basic principle of machine learning is modeling of available data. Results of various machine learning systems may be rules, functions, relations, equation systems, probability distributions and other knowledge representations. The obtained knowledge
aims to explain the data and can be used for forecasting, diagnostics, control, validation, and simulations.
Several excellent books have already been written on machine learning and data mining, most notably (Mitchell, 1997; Witten and Frank, 2000, 2005; Hand et al., 2001; Alpaydin, 2005). The purpose of our book is to provide a thorough and yet a comprehensible general overview of the field. Our approach differs from other textbooks by providing several different points of view to machine learning and data mining, as reflected from the structure of the book:
• Besides defining the basic notions and overviewing the machine learning methods, Chapter 1 provides the historical overview of the field and an overview of the state-of-the-art data mining tools as well as established and emerging standards.
• Chapter 2 is the least formal and introduces the psychological and philosophical issues related to machine and human learning, intelligence, and consciousness.
• Basic principles of machine learning, described in Chapter 3, are the central issue and can be considered as the core of the book.
• The next three chapters divide the machine learning process into three basic components: the representation of knowledge (Chapter 4) defines the hypothesis space, Chapter 5 describes basic search algorithms that can be used to search the hypothesis space, and the attribute quality measures (Chapter 6) are used to guide the search.
• Chapters 7 and 8 add to the whole story formal and practical guidelines for preparing, cleansing and transforming the input data in order to get the most out of it with the available machine learning algorithms. Constructive induction, described in Chapter 8, has great potentials, however, in the past not many researchers or implementations considered it worthwhile. We believe that in the future, when more research will be devoted to this area, the advantages of constructive induction will be revealed and implemented in useful practical tools.
• The following four chapters use the principles and notions, described in preceding chapters, as building blocks in order to describe particular learning algorithms. Four chapters divide the field into four main approaches. Symbolic learning (Chapter 9) includes all approaches that result in symbolic and transparent knowledge representations. Statistical learning (Chapter 10), on the other hand, results in knowledge representations that are typically hard to understand and interpret by the end user. Artificial neural networks (Chapter 11) are inspired by biological neural networks. With respect to transparency of knowledge representation, artificial neural networks are much closer to statistical learning than to symbolic learning. Cluster analysis deserves a separate chapter (Chapter 12) as the task of clustering algorithms differs a lot from other machine learning tasks.
• The last two chapters introduce formal approaches to machine learning. Chapter 13 describes a mathematically and philosophically inspired problem of identification in the limit. The results of this formal theory provide the ultimate answers to philosophical questions which, however, are of modest practical use. On the other hand, the computational learning theory, described in Chapter 14, aims to provide theoretical answers to questions of more practical value.
In order to improve readability, we provide all the references in separate sections on further reading, that also contain chapter summaries. Obvious exceptions that naturally include references are, besides this preface, also historical in the text indicates that the proof of the assertion is omitted from the text and can be found at the end of the chapter in the corresponding section on formal derivations and proofs. Obvious exceptions are the last two chapters, where proofs can be found in the literature indicated in further reading. Those sections and chapters are intended for advanced readers. In the table of contents, advanced topics are indicated with * (moderately advanced) or ** (advanced). Appendix A defines some less known formal notions, which are occasionally used in the text.
Both authors have experience in teaching various parts of the material presented in this book, to undergraduate and postgraduate students. For the undergraduate course we suggest excluding sections and chapters marked with *: Section 3.4 (comparing performance of machine learning algorithms), sections related to inductive logic programming (Sections 4.2 and 9.5), Chapter 8 (constructive induction), Section 10.5 (support vector machines), Sections 11.3 and 11.4 (Hopfield’s and Bayesian neural networks), as well as all sections and chapters marked with **. Optionally, Sections 9.3 (association rules) and 9.7 (Bayesian belief networks), and/or Chapter 12 (cluster analysis) can also be kept for the postgraduate course.
The book web page can be found at the address: mldmbook.fri.uni-lj.si, where all information and post-press corrections will be available. The readers’ comments, suggestions and indications about errors in the text are welcome, and should be sent to either author’s e-mail address; igor.kononenko@f ri.uni-lj.si or matjaz.kukar@fri.uni-1j.si.
Acknowledgements
The book is the result of numerous years of active research in the field of machine learning. We are grateful to Professor Ivan Bratko for his pioneering efforts in establishing an outstanding machine learning research group in Ljubljana. We are grateful to the Faculty of Computer and Information Science for support, both during research and teaching. For long term collaboration in machine learning research we are particularly grateful to Matjaž Bevk, Zoran Bosnić, Ivan Bratko, Bojan Cestnik, Janez Demšar, Aleks Jakulin, Matevž Kovačič, Nada Lavrač, Uroš Pompe, Marko Robnik Šikonja, Luka Šajn, and Blaž Zupan. We thank Luka Šajn and Alan McConnell-Duff for proofreading the text, Jožica Robnik Šikonja for proofreading Chapter 2, Ivan Bratko and the anonymous reviewer for their comments and suggestions, and Bogdan Filipič, Marko Robnik Šikonja, and Blaž Zupan for their comments on earlier drafts of the book. Useful comments on various parts of the book were provided by several people. In particular, we thank Janez Demšar for his comments on Chapter 3, Aleks Jakulin for his comments on Chapter 8, Aristidis Likas and Eduardo Hruschka for their comments on Chapter 12, and France Dacar, Matevž Kovačič and Boštjan Vilfan for their comments on Chapter 13. The author of the spider net
graphics from the book cover is a young artist and Igor’s daughter Nina Kononenko. We thank also Francesca Bonner and Ellis Horwood from Horwood Publishing for their kind assistance and patience. Finally, we are grateful to our wives and children for their loving support and patience and to whom with all our love we dedicate this book.
Chapter 1
Introduction
The time will come when we will have to forget everything we have learnt.
— Ramana Maharshi
Machine learning, a subfield of artificial intelligence, has shown tremendous improvements in the last 20 years. This is reflected in numerous commercial and open source systems for machine learning and their applications in industry, medicine, economics, natural and technical sciences, ecology, finance, and many others. Machine learning is used for data analysis and knowledge discovery in databases (data mining), automatic generation of knowledge bases for expert systems, learning to plan, game playing, construction of numerical and qualitative models, text classification and text mining (e.g., on the world wide web), for automatic knowledge acquisition to control dynamic processes, automatic recognition of speech, handwriting, and images, etc.
The basic principle of machine learning is the automatic modeling of underlying processes that have generated the collected data. Learning from data results in rules, functions, relations, equation systems, probability distributions and other knowledge representations such as decision rules, decision and regression trees, Bayesian nets, neural nets, etc. Models explain the data and can be used for supporting decisions concerning the same underlying process (e.g., forecasting, diagnostics, control, validation, and simulations).
We start with an overview and definitions of overlapping and frequently (mis)used buzzwords and acronyms, such as machine learning (ML), data mining (DM), knowledge discovery in databases (KDD), and intelligent data analysis (IDA). We continue with a short overview of machine learning methods and historical development of different directions in machine learning research: symbolic rule learning, neural nets and reinforcement learning, numerical learning methods, and formal learning theory. We describe a few interesting early machine learning systems, review typical applications and finish with an overview of data mining standards and tools.
1.1 THE NAME OF THE GAME
Why do we not, since the phenomena are well known, build a knowledge refinery
as the basis of a new industry, comparable in some ways to the industry of petroleum refining, only more important in the long run? The product to be refined is codified human knowledge.
— Donald Michie
The main problem of contemporary data analysis is the huge amount of data. In the last decade, computers have provided inexpensive and capable means to collect and store the data. The increase in data volume causes difficulties in extracting useful information. The traditional manual data analysis has become insufficient, and methods for efficient computer-based analysis are now needed. Therefore, a new interdisciplinary field has emerged, that encompasses statistical, pattern recognition, machine learning, and visualization tools to support the analysis of data and discovery of principles hidden within the data. In different communities this field is known by different names: knowledge discovery in databases (KDD), data mining (DM), or intelligent data analysis (IDA).
The terms data mining (in the database community) and knowledge discovery in databases (in machine learning and artificial intelligence communities) appeared around 1990, but the term data mining became more popular in the business community and in the press. Frequently, data mining and knowledge discovery are used interchangeably.
KDD focuses on the knowledge discovery from data (Figure 1.1), including how the data are stored and accessed; how algorithms can be scaled to massive data sets and still run efficiently; how results can be interpreted and visualized; and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any particular discipline such as machine learning. Note that the KDD process uses data mining as one of its steps.
Figure 1.1 The process of knowledge discovery in databases.
Data mining (DM) is an interactive and iterative process in which many steps need to be repeatedly refined in order to provide for an appropriate solution to the data analysis problem. Data mining standard CRISP-DM defines a process model that reflects the life cycle of a data mining project. It contains the phases (see Figure 1.2) of a data mining project and identifies relationships between them. The phases are cyclically iterated until some desired goal is reached. By comparing Figures 1.1 and 1.2 it is clear that basically they describe the same thing.
Figure 1.2 The cycle of a data mining process in the CRISP-DM standard.
The ambiguity whether KDD = DM arises from the fact that, strictly speaking, data mining (DM) covers the methods used in the data modeling step. This accounts both for visual methods as well as for methods that (semi) automatically deal with the extraction of knowledge from large masses of data, thus describing the data in terms of the interesting discovered regularities. Most automatic data mining methods have their origins in machine learning (ML). However, ML cannot be seen as a true subset of data mining, as it also encompasses other fields, not utilized for data mining (e.g. theory of learning, computational learning theory, and reinforcement learning).
A closely related field of intelligent data analysis (IDA) is similar to KDD and DM. It also refers to the interactive and iterative process of data analysis, with the distinguishing feature that the architectures, methodologies and techniques used in this process are those of artificial intelligence (AI) and machine learning. Figure 1.3 shows the relation of IDA to KDD, DM and ML in the strictest sense. In practice (and in the following text) KDD and DM are treated as synonyms.
Figure 1.3 Relation between ML, DM, KDD, and IDA in the stricter sense.
There is a large intersection between KDD/DM and IDA. Both fields investigate data analysis, and they share many common methods. The main difference is that IDA uses AI methods and tools, while KDD/DM employ both AI and non-AI methods. Therefore, ML methods are in the intersection of the two fields, whereas classical statistical methods and on-line analytical processing (OLAP) belong to KDD/DM but not to IDA. Typically, KDD and DM are concerned with the extraction of knowledge from large datasets (databases), whereas in IDA the datasets are smaller.
1.2 OVERVIEW OF MACHINE LEARNING METHODS
A learning machine, broadly defined, is any device whose actions are influenced by past experiences.
— Nils J. Nilsson
This overview is intended to give an impression of the field of machine learning. Therefore detailed explanations are omitted. The meaning of various terms mentioned in this section will become more familiar later in the book.
We differentiate machine learning methods with respect to how the obtained (induced) knowledge is used (see Figure 1.4): classification, regression, clustering, learning of associations, relations, and (differential) equations. Besides, machine learning is used in reinforcement learning within an iterative approach for optimization of value function approximation. Equation and reinforcement learning are not thoroughly discussed in this book, and are only described here for completeness.
Figure 1.4 A taxonomy of machine learning methods. The methods not thoroughly discussed in the book are grayed.
1.2.1 Classification
Machine learning methods are most frequently used for classification. Let us suppose that we have an object, described with many attributes (features, properties). Each object can be assigned to exactly one class from the finite set of possible classes. Attributes are independent observable variables, either continuous or discrete. The class is a dependent unobservable discrete variable and its value is determined from values of respective independent variables. Machine learning methods are, among others, used for creation of classifiers. The task of the classifier is to determine the class to which the object in question should be assigned.
A typical classification task is medical diagnosis: a patient is described with continuous (e.g. age, height, weight, body temperature, heart rate, blood pressure) and discrete attributes (e.g. sex, skin discoloration, location of pain). The classifier’s task is to produce a diagnosis – to label the patient with one of several possible diagnoses (e.g. healthy, influenza, pneumonia).
To determine the class, a classifier needs to describe a discrete function, a mapping from the attribute space to the class space. This function may be given in advance, or can be learnt from data. Data consists of learning examples (sometimes also called training instances) that describe solved past problems. For illustration let us again consider medical diagnostics. Here, solved past problems are medical records, including diagnoses, for all patients that had been treated in a hospital. The learning algorithm’s task is therefore to determine the mapping by learning from the set of patients with known diagnoses. This mapping (represented as a function, a rule, … ) can later be used for diagnosing new patients.
Different classifiers represent mapping functions in many different ways. Most common classifiers are decision trees, decision rules, naive Bayesian classifiers, Bayesian belief networks, nearest neighbor classifiers, linear discriminant functions, logistic regression, support vector machines, and artificial neural networks.
Decision trees and rules. Algorithms for building decision trees and rules can select attributes and suitable subsets of their values according to their quality, and use them as building blocks of a conjunctive rule’s antecedent. A consequent consists of one or more classes. When a new example arrives it is classified by applying a suitable rule.
For the given attribute A and its values V1 … Vm, decision trees and rules form conjunctive terms such as A = Vj . Continuous attributes are either discretized in advance or, even better, the terms like A > x, A ≤ x, or x ≥ A > y, where x, y , are formed during learning.
Bayesian classifier. Given the attribute values of a new example, the task of the Bayesian classifier is to calculate conditional probabilities for all possible classes. The exact Bayesian classifier minimizes the expected error rate and is therefore optimal in this sense. As exact Bayesian classifiers are hard to come by (unless the learning data cover the complete attribute space and there is no noise), certain approximations and assumptions have to be used when calculating conditional probabilities.
The naive Bayesian classifier assumes conditional independence of attributes given the class. This (often very strong) assumption usually ensures reliable conditional probability estimations needed for classification, even from very small datasets. Implementations of the naive Bayesian classifier often assume that only discrete attributes are used; therefore continuous attributes need to be discretized in advance.
A generalization of the naive Bayesian classifier are Bayesian belief networks. They use acyclic graphs for modeling dependencies between attributes and the class. Conditional independence is implicitly assumed between nodes (attributes) that are not directly connected. A Bayesian network’s topology can be given in advance (as a part of background knowledge for the problem), or can be learnt from data. Conditional dependencies are also learnt from data. Acyclic graph topology ensures efficient calculation of conditional probabilities for all classes. Bayesian belief networks can be used also for modeling probability distributions.
Nearest neighbors classifiers (instance-based learning). In its simplest form, the k-nearest neighbors classifier (k-NN) generates its knowledge base by storing all learning data. When a new example is presented for classification, the stored data are used to find a certain number (k) of most similar learning examples (nearest neighbors), according to some distance metric. The new example is labelled with a class, prevalent among its nearest neighbors. The number k is typically a small odd number (e.g. five or seven).
Discriminant functions. The task of the learning algorithm is to calculate coefficients for a discriminant function whose structure is fixed in advance. A discriminant function is actually a hypersurface, dichotomizing between two classes in the attribute space. As a hypersurface can be defined only in a continuous hyperspace, all attributes need to be continuous. If we have more than two classes, a separate hypersurface is necessary for each pair of classes. Discriminant functions can be linear, quadratic, polynomial, etc. In case of a linear discriminant function the corresponding hypersurface is a hyperplane dichotomizing between two classes. The hyperplane’s (hypersurface’s) coefficients are determined so as to minimize the classification error rate.
Frequently, Fisher’s linear discriminant function is used. It assumes normal distribution of learning examples within each class. It maximizes the (Euclidean) distance between averaged examples from both classes. It accounts for potentially different class variances and finds an optimal classification boundary between classes.
Support Vector Machines (SVMs). SVMs were developed in the 1990s and are among the most accurate approaches to discriminant function classification. Due to their importance we discuss them separately. SVM is a multi-criterion optimization method:
• It maximizes the distance between support vectors and a dichotomizing hyperplane. Support vectors are learning examples that lie closest to the hyperplane and therefore define its margin. The method thus maximizes the margin as well.
• Since many classification problems are not linearly separable, SVMs implicitly transform the original attribute set to the much larger (possibly even infinite) attribute set by using kernel functions. The success depends mostly on a suitably chosen kernel function.
• Since every classification problem can be exactly solved by using arbitrarily complex discriminant functions, SVMs attempt to minimize the size of solution (the sum of attributes’ weights).
• Although the transformation of attributes is nonlinear, some examples still get misclassified. SVMs therefore tend to minimize the number of misclassifications.
Completely satisfying all the above criteria is usually impossible, so a compromise needs to be found. Although SVMs use several optimization algorithms to optimize all criteria simultaneously, one still needs to appropriately weight the above criteria in order to obtain the best possible classification.
Artificial neural networks. Algorithms for artificial neural networks (ANNs) mimic biological neural networks by abstracting neural functions to a simple element (neuron), only able to summarize its input and normalize its output. Neurons are interconnected into arbitrarily complex artificial neural networks. For classification, most frequently feedforward multi-layered neural networks are used. Neurons are organized in a layered cascade: an input layer (corresponding to attributes), one or more hidden layers, and an output layer (corresponding to classes). The task of the learning algorithm is to determine weights on connections between neurons (used for calculating weighted sums in each neuron) in order to minimize the classification error rate.
For classification of a new example its attribute values are presented to the ANNs input neurons. These values are weighted according to the connections between neurons, and their weighted sums are calculated in each neuron in the subsequent layer of neurons. Normalized results in the output neurons determine the classification outcome (class).
Hybrid algorithms. Hybrid algorithms or their parts combine several different approaches by using their advantages and thus creating possibly superior learning algorithms. An example is the naive Bayesian classifier that has many nice features (speed, robustness, comprehensibility), and a sole problem – its naivety. It is often combined with other algorithms in order to – at least partially – avoid the effects of naivety:
• A locally naive (k-NN) Bayesian classifier is based on learning a naive Bayesian classifier on a set of k nearest neighbors of a new example. This reduces the naivety with respect to the original globally naive Bayesian classifier. Local classifiers are adapted to local distributions which implicitly account for attribute dependencies. The parameter k must be selected much larger than for ordinary nearest neighbor classifiers (e.g. k > 30) in order to provide sufficiently reliable conditional probability estimations.
• Clustering and the naive Bayesian classifier. For each class the learning examples are clustered (described later in this section) into several clusters. Each of them serves as a learning set for a separate new class. The number of obtained clusters is larger than the number of classes. The clustering algorithm should account for attribute dependencies and detect locally coherent subspaces, such that the naive Bayesian classifier can discriminate between them.
• The naive Bayesian classifier in the leaves of a decision tree. We may assume that the decision tree will account for dependencies among attributes, and split the original attribute space into subspaces where the remaining attributes are relatively independent. Such decision trees should be made smaller (by pruning) in order to obtain sufficiently reliable conditional probability estimations for the final naive Bayesian classification.
• The naive Bayesian classifier in internal nodes of decision trees. Instead of using original attributes to form splitting criteria, we can use a simple classifier in each internal node. Each node has for each possible class a corresponding subtree. A classifier learns from the training examples that belong to its node and assigns them to respective subtrees. The task of the subtree is to correctly classify learning examples that were misclassified in the previous node. A decision tree is built recursively until all examples are correctly classified or some other stopping criterion is fulfilled (e.g., number of examples becomes too small).
1.2.2 Regression
As in classification problems, in regression we have a set of objects (learning examples), described with several attributes (features, properties). Attributes are independent observable variables (either continuous or discrete). The dependent (target, regression) variable is continuous and its value is determined as a function of independent variables. The task of the regressional predictor is to determine the value of the dependent unobservable continuous variable for the object in question.
Similarly to classifiers, a regressional predictor implements a continuous function mapping from the attribute space to prediction values. This function can be given in advance or learnt from previously solved problems (a learning set). The task of the learning algorithm is therefore to determine a continuous function by learning from the learning set. This function can later be used for predicting values for new, previously unseen examples.
Regressional predictors differ with respect to the representation of regressional function. Most common regressional predictors (regressors) are linear regression, regression trees, locally weighted regression, support vector machines for regression, and multi-layered feedforward neural networks for regression.
Regression trees. Algorithms for building regression trees are similar to algorithms for constructing decision trees. According to their quality, they choose attributes and subsets of their values for building the node-splitting criteria. Tree nodes stand for attributes, tree branches (edges) for attribute values (conjunctively combined), and leaves for consequents of the prediction rules that map from the remaining (non-used) attributes to the prediction value. Discrete and continuous attributes are handled within internal nodes much in the same way as in decision (classification) trees. In terminal nodes (leaves) we have a continuous function. It is usually assumed that all the remaining attributes are continuous. Most frequently, single point functions (the mean value of dependent variable of the corresponding subset of learning examples), or linear regression functions are used.
Linear regression. A linear regression function assumes all attributes to be continuously-valued and suitably normalized. It also assumes the existence of a linear relationship between dependent and independent variables. The learning algorithm needs to determine the coefficients of the linear function in order to minimize the sum of squared (or absolute) errors of regressional predictions evaluated on learning examples. Either analytical solution or iterative optimization may be used. The result of learning is a hyperplane that for each example determines the value of the dependent variable. The error for a given example is its distance from the hyperplane in the dependent variable’s axis.
For non-linear problems a series of linear functions or nonlinear parts included in the functional expression can be used.
Locally weighted regression. Locally weighted regression is the nearest neighbors algorithm adapted for regression, such that linear regression function is calculated from the new example’s nearest neighbors. Linear regression is thus adapted to the local properties. Besides linear regression, the average value of nearest neighbors or some other simple function can be used – we cannot afford to use complex functions due to the small number of neighbors.
Support vector machines for regression. Since SVMs perform implicit transformation of the attribute space by utilizing kernel functions, there is no need to introduce nonlinearities for description of the hypersurface. SVMs can be easily adapted for solving regressional problems. A margin around the regressional hyperplane is defined such that correctly predicted learning examples lie inside it. Support vectors determine the margin. SVMs aim to minimize the margin as well as minimize the prediction error on learning examples. As is the case with SVM’s for classification, the criterion function needs to be optimized with respect to the prediction error of the regressional variable as well as with respect to the complexity of the criterion function (measured with magnitudes of weights for implicitly generated attributes).
Artificial neural networks. For regression, as well as for classification, multilayered feedforward networks are commonly used. Simple calculation elements (neurons) are organized in a layered cascade: an input layer (corresponding to attributes), one or more hidden layers, and an output layer with a single neuron (corresponding to the regressional variable). The task of the learning algorithm is to determine weights on connections between neurons (used for calculating weighted sums in each neuron) such that the regressional error is minimal.
Hybrid algorithms. As for classification, hybrid algorithms combine several different approaches by using their advantages and thus creating possibly superior learning algorithms. Strictly speaking, locally weighted regression is of this kind, combining linear regression with nearest neighbors. The same holds for the use of linear regression (or other simple methods) in the leaves of regression trees. The rationale here is that regression trees should partition the attribute space finely enough for a simple regressor to work well. Such regression tree should be made smaller (by pruning) in order to allow enough examples for simple regressors.
1.2.3 Logical relations
Logical relations can be considered a generalization of discrete functions. As opposed to functions, there is no single dependent discrete variable (class), and all variables (attributes) are treated equivalently. Sometimes, some variables have unknown values, and we want to predict them. In other cases, all the variables have known values and we want to check if the relation holds for this particular set of values. According to the expressive power of the language used for describing learning examples and logical relations, we differentiate between learning associations and inductive logic programming.
Associations. We speak of associations when the propositional (with attributes) data description is used, such as in classification and regression, the only difference being the absence of the dependent variable. Each variable may in turn be considered as independent (its value is given), or dependent (its value is not given). Because of the absence of an explicitly defined dependent variable, association learning is often considered unsupervised, together with clustering (to be described later).
A straightforward approach to association learning is to break the problem into a learning problems, a being the number of attributes. In each new problem, one attribute is taken as a target (dependent) variable, whereas others are treated as independent variables. The type of the learning problem (classification or regression) depends on the type of the attribute (discrete or continuous). Two approaches to association learning are associative neural networks and association rules.
• Associative neural networks are artificial neural networks where every neuron can serve both as input and output. Each neuron represents one attribute. Neurons are mutually interconnected with two-way connections, so that the signal can travel in either direction. The signal value is multiplied by the weight of the connection. Each neuron calculates its own weighted sum of the incoming connections. This (normalized) sum represents the attribute value. In an associative neural network the initial state is determined by the given attribute values. After the initial set-up, all neurons start calculating their outputs asynchronously and in parallel. This process goes on until the outputs of all neurons cease changing. Thus the unknown values of the attributes are calculated.
The task of the learning algorithm is to set the weights on connections between each pair of neurons. According to the utilized learning algorithm we speak of Hopfield and Bayesian neural networks. Hopfield neural networks are based on correlation of the state activity of neurons. Bayesian neural networks use in every neuron the naive Bayesian learning rule for calculation of the weighted sum. Similarly, its learning rule calculates the conditional probability of activity of each neuron given the states of other neurons.
Associative neural networks are frequently used for pattern recognition when only partial data are known, and the missing part needs to be reconstructed. For example, in face recognition only a part of the face can be seen. Another use is in combinatorial optimization problems, where parameters (attributes) compete with each other in the sense that higher values of one may result in lower values of others. Associative neural networks tend to find a compromise – a locally optimal combination of parameter values.
• Association rules are similar to classification rules. They differ in the consequent, where instead of class any attribute may occur. The antecedent is made of conjunctively connected terms, built from other attributes. The aim of learning association rules is to find new interesting relations between attributes. They may contribute to (human) knowledge of a certain problem and give insight about regularities in data. The task of the learning algorithm is to find all association rules that are both accurate and interesting.
Association rules have originally been used in market basket analysis. With their help the merchants can plan the arrangement of items on the shelves in their shop. For example, a rule may say, that there is a high probability that a customer, buying flour, sugar, and nuts, will also buy eggs.
Inductive logic programming. In inductive logic programming (ILP) the first-order predicate calculus is used as a description language. It is usually limited to Horn clauses with negation, or equivalently, to Prolog expressions. A learning problem is described with the background knowledge (a set of relations, or Prolog programs) and with sets of positive (true) and negative (false) facts (examples) for a target relation. The task of the learning algorithm is to derive the target relation (a Prolog program) covering as many as possible positive and as few as possible negative examples. ILP can be considered as a process for discovering systems of logical equations.
In practice, inductive logic programming is used for solving geometrical and spatial problems, such as structure prediction of chemical compounds, gene expression determination in functional genomics, spatial planning and subspace recognition, and static and modeling problems in machine engineering.
1.2.4 Equations
When modeling real-world problems, we often have to deal with measurements of different, related and mutually dependent processes. By learning systems of equations we aim to explicitly describe these dependencies. The same system of equations may as well be used for simulation and forecasting. We usually deal with real-valued (continuous) variables, numerical relations, and dependencies. In contrast to this, ILP is concerned mostly with logical variables, relations, and dependencies.
When learning systems of equations, their structure may either be given in advance in the form of background knowledge, or the learning algorithm needs to derive from data the structure of equations as well as their corresponding coefficients. For modeling dynamic systems that change continuously through the time, systems of partial differential equations need to be derived. The task of the learning algorithm is to derive a system of equations that is both simple (in terms of equations) and accurate (with respect to input data).
Equation discovery is frequently used in meteorology for modeling weather systems, in ecology for modeling biological subsystems, and in machine engineering for modeling technical and dynamical processes.
1.2.5 Clustering
Unsupervised learning is different from classification and regression since only description of examples with attributes is given, without the supervising (target) variable. Clustering is the most popular of unsupervised learning methods. The task of the learning algorithm is to determine coherent subsets (clusters) of learning examples. Clustering is frequently used in natural sciences, econometrics, process analysis, psychological and sociological research.
The number of clusters may either be given in advance as a part of the background knowledge, or determined by the learning algorithm. Its task is therefore to determine a relatively small number of coherent clusters, that is, subsets of similar examples. The choice of the (dis)similarity measure is the most important part of the background knowledge, and of utmost importance for successful and meaningful clustering. Most popular approaches to clustering are:
• Hierarchical clustering. In agglomerative (bottom-up) hierarchical clustering, each example initially forms a separate cluster. At each step of the algorithm, the most similar clusters are amalgamated, thus forming a tree of clusterings (dendrogram). Frequently, pairwise amalgamation is used, leading to binary trees. Amalgamation continues until all examples belong to a single cluster. Finally, the learning algorithm or the end user, selects the most suitable clustering level from the constructed tree. In divisive (top-down) hierarchical clustering, all examples initially belong to a single cluster. The number of clusters is incremented at each step of the algorithm by dividing an existing cluster into (usually, two) sub-clusters. The process continues until a suitable number of clusters is obtained.
• Partitional clustering. Initially, the number (c) of required disjoint clusters needs to be known. Given a criterion for measuring adequacy of a partition of examples into c clusters, the algorithm searches for optimal partitions of examples. The process starts with an initial clustering (frequently obtained by randomly selecting c examples), and a set of transformations for changing a partition into another partition. The learning algorithm (e.g., iterative relocation algorithm) modifies a partition until no allowable transformation would improve the given criterion of partition adequacy.
1.2.6 Reinforcement learning
Reinforcement learning deals with the problem of teaching an autonomous agent that acts and senses in its environment to choose optimal actions for achieving its goals. The agent receives information about the current state in the environment and performs actions to change it. Each time the agent performs an action, he may also get a positive or a negative reward (punishment). A reward is not necessarily a result of the latest action, but may be an effect of a (sometimes very long) series of actions. Since rewards are often indirect and delayed, learning is difficult and slow. A learner may be conservative and use only the knowledge it already has, or it might experiment with new, rarely used actions to obtain the knowledge of new situations.
The learner’s task is to learn from indirect, delayed rewards, how to choose sequences of actions that maximize the cumulative reward. Usually, the fastest solution (in terms of the number of actions) is preferable. Therefore, the weighting of accumulated rewards is often chosen so that immediate rewards are most important.
A commonly used approach to reinforcement learning is Q learning. This is a form of reinforcement learning in which the agent iteratively learns an evaluation function over states and actions. It can be employed even when the learner has no prior knowledge of how its