12BCE055 Final

Data Mining Models and Evaluation
Techniques
Shubham Pachori
12BCE055
DEPARTMENT OF COMPUTER ENGINEERING

AHMEDABAD -382424
November 2014
Data Mining Models And Evaluation

Techniques
Seminar
Submitted in partial fulfillment of the requirements
For the degree of
Bachelor of Technology in
Computer Science and Engineering
Shubham Pachori
12BCE055
DEPARTMENT OF COMPUTER ENGINEERING

AHMEDABAD -382424
November 2014
Data Mining Models And Evaluation Technique
CERTIFICATE
This is to certify that the seminar entitled Data Mining Models and Evaluation Techniques submitted
by Shubham Pachori (12BCE055), towards the partial fulfillment of the requirements for the degree
of Bachelor of Technology in B.Tech of Nirma University, Ahmed- abad is the record of work carried
out by him under my supervision and guidance. In my opinion, the submitted work has reached a
level required for being accepted for examination. The results embodied in this Seminar, to the best
of my knowledge, havent been submitted to any other university or institution for award of any degree
or diploma.
Prof. K.P.Agarwal
Associate Professor,
CSE Department,
Institute Of Technology,
Nirma University, Ahmedabad.
Prof. Anuja Nair

Assistant Professor,
CSE Department,
Dr. Sanjay Garg

Prof & Head Of Department,
CSE Department,
CSE Department,Institute of Technology, Nirma University
Acknowledgements
I am profoundly grateful to Prof. K P AGARWAL for his expert guidance
throughout the project.His continuous encouragement have fetched us the golden
results. His elixir of knowledge in the field has made this project achieve its zenith
and credibility.
I would like to express deepest appreciation towards , Prof. SANJAY GARG,
Head of Department of Computer Engineering and Prof. ANUJA NAIR, whose
invaluable guidance supported us in completing this project.
At last I must express my sincere heartfelt gratitude to all the staff members of
Computer Engineering Department who helped me directly or indirectly during this
course of work.
SHUBHAM PACHORI
12BCE055
ii
Abstract
Databases are rich with hidden information that can be used for intelligent decision making. Classification and prediction are two forms of data analysis that
can be used to extract models describing important data classes or to predict future
data trends. Such analysis can help provide us with a better understanding of the
data at large. Classification models predicts categorical (discrete, unordered) label
functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky.
As predictions always have an implicit cost involved it is important to evaluate
classifiers generalization performance in order to determine whether to employ the
classifier (For example: when learning the effectiveness of medical treatments from
a limited-size data, it is important to estimate the accuracy of the classifiers.) and
to Optimize the classifier. (For example: when post-pruning decision trees we must
evaluate the accuracy of the decision trees on each pruning step.)
This seminar report gives an in depth explanation of classifier models (viz.
Naive Bayesian and Decision Trees) and how these classifier models are evaluated
for their accuracy on predictions. The later part of the report also deals with how to
improve the accuracy of these classifier models and it includes an exploratory study
comparing the various model evaluation techniques , carried out in Weka(A GUI
Based Data Mining Tool) on representative data sets.
iii
Contents
Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
ii
iii
Introduction
Classification Using Decision Tree

2.1 Understanding Decision Trees . . . . . .
2.2 Divide and Conquer . . . . . . . . .
2.3 C5.0 Decision Tree Algorithm . . .
2.4 How To Choose The Best Split? . .
2.5 Pruning The Decision Tree . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Probabilistic Learning - Naive Bayesian Classification

3.1 Understanding Naive Bayesian Classification . . .
3.2 Bayes Theorem . . . . . . . . . . . . . . . . . . .
3.3 The Naive Bayes Algorthim . . . . . . . . . . . .
3.4 Naive Bayesian Classification . . . . . . . . . . .
Model Evaluation Techniques
4.1 Prediction Accuracy . . . . . . . . . . . . . . . .
4.2 Confusion Matrix and Model Evaluation Metrics
4.3 How To Estimate These Metrics? . . . . . . . . .
4.3.1 Training and Independent Test Data . . .
4.3.2 Holdout Method . . . . . . . . . . . . .
4.3.3 K-Cross-validation . . . . . . . . . . . .
4.3.4 Bootstrap . . . . . . . . . . . . . . . . .
4.3.5 Comparing Two Classifier Models . . . .
4.4 ROC Curves . . . . . . . . . . . . . . . . . . . .
4.5 Ensemble Methods . . . . . . . . . . . . . . . .
4.5.1 Why Ensemble Works? . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
5
8
9
10
.
.
.
.
12
12
13
14
15
.
.
.
.
.
.
.
.
.
.
.
17
17
18
23
23
24
25
26
27
29
31
31
iv
4.5.2
4.5.3
4.5.4
4.5.5
4.5.6
5
Ensemble Works in Two Ways

Learn To Combine . . . . . .
Learn By Consensus . . . . .
Bagging . . . . . . . . . . . .
Boosting . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
33
33
33
34
Conclusion and Future Scope

5.1 Comprative Study . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
37
40
40
References
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
Chapter 1
Introduction
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the high-level application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data visualization.The unifying goal of the KDD process
is to extract knowledge from data in the context of large databases.It does this by
using data mining methods (algorithms) to extract (identify) what is deemed knowledge, according to the specifications of measures and thresholds, using a database
along with any required preprocessing, subsampling, and transformations of that
database.
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
Figure 1.1: KDD Process
1. Developing an understanding of
1. the application domain
2. the relevant prior knowledge

3. the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
1. Removal of noise or outliers.
2. Removal of noise or outliers.
3. Strategies for handling missing data fields.
4.Accounting for time sequence information and known changes.
4. Data reduction and projection.

1.Finding useful features to represent the data depending on the goal of the
task.
2.Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data.
5. Choosing the data mining task.Deciding whether the goal of the KDD process
is classification, regression, clustering, etc.
6. Choosing the data mining algorithm
1. Selecting method(s) to be used for searching for patterns in the data.
2. Deciding which models and parameters may be appropriate.
3. Matching a particular data
mining method with the overall criteria of the KDD process.
7. Data mining.
Searching for patterns of interest in a particular representational form or a set
of such representations as classification rules or trees, regression, clustering,
and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge
In the following chapters we will be exploring Data Mining Models and Evaluation
techniques in depth
Chapter 2
Classification Using Decision Tree
This chapter introduces the concept of the most widely used learning method that
apply a similar strategy of dividing data into smaller and smaller portions to identify
patterns that can be used for prediction. The knowledge is then presented in the
form of logical structures that can be understood without any statistical knowledge.
This aspect makes these models particularly useful for business strategy and process
improvement.
1. Understading Decision Tress
2. Divide and Conquer
3. Unique identifiers
4. C 5.0 Decision Tree Algorithm
5. Choosing The Best Split
6. Pruning The Decision Tress
2.1
Understanding Decision Trees
As we might intuit from the name itself, decision tree learners build a model in
the form of a tree structure. The model itself comprises a series of logical decisions,
similar to a flowchart, with decision nodes that indicate a decision to be made on
an attribute. These split into branches that indicate the decisions choices. The tree
is terminated by leaf nodes (also known as terminal nodes) that denote the result of
following a combination of decisions.
Data that is to be classified begin at the root node where it is passed through
the various decisions in the tree according to the values of its features. The path
that the data takes funnels each record into a leaf node, which assigns it a predicted
class. As the decision tree is essentially a flowchart, it is particularly appropriate for

applications in which the classification mechanism needs to be transparent for legal
reasons or the results need to be shared in order to facilitate decision making. Some
potential uses include:
1. Credit scoring models in which the criteria that causes an applicant to be rejected need to be well-specified
2. Marketing studies of customer churn or customer satisfaction that will be shared
with management or advertising agencies
3. Diagnosis of medical conditions based on laboratory measurements, symptoms,
or rate of disease progression
In spite of their wide applicability, it is worth noting some scenarios where trees
may not be an ideal fit. One such case might be a task where the data has a large
number of nominal features with many levels or if the data has a large number of
numeric features. These cases may result in a very large number of decisions and an
overly complex tree.
2.2
Divide and Conquer
Decision trees are built using a heuristic called recursive partitioning. This approach is generally known as divide and conquer because it uses the feature values
to split the data into smaller and smaller subsets of similar classes. Beginning at the
root node, which represents the entire dataset, the algorithm chooses a feature that is
the most predictive of the target class. The examples are then partitioned into groups
of distinct values of this feature; this decision forms the first set of tree branches.
The algorithm continues to divide-and-conquer the nodes, choosing the best candidate feature each time until a stopping criterion is reached. This might occur at a
node if:
1. All (or nearly all) of the examples at the node have the same class
2. There are no remaining features to distinguish among examples
3. The tree has grown to a predefined size limit
To illustrate the tree building process, lets consider a simple example. Imagine
that we are working for a Hollywood film studio, and our desk is piled high with
screenplays. Rather than read each one cover-to-cover, you decide to develop a
decision tree algorithm to predict whether a potential movie would fall into one of
three categories: mainstream hit, critics choice, or box office bust. To gather data for
your model, we turn to the studio archives to examine the previous ten years of movie
releases. After reviewing the data for 30 different movie scripts, a pattern emerges.
There seems to be a relationship between the films proposed shooting budget, the
number of A-list celebrities lined up for starring roles, and the categories of success.
A scatter plot of this data might look something like the figure 2.1(Reference [2]):
Figure 2.1: Scatter Plot of Budget vs A-List(Ref[2]) Celeberities
To build a simple decision tree using this data, we can apply a divide-and-conquer
strategy. Lets first split the feature indicating the number of celebrities, partitioning the movies into groups with and without a low number of A-list stars (fig 2.2
Reference [2])
Figure 2.2: Split 1: Scatter Plot of Budget vs A-List Celeberities (Ref[2])
Next, among the group of movies with a larger number of celebrities, we can
make another split between movies with and without a high budget(fig2.3) At this
point we have partitioned the data into three groups. The group at the top-left corner
of the diagram is composed entirely of critically-acclaimed films. This group is
distinguished by a high number of celebrities and a relatively low budget. At the
top-right corner, the majority of movies are box office hits, with high budgets and a
large number of celebrities. The final group, which has little star power but budgets
ranging from small to large, contains the flops.
Figure 2.3: Split 2: Scatter Plot of Budget vs A-List Celeberities (Ref[2])
If we wanted, we could continue to divide the data by splitting it based on increasingly specific ranges of budget and celebrity counts until each of the incorrectly classified values resides in its own, perhaps tiny partition. Since the data can continue
to be split until there are no distinguishing features within a partition, a decision tree
can be prone to be overfitting for the training data with overly-specific decisions.
Well avoid this by stopping the algorithm here since more than 80 percent of the
examples in each group are from a single class.
Our model for predicting the future success of movies can be represented in a simple tree as shown fig 2.4(Ref[2]). To evaluate a script, follow the branches through
each decision until its success or failure has been predicted. In no time, you will
be able to classify the backlog of scripts and get back to more important work such
as writing an awards acceptance speech. Since real-world data contains more than
two features, decision trees quickly become far more complex than this, with many
more nodes, branches, and leaves. In the next section we will throw some light on a
popular algorithm for building decision tree models automatically.
Figure 2.4: Decision Tree Model(Refernence[2])
2.3
C5.0 Decision Tree Algorithm
There are numerous implementations of decision trees, but one of the most well
known is the C5.0 algorithm. This algorithm was developed by computer scientist J.
Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an
improvement over his ID3 (Iterative Dichotomiser 3) algorithm.
Strengths of C5.0 Algorithm
1. An all-purpose classifier that does well on most problems
2. Highly-automatic learningprocess can handle numeric or nominal features,
missing data
3. Uses only the most important features
4. Can be used on data with relatively few training examples or a very large
number
5. Results in a model that can be interpreted without a mathematical background
(for relatively small trees)
6. More efficient than other complex models
Weaknesses of C5.0 Algorithm
1. Decision tree models are often biased toward splits on features having a large
number of levels
2. It is easy to overfit or underfit the model
3. Can have trouble modeling some relationships due to reliance on axisparallel
splits
4. Small changes in training data can result in large changes to decision logic
5. Large trees can be difficult to interpret and the decisions they make may seem
counterintuitive
2.4
How To Choose The Best Split?
The first challenge that a decision tree will face is to identify which feature to
split upon. In the previous example, we looked for feature values that split the data
in such a way that partitions contained examples primarily of a single class. If the
segments of data contain only a single class, they are considered pure. There are
many different measurements of purity for identifying splitting criteria C5.0 uses
Entropy for measuring purity. The entropy of a sample of data indicates how mixed
the class values are; the minimum value of 0 indicates that the sample is completely
homogenous, while 1 indicates the maximum amount of disorder. The definition of
entropy is specified by:
c
Entropy(S) =
pi log2(pi)
(2.1)
n=1
In the entropy formula, for a given segment of data (S), the term c refers to the
number of different class levels, and pi refers to the proportion of values falling into
class level i. For example, suppose we have a partition of data with two classes: red
(60 percent), and white (40 percent). We can calculate the entropy as:
Entropy(S) = 0.60 log2 (0.60) 0.40 log2 (0.40) = 0.9709506
(2.2)
Given this measure of purity, the algorithm must still decide which feature to split
upon. For this, the algorithm uses entropy to calculate the change in homogeneity
resulting from a split on each possible feature. The calculation is known as information gain. The information gain for a feature F is calculated as the difference
between the entropy in the segment before the split (S1 ), and the partitions resulting
from the split (S2 ):

In f oGain(F) = Entropy(S1 ) Entropy(S2 )
(2.3)
The one complication is that after a split, the data is divided into more than one
partition. Therefore, the function to calculate Entropy(S2 ) needs to consider the
total entropy across all of the partitions. It does this by weighing each partitions
entropy by the proportion of records falling into that partition. This can be stated in
a formula as:
n
Entropy(S) =
wi log2(Pi)
(2.4)
n=1
In simple terms, the total entropy resulting from a split is the sum of entropy
of each of the n partitions weighted by the proportion of examples falling in that
partition wi . The higher the information gain, the better a feature is at creating
homogeneous groups after a split on that feature. If the information gain is zero,
there is no reduction in entropy for splitting on this feature. On the other hand, the
maximum information gain is equal to the entropy prior to the split. This would
imply the entropy after the split is zero, which means that the decision results in
completely homogeneous groups.
The previous formulae assume nominal features, but decision trees use information gain for splitting on numeric features as well. A common practice is testing
various splits that divide the values into groups greater than or less than a threshold.
This reduces the numeric feature into a two-level categorical feature and information
gain can be calculated easily. The numeric threshold yielding the largest information
gain is chosen for the split
2.5 Pruning The Decision Tree

A decision tree can continue to grow indefinitely, choosing splitting features and
dividing into smaller and smaller partitions until each example is perfectly classified
or the algorithm runs out of features to split on. However, if the tree grows overly
large, many of the decisions it makes will be overly specific and the model will have
been over fitted to the training data. The process of pruning a decision tree involves
reducing its size such that it generalizes better to unseen data.
One solution to this problem is to stop the tree from growing once it reaches a
certain number of decisions or if the decision nodes contain only a small number of
examples. This is called early stopping or pre-pruning the decision tree. As the tree
avoids doing needless work, this is an appealing strategy. However, one downside is
that there is no way to know whether the tree will miss subtle, but important patterns
that it would have learned had it grown to a larger size.
An alternative, called post-pruning involves growing a tree that is too large, then
using pruning criteria based on the error rates at the nodes to reduce the size of the
tree to a more appropriate level. This is often a more effective approach than prepruning because it is quite difficult to determine the optimal depth of a decision tree
without growing it first. Pruning the tree later on allows the algorithm to be certain
that all important data structures were discovered.
One of the benefits of the C5.0 algorithm is that it is opinionated about pruningit
takes care of many of the decisions, automatically using fairly reasonable defaults.
Its overall strategy is to postprune the tree. It first grows a large tree that overfits the
training data. Later, nodes and branches that have little effect on the classification
errors are removed. In some cases, entire branches are moved further up the tree
or replaced by simpler decisions. These processes of grafting branches are known
as subtree raising and subtree replacement, respectively. Balancing overfitting and
underfitting a decision tree is a bit of an art, but if model accuracy is vital it may
be worth investing some time with various pruning options to see if it improves
performance on the test data. As you will soon see, one of the strengths of the C5.0
algorithm is that it is very easy to adjust the training options.
10
Chapter 3
Probabilistic Learning - Naive Bayesian
Classification
When a meteorologist provides a weather forecast, precipitation is typically predicted using terms such as 70 percent chance of rain. These forecasts are known
as probability of precipitation reports. Have you ever considered how they are calculated? It is a puzzling question, because in reality, it will either rain or it will
not. This chapter covers a machine learning algorithm called naive Bayes, which
also uses principles of probability for classification. Just as meteorologists forecast
weather, naive Bayes uses data about prior events to estimate the probability of future events. For instance, a common application of naive Bayes uses the frequency
of words in past junk email messages to identify new junk mail.
3.1
Understanding Naive Bayesian Classification
The basic statistical ideas necessary to understand the naive Bayes algorithm have
been around for centuries. The technique descended from the work of the 18th century mathematician Thomas Bayes, who developed foundational mathematical principles (now known as Bayesian methods) for describing the probability of events,
and how probabilities should be revised in light of additional information. Classifiers based on Bayesian methods utilize training data to calculate an observed probability of each class based on feature values. When the classifier is used later on
unlabeled data, it uses the observed probabilities to predict the most likely class for
the new features. Its a simple idea, but it results in a method that often has results
on par with more sophisticated algorithms. In fact, Bayesian classifiers have been
used for:
1. Text classification, such as junk email (spam) filtering, author identification,
or topic categorizatio
11
2. Intrusion detection or anomaly detection in computer networks

3. Diagnosing medical conditions, when given a set of observed symptoms
Typically, Bayesian classifiers are best applied to problems in which the information
from numerous attributes should be considered simultaneously in order to estimate
the probability of an outcome. While many algorithms ignore features that have
weak effects, Bayesian methods utilize all available evidence to subtly change the
predictions. If a large number of features have relatively minor effects, taken together their combined impact could be quite large.
3.2
Bayes Theorem
Bayes theorem is named after Thomas Bayes, a nonconformist English clergyman

who did early work in probability and decision theory during the 18th century. Let X
be a data tuple. In Bayesian terms, X is considered evidence. As usual, it is described
by measurements made on a set of n attributes. Let H be some hypothesis, such as
that the data tuple X belongs to a specified class C. For classification problems,
we want to determine P(H|X), the probability that the hypothesis H holds given the
evidence or observed data tuple X. In other words, we are looking for the probability
that tuple X belongs to class C, given that we know the attribute description of X.
P(H|X) is the posterior probability, or a posterior probability, of H conditioned on
X. For example, suppose our world of data tuples is confined to customers described
by the attributes age and income, respectively, and that X is a 35-year-old customer
with an income of Rs40,000. Suppose that H is the hypothesis that our customer
will buy a computer. Then P(H|X) reflects the probability that customer X will
buy a computer given that we know the customers age and income. In contrast,
P(H) is the prior probability, or a prior probability, of H. For our example, this
is the probability that any given customer will buy a computer, regardless of age,
income, or any other information, for that matter. The posterior probability, P(H|X),
is based on more information (e.g., customer information) than the prior probability,
P(H), which is independent of X. Similarly, P(X|H) is the posterior probability of
X conditioned on H. That is, it is the probability that a customer, X, is 35 years old
and earns Rs40,000, given that we know the customer will buy a computer. P(X) is
the prior probability of X.Using our example, it is the probability that a person from
our set of customers is 35 years old and earns Rs40,000. How are these probabilities
estimated? P(H), P(X|H), and P(X) may be estimated from the given data, as we
shall see below. Bayes theorem is useful in that it provides a way of calculating the
12
posterior probability, P(H|X), from P(H), P(X|H), and P(X). Bayes theorem is
P(H|X) =
3.3
P(X|H)P(H)
P(X)
(3.1)
The Naive Bayes Algorthim
The naive Bayes (NB) algorithm describes a simple application using Bayes theorem for classification. Although it is not the only machine learning method utilizing
Bayesian methods, it is the most common, particularly for text classification where
it has become the de facto standard. Strengths and weaknesses of this algorithm are
as follows
Strength
1. Simple, fast, and very effective

2. Does well with noisy and missing data
3. Requires relatively few examples for training, but also works well with very
large numbers of examples
4. Easy to obtain the estimated probability for a prediction
Weaknesses
1. Relies on an often-faulty assumption of equally important and independent

features
2. Not ideal for datasets with large
3. Estimated probabilities are less reliable than the predicted classes
The naive Bayes algorithm is named as such because it makes a couple of naive
assumptions about the data. In particular, naive Bayes assumes that all of the features
in the dataset are equally important and independent. These assumptions are rarely
true in most of the real-world applications.
For example, if you were attempting to identify spam by monitoring email messages, it is almost certainly true that some features will be more important than
others. For example, the sender of the email may be a more important indicator
of spam than the message text. Additionally, the words that appear in the message
13
body are not independent from one another, since the appearance of some words is a
very good indication that other words are also likely to appear. A message with the
word Viagra is probably likely to also contain the words prescription or drugs. Probabilistic Learning Classification Using Naive Bayes [ 96 ] However, in most cases
when these assumptions are violated, naive Bayes still performs fairly well. This
is true even in extreme circumstances where strong dependencies are found among
the features. Due to the algorithms versatility and accuracy across many types of
conditions, naive Bayes is often a strong first candidate for classification learning
tasks.
3.4
Naive Bayesian Classification
The nave Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, each
tuple is represented by an n-dimensional attribute vector, X = (x1 , x2 , ....., xn ),
depicting n measurements made on the tuple from n attributes, respectively
A1 , A2 , ....., An .
2. Suppose that there are m classes, C1 ,C2 , ...,Cm . Given a tuple, X, the classifier
will predict that X belongs to the class having the highest posterior probability,
conditioned on X. That is, the nave Bayesian classifier predicts that tuple X
belongs to the class Ci if and only if
P(Ci |X) > P(C j |X) f or1 < j < m, j6 = i :
(3.2)
Thus we maximize P(Ci |X). The class for which P(Ci |X) is maximized is
called the maximum posterior hypothesis. By Bayes theorem
P(Ci |X) =
P(X|Ci ) P(Ci )
P(X)
(3.3)
3. As P(X) is constant for all classes, only P(X|Ci )P(Ci ) need be maximized. If
the class prior probabilities are not known, then it is commonly assumed that
the classes are equally likely, that is, P(C1 ) = P(C2 ) == P(Cm ), and we would
therefore maximize P(X|Ci ). Otherwise, we maximize P(X|Ci )P(Ci ). Note
that the class prior probabilities may be estimated by P(Ci ) = |Ci,D |/|D|,where
|Ci,D | is the number of training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally
14
expensive to compute P(X|Ci ). In order to reduce computation in evaluating

P(X|Ci ), the naive assumption of class conditional independence is made. This
presumes that the values of the attributes are conditionally independent of one
another, given the class label of the tuple (i.e., that there are no dependence
relationships among the attributes). Thus,
k
P(X|Ci ) =
P(xk |Ci)
(3.4)
n=1
5. In order to predict the class label of X, P(X|Ci )P(Ci ) is evaluated for each class
Ci . The classifier predicts that the class label of tuple X is the class Ci if and
only if
P(X|Ci )P(Ci ) > P(X|C j )P(C j ) f or1 < j < m, j! = i
(3.5)
In other words, the predicted class label is the class Ci for which P(X|Ci )P(Ci )
is the maximum.
15
Chapter 4
Model Evaluation Techniques
As We now have in depth explored the two most widely used classifier models the
question which we now face is how accurately these classifiers can predict the future
trends based on the data used for building these classifier viz. How accurately a customer recomender system of a company can predict the future purchasing behavior
of the customer based on the previously recorded sales data of the customers.
Given the signifanct role these classifiers play their accuracy becomes of prime
importance to the companies speically to those in e-commerce system. Thus the
model evaluation techniques are employed to evaluate the accuracy of the predictions
made by a classifier model.As different classifier models have varying strengths and
weaknesses, it is necessary to use test that reveal distinctions among the learners
when measuring how a model will perform on future data.The Succeeding sections
in this chapters will primarily focus on the following points.
1. The reason why predictive accuracy is not sufficient to measure performance
and what are the other alternatives to measure the accuracy
2. Methods to ensure that the performance measures reasonably reflect a models
ability to predict or forecast unseen data
4.1
Prediction Accuracy
The prediction accuracy of a classifier model is defined proportion of correct predictions by the total number of predictions. This number indicates the percentage of
cases in which the learner is right or wrong. For instance, suppose a classifier correctly identified whether or not 99,990 out of 100,000 newborn babies are carriers
of a treatable but potentially-fatal genetic defect. This would imply an accuracy of
99.99 percent and an error rate of only 0.01 percent.
16
Although this would appear to indicate an extremely accurate classifier, it would

be wise to collect additional information before trusting your childs life to the test.
What if the genetic defect is found in only 10 out of every 100,000 babies? A test
that predicts no defect regardless of circumstances will still be correct for 99.99
percent of all cases. In this case, even though the predictions are correct for the large
majority of data, the classifier is not very useful for its intended purpose, which is to
identify children with birth defects.
The best measure of classifier performance is whether the classifier is successful
at its intended purpose. For this reason, it is crucial to have measures of model
performance that measure utility rather than raw accuracy
4.2
Confusion Matrix and Model Evaluation Metrics
A confusion matrix is a matrix that categorizes predictions according to whether

they match the actual value in the data. One of the tables dimensions indicates the
possible categories of predicted values while the other dimension indicates the same
for actual values.It can be an order of n-matrix depending on the values which can
be achieved by the predicted class.Figure 4.1(Refernece [2]) depicts a 2x2 and 3x3
confusion matrix. There are four important terms that are considered as the building
Figure 4.1: Confusion Matrox(Ref[2])
blocks used in computing many evaluation measures.The class of interest is known

as the positive class, while all others are known as negative.
1. True Positives(T P):Correctly classified as the class of interest.
2. True Negatives (T N):Correctly classified as not the class of interest.
3. False Positives(FP):Incorrectly classified as the class of interest.
4. False Negatives(FN):Incorrectly classified as not the class of interest.
17
The confusion matrix is useful tool for analysing how well our classifier can recognize tuples of different classes. TP and TN tell us when the classifier is getting things
right, while FP and FN tell us when the classifier is getting things wrong.Given m
classes, a confusion matrix is a matrix of atleast m by m size An entry, CMi j in
the first m rows and m columns indicates the number of tuples of class i that were
labeled by the classifier as class j. For a classifier to have good accuracy, ideally
most of the tuples would be represented along the diagonal of the confusion matrix
from the entry CM1,1 to entry CMm,m , with the rest of the entries being zero or close
to zero. That is ideally , FP and FN are around zero.
Accuracy: The accuracy of a classifier on a given test set is the percentage of test
tuples that are correctly classified by the classifier.
accuracy =
TP+TN
P+N
(4.1)
Error Rate: Error Rate or miss classification rate of classifier, M, which is simply
1 accuracy(M), where accuracy (M) is the accuracy of M.
errorrate =
FP + FN
P+N
(4.2)
If we use the training set instead of test set to estimate the error rate of a model, this
quantity is known as the re-substituion error.This error estimate is optimistic of the
true error rate because the model is not tested on any samples that it has not already
seen.
The Class Imbalance Problem: the datasets where the main class of interest is
rare. That is the data set distribution reflects a significant majority of the negative
class and minority positive class.For example in fraud detection applications, the
class of interest fraudulent class is rare or less frequently occurring in comparison
to the negative non-fraudulentclass.In medical data there may be a rare class, such
as cancer. Suppose that we have trained a classifier to classify medical data tuples,where the class label attribute is cancer and the possible class values are yes
and no. An accuracy rate of say 97% may make the classifier seem quite accurate,
but what if only, say 3% of the training tuples are actually cancer? Clearly an accuracy rate of 97% may not be acceptable- the classifier could be correctly labeling
only the non-cancer tuples, for instance,and miss classifying all the cancer tuples.
Instead we need other measures which access how well the classifier can recognize
18
the positive tuples and how well it can recognize the negative tuples
Sensitivity and Specificity:Classification often involves a balance between being overly conservative and overly aggressive in decision making. For example, an
e-mail filter could guarantee to eliminate every spam message by aggressively eliminating nearly every ham message at the same time. On the other hand, a guarantee
that no ham messages will be inadvertently filtered might allow an unacceptable
amount of spam to pass through the filter. This tradeoff is captured by a pair of
measures: sensitivity and specificity.
The sensitivity of a model (also called the true positive rate), measures the proportion of positive examples that were correctly classified. Therefore, as shown in
the following formula, it is calculated as the number of true positives divided by the
total number of positives in the datathose correctly classified (the true positives), as
well as those incorrectly classified (the false negatives).
sensitivity =
TP
T P + FN
(4.3)
The specificity of a model (also called the true negative rate), measures the proportion of negative examples that were correctly classified. As with sensitivity, this
is computed as the number of true negatives divided by the total number of negativesthe true negatives plus the false positives.
speci f icity =
TN
T N + FP
(4.4)
Precision and recall: Closely related to sensitivity and specificity are two other
performance measures, related to compromises made in classification: precision and
recall. Used primarily in the context of information retrieval, these statistics are
intended to provide an indication of how interesting and relevant a models results
are, or whether the predictions are diluted by meaningless noise.
The precision (also known as the positive predictive value) is defined as the proportion of positive examples that are truly positive; in other words, when a model
predicts the positive class, how often is it correct? A precise model will only predict
the positive class in cases very likely to be positive. It will be very trustworthy.
19
Consider what would happen if the model was very imprecise. Over time, the
results would be less likely to be trusted. In the context of information retrieval,
this would be similar to a search engine such as Google returning unrelated results.
Eventually users would switch to a competitor such as Bing. In the case of the SMS
spam filter, high precision means that the model is able to carefully target only the
spam while ignoring the ham.
precision =
TP
T P + FP
(4.5)
On the other hand, recall is a measure of how complete the results are. As shown
in the following formula, this is defined as the number of true positives over the
total number of positives. We may recognize that this is the same as sensitivity, only
the interpretation differs. A model with high recall captures a large portion of the
positive examples, meaning that it has wide breadth. For example, a search engine
with high recall returns a large number of documents pertinent to the search query.
Similarly, the SMS spam filter has high recall if the majority of spam messages are
correctly identified.
TP
recall =
(4.6)
T P + FN
The F-Measure: A measure of model performance that combines precision and
recall into a single number is known as the F-measure (also sometimes called the
F1 score or the F-score). The F-measure combines precision and recall using the
harmonic mean. The harmonic mean is used rather than the more common arithmetic
mean since both precision and recall are expressed as proportions between zero and
one. The following is the formula for F-measure:
F Measure =
2 precision recall
recall + precision
(4.7)
(1 + 2 ) precision recall
(4.8)
2 precision + recall
In addition to accuracy-based measures, classifiers can also be compared with respect to the following additional aspects:
F =
1. Speed: This refers to the computational costs involved in generating and using
the given classifier
2. Robustness: This is the ability of the classifier to make correct prediction
given noisy data or data with missing values.Robustness is typically assessed
20
with a series of synthetic data sets represeting increasing degress of noise and
missing values.
3. Scalability: This refers to the ability to construct the classifier efficiently given
large amounts of data. Scalability is typically assessed with a series of data sets
of increasing size.
4. Interpretability : This refers to the level of understanding and insight that is
provided by the classifier or predictor. Interpretability is subjective and therefore more difficult to asses
21
4.3
How To Estimate These Metrics?
We can use following methods to estimate the evaluation metrics explained indepth in the preceding sections:
a. Training data
b. Independent test data
c. Hold-out method
d. k-fold cross-validation method
e. Leave-one-out method
f. Bootstrap method
g. Comparing Two Models
4.3.1
Training and Independent Test Data
The accuracy/error estimates on the training data are not good indicators of
performance on future data. Because new data will probably not be exactly the same
as the training data.The accuracy/error estimates on the training data measure the
degree of classifiers over-fitting.Fig 4.2 depicts use of training set Estimation with
Figure 4.2: Training Set
independent test data (figure 4.3)is used when we have plenty of data and there
is a natural way to forming training and test data. For example: Quinlan in 1987
reported experiments in a medical domain for which the classifiers were trained on
data from 1985 and tested on data from 1986.
Figure 4.3: Training and Test Set
22
Figure 4.4: Classification: Train, Validation, Test Split Reference[3]
4.3.2
Holdout Method
The holdout method(fig4.5) is what we have alluded to so far in our discussions

about accuracy. In this method, the given data are randomly partitioned into two
independent sets, a training set and a test set. Typically, two-thirds of the data are
allocated to the training set, and the remaining one-third is allocated to the test set.
The training set is used to derive the model, whose accuracy is estimated with the
test set . The estimate is pessimistic because only a portion of the initial data is used
to derive the model. The hold-out method is usually used when we have thousands
Figure 4.5: Holdout Method
23
of instances, including several hundred instances from each class.

For unbalanced data-sets, samples might not be representative.Few or no instances
of some classes will be there in case of class imbalanced data where one class is in
majority viz. fraudulent transaction detection and Medical Diagnostic Tests. To
make the sample Representative for holdout we use the concept of stratification in
this we ensure that each class gets equal representation according to their proportion
in actual data-set
Random sub-sampling is a variation of the holdout method in which the holdout method is repeated k times.In each iteration, a certain proportion is randomly
selected for training (possibly with stratification). The error rates on the different
iterations are averaged to yield an overall error rate.It is also known as repeated
holdout method.
4.3.3
K-Cross-validation
In k-fold cross-validation, (fig 4.6) the initial data are randomly partitioned into k
mutually exclusive subsets or folds, D1 , D2 , ..., Dk each of approximately equal size.
Training and testing is performed k times. In iteration i, partition Di is reserved as
the test set, and the remaining partitions are collectively used to train the model.
That is, in the first iteration, subsets D2 , .., Dk collectively serve as the training set in
order to obtain a first model, which is tested on D1 ; the second iteration is trained on
subsets D1 , D3 , ...., Dk and tested on D2 ; and so on. Unlike the holdout and random
subsampling methods above, here, each sample is used the same number of times for
training and once for testing. For classification, the accuracy estimate is the overall
number of correct classifications from the k iterations, divided by the total number
of tuples in the initial data. For prediction, the error estimate can be computed as the
total loss from the k iterations, divided by the total number of initial tuples.
Leave one out CV
Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples. That is, only one sample is left out at a time for the test set.
Some features of Leave one out CV are
1. Makes best use of the data.
2. Involves no random sub-sampling.
24
Figure 4.6: k-cross-validation
Disadvantages of Leave one out CV:

1. A disadvantage of Leave-One-Out-CV is that stratification is not possible:
2. Very computationally expensive.
4.3.4
Bootstrap
Cross validation uses sampling without replacement. The same instance, once selected, can not be selected again for a particular training/test set.The The bootstrap
uses sampling with replacement to form the training set:
1. Sample a dataset of n instances n times with replacement to form a new dataset
of n instances.
2. Use this data as the training set.
3. Use the instances from the original dataset that dont occur in the new training
set for testing.
4. A particular instance has a probability of 1 n1 of not being picked Thus its probability of ending up in the test data is (where n tends to infinity):
1
(1 )n = e1 = 0.368
n
(4.9)
5. This means the training data will contain approximately 63.2% of the instances
and the test data will contain approximately 36.8% of the instances.
6. The error estimate on the test data will be very pessimistic because the classifier
is trained on just 63% of the instances.
25
7. Therefore, combine it with the training error:

err = 0.632 .e + 0.368 e
(4.10)
8. The training error gets less weight than the error on the test data.
4.3.5
Comparing Two Classifier Models
Suppose that we have generated two models, M1 and M2 (for either classification or
prediction), from our data. We have performed 10-fold cross-validation to obtain a
mean error rate for each. How can we determine which model is best? It may seem
intuitive to select the model with the lowest error rate, however, the mean error rates
are just estimates of error on the true population of future data cases. There can be
considerable variance between error rates within any given 10-fold cross-validation
experiment. Although the mean error rates obtained for M1 and M2 may appear
different, that difference may not be statistically significant. What if any difference
between the two may just be attributed to chance? following points explain in detail
how statistically significant is theirr difference
1. Assume that we have two classifiers, M1 and M2 , and we would like to know
which one is better for a classification problem.
2. We test the classifiers on n test data sets D1 , D2 , , Dn and we receive error rate
estimates e11 , e12 , , e1n for classifier M1 and error rate estimates e21, e22 , , e2n
for classifier M2 .
3. Using rate estimates we can compute the mean error rate e1 for classifier M1
and the mean error rate e2 for classifier M2 .
4. These mean error rates are just estimates of error on the true population of
future data cases.
5. We note that error rate estimates e11 , e12 , , e1n for classifier M1 and error rate
estimates e21 , e22 , , e2n for classifier M2 are paired. Thus, we consider the differences d1 , d2 , , dn where d j = |e1 j e2 j |.
6. The differences d1 , d2 , , dn are instantiations of n random variables D1 , D2 , , Dn
with mean D and standard deviation D.
7. We need to establish confidence intervals for D in order to decide whether the
difference in the generalization performance of the classifiers M1 and M2 is
statistically significant or not.
26
8. Since the standard deviation D is unknown, we approximate it using the sample

standard deviation sd :
s
1 n
sd =
[(e1i e2i ) (e1 e2 )]2
(4.11)
n i=1
9. T-statistics
T=
D D
sd
(4.12)
10. The T statistics is governed by t-distribution with n - 1 degrees of freedom.

Figure 4.7 shows t-distribution curve (Refernce[4])
Figure 4.7: t-distribution curve (Reference [4])
11. If d and sd are the mean and standard deviation of the normally distributed
differences of n random pairs of errors, a (1 )100% confidence interval for D =
1 - 2 is :
sd
sd
dm t1 / 2 < d < dm + t1 / 2
(4.13)
n
n
where t / 2 is the t-value with v = n 1 degrees of freedom, leaving an area of
/2 to the right.
27
12. If t > z or t < z t doesnt lie in the rejection region, within the tails of the distribution.
This means that we can reject the null hypothesis that the means of M1 and M2 are
the same and conclude that there is a statistically significant difference between the
two models. Otherwise, if we cannot reject the null hypothesis, we then conclude
that any difference between M1 and M2 can be attributed to chance.
4.4
ROC Curves
The ROC curve (Receiver Operating Characteristic) is commonly used to examine the tradeoff between the detection of true positives, while avoiding the false
positives. As you might suspect from the name, ROC curves were developed by
engineers in the field of communications around the time of World War II; receivers
of radar and radio signals needed a method to discriminate between true signals
and false alarms. The same technique is useful today for visualizing the efficacy of
machine learning models.
The characteristics of a typical ROC diagram are depicted in the following plot(figure
4.8 Reference[2]). Curves are defined on a plot with the proportion of true positives
on the vertical axis, and the proportion of false positives on the horizontal axis. Because these values are equivalent to sensitivity and (1 specificity), respectively, the
diagram is also known as a sensitivity/specificity plot:
Figure 4.8: ROC curves (Reference[2])
28
The points comprising ROC curves indicate the true positive rate at varying false
positive thresholds. To create the curves, a classifiers predictions are sorted by
the models estimated probability of the positive class, with the largest values first.
Beginning at the origin, each predictions impact on the true positive rate and false
positive rate will result in a curve tracing vertically (for a correct prediction), or
horizontally (for an incorrect prediction).
To illustrate this concept, three hypothetical classifiers are contrasted in the previous plot. First, the diagonal line from the bottom-left to the top-right corner of
the diagram represents a classifier with no predictive value. This type of classifier
detects true positives and false positives at exactly the same rate, implying that the
classifier cannot discriminate between the two. This is the baseline by which other
classifiers may be judged; ROC curves falling close to this line indicate models that
are not very useful. Similarly, the perfect classifier has a curve that passes through
the point at 100 percent true positive rate and 0 percent false positive rate. It is
able to correctly identify all of the true positives before it incorrectly classifies any
negative result. Most real-world classifiers are similar to the test classifier; they fall
somewhere in the zone between perfect and useless.
The closer the curve is to the perfect classifier, the better it is at identifying positive values. This can be measured using a statistic known as the area under the ROC
curve (abbreviated AUC). The AUC, as you might expect, treats the ROC diagram
as a two-dimensional square and measures the total area under the ROC curve. AUC
ranges from 0.5 (for a classifier with no predictive value), to 1.0 (for a perfect classifier). A convention for interpreting AUC scores uses a system similar to academic
letter grades:
1. 0.9 1.0 = A (outstanding)
2. 0.8 0.9 = B (excellent/good)
3. 0.7 0.8 = C (acceptable/fair)
4. 0.6 0.7 = D (poor)
5. 0.5 0.6 = F (no discrimination)
As with most scales similar to this, the levels may work better for some tasks
than others; the categorization is somewhat subjective.
29
4.5
Ensemble Methods
Motivation
1. Ensemble model improves accuracy and robustness over single model methods
2. Applications:
(a) distributed computing
(b) privacy-preserving applications
(c) large-scale data with reusable models
(d) multiple sources of data
3. Efficiency: a complex problem can be decomposed into multiple sub-problems
that are easier to understand and solve (divide-and-conquer approach)
4.5.1
Why Ensemble Works?
1. Intuition combining diverse, independent opinions in human decision-making

as a protective mechanism e.g. stock portfolio
2. Overcome limitations of single hypothesis The target function may not be implementable with individual classifiers, but may be approximated by model averaging
3. Gives a global picture
Figure 4.9: Ensemble Gives Global picture
30
4.5.2
Ensemble Works in Two Ways
1. Learn to Combine
Figure 4.10: Learn to Combine(Reference[3])
2. Learn By Consensus
Figure 4.11: Learn By Consensus(Refrence[3])
31
4.5.3
Learn To Combine
Pros
1. Get useful feed backs from labeled data.
2. Can potentially improve accuracy.
Cons
1. Need to keep the labeled data to train the ensemble
2. May overfit the labeled data.
3. Cannot work when no labels are available
4.5.4
Learn By Consensus
Pros
1. Do not need labeled data.
2. Can improve the generalization performance.
Cons
1. No feedbacks from the labeled data.
2. Require the assumption that consensus is better.
4.5.5
Bagging
Given a set, D, of d tuples, bagging works as follows. For iteration i(i = 1, 2, ...., k),
a training set,Di, of d tuples is sampled with replacement fromthe original set of
tuples,D. The term bagging stands for bootstrap aggregation.Each training set is a
bootstrap sample . Because sampling with replacement is used, some of the original
tuples of D may not be included in Di ,whereas other smay occur more than once.
A classifier model, Mi , is learned for each training set, Di . To classify an unknown
tuple, X, each classifier, Mi , returns its class prediction, which counts as one vote.
The bagged classifier, M, counts the votes and assigns the class with the most votes
to X. Bagging can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple. . The bagged classifier often
has significantly greater accuracy than a single classifier derived from D, the original
training data. It will not be considerably worse and is more robust to the effects of
noisy data. The increased accuracy occurs because the composite model reduces
32
the variance of the individual classifiers. For prediction, it was theoretically proven
that a bagged predictor will always have improved accuracy over a single predictor
derived from D.
Alogrithm The bagging algorithmcreate an ensemble of models (classifiers or
predictors) for a learning scheme where each model gives an equally-weighted prediction.
Input: D, a set of d training tuples; k, the number of models in the ensemble; a
learning scheme (e.g., decision tree algorithm, back-propagation, etc.)
Output: A composite model, M.
Method:
(1) for i = 1tokdo // create k models:
(2) create bootstrap sample, Di , by sampling D with replacement;
(3) use Di to derive a model, Mi ;
(4) end for
To use the composite model on a tuple, X
(1) if classification then
(2) let each of the k models classify X and return the majority vote
(3) if prediction then
(4) let each of the k models predict a value for X and return the average predicted
value;
4.5.6
Boosting
Principles
1. Boost a set of weak learners to a strong learner
2. An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records
3. Initially, all N records are assigned equal weights Unlike bagging, weights may
change at the end of a boosting round
4. Records that are wrongly classified will have their weights increased
5. Records that are classified correctly will have their weights decreased
6. Equal weights are assigned to each training tuple (1/d for round 1)
7. After a classifier Mi is learned, the weights are adjusted to allow the subsequent
classifier Mi+1 to pay more attention to tuples that were misclassified by Mi .
33
8. Final boosted classifier M combines the votes of each individual classifier

Weight of each classifiers vote is a function of its accuracy
9. Adaboost popular boosting algorithm
Adaboost-Boosting Algorithm
Input:
1). Training set D containing d tuples
2). k rounds
3). A classification learning scheme
Output:
A composite model
Method:
1. Data set D containing d class-labeled tuples (X1 , y1 ), (X2 , y2 ), (X3 , y3 ), .(Xd , yd )
2. Initially assign equal weight 1/d to each tuple
3. To generate k base classifiers, we need k rounds or
4. iterations Round i, tuples from D are sampled with replacement , to form Di
(size d)
5. Each tuples chance of being selected depends on its weight
6. Base classifier Mi , is derived from training tuples of Di
7. Error of Mi is tested using Di ]item Weights of training tuples are adjusted depending on how they were classified
Correctly classified: Decrease weight
Incorrectly classified: Increase weight
8. Weight of a tuple indicates how hard it is to classify it (directly proportional)
9. Some classifiers may be better at classifying some hard tuples than others
10. We finally have a series of classifiers that complement each other
11. Error Estimate:
error(Mi ) = w j err(X j )
(4.14)
where err(X j ) is the misclassification error for X j (= 1)

12. If classifier error exceeds 0.5, we abandon it
13. Try again with a new Di and a new Mi derived from it error (Mi) affects how the
weights of training tuples are updated
34
14. If a tuple is correctly classified in round i, its weight is multiplied by

error(Mi )
1 erro(Mi )
(4.15)
15. Adjust weights of all correctly classified tuples

16. Now weights of all tuples (including the misclassified tuples) are normalized
nf =
sumo f oldweights
sumo f newweights
(4.16)
error(M )
17. Weight of a classifier Mis weight is log 1erro(Mi i )

18. The lower a classifier error rate, the more accurate it is, and therefore, the higher
its weight for voting should be
error(M )
19. Weight of a classifier Mis vote is log 1erro(Mi i )

20. For each class c, sum the weights of each classifier that assigned class c to X
(unseen tuple)
21. The class with the highest sum is the WINNER!
35
Chapter 5
Conclusion and Future Scope
5.1
Comprative Study
To pratically explore the theoretical aspects of the data mining models and the
techniques to evaluate them, we conducted a small scale exploratory study in data
mining tool Weka - developed by University of Waikado , Newzeland.The following
tables summarize the result of our exploratory study
Figure 5.1: Weka Screen Shots
36
Data Set Name
Type of Data
No of Attributes
No of Instances
Type of Classifier
Attribute Under Observation
Evaluation Technique
No of Correctly Classified Instances
No of Incorrectly Classified Instances
Predicted Attribute Values
Only Training Set
217
69
No recurrence event
0.965
0.729
0.758
0.965
0.849
0.639
Recurrence event
0.271
0.035
0.767
0.271
0.4
0.639
No recurrence event
0.95
0.729
0.755
0.95
0.841
0.582
Recurrence event
0.271
0.05
0.697
0.271
0.39
0.582
10 -Fold CV( Random

Seed Value 0 )
10 -Fold CV( Random

Seed Value 20)
Breast Cancer .arff
Numeric
10
286
J48
Repeated (20) Hold out

Mehthod(Percent Split
66%)
Only Training Set
10 -Fold CV( Random

Seed Value 0 )
10 -Fold CV( Random

Seed Value 20)
Numeric
10
286
Nave Bayesian
208
72
78
Recall
F-Measure ROC Area
No recurrence event
0.93
0.753
0.745
0.93
0.827
0.547
Recurrence event
0.247
0.07
0.6
0.247
0.35
0.547
Class
Hold out
66%)
Breast Cancer .arff
214
TP Rate FP Rate Precision
73
59
215
208
24
38
71
78
No recurrence event
0.972
0.88
0.761
0.972
0.854
0.697
Recurrence event
0.12
0.028
0.6
0.12
0.2
0.697
No recurrence event
0.75
0.667
0.686
0.75
0.716
0.543
Recurrence event
0.33
0.25
0.407
0.33
0.367
0.543
No recurrence event
0.866
0.518
0.798
0.866
0.831
0.76
Recurrence event
0.482
0.134
0.603
0.482
0.536
0.76
No recurrence event
0.851
0.565
0.781
0.851
0.814
0.7
Recurrence event
0.435
0.149
0.552
0.435
0.487
0.7

66%)
210
76
No recurrence event
0.861
0.565
0.783
0.861
0.82
0.702
Recurrence event
0.435
0.139
0.569
0.435
0.493
0.702
72
65
25
32
No recurrence event
0.778
0.36
0.862
0.778
0.818
0.752
Recurrence event
0.64
0.222
0.5
0.64
0.561
0.752
No recurrence event
0.781
0.545
0.735
0.781
0.758
0.674
194
62
7
23
a = norecurren
b = recurrence
a
b
a 191
10
b
62
23
a = norecurren
b = recurrence
a
b
187
64
14
21
a = norecurren
646
122
0.455
0.219
0.517
0.455
0.484
0.674
ce
ce
ce
b = recurrence
a
b
a
70
22
b
2
3
a = norecurren
b = recurrence
a
b
a
48
16
b
22
11
a = norecurren
b = recurrence
a
174
b
27
44
41
a = norecurren
ce
ce
ce
b = recurrence
a
b
171
48
30
37
a = norecurren
ce
b = recurrence
a 173
28
b
48
37
a = norecurren
b = recurrence
a
56
ce
b
16
b
9
16
a = norecurren
ce
b = recurrence
a
a
50
b
14
b
18
15
a = norecurren
Recurrence event
Only Training Set
a
b
Class
Hold out
66%)
Confusion Matrix
ce
b = recurrence
a
468
tested_negative
0.936
0.336
0.839
0.936
0.885
0.888
tested_positive
0.664
0.064
0.848
0.664
0.745
0.888
b = tested
tested_negative
0.814
0.403
0.79
0.814
0.802
0.751
b
32
b
90
178
a = tested _ neg
10 -Fold CV( Random

Seed Value 0 )
571
197
a
tested_positive
0.597
0.186
0.632
0.597
0.614
0.751
tested_negative
0.834
0.418
0.788
0.834
0.81
0.74
a
404
Diabetes.arff
Numeric
768
J48
573
195
Class
tested_positive
0.582
0.166
0.653
0.582
0.615
0.74
tested_negative
0.849
0.463
0.762
0.849
0.803
0.722
tested_positive
0.537
0.151
0.671
0.537
0.596
0.722
a
417

66%)
Only Training Set
10 -Fold CV( Random

Seed Value 0 )
10 -Fold CV( Random

Seed Value 20)
Diabetes.arff
Numeric
768
Nave Bayesian
192
194
586
583
578
69
67
182
185
190
_ pos
b
83
b
112
156
a = tested _ neg
b = tested
Hold out
66%)
b
93
b
108
160
a = tested _ neg
b = tested
10 -Fold CV( Random

Seed Value 20)
_ pos
a
b
b
25
51
a = tested
b = tested
a
b
_ pos
a
141
44
_ neg
_ pos
128
30
28
66
tested_negative
0.776
0.313
0.81
0.776
0.793
0.747
a = tested
_ neg
tested_positive
0.688
0.224
0.641
0.688
0.663
0.747
b = tested
_ pos
tested_negative
0.842
0.384
0.803
0.842
0.822
0.825
a
b
tested_positive
0.616
0.158
0.676
0.616
0.645
0.825
tested_negative
0.844
0.399
0.798
0.844
0.82
0.814
tested_positive
0.601
0.156
0.674
0.601
0.635
0.814
a
421
103
a = tested
b = tested
a
b
_ neg
_ pos
422
107
78
161
a = tested
_ neg
b = tested
_ pos
417
107
83
161
tested_negative
0.834
0.399
0.796
0.834
0.814
0.811
a
b
tested_positive
0.601
0.166
0.66
0.601
0.629
0.811
b = tested
Class
b
79
165
a = tested
Hold out
66%)
187
74
tested_negative
0.831
0.484
0.75
0.831
0.789
0.811
tested_positive
0.516
0.169
0.636
0.516
0.57
0.811
a
b
138
46
28
149
a = tested
b = tested

66%)
205
56
tested_negative
tested_positive
0.842
0.688
0.313
0.158
0.822
0.717
0.842
0.688
0.832
0.702
0.838
0.838
a
b
_ neg
_ pos
_ neg
_ pos
139
30
26
66
a = tested
_ neg
b = tested
_ pos
Data Set Name
No of Attributes
No of Instances
Type of Classifier
Attribute Under Observation
Evaluation
Technique
No of Correctly Classified Instances
No of Incorrectly Classified Instances
Predicted Attribute Values
TP Rate
FP Rate
Precision
Recall
F-Measure
ROC Area
Only Training Set
144
Iris-setosa
Iris-versicolor
0.96
0.04
0.923
0.96
0.941
0.993
Iris-virginica
0.92
0.02
0.958
0.92
0.939
0.993
CV
CV(Seed value
20)
143
143
J48
Hold Out
Method
Iris.arff
49
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.993
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.993
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.931
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.929
Iris-setosa
Iris-versicolor
0.063
0.905
0.95
0.969
Iris-virginica
0.882
0.882
0.938
0.967
Iris-setosa
Nom(Class)
Hold Out
Method(seed 20)
150
Only Training Set
CV
CV (seed =20)
51
144
143
143
Nave Bayesian
Hold out Method
Hold out Method

seed =20
49
Confusion Matrix
a
a 50
b
0
b
c
48 2
4 46
0
0
a
a 50
c
0
b
0
c
0
0 47
0 4
a b
a 50 0
3
46
c
0
b
c
3
46
b
c
0
0
47
4
a b c
a 15 0 0
b 0 19 0
c 0 2 15
a
c
0
0
15
Iris-versicolor
a 15 0
b 0 13
Iris-virginica
Iris-setosa
a 50
48
46
a b
a 50 0
b 0 47
c 0 4
c
0
3
46
Iris-versicolor
0.96
0.04
0.923
0.96
0.941
0.993
Iris-virginica
0.92
0.02
0.958
0.92
0.939
0.993
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.993
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.993
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.993
a 50 0
b 0 47
0
3
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.993
46
Iris-setosa
Iris-versicolor
0.947
0.031
0.947
0.947
0.947
0.993
Iris-virginica
0.938
0.029
0.938
0.938
0.938
0.993
Iris-setosa
Iris-versicolor
Iris-virginica
a b c
a 16 0 0
b 0 18 1
c 0 1 15
a b c
a 15 0 0
b 0 13 0
c 0 0 15
5.2
Conclusion
From the exploratory tests carried out on the datasets in Weka we can conclude
some of the following theoretical aspect which we explored in depth in the various
sections
1. Evaluating the classifier only on the training set fetches highly optimistic result
and thus they are biased.
2. Increasing the k value incereases the credibility of the result and the best results
are obtained when k = 10
3. Repeating the k-cross validation in iteration fetches more credible results and
the best results are obtained when it is repeated for 10 times
4. Hold out Method when repeated iteratively fetches accurate results and best
results are obtained when it is repeated for 10 times
5. Naive Bayesian and Decision Tree Induction(J48) works excellently well with
datasets which have more nominal data in comparision to numeric data
5.3
Future Scope
The Comparative Study can be extended to new levels by incorporating R caret

package and carrying out these comparative test on more complex data sets which
have 1000+plus entries.A cost Sensitive comparative study can also be seen as an
extension to this seminar which again can be carried out in R using ROCR package.The comparative study conducted and the ones proposed as future scope can be
very helpful in desigining machine learning systems and evauluating their accuracy.
39
References
[1] Data Mining Concepts and Techniques: Jiawei Han, Micheline Kamber,Jian
Pei
[2] Machine Learning With R: Brett Lanz
[3] Statistical Learning: Gareth James, Daniela Witten ,Trevor Hastie ,Robert Tibshirani
[4] Statistics; David Freedman, Robert Pisani
[5] Inferential Statistics:Course Track Udacity
[6] Descriptive Statistics:Course Track Udacity
[7] Data Mining With Weka: Course Track University Of Waikato, Newzeland
40

12BCE055 Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12BCE055 Final

Uploaded by

Copyright:

Available Formats

Data Mining Models and Evaluation

DEPARTMENT OF COMPUTER ENGINEERING

Data Mining Models And Evaluation

DEPARTMENT OF COMPUTER ENGINEERING

Data Mining Models And Evaluation Technique

Prof. Anuja Nair

Dr. Sanjay Garg

CSE Department,Institute of Technology, Nirma University

Data Mining Models And Evaluation Technique

CSE Department,Institute of Technology, Nirma University

Data Mining Models And Evaluation Technique

CSE Department,Institute of Technology, Nirma University

Data Mining Models And Evaluation Technique

Classification Using Decision Tree

Probabilistic Learning - Naive Bayesian Classification

CSE Department,Institute of Technology, Nirma University

Data Mining Models And Evaluation Technique

Ensemble Works in Two Ways

Conclusion and Future Scope

CSE Department, Institute Of Technology,Nirma University, Ahmedabad

Data Mining Models And Evaluation Technique

Figure 1.1: KDD Process

Data Mining Models And Evaluation Technique

2. the relevant prior knowledge

4. Data reduction and projection.

Data Mining Models And Evaluation Technique

Understanding Decision Trees

Data Mining Models And Evaluation Technique

class. As the decision tree is essentially a flowchart, it is particularly appropriate for

Divide and Conquer

Data Mining Models And Evaluation Technique

Figure 2.1: Scatter Plot of Budget vs A-List(Ref[2]) Celeberities

Figure 2.2: Split 1: Scatter Plot of Budget vs A-List Celeberities (Ref[2])

CSE Department, Institute Of Technology,Nirma University, Ahmedabad

Data Mining Models And Evaluation Technique

Figure 2.3: Split 2: Scatter Plot of Budget vs A-List Celeberities (Ref[2])

CSE Department, Institute Of Technology,Nirma University, Ahmedabad

Data Mining Models And Evaluation Technique

Figure 2.4: Decision Tree Model(Refernence[2])

C5.0 Decision Tree Algorithm

CSE Department, Institute Of Technology,Nirma University, Ahmedabad

Data Mining Models And Evaluation Technique

How To Choose The Best Split?

Data Mining Models And Evaluation Technique

from the split (S2 ):

2.5 Pruning The Decision Tree

CSE Department, Institute Of Technology,Nirma University, Ahmedabad

Data Mining Models And Evaluation Technique

CSE Department, Institute Of Technology,Nirma University, Ahmedabad

Data Mining Models And Evaluation Technique

Understanding Naive Bayesian Classification

Data Mining Models And Evaluation Technique

2. Intrusion detection or anomaly detection in computer networks

Bayes theorem is named after Thomas Bayes, a nonconformist English clergyman

Data Mining Models And Evaluation Technique

The Naive Bayes Algorthim

1. Simple, fast, and very effective

1. Relies on an often-faulty assumption of equally important and independent

Data Mining Models And Evaluation Technique

Naive Bayesian Classification

The nave Bayesian classifier, or simple Bayesian classifier, works as follows:

Data Mining Models And Evaluation Technique

expensive to compute P(X|Ci ). In order to reduce computation in evaluating

CSE Department, Institute Of Technology,Nirma University, Ahmedabad