You are on page 1of 61

Classification

Classification: Definition
Given a collection of records (training set )

Goal: previously unseen records should be


assigned a class as accurately as possible.

A test set is used to determine the accuracy of the


model. Usually, the given data set is divided into training
and test sets, with training set used to build the model
and test set used to validate it.

15-Aug-17 CS F469: IR 2
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

15-Aug-17 CS F469: IR 3
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = professor
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no THEN tenured = yes
15-Aug-17 CS F469: IR 4
Classification Process (2): Use the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
15-Aug-17 CS F469: IR 5
General approach to classification
Training set consists of records with known class
labels

Training set is used to build a classification model

A labeled test set of previously unseen data


records is used to evaluate the quality of the
model.
The classification model is applied to new records
with unknown class labels
15-Aug-17 CS F469: IR 6
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?

Methods for Performance Evaluation


How to obtain reliable estimates?

Methods for Model Comparison


How to compare the relative performance among
competing models?

15-Aug-17 CS F469: IR 7
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value
is from the actual known value
Loss function: measures the error between yi and the predicted value
yi
Absolute error: | yi yi|
Squared error: (yi yi)2
Test error (generalization error): the average loss over the test set
Mean absolute error: d
Mean squared error:
| y yi ' |
d

i 1
i
( y y ')
i 1
i i
2

d d
Relative absolute error: Relative squared error: d

(y yi ' ) 2
d

| y
i 1
i yi ' |
i 1
i

(y
d

| y i y| i y)2
The mean squared-error exaggerates the presence of outliers
i 1 i 1

Popularly use (square) root mean-square error, similarly, root relative


squared error

15-Aug-17 CS F469: IR 8
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build
models, scalability, etc.
Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No
a: TP (true positive)
Class=Yes a b b: FN (false
ACTUAL negative)
CLASS Class=No c d c: FP (false
positive)
d: TN (true
15-Aug-17 CS F469: IR negative) 9
True Positive refers to # of positive tuples that
were correctly labeled by the classifier.

True Negative refers to # of negative tuples that


were correctly labeled by the classifier.

False Positive refers to # of negative tuples that


were incorrectly labeled by the classifier.

False Negative refers to # of positive tuples that


were incorrectly labeled by the classifier.

15-Aug-17 CS F469: IR 10
Metrics for Performance Evaluation

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

Most widely-used metric:


ad TP TN
Accuracy
a b c d TP TN FP FN
15-Aug-17 CS F469: IR 11
Cost-Sensitive
a
Measures
Precision (p)
ac
a
Recall (r)
ab
2rp 2a
F - measure (F)
r p 2a b c

Precision is biased towards C(Yes|Yes) & C(Yes|No)


Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
wa w d
Weighted Accuracy 1 4

wa wb wc w d
1 2 3 4

15-Aug-17 CS F469: IR 12
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?

Methods for Performance Evaluation


How to obtain reliable estimates?

Methods for Model Comparison


How to compare the relative performance among
competing models?

15-Aug-17 CS F469: IR 13
Methods for Performance Evaluation
How to obtain a reliable estimate of
performance?

Performance of a model may depend on other


factors besides the learning algorithm:
Class distribution
Cost of misclassification
Size of training and test sets

15-Aug-17 CS F469: IR 14
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Bootstrap
Sampling with replacement

15-Aug-17 CS F469: IR 15
Evaluating the Accuracy of a Classifier or Predictor
1. Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for performance estimation
A classification model is then induced from the training set and its
performance is evaluated on the test set.

Limitations:
Fewer labeled examples are used for training because some of the
records are with held for testing.
The model may be highly dependent on the composition of training
and testing test. The smaller the training set size, the larger the
variance of the model . Similarly if the training set size is too large
then there may chances of over fitting.

15-Aug-17 CS F469: IR 16
2. Random Subsampling: a variation of holdout
Repeat holdout k times to improve the estimation of a classifiers
performance
In each iteration, select some of the dataset for training and the
rest for testing.
Overall accuracy = avg. of the accuracies obtained.

where k is the number of iteration.

Limitation:
It has no control over the number of times each record is used for
testing and training. Consequently, some records might be used
for training more often than others.

15-Aug-17 CS F469: IR 17
3. Cross-validation (k-fold, where k = 10 is most
popular)
In this approach, each document is used the same number of times for
training and exactly once for testing.
Two-fold Cross-Validation: The dataset is partition into two equal-sized
subsets. First one of the subset choose for training and another for testing.
Next, we swap the role, training become testing set and vice-versa. The
total errors can be obtained by summing up the errors in both the run. In
this example, each record is used exactly once for training and once for
testing.
k-fold Cross-Validation: Randomly partition the data into k mutually
exclusive subsets, each approximately of equal size
At i-th iteration, use Di as test set and others as training set
A special case of the k-fold is set k=N., the size of the dataset
Leave-one-out: each test set contains only one record.
The advantages is that, the training set used as much data as possible
and the test set is mutually exclusive and they effectively covers the
entire data set
The computational cost is very high as we run it k times and the
testing set is very small compared to training set.
15-Aug-17 CS F469: IR 18
Evaluating the Accuracy of a Classifier or Predictor

4. Bootstrap
The methods presented so far assume that the training records are
sampled without replacement. As a results, there are no duplicate
records in training and testing set.
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several boostrap methods, and a common one is .632 boostrap
Suppose we are given a data set of `d tuples. The data set is sampled `d times,
with replacement, resulting in a training set of `d samples. The data tuples that
did not make it into the training set end up forming the test set. About 63.2% of
the original data will end up in the bootstrap, and the remaining 36.8% will form
the test set (since (1 1/d)d e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
k
acc( M ) (0.632 acc( M i ) test _ set 0.368 acc( M i ) train_ set )
i 1

15-Aug-17 CS F469: IR 19
Multiclass Problem
In multiclass problem, the input data is divided
into more than one class.
One-against-rest(1-r): It decompose the multiclass
problem into K binary problems. For a class y, all
instances which belongs to y are considers
positive while the remaining instances are
considered negatives. If an instance is classified as
negative, then all classes except the positive class
receive a vote.
A binary classifier is than used to separate the
instances of class y from rest of the classes.

15-Aug-17 CS F469: IR 20
Multiclass Problem,cont..
One-against-one: It constructs K(K-1)/2 binary
classifiers, where each classifier is used to distinguish
between a pair of classes(y and y ).
i j

In both 1-r and 1-1 approaches, a test instances is


classified by combining the predictions made by the
binary classifiers. A voting scheme is used to
combine the predictions, where the class that
receives the highest number of votes is assigned to
the test instances.

15-Aug-17 CS F469: IR 21
Overfitting
Overfitting refers to a model that models the training data
too well. Overfitting happens when a model learns the
detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new
data. This means that the noise or random fluctuations in
the training data is picked up and learned as concepts by
the model. The problem is that these concepts do not apply
to new data and negatively impact the models ability to
generalize.
In machine learning, when a statistical model describes
random error or noise instead of underlying relationship
overfitting occurs. When a model is excessively complex,
overfitting is normally observed, because of having too
many parameters with respect to the number of training
data types. The model exhibits poor performance which
has been overfit.
15-Aug-17 CS F469: IR 22
How to avoid Overfitting?
1- Keep the model simpler: reduce variance by
taking into account fewer variables and parameters,
thereby removing some of the noise in the training
data.
2- Use cross-validation techniques such as k-folds
cross-validation.
3- Use a resampling technique to estimate model
accuracy.
4- Get more training data.
5- Regularization: penalize certain parts of the
parameter space or introduce additional constraints
to deal with a potentially ill-posed problem.

15-Aug-17 CS F469: IR 23
Ensemble Methods
Construct a set of classifiers from the training data
Predict class label of previously unseen records by
aggregating predictions made by multiple
classifiers to increase the classification accuracy is
called ensemble or classifier combination methods
As the ensemble method constructs a set of base
classifiers from training data and performs
classification by taking a vote on the predictions
made by each classifier.

15-Aug-17 CS F469: IR 24
General Idea
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

15-Aug-17 CS F469: IR 25
Ensemble Classifiers (EC)
An ensemble classifier constructs a set of
base classifiers from the training data
Methods for constructing an EC
Manipulating training set
Manipulating input features
Manipulating class labels
Manipulating learning algorithms

15-Aug-17 CS F469: IR 26
Ensemble Classifiers (EC)
Manipulating training set
Multiple training sets are created by re-sampling the data
according to some sampling distribution
Sampling distribution determines how likely it is that an
example will be selected for training may vary from one trial
to another
Classifier is built from each training set using a particular
learning algorithm
Examples: Bagging & Boosting
Majority voting - bagging
More weightage to the opinion of some good (accurate)
classifiers - boosting
In bagging, you give equal weightage to all classifiers, whereas in
boosting you give weightage according to the accuracy of the
15-Aug-17 CS F469: IR 27
classifier.
Ensemble Classifiers (EC)
Manipulating input features
Subset of input features chosen to form each training
set
Subset can be chosen randomly or based on inputs
given by Domain Experts
Good for data that has redundant features
Random Forest is an example which uses Decision
Tree as its base classifiers

15-Aug-17 CS F469: IR 28
Ensemble Classifiers (EC)
Manipulating class labels
When number of classes is sufficiently large
Training data is transformed into a binary class problem by
randomly partitioning the class labels into 2 disjoint subsets,
A0 & A1. Training examples whose class labels belongs to A0
are assigned to class 0, while those belongs to A1 are assigned
to class 1.
Re-labeled examples are used to train a base classifier
By repeating the class re-labeling and model building steps
several times, an ensemble of base classifiers is obtained
When a test example is presented, each base classifier Ci is
used to predict its class label. If the class label is predicted as
class 0, then all the classes belongs to A0 will receive a vote
and similarly for class 1. Finally the votes are tallied and the
class that receives the highest votes is assigned to the test
example.
15-Aug-17 CS F469: IR 29
Ensemble Classifiers (EC)
Manipulating learning algorithm
Learning algorithms can be manipulated in such a way
that applying the algorithm several times on the same
training data may result in different models
Example 1 ANN can produce different models by
changing network topology or the initial weights of links
between neurons
Example 2 ensemble of Decision Trees can be
constructed by introducing randomness into the tree
growing procedure instead of choosing the best split
attribute at each node, we randomly choose one of the
top k attributes

15-Aug-17 CS F469: IR 30
Bias and Variance tradeoff
The prediction error for any machine learning
algorithm can be broken down into three parts:
Bias Error
Variance Error
Irreducible Error
The irreducible error cannot be reduced
regardless of what algorithm is used. It is the
error introduced from the chosen framing of the
problem and may be caused by factors like
unknown variables that influence the mapping of
the input variables to the output variable.
BA ZG523:Introduction to Data Science 31
Bias error is useful to quantify how much on an
average are the predicted values different from
the actual value. A high bias error means we
have a under-performing model which keeps on
missing important trends. Variance on the other
side quantifies how are the predictions made on
same observation different from each other. A
high variance model will over-fit on your training
population and perform badly on any
observation beyond training.
BA ZG523:Introduction to Data Science 32
Bias is error due to erroneous or overly simplistic
assumptions in the learning algorithm youre using.
This can lead to the model underfitting your data,
making it hard for it to have high predictive accuracy
and for you to generalize your knowledge from the
training set to the test set.
Variance is error due to too much complexity in the
learning algorithm youre using. This leads to the
algorithm being highly sensitive to high degrees of
variation in your training data, which can lead your
model to overfit the data. Youll be carrying too much
noise from your training data for your model to be very
useful for your test data.

BA ZG523:Introduction to Data Science 33


Bagging
Also known as bootstrap aggregation, is a
technique that repeatedly samples( with
replacement) from a data set according to a
uniform probability distribution. After training the
k classifiers on bootstrap samples, a test instance
is assigned to the class that receives the highest
number of votes.

Sampling uniformly with replacement


Build classifier on each bootstrap sample
0.632 bootstrap
Each bootstrap sample Di contains approx. 63.2% of the
original training data
Remaining (36.8%) are used as test set
15-Aug-17 CS F469: IR 34
Bagging

Accuracy of bagging:
k
Acc( M ) (0.632 * Acc( M i )test _ set 0.368 * Acc( M i )train_ set )
i 1

Works well for small data sets

15-Aug-17 CS F469: IR 35
Bagging- Final Points
Works well if the base classifiers are unstable
Does not focus on any particular instance of
the training data and therefore less
susceptible to model over-fitting when
applied to noisy data since every sample has
an equal probability of being selected.
What if we want to focus on a particular
instances of training data?

15-Aug-17 CS F469: IR 36
Boosting
An iterative procedure to adaptively change
distribution of training data by focusing more
on previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, boosting assigns a weight to each
training example and weights may change at the
end of a boosting round

15-Aug-17 CS F469: IR 37
Boosting
Records that are wrongly classified will have
their weights increased
Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

Example 4 is hard to classify


Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds

15-Aug-17 CS F469: IR 38
Boosting
Equal weights are assigned to each training tuple
(1/d for round 1)
After a classifier Mi is learned, the weights are
adjusted to allow the subsequent classifier Mi+1
to pay more attention to tuples that were
misclassified by Mi.
Final boosted classifier M* combines the votes of
each individual classifier
Weight of each classifiers vote is a function of its
accuracy
Adaboost popular boosting algorithm

15-Aug-17 CS F469: IR 39
Learning Algorithm

15-Aug-17 CS F469: IR 40
Decision Trees
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution

15-Aug-17 CS F469: IR 41
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines

15-Aug-17 CS F469: IR 42
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


15-Aug-17 CS F469: IR 43
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10

15-Aug-17 CS F469: IR 44
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

15-Aug-17 CS F469: IR 45
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

15-Aug-17 CS F469: IR 46
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

15-Aug-17 CS F469: IR 47
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

15-Aug-17 CS F469: IR 48
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

15-Aug-17 CS F469: IR 49
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

15-Aug-17 CS F469: IR 50
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to No

TaxInc NO
< 80K > 80K

NO YES

15-Aug-17 CS F469: IR 51
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

15-Aug-17 CS F469: IR 52
Decision Tree Induction
Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT

15-Aug-17 CS F469: IR 53
General Structure of Hunts Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat

that reach a node t 1 Yes Single 125K No

General Procedure: 2 No Married 100K No


3 No Single 70K No
If Dt contains records that belong the
4 Yes Married 120K No
same class yt, then t is a leaf node
5 No Divorced 95K Yes
labeled as yt
6 No Married 60K No
If Dt is an empty set, then t is a leaf
7 Yes Divorced 220K No
node labeled by the default class, yd
8 No Single 85K Yes
If Dt contains records that belong to 9 No Married 75K No
more than one class, use an attribute 10 No Single 90K Yes
test to split the data into smaller 10

subsets. Recursively apply the Dt


procedure to each subset.

15-Aug-17 CS F469: IR 54
Hunts Algorithm
Tid Refund Marital Taxable
Status Income Cheat

Refund 1 Yes Single 125K No


Dont 2 No Married 100K No
Yes No
Cheat
3 No Single 70K No
Dont Dont
Cheat Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Refund Refund
8 No Single 85K Yes
Yes No Yes No
9 No Married 75K No
Dont Dont Marital
Marital 10 No Single 90K Yes
Cheat Cheat Status 10

Status
Single, Single,
Married Married
Divorced Divorced

Dont Taxable Dont


Cheat Cheat
Cheat Income
< 80K >= 80K

15-Aug-17
Dont CS F469:Cheat
IR 55
Cheat
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous

Depends on number of ways to split


2-way split
Multi-way split

15-Aug-17 CS F469: IR 56
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as
distinct values.
CarType
Family Luxury
Sports

Binary split: Divides values into two subsets.


Need to find optimal partitioning.
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}

15-Aug-17 CS F469: IR 57
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
Binary split: Divides values into two subsets
respects the order. Need to find optimal
partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

What about this split? {Small,


Size
Large} {Medium}
15-Aug-17 CS F469: IR 58
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

Binary Decision: (A < v) or (A v)


consider all possible splits and finds the best cut
can be more compute intensive

15-Aug-17 CS F469: IR 59
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

15-Aug-17 CS F469: IR 60
END

15-Aug-17 CS F469: IR 61

You might also like