Professional Documents
Culture Documents
Classification: Definition
Given a collection of records (training set )
15-Aug-17 CS F469: IR 2
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
15-Aug-17 CS F469: IR 3
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
15-Aug-17 CS F469: IR 7
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value
is from the actual known value
Loss function: measures the error between yi and the predicted value
yi
Absolute error: | yi yi|
Squared error: (yi yi)2
Test error (generalization error): the average loss over the test set
Mean absolute error: d
Mean squared error:
| y yi ' |
d
i 1
i
( y y ')
i 1
i i
2
d d
Relative absolute error: Relative squared error: d
(y yi ' ) 2
d
| y
i 1
i yi ' |
i 1
i
(y
d
| y i y| i y)2
The mean squared-error exaggerates the presence of outliers
i 1 i 1
15-Aug-17 CS F469: IR 8
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build
models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
Class=Yes a b b: FN (false
ACTUAL negative)
CLASS Class=No c d c: FP (false
positive)
d: TN (true
15-Aug-17 CS F469: IR negative) 9
True Positive refers to # of positive tuples that
were correctly labeled by the classifier.
15-Aug-17 CS F469: IR 10
Metrics for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
wa wb wc w d
1 2 3 4
15-Aug-17 CS F469: IR 12
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
15-Aug-17 CS F469: IR 13
Methods for Performance Evaluation
How to obtain a reliable estimate of
performance?
15-Aug-17 CS F469: IR 14
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Bootstrap
Sampling with replacement
15-Aug-17 CS F469: IR 15
Evaluating the Accuracy of a Classifier or Predictor
1. Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for performance estimation
A classification model is then induced from the training set and its
performance is evaluated on the test set.
Limitations:
Fewer labeled examples are used for training because some of the
records are with held for testing.
The model may be highly dependent on the composition of training
and testing test. The smaller the training set size, the larger the
variance of the model . Similarly if the training set size is too large
then there may chances of over fitting.
15-Aug-17 CS F469: IR 16
2. Random Subsampling: a variation of holdout
Repeat holdout k times to improve the estimation of a classifiers
performance
In each iteration, select some of the dataset for training and the
rest for testing.
Overall accuracy = avg. of the accuracies obtained.
Limitation:
It has no control over the number of times each record is used for
testing and training. Consequently, some records might be used
for training more often than others.
15-Aug-17 CS F469: IR 17
3. Cross-validation (k-fold, where k = 10 is most
popular)
In this approach, each document is used the same number of times for
training and exactly once for testing.
Two-fold Cross-Validation: The dataset is partition into two equal-sized
subsets. First one of the subset choose for training and another for testing.
Next, we swap the role, training become testing set and vice-versa. The
total errors can be obtained by summing up the errors in both the run. In
this example, each record is used exactly once for training and once for
testing.
k-fold Cross-Validation: Randomly partition the data into k mutually
exclusive subsets, each approximately of equal size
At i-th iteration, use Di as test set and others as training set
A special case of the k-fold is set k=N., the size of the dataset
Leave-one-out: each test set contains only one record.
The advantages is that, the training set used as much data as possible
and the test set is mutually exclusive and they effectively covers the
entire data set
The computational cost is very high as we run it k times and the
testing set is very small compared to training set.
15-Aug-17 CS F469: IR 18
Evaluating the Accuracy of a Classifier or Predictor
4. Bootstrap
The methods presented so far assume that the training records are
sampled without replacement. As a results, there are no duplicate
records in training and testing set.
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several boostrap methods, and a common one is .632 boostrap
Suppose we are given a data set of `d tuples. The data set is sampled `d times,
with replacement, resulting in a training set of `d samples. The data tuples that
did not make it into the training set end up forming the test set. About 63.2% of
the original data will end up in the bootstrap, and the remaining 36.8% will form
the test set (since (1 1/d)d e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
k
acc( M ) (0.632 acc( M i ) test _ set 0.368 acc( M i ) train_ set )
i 1
15-Aug-17 CS F469: IR 19
Multiclass Problem
In multiclass problem, the input data is divided
into more than one class.
One-against-rest(1-r): It decompose the multiclass
problem into K binary problems. For a class y, all
instances which belongs to y are considers
positive while the remaining instances are
considered negatives. If an instance is classified as
negative, then all classes except the positive class
receive a vote.
A binary classifier is than used to separate the
instances of class y from rest of the classes.
15-Aug-17 CS F469: IR 20
Multiclass Problem,cont..
One-against-one: It constructs K(K-1)/2 binary
classifiers, where each classifier is used to distinguish
between a pair of classes(y and y ).
i j
15-Aug-17 CS F469: IR 21
Overfitting
Overfitting refers to a model that models the training data
too well. Overfitting happens when a model learns the
detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new
data. This means that the noise or random fluctuations in
the training data is picked up and learned as concepts by
the model. The problem is that these concepts do not apply
to new data and negatively impact the models ability to
generalize.
In machine learning, when a statistical model describes
random error or noise instead of underlying relationship
overfitting occurs. When a model is excessively complex,
overfitting is normally observed, because of having too
many parameters with respect to the number of training
data types. The model exhibits poor performance which
has been overfit.
15-Aug-17 CS F469: IR 22
How to avoid Overfitting?
1- Keep the model simpler: reduce variance by
taking into account fewer variables and parameters,
thereby removing some of the noise in the training
data.
2- Use cross-validation techniques such as k-folds
cross-validation.
3- Use a resampling technique to estimate model
accuracy.
4- Get more training data.
5- Regularization: penalize certain parts of the
parameter space or introduce additional constraints
to deal with a potentially ill-posed problem.
15-Aug-17 CS F469: IR 23
Ensemble Methods
Construct a set of classifiers from the training data
Predict class label of previously unseen records by
aggregating predictions made by multiple
classifiers to increase the classification accuracy is
called ensemble or classifier combination methods
As the ensemble method constructs a set of base
classifiers from training data and performs
classification by taking a vote on the predictions
made by each classifier.
15-Aug-17 CS F469: IR 24
General Idea
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
15-Aug-17 CS F469: IR 25
Ensemble Classifiers (EC)
An ensemble classifier constructs a set of
base classifiers from the training data
Methods for constructing an EC
Manipulating training set
Manipulating input features
Manipulating class labels
Manipulating learning algorithms
15-Aug-17 CS F469: IR 26
Ensemble Classifiers (EC)
Manipulating training set
Multiple training sets are created by re-sampling the data
according to some sampling distribution
Sampling distribution determines how likely it is that an
example will be selected for training may vary from one trial
to another
Classifier is built from each training set using a particular
learning algorithm
Examples: Bagging & Boosting
Majority voting - bagging
More weightage to the opinion of some good (accurate)
classifiers - boosting
In bagging, you give equal weightage to all classifiers, whereas in
boosting you give weightage according to the accuracy of the
15-Aug-17 CS F469: IR 27
classifier.
Ensemble Classifiers (EC)
Manipulating input features
Subset of input features chosen to form each training
set
Subset can be chosen randomly or based on inputs
given by Domain Experts
Good for data that has redundant features
Random Forest is an example which uses Decision
Tree as its base classifiers
15-Aug-17 CS F469: IR 28
Ensemble Classifiers (EC)
Manipulating class labels
When number of classes is sufficiently large
Training data is transformed into a binary class problem by
randomly partitioning the class labels into 2 disjoint subsets,
A0 & A1. Training examples whose class labels belongs to A0
are assigned to class 0, while those belongs to A1 are assigned
to class 1.
Re-labeled examples are used to train a base classifier
By repeating the class re-labeling and model building steps
several times, an ensemble of base classifiers is obtained
When a test example is presented, each base classifier Ci is
used to predict its class label. If the class label is predicted as
class 0, then all the classes belongs to A0 will receive a vote
and similarly for class 1. Finally the votes are tallied and the
class that receives the highest votes is assigned to the test
example.
15-Aug-17 CS F469: IR 29
Ensemble Classifiers (EC)
Manipulating learning algorithm
Learning algorithms can be manipulated in such a way
that applying the algorithm several times on the same
training data may result in different models
Example 1 ANN can produce different models by
changing network topology or the initial weights of links
between neurons
Example 2 ensemble of Decision Trees can be
constructed by introducing randomness into the tree
growing procedure instead of choosing the best split
attribute at each node, we randomly choose one of the
top k attributes
15-Aug-17 CS F469: IR 30
Bias and Variance tradeoff
The prediction error for any machine learning
algorithm can be broken down into three parts:
Bias Error
Variance Error
Irreducible Error
The irreducible error cannot be reduced
regardless of what algorithm is used. It is the
error introduced from the chosen framing of the
problem and may be caused by factors like
unknown variables that influence the mapping of
the input variables to the output variable.
BA ZG523:Introduction to Data Science 31
Bias error is useful to quantify how much on an
average are the predicted values different from
the actual value. A high bias error means we
have a under-performing model which keeps on
missing important trends. Variance on the other
side quantifies how are the predictions made on
same observation different from each other. A
high variance model will over-fit on your training
population and perform badly on any
observation beyond training.
BA ZG523:Introduction to Data Science 32
Bias is error due to erroneous or overly simplistic
assumptions in the learning algorithm youre using.
This can lead to the model underfitting your data,
making it hard for it to have high predictive accuracy
and for you to generalize your knowledge from the
training set to the test set.
Variance is error due to too much complexity in the
learning algorithm youre using. This leads to the
algorithm being highly sensitive to high degrees of
variation in your training data, which can lead your
model to overfit the data. Youll be carrying too much
noise from your training data for your model to be very
useful for your test data.
Accuracy of bagging:
k
Acc( M ) (0.632 * Acc( M i )test _ set 0.368 * Acc( M i )train_ set )
i 1
15-Aug-17 CS F469: IR 35
Bagging- Final Points
Works well if the base classifiers are unstable
Does not focus on any particular instance of
the training data and therefore less
susceptible to model over-fitting when
applied to noisy data since every sample has
an equal probability of being selected.
What if we want to focus on a particular
instances of training data?
15-Aug-17 CS F469: IR 36
Boosting
An iterative procedure to adaptively change
distribution of training data by focusing more
on previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, boosting assigns a weight to each
training example and weights may change at the
end of a boosting round
15-Aug-17 CS F469: IR 37
Boosting
Records that are wrongly classified will have
their weights increased
Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
15-Aug-17 CS F469: IR 38
Boosting
Equal weights are assigned to each training tuple
(1/d for round 1)
After a classifier Mi is learned, the weights are
adjusted to allow the subsequent classifier Mi+1
to pay more attention to tuples that were
misclassified by Mi.
Final boosted classifier M* combines the votes of
each individual classifier
Weight of each classifiers vote is a function of its
accuracy
Adaboost popular boosting algorithm
15-Aug-17 CS F469: IR 39
Learning Algorithm
15-Aug-17 CS F469: IR 40
Decision Trees
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
15-Aug-17 CS F469: IR 41
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines
15-Aug-17 CS F469: IR 42
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
15-Aug-17 CS F469: IR 44
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
15-Aug-17 CS F469: IR 45
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
15-Aug-17 CS F469: IR 46
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
15-Aug-17 CS F469: IR 47
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
15-Aug-17 CS F469: IR 48
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
15-Aug-17 CS F469: IR 49
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
15-Aug-17 CS F469: IR 50
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to No
TaxInc NO
< 80K > 80K
NO YES
15-Aug-17 CS F469: IR 51
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
15-Aug-17 CS F469: IR 52
Decision Tree Induction
Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
15-Aug-17 CS F469: IR 53
General Structure of Hunts Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
15-Aug-17 CS F469: IR 54
Hunts Algorithm
Tid Refund Marital Taxable
Status Income Cheat
Status
Single, Single,
Married Married
Divorced Divorced
15-Aug-17
Dont CS F469:Cheat
IR 55
Cheat
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
15-Aug-17 CS F469: IR 56
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as
distinct values.
CarType
Family Luxury
Sports
15-Aug-17 CS F469: IR 57
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
Binary split: Divides values into two subsets
respects the order. Need to find optimal
partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
15-Aug-17 CS F469: IR 59
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
15-Aug-17 CS F469: IR 60
END
15-Aug-17 CS F469: IR 61