Bayes Classification

Classification
Contd…
Bayes Classification
• Bayesian classifiers are statistical classifiers
based on Bayes’ theorem
• Predict class membership probabilities
• Naive Bayesian classifier
– Assumes effect of an attribute value on a given
class is independent of the values of the other
attributes – class conditional independence
– Simplifies the computations
– Has comparable performance with decision tree
and selected neural network classifiers
• Let X be a data sample (“evidence”): n attributes, class
label unknown
• H: some hypothesis such as that the data tuple X
belongs to a specified class C
• Find P(H|X): probability that the hypothesis H holds
given the observed data tuple X
• Probability that tuple X belongs to class C, given that
we know the attribute description of X
• P(H|X) a posteriori probability of H conditioned on X
• Baye’s theorem
• Example: computer purchase problem

– customers described by age and income
– X is a 35-year-old customer with an income of
$40,000
– H: hypothesis that customer will buy a computer
– P(H|X) = ?
Example
• P(H|X): posterior probability that customer X will buy
a computer given his age and income
• P(H) is the prior probability that any given customer
will buy a computer, regardless of age, income
• P(X|H): posterior probability that customer X is 35
years old and earns $40,000, given that he buys a
computer
• P(X): prior probability that a person from the set of
customers is 35 years old and earns $40,000
• P(H), P(X|H), P(X) may be estimated from given data
Naive Bayesian Classification
• D: training set of tuples. Each tuple represented by n-
dimensional attribute vector, X=(x1, x2,…, xn).
Attributes are A1, A2,…, An. m classes - C1, C2,…, Cm
• Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior
probability, conditioned on X. Tuple X belongs to class
Ci if and only if
• By Bayes’ theorem
• As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs
to be maximized
• If the class prior probabilities are not known, assume
P(C1)=P(C2)=…=P(Cm), and therefore maximize P(X|Ci)
• Class prior probabilities may be estimated by P(Ci) =
|Ci,D|/|D|
• Given data sets with many attributes, extremely
computationally expensive to compute P(X|Ci)
• Assumption: class-conditional independence, i.e., no
dependence relation between attributes:
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1
• Attributes may be categorical or continuous
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of
Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ 1 
( x ) 2
P ( X | Ci )  g ( x ,  ,  )  e 2 2
2 
• Compute P(X|Ci) for each class Ci
• Predicted class label is class Ci for which
P(X|Ci) is maximum
Example
Avoiding the Zero-Probability Problem
• Naive Bayesian prediction requires each conditional
prob. be non-zero. Otherwise, the predicted
probability will be zero n
P( X | C i)   P( x k | C i)
k 1
• Ex: A dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Naive Bayes Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss
of accuracy
– Practically, dependencies exist among variables
• E.g., Patients: Profile - age, family history, etc. Symptoms
- fever, cough etc., Disease - lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayes Classifier
Bayesian Belief Networks
• Unlike naive Bayesian classifiers, do not assume class
conditional independence
• Allow the representation of dependencies among
subsets of attributes
• Graphical model of causal relationships
• Two components
– a directed acyclic graph
– set of conditional probability tables (CPT)
• Node = random variable (discrete- or continuous)
• Arc = probabilistic dependence
• Arc Y -> Z implies Y is a parent, Z is a descendant
• Each variable is conditionally independent of its non-
descendants in the graph, given its parents
• One CPT for each variable
• X = (x1,…, xn) be a data tuple described by the
variables or attributes Y1,…, Yn
• Complete representation of the existing joint
probability distribution
• A node within the network can be selected as an
“output” node, representing a class label attribute
• May be more than one output node
• Can return probability of each class
• Various learning algorithms – gradient descent
• Some applications
– genetic linkage analysis
– computer vision
– document and text analysis
– decision support systems
– sensitivity analysis
Rule-Based Classification
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
• R: IF age = youth AND student = yes THEN buys_computer =
yes
• Can be generated either from a decision Tree or directly from
the training data using a sequential covering algorithm
• The “IF” part (or left side) of a rule is known as the rule
antecedent or precondition
• The “THEN” part (or right side) is the rule consequent
• If the rule antecedent holds true for a given tuple, we say that
the rule is satisfied and that the rule covers the tuple
• Assessment of a rule R: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
• If more than one rule is triggered, need

conflict resolution
– Size ordering: assigns highest priority to the
triggering rules that has the “toughest”
requirement (i.e., with the most attribute tests)
– Rule ordering: rules prioritized beforehand
• class-based, rule-based
• Class-based ordering: classes are sorted in
decreasing order of prevalence or misclassification
cost per class
• Rule-based ordering (decision list): rules are
organized into one long priority list, according to
some measure of rule quality or by experts
• What if no rule satisfied by X?
– Set up a default rule to specify a default class,
based on a training set
– May be the class in majority or the majority class
of the tuples that were not covered by any rule
Rule Extraction from a Decision Tree
• Rules are easier to understand than large trees
• One rule is created for each path from the root
to a leaf
• Each splitting criterion along a given path is
logically ANDed to form the rule antecedent
(“IF” part)
• The leaf node holds the class prediction,
forming the rule consequent (“THEN” part)
• Example
IF age = youth AND student = no THEN buys_computer = no

IF age = youth AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = senior AND credit_rating = fair THEN buys_computer = no
IF age = senior AND credit_rating = excellent THEN buys_computer = yes
• Rules extracted are mutually exclusive and
exhaustive
• Mutually exclusive: cannot have rule conflicts
here as no two rules will be triggered for the
same tuple
• Exhaustive: one rule for each possible attribute–
value combination
• Since one rule extracted per leaf, the set of rules
is not much simpler than the corresponding
decision tree
• Rule pruning required
Rule Induction: Sequential Covering Algorithm
• Extracts rules directly from training data

• Typical sequential covering algorithms: FOIL, AQ, CN2,
RIPPER
• Rules are learned sequentially, each for a given class Ci will
cover many tuples of Ci but none (or few) of the tuples of
other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
removed
– Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
Basic Sequential Covering Algorithm
How are rules learned?
• Start with the most general rule possible:
– IF THEN loan_decision = accept
• Add new attributes by adopting a greedy
depth-first strategy
– Pick the one that improves the rule quality most
• The resulting rule should cover relatively more
of the “accept” tuples
Rule Learning
Rule-Quality measures
• Need to consider both coverage and accuracy
• Entropy - prefers rules that cover a large number of tuples of a
single class and few tuples of other classes
• Tuples of the class for which rules are learned are called
positive tuples, while the remaining tuples are negative
• Foil-gain (in FOIL & RIPPER): assesses information gained by
extending the antecedent
pos ' pos
FOIL _ Gain  pos '(log 2  log 2 )
pos ' neg ' pos  neg
• Favors rules that have high accuracy and cover many positive
tuples
Rule Pruning
• Prune a rule, R, if the pruned version of R has greater
quality, as assessed on an independent set of tuples
pos  neg
FOIL _ Prune( R) 
pos  neg
• If FOIL_Prune is higher for the pruned version of R,
prune R
Classifier Evaluation Metrics
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:

Actual class\Predicted class buy_computer buy_computer = Total
= yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Classifier Accuracy, or recognition rate: A\P C ¬C
percentage of test set tuples that are C TP FN P
correctly classified ¬C FP TN N
Accuracy = (TP + TN)/All P’ N’ All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
• Sensitivity: True Positive recognition rate
• Sensitivity = TP/P
• Specificity: True Negative recognition
rate
• Specificity = TN/N
• Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive
• Recall: completeness – what % of positive tuples did the classifier

label as positive?
• F measure (F1 or F-score): harmonic mean of precision and recall
• Fß: weighted measure of precision and recall

– assigns ß times as much weight to recall as to precision

Bayes Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayes Classification

Uploaded by

Copyright:

Available Formats

Classification

• Example: computer purchase problem

• If more than one rule is triggered, need

IF age = youth AND student = no THEN buys_computer = no

• Extracts rules directly from training data

Example of Confusion Matrix:

• Recall: completeness – what % of positive tuples did the classifier

• F measure (F1 or F-score): harmonic mean of precision and recall

• Fß: weighted measure of precision and recall

You might also like