Professional Documents
Culture Documents
Raymond J. Mooney
University of Texas at Austin
1 2
3 4
1
Generalization Hypothesis Space
• Restrict learned functions a priori to a given hypothesis
• Hypotheses must generalize to correctly space, H, of functions h(x) that can be considered as
classify instances not in the training data. definitions of c(x).
• For learning concepts on instances described by n discrete-
• Simply memorizing training examples is a valued features, consider the space of conjunctive
hypotheses represented by a vector of n constraints
consistent hypothesis that does not <c1, c2, … cn> where each ci is either:
generalize. – ?, a wild card indicating no constraint on the ith feature
– A specific value from the domain of the ith feature
• Occam’s razor: – Ø indicating no value is acceptable
– Finding a simple hypothesis helps ensure • Sample conjunctive hypotheses are
– <big, red, ?>
generalization. – <?, ?, ?> (most general hypothesis)
– < Ø, Ø, Ø> (most specific hypothesis)
7 8
9 10
2
Efficient Learning Conjunctive Rule Learning
• Is there a way to learn conjunctive concepts • Conjunctive descriptions are easily learned by finding
without enumerating them? all commonalities shared by all positive examples.
• How do human subjects learn conjunctive Example Size Color Shape Category
concepts? 1 small red circle positive
2 large red circle positive
• Is there a way to efficiently find an
unconstrained boolean function consistent 3 small red triangle negative
with a set of discrete-valued training 4 large blue circle negative
instances? Learned rule: red & circle → positive
• Must check consistency with negative examples. If
• If so, is it a useful/practical algorithm?
inconsistent, no conjunctive rule exists.
13 14
17 18
3
Top-Down Decision Tree Induction Decision Tree Induction Pseudocode
• Recursively build a tree top-down by divide and conquer.
DTree(examples, features) returns a tree
<big, red, circle>: + <small, red, circle>: + If all examples are in one category, return a leaf node with that category label.
<small, red, square>: − <big, blue, circle>: − Else if the set of features is empty, return a leaf node with the category label that
color is the most common in examples.
red blue green Else pick a feature F and create a node R for it
<big, red, circle>: +
shape For each possible value vi of F:
<small, red, circle>: + neg neg
<big, blue, circle>: −
<small, red, square>: − circle square triangle Let examplesi be the subset of examples that have value vi for F
Add an out-going edge E to node R labeled with the value vi.
pos neg pos If examplesi is empty
<big, red, circle>: + <small, red, square>: −
<small, red, circle>: + then attach a leaf node to edge E labeled with the category that
is the most common in examples.
else call DTree(examplesi , features – {F}) and attach the resulting
tree as the subtree under edge E.
Return the subtree rooted at R.
19 20
21 22
4
Hypothesis Space Search Another Red-Circle Data Set
• Performs batch learning that processes all training
instances at once rather than incremental learning
that updates a hypothesis after each example.
Example Size Color Shape Category
• Performs hill-climbing (greedy search) that may
only find a locally-optimal solution. Guaranteed to 1 small red circle positive
find a tree consistent with any conflict-free 2 large red circle positive
training set (i.e. identical feature vectors always
assigned the same class), but not necessarily the 3 small red square negative
simplest tree. 4 large blue circle negative
• Finds a single discrete hypothesis, so there is no
way to provide confidences or create useful
queries.
25 26
27 28
5
Evaluating Inductive Hypotheses K-Fold Cross Validation
• Accuracy of hypotheses on training data is
Randomly partition data D into k disjoint equal-sized
obviously biased since the hypothesis was
subsets P1…Pk
constructed to fit this data.
For i from 1 to k do:
• Accuracy must be evaluated on an independent Use Pi for the test set and remaining data for training
(usually disjoint) test set. Si = (D – Pi)
• Average over multiple train/test splits to get hA = LA(Si)
accurate measure of accuracy. hB = LB(Si)
• K-fold cross validation averages over K trials δi = errorPi(hA) – errorPi(hB)
using each example exactly once as a test case. Return the average difference in error:
1 k
δ = ∑δi
k i =1
31 32
# Training examples
33 34
35