Professional Documents
Culture Documents
Lizhen Xu
I.
C4.5,
the
rate
of
INTRODUCTION
A. The improvement
The C4.5 algorithm [3] [4] generates a decision tree
through learning from a training set, in which each example
is structured in terms of attribute-value pair. The current
attribute node is one which has the maximum rate of
information gain which has been calculated, and the root
node of the decision tree is obtained in this way. Having
studied carefully, we find that for each node in the selection
of test attributes there are logarithmic calculations, and in
each time these calculations have been performed previously
too. The efficiency of decision tree generation can be
impacted when the dataset is large. We find that the all
antilogarithm in logarithmic calculation is usually small after
studying the calculation process carefully, so the process can
be simplified by using LHospital Rule. As follows:
If f(x) and g(x) satisfy:
(1) lim f ( x) And lim g ( x) are both zero or are both
x > x0
x > x0
III.
173
(3)
lim
x > x0
Then
f ( x)
g ( x)
exist or is
n with 1- n
N
N
equation:
x > x0
g ( x)
x > x0
g ' ( x)
ln(1 x)
[ln(1 x)]'
1
x = lim 1 = 1
lim
= lim
= lim
1
x
1 x
x'
p and
N
Gain-Ratio(S, A) =
(x approaches 0)
viz. (1-x) = -x (x approaches 0)
(1)
(1-x)-x (when x is quite small )
(2)
Suppose c = 2, that is there are only two categories in
the basic definition of C4.5 algorithm. Each candidate
attributes information gain is calculated and the one has the
largest information gain is selected as the root. Suppose that
in the sample set S the number of positive is p and the
negative is n.
So we can get the equation: E(S, A) =
p +n
j
j
I ( S1 j + S 2 j ) , in which p j and n j are respective
j =1 p + n
the number of positive examples and negative examples in
the sample set. So Gain-Ratio (A) can be simplified as:
Gain( A)
E(S) - E(S, A)
Gain-Ratio(A)=
=
I ( A)
I(A)
I ( p , n) {
{ p ln(1
S
n
p
) + n ln(1 ) {[ S11 ln(1 12 ) +
N
N
S1
S12 ln(1
S11
S
S
)] + [ S 21 ln(1 22 ) + S 22 ln(1 21 )]}
S1
S2
S2
/ S1 ln(1
S2
S
) + S 2 ln(1 1 )
N
N
Gain-Ratio(S, A) = N
{[
S 11 * S 12
S *S
] + [ 21 22 ]}
S1
S2
S1 S 2
N
S1
S
I ( S11 , S12 ) + 2 I ( S 21 , S 22 )}
N
N
I ( S1 , S 2 )
174
IV.
EXPERIMENTS
A. Data Preparation
According to the requirements of the analysis of cigarette
sales, a analysis data table is obtained from cigarette table
and cigarette sales table connected with one connection and
price (A)
production place(D)
Low-grade(43)
Hardboard box(62)
High-grade(32)
No hardboard box(13)
Low-grade(82)
Hardboard box(65)
High-grade(13)
No hardboard box(30)
High
sales95
100000
50000
0
0.15
the c4.5
the c4.5
0.05
0
Low
sales 75
B. Discussion
0.1
sale(E)
Figure.2 the efficiency comparison between c4.5 and the improved c4.5
the improved
c4.5
Figure.1 the calculation of contrast between c4.5 and the improved c4.5
175
[2]
[3]
[4]
[5]
CONCLUSION
[6]
[7]
[8]
176