You are on page 1of 4

2009 Sixth Web Information Systems and Applications Conference

Improved C4.5 Algorithm for the Analysis of Sales


Rong Cao

Lizhen Xu

School of Computer Science and Engineering


Southeast University
Nanjing211189, China.
Email: caorong_1986@126.com

School of Computer Science and Engineering


Southeast University
Nanjing211189, China.
Email: lzxu@seu.edu.cn

AbstractA decision tree is an important means of data mining


and inductive learning, which is usually used to form classifiers
and prediction models. C4.5 is one of the most classic
classification algorithms on data mining, but when it is used in
mass calculations, the efficiency is very low. In this paper, the
rule of C4.5 is improved by the use of LHospital Rule, which
simplifies the calculation process and improves the efficiency of
decision-making algorithm. When calculating the rate of
information gain, the similar principle is used, which improves
the algorithm a lot. And the application at the end of the paper
shows that the improved algorithm is efficient, which is more
suitable for the application of large amounts of data, and its
efficiency has been greatly improved in line with the practical
application.

the information gain, and later, an improved C4.5 algorithm


in1993. In the following years, many scholars made kinds of
improvements on the decision tree algorithm. But the
problem is that these decision tree algorithms need multiple
scanning and sorting of data collection several times in the
construction process of the decision tree. The processing
speed reduced greatly in the case that the data set is so large
that can not fit in the memory.
At present, the literature about the improvement on the
efficiency of decision tree classification algorithm is rarely,
while some literature has merely made a simple
improvement. For example, Wei Zhao, Jamming Su in the
literature [8] proposed improvements to the ID3 algorithm,
which is simplify the information gain in the use of Taylor's
formula. But this improvement is more suitable for a small
amount of data, so it's not particularly effective in large data
sets.
Due to dealing with large amount of data sets, a variety of
decision tree classification algorithm has been considered.
The advantages of C4.5 algorithm is significantly, so it can
be chose. But its efficiency must be improved to meet the
dramatic increase in the demand for large amount of data.

Keywords-decision tree, algorithm


information gain, large data sets

I.

C4.5,

the

rate

of

INTRODUCTION

With the rapid development of information technology,


the amount of data grows at an amazing rate. People in
various fields urgently need to seek useful information from
these large data sets. Classification Algorithm [1] [2] is one
of the most important issues which are widely used, whose
purpose is to generate an accurate classifier through the
analysis of the characteristics of training data set so as to
decide the category of the unknown data samples.
In this paper, we analyzed several decision tree
classification algorithms currently in use, including the ID3
[3] and C4.5 [4] algorithm as well as some of the improved
algorithms [5] [6] [7] thereafter them. When these
classification algorithms are used in the data processing of
commodity sales analysis, we can find that its efficiency is
very low and it can cause excessive consumption of memory.
On this basis, combining with large quantity of goods sales
data, we put forward the improvement of C4.5 algorithm
efficiency, and uses LHospital rule to simplify the
calculation process by using approximate method. Although
accuracy is reduced, the application of commodity sales
analysis (based on tobacco sales analysis as an example to
discuss) indicates that the improved C4.5 algorithm is
efficient. This improved algorithm not only has no essential
impact on the outcome of decision-making, but can greatly
improve the efficiency and reduce the use of memory. So it
is more easily used to process large amount of data
collection.
II.

THE IMPROVEMENT OF C4.5ALGORITHM

A. The improvement
The C4.5 algorithm [3] [4] generates a decision tree
through learning from a training set, in which each example
is structured in terms of attribute-value pair. The current
attribute node is one which has the maximum rate of
information gain which has been calculated, and the root
node of the decision tree is obtained in this way. Having
studied carefully, we find that for each node in the selection
of test attributes there are logarithmic calculations, and in
each time these calculations have been performed previously
too. The efficiency of decision tree generation can be
impacted when the dataset is large. We find that the all
antilogarithm in logarithmic calculation is usually small after
studying the calculation process carefully, so the process can
be simplified by using LHospital Rule. As follows:
If f(x) and g(x) satisfy:
(1) lim f ( x) And lim g ( x) are both zero or are both
x > x0

x > x0

(2) In the deleted neighborhood of the point x0, both f'(x)


and g'(x) exist and g'(x)! = 0;

RELATED RESEARCH SUMMAEIESR

Quinlan puts forward ID3 decision tree algorithm base on


978-0-7695-3874-7/09 $26.00
$25.00 2009 IEEE
DOI 10.1109/WISA.2009.36

III.

173

(3)

lim

x > x0

Then

f ( x)
g ( x)

exist or is

Because N = p + n, p + n =1.Then replaces


N

n with 1- n
N
N
equation:

lim f ( x) = lim f ' ( x) , so

x > x0

g ( x)

x > x0

g ' ( x)

ln(1 x)
[ln(1 x)]'

1
x = lim 1 = 1
lim
= lim
= lim
1
x
1 x
x'

p and
N

and 1- p respectively, then we can get


N

Gain-Ratio(S, A) =

(x approaches 0)
viz. (1-x) = -x (x approaches 0)
(1)
(1-x)-x (when x is quite small )
(2)
Suppose c = 2, that is there are only two categories in
the basic definition of C4.5 algorithm. Each candidate
attributes information gain is calculated and the one has the
largest information gain is selected as the root. Suppose that
in the sample set S the number of positive is p and the
negative is n.
So we can get the equation: E(S, A) =
p +n
j
j
I ( S1 j + S 2 j ) , in which p j and n j are respective

j =1 p + n
the number of positive examples and negative examples in
the sample set. So Gain-Ratio (A) can be simplified as:
Gain( A)
E(S) - E(S, A)
Gain-Ratio(A)=
=
I ( A)
I(A)
I ( p , n) {

{ p ln(1

S
n
p
) + n ln(1 ) {[ S11 ln(1 12 ) +
N
N
S1

S12 ln(1

S11
S
S
)] + [ S 21 ln(1 22 ) + S 22 ln(1 21 )]}
S1
S2
S2

/ S1 ln(1

S2
S
) + S 2 ln(1 1 )
N
N

Because we already have equation (2), so we get:


pn

Gain-Ratio(S, A) = N

{[

S 11 * S 12
S *S
] + [ 21 22 ]}
S1
S2
S1 S 2
N

In the expression above, Gain-Ratio (A) only has addition,


subtraction, multiplication and division but no logarithmic
calculation, so computing time is much shorter than the
original expression. Whats more, the simplification can be
extended for multi-class.

S1
S
I ( S11 , S12 ) + 2 I ( S 21 , S 22 )}
N
N
I ( S1 , S 2 )

B. Reasonable arguments for the improvement


In the improvement of C4.5 above, there is no item
increased or decreased only approximate calculation is used
when we calculate the information gain rate. And the
antilogarithm in logarithmic calculation is a probability
which is less than 1. In order to facilitate the improvement of
the calculation, there are only two categories in this article
and the probability is a little bigger than in multi-class. And
the probability will become smaller when the number of
categories becomes larger; it is more helpful to justify the
rationality. Furthermore, there is also the guarantee of
LHospital Rule in the approximate calculation, so the
improvement is reasonable.

S1: the number of positive examples in A


S2: the number of negative examples in A
S11: the number of examples that A is positive and
attributes value is positive,
S12: the number of examples that A is positive and
attributes value is negative,
S21: the number of examples that A is negative and
attributes value is positive,
S22: the number of examples that A is negative and
attributes value is negative
Go on the simplification we can get:
S S
S
p
p n
n
{ log 2 + log 2 { 1 [ 11 log 2 11 +
N
N N
N
N S1
S1
S12
S12
S 2 S 21
S 21 S 22
S
+
log 2
]+ [
log 2
log 2 22 ]}
S1
S1
N S2
S2
S2
S2
S
S
S
S
/{ 1 log 2 1 + 2 log 2 2 }
N
N N
N
In the equation above, we can easily learn that each item in
both numerator and denominator has logarithmic calculation
and N, Divide the numerator and denominator by 2e
simultaneously, and multiplied by N simultaneously. we can
get equation: Gain-Ratio(S, A) =
S
n
p
{ p ln + n ln {[ S11 ln 11 +
N
S1
N
S
S
S
S12 ln 12 ] + [ S 21 ln 21 + S 22 ln 22 ]}
S1
S2
S2
S1
S2
/{S1 ln + S 2 ln }
N
N

C. Comparison of the complexity


To calculate Gain Ratio(S, A), the C4.5 algorithms
complexity is mainly concentrated in E(S) and E(S,
A).When we compute E(s), each probability value is needed
to calculated first and this need o (n) time. Then each one is
multiplied and accumulated which need O(log2n) time. So
the complexity is O(log2n).Again, in the calculation of
E(S,A),the complexity is O(n(log2n)2),so the total
complexity of Gain-Ration(S,A) is O(n(log2n)2).
And the improved C4.5 algorithm only involves
original data and only addition, subtract, multiply and divide
operation. So it only needs one scan to obtain the total value
and then do some simple calculations, the total complexity
are O (n).

174

IV.

EXPERIMENTS

query statement. The following 3 steps are made to obtain


data set about relevant targets that affect cigarette sales,
those are data cleaning(to reduce noisy value and deal with
null value), correlation analysis(feature selection) and data
conversion(to generalize or normalize data).This data set
have 4 attribute that is price categories(A), cigarette packing
specifications(B),cigarette specifications(C) and cigarette
production place(D).The 4 attribute is divided into high sales
volume(positive examples) and low sales volume(negative
examples).

The improved algorithm is applicable to large amount of data, so


we implement our algorithm on the real sale data of a tobacco
company in a certain year. And the result is compared with the
original algorithm. The concrete implementation is as follows:

A. Data Preparation
According to the requirements of the analysis of cigarette
sales, a analysis data table is obtained from cigarette table
and cigarette sales table connected with one connection and
price (A)

cigarette packing (B)

TABLE .1: THE SALE OF CIGARETTE


cigarette (C)

production place(D)

Low-grade(43)

Hardboard box(62)

cured tobacco (17)

Outside of the province(67)

High-grade(32)

No hardboard box(13)

General Cigar (58)

Inside of the province(8)

Low-grade(82)

Hardboard box(65)

cured tobacco (6)

Outside of the province(68)

High-grade(13)

No hardboard box(30)

General Cigar (89)

Inside of the province (27)

Each attributes information gain is calculated in both


improved and original C4.5 algorithm. The difference
contrast ratio between them is calculated too.
C4.5
improved C4.5
contrast ratio
Attribute A
0.096
0.135
0.289
Attribute B
0.024
0.034
0.257
Attribute C
0.052
0.081
0.358
Attribute D
0.055
0.072
0.236

High
sales95

100000
50000
0

0.15

the original data

the improved c4.5

the c4.5

the c4.5

0.05
0

Low
sales 75

We can find that though information gain ratio is a little


changed and precision is worse than C4.5, the contrast ratio
is not very large, so it does not have a profound influence on
decision tree while efficiency is improved a lot.

B. Discussion

0.1

sale(E)

Figure.2 the efficiency comparison between c4.5 and the improved c4.5

Time is saved because its complexity is changed from O


(n (log2n) 2) to O (n).And also the improved C4.5 does not
need to scan the data for several times, the memory is saved
too

the improved
c4.5

Figure.1 the calculation of contrast between c4.5 and the improved c4.5

Figure.3 the decision tree generated by c4.5

175

Figure.4 the decision tree generated by the improved c4.5

The attribute which has the largest information gain ratio


is selected to be the root to get the decision tree. Comparing
the two kinds of decision tree algorithm of extracting rules,
we can find that there are a little difference in the decision
trees between C4.5 and improved C4.5.and there are also
some differences between the classification rules of the two
algorithms. But the improved algorithm reduces unimportant
attributes influence about commodity sales while it also
increase important attributes influence about commodity
sales for that the important attributes are closer to the root.
Obviously the improved algorithm is not only concise and
practical but also with a better efficiency. It should be
pointed out that, all attributes are divided into two classes in
order to make the formulas t more concise. But in practice,
attributes can be divided into more than two classes
according to the requirement. The improvement is also
applicable. More classes are divided, more precise the result
is. The improved C4.5 algorithm will speed up the
processing phase and improve the efficiency especially used
in large data set.
V.

sales. With the improved algorithm, we can get faster and


more effective results without the change of the final
decision. Therefore, the efficiency of classification was
greatly improved, and the disadvantages of low efficiency
and memory consumption while dealing with large amount
of data were overcome as it was in C4.5. If the amount of
data is not very large, the original C4.5 is recommended
because of its higher accuracy.
REFERENCES
[1]

[2]

[3]
[4]
[5]

CONCLUSION

In this paper, C4.5 algorithm was improved. Although


approximate calculation were used in the calculation of
Gain-Ratio (S, A), the experiment proved that it has minimal
impact on the classification accuracy, but the efficiency was
increased a lot. We can not only speed up the growing of the
decision tree, but also get a better-structured decision tree, so
that better information of rules can be excavated. In the
paper, the algorithm was verified by the analysis of tobacco

[6]
[7]

[8]

176

Mehta, M., Agrawal, R., Rissanen, J. SLIQ: a fast scalable classifier


for data mining. In: Apers, P., Bouzeghoub, M., ardarin, G eds.
Proceedings of the 5th International Conference on Extending
Database Technology. Berlin: pringer-Verlag, 1996.18~32
Wang, M., Iyer, B., Vitter, J.S. Scalable mining for classification
rules in relational databases. In: Eaglestone, B., Desai, B.C., Shao,
Jian-hua, eds. Proceedings of the 1998 International Database
Engineering and Applications Symposium. Wales: IEEE Computer
Society, 1998. 58~67
Quinlan JR.Induction of decision tree [J].Machine Learing.1986
Quinlan,J.R.C4.5:Programsfor Machine Learning.SanMateo,
CA:Morgan Kaufmann1993
Liu, B., Hsu, W., Ma, Y. Integrating classification and association
rule mining. In: Agrawal, R., ed. Proceedings of the 4th International
Conference on Knowledge Discovery and Data Mining. New York:
AAAI Press, 1998. 80~86
RossQuinlan J. Improved use of continuous attributes in C4. 5
[J] .Artificial Intelligence Research, 1996, 4:77290
UCIRepository of machine earning databases. University of California,
Department of Information and Computer Science, 1998. http:
//www.ics. uci. edu/mlearn/MLRepository. Html
Wei Zhao, Jianming Su. Based on the research of the decision tree ID3
algorithms and improvement. The computer application and
software.2003.25~12

You might also like