You are on page 1of 5

Global Congress on Intelligent Systems

A New Algorithm For Frequent Itemsets Mining

Based On Apriori And FP-Tree


Qihua Lan, Defu Zhang, Bo Wu
Department of Computer Science, Xiamen University, Xiamen 361005, China
Email: langel826@sina.com

Abstract subsequent studies[3,4,5]. In each iteration of the


candidate generate-and-test approach, pairs of frequent
Frequent itemsets mining plays an important role k-itemsets are joined to form candidate (k+1)-itemsets,
in association rules mining. The apriori algorithm and then scanned the database to verify their supports. The
the FP-growth algorithm are the most famous Apriori algorithm achieves good reduction on the size
algorithms, existing frequent itemsets mining of candidate sets, however!it takes many scans of the
algorithms are almost improved based on the two database to check the candidates’s supports as much as
algorithms respectively and suffer from many problems the most long length of patterns, so, when there exist a
when mining massive transctional datasets. In this large number of frequent patterns and/or long patterns,
paper, a new algorithm named APFT is proposed, it candidate generate-and-test approach may suffer a
combines the Apriori algorithm and FP-tree structure large overhead of I\O. The second class comprises
which proposed in FP-growth algorithm. The pattern-growth methods, over the past few years,
advantage of APFT is that it dosen’t need to generate several pattern-growth algorithms have been proposed,
conditional pattern bases and sub- conditional pattern such as FP-growth[6] ,Tree-projection[7] ,H-Mine[8]
tree recursively. And the results of the experiments and COFI[9]. A pattern-growth algorithm uses the FP-
show that it works faster than Apriori and almost as tree to store the database, instead of generating
fast as FP-growth. candidates, it mining the FP-tree recursively by
building conditional trees that are of the same order of
magnitude in number as the frequent pattern.
1. Introduction Compared to the first class approach, the second
approach is more efficient but needs more memory to
Association rule mining is one of the most story the intermediate data structure. This massive
important data mining problems. The purpose of creation of conditional trees makes this algorithm not
association rule mining is the discovery of association scalable to mine large datasets beyond few millions[6].
relationship among a set of items. The mining of Aim to overcome the limitation of the two
association rule include two subproblems(1)finding all approaches forementioned, a new method named
frequent itemsets that appear more often than a APFT which combines the Apriori algorithm and FP-
minimum support threshold, and(2)generate tree structure is proposed. At the beginning it
association rules using these frequent itemsets. The constructs a FP-tree like what FP-growth does, then
first subproblem plays an important role in association to every one of the items in the header table, we find
rules mining. all branches that include the item and generate the
A number of algorithms for mining frequent candidate itemsets including the item by using of the
itemsets have been proposed after Agrawal first candidate-generate method as Apriori, at last scan the
introducing the problem of deriving categorical corresponding branches to calculate the support of
association rule from transactional databases in those candidates. The results of our experiments show
[1].These existing algorithms can be categorized into that it works faster than Apriori and almost as fast as
two classes: the candidate generate-and-test approach FP-Growth.
and the pattern growth approach. The first class The rest of the paper is organized as follows:
algorithms, such as Apriori[1, 2] as well as many Section 2, formally introduces the problem and briefly
reviews the Apriori method and FP-tree structure.

978­0­7695­3571­5/09 $25.00 © 2009 IEEE 360
DOI 10.1109/GCIS.2009.387

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
Section 3 proposes a new algorithm. Experimental generated .The Apriori algorithm can be written by
results are presented in Section 4. Section 5 concludes pseudocode as follows.
this paper. Algorithm 1. Apriori[2]
Input: data set D, minimum support minsup
2. problem statement Output: frequent itemsets L
1. k=1;
In this section, we first introduce the problem, then 2. Fk = {i | i ∈ I ∧ !({i}) ≥ N × minsup}
review the Apriori method and FP-tree structure. 3. repeat
4. k = k+1;
2.1 Mining frequent itemsets 5. Ck = apriori-gen(Fk-1);
6. for each transaction t"T do
Let I = {x1, x2, x3 … xn} be a set of items. An 7. C t = subset(Ck, t);
itemset X that is also called pattern is a subset of I,
8. for each candidate itemset c"C t do
denoted by X ⊆ I. A transaction TX = (TID, X) is a
pair, where X is a pattern and TID is its unique
9. !(c)=!(c) + 1 ;
identifier. A transaction TX is said to contain TY if 10. Fk = {c | c ∈ C k ∧ !(c) ≥ N × minsup} ;
and only if Y ⊆ X. A transaction database, named 11. until FK = φ
TDB, is a set of transactions. The number of
12. return L = #FK ;
transactions in DB that contain X is called the support
of X. A pattern X is a frequent pattern, if and only if its Procedure apriori-gen
support is larger than or equal to s, where s is a Input: (k-1)-frequent itemsets
threshold called minimum support. Output: k-candidate itemsets
Given a transaction database, TDB, and a minimum 1. for each itemset p "Fk-1 do
support threshold, s, the problem of finding the 2. for each itemset q "Fk-1 do
complete set of frequent itemsets is called the frequent 3. if p.iteml = q.item1!p.item2 = q.item2!
itemsets mining problem. $ ! p.item(k-2) = q.item(k-2) and p.item(k-1) <
q.item(k-1)
2.2 Apriori algorithm 4. c = p % q;
5. if has_infrequent_subset(c, Fk-1 )
Agrawal[1,2] firstly proposed the Apriori algorithm, 6. delete c;
The Apriori algorithm is the most well known 7. else add c to Ck ;
association rule algorithm and is used in most 8. return Ck;
commercial products. The use of support for pruning
candidate itemsets is guided by the following 2.3. FP-tree
principles.
Property 1: If an itemset is frequent, then all of its Han et al. developed an efficient algorithm, FP-
subsets must also be frequent. growth, bases on FP-tree. It mining frequent itemsets
Property 2: If an itemset is infrequent, then all of its without generating candidates, this approach scans the
supersets must also be infrequent. database only twice[6]. The first scan is to find 1-
The algorithm initially scans the database to count frequent itemset, the second scan is to construct the
the support of each item. Upon completion of this step, FP-tree. The FP-tree has sufficient information to mine
the set of all frequent 1-itemsets, F1, will be known. complete frequent patterns, it consists of a prefix-tree
Next, the algorithm will iteratively generate new of frequent 1-itemset and a frequent-item header table
candidate k-itemsets using the frequent (k-1)-itemsets in which the items are arranged in order of decreasing
found in the previous iteration. Candidate generation is support value.
implemented using a function called Apriori-gen. To Each node in the prefix-tree has three fields: item-
count the support of the candidates, the algorithm name, count, and node-link.
needs to make an additional scan over the database. item-name is the name of the item.
The subset function is used to determine all the count is the number of transactions that consist of
candidate itemsets in Ck that are contained in each the frequent 1-items on the path from the root to this
transaction t. After counting their supports, the node.
algorithm eliminates all candidate itemsets whose node-link is the link to the next same item-name
support counts are less than minsup. The algorithm node in the FP-tree.
terminates when there are no new frequent itemsets

361

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
Each entry in the frequent-item header table has two
fields: item-name and head of node-link.
item-name is the name of the item.
head of node-link is the link to the first same item-
name node in the prefix-tree.
Algorithm 2. FP-tree construction[3]
Input: A transaction database TDB and a minimum
support threshold ξ .
Output: Its frequent pattern tree, FP-Tree
1. Scan the transaction database DB once. Collect the
set of frequent items F and their supports. Sort F in
support descending order as L.
2. Create the root of an FP-tree, T, and label it as
Figure 1. Example of FP-tree
“null”, for each transaction in TDB do the following.
Select and sort the frequent items in transaction
according to the order of L. Let the sorted frequent 3. AFPT Algorithm
item list in transaction be [p|P], where p is the first
element and P is the remaining list. Call insert-tree In order to avoid generating candidates, FP-growth
([p|P,T]. algorithm needs to construct a large of conditional FP-
Function insert-tree ([p|P], T) tree, it may let the mining process failure when the
1. if T has a child N such that N.item-name = p.item- database is sparse or there are a lot of frequent patterns
name that result in the memory requirement overstep the
2. then increment N’s count by 1; main memory. Therefore, we consider using the apriori
3. else do method to mining the frequent itemsets basing on the
4. create a new node N; FP-tree, the divide-and-conquer strategy is still
5. N’s count = 1; adopted by mining process. That is to say, the
6. N’s parent link be linked to T; compressed FP-tree is patitioned off a set of
7. N’s node-link be linked to the nodes with the same conditional subtree , each of the conditional subtree
item-name via the node-link structure; associated with a frequent item.If there are n 1-frequent
8. if P is nonempty, Call insert-tree (P, N); items Ii(i=1,2,….n), then the FP-tree can be divided
An example of an FP-tree is shown in Figure 1. into n conditional subtree FPTi (i=1,2,….n) , and FPTi
This FP-tree is constructed from the TDB shown in is the conditional subtree associating with frequent
Table 1. with minsup = 3. In Figure 1., every node is item Ii .Then use the apriori algorithm to mine each
represented by (item - name : count) . Links to the next conditional subtree, and gain all the frequent itemsets
with the first prefix item Ii. The AFPT algorithm
same item-name node are represented by dotted
includes two steps, the first step is to construct the FP-
arrows. tree as FP-growth does, the second step is to use of the
Table 1. Sample TDB apriori algorithm to mine the FP-tree. On the second
TID Transaction Frequent Items step, it is needed to add an additional node Table,
001 f,a,c,d,g,i,m,p f,c,a,m,p named NTable, each entry in the NTable has two
fields: Item-name, and Item-support.
002 a,b,c,f,l,m,o f,c,a,b,m
Item-name: the name of the node appears in the
003 b,f,h,j,o f,b FPTi,
004 b,c,k,s,p c,b,p Item-support: the number of the node appear with Ii
The pseudocode of the APFT algorithm is described
005 a,f,c,e,l,p,m,n f,c,a,m,p below.
Algorithm 3. APFT
Input: FP-tree, minimum support threshold ξ
Output: all frequent itemset L
1. L = L1;
2. for each item Ii in header table, in top down order
3. LIi = Apriori-mining(Ii) ;
4. return L = {L#&I1 #&I2#$#&I n};
pseudocode Apriori-mining(I i )

362

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
1'Find item p in the header table which has the same 4. Experimental Results
name with Ii (
2. q = p.tablelink; In order to verify the performance of the APFT
3. while q is not null algorithm, we compare it with Apriori and FP-Growth.
4. for each node qi != root on the prefix path of q Three algorithms are performed on a computer with a
5. if NTable has a entry N such that N.Item-name 1.41GHz processor and 512MB memory,running
= q i.item-name windows xp. The program is developed by Visual C++
6. N.Item-support = N.Item-support + q.count; 6.0. We present experimental results using three
7. else databases. Database bankcard is bank card transactions
8. add an entry N to the NTable; seized from bank of China. Database mushroom and
9. N.Item-name = q i. item-name; database T10I4D100K can be found from
10. N.Item-support = q.count; http://fimi.cs.hesinki.fi/data/. Some characteristics of
11 q = q.tablelink; these databases are showed in Table 2. The
12. k = 1; experimental results are showed in Figure 2, Figure 3
13. Fk = {j | j"NTable)j.Item-support*minsup} and Figure 4 respectively. As show as these Figures,
14. repeat APFT is more super than apriori ,because it dosen’t
15. k = k + 1; need to generate 2-candidate itemsets and reduce the
16. Ck = apriori-gen(Fk-1) ; search space, and APFT performences almost as fast as
17. q = p.tablelink; FP-growth, but when the minsup is low, AFPT runs
18. while q is not null faster than FP-growth, because in the case , FP-growth
19. find prefix path t of q needs to construct a large of conditional subtrees , it
20. Ct = subset(Ck, t); is not only time-consuming but also high memory
21. for each c"+,t cost, APFT dosen’t need to much extra spaces on the
22. c.support = c.support + q.count; mining process, so APFT has a better space scalability.
23. q = q.tablelink;
24. Fk = {c | c"+k,),c.support *,-./012} Table 2. Database characteristics
25. until Fk = φ Database Items Records Max|T| Avg|T|
26. return LI i = Ii # F1 # F2 #…# Fk // Generate Bankcard 27 50905 13 6
all frequent itemsets which with Ii as the prefix item. Mushroom 119 8124 23 23
To explain the algorithm ,we use an example with T10I4D100K 870 100000 29 10
the transaction database showed in Table 1. The FP-
tree of this database is showed in Figure 1. The mining
process begins from the top of the header table, and
moves toward the bottom. For the f-node, it has only
one prefix path and the prefix path has no other node
except root-node, so there is no frequent itemset has
the first prefix item with f. For the next c-node in the
header table, it has two prefix pathes{f}, and {root},
f.support = c.count = 3 = minsup, there is no other
requent item in c’s prefix, so we gain the frequent
itemset {cf } with support 3, the process of mining the
frequent itemsets with the first prefix item c
terminates. Next for the a-node, it has one prefix path: Figure 2. Bankcard
cf, and c.support = f.support = a.count = 3, so item c
and item f are 1-frequent item, generate 2-candicate cf
by joining c and f, then generate two-subset of a’s
prefix path, here just is cf, and
(cf).support=a.count=3=minsup, so itemset cf is
frequent, finally ,we get the prequent itemsets
{ac:3,af:3,acf:3}. The remain mining processes are
similar.

363

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
[2] R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. In VLDBY94, pp. 487-499.

[3] J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-
based algorithm for mining association rules. In
SIGMOD1995, pp 175-186.

[4] .J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-


based algorithm for mining association rules. Proceedings of
ACM SIGMOD International Conference on Management of
Data, San Jose, CA, 1995,pp175-186.
Figure 3. mushroom [5] A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association rules in large databases.
Proceedings of the 21st International Conference on Very
large Database,1995.

[6] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns


without Candidate Generation(PDF), (Slides), Proc. 2000
ACM-SIGMOD Int. May 2000.

[7] AgarwalR,AggarwalC,Prasad V V V.A treeprojection


algorithm forgeneration offrequentitemsets. In Journalof
Paralleland Distributed Computing (SpecialIssueon High
PerformanceDataMining),2000.

[8] J. Pei, J. Han, and H. Lu. Hmine: Hyper-structure mining


Figure 4. T10I4D100K of frequent patterns in large databases. In ICDM, 2001, pp
441–448.
5. Conclusion
[9] Mohammad El - Hajj and Osmar R Zaïane. COFI
In this paper, a new algorithm is proposed which Approach for Mining Frequent Itemsets Revisited3C4!
combined Apriori algorithm and the FP-Tree structure. 9th ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery 5 DMKD - 04 6 !
The experimental results shows that this new algorithm
works much faster than Apriori and as fast as FP- Paris!France!June 2004.
Growth. It works little faster than FP-Growth when the
support threshold is small. The future work is to
optimize the technique for counting the support of the
candidates and expand it for mining more larger
database.

6. Acknowledgments
This work was supported by the National Nature
Science Foundation of China (Grant no. 60773126)
and the Province Nature Science Foundation of Fujian
(Grant no. A0710023) and academician start-up fund
(Grant No. X01109) and 985 information technology
fund (Grant No. 0000-X07204) in Xiamen University.

7. References
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. In
Proc.1993 ACM-SIGMOD Int. Conf. Management of Data,
Washington, D.C., May 1993, pp 207–216

364

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.

You might also like