Professional Documents
Culture Documents
9780769535715/09 $25.00 © 2009 IEEE 360
DOI 10.1109/GCIS.2009.387
Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
Section 3 proposes a new algorithm. Experimental generated .The Apriori algorithm can be written by
results are presented in Section 4. Section 5 concludes pseudocode as follows.
this paper. Algorithm 1. Apriori[2]
Input: data set D, minimum support minsup
2. problem statement Output: frequent itemsets L
1. k=1;
In this section, we first introduce the problem, then 2. Fk = {i | i ∈ I ∧ !({i}) ≥ N × minsup}
review the Apriori method and FP-tree structure. 3. repeat
4. k = k+1;
2.1 Mining frequent itemsets 5. Ck = apriori-gen(Fk-1);
6. for each transaction t"T do
Let I = {x1, x2, x3 … xn} be a set of items. An 7. C t = subset(Ck, t);
itemset X that is also called pattern is a subset of I,
8. for each candidate itemset c"C t do
denoted by X ⊆ I. A transaction TX = (TID, X) is a
pair, where X is a pattern and TID is its unique
9. !(c)=!(c) + 1 ;
identifier. A transaction TX is said to contain TY if 10. Fk = {c | c ∈ C k ∧ !(c) ≥ N × minsup} ;
and only if Y ⊆ X. A transaction database, named 11. until FK = φ
TDB, is a set of transactions. The number of
12. return L = #FK ;
transactions in DB that contain X is called the support
of X. A pattern X is a frequent pattern, if and only if its Procedure apriori-gen
support is larger than or equal to s, where s is a Input: (k-1)-frequent itemsets
threshold called minimum support. Output: k-candidate itemsets
Given a transaction database, TDB, and a minimum 1. for each itemset p "Fk-1 do
support threshold, s, the problem of finding the 2. for each itemset q "Fk-1 do
complete set of frequent itemsets is called the frequent 3. if p.iteml = q.item1!p.item2 = q.item2!
itemsets mining problem. $ ! p.item(k-2) = q.item(k-2) and p.item(k-1) <
q.item(k-1)
2.2 Apriori algorithm 4. c = p % q;
5. if has_infrequent_subset(c, Fk-1 )
Agrawal[1,2] firstly proposed the Apriori algorithm, 6. delete c;
The Apriori algorithm is the most well known 7. else add c to Ck ;
association rule algorithm and is used in most 8. return Ck;
commercial products. The use of support for pruning
candidate itemsets is guided by the following 2.3. FP-tree
principles.
Property 1: If an itemset is frequent, then all of its Han et al. developed an efficient algorithm, FP-
subsets must also be frequent. growth, bases on FP-tree. It mining frequent itemsets
Property 2: If an itemset is infrequent, then all of its without generating candidates, this approach scans the
supersets must also be infrequent. database only twice[6]. The first scan is to find 1-
The algorithm initially scans the database to count frequent itemset, the second scan is to construct the
the support of each item. Upon completion of this step, FP-tree. The FP-tree has sufficient information to mine
the set of all frequent 1-itemsets, F1, will be known. complete frequent patterns, it consists of a prefix-tree
Next, the algorithm will iteratively generate new of frequent 1-itemset and a frequent-item header table
candidate k-itemsets using the frequent (k-1)-itemsets in which the items are arranged in order of decreasing
found in the previous iteration. Candidate generation is support value.
implemented using a function called Apriori-gen. To Each node in the prefix-tree has three fields: item-
count the support of the candidates, the algorithm name, count, and node-link.
needs to make an additional scan over the database. item-name is the name of the item.
The subset function is used to determine all the count is the number of transactions that consist of
candidate itemsets in Ck that are contained in each the frequent 1-items on the path from the root to this
transaction t. After counting their supports, the node.
algorithm eliminates all candidate itemsets whose node-link is the link to the next same item-name
support counts are less than minsup. The algorithm node in the FP-tree.
terminates when there are no new frequent itemsets
361
Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
Each entry in the frequent-item header table has two
fields: item-name and head of node-link.
item-name is the name of the item.
head of node-link is the link to the first same item-
name node in the prefix-tree.
Algorithm 2. FP-tree construction[3]
Input: A transaction database TDB and a minimum
support threshold ξ .
Output: Its frequent pattern tree, FP-Tree
1. Scan the transaction database DB once. Collect the
set of frequent items F and their supports. Sort F in
support descending order as L.
2. Create the root of an FP-tree, T, and label it as
Figure 1. Example of FP-tree
“null”, for each transaction in TDB do the following.
Select and sort the frequent items in transaction
according to the order of L. Let the sorted frequent 3. AFPT Algorithm
item list in transaction be [p|P], where p is the first
element and P is the remaining list. Call insert-tree In order to avoid generating candidates, FP-growth
([p|P,T]. algorithm needs to construct a large of conditional FP-
Function insert-tree ([p|P], T) tree, it may let the mining process failure when the
1. if T has a child N such that N.item-name = p.item- database is sparse or there are a lot of frequent patterns
name that result in the memory requirement overstep the
2. then increment N’s count by 1; main memory. Therefore, we consider using the apriori
3. else do method to mining the frequent itemsets basing on the
4. create a new node N; FP-tree, the divide-and-conquer strategy is still
5. N’s count = 1; adopted by mining process. That is to say, the
6. N’s parent link be linked to T; compressed FP-tree is patitioned off a set of
7. N’s node-link be linked to the nodes with the same conditional subtree , each of the conditional subtree
item-name via the node-link structure; associated with a frequent item.If there are n 1-frequent
8. if P is nonempty, Call insert-tree (P, N); items Ii(i=1,2,….n), then the FP-tree can be divided
An example of an FP-tree is shown in Figure 1. into n conditional subtree FPTi (i=1,2,….n) , and FPTi
This FP-tree is constructed from the TDB shown in is the conditional subtree associating with frequent
Table 1. with minsup = 3. In Figure 1., every node is item Ii .Then use the apriori algorithm to mine each
represented by (item - name : count) . Links to the next conditional subtree, and gain all the frequent itemsets
with the first prefix item Ii. The AFPT algorithm
same item-name node are represented by dotted
includes two steps, the first step is to construct the FP-
arrows. tree as FP-growth does, the second step is to use of the
Table 1. Sample TDB apriori algorithm to mine the FP-tree. On the second
TID Transaction Frequent Items step, it is needed to add an additional node Table,
001 f,a,c,d,g,i,m,p f,c,a,m,p named NTable, each entry in the NTable has two
fields: Item-name, and Item-support.
002 a,b,c,f,l,m,o f,c,a,b,m
Item-name: the name of the node appears in the
003 b,f,h,j,o f,b FPTi,
004 b,c,k,s,p c,b,p Item-support: the number of the node appear with Ii
The pseudocode of the APFT algorithm is described
005 a,f,c,e,l,p,m,n f,c,a,m,p below.
Algorithm 3. APFT
Input: FP-tree, minimum support threshold ξ
Output: all frequent itemset L
1. L = L1;
2. for each item Ii in header table, in top down order
3. LIi = Apriori-mining(Ii) ;
4. return L = {L#&I1 #&I2#$#&I n};
pseudocode Apriori-mining(I i )
362
Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
1'Find item p in the header table which has the same 4. Experimental Results
name with Ii (
2. q = p.tablelink; In order to verify the performance of the APFT
3. while q is not null algorithm, we compare it with Apriori and FP-Growth.
4. for each node qi != root on the prefix path of q Three algorithms are performed on a computer with a
5. if NTable has a entry N such that N.Item-name 1.41GHz processor and 512MB memory,running
= q i.item-name windows xp. The program is developed by Visual C++
6. N.Item-support = N.Item-support + q.count; 6.0. We present experimental results using three
7. else databases. Database bankcard is bank card transactions
8. add an entry N to the NTable; seized from bank of China. Database mushroom and
9. N.Item-name = q i. item-name; database T10I4D100K can be found from
10. N.Item-support = q.count; http://fimi.cs.hesinki.fi/data/. Some characteristics of
11 q = q.tablelink; these databases are showed in Table 2. The
12. k = 1; experimental results are showed in Figure 2, Figure 3
13. Fk = {j | j"NTable)j.Item-support*minsup} and Figure 4 respectively. As show as these Figures,
14. repeat APFT is more super than apriori ,because it dosen’t
15. k = k + 1; need to generate 2-candidate itemsets and reduce the
16. Ck = apriori-gen(Fk-1) ; search space, and APFT performences almost as fast as
17. q = p.tablelink; FP-growth, but when the minsup is low, AFPT runs
18. while q is not null faster than FP-growth, because in the case , FP-growth
19. find prefix path t of q needs to construct a large of conditional subtrees , it
20. Ct = subset(Ck, t); is not only time-consuming but also high memory
21. for each c"+,t cost, APFT dosen’t need to much extra spaces on the
22. c.support = c.support + q.count; mining process, so APFT has a better space scalability.
23. q = q.tablelink;
24. Fk = {c | c"+k,),c.support *,-./012} Table 2. Database characteristics
25. until Fk = φ Database Items Records Max|T| Avg|T|
26. return LI i = Ii # F1 # F2 #…# Fk // Generate Bankcard 27 50905 13 6
all frequent itemsets which with Ii as the prefix item. Mushroom 119 8124 23 23
To explain the algorithm ,we use an example with T10I4D100K 870 100000 29 10
the transaction database showed in Table 1. The FP-
tree of this database is showed in Figure 1. The mining
process begins from the top of the header table, and
moves toward the bottom. For the f-node, it has only
one prefix path and the prefix path has no other node
except root-node, so there is no frequent itemset has
the first prefix item with f. For the next c-node in the
header table, it has two prefix pathes{f}, and {root},
f.support = c.count = 3 = minsup, there is no other
requent item in c’s prefix, so we gain the frequent
itemset {cf } with support 3, the process of mining the
frequent itemsets with the first prefix item c
terminates. Next for the a-node, it has one prefix path: Figure 2. Bankcard
cf, and c.support = f.support = a.count = 3, so item c
and item f are 1-frequent item, generate 2-candicate cf
by joining c and f, then generate two-subset of a’s
prefix path, here just is cf, and
(cf).support=a.count=3=minsup, so itemset cf is
frequent, finally ,we get the prequent itemsets
{ac:3,af:3,acf:3}. The remain mining processes are
similar.
363
Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.
[2] R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. In VLDBY94, pp. 487-499.
[3] J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-
based algorithm for mining association rules. In
SIGMOD1995, pp 175-186.
6. Acknowledgments
This work was supported by the National Nature
Science Foundation of China (Grant no. 60773126)
and the Province Nature Science Foundation of Fujian
(Grant no. A0710023) and academician start-up fund
(Grant No. X01109) and 985 information technology
fund (Grant No. 0000-X07204) in Xiamen University.
7. References
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. In
Proc.1993 ACM-SIGMOD Int. Conf. Management of Data,
Washington, D.C., May 1993, pp 207–216
364
Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on June 07,2010 at 12:03:31 UTC from IEEE Xplore. Restrictions apply.