You are on page 1of 9

Expert Systems With Applications 73 (2017) 178186

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

A novel approach for mining maximal frequent patterns


Bay Vo a,b, Sang Pham c, Tuong Le d,e,, Zhi-Hong Deng f
a
Faculty of Information Technology, Ho Chi Minh City University of Technology, Vietnam
b
College of Electronics and Information Engineering, Sejong University, Seoul, Republic of Korea
c
University of Economics & Finance, Ho Chi Minh City, Vietnam
d
Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam
e
Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
f
Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871,
China

a r t i c l e i n f o a b s t r a c t

Article history: Mining maximal frequent patterns (MFPs) is an approach that limits the number of frequent patterns
Received 25 July 2016 (FPs) to help intelligent systems operate eciently. Many approaches have been proposed for mining
Revised 16 December 2016
MFPs, but the complexity of the problem is enormous. Therefore, the run time and memory usage are
Accepted 17 December 2016
still large. Recently, the N-list structure has been proposed and veried to be very effective for min-
Available online 30 December 2016
ing FPs, frequent closed patterns, and top-rank-k FPs. Therefore, this paper uses the N-list structure for
Keywords: mining MFPs. A pruning technique is also proposed to prune branches to reduce the search space. This
Data mining technique is applied to an algorithm called INLA-MFP (improved N-list-based algorithm for mining max-
Pattern mining imal frequent patterns) for mining MFPs. Experiments were conducted to evaluate the effectiveness of
Maximal frequent patterns the proposed algorithm. The experimental results show that INLA-MFP outperforms two state-of-the-art
N-list structure algorithms for mining MFPs.
Pruning technique
2016 Elsevier Ltd. All rights reserved.

1. Introduction 20 0 0; Burdick, Calimlim, Flannick, Gehrke, & Yiu, 2005; Gouda


& Zaki, 2005; Liu, Zhai, & Pedrycz, 2012). However, due to the
Data mining is the computational process of discovering pat- large complexity, they require a lot of system resources. Therefore,
terns in large datasets. Data mining problems include frequent the present study proposes an effective algorithm for mining
pattern (FP) mining (Agrawal, Imielinski, & Swami, 1993; Deng, MFPs. Compared with the problem of mining FPs, the problem
2016; Deng & Lv, 2014, 2015; Deng, Wang, & Jiang, 2012; Dong & of mining MFPs eliminates many FPs without losing generality.
Han, 2007; Fan et al., 2008; Grahne & Zhu, 2005; Han, Pei, & Yin, Although many methods have been proposed for mining MFPs,
20 0 0; Pyun & Yun, 2014; Vo, Le, Coenen, & Hong, 2016; Zaki & their runtime and memory usage are large.
Hsiao, 2005), clustering (Le, 2015; Xiao, Jing, Bettina, Son, & Plant, The N-list structure, proposed by Deng et al. (2012), is based
2014), and classication (Liu, Hsu, & Ma, 1998; Nguyen & Nguyen, on PPC-tree (pre-post code-tree). This concept is used in the Pre-
2015). The problem of mining association rules (Agrawal & Srikant, Post algorithm for mining FPs. N-list helps the algorithm reduce
1994) is one of the most important and most popular problems. the mining time and memory usage since its structure is more
Association rules are used widely in data analysis to support sales compact than previous vertical structures. The N-list structure has
strategies, e-commerce, and medicine. Mining FPs is the rst step been applied to the problem of mining erasable itemsets (Deng
of mining association rules. However, the number of obtained & Xu, 2012; Le, Vo, & Coenen, 2013), frequent closed patterns (Le
FPs is usually large, and there are a lot of redundant FPs. Some & Vo, 2015), and top-rank-k FPs (Deng, 2014; Huynh, Le, Vo, &
approaches have been developed to create representative patterns Le, 2015). These studies show that N-list is an effective structure
of FPs, such as closed patterns, maximal patterns, and top-k and for mining patterns. In 2014, Deng proposed a structure named
top-rank-k FPs. Several approaches have been proposed for mining Nodeset, which is also generated from PPC-tree. The Nodeset
maximal frequent patterns (MFPs) (Agarwal, Aggarwal, & Prasad, structure stores only one of pre-order or post-order whereas N-list
stores both pre-order and post-order information. This structure
was applied in the FIN algorithm for mining FPs. An experiment

Corresponding author.
by Deng (2014) showed that FIN outperforms PrePost (Deng et
E-mail addresses: bayvodinh@gmail.com (B. Vo), sangppt@gmail.com (S. Pham),
lecungtuong@tdt.edu.vn, tuonglecung@gmail.com (T. Le), zhdeng@cis.pku.edu.cn (Z.-
al., 2012) in terms of runtime and memory usage. In 2015, Deng
H. Deng). proposed the PrePost+ algorithm based on the N-list concept and

http://dx.doi.org/10.1016/j.eswa.2016.12.023
0957-4174/ 2016 Elsevier Ltd. All rights reserved.
B. Vo et al. / Expert Systems With Applications 73 (2017) 178186 179

a child-parent equivalence pruning technique. With this pruning Table 1


Example dataset.
technique, PrePost+ is better than FIN (using Nodeset) and PrePost
(using N-list) algorithms for mining FPs (Deng & Lv, 2015). Transaction Items
The present work proposes a novel approach for mining MFPs 1 A, C, D
that uses the N-list structure and a divide-and-conquer strategy. 2 B, C, E
The main contributions of this study are as follows: (i) an N-list 3 A, B, C, E
structure is used to compress the data and improve runtime; (ii) A 4 B, E
5 A, B, C, E
pruning technique for reducing the search space are proposed for
quick mining MFPs based on N-list. Experiments were conducted
to show the effectiveness of the proposed approach compared
with two state-of-the-art algorithms for mining MFPs, namely 48,842 transactions and 15 attributes. Steel Sales (not publicly
dGenMax (GenMax algorithm that uses diffset strategy) (Gouda & available) has 0.5 million transactions and 35 attributes.
Zaki, 2005) and TDM-MFI (Liu et al., 2012) algorithms, in terms of
runtime and memory usage.
2.2. N-list structure
The rest of this paper is organized as follows. Section 2 presents
related work. Denitions of MFPs are presented in Section 3. This
N-list (Deng et al., 2012) structures, generated from PPC-tree,
section also summarizes the denitions and construction al-
store information related to 1-items and are used to calculate
gorithms of PPC-tree and N-list structures. Several theorems
the support of a pattern for mining FPs and frequent closed
concerning the N-list structure and MFPs are also summarized in
patterns. PrePost based on N-list is faster than state-of-the-art
this section. A pruning technique for reducing the search space
algorithms such as dEclat and eclat_goethals (Deng et al., 2012).
and the INLA-MFP (improved N-list-based algorithm for mining
N-list is more compact than previous vertical structures. Hence,
maximal frequent patterns) algorithm are proposed in Section
it helps to reduce mining time and memory usage. Deng and Lv
4. Section 5 shows the results of experiments, comparing the
(2014) proposed the Nodeset structure, where a node encodes
runtime and memory usage of INLA-MFP with those of dGenMax
only one value (pre- or post-order code) in the PPC-tree, for mining
and TDM-MFI algorithms. Finally, Section 6 summarizes the results
FPs. A Nodeset-based algorithm called FIN was proven to be more
and offers some future research topics.
ecient than the PrePost algorithm (N-list-based algorithm) using
extensive experiments. Recently, Deng and Lv (2015) proposed the
PrePost+ algorithm for mining FPs based on the N-list structure
2. Related work
and children-parent equivalence pruning. PrePost+ with the new
pruning technique was shown to be better than PrePost and FIN
2.1. Mining maximal frequent patterns
algorithms for mining FPs.
The NC_set structure (Deng and Xu, 2012; Le et al., 2013),
Many algorithms have been proposed for mining MFPs, such as
a structure generated from PPC-tree, was proposed for mining
DepthProject (Agarwal et al., 20 0 0), Maa (Burdick et al., 2005),
erasable itemsets. Moreover, N-list was used to improve the per-
GenMax (Gouda & Zaki, 2005), and TDM-MFI (Liu et al., 2012). The
formance of mining top-rank-k FPs by Deng (2014) and Huynh et
DepthProject algorithm mines long patterns using a depth-rst
al. (2015). In addition, the N-list structure was applied for mining
search of a lexicographic tree of patterns, and uses a counting
frequent closed patterns by Le and Vo (2015).
method based on transaction projections along its branches. This
In this paper, we propose a novel approach for mining MFPs
algorithm also uses the look-ahead pruning technique with item
using the N-list structure named the INLA-MFP algorithm. A
reordering to reduce runtime and memory usage. The Maa algo-
pruning technique based the N-list structure is also proposed for
rithm uses three pruning strategies, namely look-ahead pruning,
reducing the search space. It is applied in the INLA-MFP algorithm
checking whether a new set is subsumed by an existing maximal
for mining MFPs.
set, and checking whether t(X) t(Y) (if so, X is considered to-
gether with Y for extension), to eliminate non-maximal patterns.
Maa also uses the vertical bit-vector data format and the com- 3. Basic principles
pression and projection of bitmaps to improve performance. This
algorithm mines a superset of MFPs, and requires a post-pruning 3.1. Maximal frequent patterns
step to eliminate non-maximal patterns. The GenMax algorithm
stores datasets in the vertical tidset format, integrates pruning Consider a dataset DB consisting of n transactions, where each
with mining, and returns all MFPs. First, this algorithm computes transaction includes a number of items. Let I be a nite set of
the set of frequent 1- and 2-patterns using a vertical-to-horizontal items. An example dataset with n = 5 is shown in Table 1. This
recovery method. This information is used to reorder the items dataset is used throughout the paper. We have: I = {A, B, C, D, E}.
in the initial combined list to limit the search space. Then, Gen- A pattern is a subset X I that contains a number of items.
Max uses the progressive focusing technique of local maximal The support of pattern X, denoted by sup(X), is the number of
frequent itemset backtracking, combined with diffset propagation transactions that contain all the items of X. Pattern X is said to
of Fl-diffset-combine, to produce the exact set of all MFPs. Liu et be an FP if and only if sup(X) minSup (minSup is a user-dened
al. (2012) presented a form of directed itemset graph to store the support threshold). An FP X is said to be an MFP if X is frequent
information of frequent itemsets, and used the trifurcate linked and no super-set of X is frequent.
list storage structure of the directed itemset graph. Then, an algo-
rithm for mining MFPs based on this structure named TDM-MFI {(sup(X ) min Sup) (Y I : X Y sup(Y ) < min Sup)}
was proposed. The performance of TDM-MFI is better than that of (1)
some well-known algorithms such as Maa and GenMax (Liu et al.,
2012) on three particular datasets (Congressional Voting Records, The problem of mining MFPs requires discovering all MFPs that
Adult, and Steel Sales datasets) which have a small number of satisfy the given user-dened minimum support (minSup).
attributes. The rst dataset is Congressional Voting Records with Example 1. With the example dataset shown in Table 1 and
435 transactions and 17 attributes. The second dataset, Adult, has minSup = 60%, the whole list of FPs is shown in Table 2.
180 B. Vo et al. / Expert Systems With Applications 73 (2017) 178186

Table 2
Frequent patterns in
example dataset with
minSup = 60%.

Patterns Support

B 4
C 4
E 4
AC 3
BC 3
BE 4
CE 3
BCE 3

As shown in Table 2, A, B, C, E, AC, BC, BE, CE, and BCE are


FPs. Of these FPs, only AC and BCE are MFPs because they are not
subsets of any FPs.

3.2. PPC-tree

Deng et al. (2012) dened PPC-tree as follows:

Denition 1. (PPC-tree): A PPC-tree, R, is a tree structure where


each node consists of ve values: name, frequency, childnodes, pre,
and post, which are the frequent 1-patterns in I, the frequency of
this node, the set of child nodes associated with this node, the
order number of the node when traversing the tree in pre-order
form, and the order number when traversing the tree in post-order
form, respectively.

The PPC-tree construction algorithm is shown in Fig. 1.


Example 2. To illustrate the PPC-tree construction algorithm, we
use the example dataset with threshold = 60%. First, the PPC-tree
construction algorithm scans the dataset to determine I1 = {A, B,
C, D, E} with their supports (line 1). Then, the algorithm sorts
Fig. 1. PPC-tree construction algorithm.
I1 in descending order of support (line 2), creates the root of
the PPC-tree, R (line 3), and determines minSup = 60% 5 = 3 (line
4). For the rst transaction (T1 ), item D is removed because
Given two PP-codes and , is an ancestor of if and only
sup(D) = 1 < minSup = 3 (line 7). The new order of T1 after sorting
if pre() < pre( ) and post() > post( ). For example, in Fig. 3,
the remaining items in descending order of frequency is {C, A}.
because pre(N1 ) pre(N2 ) 1 < 3 and post(N1 )
post(N2 ) 4 > 1,
Each item in T1 is inserted into the PPC-tree. The tree for T1 is
N1 is an ancestor of N2 .
shown in Fig. 2 (column A). For the second transaction (T2 ), no
Let XA and XB be two (k-1)-patterns with a given prex X.
items are removed and the order of T2 is {B, C, E}. Each item in
N(XA) and N(XB) are two N-lists associated with XA and XB, re-
T2 is inserted into the PPC-tree. The resulting tree is shown in
spectively. To create N(XAB), the following steps are used. For each
Fig. 2 (column B). T3 , T4 , and T4 are processed similarly (Fig. 2,
PP-code N(XA) and N(XB), if is an ancestor of and
columns C, D, and E).
N(XAB) such that pre() = pre() and post() = post(), then
Next, the PPC-tree construction algorithm scans the PPC-tree
the frequency count of is updated as sup() = sup() + sup().
and generates pre- and post-order values (line 11). Fig. 3 shows
Otherwise, the algorithm adds pre(), post(), sup( )
to N(XAB).
the PPC-tree generated for the example dataset. The letter/number
The support of XAB is calculated as follows:
pair in each rectangle is the name of the item and its support, 
sup( ), respectively. The pre- and post-order of the corresponding sup (X AB ) = sup ( ) (2)
node ( ) are represented by a pair of numbers in brackets, de- N (XAB )
noted by pre( ) and post( ). For example, node N1 in Fig. 3 has
pre(N1 ) = 1, post(N1 ) = 4, and sup(N1 ) = 4. Denition 4. (subset relation of two N-lists). N(XB) N(XA) if
N(XB), N(XA) such that is an ancestor of .
3.3. N-list structure
4. Improved algorithm for mining MFPs with pruning
Denition 2. (PP-code). The PP-code, denoted by , of each node technique
N comprises a tuple pre(), post(), sup()
, where pre() is the
pre-order of N, post() is the post-order of N, and sup() is the 4.1. INLA-MFP algorithm
frequency of N.

Denition 3. (N-list of a 1-pattern). The N-list associated with an Theorem 1. Let XA and XB be two FPs with prex X. If
item, A, is N(A), the set of PP-codes associated with nodes in the N(XB) N(XA) or N(XA) N(XB), then XB and XA are not MFPs.
PPC-tree whose name is A.
Proof. Based on the MFP denition, if XAB is an FP, XA XAB, and
Example 3. The N-list of E in the PPC-tree (Fig. 3) is { 3, 1, 3
, XB XAB, then XA and XB are not MFPs. Therefore, to assert that
5, 3, 1
} and that of B is { 1, 4, 4
}. XB and XA are not MFPs, we must prove that XAB is an FP.
B. Vo et al. / Expert Systems With Applications 73 (2017) 178186 181

Fig. 2. PPC-tree without pre- and post-order created for example dataset with minSup = 60%.

In addition, we also use the property named the children-


parent equivalence pruning technique (Deng & Lv, 2015; Le & Vo,
2015) to reduce the search space for mining MFPs. When the
algorithm calls the function to intersect the two N-lists of XA and
XB, if N(XB) N(XA) and sup(XB) = sup(XA), the algorithm will:

1. Remove XA from the search space.


2. Replace XB by XAB and update the N-list of XAB. Then, the
algorithm will update all sub-nodes that were created earlier
by adding A at the end of the pattern.

First, the INLA-MFP algorithm constructs the PPC-tree (line 1)


and generates the N-list from this tree (line 2). Then, the algorithm
deletes all information in the PPC-tree (line 3). The algorithm calls
the Find_MFPs function to expand the search tree to nd all MFPs.
This function is the main idea of the INLA-MFP algorithm, which
uses the list of frequent k-itemsets ( k ) and the results (Zk ) as
parameters. If k has only one element and it is not a subset
of any set in Zk , the algorithm adds this element to the results,
Zk (lines 13 and 14). For each element k [i] in k (line 16), the
Fig. 3. PPC-tree created from example dataset with minSup = 60%. algorithm executes the following processes.

1. Determine itemset P by union all elements (from k [i] to the


According to the method used for determining the N-list end) (lines 1719). If P is a subset of any itemset in Zk , then
associated with a k-pattern and N(XB) N(XA), we have: k [i] cannot create any MFPs. The algorithm thus prunes this
branch.
 2. Combine k [i] with all remaining elements in k , ( k [j]).
N (X AB ) = pre( ), post ( ), sup ( )
(3)
For each k [j], the algorithm determines X = k [j] k [i]
(line 24), N-list of X, support of X, and iS (iS is used to show
where N(XA), N(XB), and is an ancestor of . Therefore,
  n( k [i]) n( k [j]) or not) (line 25). On lines 2631, if iS is true
sup(XAB) = N (XAB ) sup() = N (XB ) sup() = sup(XB). Because
(The children-parent equivalence pruning technique (Deng &
sup(XAB) = sup(XB), XAB is an FP. Therefore, XA and XB cannot
Lv, 2015) is satised), the algorithm removes k [j] from the
be MFPs. The same in the case of N(XA) N(XB) can be proven.
search space, replaces k [i] = k [j] k [j], and updates all
Therefore, Theorem 1 is proven.
sub-nodes that were created earlier by union with k [j]. If
Due to the characteristics of the Generate_NList function, only
the support of X satises minSup, the algorithm inserts X at
the case of N(XB) N(XA) appears. Using Theorem 1, when the
the head of k+ 1 (lines 33 and 34). If k+ 1 has no element
algorithm calls the function to intersect the two N-lists of XA and
and k [i] is not a subset of any set in Zk , the algorithm adds
XB, if N(XB) N(XA), then the algorithm can conclude that XA and
k [i] to the results (lines 35 and 36). Otherwise, the algorithm
XB are not MFPs without checking. The algorithm replaces XB by
recursively calls the Find_LMFPs function with k+ 1 and Zk+ 1
XAB without creating a new node on the tree.
as parameters (line 39).
Example 4. We have N(E) = { 3, 1, 3
, 5, 3, 1
} and N(B) = { 1,
4, 4
}. Based on Denition 4, N(E) N(B) because all PP-codes in The algorithm stops when all elements in k have been
N(E) have an ancestor in N(B). Therefore, E and B are not MFPs. scanned. All MFPs are found.
182 B. Vo et al. / Expert Systems With Applications 73 (2017) 178186

Fig. 4. Continued

Fig. 5. Frequent pattern tree of INLA-MFP when expanding node A.

Fig. 6. Frequent pattern tree of INLA-MFP when expanding node E.

4.2. Demonstration of INLA-MFP

Fig. 5 shows all 1-patterns with their N-lists. INLA-MFP calls


the function Find_LMFPs to nd all MFPs in the datasets. First,
A is combined with 1-patterns, namely B, C, and E. Because
N(A) N(C), C is inserted before A with N(AC) = { 2,2,2
6,6,1
}
(see Fig. 5). Then, AC has sup(AC) = 3 minSup and has no pattern
in the current results as its superset. Therefore, AC is an MFP and
is inserted into the results.
Next, E is combined with C to create CE with sup(CE) = 3 and
N(CE) = { 2,2,3
}. Then, E is combined with B to create BE. In this
step, the algorithm nds that N(E) N(B) and sup(E) = sup(B), and
so it replaces E by EB with a new N-List N(EB) = { 1,4,4
}, removes
B from the search space, and updates all created 2-level nodes
(inserts B into node CE). Only CEB is not a subset of any set in
the results; therefore, BCE is added to the results. Fig. 6 shows the
results of this step.
Finally, C cannot be an MFP because BCE is an MFP in the
results. The algorithm stops here. The results, INLA-MFP (the set of
MFPs), for the example dataset with minSup = 60% are {AC, BCE}.

5. Performance studies

All experiments presented in this section were performed on


a laptop with an Intel Core i3-370 M CPU (2.4 GHz) and 4GB of
RAM. All programs were coded in C# in Microsoft Visual Studio
2012, Windows 8.1 (64-bit), and run on the Microsoft .Net Frame-
work (version 4.5). The experiments were conducted using the
following UCI databases1 : Pumsb, Connect, Pumsb_star, Accident,

Fig. 4. INLA-MFP algorithm.


1
http://mi.cs.helsinki./data/.
B. Vo et al. / Expert Systems With Applications 73 (2017) 178186 183

Table 3
Statistical summary of experimental datasets.

Datasets Number of transactions Number of items Average length of transactions Max length of transactions

Pumsb 49,046 2113 74 74


Connect 67,557 129 43 43
Pumsb_star 49,046 2088 50 63
Accidents 340,183 468 33.8 51
T40I10D100K 10 0,0 0 0 942 40 77
T10I4D100K 10 0,0 0 0 870 10 30
Retail 88,162 16,470 10.3 76
Chainstore 1,112,949 46,086 7.26 170
BMS-POS 515,597 1657 6.53 165

Fig. 7. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Pumsb dataset. Fig. 9. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Pumsb_star dataset.

Fig. 8. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Connect dataset. Fig. 10. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Accident dataset.

T10I4D100K, T40I10D100K, Retail, Chainstore, and BMS-POS. A


statistical summary of these datasets is shown in Table 3.
To prove the effectiveness of the proposed algorithm, we
compared its runtime and memory usage with those of dGenMax
(GenMax algorithm that uses diffset strategy) and TDM-MFI al-
gorithms for experimental datasets. dGenMax is a highly ecient
method for mining the exact set of MFPs (Gouda & Zaki, 2005).
TDM-MFI is the newest algorithm for mining MFPs (Liu et al.,
2012). Note that the mining time of TDM-MFI is better than
dGenMax (Liu et al., 2012) on three particular datasets (Congres-
sional Voting Records, Adult, and Steel Sales datasets) which are
not usually used for benchmark of pattern mining algorithms.
However, for the experimental datasets in this article, dGenMax Fig. 11. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for T40I10D100K dataset.
outperforms TDM-MFI (see Section 5.1).

5.1. Runtime rapidly when threshold is decreased from 70% to 50%. TDM-MFI
even cannot execute with threshold = 60% and 50%.
INLA-MFP spends a lot of time building the PPC-tree, and thus In Fig. 8, for the Connect dataset, the runtimes of INLA-MFP are
with a large minSup, it is not faster than dGenMax and TDM-MFI. consistently lower than those of dGenMax and TDM-MFI.
However, with a small minSup, INLA-MFP is much faster than In Figs. 915, the runtimes of INLA-MFP for Connect,
dGenMax and faster than TDM-MFI in most cases. Pumsb_star, Accident, T40I10D100K, T10I4D100K, Retail, Chain-
In Fig. 7, for the Pumsb dataset, the runtimes of INLA-MFP are store, and BMS-POS datasets are better than those of dGenMax
almost constant whereas those of dGenMax and TDM-MFI increase and TDM-MFI in most cases, especially with small thresholds.
184 B. Vo et al. / Expert Systems With Applications 73 (2017) 178186

Fig. 12. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for T10I4D100K dataset. Fig. 16. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Pumsb dataset.

Fig. 13. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Retail dataset.
Fig. 17. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Connect dataset.

Fig. 14. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset.
Fig. 18. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Pumsb_star
dataset.

Fig. 15. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for BMS-POS dataset.

Fig. 19. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Accident dataset.
5.2. Memory usage

For all datasets, INLA-MFP is always better than dGenMax and the N-list structure. The number of transactions is often greater
TDM-MFI in terms of memory usage, as shown in Figs. 1624. This than the number of nodes in the PPC-tree. Therefore, INLA-MFP
can be explained as follows. dGenMax uses the diffset structure, generally requires less memory than do dGenMax and TDM-MFI.
which contains a list of transaction identiers. TDM-MFI uses a In general, these experiments show that INLA-MFP is the best
bit vector structure for storing tidset information. INLA-MFP uses algorithm for mining MFPs in terms of memory usage.
B. Vo et al. / Expert Systems With Applications 73 (2017) 178186 185

Fig. 20. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for T40I10D100K Fig. 24. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for BMS-POS dataset.
dataset.

Fig. 25. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset
Fig. 21. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for T10I4D100K with threshold = 0.08% and various dataset sizes.
dataset.

Fig. 22. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Retail dataset. Fig. 26. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset
with threshold = 0.06% and various dataset sizes.

time. Note that TDM-MFI cannot run for Chainstore dataset with
thresholds 0.08%. The results in Figs. 2528 show that INLA-MFP
has the best scalability. Especially, with smaller thresholds (Figs.
2728), the scalability of INLA-MFP is much better than that of
dGenMax.

5.4. Discussion

With dense datasets, which have a large average length of


Fig. 23. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Chainstore
dataset.
transactions (Pumsb, Connect, and Pumsb_star datasets), INLA-MFP
is more effective than dGenMax and TDM-MFI in terms of runtime
and memory usage. In contrast, with sparse datasets (T10I4D100K
5.3. Scalability comparison and Retail datasets), which have a small average length of transac-
tions, the runtime and memory usage of INLA-MFP and dGenMax
In this section, we performed scalability experiments on various are similar. In addition, Section 5.3 show that INLA-MFP has the
number of transactions for Chainstore dataset which is the largest best scalability. Generally, INLA-MFP outperforms dGenMax and
one in the experimental datasets. The goal of this experiment is to TDM-MFI in terms of runtime and memory usage with small
observe the inuence of the number of transactions on execution thresholds.
186 B. Vo et al. / Expert Systems With Applications 73 (2017) 178186

References

Agarwal, R. C., Aggarwal, C. C., & Prasad, V. V. V. (20 0 0). Depth rst generation of
long patterns. In KDD00 (pp. 108118).
Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules between
sets of items in large databases. In SIGMOD93 (pp. 207216).
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In
VLDB94 (pp. 487499).
Burdick, D., Calimlim, M., Flannick, J., Gehrke, J., & Yiu, T. (2005). MAFIA: A Maxi-
mal Frequent Itemset Algorithm. IEEE IEEE Transactions on Knowledge and Data
Engineering, 17(11), 14901504.
Deng, Z. H., & Xu, X. (2012). Fast mining erasable itemsets using NC_sets. Expert
Systems with Applications, 39(4), 44534463.
Deng, Z. H. (2014). Fast mining top-rank-k frequent patterns by using Node-lists.
Expert Systems with Applications, 41(4), 17631768.
Deng, Z. H. (2016). DiffNodesets: An ecient structure for fast mining frequent
Fig. 27. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset itemsets. Applied Soft Computing, 41, 214223.
with threshold = 0.04% and various dataset sizes. Deng, Z. H., & Lv, S. L. (2014). Fast mining frequent itemsets using Nodesets. Expert
Systems with Applications, 41(10), 45054512.
Deng, Z. H., & Lv, S. L. (2015). PrePost+: An ecient N-lists-based algorithm for
mining frequent itemsets via children-parent equivalence pruning. Expert Sys-
tems with Applications, 42(13), 54245432.
Deng, Z. H., Wang, Z., & Jiang, J. J. (2012). A new algorithm for fast mining frequent
itemsets using N-lists. SCIENCE CHINA Information Sciences, 55(9), 20082030.
Dong, J., & Han, M. (2007). BitTableFI: An ecient mining frequent itemsets algo-
rithm. Knowledge-Based Systems, 20, 329335.
Fan, W., Zhang, K., Cheng, H., Gao, J., Yan, X., Han, J., et al. (2008). Direct mining of
discriminative and essential frequent patterns via model-based search tree. In
SIGKDD08 (pp. 230238).
Gouda, K., & Zaki, M. J. (2005). GenMax: An ecient algorithm for mining maximal
frequent itemsets. Data Mining and Knowledge Discovery, 11(3), 223242.
Grahne, G., & Zhu, J. (2005). Fast algorithms for frequent itemset mining using
FP-trees. IEEE Transactions on Knowledge and Data Engineering, 17, 13471362.
Han, J., Pei, J., & Yin, Y. (20 0 0). Mining frequent patterns without candidate genera-
tion. In SIGMOD00 (pp. 112).
Huynh, Q., Le, T., Vo, B., & Le, B. (2015). An ecient and effective algorithm for
Fig. 28. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset mining top-rank-k frequent patterns. Expert Systems with Applications, 42(1),
with threshold = 0.02% and various dataset sizes. 156164.
Le, H. S. (2015). A novel kernel fuzzy clustering algorithm for Geo-Demographic
Analysis. Information Sciences, 317, 202223.
6. Conclusion and future work Le, T., & Vo, B. (2015). An N-list-based algorithm for mining frequent closed pat-
terns. Expert Systems with Applications, 42(19), 66486657.
Le, T., Vo, B., & Coenen, F. (2013). An ecient algorithm for mining erasable itemsets
This article proposed the INLA-MFP algorithm for mining using the difference of NC-Sets. In IEEE SMC13 (pp. 22702274).
MFPs that uses the N-list structure and a pruning technique. The Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classication and association rule min-
ing. In SIGKDD98 (pp. 8086).
pruning technique, based on the N-list structure, reduces the Liu, X. B., Zhai, K., & Pedrycz, W. (2012). An improved association rules mining
search space. To show the effectiveness of the proposed algorithm, method. Expert Systems with Applications, 39(1), 13621374.
experiments were conducted on several datasets for mining MFPs. Nguyen, T. T. L., & Nguyen, N. T. (2015). Updating mined class association rules for
record insertion. Applied Intelligence, 42(4), 707721.
The experimental results show that INLA-MFP runs faster and uses
Pyun, G., & Yun, U. (2014). Mining top-k frequent patterns with combination reduc-
less memory compared to dGenMax and TDM-MFI in most cases. ing techniques. Applied Intelligence, 41(1), 7698.
In future work, we will focus on applying the N-list structure Vo, B., Le, T., Coenen, F., & Hong, T. P. (2016). Mining frequent itemsets using the
and several pruning techniques for mining top-rank-k frequent N-list and subsume concepts. International Journal of Machine Learning and Cy-
bernetics, 7(2), 253265.
closed patterns and top-rank-k MFPs. We will apply our approach Xiao, H., Jing, F., Bettina, K., Son, M. T., & Plant, C. (2014). Relevant overlaping sub-
to the mining of patterns from quantitative and hierarchical space clusters on categorical data. In KDD14 (pp. 213222).
databases to reduce runtime and memory usage. Zaki, M. J., & Hsiao, C. J. (2005). Ecient algorithms for mining closed itemsets
and their lattice structure. IEEE Transactions on Knowledge and Data Engineering,
17(4), 462478.
Acknowledgments

This research is funded by Vietnam National Foundation for


Science and Technology Development (NAFOSTED) under grant
number 102.05-2015.10.

You might also like