Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Article history: Mining maximal frequent patterns (MFPs) is an approach that limits the number of frequent patterns
Received 25 July 2016 (FPs) to help intelligent systems operate eciently. Many approaches have been proposed for mining
Revised 16 December 2016
MFPs, but the complexity of the problem is enormous. Therefore, the run time and memory usage are
Accepted 17 December 2016
still large. Recently, the N-list structure has been proposed and veried to be very effective for min-
Available online 30 December 2016
ing FPs, frequent closed patterns, and top-rank-k FPs. Therefore, this paper uses the N-list structure for
Keywords: mining MFPs. A pruning technique is also proposed to prune branches to reduce the search space. This
Data mining technique is applied to an algorithm called INLA-MFP (improved N-list-based algorithm for mining max-
Pattern mining imal frequent patterns) for mining MFPs. Experiments were conducted to evaluate the effectiveness of
Maximal frequent patterns the proposed algorithm. The experimental results show that INLA-MFP outperforms two state-of-the-art
N-list structure algorithms for mining MFPs.
Pruning technique
2016 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2016.12.023
0957-4174/ 2016 Elsevier Ltd. All rights reserved.
B. Vo et al. / Expert Systems With Applications 73 (2017) 178186 179
Table 2
Frequent patterns in
example dataset with
minSup = 60%.
Patterns Support
B 4
C 4
E 4
AC 3
BC 3
BE 4
CE 3
BCE 3
3.2. PPC-tree
Denition 3. (N-list of a 1-pattern). The N-list associated with an Theorem 1. Let XA and XB be two FPs with prex X. If
item, A, is N(A), the set of PP-codes associated with nodes in the N(XB) N(XA) or N(XA) N(XB), then XB and XA are not MFPs.
PPC-tree whose name is A.
Proof. Based on the MFP denition, if XAB is an FP, XA XAB, and
Example 3. The N-list of E in the PPC-tree (Fig. 3) is { 3, 1, 3
, XB XAB, then XA and XB are not MFPs. Therefore, to assert that
5, 3, 1
} and that of B is { 1, 4, 4
}. XB and XA are not MFPs, we must prove that XAB is an FP.
B. Vo et al. / Expert Systems With Applications 73 (2017) 178186 181
Fig. 2. PPC-tree without pre- and post-order created for example dataset with minSup = 60%.
Fig. 4. Continued
5. Performance studies
Table 3
Statistical summary of experimental datasets.
Datasets Number of transactions Number of items Average length of transactions Max length of transactions
Fig. 7. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Pumsb dataset. Fig. 9. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Pumsb_star dataset.
Fig. 8. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Connect dataset. Fig. 10. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Accident dataset.
5.1. Runtime rapidly when threshold is decreased from 70% to 50%. TDM-MFI
even cannot execute with threshold = 60% and 50%.
INLA-MFP spends a lot of time building the PPC-tree, and thus In Fig. 8, for the Connect dataset, the runtimes of INLA-MFP are
with a large minSup, it is not faster than dGenMax and TDM-MFI. consistently lower than those of dGenMax and TDM-MFI.
However, with a small minSup, INLA-MFP is much faster than In Figs. 915, the runtimes of INLA-MFP for Connect,
dGenMax and faster than TDM-MFI in most cases. Pumsb_star, Accident, T40I10D100K, T10I4D100K, Retail, Chain-
In Fig. 7, for the Pumsb dataset, the runtimes of INLA-MFP are store, and BMS-POS datasets are better than those of dGenMax
almost constant whereas those of dGenMax and TDM-MFI increase and TDM-MFI in most cases, especially with small thresholds.
184 B. Vo et al. / Expert Systems With Applications 73 (2017) 178186
Fig. 12. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for T10I4D100K dataset. Fig. 16. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Pumsb dataset.
Fig. 13. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Retail dataset.
Fig. 17. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Connect dataset.
Fig. 14. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset.
Fig. 18. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Pumsb_star
dataset.
Fig. 15. Runtimes of dGenMax, TDM-MFI, and INLA-MFP for BMS-POS dataset.
Fig. 19. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Accident dataset.
5.2. Memory usage
For all datasets, INLA-MFP is always better than dGenMax and the N-list structure. The number of transactions is often greater
TDM-MFI in terms of memory usage, as shown in Figs. 1624. This than the number of nodes in the PPC-tree. Therefore, INLA-MFP
can be explained as follows. dGenMax uses the diffset structure, generally requires less memory than do dGenMax and TDM-MFI.
which contains a list of transaction identiers. TDM-MFI uses a In general, these experiments show that INLA-MFP is the best
bit vector structure for storing tidset information. INLA-MFP uses algorithm for mining MFPs in terms of memory usage.
B. Vo et al. / Expert Systems With Applications 73 (2017) 178186 185
Fig. 20. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for T40I10D100K Fig. 24. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for BMS-POS dataset.
dataset.
Fig. 25. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset
Fig. 21. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for T10I4D100K with threshold = 0.08% and various dataset sizes.
dataset.
Fig. 22. Memory usage of dGenMax, TDM-MFI, and INLA-MFP for Retail dataset. Fig. 26. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset
with threshold = 0.06% and various dataset sizes.
time. Note that TDM-MFI cannot run for Chainstore dataset with
thresholds 0.08%. The results in Figs. 2528 show that INLA-MFP
has the best scalability. Especially, with smaller thresholds (Figs.
2728), the scalability of INLA-MFP is much better than that of
dGenMax.
5.4. Discussion
References
Agarwal, R. C., Aggarwal, C. C., & Prasad, V. V. V. (20 0 0). Depth rst generation of
long patterns. In KDD00 (pp. 108118).
Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules between
sets of items in large databases. In SIGMOD93 (pp. 207216).
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In
VLDB94 (pp. 487499).
Burdick, D., Calimlim, M., Flannick, J., Gehrke, J., & Yiu, T. (2005). MAFIA: A Maxi-
mal Frequent Itemset Algorithm. IEEE IEEE Transactions on Knowledge and Data
Engineering, 17(11), 14901504.
Deng, Z. H., & Xu, X. (2012). Fast mining erasable itemsets using NC_sets. Expert
Systems with Applications, 39(4), 44534463.
Deng, Z. H. (2014). Fast mining top-rank-k frequent patterns by using Node-lists.
Expert Systems with Applications, 41(4), 17631768.
Deng, Z. H. (2016). DiffNodesets: An ecient structure for fast mining frequent
Fig. 27. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset itemsets. Applied Soft Computing, 41, 214223.
with threshold = 0.04% and various dataset sizes. Deng, Z. H., & Lv, S. L. (2014). Fast mining frequent itemsets using Nodesets. Expert
Systems with Applications, 41(10), 45054512.
Deng, Z. H., & Lv, S. L. (2015). PrePost+: An ecient N-lists-based algorithm for
mining frequent itemsets via children-parent equivalence pruning. Expert Sys-
tems with Applications, 42(13), 54245432.
Deng, Z. H., Wang, Z., & Jiang, J. J. (2012). A new algorithm for fast mining frequent
itemsets using N-lists. SCIENCE CHINA Information Sciences, 55(9), 20082030.
Dong, J., & Han, M. (2007). BitTableFI: An ecient mining frequent itemsets algo-
rithm. Knowledge-Based Systems, 20, 329335.
Fan, W., Zhang, K., Cheng, H., Gao, J., Yan, X., Han, J., et al. (2008). Direct mining of
discriminative and essential frequent patterns via model-based search tree. In
SIGKDD08 (pp. 230238).
Gouda, K., & Zaki, M. J. (2005). GenMax: An ecient algorithm for mining maximal
frequent itemsets. Data Mining and Knowledge Discovery, 11(3), 223242.
Grahne, G., & Zhu, J. (2005). Fast algorithms for frequent itemset mining using
FP-trees. IEEE Transactions on Knowledge and Data Engineering, 17, 13471362.
Han, J., Pei, J., & Yin, Y. (20 0 0). Mining frequent patterns without candidate genera-
tion. In SIGMOD00 (pp. 112).
Huynh, Q., Le, T., Vo, B., & Le, B. (2015). An ecient and effective algorithm for
Fig. 28. Scalability of dGenMax, TDM-MFI, and INLA-MFP for Chainstore dataset mining top-rank-k frequent patterns. Expert Systems with Applications, 42(1),
with threshold = 0.02% and various dataset sizes. 156164.
Le, H. S. (2015). A novel kernel fuzzy clustering algorithm for Geo-Demographic
Analysis. Information Sciences, 317, 202223.
6. Conclusion and future work Le, T., & Vo, B. (2015). An N-list-based algorithm for mining frequent closed pat-
terns. Expert Systems with Applications, 42(19), 66486657.
Le, T., Vo, B., & Coenen, F. (2013). An ecient algorithm for mining erasable itemsets
This article proposed the INLA-MFP algorithm for mining using the difference of NC-Sets. In IEEE SMC13 (pp. 22702274).
MFPs that uses the N-list structure and a pruning technique. The Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classication and association rule min-
ing. In SIGKDD98 (pp. 8086).
pruning technique, based on the N-list structure, reduces the Liu, X. B., Zhai, K., & Pedrycz, W. (2012). An improved association rules mining
search space. To show the effectiveness of the proposed algorithm, method. Expert Systems with Applications, 39(1), 13621374.
experiments were conducted on several datasets for mining MFPs. Nguyen, T. T. L., & Nguyen, N. T. (2015). Updating mined class association rules for
record insertion. Applied Intelligence, 42(4), 707721.
The experimental results show that INLA-MFP runs faster and uses
Pyun, G., & Yun, U. (2014). Mining top-k frequent patterns with combination reduc-
less memory compared to dGenMax and TDM-MFI in most cases. ing techniques. Applied Intelligence, 41(1), 7698.
In future work, we will focus on applying the N-list structure Vo, B., Le, T., Coenen, F., & Hong, T. P. (2016). Mining frequent itemsets using the
and several pruning techniques for mining top-rank-k frequent N-list and subsume concepts. International Journal of Machine Learning and Cy-
bernetics, 7(2), 253265.
closed patterns and top-rank-k MFPs. We will apply our approach Xiao, H., Jing, F., Bettina, K., Son, M. T., & Plant, C. (2014). Relevant overlaping sub-
to the mining of patterns from quantitative and hierarchical space clusters on categorical data. In KDD14 (pp. 213222).
databases to reduce runtime and memory usage. Zaki, M. J., & Hsiao, C. J. (2005). Ecient algorithms for mining closed itemsets
and their lattice structure. IEEE Transactions on Knowledge and Data Engineering,
17(4), 462478.
Acknowledgments