You are on page 1of 4

Probabilistic sequence mining - evaluation and extension of ProMFS algorithm

K. Hryniw - Institute of Control and Industrial Electronics, Warsaw University of o Technology, ul.Koszykowa 75, 00-662, Warsaw, Poland, hryniowk@isep.pw.edu.pl Sequential pattern mining is extensively studied method for data mining. One of more novel and less documented aproaches is estimating the statistical characteristics of sequence for creating model sequences that can be used to speed up the working time of sequence mining process. This article evaluates one of such algorithms, ProMFS, for both real-life and articial data and proposes modications to it, that make algorithm faster and more accurate. It is shown that modied ProMFS algorithm can be competitive for large and often updated data sets. Keywords: ProMFS, sequential pattern mining, probabilistic mining 1. Introduction Sequential pattern mining was rst dened by Agrawal and Srikant in [1] as Given a set of sequences, where each sequence consists of a list of itemsets, and given a user-specied minimum support threshold (min support), sequential pattern mining is to nd all frequent subsequences whose frequency is no less than min support. One of the algorithms used for sequential pattern mining is the ProMFS algorithm (probabilistic algorithm for mining frequent sequences) proposed by Tumasonis and Dzemyda[2]. This two-step procedure is based on the estimation of statistical characteristics of the main sequence. Based on such characteristics the algorithm creates shorter, model sequence which is analysed with GSP (Generated Sequence Pattern)[3] or a similar algorithm. The subsequences frequency in the sequence is estimated on the basis of results of GSP algorithm used on model sequence. The ProMFS algorithm is meant to produce slightly less acurate results in shorter time then classical aproach algorithms for very large data sets.[2] 2. ProMFS algorithm overview ProMFS algorithm is based on three statistical characteristics of sequence: probability of element in sequence, probability for element to appear after another and average distance between dierent elements in sequence. It needs two parameters dened - length of model sequence l and minimum support g for algorithm used in the second step. In the rst step algorithm creates matrices with characteristics of sequence. Probabilities j of an element in the sequence are stored in a matrix P, where P (ij ) = VV(iS ) and V S is the 1

2 length of main sequence S and V (ij ) is the number of occurrences of element ij in the sequence. Probability of element occurring after another element is marked as P (ij |iv ) and D(ij |iv ) is distance between elements ij and iv . Matrix A stores average distances between elements, which are used for complementary function (Cr , Ar,j ) that after each step r of algorithm modies matrix Q, which starts with zero values. After creating all matrices with statistical characteristics of sequence algorithm starts by putting element with maximum P (ij ) into the rst place of model sequence C, changing matrix Q with (Cr , Ar,j ) and choosing next elements according to algorithm below until model sequence C is of length l C(r) = max(Q(ij , Cr )) (1)

If Q(ij , Cr ) is equal for two or more elements we choose next element for model sequence based on second criterion C(r) = max(P (Cr1 |ij ), P (Cr1 |iv )) If we have again equal values we use third criterion C(r) = max(P (ij ), P (iv )) (3) (2)

After model sequence C is of lenght l rst step of algorithm stops and another algorithm (such as GSP) is used with model sequence C instead of longer sequence S. 2.1. ProMFS algorithm evaluation Work with real-life and articial data shown that the originaly proposed ProMFS algorithm can generate inadequate results in the time compared to, or even longer, then GSP algorithm. The experiments for many tested types of commonly available data, have shown that the original ProMFS algorithm working speed is less favourable then that of the basic GSP or similar algorithms. Moreover, the accuracy of generated results, especially the longest found patterns that are a good measure of accuracy as they reect the hardest to nd and the most interesting patterns, was low for many tested types of data as shown in Table 1 and Table 2. Table 1 Comparison between ProMFS and GSP algorithm for real-life data Sequence size Working time (s) Number of Longest frequent sequences found sequence GSP ProMFS GSP ProMFS GSP ProMFS 35 0,015 0,029 31 34 8 10 6864 8,485 33,41 4375 1902 44 24 34340 88,953 744,948 26240 4492 58 37 Above results are for real-life data set. ProMFS and GSP parameters used for comparison were found optimal for each test.

3 Table 2 Comparison between ProMFS and GSP algorithm for articial data Sequence size Working time (s) Number of Longest frequent sequences found sequence GSP ProMFS GSP ProMFS GSP ProMFS 8124 4,848 31,442 1099 214 23 7 32000 51,017 761,953 21132 3310 32 11 ProMFS parameters used for comparison were found optimal for each test. For GSP minimal support was set to 100.

During testing it became clear that some data structures generate improper results, because of some features of probabilistic approach of ProMFS algorithm. Three examples of such results are shown below. For sequences with dominating element, algorithm can create model sequences, that are not representative for the main sequence. For example ABACADABACADAAAEAFAGAHAIAJAKALAMAPARASATTATSTQTWTETEVSZRL main sequence creates AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT model sequence. For sequences with common, but not positioned nearby elements, such as sequences with dominating elements only on both ends of sequence, the algorithm nds improper frequent sequences. Example: For AAAAAAAAAABCBCBCBCTTTTTT sequence algorithm nds AT nonexisting subsequence. In some cases algorithm nds frequent sequences that do not exist even in sequences without dominating elements. For example, in sequence ABACADABACADAAAEAFAGAHAIAJAKALAMAPARASATTATSTQTWTETEVSSVCSR algorithm is able to nd non-existing frequent subsequence DAD. All this features are due to novel, probabilistic approach and some correlations of data in sequence. During testing it was found that some modications to ProMFS algorithm are capable of xing the encountered problems. 3. Modications Originaly proposed ProMFS algorithm does not have any criteria of choice when probabilities of two elements in matrix P in criterion 3 are equal. In that case element with lower index in matrix P is chosen, which can lead directly to rst of problems shown above. In criterion 1 when P (Cr1 |ij ) = P (Cr1 |iv ), it is proposed to take C(r) = max(P (ij ), P (iv )), which does not take into consideration relative position of elements in sequence. Both of this criteria were modied. Additional conditon in algorithm for situation when

4 elements in P (ij |iv ) are equal, was added. That modication changes chosen element from that with lower index to the one which occurres less in model sequence. That change improves algorithms accuracy and eliminates some of improper results as model sequence is given greater diversity. Second criterion was modied from C(r) = max(P (ij ), P (iv )) into C(r) = max(P (ij |iv )) so that positions of elements in sequence can be taken into consideration when checking conditions for puting element into model sequence. This change also improved algorithm accuracy and with it most of non-existing patterns was eliminated from results. In addition, optimisation of the ProMFS algorithm to work with FUP (Fast Update)[4] algorithm was made, which resulted in better working time for frequently updated databases. Comparison of original and modied algorithms is shown in Table 3. Table 3 Eects of modications on ProMFS GSP algorithms working time Sequence size Working time (s) used for update GSP ProMFS 35 0,005 0,014 6864 0,062 4,981 34340 70,133 92,472

accuracy and comparison of modied ProMFS and Number of Longest frequent sequences found sequence ProMFS ProMFS (mod) ProMFS ProMFS (mod) 36 34 10 10 1902 2014 24 28 4492 4751 37 45

4. Conclusions The ProMFS algorithm has been evaluated and tested for both articial and real-life data. Found problems with the working algorithm were solved, modications increasing both speed and accuracy were added and it was proven, that ProMFS can be used for large and frequently updated data sets succesfully. With further modications and additional tests it is presumed that the ProMFS algorithm will be able to compete with most commonly used frequent pattern mining algorithms for large and very large real-life data sets. REFERENCES 1. R. Agrawal and R. Srikant, Mining sequential patterns, ICDE95, pages 314, Taipei, Taiwan, 1995. 2. R. Tumasonis and G. Dzemyda, A probabilistic algorithm for mining frequent sequences, 2004, http://www.sztaki.hu/conferences/ADBIS/8-Tumasonis.pdf 3. R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements, EDBT96, pages 317, Avignon, France, 1996 4. D.W. Cheung, J. Han, V. Ng, C.Y. Wong, Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique, 1996

You might also like