Professional Documents
Culture Documents
C-10
Pattern-Based
INTRODUCTION
Business transactions gathered enormous amounts of data in daily operations. As technology advances, firms are able to track the origin of the transactions through customer identifiers such as shopper cards, credit card numbers, cell phone numbers etc. Firms are already realized the importance of leveraging customer-level data and also support the decisions made by decision makers. For example, amazon.com offers distinct home pages and recommends new products for customers, based on personalization model built from data. Another example, most credit card and cellular phone fraud alerts are issued based on analysis of customer-level data. The expectation for customer-level data analysis is high but there are still some issues arise with existing analysis model. The issues are on expectation for speed or performance, short time to success or failure and high demand for technologies that bring speed and intelligent to bear in real time. To build more successful personalized systems and accurate consumer behavior, firms need to understand their customer better. Thats why we need a model to clustering web transaction data. Clustering is one of the techniques needed to group the data into classes. Two major clustering techniques are hierarchical and partition. Clustering is the process of grouping the data into classes or clusters so that the data objects are similar to one another
RELATED WORK
In the mining web transaction patterns in [2] and [3], it takes both traveling pattern and purchasing pattern of customers. It is needed to overcome the problem of mining web transaction pattern is in nature. The purpose of mining web transaction pattern is to capture the information on consumer behaviors and the quality of business strategies.
1
ISBN 978-979-16338-0-2 687
Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007
C-10
However, the model didnt take processing time and data accuracy into account. It concentrate more on mining the traveling path and items that purchased by the customers while our model concentrate more on clustering the user behavior patterns on the web. This model has the same purpose as us to capture the information from those web transactions to understand customer better. In the clustering by pattern similarity in [4], two objects consider similar if they exhibit a coherent pattern on a subset of dimensions. The definition of similarity may be different from one clustering model to another model, but most of the models use distance based approach to define similarity. Their definition of a pattern based on the correlation between attributes of the objects to be clustered while our model pattern similarity is defined within a cluster. Similarity in our model is the number of frequent item sets occurs in a cluster. Their use only pattern similarity in clustering but our model use pattern difference and similarity. In other words, the definition of a pattern used in [4] makes it suitable for numeric data only since web transaction data contain both numeric and categorical data. Due to the significant of clustering objective, the definition of similarity measurement is so important. El-Sonbaty et al. [9] introduced a new online hierarchical clustering algorithm based on single-linkage measurement for symbolic and numeric data. While Oh et al. [8], define similarity based on the number of common items and identical order of the items to cluster sequence data. In addition, this approach only considers two items in a sequence element due to computational complexity. Moreover, a web transaction may have hundred or thousand items. Thus; our approach more effective to cluster web transactions rather than [7] and can handle up to thirty thousand items. The time complexity for [8] O (n2 log n) since the time complexity for our approach is O (n). In the hierarchical pattern-based clustering algorithm [11] and [12], it can maximize the difference between clusters and the similarity of transaction within clusters. It is need to enable firms or organizations to understand their customer better and build more accurate models. However, the processing time is also important in maximizing the difference and similarity between clusters. In other words, the processing time with high clustered data accuracy, the better the model is. For hierarchical pattern-based clustering algorithm proposed in [11] and [12] use the support of pattern in cluster to obtain the difference between two clusters. However, the model can be improved by using summations of support in each cluster which is more efficient in terms of time complexity. In this research, we introduced the hierarchical pattern-based clustering (HPBC) by taking into account the summations of support in each cluster in order to obtain the difference between clusters.
Consider T= {T1, T2, T3,,Tn} is a collection of transactions to be clustered. Each transaction Ti contains a subset of a list of candidate items {i1, i2, , ik}. A clustering C is a partition {C1, C2, C3, , Cm} of {T1, T2, T3,, Tn} and each Ci is a cluster. The goal for clustering is to maximize the different between clusters and the similarity of transactions (M) within clusters, where M is defined as follows: M (C1, C2, Ck) = Difference (C1, C2 Ck) + Similarity (C1, C2, , Ck) The definition of difference is that the support of any pattern in one cluster should be different from the other cluster and therefore, we compute the difference between the support values based on lemma 1 and lemma 2 below. Let D be a set of transactions to be clustered. Let I={i1,i2,,im} be a set of literals, called items. Each transaction T is a set of items such that T I. We say that a transaction T contains B, a set of some items in I, if B T. Let U be the set of all non-empty subsets of I, U = {B1, B2, , Bq}, where q=2m-1. The probability of a transaction containing Bk (0 k q) is P(Bk T). Assuming that there are two latent states generating all transactions in a probabilistic manner, we will show that our difference definition is better because it is maximized when the cluster are pure. Under each state, assume that a set of probabilities is followed in generating transactions. Table 1 represents the two sets of probabilities. Table 1: Probabilities under two states State G (g) State H (h) P1g = P(B1 T|g) P1h = P(B1 T|h) P2g = P(B2 T|g) P2h = P(B2 T|h) Pqa = P(Bq T|g) Pqh = P(Bq T|h) For each Bk(0 k q), the probability for Bk occurring in a transaction in g and in h are Pka and Pkb, respectively. Assume that D is divided into two clusters, C1 and C2. Within C1, the percentage of transactions generated under g is Px and the percentage of transactions generated under h is (1-Px). Within C2, the percentage of transactions generated under g is Py and the percentage of transactions generated under h is (1-Py). The further distribution is illustrated in Table 2. Table 2 : Distribution of transactions
2
ISBN 978-979-16338-0-2 688
Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007
C-10
From g From h
candidate items { i1, i2, , im}. Let T(isr) ={ Tk | {ia1, ia2, , iaf} Tk , Tk {T1, T2, , Tn}}. T(isr) is the set of transactions that contain item set isr. Let |T(isr)| = the number of transactions in T(isr). Item set isr is a frequent item set if T(isr) > , where is a user defined threshold, and n is the total number of transactions to be clustered. As our model use item set to represent pattern, we use association ruled discovery algorithm to get the frequent item set. The approach in [6], Enhanced Apriori Algorithm (EAA) is a new model of association rules discovery algorithm by using second support and confidence. This model has been proved that will decrease the processing time 76.6% lower than current Apriori algorithm. We will use EAA to achieve our objectives in this research. EAA is more efficient than current association rules discovery algorithm such as Apriori and Relative Support Apriori Algorithm (RSAA) therefore our model also will have new improvement on processing time.
| S
b
ak
S bk | is maximized when
transactions generated from g and h is correctly separated. Proof: | Sak- Sbk | =|[ Px . Pk + (1- Px ). Pk ]-[ Py . Pk +(1- Py ). Pk ]| =|( Px - Py ).( Pk - Pk )| =| Px - Py |.| Pk - Pk | This difference is clearly maximize when | Px - Py |=1, which is the case when the clusters are pure (i.e. either ( Px =1, Py =0) or ( Px =0, Py =1)). Lemma 2:
a b a b a a b
From Lemma 2, if x is the support for pattern in cluster a, and y is the support for pattern in cluster b so the difference between cluster a and b can be compute as the different of summations between cluster a and b. Similarity is based on the number of frequent item set (which are similar transactions) in the cluster. Thus; Similarity (Ci) = the number of frequent item sets in clusters Ci. These definitions permit any set of pattern to be used in computing the difference. In this paper, we used only the unclustered data. The reason from a computational point of view to used only the original data because it is efficient since it involves the large item sets only once. A frequent item set is a set of items that have high support value in the data [11]. For example, an item set isr = {ia1, ia2,, iaf}, and ia1, ia2, , iaf are items from the list of
3
ISBN 978-979-16338-0-2 689
Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007
C-10
For each division in the hierarchy, the dataset is grouped into two, according to each pattern in FIS. One group consists of all transactions containing that pattern and the other group contains the others of the transactions. From all the possible groupings, we choose the one that maximizes M.
CONCLUSION
In this paper, we propose Hierarchical Pattern-Based Clustering (HPBC) model for grouping web transactions. The experimental result shows that our model improves GHIC up to 50% based on processing time. Therefore HPBC are more efficient than GHIC model in terms of processing time.
6
[1]
REFERENCES
A.K. Jain, M.N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No. 3, September 1999. C.H. Yun and M.S. Chen, Mining Web Transaction Pattern in an Electronic Commerce Environment, Proc. of 4th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD 2000), April 2000.
[2]
[3]
C.H. Yun and M.S. Chen, Using Pattern-Join and PurchaseCombination for Mining Web Transaction Pattern in an Electronic Commerce Environment, The 24th Annual International Computer Software and Applications Conference, 2000. H. Wang, J. Wang, W. Wang and P.S. Yu, Clustering by Pattern Similarity in Large Data Sets, Proc. ACM SIGMOD (Special Interest Group on Management of Data) Conference, Madison, WI, June 2002. I.K. Ravinchandra Rao, Data Mining and Clustering Techniques, DRTC Workshop on Semantic Web, 8th -10th December 2003, Bangalore. M.D. Mustafa, N.F. Nabila, D.J. Evans, M.Y. Saman and A. Mamat, Association rules on significant rare data using second support, International Journal of Computer Mathematics, Vol.83, No. 1, January 2006, 69-80 pp. M.H. Dunham (2002). Data Mining: Introductory and Advanced Topics, New Jersey: Prentice Hall. S.J. Oh and J.Y. Kim, A Hierarchical Clustering Algorithm for Categorical Sequence Data, International Journal of Information Processing Letters, Vol. 91, Issue 3, August 2004, 135-140 pp. Y. El-Sonbaty and M.A. Ismail, On-Line Hierarchical Clustering, International Journal of Pattern Recognition Letters, Vol. 19, 1998, 1285-1291 pp. Y.K. Woon, W.K Ng and E.P. Lim, A Support-Ordered Trie for Fast Frequent Itemset Discovery, IEEE Transactions On Knowledge and Data Engineering, Vol. 16, No. 7, July 2004, 875-879 pp. Y. Yang and B. Padmanabhan, Data Mining for Customer Segmentation: A Behavioral Pattern-Based Approach, In Proceedings of The Third IEEE International Conference on Data Mining (ICDM 03), Melbourne, Florida, November 1922, 2003, 411-418 pp. Y. Yang and B. Padmanabhan, GHIC: A Hierarchical PatternBased Clustering Algorithm for Grouping Web Transactions, IEEE Vol. 17, No. 9, September 2005, 1300-1304 pp.
[4]
[5]
OUTCOME
[6]
The following experiment based on synthetic web transaction data for testing the execution performance of our model. This web transactions data were split into 5K, 10K, 15K, 20K, 25K and 30K data.
30 Processing time(seconds) 25 20 15 10 5 0 5K 10K 15K 20K 25K 30K
Input data
[7]
[8]
GHIC HPBC
[9]
[10]
Figure 2: The experimental result for the execution performance between GHIC and HPBC. From Figure 2, our proposed model, HPBC taking less processing time rather than GHIC model.
[11]
[12]
4
ISBN 978-979-16338-0-2 690