You are on page 1of 4

Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007

C-10

Hierarchical Pattern-Based Clustering for Grouping Web Transactions


Darenna Syahida Suib
Faculty of Information Technology and Multimedia, Universiti Tun Hussein Onn Malaysia (UTHM), Batu Pahat, Johor Email: daren_6010@yahoo.com.sg Telephone: +60127469566
Abstract Grouping customer transactions into segments is important in order to obtain better understanding of customers pattern. Currently, the hierarchical pattern-based clustering has been used to group customer transactions into segments. However, the processing time is still high due to difference parameter used between two clusters. In this paper, the difference will be based on the different between the summations of each cluster. The simulation involving several sets of web data reveal that the proposed model improves the greedy hierarchical pattern-based clustering model up to fifty percent.

Mustafa Mat Deris


Faculty of Information Technology and Multimedia, Universiti Tun Hussein Onn Malaysia (UTHM), Batu Pahat, Johor Email: mmustafa@uthm.edu.my Telephone: +6074538001 within the same cluster and dissimilar to the objects in other clusters. Hierarchical Pattern-Based Clustering Algorithm is recently introduced by [11] and [12] in order to generate effective clustering of web transaction. Distance-based clustering techniques and parametric mixture models are two main approaches used in customer segmentation in the data mining literature. Using distancebased clustering for grouping customers is convenient but it is not obviously an appropriate method. For mixture models, using changing model parameters to represent the difference between segments can often oversimplify the differences and can ignore variables and patterns that are not captured by the parametric models. We propose an approach using pattern-based clustering based on the idea that there are natural behaviors patterns in different groups of customer transactions. There is one way to grouping customer transactions that generates a highly effective clustering of transactions that is Hierarchical Pattern-Based Clustering Algorithm for Grouping Web Transactions [11] and [12]. Natural clusters can be generated by using patternbased clustering for customer transactions. These natural clusters have natural categories that are not observable from data. We will get these natural categories by analyzing customer transactions using pattern-based clustering. How to represent patterns? There are some ways on how to represent pattern i.e. IF-THEN rules and item sets. There are different pattern representations for different domain. For web transactions, item sets are a good representation. In this paper, we will study Greedy Hierarchical Item set-based Clustering (GHIC).

Keywords: Clustering, Hierarchical Clustering, Web Transactions.

Pattern-Based

INTRODUCTION

Business transactions gathered enormous amounts of data in daily operations. As technology advances, firms are able to track the origin of the transactions through customer identifiers such as shopper cards, credit card numbers, cell phone numbers etc. Firms are already realized the importance of leveraging customer-level data and also support the decisions made by decision makers. For example, amazon.com offers distinct home pages and recommends new products for customers, based on personalization model built from data. Another example, most credit card and cellular phone fraud alerts are issued based on analysis of customer-level data. The expectation for customer-level data analysis is high but there are still some issues arise with existing analysis model. The issues are on expectation for speed or performance, short time to success or failure and high demand for technologies that bring speed and intelligent to bear in real time. To build more successful personalized systems and accurate consumer behavior, firms need to understand their customer better. Thats why we need a model to clustering web transaction data. Clustering is one of the techniques needed to group the data into classes. Two major clustering techniques are hierarchical and partition. Clustering is the process of grouping the data into classes or clusters so that the data objects are similar to one another

RELATED WORK

In the mining web transaction patterns in [2] and [3], it takes both traveling pattern and purchasing pattern of customers. It is needed to overcome the problem of mining web transaction pattern is in nature. The purpose of mining web transaction pattern is to capture the information on consumer behaviors and the quality of business strategies.

1
ISBN 978-979-16338-0-2 687

Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007

C-10

However, the model didnt take processing time and data accuracy into account. It concentrate more on mining the traveling path and items that purchased by the customers while our model concentrate more on clustering the user behavior patterns on the web. This model has the same purpose as us to capture the information from those web transactions to understand customer better. In the clustering by pattern similarity in [4], two objects consider similar if they exhibit a coherent pattern on a subset of dimensions. The definition of similarity may be different from one clustering model to another model, but most of the models use distance based approach to define similarity. Their definition of a pattern based on the correlation between attributes of the objects to be clustered while our model pattern similarity is defined within a cluster. Similarity in our model is the number of frequent item sets occurs in a cluster. Their use only pattern similarity in clustering but our model use pattern difference and similarity. In other words, the definition of a pattern used in [4] makes it suitable for numeric data only since web transaction data contain both numeric and categorical data. Due to the significant of clustering objective, the definition of similarity measurement is so important. El-Sonbaty et al. [9] introduced a new online hierarchical clustering algorithm based on single-linkage measurement for symbolic and numeric data. While Oh et al. [8], define similarity based on the number of common items and identical order of the items to cluster sequence data. In addition, this approach only considers two items in a sequence element due to computational complexity. Moreover, a web transaction may have hundred or thousand items. Thus; our approach more effective to cluster web transactions rather than [7] and can handle up to thirty thousand items. The time complexity for [8] O (n2 log n) since the time complexity for our approach is O (n). In the hierarchical pattern-based clustering algorithm [11] and [12], it can maximize the difference between clusters and the similarity of transaction within clusters. It is need to enable firms or organizations to understand their customer better and build more accurate models. However, the processing time is also important in maximizing the difference and similarity between clusters. In other words, the processing time with high clustered data accuracy, the better the model is. For hierarchical pattern-based clustering algorithm proposed in [11] and [12] use the support of pattern in cluster to obtain the difference between two clusters. However, the model can be improved by using summations of support in each cluster which is more efficient in terms of time complexity. In this research, we introduced the hierarchical pattern-based clustering (HPBC) by taking into account the summations of support in each cluster in order to obtain the difference between clusters.

PATTERN-BASED CLUSTERING OF WEB TRANSACTIONS

Consider T= {T1, T2, T3,,Tn} is a collection of transactions to be clustered. Each transaction Ti contains a subset of a list of candidate items {i1, i2, , ik}. A clustering C is a partition {C1, C2, C3, , Cm} of {T1, T2, T3,, Tn} and each Ci is a cluster. The goal for clustering is to maximize the different between clusters and the similarity of transactions (M) within clusters, where M is defined as follows: M (C1, C2, Ck) = Difference (C1, C2 Ck) + Similarity (C1, C2, , Ck) The definition of difference is that the support of any pattern in one cluster should be different from the other cluster and therefore, we compute the difference between the support values based on lemma 1 and lemma 2 below. Let D be a set of transactions to be clustered. Let I={i1,i2,,im} be a set of literals, called items. Each transaction T is a set of items such that T I. We say that a transaction T contains B, a set of some items in I, if B T. Let U be the set of all non-empty subsets of I, U = {B1, B2, , Bq}, where q=2m-1. The probability of a transaction containing Bk (0 k q) is P(Bk T). Assuming that there are two latent states generating all transactions in a probabilistic manner, we will show that our difference definition is better because it is maximized when the cluster are pure. Under each state, assume that a set of probabilities is followed in generating transactions. Table 1 represents the two sets of probabilities. Table 1: Probabilities under two states State G (g) State H (h) P1g = P(B1 T|g) P1h = P(B1 T|h) P2g = P(B2 T|g) P2h = P(B2 T|h) Pqa = P(Bq T|g) Pqh = P(Bq T|h) For each Bk(0 k q), the probability for Bk occurring in a transaction in g and in h are Pka and Pkb, respectively. Assume that D is divided into two clusters, C1 and C2. Within C1, the percentage of transactions generated under g is Px and the percentage of transactions generated under h is (1-Px). Within C2, the percentage of transactions generated under g is Py and the percentage of transactions generated under h is (1-Py). The further distribution is illustrated in Table 2. Table 2 : Distribution of transactions

2
ISBN 978-979-16338-0-2 688

Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007

C-10

From g From h

C1 C2 Px (1-Px) Py (1-Py) 0 Px, Py 1

In C1, the expected percentage of transactions containing Bk(0 k q) is :

candidate items { i1, i2, , im}. Let T(isr) ={ Tk | {ia1, ia2, , iaf} Tk , Tk {T1, T2, , Tn}}. T(isr) is the set of transactions that contain item set isr. Let |T(isr)| = the number of transactions in T(isr). Item set isr is a frequent item set if T(isr) > , where is a user defined threshold, and n is the total number of transactions to be clustered. As our model use item set to represent pattern, we use association ruled discovery algorithm to get the frequent item set. The approach in [6], Enhanced Apriori Algorithm (EAA) is a new model of association rules discovery algorithm by using second support and confidence. This model has been proved that will decrease the processing time 76.6% lower than current Apriori algorithm. We will use EAA to achieve our objectives in this research. EAA is more efficient than current association rules discovery algorithm such as Apriori and Relative Support Apriori Algorithm (RSAA) therefore our model also will have new improvement on processing time.

Px . Pka +(1- Px ). Pkb (Denoted by Sak)


Py . Pka +(1- Py ). Pkb (Denoted by Sbk)
Lemma 1:

| S
b

ak

S bk | is maximized when

transactions generated from g and h is correctly separated. Proof: | Sak- Sbk | =|[ Px . Pk + (1- Px ). Pk ]-[ Py . Pk +(1- Py ). Pk ]| =|( Px - Py ).( Pk - Pk )| =| Px - Py |.| Pk - Pk | This difference is clearly maximize when | Px - Py |=1, which is the case when the clusters are pure (i.e. either ( Px =1, Py =0) or ( Px =0, Py =1)). Lemma 2:
a b a b a a b

3.2 Clustering Algorithm


The algorithm proposed by Yang et al. [11] and [12] will be adopted. The goal for this algorithm is to maximize M. However, if there are n transactions and two clusters, the number of possible clustering schemes to examine is 2n. Firstly we obtain FIS using EAA. After that we convert the initial transaction into binary format, where rows represent original transaction to be clustered and columns represent item sets. T = { T1, T2, , Tn} use to represent this new transaction set in binary format. This entire transaction set first divided into two clusters as we use hierarchical algorithm. This process is repeated until no cluster is big enough to be dividing further. We assume that the stopping conditions are users specified and can contain additional criteria such as clusters size is bigger than the threshold. Input: C = {C0} C0 = {T1, T2, , Tn} FIS ={is1,is2, , isp} Output: Clusters C = {C1, C2, , Cf} C = {C0} Repeat { Choose any X = {T1, T2, , Ts} from such C that the stopping condition is not satisfied for X; For each isr in FIS { Let Ca(r) = {Tk | Tk X and isr Tk} Let Cb(r) = X - Ca(r) Calculate Mr = M(Ca(r), Cb(r)) } r* = arg max (Mr); C = (C - X) {Ca(r)} {Cb(r)} } While stopping condition is not satisfied for any X in C. Figure 1: The main proposed algorithm.

|x| - |y| |x-y|

Proof: |x| = |x-y+y| = |x-y| +

|y| = |x-y| + |y| | x-y| + | y| |x| - |y| |x-y|

From Lemma 2, if x is the support for pattern in cluster a, and y is the support for pattern in cluster b so the difference between cluster a and b can be compute as the different of summations between cluster a and b. Similarity is based on the number of frequent item set (which are similar transactions) in the cluster. Thus; Similarity (Ci) = the number of frequent item sets in clusters Ci. These definitions permit any set of pattern to be used in computing the difference. In this paper, we used only the unclustered data. The reason from a computational point of view to used only the original data because it is efficient since it involves the large item sets only once. A frequent item set is a set of items that have high support value in the data [11]. For example, an item set isr = {ia1, ia2,, iaf}, and ia1, ia2, , iaf are items from the list of

3
ISBN 978-979-16338-0-2 689

Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007

C-10

For each division in the hierarchy, the dataset is grouped into two, according to each pattern in FIS. One group consists of all transactions containing that pattern and the other group contains the others of the transactions. From all the possible groupings, we choose the one that maximizes M.

CONCLUSION

3.3 Computational Complexity


In this paper, the time complexity is used to evaluate the complexity of the algorithm. It can be expressed in terms of the number of operations used by the algorithm when input has particular size. In the HPBC algorithm, at each step of the loop, two totals are performed, C1 and C2. Finally, one more operation is made outside the loop that is the difference between C1 and C2. Thus 2n+1 operation are used. However, for the GHIC algorithm, at each step of the loop, four operations are performed, a, b, distance and difference. Finally one more operation is made outside the loop that is the total of difference between C1 and C2. Thus 4n+1 operation are used. From these two algorithms, the time complexity of GHIC is two times of the time complexity of HPBC even though both of these algorithms have the same O (n). HPBC has been proved better than GHIC up to fifty percent based on synthetic data.

In this paper, we propose Hierarchical Pattern-Based Clustering (HPBC) model for grouping web transactions. The experimental result shows that our model improves GHIC up to 50% based on processing time. Therefore HPBC are more efficient than GHIC model in terms of processing time.

6
[1]

REFERENCES
A.K. Jain, M.N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No. 3, September 1999. C.H. Yun and M.S. Chen, Mining Web Transaction Pattern in an Electronic Commerce Environment, Proc. of 4th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD 2000), April 2000.

[2]

[3]

C.H. Yun and M.S. Chen, Using Pattern-Join and PurchaseCombination for Mining Web Transaction Pattern in an Electronic Commerce Environment, The 24th Annual International Computer Software and Applications Conference, 2000. H. Wang, J. Wang, W. Wang and P.S. Yu, Clustering by Pattern Similarity in Large Data Sets, Proc. ACM SIGMOD (Special Interest Group on Management of Data) Conference, Madison, WI, June 2002. I.K. Ravinchandra Rao, Data Mining and Clustering Techniques, DRTC Workshop on Semantic Web, 8th -10th December 2003, Bangalore. M.D. Mustafa, N.F. Nabila, D.J. Evans, M.Y. Saman and A. Mamat, Association rules on significant rare data using second support, International Journal of Computer Mathematics, Vol.83, No. 1, January 2006, 69-80 pp. M.H. Dunham (2002). Data Mining: Introductory and Advanced Topics, New Jersey: Prentice Hall. S.J. Oh and J.Y. Kim, A Hierarchical Clustering Algorithm for Categorical Sequence Data, International Journal of Information Processing Letters, Vol. 91, Issue 3, August 2004, 135-140 pp. Y. El-Sonbaty and M.A. Ismail, On-Line Hierarchical Clustering, International Journal of Pattern Recognition Letters, Vol. 19, 1998, 1285-1291 pp. Y.K. Woon, W.K Ng and E.P. Lim, A Support-Ordered Trie for Fast Frequent Itemset Discovery, IEEE Transactions On Knowledge and Data Engineering, Vol. 16, No. 7, July 2004, 875-879 pp. Y. Yang and B. Padmanabhan, Data Mining for Customer Segmentation: A Behavioral Pattern-Based Approach, In Proceedings of The Third IEEE International Conference on Data Mining (ICDM 03), Melbourne, Florida, November 1922, 2003, 411-418 pp. Y. Yang and B. Padmanabhan, GHIC: A Hierarchical PatternBased Clustering Algorithm for Grouping Web Transactions, IEEE Vol. 17, No. 9, September 2005, 1300-1304 pp.

[4]

[5]

OUTCOME
[6]

The following experiment based on synthetic web transaction data for testing the execution performance of our model. This web transactions data were split into 5K, 10K, 15K, 20K, 25K and 30K data.
30 Processing time(seconds) 25 20 15 10 5 0 5K 10K 15K 20K 25K 30K
Input data

[7]

[8]

GHIC HPBC

[9]

[10]

Figure 2: The experimental result for the execution performance between GHIC and HPBC. From Figure 2, our proposed model, HPBC taking less processing time rather than GHIC model.

[11]

[12]

4
ISBN 978-979-16338-0-2 690

You might also like