You are on page 1of 6

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 2, March April 2013 ISSN 2278-6856

Improving Webpage Clustering Using Multiview Laerning


Lalit A. Patil1, Prof. S. M. Kamalapur 2
1,2

Dept. of Computer Engg. KKWIEER, Nasik Nasik, India

Abstract:

Traditional clustering algorithms are usually based on the bag-of-words (BOW) approach. A notorious disadvantage of the BOW model is that it ignores the semantic relationship among words. As a result, if two documents use different collections of core words to represent the same topic, they may be assigned to different clusters, even though the core words they use are probably synonyms or semantically associated in other form conventional web page clustering technique is often utilized to reveal the functional similarity of web pages. Word and tag correlation also called multi-view learning can be beneficial to improve the clustering performance. To our knowledge, all the existing approaches exploiting tag information for webpage clustering assume that all the WebPages are tagged, which is a somewhat restrictive assumption. In a more realistic setting, one can only expect that the tags will be available for only a small number of WebPages. In our approach we consider a multiview learning based on page text and tag. We propose a web page clustering approach based on Probabilistic Latent Semantic Analysis (PLSA) model. An iterative algorithm based on maximum likelihood principle is employed to overcome the aforementioned computational shortcoming.

Following five ways are used to model a document i. Word only: Uses only bag of word and discards the tag information ii. Tag only: Uses only bag of tags and discards the word information iii. Word and Tag Correlation: Forms a correlation of bag of both words and tags, and uses that to derive feature vectors for each document iv. Word + Tag appends: Uses both bags of words and tags, and uses it to derive feature vectors for each document v. Joint PLSA: Finds the probability of word and document co-occurrence with the hidden concept of the documents Considering the views of documents the clustering techniques divided into following types: i. Web page clustering using Single View: On single view web page clustering we just consider the single view means word of the page. ii. Web page clustering using Multi View: In multiview web page clustering system consider number of view. e.g. word and tag of pages. In Section 2, we describe multiview web page clustering. Section 3, briefly describe proposed system. We describe system architecture in Section 4. Section 5 specify dataset require for system. Our system experiment is described in Section 6. and conclude in Section 7.

Keywords: Clustering, correlations, learning, multiview, semantically, subspace, tagged

1. INTRODUCTION
Traditional web page clustering algorithms only use features extracted from the page-text. This method is not good for web page clustering. To improve the quality of clustering is a major trend in information retrieval to make more and better use of the increasingly relevant user-provided data. Social bookmarking websites such as del.icio.us and StumbleUpon enable users to tag any web page with short free-form text strings, collecting hundreds of thousands of keyword annotations per day. The set of tags applied to a document is an explicit set of keywords that users have found appropriate for categorizing that document within their own filing system. Thus tags promise a uniquely well suited source of information on the similarity between web documents. High quality clustering based on user-contributed tags has the potential to improve the clustering quality.

2. MULTIVIEW WEB PAGE CLUSTERING


In multi-view learning, the features can be split into two subsets such that each subset alone is sufficient for learning. By exploiting both views of the data, multi-view learning can result in improved performance on various learning tasks [13]. Correlation Analysis (CA) is a technique for modeling the relationships between two (or more) set of variables. CA computes a low-dimensional shared embedding of both sets of variables such that the correlations among the variables between the two sets are maximized in the embedded space. CA has been applied with great success in the past on a variety of learning problems dealing with multi-modal data. In this system word and tag of webpages is consider two separate Page 1

Volume 2, Issue 2 March April 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 2, March April 2013 ISSN 2278-6856
View and using correlation analysis find the correlation between word and Tag. Select a document class probability As a result one obtains an observed pair while the latent class variable c is discarded. Translating this process into a joint probability model result in the expression
(i)

with probability

pick a latent

with probability

generate a word w with

B
(ii)

Figure 1: Multi View (Correlation Analysis) In Fig. 3.1 A represent features extracted from text-page subspace and B represent features extracted from tag subspace and C represent common subspaces Tagging is good to improve quality to some extent but not efficiently work for those pages those having minimum number of tags.

Equation (ii) has to sum over the possible choices of which could have generated the observation. The aspect model is a statistical mixture model which is based on two independence assumptions: First, observation pairs are assumed to be generated independently; this essentially corresponds to the 'bag of words' approach. Secondly, the conditional independence assumption is made that conditioned on the latent class , words are generated independently of the specific document identity . Given that the number of states is smaller than the number of documents (K N), acts as a bottleneck variable in predicting w conditioned on d. Notice that in contrast to document clustering models document {specific word distributions are obtained by a convex combination of the aspects or factors Documents are not assigned to clusters, they are characterized by a specific mixture of factors with weights These mixing weights over more modeling power and are conceptually very different from posterior probabilities in clustering models and (unsupervised) naive Bays models. Following the likelihood principle, one determines , and by maximization of the log likelihood function (iii) denotes the term frequency, i.e., the number of times w occurred in d. This is a symmetric model with the help of Bays rule, which results in = (iv) Eqn. (iv) is a re-parameterized version of the generative model that described by (i), (ii). Here, following way given how actually use this probabilistic form of LSA in project. Assume that we are given two sets of Web Pages - one set T is tagged and the other set is non-tagged. Further, |T| |U| and is the total number of Page 2 =

3. LATENT SEMANTIC ANALYSIS


Probabilistic Latent Semantic Analysis (PLSA) is statistical model used to improve the quality of clustering. Now consider the class-conditional multinomial distributions ) over the vocabulary in the aspect model which can be represented as points on the M-1 dimensional simplex of all possible multinomials. Via its convex hull, this set of points defines a K -1 dimensional sub-simplex. The modeling assumption expressed is that all conditional distributions are approximated by a multinomial represent able as a convex combination of the class-conditionals In this geometrical view, the mixing weights correspond exactly to the coordinates of a document in that sub-simplex. This demonstrates that despite of the discreteness of the latent variables introduced in the aspect model, a continuous latent space is obtained within the space of all multinomial distributions. Since the dimensionality of the sub-simplex is a K- 1 as opposed to M -1 for the complete probability simplex, this can also be thought of in terms of dimensionality reduction and the sub-simplex can be identified with a probabilistic latent semantic space. The core of PLSA is a statistical model which has been called an aspect model. The latter is a latent variable model for general co-occurrence data which associates an unobserved class variable with each observation, i.e., with each occurrence of a word } in a document In terms of a generative model it can be defined in the following way: Volume 2, Issue 2 March April 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 2, March April 2013 ISSN 2278-6856
WebPages. The goal is to obtain a clustering of all N Web Pages for which we define the following: = document-word co-occurrence matrix of size , is the number of documents in the corpus, is the page-text vocabulary size. Aij denote the frequency of the word j appearing in document i. Note that the document-word co-occurrence matrix is constructed using both tagged and non-tagged Web Pages. = tag-word co-occurrence matrix (bag-of-words representation) of size and | Denote the is the total number of tags in the corpus, is the page-text vocabulary size. number of times tag is associated with word . Note that the tag-word matrix is constructed using only the tagged Web Pages note that this is a more finegranular association where we do not look for the associations between the tag and a web page, but go a level further to consider the co-occurrences of tags with the actual words appearing in the WebPages (based on the tag-word co-occurrence matrix). Also, we would assume the same page-text vocabulary while constructing matrices and . To do this, we pool all WebPages (with and without tag information) and construct a common vocabulary of size Having constructed the document-word and word-tag co-occurrence matrices and , joint PLSA can be applied using and . i. Delete HTML tags, non-letter characters such as "$", "%" or "#". ii. Transfer the words into their root words. For example connected, connecting, interconnection should be transformed to the word connect. iii. Clear the Stop words. Natural candidates for stop words are articles (e.g. "the"), prepositions (e.g. "for", "of") and pronouns (e.g. "I", "his"). b) Vector space model Vector space model represent the words in form of vector which show how many times the words occur in respective document. Terms Document 1 Document 2 Data 1 0 Mining 5 0 Knowledge 4 1 Course 3 5

c) Calculate TFIDF The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term t within the particular document d. Thus we have the term frequency tf(t,d). The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

4. SYSTEM ARCHITECTURE
Input web pages Where, |D| denotes the total number of documents in the corpus. : Number of documents where the term t appears. Mathematically the base of the log function does not matter and constitutes a constant multiplicative factor towards the overall result. Web pages in cluster form

Pre-possessing

Vector space model

Calculate TFIDF

K-means

Figure 2: System Architecture diagram a) Pre-processing Volume 2, Issue 2 March April 2013

d) K-means Clustering algorithm K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of Page 3

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 2, March April 2013 ISSN 2278-6856
different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early clustering is done. At this point we need to recalculate k new centroids. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k- centroids change their location step by step until no more changes are done. The algorithm is composed of the following steps: i. Choose cluster centroids to coincide with K randomly selected documents from the document set. ii. Assign each document to its closest cluster. iii. Re-compute the cluster centroids using the current cluster memberships. iv. If there is a reassignment of documents to the new cluster, go to step 2. Otherwise stop e) Web pages in cluster form After applying k-means, we get the Web pages into a cluster form. That is used to calculate result such as precision, recall and F-measures of the system.

A. Experiment Results Input Pages 07190d9c7e88e6987d26af0f7a46fdbb 16855a6d45f3ae9c4429f91e7ac51663 168003de56f214f3a15cbf276aa4123a 2358823c07ea91f5795466013483c38b 0c539040d03f4ba7edb386b3d5b7c8c8 0e9086faadeffd72797ddfe93d54c0ea 1a7a3fb994598020f528c59aef32af02 1ad9cd8fca2a192683522caf9480003f 1adb60aa6f5526fa56352f149e7296a2 1b182dade18b35052aec38bd0d0020c1 2b9c4709d3af6af54bdf204c22f99f03 1c233a947bd15b68d17862650a0023e0 27b281ae8cb73c5eec8f4262a0203474 207ce3c649396f1cc80e2739b4848b33 217ac9fdfc31ef5097c88e034151b7ee 33d8fc8fccc52e719dbf20e38400d490 2358823c07ea91f5795466013483c38b 18c46d501ed9c7d597076fe2c70b31e7 34ef67ec3102ea99354cb472e970cfee 2935e50c3bd03ed30825e9dfb73096d7 3547bb84dd251febaed700788487aeae 13f0dfd31514ec5af95d137766ff1514 1c233a947bd15b68d17862650a0023e0

5. DATABASE
The dataset used in this system has taken from Delicious social bookmarking website. For evaluation purpose the directory structure required is available on the ODP (Open Directory Project) site. Our dataset consists of a collection of 117 tagged web pages that use for our webpage clustering task. All web pages in our collection were downloaded from URLs that are present in both the Open Directory Project (ODP) web directory (so that their ground-truth clustering are available) and Delicious social bookmarking website (so that their tag information is available). The dataset of tags is available here: http://kmi.tugraz.at/staff/markus/datasets/. Each webpage that we crawled and downloaded was tagged by a number of users on Delicious. Therefore, for each webpage, we combine the tags assigned to it by all users who tagged that webpage. ODP is a free, user-maintained hierarchical web directory. Each node in the ODP hierarchy has a label (e.g., Arts or Python) and a set of associated documents.

In information retrieval contexts, precision and recall are defined in terms of a set of retrieved documents and a set of relevant documents. i) Precision In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the search. ii) Recall Recall in information retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved. F-measure: In statistics, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score. p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:

6. EXPERIMENTAL SETUP
A. Software used a. Java Development Kit 1.6 or later B. Hardware specification a. RAM 1GB b. Pentium 4 or later C. Platform a. Windows D. Tools a. Netbeans 6.8 IDE Volume 2, Issue 2 March April 2013

Page 4

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 2, March April 2013 ISSN 2278-6856
Project) clusters. In this system four techniques are used. In first technique, clustering is done with the help of words. In second technique, tags are used where in third, the correlation of tags and words are used and in fourth, the PLSA is used.

is used for information retrieval for measuring document classification. In pattern recognition and information retrieval, precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance Results given below are taken on 25 documents and 5 cluster with different 5 clustering methods.
No . of pa ges No. of cluster Clustering Methods precisio n Recall FMeasure s 34.99 30.00 39.90 34.99 30.00

REFERENCES
[1] Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. B 39 (1977), 138. [2] Dumais, S. T. Latent semantic indexing, Trec-3 report. In Proceeding of the Text Retrieval Conference (TREC-3) (1995), D. Harman, Ed., pp. 219. [3] Gildea, D., and Hofmann, T. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH) (1999). [4] Hofmann, T. Latent class models for collaborative filtering. In proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI) (1999). [5] Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in AI (1999). [6] Hofmann, T., Puzicha, J., and Jordan, M. I. Unsupervised learning from dyadic data. In Advances in Neural Information Processing Systems (1999), vol. 11. [7] Michael Tipping and Christopher Bishop. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61(3):611622. [8] Bach, F. R., and Jordan, M. I. Kernel independent component analysis. Journal of Machine Learning Research 3 (2003) [9] Bickel, S., and Scheffer, T. Multi-view clustering. In ICDM 04 (Washington, DC, USA, 2004), IEEE Computer Society. [10] Kakade, S. M., and Foster, D. P. Multi-view regression via canonical correlation analysis. In COLT07 (2007) [11] Ando, R. K., and Zhang, T. Two-view feature generation model for semi-supervised learning. In ICML 07 (2007) [12] Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., and Su, Z. Optimizing web search using social annotations. In WWW 07 (2007). [13] Blaschko, M. B., and Lampert, C. H. Correlational spectral clustering. In CVPR (2008). [14] Lu, C., Chen, X., and Park, E. K. Exploit the tripartite network of social tagging for web clustering. In CIKM 09 (2009), pp. 15451548.

Word only Tag only 25 5 Word-Tag correlation Word +Tag PLSA

0.4666 0.4000 0.5300 0.4666 0.4000

0.2800 0.2400 0.3200 0.2800 0.2400

Table 1: Result of different clustering methods B. Graph of result

Figure 3: Graph of result In above figure the graph shown it is FMeasure Vs Number of pages. The result has taken on 15 pages and 3, 4, 5 clusters with different web page clustering methods. Above Graph shown word plus tag correlation method is better than the remaining methods.

7. CONCLUSIONS
User generated contents are small but structured in nature (e.g. tags). These structured contents can be used for huge and unstructured information (e.g. Page text) and using multiview web page clustering clustered web pages and compared the result with ODP (Open Directory Volume 2, Issue 2 March April 2013

Page 5

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 2, March April 2013 ISSN 2278-6856
[15] Ramage, D., Heymann, P., Manning, C. D. and Garcia-Molina, H. Clustering the tagged web. In WSDM 09 (2009). [16] Anusua Trivedi, Piyush Rai, Scott L. DuVall Exploiting Tag and Word Correlations for Improved Webpages Clustering SMUC10, October 30,2010, Toronto, Ontario, Canada. Copyright 2010 ACM. [17] S. Poomagal, Dr. T. Hamsapriya, K-means for Search Results clustering using URL and Tag contents 978-1-61284-764-1/11/$26.00 2011 IEEE. [18] http://www.google.com [19] http://www.stumbleupon.com [20] http://www.delicious.com [21] http://www.dmoz.org [22] http://www.Wikipedia.com [23] N. Oikonomakou, M. Vazirgiannis, A Review of Web Document Clustering Approaches, Dept. of Informatics, Athens University of Economics & Busine Patision 76, 10434, Greece.

Volume 2, Issue 2 March April 2013

Page 6

You might also like