International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 54
Abstract: Web search results clustering is an increasingly popular technique for providing useful grouping of web search results. This paper introduces a prototype web search results clustering engine. Proposed clustering algorithm M- FPF is compared against two other established web document clustering algorithms: Major goal of the experiment is to determine the quality of results given by the system from the common user's point of view. We compare the results with ground truth of users survey and comparing the results of different algorithms, such as Suffix Tree Clustering (STC) and Lingo and k-means. We measure cluster quality by considering precision, recall and F- measure. Results from testing on different datasets show a considerable clustering quality.
Keywords: Clustering, document clustering, Cluster Labeling, Information retrieval. FPF Algorithm.
1. INTRODUCTION With the increase in information on the World Wide Web it has become difficult to find the desired information on search engines. The low precision of the web search engines coupled with the long ranked list presentation make it hard for users to find the information they are looking for. It takes lot of time to find the relevant information. Typical queries retrieve hundreds of documents, most of which have no relation with what the user is looking for. The reason for this is due to the user failing to formulate a suitable or specific enough query, and efforts have been made to use some form of natural language processing when processing a search query to try and understand the underlying concept the user is trying to get across. One solution to this problem is to enable more efficient navigation of search results by clustering similar documents together [1][2]. By clustering web search results generated by a conventional search engine, the search results can be organized in a manner to reduce user stress and make searching more efficient while leveraging the existing search capability and indexes of existing search engines [9].
2. OVERVIEW OF M-FPF AND IMPROVING THE FPF ALGORITHM In this paper we improve the Furthest Point First algorithm from both the computational cost point of view and the output clustering quality. Since theoretically the FPF algorithm as proposed by Gonzalez [12] is optimal (unless P =NP), only heuristics can be used to obtain better results and, in the worst case, it is not possible to go behind the theoretical bounds. We profiled FPF and analyzed the most computational expensive parts of the algorithm. We found that most of the distance computations are devoted to find the next furthest point. FPF clustering quality can be improved modifying part of the clustering schema. We describe an approach that use the random sampling technique to improve clustering output quality, we call this algorithm MFPF[ 10][11]. Another crucial shortcoming of FPF is that it selects a set of centers not representative of the clusters. This phenomenon must be imputed to the fact that, when FPF creates a new center, it selects the furthest point from the previous selected centers and thus the new center can likely be close to a boundary of the subspace containing the data set. To overcame this we modify M-FPF to use medoids instead of centers.
3. ARCHITECTURE OF PROPOSED WEB SEARCH RESULT CLUSTERING SYSTEM Clustering and labeling are both essential operations for a Web snippet clustering system [6, 7]. The architecture of Web Search Result Clustering system, the proposed methodology is the comprises: i. Data collection from search engine results ii. Web snippet cleaning and filtering iii. First-level clustering of web snippets iv. Candidate word extraction form web snippets v. Second-level clustering of weighted terms vi. Cluster labeling based on candidates words
Figure 1: Architecture of proposed web search result clustering system
(i) Data collection from search engine results and cleaning: The user of Web Search Result Clustering Performance Evaluation of Web Search Result Clustering and Labeling
B R Prakash 1 and Hanumanthappa M 2
1 Research Scholar, Department of Computer Science & Applications, Bangalore University, Bangalore. INDIA.
2 Professors, Department of Computer Science & Applications, Bangalore University, Bangalore. INDIA. International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 55
system (WSRC) issues a query string that is re-directed by WSRC to the selected search engines (the user can select search engine like Google and/or Yahoo...etc.). A search engine obtains a list of snippets describing web pages that are deemed relevant to the query. An important system design issue is deciding the type and number of snippet sources to be used as auxiliary search engines. It is well- known that the probability of relevance of a snippet to the user information need quickly decays with the rank of the snippet in the list returned by the search engine. Therefore the need of avoiding low-quality snippets suggests the use of many sources each supplying a low number of high-quality snippets [8]. On the other hand, increasing the number of snippet sources raises the practical issue of handling several concurrent threads, the need to detect and handle more duplicates, and the need for a more complex handling of the composite ranking by merging several snippet lists. In the experiment 200 snippets are collected, which is reasonable. Thus these numbers optimize the total waiting time. The initial composite ranking of the merged snippet list is presented by a very simple method, i.e. by alternatively picking snippets from each source list.
(ii) Web snippet cleaning and filtering: Snippets that are too short or with little informative content or in non- English alphabets are filtered out. The input is then filtered by removing non-alphabetic symbols, HTML tags, stop words, and the query terms. The latter are removed, since they are likely to be present in every snippet, and thus are going to be useless for the purpose of discriminating different contexts. On the other hand query terms are very significant for the user, and hence, considered in the label generation phase described below. Then identify the prevalent language of each snippet is identified, which allows one to choose the appropriate stop word list and stemming algorithm. Currently, the ccTLD language identification tool of a snippet and Porters stemming algorithm are used.
(iii) First-level clustering of web snippets: A flat clustering representing the first level of the cluster hierarchy is built, using the FPF algorithm. An important issue is deciding the number of clusters to create. Currently, by default this number is fixed to 30, but it is clear that the number of clusters should depend on the query and on the number of snippets found. A general rule is difficult to find, also because the optimal number of clusters to be displayed is a function of the goals and preferences of the user. Moreover, while techniques for automatically determining such optimal number do exist [4], their computational cost is incompatible with the real-time nature of the web application. Therefore, besides providing a default value, the user is allowed to increase or decrease the no of clusters to his/her liking. Clusters that contain one snippet only are probably outliers of some sort, and thus these are merged under a single cluster labeled Other topics. (iv) Candidate words extraction form web snippets: For each cluster we need to determine a set of candidate words for appearing in its label; these will hereafter be referred to as candidates. For each of the selected terms, compute weight of each term, using proposed term weighting schemes for document clustering based on the traditional TF - IDF. This motivation is based on the idea that the terms of the documents in a same cluster have similar importance compared to the terms of the documents in a different cluster [3]. The importance is on the terms that are important within a cluster and considered the other terms as irrelevant and redundant. With this idea, more weight is given to the terms that are common within a cluster but uncommon in other clusters. The terms in each cluster with the highest score are chosen as candidates. The computation of term in a cluster is dependent on the contents of the other clusters. At the end of this procedure, if two clusters have the same label, then merge them, since this is an indication that the target number of clusters k may have been too high for this particular query.
(v) Second level clustering of weighted terms: Although the clustering algorithm could in principle be iterated recursively in each cluster up to an arbitrary number of levels, we limit our algorithm to only two levels, since this is likely to be the maximum depth a user is willing to explore in searching for an answer to user information needs. Second-level clustering is applied to web snippets obtained in first-level clustering which are of size larger than a predetermined threshold (at the moment this is fixed to 10 snippets, excluding duplicates). For second- level clustering we adopt a different approach, since metric-based clustering applied at the second level tends to detect a single large cluster. The second-level part of the hierarchy is generated based on the candidate words found for each cluster during the first-level candidate words selection.
(vi) Cluster labeling based on candidates words: The semantics of cluster descriptions was often very poor, frequent phrases with no meaning crept into cluster labels [5]. In the cluster labels, candidate keywords are included, which are the frequent phrases obtained from the modified TF-IDF term weighting scheme. We use the candidate keywords just as a basis for generating well- formed phrases that will be shown to the user as real cluster labels. For labeling a cluster, among all its snippets, select the shortest substring of a snippet title among those having the highest score. Once a short phrase is selected, simple Natural Language processing technique is applied to adjust possible grammatical imperfections due to the fact that the selected phrase came from a title prefixing and suffixing the resulting label.
User Interface: The user interface is important for the success of a web based service. We have adopted open International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 56
source software carrot2, in which the data are shown in ranked list format in the main frame while the list of cluster labels is presented on the left frame as a navigation tree. The interface also allows the user to select the number of clusters, by increasing or decreasing the default value.
4. PERFORMANCE EVALUATION OF WEB SEARCH RESULT CLUSTERING AND LABELING Our experiments may be distinguished into two classes on the ground of the way of results evaluation[10]: User evaluation Expert evaluation
Major goal of the first type of experiment was to determine the quality of results given by the system from the common user's point of view. In order to achieve this goal, we have performed a survey on a group of several users, asking them to judge simple aspects of the algorithm's results. We have also created few simple metrics that give exact, numerical values based on these users assessments. On the other hand, experiments of the second kind consisted of comparing the results of different algorithms (or one algorithm for different parameters / input data) by a single person (expert) in order to find differences between them. In this part of evaluation we have concentrated our attention rather on finding and describing the properties of the algorithm's results than on attempt to calculate some exact metrics. Therefore, this part of the evaluation experiment was more research- than user-oriented, trying to find reasons behind the final results, not to judge them definitely.
4.1 Methods of Clustering Results' Evaluation It turns out that evaluation of clustering is quite a difficult task. It is particularly hard to create objective, numerical measures of their quality. Here we point out a few most important reasons of this fact: Subjectivity of evaluation The primary criterion in clustering evaluation should be usefulness of the given results. However, for such a definition of quality it is hard to find objective measures. Different end users may have different preferences (e.g. some may be interested in small, precisely described clusters while others prefer large, more general groups), therefore there may exist many equivalent ways of partitioning the input documents set. This is a common problem in the Web Mining, as this domain is very user- oriented.
Multi criteria nature of the problem Apart from major goal of our evaluation, the quality of results is influenced by a number of factors such, as conciseness of created clusters or eligibility of their descriptions and it is hard to aggregate them all to one factor.
Difficulty in comparing results given by different different algorithms As the results of different clustering algorithms may have different properties (i.e. be hierarchical or flat, overlapping or not). It is difficult to create quality measures that could be equally well applied to different algorithms and therefore fairly compare their results.
4.2 Clustering Results Quality Measures The survey has been created taking into account key requirements for Web search results clustering, such as browsable summaries of the clusters or Users taking part in the evaluation were asked to answer the following questions:
Usefulness of created clusters For each of the created groups (except the "Other Topics" group), the users were asked to assess its usefulness, defined as user's ability to deduce topics of documents contained in it on the basis of its description. It is easy to see that the most important factor influencing evaluation of this criterion is the algorithm's ability to create meaningful, accurate cluster labels.
We have to add here that not only the clusters corresponding to topics, in which the user is interested are useful. Provided their descriptions are concise and accurate, also clusters "unexpected" in the context of the query that contain documents irrelevant to the user's interests are also helpful, as they enable him to quickly separate these documents from the interesting ones. This is exactly the way, in which clustering improves the search results.
Precision of assignment of documents to their respective clusters For each document assignment to a group, the users were asked to judge if this assignment is correct (i.e. whether its snippet fits this group's description). Here we have decided to use a more diverse than binary scale, as documents may fit to some degree in their clusters (e.g. be more or less specific than respective clusters, but still somehow related to their descriptions). Precision of assignment is assessed only if the cluster to which the document has been assigned has been judged as useful. Secondly, if user cannot understand the cluster's title, he will probably discard it and therefore he won't browse documents contained in it. Moreover, the "Other Topics" group with its documents is also excluded from this process. Here we have to make an important remark as in case of both overlapping results one document may be contained in many clusters. Therefore its assignment to all the clusters where it is contained should be independently judged. Scale: well matching | moderately matching | not matching
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 57
After collecting of the forms from users we are able to calculate the following statistics: total number of clusters in the results total number of useful clusters total number of snippets in the results total number of snippet assignments in the results total number of snippets well matching their clusters total number of snip pets moderately matching their clusters total number of snippets not matching their clusters total number of snippets in the "Other Topics" group
On the basis of these statistics, the following quality measures have been defined:
Cluster label quality It is the proportion of number of useful and number of all groups (without the "OtherTopics" group: Cluster label quality = Snippets assignment distribution This measure consists in fact of three ones, describing proportions of number of snippets well / moderately / not matching their clusters and total number of snippet assignments (excluding ones to groups judged as useless and the "Other Topics" group):
Fraction of well matching snippets = Fraction of moderately matching snippets = Fraction of not matching snippets = Assignment coverage Assignment coverage is the fraction of snippets assigned to clusters other than the "Other Topics" group (including also useless ones) in relation to the total number of input snippets: Assignment coverage = Cluster label quality including size of assessed clusters.
4.2 New Technique Developed: We propose to calculate the cluster label quality factor not on the basis of number but size of useless clusters (i.e. number of documents contained in them), as the more documents are assigned to useless groups, the more information is lost for the user. This factor may be calculated using number of different snippet assignments in the following manner: Cluster label quality =
Calculating cluster label quality and snippet assignment distribution separately for the top level groups Apart from calculating these coefficients for all clusters, when evaluating clustering results one may calculate them only for the top level clusters (i.e. ones that have no parent groups). This should be done due to the mentioned earlier fact that top groups are the ones most important to the user.
4.3 The Experiment Our user evaluation experiment has been conducted on 30 users (both Post graduate and research scholar students), who were asked to fill in survey forms. Such a small number of persons taking part in this process may of course seem insufficient in order to recognize its results as statistically significant. Unfortunately, considering our situation, and especially lack of resources and time, it was hard to find a larger number of people that could help us in carrying out the evaluation. We should also mention that all persons engaged in this process are experienced Internet user s, professionally connected with computing science. Moreover, most of them are also experts in the domain of clustering algorithms. This fact had also to influence obtained results.
For the purpose of evaluation we have used five queries. As we also wanted to study the influence of breadth of topics scope in the input documents set on results quality, two of the queries used in the experiment were general, while another three were more specific ones.
Figure 1: Evaluation results (snippet distribution, cluster label quality and assignment coverage) across queries.
Snippets used as input data for the experiment is from the Google search engine. Results handled to the persons involved in the evaluation were obtained using the Carrot2 module with its default parameter values.
5. EXPERT EVALUATION OF THE FPF ALGORITHM INFLUENCE OF PARAMETERS AND INPUT DATA ON THE RESULTS International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 58
Major goal of experiments performed is presentation of the influence of changes in algorithm parameters and input data on its results. Due to already mentioned variety of parameters that may be used in our implementation, it was impossible to repeat user evaluation similar to one previously presented as the aspect of the algorithm. We have therefore decided to present the results of these experiments with simple verbal description pointing out the most important differences. During expert evaluation we have used default values of the algorithm parameters. We have used top 100 snippets returned by the different search engines like Google, Yahoo service as the input data.
5.1 Influence Of Query Breadth Purpose of this experiment was to compare sub clusters created by the algorithm for two different queries, one general and another more specific, covering a small subset of documents corresponding to the first one. We have chosen two queries: "data and " data mining".
Figure 2: Comparison of SRC results for queries: "data (general on the left) and "data mining" (much more specific on the right).
Result of this comparison shows that the algorithm performs better with more specific queries the general queries. Main reason for this fact is of course that it is much more difficult to find distinct topics in a set of documents relevant to a specific query, and therefore create well, heterogeneous clusters. In particular, group descriptions are much better when clustering results of specific queries.
5.2 Influence Of The Input Snippets Number Natural influence of increase of input data size should be increase of number of clusters in the results and also increase of their size. It should be also expected that it will improve clusters labels quality, as small documents set may contain proportionally many over specialized phrases that will be chosen as clusters labels. On the other hand, in case of larger input collection we may expect that these phrases will be replaced by ones a little bit more general, but instead much more informative to the user. Another important, yet not considered here aspect of increase of the input data size is its influence on the algorithm's processing time.
Figure 3: Comparison of SRC results for a query "unix" using 100 (on the left) and 300 (on the right) input snippets.
The most important difference caused by increase of input snippets number are increase of number of created top- level clusters and growth of hierarchy, which has become much more complex. It is clearly visible that while when using 300 snippets top-level groups have more sub clusters, in case of 100 snippets the structure is less, with hierarchy present only in few biggest clusters. However, increase of the group number was much smaller than of the documents number due to repeating topics in the input set.
It turns out that our assumption about influence of increase in input data size on cluster labels quality is only partially fulfilled. Although some of the groups in the right tree have shorter descriptions, it is hard to say, whether their quality has really improved. On the other hand, it is clearly visible that in hierarchy constructed by clustering a larger number of documents first clusters (i.e. the biggest ones) correspond better to the most important subtopics (due to the fact that algorithm is performed on a more representative sample).
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 59
5.3 Influence of Phrases Use Purpose of this experiment is to show the profits of phrases use in the proposed SRC and the vector space model in general. According to our hypothesis formulated earlier, use of phrases shouldn't have strong influence on structure of the hierarchy itself, but rather on quality of cluster labels.
Figure 4: Comparison of SRC results for a query "Search Result clustering" for maximum length of extracted phrases equal to 1 ( on the left) and 5 (on the right).
After closer examination of the structure it is easy to see that in spite of phrases use its structure has changed only slightly. This is due to the fact that if a group of documents shares a phrase, it also shares all of words that are included in it, so introduction of phrases should have no great impact on inter-document similarities. On the other hand, the clusters labels have become much more coherent. When we use only single terms, often their sequence in descriptions is different than that in documents, and this fact may confuse the user.
5.4 Influence of Terms Weighting Methods In this experiment we have confronted four methods of terms weighting: Standard tf terms weighting Its version assigning the square root of the tf coefficient value as weight of a term Standard tf-idf terms weighting Cluster based term weighting One may easily notice that it is hard to find any significant differences between hierarchies created by using these methods (with the sole exception of "CBT" tf- idf weighting). In particular, ones created by using tf weighting and tf-idf formula are very similar. The most important difference occurring in case of use of standard tf-idf terms weighting method is frequently its inability to form larger clusters.
Figure 5: Comparison of AHC results for a query "new jaguar" with different terms weighting methods used (from the left to the right: tf, , tf-idf with idf factor and CBT.
In this example, this property manifests itself in case of the group "new jaguar" (being the largest from all created for used query). When using this method, this cluster is only half of its size when compared to use of other ones. This is due to our hypothesis where we have noticed that idf factor used in this method decreases too quickly with increase in number of documents in which a term appears. It is hard to determine unambiguously, which one of the remaining methods allows to produce best results. In our opinion making such judgments would require conducting an experiment similar to one, with at least several different queries and exact quality measures used. It is needless to say that in fact there would have to be four experiments one for each of the presented methods.
6. CONCLUSION Web mining, and in particular search results clustering are relatively new and quickly developing domains of computing science. These domains are promising and certainly there still remains a lot of work to be done before the clustering algorithms will be widely used. Current methods usually seem to have a great advantage over old algorithms derived from the areas of both statistics and text databases as they pay more attention to meaning of texts contained in the input data, and not only to simple common occurrences of particular words in them. The an effect of employing pre- and post- processing methods that enable to fit this algorithm to our application domain. But pre- and post-processing phases that according to us are very important are considerably less studied. Moreover, we have found these phases, where not numerical, but textual data is processed much International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 60
more interesting. Finally, we would like to emphasize meaning of the fact that this thesis is a part of much greater whole. Moreover, maybe in the future our module will be developed further in order to perform research on other aspects of the algorithm.
References [1] Carrot2 Web search results clustering interface, http://www.cs.put.poznan.pl /dweiss /carrot CHIP, 118 (3): 128-132, March 2003. [2] Dom, E. B., An information-theoretic external cluster validity measure, IBM research report RJ 10219, 2001. [Duff et. al 02] Duff, I. S., Heroux, M. A., Pozo R., An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum, ACM Transactions on Mathematical Software (TOMS), 28 (2): 239-267, June 2002. [3] S. Osinski and D. Weiss, A concept-driven algorithm for clustering search results, IEEE Intelligent Systems, vol. 20, no. 3, pp. 4854, 2005 [4] M. Kaki, Findex: Search result categories help users when document ranking fails, in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2005. [5] G. Zhang, Y. Liu, S. Tan, and X. Cheng, A novel method for hierarchical clustering of search results, in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops, 2007. [6] Y. Tzitzikas, N. Armenatzoglou, and P. Papadakos. FleXplorer: A Framework for Providing Faceted and Dynamic Taxonomy-Based Information Exploration. In Database and Expert Systems Application (FIND '08 at DEXA '08), pages 392{396, Torino, Italy, 2008. [7] J. Wang, Y. Mo, B. Huang, J. Wen, and L. He. Web Search Results Clustering Based on a Novel Suffix Tree Structure. In Procs of 5th International Conference on Autonomic and Trusted Computing (ATC '08), volume 5060, pages 540{554, Oslo, Norway, June 2008. [8] Zheng, H.-T., Kang, B.-Y., and Kim, H.-G. (2009). Exploiting noun phrases and semantic relationships for text document clustering. Information Sciences, 179 (13), 22492262. [9] Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. [10] Micha Wrblewski A Hierarchical WWW Pages Clustering Algorithm Based On The Vector Space Model , Master Thesis, Pozna University of Technology, Poland July 2003. [11] D. Lewandowski, Search engine user behavior: How can users be guided to quality content? Information Services & Use, vol. 28, no. 3/4, pp. 261268, 2008. [12] Chien Chin Chen, Mengchang Chen TSCAN: A Content Anatomy Approach to Temporal Topic Summarization IEEE Transactions on Knowledge and Data Engineering Volume 24 Issue 1, January 2012 Pages 170-183.