Ijettcs 2014 04 02 053

International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 3, Issue 2, March April 2014 ISSN 2278-6856

Volume 3, Issue 2 March April 2014 Page 54

Abstract: Web search results clustering is an increasingly
popular technique for providing useful grouping of web
search results. This paper introduces a prototype web search
results clustering engine. Proposed clustering algorithm M-
FPF is compared against two other established web document
clustering algorithms: Major goal of the experiment is to
determine the quality of results given by the system from the
common user's point of view. We compare the results with
ground truth of users survey and comparing the results of
different algorithms, such as Suffix Tree Clustering (STC)
and Lingo and k-means. We measure cluster quality by
considering precision, recall and F- measure. Results from
testing on different datasets show a considerable clustering
quality.

Keywords: Clustering, document clustering, Cluster
Labeling, Information retrieval. FPF Algorithm.

1. INTRODUCTION
With the increase in information on the World Wide Web
it has become difficult to find the desired information on
search engines. The low precision of the web search
engines coupled with the long ranked list presentation
make it hard for users to find the information they are
looking for. It takes lot of time to find the relevant
information. Typical queries retrieve hundreds of
documents, most of which have no relation with what the
user is looking for. The reason for this is due to the user
failing to formulate a suitable or specific enough query,
and efforts have been made to use some form of natural
language processing when processing a search query to
try and understand the underlying concept the user is
trying to get across. One solution to this problem is to
enable more efficient navigation of search results by
clustering similar documents together [1][2]. By
clustering web search results generated by a conventional
search engine, the search results can be organized in a
manner to reduce user stress and make searching more
efficient while leveraging the existing search capability
and indexes of existing search engines [9].

2. OVERVIEW OF M-FPF AND IMPROVING THE
FPF ALGORITHM
In this paper we improve the Furthest Point First
algorithm from both the computational cost point of view
and the output clustering quality. Since theoretically the
FPF algorithm as proposed by Gonzalez [12] is optimal
(unless P =NP), only heuristics can be used to obtain
better results and, in the worst case, it is not possible to
go behind the theoretical bounds. We profiled FPF and
analyzed the most computational expensive parts of the
algorithm. We found that most of the distance
computations are devoted to find the next furthest point.
FPF clustering quality can be improved modifying part of
the clustering schema. We describe an approach that use
the random sampling technique to improve clustering
output quality, we call this algorithm MFPF[ 10][11].
Another crucial shortcoming of FPF is that it selects a set
of centers not representative of the clusters. This
phenomenon must be imputed to the fact that, when FPF
creates a new center, it selects the furthest point from the
previous selected centers and thus the new center can
likely be close to a boundary of the subspace containing
the data set. To overcame this we modify M-FPF to use
medoids instead of centers.

3. ARCHITECTURE OF PROPOSED WEB SEARCH
RESULT CLUSTERING SYSTEM
Clustering and labeling are both essential operations for a
Web snippet clustering system [6, 7]. The architecture of
Web Search Result Clustering system, the proposed
methodology is the comprises:
i. Data collection from search engine results
ii. Web snippet cleaning and filtering
iii. First-level clustering of web snippets
iv. Candidate word extraction form web snippets
v. Second-level clustering of weighted terms
vi. Cluster labeling based on candidates words

Figure 1: Architecture of proposed web search result
clustering system

(i) Data collection from search engine results and
cleaning: The user of Web Search Result Clustering
Performance Evaluation of Web Search Result
Clustering and Labeling

B R Prakash
1
and Hanumanthappa M
2

1
Research Scholar, Department of Computer Science & Applications,
Bangalore University, Bangalore. INDIA.

2
Professors, Department of Computer Science & Applications,
Bangalore University, Bangalore. INDIA.


system (WSRC) issues a query string that is re-directed by
WSRC to the selected search engines (the user can select
search engine like Google and/or Yahoo...etc.). A search
engine obtains a list of snippets describing web pages that
are deemed relevant to the query. An important system
design issue is deciding the type and number of snippet
sources to be used as auxiliary search engines. It is well-
known that the probability of relevance of a snippet to the
user information need quickly decays with the rank of the
snippet in the list returned by the search engine.
Therefore the need of avoiding low-quality snippets
suggests the use of many sources each supplying a low
number of high-quality snippets [8]. On the other hand,
increasing the number of snippet sources raises the
practical issue of handling several concurrent threads, the
need to detect and handle more duplicates, and the need
for a more complex handling of the composite ranking by
merging several snippet lists. In the experiment 200
snippets are collected, which is reasonable. Thus these
numbers optimize the total waiting time. The initial
composite ranking of the merged snippet list is presented
by a very simple method, i.e. by alternatively picking
snippets from each source list.

(ii) Web snippet cleaning and filtering: Snippets that are
too short or with little informative content or in non-
English alphabets are filtered out. The input is then
filtered by removing non-alphabetic symbols, HTML tags,
stop words, and the query terms. The latter are removed,
since they are likely to be present in every snippet, and
thus are going to be useless for the purpose of
discriminating different contexts. On the other hand
query terms are very significant for the user, and hence,
considered in the label generation phase described below.
Then identify the prevalent language of each snippet is
identified, which allows one to choose the appropriate
stop word list and stemming algorithm. Currently, the
ccTLD language identification tool of a snippet and
Porters stemming algorithm are used.

(iii) First-level clustering of web snippets: A flat
clustering representing the first level of the cluster
hierarchy is built, using the FPF algorithm. An important
issue is deciding the number of clusters to create.
Currently, by default this number is fixed to 30, but it is
clear that the number of clusters should depend on the
query and on the number of snippets found. A general
rule is difficult to find, also because the optimal number
of clusters to be displayed is a function of the goals and
preferences of the user. Moreover, while techniques for
automatically determining such optimal number do exist
[4], their computational cost is incompatible with the
real-time nature of the web application. Therefore,
besides providing a default value, the user is allowed to
increase or decrease the no of clusters to his/her liking.
Clusters that contain one snippet only are probably
outliers of some sort, and thus these are merged under a
single cluster labeled Other topics.
(iv) Candidate words extraction form web snippets: For
each cluster we need to determine a set of candidate
words for appearing in its label; these will hereafter be
referred to as candidates. For each of the selected terms,
compute weight of each term, using proposed term
weighting schemes for document clustering based on the
traditional TF - IDF. This motivation is based on the idea
that the terms of the documents in a same cluster have
similar importance compared to the terms of the
documents in a different cluster [3]. The importance is on
the terms that are important within a cluster and
considered the other terms as irrelevant and redundant.
With this idea, more weight is given to the terms that are
common within a cluster but uncommon in other clusters.
The terms in each cluster with the highest score are
chosen as candidates. The computation of term in a
cluster is dependent on the contents of the other clusters.
At the end of this procedure, if two clusters have the same
label, then merge them, since this is an indication that the
target number of clusters k may have been too high for
this particular query.

(v) Second level clustering of weighted terms: Although
the clustering algorithm could in principle be iterated
recursively in each cluster up to an arbitrary number of
levels, we limit our algorithm to only two levels, since
this is likely to be the maximum depth a user is willing to
explore in searching for an answer to user information
needs. Second-level clustering is applied to web snippets
obtained in first-level clustering which are of size larger
than a predetermined threshold (at the moment this is
fixed to 10 snippets, excluding duplicates). For second-
level clustering we adopt a different approach, since
metric-based clustering applied at the second level tends
to detect a single large cluster. The second-level part of
the hierarchy is generated based on the candidate words
found for each cluster during the first-level candidate
words selection.

(vi) Cluster labeling based on candidates words: The
semantics of cluster descriptions was often very poor,
frequent phrases with no meaning crept into cluster labels
[5]. In the cluster labels, candidate keywords are
included, which are the frequent phrases obtained from
the modified TF-IDF term weighting scheme. We use the
candidate keywords just as a basis for generating well-
formed phrases that will be shown to the user as real
cluster labels. For labeling a cluster, among all its
snippets, select the shortest substring of a snippet title
among those having the highest score. Once a short
phrase is selected, simple Natural Language processing
technique is applied to adjust possible grammatical
imperfections due to the fact that the selected phrase
came from a title prefixing and suffixing the resulting
label.

User Interface: The user interface is important for the
success of a web based service. We have adopted open


source software carrot2, in which the data are shown in
ranked list format in the main frame while the list of
cluster labels is presented on the left frame as a
navigation tree. The interface also allows the user to
select the number of clusters, by increasing or decreasing
the default value.

4. PERFORMANCE EVALUATION OF WEB
SEARCH RESULT CLUSTERING AND
LABELING
Our experiments may be distinguished into two classes on
the ground of the way of results evaluation[10]:
User evaluation
Expert evaluation

Major goal of the first type of experiment was to
determine the quality of results given by the system from
the common user's point of view. In order to achieve this
goal, we have performed a survey on a group of several
users, asking them to judge simple aspects of the
algorithm's results. We have also created few simple
metrics that give exact, numerical values based on these
users assessments. On the other hand, experiments of the
second kind consisted of comparing the results of
different algorithms (or one algorithm for different
parameters / input data) by a single person (expert) in
order to find differences between them. In this part of
evaluation we have concentrated our attention rather on
finding and describing the properties of the algorithm's
results than on attempt to calculate some exact metrics.
Therefore, this part of the evaluation experiment was
more research- than user-oriented, trying to find reasons
behind the final results, not to judge them definitely.

4.1 Methods of Clustering Results' Evaluation
It turns out that evaluation of clustering is quite a difficult
task. It is particularly hard to create objective, numerical
measures of their quality. Here we point out a few most
important reasons of this fact:
Subjectivity of evaluation
The primary criterion in clustering evaluation should be
usefulness of the given results. However, for such a
definition of quality it is hard to find objective measures.
Different end users may have different preferences (e.g.
some may be interested in small, precisely described
clusters while others prefer large, more general groups),
therefore there may exist many equivalent ways of
partitioning the input documents set. This is a common
problem in the Web Mining, as this domain is very user-
oriented.

Multi criteria nature of the problem
Apart from major goal of our evaluation, the quality of
results is influenced by a number of factors such, as
conciseness of created clusters or eligibility of their
descriptions and it is hard to aggregate them all to one
factor.

Difficulty in comparing results given by different
different algorithms
As the results of different clustering algorithms may have
different properties (i.e. be hierarchical or flat,
overlapping or not). It is difficult to create quality
measures that could be equally well applied to different
algorithms and therefore fairly compare their results.

4.2 Clustering Results Quality Measures
The survey has been created taking into account key
requirements for Web search results clustering, such as
browsable summaries of the clusters or Users taking part
in the evaluation were asked to answer the following
questions:

Usefulness of created clusters
For each of the created groups (except the "Other Topics"
group), the users were asked to assess its usefulness,
defined as user's ability to deduce topics of documents
contained in it on the basis of its description. It is easy to
see that the most important factor influencing evaluation
of this criterion is the algorithm's ability to create
meaningful, accurate cluster labels.

We have to add here that not only the clusters
corresponding to topics, in which the user is interested
are useful. Provided their descriptions are concise and
accurate, also clusters "unexpected" in the context of the
query that contain documents irrelevant to the user's
interests are also helpful, as they enable him to quickly
separate these documents from the interesting ones. This
is exactly the way, in which clustering improves the
search results.

Precision of assignment of documents to their
respective clusters
For each document assignment to a group, the users were
asked to judge if this assignment is correct (i.e. whether
its snippet fits this group's description). Here we have
decided to use a more diverse than binary scale, as
documents may fit to some degree in their clusters (e.g. be
more or less specific than respective clusters, but still
somehow related to their descriptions). Precision of
assignment is assessed only if the cluster to which the
document has been assigned has been judged as useful.
Secondly, if user cannot understand the cluster's title, he
will probably discard it and therefore he won't browse
documents contained in it. Moreover, the "Other Topics"
group with its documents is also excluded from this
process. Here we have to make an important remark as in
case of both overlapping results one document may be
contained in many clusters. Therefore its assignment to
all the clusters where it is contained should be
independently judged.
Scale: well matching | moderately matching | not
matching



After collecting of the forms from users we are able to
calculate the following statistics:
total number of clusters in the results
total number of useful clusters
total number of snippets in the results
total number of snippet assignments in the results
total number of snippets well matching their
clusters
total number of snip pets moderately matching their
clusters
total number of snippets not matching their clusters
total number of snippets in the "Other Topics" group

On the basis of these statistics, the following quality
measures have been defined:

Cluster label quality
It is the proportion of number of useful and number of all
groups (without the "OtherTopics" group:
Cluster label quality =
Snippets assignment distribution
This measure consists in fact of three ones, describing
proportions of number of snippets well / moderately / not
matching their clusters and total number of snippet
assignments (excluding ones to groups judged as useless
and the "Other Topics" group):

Fraction of well matching snippets =
Fraction of moderately matching snippets =
Fraction of not matching snippets =
Assignment coverage
Assignment coverage is the fraction of snippets assigned
to clusters other than the "Other Topics" group (including
also useless ones) in relation to the total number of input
snippets:
Assignment coverage =
Cluster label quality including size of assessed clusters.

4.2 New Technique Developed:
We propose to calculate the cluster label quality factor not
on the basis of number but size of useless clusters (i.e.
number of documents contained in them), as the more
documents are assigned to useless groups, the more
information is lost for the user.
This factor may be calculated using number of different
snippet assignments in the following manner:
Cluster label quality =

Calculating cluster label quality and snippet
assignment distribution separately for the top level
groups Apart from calculating these coefficients
for all clusters, when evaluating clustering results
one may calculate them only for the top level
clusters (i.e. ones that have no parent groups). This
should be done due to the mentioned earlier fact
that top groups are the ones most important to the
user.

4.3 The Experiment
Our user evaluation experiment has been conducted on 30
users (both Post graduate and research scholar students),
who were asked to fill in survey forms. Such a small
number of persons taking part in this process may of
course seem insufficient in order to recognize its results
as statistically significant. Unfortunately, considering our
situation, and especially lack of resources and time, it was
hard to find a larger number of people that could help us
in carrying out the evaluation.
We should also mention that all persons engaged in this
process are experienced Internet user s, professionally
connected with computing science. Moreover, most of
them are also experts in the domain of clustering
algorithms. This fact had also to influence obtained
results.

For the purpose of evaluation we have used five queries.
As we also wanted to study the influence of breadth of
topics scope in the input documents set on results quality,
two of the queries used in the experiment were general,
while another three were more specific ones.

Figure 1: Evaluation results (snippet distribution, cluster
label quality and assignment coverage) across queries.

Snippets used as input data for the experiment is from the
Google search engine. Results handled to the persons
involved in the evaluation were obtained using the
Carrot2 module with its default parameter values.

5. EXPERT EVALUATION OF THE FPF
ALGORITHM INFLUENCE OF PARAMETERS
AND INPUT DATA ON THE RESULTS


Major goal of experiments performed is presentation of
the influence of changes in algorithm parameters and
input data on its results. Due to already mentioned variety
of parameters that may be used in our implementation, it
was impossible to repeat user evaluation similar to one
previously presented as the aspect of the algorithm. We
have therefore decided to present the results of these
experiments with simple verbal description pointing out
the most important differences.
During expert evaluation we have used default values of
the algorithm parameters. We have used top 100 snippets
returned by the different search engines like Google,
Yahoo service as the input data.

5.1 Influence Of Query Breadth
Purpose of this experiment was to compare sub clusters
created by the algorithm for two different queries, one
general and another more specific, covering a small
subset of documents corresponding to the first one. We
have chosen two queries: "data and " data mining".

Figure 2: Comparison of SRC results for queries: "data
(general on the left) and "data mining" (much more
specific on the right).

Result of this comparison shows that the algorithm
performs better with more specific queries the general
queries. Main reason for this fact is of course that it is
much more difficult to find distinct topics in a set of
documents relevant to a specific query, and therefore
create well, heterogeneous clusters. In particular, group
descriptions are much better when clustering results of
specific queries.

5.2 Influence Of The Input Snippets Number
Natural influence of increase of input data size should be
increase of number of clusters in the results and also
increase of their size. It should be also expected that it
will improve clusters labels quality, as small documents
set may contain proportionally many over specialized
phrases that will be chosen as clusters labels. On the other
hand, in case of larger input collection we may expect
that these phrases will be replaced by ones a little bit
more general, but instead much more informative to the
user. Another important, yet not considered here aspect of
increase of the input data size is its influence on the
algorithm's processing time.

Figure 3: Comparison of SRC results for a query "unix"
using 100 (on the left) and 300 (on the right) input
snippets.

The most important difference caused by increase of input
snippets number are increase of number of created top-
level clusters and growth of hierarchy, which has become
much more complex. It is clearly visible that while when
using 300 snippets top-level groups have more sub
clusters, in case of 100 snippets the structure is less, with
hierarchy present only in few biggest clusters. However,
increase of the group number was much smaller than of
the documents number due to repeating topics in the
input set.

It turns out that our assumption about influence of
increase in input data size on cluster labels quality is only
partially fulfilled. Although some of the groups in the
right tree have shorter descriptions, it is hard to say,
whether their quality has really improved. On the other
hand, it is clearly visible that in hierarchy constructed by
clustering a larger number of documents first clusters (i.e.
the biggest ones) correspond better to the most important
subtopics (due to the fact that algorithm is performed on a
more representative sample).



5.3 Influence of Phrases Use
Purpose of this experiment is to show the profits of
phrases use in the proposed SRC and the vector space
model in general. According to our hypothesis formulated
earlier, use of phrases shouldn't have strong influence on
structure of the hierarchy itself, but rather on quality of
cluster labels.

Figure 4: Comparison of SRC results for a query "Search
Result clustering" for maximum length of extracted
phrases equal to 1 ( on the left) and 5 (on the right).

After closer examination of the structure it is easy to see
that in spite of phrases use its structure has changed only
slightly. This is due to the fact that if a group of
documents shares a phrase, it also shares all of words that
are included in it, so introduction of phrases should have
no great impact on inter-document similarities. On the
other hand, the clusters labels have become much more
coherent. When we use only single terms, often their
sequence in descriptions is different than that in
documents, and this fact may confuse the user.

5.4 Influence of Terms Weighting Methods
In this experiment we have confronted four methods of
terms weighting:
Standard tf terms weighting
Its version assigning the square root of the tf coefficient
value as weight of a term
Standard tf-idf terms weighting
Cluster based term weighting
One may easily notice that it is hard to find any
significant differences between hierarchies created by
using these methods (with the sole exception of "CBT" tf-
idf weighting). In particular, ones created by using tf
weighting and tf-idf formula are very similar. The most
important difference occurring in case of use of standard
tf-idf terms weighting method is frequently its inability to
form larger clusters.

Figure 5: Comparison of AHC results for a query "new
jaguar" with different terms weighting methods used
(from the left to the right: tf, , tf-idf with idf factor
and CBT.

In this example, this property manifests itself in case of
the group "new jaguar" (being the largest from all created
for used query). When using this method, this cluster is
only half of its size when compared to use of other ones.
This is due to our hypothesis where we have noticed that
idf factor used in this method decreases too quickly with
increase in number of documents in which a term
appears.
It is hard to determine unambiguously, which one of the
remaining methods allows to produce best results. In our
opinion making such judgments would require
conducting an experiment similar to one, with at least
several different queries and exact quality measures used.
It is needless to say that in fact there would have to be
four experiments one for each of the presented methods.

6. CONCLUSION
Web mining, and in particular search results clustering
are relatively new and quickly developing domains of
computing science. These domains are promising and
certainly there still remains a lot of work to be done
before the clustering algorithms will be widely used.
Current methods usually seem to have a great advantage
over old algorithms derived from the areas of both
statistics and text databases as they pay more attention to
meaning of texts contained in the input data, and not only
to simple common occurrences of particular words in
them. The an effect of employing pre- and post-
processing methods that enable to fit this algorithm to our
application domain. But pre- and post-processing phases
that according to us are very important are considerably
less studied. Moreover, we have found these phases,
where not numerical, but textual data is processed much


more interesting. Finally, we would like to emphasize
meaning of the fact that this thesis is a part of much
greater whole. Moreover, maybe in the future our module
will be developed further in order to perform research on
other aspects of the algorithm.

References
[1] Carrot2 Web search results clustering interface,
http://www.cs.put.poznan.pl /dweiss /carrot CHIP,
118 (3): 128-132, March 2003.
[2] Dom, E. B., An information-theoretic external cluster
validity measure, IBM research report RJ 10219,
2001. [Duff et. al 02] Duff, I. S., Heroux, M. A.,
Pozo R., An overview of the sparse basic linear
algebra subprograms: The new standard from the
BLAS technical forum, ACM Transactions on
Mathematical Software (TOMS), 28 (2): 239-267,
June 2002.
[3] S. Osinski and D. Weiss, A concept-driven
algorithm for clustering search results, IEEE
Intelligent Systems, vol. 20, no. 3, pp. 4854, 2005
[4] M. Kaki, Findex: Search result categories help users
when document ranking fails, in Proceedings of the
SIGCHI Conference on Human Factors in
Computing Systems, 2005.
[5] G. Zhang, Y. Liu, S. Tan, and X. Cheng, A novel
method for hierarchical clustering of search results,
in Proceedings of the 2007 IEEE/WIC/ACM
International Conference on Web Intelligence and
Intelligent Agent Technology - Workshops, 2007.
[6] Y. Tzitzikas, N. Armenatzoglou, and P. Papadakos.
FleXplorer: A Framework for Providing Faceted and
Dynamic Taxonomy-Based Information Exploration.
In Database and Expert Systems Application (FIND
'08 at DEXA '08), pages 392{396, Torino, Italy,
2008.
[7] J. Wang, Y. Mo, B. Huang, J. Wen, and L. He. Web
Search Results Clustering Based on a Novel Suffix
Tree Structure. In Procs of 5th International
Conference on Autonomic and Trusted Computing
(ATC '08), volume 5060, pages 540{554, Oslo,
Norway, June 2008.
[8] Zheng, H.-T., Kang, B.-Y., and Kim, H.-G. (2009).
Exploiting noun phrases and semantic relationships
for text document clustering. Information Sciences,
179 (13), 22492262.
[9] Witten, I. H., Frank, E., and Hall, M. A. (2011). Data
Mining: Practical Machine Learning Tools and
Techniques (3rd ed.). San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc.
[10] Micha Wrblewski A Hierarchical WWW Pages
Clustering Algorithm Based On The Vector Space
Model , Master Thesis, Pozna University of
Technology, Poland July 2003.
[11] D. Lewandowski, Search engine user behavior: How
can users be guided to quality content? Information
Services & Use, vol. 28, no. 3/4, pp. 261268, 2008.
[12] Chien Chin Chen, Mengchang Chen TSCAN: A
Content Anatomy Approach to Temporal Topic
Summarization IEEE Transactions on Knowledge
and Data Engineering Volume 24 Issue 1, January
2012 Pages 170-183.

Ijettcs 2014 04 02 053

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ijettcs 2014 04 02 053

Uploaded by

Copyright:

Available Formats

International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

You might also like