You are on page 1of 8

1350 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO.

6, JUNE 2009

A Unified Relevance Feedback Framework


for Web Image Retrieval
En Cheng, Feng Jing, and Lei Zhang

Abstract—Although relevance feedback (RF) has been exten- process is passive, i.e., disregarding the informative interactions
sively studied in the content-based image retrieval community, between users and retrieval systems. An active system should
no commercial Web image search engines support RF because get the user into the loop so that personalized results could be
of scalability, efficiency, and effectiveness issues. In this paper,
we propose a unified relevance feedback framework for Web provided for the specific user. To be active, the system could
image retrieval. Our framework shows advantage over traditional take advantage of relevance feedback techniques.
RF mechanisms in the following three aspects. First, during the Relevance feedback, originally developed for information re-
RF process, both textual feature and visual feature are used in trieval [5], is an online learning technique aiming at improving
a sequential way. To seamlessly combine textual feature-based the effectiveness of the information retrieval system. The main
RF and visual feature-based RF, a query concept-dependent
fusion strategy is automatically learned. Second, the textual idea of relevance feedback is to let the user guide the system.
feature-based RF mechanism employs an effective search result During retrieval process, the user interacts with the system and
clustering (SRC) algorithm to obtain salient phrases, based on rates the relevance of the retrieved documents, according to
which we could construct an accurate and low-dimensional textual his/her subjective judgment. With this additional information,
space for the resulting Web images. Thus, we could integrate RF the system dynamically learns the user’s intention, and gradu-
into Web image retrieval in a practical way. Last, a new user
interface (UI) is proposed to support implicit RF. On the one ally presents better results. Since the introduction of relevance
hand, unlike traditional RF UI which enforces users to make feedback to image retrieval in the mid-1990s, it has attracted
explicit judgment on the results, the new UI regards the users’ tremendous attention in the content-based image retrieval
click-through data as implicit relevance feedback in order to re- (CBIR) community and has been shown to provide dramatic
lease burden from the users. On the other hand, unlike traditional performance improvement [6]. However, no commercial Web
RF UI which hardily substitutes subsequent results for previous
ones, a recommendation scheme is used to help the users better image search engines support relevance feedback because of
understand the feedback process and to mitigate the possible usability, scalability, and efficiency issues.
waiting caused by RF. Experimental results on a database con- Note that the textual features, on which most of the commercial
sisting of nearly three million Web images show that the proposed search engines depend, are extracted from the file name, ALT text,
framework is wieldy, scalable, and effective. URL, and surrounding text of the images. The usefulness of the
Index Terms—Implicit feedback, relevance feedback (RF), textual features is demonstrated by the popularity of the current
search result clustering, web image retrieval. available Web image search engine. While straightly using the
textual information to construct the textual space leads to a time-
I. INTRODUCTION consuming computation and the performance suffers from noisy
terms. Since the user is interacting with the search engine in real
time, the relevance feedback mechanism should be sufficiently
W ITH the explosive growth of both World Wide Web and
the number of digital images, there is more and more ur-
gent need for effective Web image retrieval systems. Most of the
fast, and if possible avoid heavy computations over millions of
retrieved images. To integrate relevance feedback into Web image
popular commercial search engines, such as Google [1], Yahoo! retrieval in a practical way, an efficient and effective mechanism is
[2], and AltaVista [3], support image retrieval by keywords. requiredforconstructinganaccurateandlow-dimensionaltextual
There are also commercial search engines dedicated to image space with respect to the resulting Web images.
retrieval, e.g., Picsearch [4]. A common limitation of most of Although all existing commercial Web image retrieval systems
the existing Web image retrieval systems is that their search solely depend on textual information, Web images are character-
ized by both textual and visual features. With effective utilization
of textual features, image retrieval greatly benefits from lever-
Manuscript received May 22, 2007; revised February 10, 2009. First pub-
lished April 07, 2009; current version published May 13, 2009. This work was
aging mature techniques from text retrieval. However, just as the
performed at Microsoft Research Asia, Beijing, China. The associate editor proverb “a picture is worth one thousand words,” the textual repre-
coordinating the review of this manuscript and approving it for publication sentation of an image is always insufficient compared to the visual
was Dr. Eli Saber.
E. Cheng is with Electrical Engineering and Computer Science, Case Western content of the image itself. Therefore, visual features are required
Reserve University, Cleveland, OH 44106-7071 USA (e-mail: en.cheng@case. for finer granularity of image description. Considering the char-
edu). acteristics of both textual and visual feature, it is reasonable to
F. Jing is with Tencent Research Center, Beijing, 100080, China (e-mail:
scenery.jf@gmail.com). conclude that RF in textual space could guarantee relevance and
L. Zhang is with Microsoft Research Asia, Beijing, 100080, China (e-mail: RF in visual space could meet the need for finer granularity. Thus,
leizhang@microsoft.com). it is meaningful to introduce a unified relevance feedback frame-
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. work for Web image retrieval which seamlessly combines textual
Digital Object Identifier 10.1109/TIP.2009.2017128 feature-based RF and visual feature-based RF in a sequential way.

1057-7149/$25.00 © 2009 IEEE

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1351

To strengthen our proposed framework, we employ implicit tographer’s comment and other people’s critiques. These im-
feedback to overcome the limitation of explicit feedback tech- ages constitute the evaluation dataset for the proposed relevance
niques where an increased cognitive burden is placed on the feedback framework. For example, a photo of photosig1 has the
users. Unlike explicit feedback, implicit feedback could be col- following metadata. In order to facilitate later citation of this
lected at much lower cost, in much larger quantities and without photo, we denote it by .
burden on the users. As one of the most effective implicit feed- • Title: early morning.
back information, click-through data has been used either as ab- • Category: landscape, nature, rural.
solute relevance judgments [7] or relative relevance judgments • Comment: I found this special light one early morning in
[8] in text retrieval. Fortunately, image retrieval has the fol- Pyrenees along the Vicdessos river near our house .
lowing two characteristics when comparing with text retrieval. • One of the critiques: wow I like this picture very
First, the thumbnail of an image reflects more information than much I guess the light has to do with everything the
the title and snippet of a Web page, so click-through information light is great on the snow and on the sky (strange looking
of image retrieval tends to be less noisy than that of text retrieval. sky by the way) greatly composed nice crafted
Second, unlike textual document, the content of an image can be border a beauty.
taken in at a glance. As a result, the user will possibly click more All the aforementioned metadata is used as the textual source
results in image retrieval than in text retrieval. Both character- for the textual space construction. To build the textual space,
istics imply that click-through data could be helpful for image there are two available approaches in our work. One straight-
retrieval. forward approach is directly using the above metadata to obtain
In this paper, we propose a unified relevance feedback frame- the textual feature. Another one is based on the Search Result
work for Web image retrieval. There are three main contribu- Clustering (SRC) algorithm to construct the textual space. The
tions of the paper. detailed description of the SRC-based textual space construc-
• A dynamic mutimodal fusion scheme is proposed to seam- tion is illustrated in Section III.
lessly combine textual feature-based RF (TBRF) and vi- To represent the textual feature, vector space model [10] with
sual feature-based RF (VBRF). More specifically, a TBRF TF-IDF weighting scheme is adopted. More specifically, the tex-
algorithm is first used to quickly select a possibly relevant tual feature of an image is an -dimensional vector and can
image set. Then, a VBRF algorithm is combined with the be given by
TBRF algorithm to further re-rank the resulting Web im-
ages. The fusion of VBRF and TBRF is query concept-de-
pendent and automatically learned. (1.1)
• The textual feature-based RF mechanism employs an ef- (1.2)
fective search result clustering (SRC) algorithm to obtain
salient phrases, based on which we could construct an ac- where:
curate and low-dimensional textual space for the resulting • is the textual feature of an image ;
Web images. As a result, we could integrate RF into Web • is the weight of the th term in ’s textual space;
image retrieval in a practical way. • is the number of all distinct terms of all images’ textual
• A new UI is proposed to support implicit RF. On the one space;
hand, unlike traditional RF UI which enforces the users • is the frequency of the th term in ’s textual space;
to make explicit judgment on the results, the new UI re- • is the total number of images;
gards the user’s click-through data as implicit relevance • is the number of images whose metadata contains the
feedback in order to release burden from the user. On the th term.
other hand, unlike traditional RF UI which hardily substi- To illustrate the straightforward approach where all metadata
tutes subsequent results for previous ones, a recommen- is utilized to construct the textual space, we use the photo
dation scheme is used to help the user better understand introduced at the beginning of this section as an example. Given
the feedback process and to mitigate the possible waiting the query “early morning,” we have 151 resulting images in-
caused by RF. cluding photo . Based on those resulting images, we collect
The remainder of this paper is organized as follows. In all distinct terms from the metadata which results in totally 358
Section II, we describe the dynamic multimodal fusion mech- distinct terms. For , it has 48 distinct terms, which consist
anism. SRC-based textual space construction is illustrated in of early, morning, landscape, nature, rural, I, found, this, spe-
Section III. Experimental results are presented and analyzed cial, light, one, in, Pyrenees, along, the, Vicdessos, river, near,
in Section V. Finally, we conclude and discuss future work in our, house, wow, like, picture, very, much, guess, has, to, do,
Section VI. with, everything, is, great, on, snow, and, sky, strange, looking,
by, way, greatly, composed, nice, crafted, border, a, and beauty.
II. DYNAMIC MULTIMODAL FUSION Given , and 48 distinct terms of , we can
calculate and for each distinct term with respect to .
A. Image Representation As a result, we can obtain according to the (1.2). In the end,
according to the (1.1), the textual feature of is obtained.
The images collected from several photo forum sites, e.g.,
photosig [9], have rich metadata such as title, category, pho- 1http://www.photosig.com/go/photos/view?id=733881

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
1352 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009

To visually represent an image, a 64-dimensional feature [11] of all features of clicked images. The weight of a feature dimen-
was extracted. It is a combination of three features: six-dimen- sion is proportional to the inverse of the standard deviation of the
sional color moments [12], 44-dimensional banded auto-correl- feature values of all clicked images [15]. Weighted Euclidean
ogram [12], and 14-dimensional color texture moments [14]. distance is used to calculate the distance between an image and
For color moments, the first two moments from each channel of the optimal query. Although Rui’s algorithm is used currently,
CIE-LUV color space were extracted. For correlogram, the HSV any RF algorithm using only relevant images could be used in
color space with inhomogeneous quantization into 44 colors the unified framework.
is adopted [11]. For textual moments, we operate the original
image with templates derived from local Fourier transform and D. Dynamic Multimodal Fusion
obtain characteristic maps, each of which characterizes some in- There has been some work on fusion of relevance feedback in
formation on a certain aspect of the original image. Similar to different feature spaces [16]–[18]. A straightforward and widely
color moments, we calculate the first and second moments of used strategy is linear combination [16], [17]. Nonlinear com-
the characteristic maps, which represent the color texture infor- bination using support vector machine (SVM) was proposed in
mation of the original image. The resulting visual feature of an [18]. Since the super-kernel fusion algorithm [18] needs irrel-
evant images, it is incapable for systems only offering relevant
image is a 64-dimensional vector . Each
images.
feature dimension is normalized to [0, 1] using Gaussian nor-
Since textual features are more semantic-oriented and effi-
malization for the convenience of further computation.
cient than visual features while visual features have finer de-
B. RF in Textual Space scriptive granularity than textual features, we combine the RF
in both feature spaces in a sequential way. The flowchart of the
To perform RF in textual space, Rocchio’s algorithm [1] is RF of our unified framework is shown in Fig. 1. First, RF in tex-
used. The algorithm was developed in the mid-1960s and has tual space is performed to rank the initial resulting images using
been proven to be one of the most effective RF algorithms in the optimal query learned in (1.3). Then, RF in visual space is
information retrieval. The key idea of Rocchio’s algorithm is performed to re-rank the top images. The re-ranking process
to construct a so-called optimal query so that the difference be- is based on a dynamic linear combination of the RF in both vi-
tween the average score of a relevant document and the average sual and textual spaces.
score of a nonrelevant document is maximized. Cosine simi- Note that restricting the re-ranking only on the top im-
larity is used to calculate the similarity between an image and ages has two advantages. First, the relevance of the top images
the optimal query. Since only clicked images are available for could be guaranteed by the former RF in textual space. Second,
our proposed framework, we assume clicked images to be rele- the efficiency of RF process could be ensured, for RF in visual
vant and define the feature of optimal query as follows: space could possibly be inefficient on a very large image set.
The number of top images that affects both efficiency and ef-
(1.3) fectiveness of the RF process is predetermined experimentally.
The re-ranking process is based on a dynamic multimodal fusion
where: of the RF in visual and textual spaces. The combination weights
that reflect the relative contribution of both spaces are automati-
• is the vector of the initial query;
cally learned and query concept-dependent. Assume there are
• is the vector of a relevant image; clicked images . The similarity metric used to re-rank a top
• is the vector of a nonrelevant image; image using RF in both visual and textual spaces is defined as
• Rel is the relevant image set; follows:
• Non-Rel is the nonrelevant image set;
• is the number of relevant images; (1.4)
• is the number of nonrelevant images; (1.5)
• is the parameter controlling the relative contribution of
relevant images and the initial query; (1.6)
• is the parameter controlling the relative contribution of
nonrelevant images and the initial query.
In our case, only relevant images are available for our pro- (1.7)
posed mechanism, so we set to be 1 and to be 0 in our
experiments. Although Rocchio’s algorithm is used currently, (1.8)
any vector-based RF algorithm could be used in the unified
where:
framework.
• is the similarity metric in both visual and textual spaces;
C. RF in Visual Space • is the similarity between ’s visual feature and ;
To perform RF in visual space, Rui’s algorithm [15] is used. • is the cosine similarity between ’s textual feature and
Assume clicked images to be relevant, both an optimal query ;
and feature weights are learned from the clicked images. More • is the dynamic linear combination parameter for simi-
specifically, the feature vector of the optimal query is the mean larity metric in both visual and textual spaces;

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1353

-gram) is denoted as , and the set of documents that contains


as . Then, the five properties can be given by

(1.9)
(1.10)
(1.11)

(1.12)

(1.13)
(1.14)

(1.15)
Fig. 1. Flowchart of the RF of the unified framework.
where represents frequency calculation.
Given the above five properties, we use a single for-
• and are parameters which control the relative contri- mula to combine them and calculate a single salience score
bution of RF in visual space; for each phrase. In our case, each term can be a vector
• is the deviation of the clicked image in visual space; . A regression model
• is the visual feature vector of the clicked image ; learned from previous training data is then applied to combine
the five properties into a single salience score . According to
• is the feature vector of the optimal query in visual
[19], when comparing the performance of linear regression,
space;
logistic regression, and support vector regression, the perfor-
• is the weighted Euclidean distance between ’s visual
mance of linear regression is the best one. Therefore, in our
feature and . experiments, we choose the linear regression model. The linear
Note that in (1.4) tunes the visual feature’s contribution to regression model postulates that
the overall similarity metric according to different query con-
cept. According to (1.5), controls the overall contribution of (1.16)
RF in visual space, fine-tunes the contribution. If the query
concept could be well characterized by visual feature and the where:
clicked images should be visually consistent, will be small • is a random variable with mean zero;
(near 0). According to (1.5), should be large. Thus, visual • is a coefficient determined by the condition that the sum
feature will be important. This is consistent with our intuition. of the square residuals is as small as possible.
Since is query concept-dependent, the resulting combina- The phrases are ranked according to the salience score , and
tion parameter is query concept-dependent as well. This prop- the top-ranked phrases are taken as salient phrases. The resulting
erty of the parameter results in a query concept-dependent fu- salient phrases are utilized to construct the textual space, based
sion strategy for relevance feedback in both textual and visual on which we use the (1.1) and (1.2) to compute the textual feature.
space.
IV. FRIENDLY USER INTERFACE
III. SRC-BASED TEXTUAL SPACE CONSTRUCTION To make the best of the implicit feedback information, a new
To construct an accurate and low-dimensional textual space Web image search UI named MindTracer is proposed. Mind-
for the resulting Web images, we use the SRC algorithm pro- Tracer consists of two types of pages: main page and detail page.
posed in [19]. The author re-formalizes the clustering problem The main page is shown in Fig. 3 and the detail page is shown
as a salient phrase ranking problem. Given a query and the in Fig. 4. The main page has three frames: search frame, rec-
ranked list of search results, it first parses the whole list of titles ommendation frame, and result frame. The search frame con-
and snippets, extracts all possible phrases ( -grams) from the tains an edit box for users to type query phrase. Only text-based
contents, and calculates five properties for each phrase. The five queries are supported by MindTracer since they are friendly and
properties consist of phrase frequency/inverted document fre- familiar to the typical surfer of the Web. After a user submits a
quency (TFIDF), phrase length (LEN), intra-cluster similarity query to MindTracer, the thumbnails of result images are shown
(ICS), cluster entropy (CE), and phrase independence (IND). in the result frame with five rows and four columns. Initially, no
The five properties are supposed to be relative to the salience images are shown in the recommended frame. When the user
score of phrases. In our case, the comment and critiques are clicks an image in the result frame for sake of his/her interest,
regarded as snippets. In the following, the current phrase (an the recommendation function are activated, so that the dynamic

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
1354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009

Fig. 2. Function curves of exp(0 1 x) with different .

Fig. 5. Flowchart of the UI.

the user clicks another image in the result frame or the recom-
mendation frame, besides the aforementioned system reactions,
the former recommended images will be shown in the snapshot
frame of the detail page in case that the user wants more images
from the former recommended image list. If the user clicks an
image in the snapshot frame, the corresponding original image
will be shown in the image frame. Once the user is satisfied with
the recommended results, he/she could click the refine button
to move all the recommended images from recommendation
frame to the result frame. With the asynchronous scheme for
Fig. 3. Main page of MindTracer.
refreshing the detail page and the recommendation frame of
the main page, no extra-waiting time is required to support the
recommendation scheme.
The available functions of MindTracer include query-based
search, result recommendation, and result refinement. The
query-based search is similar to the current available search en-
gines. The result recommendation and refinement functions are
the contributions of MindTracer. The recommendation function
is activated by the user’s click-through, for MindTracer regards
the user’s click-through as implicit relevance feedback. Besides
result recommendation, the result refinement is another useful
function, which will display the whole results obtained from the
multimodal RF procedure to the user when the user is satisfied
with the recommendation results and clicks the refine button.
Considering the user’s satisfaction and refine button clicked on
his/her own initiative, the relevance of the refined results could
Fig. 4. Detailed page of MindTracer.
be guaranteed. The flowchart of MindTracer is shown in Fig. 5.

V. EXPERIMENTAL RESULTS
multimodal RF are carried out. As a result, a finer ranking of the
initial results are obtained, and the top 20 recommended images A. Evaluation Dataset
will be shown in the recommendation frame. The images itera- To construct the evaluation dataset, approximately three
tively roll in the recommendation window with a scroll-bar that million images were crawled from several photo forum sites,
could be manually controlled by the user. e.g., photosig [9]. To automatically evaluate our proposed
Accompanying with the user’s click-through, the corre- SRC-based RF mechanism, an image subset was selected and
sponding original image will be shown in a detail page. The manually labeled as follows. First, ten representative queries
detail page has two frames: image frame and snapshot frame. If were chosen. Then, for each query, the key terms related to the

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1355

TABLE I
QUERIES AND CORRESPONDING KEY TERMS. THE NUMBER WITHIN
PARENTHESES IS THE NUMBER OF RESULT IMAGES

Fig. 6. Performance of TVRF under different and K .

Fig. 7. Performance of TVRF under different and .

contribution of RF in visual space, fine-tunes the contribution,


top 20 images were identified. Finally, all resulting images of and the scope in which the resulting images are re-ranked by
each query were manually annotated with the corresponding the combination of the textual similarities and the visual similar-
key terms. The key terms and number of resulting images for ities. Because is less correlated with and , we first chose
each query are shown in Table I. In total, there are 160 key based on a simplified version of (1.4) by constraining to 0,
terms. i.e., . We conducted a series of experi-
To simulate the interactions between a user and a Web image ments by varying from 0 to 1 (by iteratively adding 0.05), and
retrieval system, for each query , each related key term from 100 to 1000 (by iteratively adding 100). Fig. 6 shows
was selected in turn to represent the user’s search intention. Im- the detailed performance of TVRF under different and .
ages annotated with the term were considered to be relevant is finally set to be 200, which corresponds to the best result.
to . For each , 5 iterations of user-and-system interaction Then, we fixed to 200 and chose and simultaneously.
were carried out. The system first ranked the initial resulting We conducted another series of experiments by varying both
images using the optimal query learned in (1.3) and brought from 0 to 1 (by iteratively adding 0.05) and from 1 to 256
out the top images for RF in both textual and visual space. (by iteratively multiplying 2). Fig. 7 shows the detailed perfor-
After re-ranking the top images using (1.4), the system ex- mance of TVRF under different and . We chose
amined the top 20 images to collect the relevant images which and as the best parameters. To further validate whether
were regarded as click-through data. Those relevant images la- the scope is the best parameter, we fixed to 0.25 and to
beled in previous iterations were directly placed in top ranks and 64, and varied from 100 to 1000 again. The validation result
excluded from the examining process. Precision is used as the showed that still corresponds to the best performance,
basic evaluation measure. When the top 20 images are examined which confirms the aforementioned assumption that is less
and there are relevant images, the precision within top 20 correlated to and .
images is defined to be . Four RF strategies were evaluated and compared: RF using
textual feature only (TBRF), RF using visual feature only
B. Evaluation of RF Fusion (VBRF), linear combination of the RF in two feature spaces
The proposed RF fusion strategy (TVRF) has three parame- (LTVRF), and the proposed RF fusion strategy (TVRF). Fig. 8
ters that need to be determined. That is, controls the overall shows the detailed RF performance of the four strategies for

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
1356 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009

Fig. 8. Performance of the four strategies.

the ten representative queries and the average. The average pre-
cision of the four RF strategies is 0.5481, 0.3905, 0.6705, and Fig. 9. Precision comparison of two RF strategies.
0.883, respectively. From the result, it can be seen that TVRF
performs the best among four strategies because it is capable
of effectively combining textual and visual features. Though
LTVRF also combines both features, it performs even worse
than TBRF in case of Eiffel Tower, Pear, and Rainbow, because
it is not query dependent and lacks the fine-tuning capability. It
shows that an inappropriate combination of textual and visual
feature will seriously deteriorate RF performance. The results
also show that VBRF performs the worst, except the case of
Tulip, because visual features are still ineffective in capturing
most of the textual query concepts.

C. Evaluation of SRC-Based RF

In our experiment, two RF strategies were evaluated and com-


pared: traditional RF and the proposed SRC-based RF. Both of
them use Rochhio’s algorithm to construct a so-called optimal
query. The difference lies in constructing the textual space for Fig. 10. Efficiency comparison of two RF strategies.
the resulting images. Traditional RF uses all terms present in the
metadata to construct the textual space, while the SRC-based RF
uses the SRC algorithm to obtain the salient phrases, based on D. Efficiency of TVRF
which the textual space is constructed. Fig. 9 shows the detailed
RF performance of the two strategies for the ten representative In order to evaluate the real time performance of the proposed
queries and the average. The average precision of the traditional technique, efficiency performance of the proposed RF fusion
RF and the SRC-based RF is 0.5481 and 0.6478 respectively. strategy (TVRF) is worth discussing as well. Since there are two
From the result, it can be seen that the SRC-based RF clearly textual-based RF mechanisms available in our work, we refer
outperforms the traditional RF strategy. The main reason is that to the SRC-based TVRF as SRC-TVRF. Then, we can make
using SRC could effectively detect and remove those unimpor- comparisons between TVRF and SRC-TVRF.
tant or noisy words so that the resulting feature could reflect Given a query and a term , the time cost for completing
user’s search intension more precisely. five iterations of user-and-system interaction is recorded. Based
Besides the performance comparison, the time cost of the two on the sum of each term’s time cost, we could obtain the average
strategies is another factor worth analyzing. Given a query time cost for each query . Note that each query need accom-
and a term , the time cost for completing 5 iterations of user- plish only one SRC procedure, and the resulting textual space
and-system interaction is recorded. Based on the sum of each is suitable for all the related terms. Fig. 11 shows the time cost
term’s time cost, we could obtain the average time cost for each of TVRF and SRC-TVRF for the ten representative queries and
query . Fig. 10 shows the time cost of the two strategies for the average.
the ten representative queries and the average. According to Fig. 11, the average time cost of the TVRF and
According to Fig. 10, the average time cost of the traditional SRC-TVRF is 3.02 and 0.994 s, respectively. From the result,
RF and SRC-based RF is 1.87 and 0.59 s, respectively. From it can be seen that the SRC-TVRF mechanism is more efficient
the result, it can be seen that the SRC-based RF mechanism is than the TVRF. Therefore, the SRC-TVRF is more practical for
more efficient than the traditional RF. a real Web image retrieval system.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1357

[10] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.


Reading, MA: Addison-Wesley, 1999.
[11] L. Zhang, Y. X. Hu, M. J. Li, W. Y. Ma, and H. J. Zhang, “Efficient
propagation for face annotation in family albums,” in Proc. 12th Annu.
ACM Int. Conf. Multimedia, 2004, pp. 716–723.
[12] M. Stricker and M. Orengo, “Similarity of color images,” Proc. SPIE
Storage and Retrieval for Image and Video Databases, pp. 381–392,
1995.
[13] J. Huang, S. R. Kumar, M. Mitra, W. J. Zhu, and R. Zabih, “Image
indexing using color correlograms,” in Proc. IEEE Computer Society
Conf. Computer Vision and Pattern Recognition, 1997, pp. 762–768.
[14] H. Yu, M. J. Li, H. J. Zhang, and J. F. Feng, “Color texture moments for
content-based image retrieval,” in Proc. Int. Conf. Image Processing,
2002, pp. 929–932.
[15] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevance feed-
back: A power tool for interactive content-based image retrieval,” IEEE
Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 644–655, May
1998.
[16] F. Jing, M. J. Li, H. J. Zhang, and B. Zhang, “A unified framework for
image retrieval using keyword and visual features,” IEEE Trans. Image
Process., vol. 14, no. 7, pp. 979–989, Jul. 2005.
Fig. 11. Efficiency comparison of TVRF and SRC-TVRF. [17] Y. Lu, C. Hu, X. Zhu, H. J. Zhang, and Q. Yang, “A unified framework
for semantics and feature based relevance feedback in image retrieval
systems,” in Proc. 8th Annu. ACM Int. Conf. Multimedia, 2000, pp.
VI. CONCLUSION 31–38.
[18] Y. Wu, E. Y. Chang, K. C. C. Chang, and J. R. Smith, “Optimal multi-
In this paper, we have presented a unified relevance feed- modal fusion for multimedia data analysis,” in Proc. 12th Annu. ACM
Int. Conf. Multimedia, 2004, pp. 572–579.
back framework for Web image retrieval. During RF process, [19] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. W. Ma, “Learning
both textual features and visual features are used in a sequen- to cluster web search results,” in Proc. 27th Annu. Int. ACM SIGIR
tial way. A dynamic multimodal fusion strategy is proposed to Conference on Research and Development in Information Retrieval,
2004, pp. 210–217.
seamlessly combine the RF in textual space and that in visual
space. To integrate RF into Web image retrieval in a practical En Cheng received the B.S. and M.S. degrees in
way, the textual feature-based RF mechanism employs an effec- computer science from Huazhong University of
tive search result clustering (SRC) algorithm to construct an ac- Science and Technology, Wuhan, China, in 2003
and 2006, respectively. She is currently pursuing
curate and low-dimensional textual space for the resulting Web the Ph.D. degree in the Electrical Engineering
images. Besides explicit relevance feedback, implicit relevance and Computer Science Department, Case Western
feedback, e.g., click-through data, can also be integrated into Reserve University, Cleveland, OH.
From 2005 to 2006, she was with Microsoft
the proposed mechanism. Then, a new user interface (UI) is pro- Research Asia, Beijing, China, as a visiting student.
posed to support implicit RF. Experimental results on a database Her research interests include knowledge manage-
consisting of nearly three million Web images show that the pro- ment, semantic web, information retrieval, image
processing, pattern recognition, and bioinformatics.
posed mechanism is wieldy, scalable, and effective.

REFERENCES Feng Jing received the B.S. and Ph.D. degrees in


computer science from Tsinghua University, Beijing,
[1] Google Image Search, [Online]. Available: http://images.google.com China, in 2000 and 2005, respectively.
[2] Yahoo Image Search, [Online]. Available: http://images.search.yahoo. He was with Microsoft Research Asia, Beijing,
com/ from 2005 to 2008. Then, he joined Tencent Re-
[3] Altavisa Image Search, [Online]. Available: http://www.altavista.com/ search Center, Beijing. His research interests include
image/ image retrieval, image annotation, web mining, text
[4] Picsearch Image Search, [Online]. Available: http://www.pic- understanding, and machine learning.
search.com
[5] J. Rocchio, Relevance Feedback in Information Retrieval. Upper
Saddle River, NJ: Prentice-Hall, 1971.
[6] X. S. Zhou and T. S. Huang, “Relevance feedback in image retrieval:
A comprehensive review,” ACM Multimedia Syst., vol. 8, no. 6, pp.
536–544, 2003. Lei Zhang received the B.S., M.S., and Ph.D. degrees
[7] Q. K. Zhao, S. C. H. Hoi, T. Y. Liu, S. S. Bhowmick, M. R. Lyu, and W. in computer science from Tsinghua University, Bei-
Y. Ma, “Time-dependent semantic similarity measure of queries using jing, China, in 1993, 1995, and 2001, respectively.
historical click-through data,” in Proc. 15th Int. Conf. World Wide Web, After two years of working in industry, he returned
2006, pp. 543–552. to Tsinghua University and received his Ph.D. degree.
[8] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, “Ac- Then, he joined Microsoft Research Asia, Beijing.
curately interpreting clickthrough data as implicit feedback,” in Proc. His research interests include machine learning, web
28th Annu. Int. ACM SIGIR Conf. Research and Development in Infor- image search, information retrieval, and text mining.
mation Retrieval, 2005, pp. 154–161.
[9] Photosig, [Online]. Available: http://www.photosig.com

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.

You might also like