Professional Documents
Culture Documents
6, JUNE 2009
Abstract—Although relevance feedback (RF) has been exten- process is passive, i.e., disregarding the informative interactions
sively studied in the content-based image retrieval community, between users and retrieval systems. An active system should
no commercial Web image search engines support RF because get the user into the loop so that personalized results could be
of scalability, efficiency, and effectiveness issues. In this paper,
we propose a unified relevance feedback framework for Web provided for the specific user. To be active, the system could
image retrieval. Our framework shows advantage over traditional take advantage of relevance feedback techniques.
RF mechanisms in the following three aspects. First, during the Relevance feedback, originally developed for information re-
RF process, both textual feature and visual feature are used in trieval [5], is an online learning technique aiming at improving
a sequential way. To seamlessly combine textual feature-based the effectiveness of the information retrieval system. The main
RF and visual feature-based RF, a query concept-dependent
fusion strategy is automatically learned. Second, the textual idea of relevance feedback is to let the user guide the system.
feature-based RF mechanism employs an effective search result During retrieval process, the user interacts with the system and
clustering (SRC) algorithm to obtain salient phrases, based on rates the relevance of the retrieved documents, according to
which we could construct an accurate and low-dimensional textual his/her subjective judgment. With this additional information,
space for the resulting Web images. Thus, we could integrate RF the system dynamically learns the user’s intention, and gradu-
into Web image retrieval in a practical way. Last, a new user
interface (UI) is proposed to support implicit RF. On the one ally presents better results. Since the introduction of relevance
hand, unlike traditional RF UI which enforces users to make feedback to image retrieval in the mid-1990s, it has attracted
explicit judgment on the results, the new UI regards the users’ tremendous attention in the content-based image retrieval
click-through data as implicit relevance feedback in order to re- (CBIR) community and has been shown to provide dramatic
lease burden from the users. On the other hand, unlike traditional performance improvement [6]. However, no commercial Web
RF UI which hardily substitutes subsequent results for previous
ones, a recommendation scheme is used to help the users better image search engines support relevance feedback because of
understand the feedback process and to mitigate the possible usability, scalability, and efficiency issues.
waiting caused by RF. Experimental results on a database con- Note that the textual features, on which most of the commercial
sisting of nearly three million Web images show that the proposed search engines depend, are extracted from the file name, ALT text,
framework is wieldy, scalable, and effective. URL, and surrounding text of the images. The usefulness of the
Index Terms—Implicit feedback, relevance feedback (RF), textual features is demonstrated by the popularity of the current
search result clustering, web image retrieval. available Web image search engine. While straightly using the
textual information to construct the textual space leads to a time-
I. INTRODUCTION consuming computation and the performance suffers from noisy
terms. Since the user is interacting with the search engine in real
time, the relevance feedback mechanism should be sufficiently
W ITH the explosive growth of both World Wide Web and
the number of digital images, there is more and more ur-
gent need for effective Web image retrieval systems. Most of the
fast, and if possible avoid heavy computations over millions of
retrieved images. To integrate relevance feedback into Web image
popular commercial search engines, such as Google [1], Yahoo! retrieval in a practical way, an efficient and effective mechanism is
[2], and AltaVista [3], support image retrieval by keywords. requiredforconstructinganaccurateandlow-dimensionaltextual
There are also commercial search engines dedicated to image space with respect to the resulting Web images.
retrieval, e.g., Picsearch [4]. A common limitation of most of Although all existing commercial Web image retrieval systems
the existing Web image retrieval systems is that their search solely depend on textual information, Web images are character-
ized by both textual and visual features. With effective utilization
of textual features, image retrieval greatly benefits from lever-
Manuscript received May 22, 2007; revised February 10, 2009. First pub-
lished April 07, 2009; current version published May 13, 2009. This work was
aging mature techniques from text retrieval. However, just as the
performed at Microsoft Research Asia, Beijing, China. The associate editor proverb “a picture is worth one thousand words,” the textual repre-
coordinating the review of this manuscript and approving it for publication sentation of an image is always insufficient compared to the visual
was Dr. Eli Saber.
E. Cheng is with Electrical Engineering and Computer Science, Case Western content of the image itself. Therefore, visual features are required
Reserve University, Cleveland, OH 44106-7071 USA (e-mail: en.cheng@case. for finer granularity of image description. Considering the char-
edu). acteristics of both textual and visual feature, it is reasonable to
F. Jing is with Tencent Research Center, Beijing, 100080, China (e-mail:
scenery.jf@gmail.com). conclude that RF in textual space could guarantee relevance and
L. Zhang is with Microsoft Research Asia, Beijing, 100080, China (e-mail: RF in visual space could meet the need for finer granularity. Thus,
leizhang@microsoft.com). it is meaningful to introduce a unified relevance feedback frame-
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. work for Web image retrieval which seamlessly combines textual
Digital Object Identifier 10.1109/TIP.2009.2017128 feature-based RF and visual feature-based RF in a sequential way.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1351
To strengthen our proposed framework, we employ implicit tographer’s comment and other people’s critiques. These im-
feedback to overcome the limitation of explicit feedback tech- ages constitute the evaluation dataset for the proposed relevance
niques where an increased cognitive burden is placed on the feedback framework. For example, a photo of photosig1 has the
users. Unlike explicit feedback, implicit feedback could be col- following metadata. In order to facilitate later citation of this
lected at much lower cost, in much larger quantities and without photo, we denote it by .
burden on the users. As one of the most effective implicit feed- • Title: early morning.
back information, click-through data has been used either as ab- • Category: landscape, nature, rural.
solute relevance judgments [7] or relative relevance judgments • Comment: I found this special light one early morning in
[8] in text retrieval. Fortunately, image retrieval has the fol- Pyrenees along the Vicdessos river near our house .
lowing two characteristics when comparing with text retrieval. • One of the critiques: wow I like this picture very
First, the thumbnail of an image reflects more information than much I guess the light has to do with everything the
the title and snippet of a Web page, so click-through information light is great on the snow and on the sky (strange looking
of image retrieval tends to be less noisy than that of text retrieval. sky by the way) greatly composed nice crafted
Second, unlike textual document, the content of an image can be border a beauty.
taken in at a glance. As a result, the user will possibly click more All the aforementioned metadata is used as the textual source
results in image retrieval than in text retrieval. Both character- for the textual space construction. To build the textual space,
istics imply that click-through data could be helpful for image there are two available approaches in our work. One straight-
retrieval. forward approach is directly using the above metadata to obtain
In this paper, we propose a unified relevance feedback frame- the textual feature. Another one is based on the Search Result
work for Web image retrieval. There are three main contribu- Clustering (SRC) algorithm to construct the textual space. The
tions of the paper. detailed description of the SRC-based textual space construc-
• A dynamic mutimodal fusion scheme is proposed to seam- tion is illustrated in Section III.
lessly combine textual feature-based RF (TBRF) and vi- To represent the textual feature, vector space model [10] with
sual feature-based RF (VBRF). More specifically, a TBRF TF-IDF weighting scheme is adopted. More specifically, the tex-
algorithm is first used to quickly select a possibly relevant tual feature of an image is an -dimensional vector and can
image set. Then, a VBRF algorithm is combined with the be given by
TBRF algorithm to further re-rank the resulting Web im-
ages. The fusion of VBRF and TBRF is query concept-de-
pendent and automatically learned. (1.1)
• The textual feature-based RF mechanism employs an ef- (1.2)
fective search result clustering (SRC) algorithm to obtain
salient phrases, based on which we could construct an ac- where:
curate and low-dimensional textual space for the resulting • is the textual feature of an image ;
Web images. As a result, we could integrate RF into Web • is the weight of the th term in ’s textual space;
image retrieval in a practical way. • is the number of all distinct terms of all images’ textual
• A new UI is proposed to support implicit RF. On the one space;
hand, unlike traditional RF UI which enforces the users • is the frequency of the th term in ’s textual space;
to make explicit judgment on the results, the new UI re- • is the total number of images;
gards the user’s click-through data as implicit relevance • is the number of images whose metadata contains the
feedback in order to release burden from the user. On the th term.
other hand, unlike traditional RF UI which hardily substi- To illustrate the straightforward approach where all metadata
tutes subsequent results for previous ones, a recommen- is utilized to construct the textual space, we use the photo
dation scheme is used to help the user better understand introduced at the beginning of this section as an example. Given
the feedback process and to mitigate the possible waiting the query “early morning,” we have 151 resulting images in-
caused by RF. cluding photo . Based on those resulting images, we collect
The remainder of this paper is organized as follows. In all distinct terms from the metadata which results in totally 358
Section II, we describe the dynamic multimodal fusion mech- distinct terms. For , it has 48 distinct terms, which consist
anism. SRC-based textual space construction is illustrated in of early, morning, landscape, nature, rural, I, found, this, spe-
Section III. Experimental results are presented and analyzed cial, light, one, in, Pyrenees, along, the, Vicdessos, river, near,
in Section V. Finally, we conclude and discuss future work in our, house, wow, like, picture, very, much, guess, has, to, do,
Section VI. with, everything, is, great, on, snow, and, sky, strange, looking,
by, way, greatly, composed, nice, crafted, border, a, and beauty.
II. DYNAMIC MULTIMODAL FUSION Given , and 48 distinct terms of , we can
calculate and for each distinct term with respect to .
A. Image Representation As a result, we can obtain according to the (1.2). In the end,
according to the (1.1), the textual feature of is obtained.
The images collected from several photo forum sites, e.g.,
photosig [9], have rich metadata such as title, category, pho- 1http://www.photosig.com/go/photos/view?id=733881
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
1352 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009
To visually represent an image, a 64-dimensional feature [11] of all features of clicked images. The weight of a feature dimen-
was extracted. It is a combination of three features: six-dimen- sion is proportional to the inverse of the standard deviation of the
sional color moments [12], 44-dimensional banded auto-correl- feature values of all clicked images [15]. Weighted Euclidean
ogram [12], and 14-dimensional color texture moments [14]. distance is used to calculate the distance between an image and
For color moments, the first two moments from each channel of the optimal query. Although Rui’s algorithm is used currently,
CIE-LUV color space were extracted. For correlogram, the HSV any RF algorithm using only relevant images could be used in
color space with inhomogeneous quantization into 44 colors the unified framework.
is adopted [11]. For textual moments, we operate the original
image with templates derived from local Fourier transform and D. Dynamic Multimodal Fusion
obtain characteristic maps, each of which characterizes some in- There has been some work on fusion of relevance feedback in
formation on a certain aspect of the original image. Similar to different feature spaces [16]–[18]. A straightforward and widely
color moments, we calculate the first and second moments of used strategy is linear combination [16], [17]. Nonlinear com-
the characteristic maps, which represent the color texture infor- bination using support vector machine (SVM) was proposed in
mation of the original image. The resulting visual feature of an [18]. Since the super-kernel fusion algorithm [18] needs irrel-
evant images, it is incapable for systems only offering relevant
image is a 64-dimensional vector . Each
images.
feature dimension is normalized to [0, 1] using Gaussian nor-
Since textual features are more semantic-oriented and effi-
malization for the convenience of further computation.
cient than visual features while visual features have finer de-
B. RF in Textual Space scriptive granularity than textual features, we combine the RF
in both feature spaces in a sequential way. The flowchart of the
To perform RF in textual space, Rocchio’s algorithm [1] is RF of our unified framework is shown in Fig. 1. First, RF in tex-
used. The algorithm was developed in the mid-1960s and has tual space is performed to rank the initial resulting images using
been proven to be one of the most effective RF algorithms in the optimal query learned in (1.3). Then, RF in visual space is
information retrieval. The key idea of Rocchio’s algorithm is performed to re-rank the top images. The re-ranking process
to construct a so-called optimal query so that the difference be- is based on a dynamic linear combination of the RF in both vi-
tween the average score of a relevant document and the average sual and textual spaces.
score of a nonrelevant document is maximized. Cosine simi- Note that restricting the re-ranking only on the top im-
larity is used to calculate the similarity between an image and ages has two advantages. First, the relevance of the top images
the optimal query. Since only clicked images are available for could be guaranteed by the former RF in textual space. Second,
our proposed framework, we assume clicked images to be rele- the efficiency of RF process could be ensured, for RF in visual
vant and define the feature of optimal query as follows: space could possibly be inefficient on a very large image set.
The number of top images that affects both efficiency and ef-
(1.3) fectiveness of the RF process is predetermined experimentally.
The re-ranking process is based on a dynamic multimodal fusion
where: of the RF in visual and textual spaces. The combination weights
that reflect the relative contribution of both spaces are automati-
• is the vector of the initial query;
cally learned and query concept-dependent. Assume there are
• is the vector of a relevant image; clicked images . The similarity metric used to re-rank a top
• is the vector of a nonrelevant image; image using RF in both visual and textual spaces is defined as
• Rel is the relevant image set; follows:
• Non-Rel is the nonrelevant image set;
• is the number of relevant images; (1.4)
• is the number of nonrelevant images; (1.5)
• is the parameter controlling the relative contribution of
relevant images and the initial query; (1.6)
• is the parameter controlling the relative contribution of
nonrelevant images and the initial query.
In our case, only relevant images are available for our pro- (1.7)
posed mechanism, so we set to be 1 and to be 0 in our
experiments. Although Rocchio’s algorithm is used currently, (1.8)
any vector-based RF algorithm could be used in the unified
where:
framework.
• is the similarity metric in both visual and textual spaces;
C. RF in Visual Space • is the similarity between ’s visual feature and ;
To perform RF in visual space, Rui’s algorithm [15] is used. • is the cosine similarity between ’s textual feature and
Assume clicked images to be relevant, both an optimal query ;
and feature weights are learned from the clicked images. More • is the dynamic linear combination parameter for simi-
specifically, the feature vector of the optimal query is the mean larity metric in both visual and textual spaces;
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1353
(1.9)
(1.10)
(1.11)
(1.12)
(1.13)
(1.14)
(1.15)
Fig. 1. Flowchart of the RF of the unified framework.
where represents frequency calculation.
Given the above five properties, we use a single for-
• and are parameters which control the relative contri- mula to combine them and calculate a single salience score
bution of RF in visual space; for each phrase. In our case, each term can be a vector
• is the deviation of the clicked image in visual space; . A regression model
• is the visual feature vector of the clicked image ; learned from previous training data is then applied to combine
the five properties into a single salience score . According to
• is the feature vector of the optimal query in visual
[19], when comparing the performance of linear regression,
space;
logistic regression, and support vector regression, the perfor-
• is the weighted Euclidean distance between ’s visual
mance of linear regression is the best one. Therefore, in our
feature and . experiments, we choose the linear regression model. The linear
Note that in (1.4) tunes the visual feature’s contribution to regression model postulates that
the overall similarity metric according to different query con-
cept. According to (1.5), controls the overall contribution of (1.16)
RF in visual space, fine-tunes the contribution. If the query
concept could be well characterized by visual feature and the where:
clicked images should be visually consistent, will be small • is a random variable with mean zero;
(near 0). According to (1.5), should be large. Thus, visual • is a coefficient determined by the condition that the sum
feature will be important. This is consistent with our intuition. of the square residuals is as small as possible.
Since is query concept-dependent, the resulting combina- The phrases are ranked according to the salience score , and
tion parameter is query concept-dependent as well. This prop- the top-ranked phrases are taken as salient phrases. The resulting
erty of the parameter results in a query concept-dependent fu- salient phrases are utilized to construct the textual space, based
sion strategy for relevance feedback in both textual and visual on which we use the (1.1) and (1.2) to compute the textual feature.
space.
IV. FRIENDLY USER INTERFACE
III. SRC-BASED TEXTUAL SPACE CONSTRUCTION To make the best of the implicit feedback information, a new
To construct an accurate and low-dimensional textual space Web image search UI named MindTracer is proposed. Mind-
for the resulting Web images, we use the SRC algorithm pro- Tracer consists of two types of pages: main page and detail page.
posed in [19]. The author re-formalizes the clustering problem The main page is shown in Fig. 3 and the detail page is shown
as a salient phrase ranking problem. Given a query and the in Fig. 4. The main page has three frames: search frame, rec-
ranked list of search results, it first parses the whole list of titles ommendation frame, and result frame. The search frame con-
and snippets, extracts all possible phrases ( -grams) from the tains an edit box for users to type query phrase. Only text-based
contents, and calculates five properties for each phrase. The five queries are supported by MindTracer since they are friendly and
properties consist of phrase frequency/inverted document fre- familiar to the typical surfer of the Web. After a user submits a
quency (TFIDF), phrase length (LEN), intra-cluster similarity query to MindTracer, the thumbnails of result images are shown
(ICS), cluster entropy (CE), and phrase independence (IND). in the result frame with five rows and four columns. Initially, no
The five properties are supposed to be relative to the salience images are shown in the recommended frame. When the user
score of phrases. In our case, the comment and critiques are clicks an image in the result frame for sake of his/her interest,
regarded as snippets. In the following, the current phrase (an the recommendation function are activated, so that the dynamic
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
1354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009
the user clicks another image in the result frame or the recom-
mendation frame, besides the aforementioned system reactions,
the former recommended images will be shown in the snapshot
frame of the detail page in case that the user wants more images
from the former recommended image list. If the user clicks an
image in the snapshot frame, the corresponding original image
will be shown in the image frame. Once the user is satisfied with
the recommended results, he/she could click the refine button
to move all the recommended images from recommendation
frame to the result frame. With the asynchronous scheme for
Fig. 3. Main page of MindTracer.
refreshing the detail page and the recommendation frame of
the main page, no extra-waiting time is required to support the
recommendation scheme.
The available functions of MindTracer include query-based
search, result recommendation, and result refinement. The
query-based search is similar to the current available search en-
gines. The result recommendation and refinement functions are
the contributions of MindTracer. The recommendation function
is activated by the user’s click-through, for MindTracer regards
the user’s click-through as implicit relevance feedback. Besides
result recommendation, the result refinement is another useful
function, which will display the whole results obtained from the
multimodal RF procedure to the user when the user is satisfied
with the recommendation results and clicks the refine button.
Considering the user’s satisfaction and refine button clicked on
his/her own initiative, the relevance of the refined results could
Fig. 4. Detailed page of MindTracer.
be guaranteed. The flowchart of MindTracer is shown in Fig. 5.
V. EXPERIMENTAL RESULTS
multimodal RF are carried out. As a result, a finer ranking of the
initial results are obtained, and the top 20 recommended images A. Evaluation Dataset
will be shown in the recommendation frame. The images itera- To construct the evaluation dataset, approximately three
tively roll in the recommendation window with a scroll-bar that million images were crawled from several photo forum sites,
could be manually controlled by the user. e.g., photosig [9]. To automatically evaluate our proposed
Accompanying with the user’s click-through, the corre- SRC-based RF mechanism, an image subset was selected and
sponding original image will be shown in a detail page. The manually labeled as follows. First, ten representative queries
detail page has two frames: image frame and snapshot frame. If were chosen. Then, for each query, the key terms related to the
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1355
TABLE I
QUERIES AND CORRESPONDING KEY TERMS. THE NUMBER WITHIN
PARENTHESES IS THE NUMBER OF RESULT IMAGES
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
1356 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 6, JUNE 2009
the ten representative queries and the average. The average pre-
cision of the four RF strategies is 0.5481, 0.3905, 0.6705, and Fig. 9. Precision comparison of two RF strategies.
0.883, respectively. From the result, it can be seen that TVRF
performs the best among four strategies because it is capable
of effectively combining textual and visual features. Though
LTVRF also combines both features, it performs even worse
than TBRF in case of Eiffel Tower, Pear, and Rainbow, because
it is not query dependent and lacks the fine-tuning capability. It
shows that an inappropriate combination of textual and visual
feature will seriously deteriorate RF performance. The results
also show that VBRF performs the worst, except the case of
Tulip, because visual features are still ineffective in capturing
most of the textual query concepts.
C. Evaluation of SRC-Based RF
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 1357
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on July 20,2010 at 06:36:40 UTC from IEEE Xplore. Restrictions apply.