You are on page 1of 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Manifold Ranking-Based Matrix Factorization


for Saliency Detection
Dapeng Tao, Jun Cheng, Mingli Song, Senior Member, IEEE, and Xu Lin

Abstract— Saliency detection is used to identify the most most important information contained within a tremendous
important and informative area in a scene, and it is widely amount of visual data. It also has a role in determining the
used in various vision tasks, including image quality assessment, regions that are more attractive to humans, and the method
image matching, and object recognition. Manifold ranking (MR)
has been used to great effect for the saliency detection, since can be considered as a selection process that locates salient
it not only incorporates the local spatial information but regions in a scene. In tandem with the rapid development of
also utilizes the labeling information from background queries. computer vision techniques, saliency detection has become
However, MR completely ignores the feature information popular in a number of vision tasks, such as image quality
extracted from each superpixel. In this paper, we propose an assessment [1], [2], image matching [3]–[5], object
MR-based matrix factorization (MRMF) method to overcome
this limitation. MRMF models the ranking problem in the matrix recognition [6]–[9], [10], [11], image superresolution [12],
factorization framework and embeds query sample labels in the and visual tracking [13]–[16].
coefficients. By incorporating spatial information and embedding Saliency detection methods can be divided into two main
labels, MRMF enforces similar saliency values on neighboring categories: 1) the bottom-up methods and 2) the top-down
superpixels and ranks superpixels according to the learned methods. Most existing models fall into the bottom-up
coefficients. We prove that the MRMF has good generalizability,
and develops an efficient optimization algorithm based on the category, in which subjects’ freeview, a scene and salient
Nesterov method. Experiments using popular benchmark data regions attract attention [17]–[20]. The bottom-up mechanism
sets illustrate the promise of MRMF compared with the other involves low-level processes, and, is driven by the intrinsic
state-of-the-art saliency detection methods. attributes of the stimuli [21]–[23], adopts the different
Index Terms— Manifold ranking (MR), matrix factorization, properties of the low-level visual information to compute
optimization algorithm, saliency detection. saliency maps. In contrast, the top-down models are driven
by high-level tasks, such as looking for a specific object
I. I NTRODUCTION category in a scene [3], [24]. The bottom-up and top-down
information can be combined into a unified framework for
S ALIENT object detection in biological vision systems
allows the allocation of processing resources to the the general visual attention analysis [25].
Although most saliency detection methods aim at detecting
Manuscript received September 29, 2014; revised July 21, 2015; accepted the center-surround contrast and have good performance, there
July 23, 2015. This work was supported in part by the Chinese
Academy of Sciences (CAS) and Locality Cooperation Projects under remains room for efficiency and stability improvements. The
Grant ZNGZ-2011-012, in part by the Guangdong–CAS Strategic Cooperation tradition bottom-up models often pursue the object boundaries
Program under Grant 2012-B090400044, in part by the National Natural and ignore the target region uniformly, and thus these models
Science Foundation of China under Grant 6140051238, in part by the
Guangdong Natural Science Funds under Grant 2014A030310252, in part fail to some real applications, such as visual tracking and
by the Science and Technology Service Network Initiative through CAS image retrieval. Recently, Wei et al. [26] considered the
under Grant KFJ-EW-STS-035, in part by the Shenzhen Technology Project contribution of background priors in the saliency detection and
under Grant JCYJ20130402113127502, Grant JCYJ20140901003939001,
Grant JSGG20130624154940238, and Grant JCYJ20140417113430736, developed a new scheme for the saliency detection. In addition,
in part by the Guangdong Innovative Research Team Program under manifold ranking (MR) has successfully been applied to
Grant 201001D0104648280, and in part by the Hubei Key Laboratory of saliency detection, such as the effective two-stage saliency
Intelligent Vision Based Monitoring for Hydroelectric Engineering Program
under Grant 2014KLA01. detection framework proposed in [27]. Note that the models
D. Tao is with the School of Information Science and Engineering, Yunnan utilized the background priors can obtain a better performance
University, Kunming 650091, China (e-mail: dapeng.tao@gmail.com). in calculating the precise saliency boundary patches.
J. Cheng is with the Laboratory for Human Machine Control,
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, MR is an effective saliency detection technique [27],
Shenzhen 518055, China, and also with The Chinese University of because it exploits the intrinsic graph structure and incorpo-
Hong Kong, Hong Kong (e-mail: jun.cheng@siat.ac.cn). rates local grouping cues in graph labeling. Further analysis
M. Song is with the College of Computer Science, Zhejiang University,
Hangzhou 310027, China (e-mail: brooksong@ieee.org). shows that the MR intrinsically propagates query labels along
X. Lin is with the Laboratory for Human Machine Control, Shenzhen an adjacent graph, and thus the constructed graph significantly
Institutes of Advanced Technology, Chinese Academy of Sciences, influences its performance. Since features extracted from each
Shenzhen 518055, China, and also with the Hubei Key Laboratory
of Intelligent Vision Based Monitoring for Hydroelectric Engineering, superpixel determine the weights between two nodes in the
China Three Gorges University, Yichang 443002, China (e-mail: xlin.scut@ constructed graph, they are important cues in the saliency
gmail.com). detection.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. However, MR completely ignores such cues, and therefore,
Digital Object Identifier 10.1109/TNNLS.2015.2461554 a great deal of information is not involved. In this paper,
2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

we propose a novel bottom-up model, i.e., the MR-based visual saliency can be obtained by combining the saliencies
matrix factorization (MRMF), to overcome this problem. computed from different features. These methods are also
In particular, MRMF concatenates the features of all the considered to be biologically plausible, because they are
superpixels into a data matrix, and then decomposes the inspired by the behavior and the neuronal architecture of
concatenated matrix into the product of two low-rank the early primate visual system. Itti et al. [36] proposed a
matrices, i.e., the basis of the lower dimensional space and saliency model (an extension of [56]) by integrating multiscale
the coefficients of all samples [28], [29]. By simultaneously contrasts in intensity, color, and orientation features. This
incorporating the manifold regularization [30] over coefficients method has three main stages: 1) different simple features are
and constraining the coefficients of query samples to be in the extracted at locations over the image plane; 2) activation maps
unit ball, MRMF can rank the remaining samples according are obtained using the center-surround operators at multiple
to their coefficients [31]. Technically, the motivation of scales, resulting in the multiscale feature contrasts; and 3) the
MRMF is that the comprehensive ranking result can be used activation maps are normalized and combined into a single
to compute the saliency map according to the MR framework. saliency map. The method was later extended by adding
Note that the technique of matrix factorization [32], [33] motion and flicker contrasts for the video domain [57].
has received a lot of attentions and is widely used in Following [36], a large number of contrast-based models
image classification [34] and image processing recently. Our with different features, contrast functions, and combination
theoretical analysis shows that the MRMF is generalizable. rules have been developed. Le Meur et al. [38] proposed a
However, since MRMF is nonconvex, its optimization is not bottom-up saliency approach based on the structure of the
efficient. To overcome this problem, we develop an alternating human visual system, which includes the contrast sensitivity
algorithm based on Nesterov’s method [35] to solve MRMF. functions, perceptual decomposition, and center-surround
Experiments on popular benchmark data sets demonstrate the interactions. In [39], the center-surround histogram contrast,
effectiveness of MRMF for the saliency detection. which is the distance between the color histograms in a region
The main contributions of this paper include: 1) we develop and its surrounding regions, was used as the bottom-up cue to
a novel MRMF approach, which incorporates the spatial detect salient objects. A lot of progress has been made, since
information and embeds labels; 2) we prove that the MRMF these early efforts and many off-the-shelf toolboxes exist that
has good generalizability and develop an efficient optimization can be applied to many visual tasks.
algorithm based on Nesterov’s method; and 3) to demonstrate The information theoretical methods define bottom-up
the effectiveness of MRMF, we provide extensive experi- saliency based on maximum information sampling. The most
mentations on the benchmark data sets to verify that the informative parts of a scene are selected as salient parts. For
newly developed MRMF can improve the saliency detection example, Burce and Tsotsos [40] proposed the attention based
performance. Given space constraints, those parts that are on information maximization (AIM) model. AIM uses Shan-
easy to implement based on the given references are not non’s self-information, which is inversely proportional to the
detailed [27]. likelihood of observing the feature vector at a certain position,
The rest of this paper is organized as follows. In Section II, to measure the saliency. To estimate the probability density
we briefly discuss related works of the new proposed function of local patches, they used independent component
algorithm. The newly proposed MRMF is detailed analysis to reduce the problem to estimating several 1-D den-
in Section III. A theoretical analysis of the generalization sity functions. Hou and Zhang [41] proposed the incremental
error bound for MRMF is provided in Section IV. Section V coding length method to measure the perspective entropy of
presents the experimental results on four representative data each feature: the features with large coding length increments
sets. Finally, the conclusion is drawn in Section VI. are selected as salient features. Similar to the AIM method,
rare features have the highest energy and become salient. Seo
II. R ELATED W ORK and Milanfar [58] proposed the self-resemblance approach,
In recent years, a number of salience detection methods which measures the resemblance of the local image structure
have been proposed, which, as noted above, can be grouped at each pixel to its surroundings and takes the statistical
into the bottom-up and top-down models. In this section, we likelihood of its feature (given the features in surrounding
briefly review the most popular techniques. pixels) as the saliency measure.
Bayesian models have the advantage of combining
different sensory evidence and prior constraints into a unified
A. Bottom-Up Models framework. Oliva et al. [43] proposed a Bayesian model
Based on the low-level visual characteristics used to deter- that determines the joint probability of the presence and
mine the saliency measure, the bottom-up attention models location of a pixel in the image given the observed features.
can be further divided into six categories: 1) contrast-based The bottom-up saliency in the model is the probability of
models [36]–[39]; 2) information theoretical models [40]–[42]; the local feature at a pixel given the global image features,
3) Bayesian models [43]–[45]; 4) graphical models [46]–[48]; which is similar to the self-information method in [40].
5) frequency domain-based methods [1], [49]–[52]; and Zhang et al. [45] proposed the saliency using natural
6) supervised-learning-based models [53], [54]. statistics (SUN) model, which defines saliency by considering
In general, the contrast-based models have been inspired what the visual system is trying to optimize when directing
by the feature integration theory [55], which states that attention. The proposed model is a Bayesian framework,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

TAO et al.: MRMF FOR SALIENCY DETECTION 3

from which bottom-up saliency naturally emerges as the were playing video games. The supervised-learning-based
self-information of visual features, and the overall saliency methods learn the saliency mapping function from the data,
emerges as the pointwise mutual information between the with few a priori assumptions. However, the learned model
features and the target. The Bayesian models are very similar might be overfitted to the data used for training, and there-
to the information theoretical methods but have the additional fore, a large number of training examples must be provided.
advantage that they can easily incorporate the top-down In addition, it is possible to use deep-learning techniques [61]
information. to learn cortex-like neural networks to tackle attention
In most of these methods, the saliency values at different problems.
locations are considered independently. With graphical
models, dependence between saliencies over different spatial
and temporal locations can be considered, which results B. Top-Down Models
in more powerful models. Harel et al. [48] introduced Unlike bottom-up methods, which only consider the
graph-based visual saliency, in which Markov chains are low-level information from the images themselves, the top-
defined over various image maps and the equilibrium down models are driven by high-level tasks, e.g., searching
distribution over map locations are treated as saliency values. for particular objects or object classes. The top-down models
Avraham and Lindenbaum [47] introduced the extended can also incorporate bottom-up attention mechanisms to
saliency model, which uses a graphical model approximation process the data more efficiently. Gao and Vasconcelos [24]
to efficiently reveal the segments that are more likely to be used classification as the specific goal for saliency, and
salient. The model quantifies several intuitive observations, called the saliency discriminant saliency. In this approach,
such as the greater likelihood that visually similar image the regions that are more discriminant for classification
regions will correspond and only a few interesting objects will are deemed the salient regions, and saliency detection is
be present in the scene. In general, graphical models can model considered as a feature selection problem by maximizing the
complex dependence involved in the saliency computation mutual information between the features and the class labels.
process and can easily incorporate spatial and temporal Similarly, the SUN model [45] uses a Bayesian framework
constraints and complex priors. However, computational that combines bottom-up and top-down saliency and obtains
complexity is much higher than with the other methods. the overall saliency as the pointwise mutual information
Frequency domain-based methods model saliency in between local features and the search target’s features when
the frequency domain rather than the spatial domain. searching for a target. Li et al. [62] presented a Bayesian
Hou and Zhang [51] introduced the spectral residual model, multitask-learning framework for visual attention in video.
which assumes that the statistical singularities in the spectrum In this paper, bottom-up saliency modeled by multiscale
are responsible for anomalous regions in the image spatial wavelet decomposition was fused with different top-down
domain. The salient map is derived from the inverse Fourier components trained by a multitask-learning algorithm.
transform of the residual between the Fourier amplitude
spectrum of the down-sampled image and the smoothed
version of the spectrum. Yang et al. [33] and Guo et al. [49] III. M ANIFOLD R ANKING -BASED M ATRIX
incorporated the phase spectrum of the Fourier transform to FACTORIZATION
improve saliency predictions. Achanta et al. [52] introduced Given a point set V = {v1 , . . . , vq , vq+1 , . . . , vn } ⊂ Rm , the
a frequency-tuned approach for the saliency detection using first q points are the queries, and the remainder are the points
low-level features of color and luminance; the saliency map to be ranked. Let d : V × V → R denote a metric on V that
is computed as the difference between the algorithmic mean assigns each pair of points, e.g., vi and v j , a distance d(vi , v j ).
of the image feature vector and the Gaussian-blurred version Since the first q points are queries, we fix their ranking scores
of the original image. The Fourier domain-based methods to one. Let h : V → R denote a ranking function that assigns
provide new insights into saliency modeling and are generally each nonquery point vi a ranking score h i . We can view h as
fast to compute. n−q
a vector, i.e., h = [h q+1 , . . . , h n ]T ∈ R+ .
While most of the existing methods use criteria based In this paper, we propose the MRMF method to compute
on natural priors to define saliency, the supervised-learning- the ranking scores h. In particular, we concatenate all data
based methods attempt to learn the salient mapping function points into a matrix, i.e., V = [v1 , . . . , vn ], and decompose V
from labeled salient regions. Kienzle et al. [54] introduced an into the product of two lower rank matrices
support vector machine (SVM)-based method [59] to directly
learn attention from the human eye-tracking data. This model min V − W H 2F (1)
H ≥0,W
considers saliency detection as a binary classification problem
and is trained on the positive and negative samples, which are where W denotes the basis and H signifies the coefficients.
fixations and randomly sampled patches, respectively. Similar To incorporate both the label and the feature information, we
to [54], Judd et al. [53] trained a linear SVM from human divide H and W into two components
fixation data using a set of low-, mid-, and high-level image  
features. Peters and Itti [60] trained a simple regression classi- Hf
H = (2)
fier to capture the task-dependent association between a given 1T h T
scene, and the preferred locations to gaze at while subjects W = [W f , w] (3)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE I
I MPORTANT N OTATIONS U SED IN T HIS PAPER AND T HEIR D ESCRIPTION

where W f represents the feature information, H f are the where L 11 ∈ Rq×q , and L 12 = L 21 , because L is symmetric.
coefficients, and w signifies the axis that represents the ranking According to (7), we have
information.
tr (H L H T ) = tr (H f L H Tf ) + tr (1T L 11 1) + 2tr (h T L 12 1)
Since the ranking scores of queries are equal to one, the
ranking scores of relevant points are expected to be close to + tr (h T L 22 h) (8)
one and those of irrelevant points close to zero. To this end, where H f and h can now be optimized separately.
a closed-loop adjacent graph is constructed following [27],
and the labels of queries are then propagated on this graph A. Optimize Hf
following the regularization theory. To achieve the closed-
loop graph, the image is segmented into superpixels using the Let R f = V − w[1T , h T ]. The objective function of (6) can
simple linear iterative cluster (SLIC) algorithm [46], and a be rewritten as

k-regular graph G is constructed to exploit the spatial rela- 1 
min F(H f ) = R f − W f H f 2F + γ tr(H f L H Tf ) (9)
tionship between the superpixels and the fact that neighboring H f ≥0 2
superpixels and superpixels sharing common boundaries are
and the gradient of F f with respect to H f can be calculated
likely to have a similar appearance and share saliency values.
as follows:
A manifold regularization is then incorporated over H and a
Tikhonov regularization over h into (1). ∇ F (H f ) = W Tf W f H f − W Tf R f + γ H f L. (10)
In summary, the objective of MRMF is
Equation (9) is usually minimized using the projected
1  gradient descent (PGD) method. PGD advances the search
min V − W H 2F + γ tr (H L H T ) + βh22 (4)
H ≥0,W 2 point based on the previous one and performs step-by-step iter-
where W = [W f w], L is the graph Laplacian of G, γ > 0 ation toward the optimum. However, PGD is slow to converge,
Hf because it can become trapped in local solutions. To overcome
and β > 0 are the tradeoff parameters, and H = . this limitation, Nesterov [35] proposed the optimal gradient
1T h T
By minimizing tr (H L H T ), MRMF preserves the local method (OGM) for convex optimization. Since the F(H f )
smoothness of graph G, i.e., two superpixels close on the is convex and the three terms in ∇ F (H f ) are Lipschitz
graph are likely to have similar saliency values. This enforces continuous (with Lipschitz constants W Tf W f 2 , 0, and L2 ,
the ranking scores of relevant superpixels to be close to one. respectively), the Nesterov method can naturally be adopted
By minimizing h22 , MRMF enforces the ranking scores of to optimize H f .
irrelevant superpixels to be close to zero. OGM advances the search point based on the auxiliary
Since (4) is nonconvex, we usually alternately optimize point constructed by combing two neighboring search points.
W and H to search its local solution. W can intuitively be At the kth iteration, given two previous search points,
directly optimized using least squares i.e., H k−1
f and H kf , OGM constructs an auxiliary point, i.e., Y k ,
as follows:
W∗ = V H † (5) αk − 1 k−1
Y k = H kf + Hf (11)
αk+1
where † denotes the pseudoinverse operator. To optimize H ,
we rewrite the objective (4) as follows: where the combination coefficient αk is smartly chosen and
 carefully updated in each iteration [35]
1 

min F(H ) = V − W H 2F + βh22 + γ tr (H L H T ) .


H ≥0 2 1 + 4αk2 + 1
(6) αk+1 = (12)
2
Since H is a block matrix and contains a constant block, and
it is difficult to deal with the third term. We therefore partition
L into four blocks α0 = 1. (13)
 
L 11 L 12 Note that k ≥ 1 in (8) and Y 0 = H 0f when k = 0. At the
L= (7)
L 21 L 22 auxiliary point Y k , OGM constructs the proximal function
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

TAO et al.: MRMF FOR SALIENCY DETECTION 5

Algorithm 1 OGM-Based Algorithm for MRMF Let h ∗ denote the minimum of (11) then, according to [63],
Input: Data matrix: V ∈ Rm×n , graph Laplacian the K.K.T. conditions of (17) are

matrix: L ∈ Rn×n , ⎨ h∗ ≥ 0
Parameter: γ ∈ (0, ∞), Parameter: β∈ (0, ∞), ∇ f (h ∗ ) ≥ 0 (18)
Output: Basis matrix: W ∈ Rm×r , Coefficient matrix: ⎩
h ∗ ◦ ∇ f (h ∗ ) = 0
H ∈ Rr×n .
Step 1: Repeat where ∇ f (h) = hwT w− R2T w+γ L 21 1+γ L 22 h +βh signifies
Step 2: Optimize W with (5). the gradient of f at h. According to (18)
Step 3: Optimize H f with Nesterov’s Method.  −1  T 
h ∗ = + γ L 22 + w22 I + β I R2 w − γ L 21 1 (19)
Step 3.1: Repeat
Step 3.2: Construct an auxiliary point Y k by using (11); where + (·) projects variables to the positive orthant.
Step 3.3: Calculate H k+1
f by using (16); Based on the above discussions, MRMF is summarized
Step 3.4: Until {Convergence}. in Algorithm 1.
Step 4: Optimize h with (19). For convenience, Table I lists frequently used notations and
Step 5: Until {Convergence}. descriptions in this paper.
Return: Ranking scores are Hr .
C. MRMF for Saliency Detection
The procedure for MRMF-based saliency detection can
of F and advances the search point to the minimum of the
be summarized as two stages. In the first stage, MRMF
proximal function segments the input image into several nonoverlapping regions

(or superpixels) using the SLIC algorithm [46] and construct
H k+1
f = argmin P(Y k , H f ) = F(Y k ) +
 f (Y k ), H f − Y k
H f ≥0
a closed-loop graph by connecting neighboring superpixels
and superpixels sharing common image boundaries. Since
L k 2
+ H f − Y  F (14) superpixels on different image boundaries are dissimilar, this
2 method separately computes specific labeled maps for each
where L = W Tf W f 2 + γ L2 is the Lipschitz constant of the four sides and combines them to generate the initial
of ∇ F (H f ). Equation (14) is a constrained optimization saliency map. In particular, MRMF uses the superpixels
problem that can be solved using the Lagrange multiplier on each side of image as labeled background queries and
method. According to [63], H f k+1 satisfies the following computes the saliencies of the remaining superpixels based on
Karush–Kuhn–Tucker (K.K.T.) conditions: their relevance to those queries by ranking on the previously
 constructed graph. A saliency map is generated by multiplying
∇ P (H k+1
f ≥0 the four-labeled maps obtained. In the second stage, MRMF
H k+1
f ≥0 considers the labeled foreground superpixels as salient queries
  and computes the saliency of each superpixel based on its
∇ P H k+1
f ◦ H k+1
f =0 (15) relevance to foreground queries to produce the final map. The
architecture of the proposed MRMF-based saliency detection
where ∇ P (H k+1
f ) = ∇ F (Y k ) + L(H k+1f − Y k ) is the
k+1 technique is shown in Fig. 1.
gradient of P at H f , and ◦ denotes the Hadamard product.
Using (10), we have
IV. T HEORETICAL A NALYSIS
1
H k+1
f = + Y k
− ∇ F (Y k
) (16) Let us first analyze the reconstruction error of the learned
L bases and representations using (4). Let us replace the soft
where ∇ F (Y k ) is defined by (10) and + projects all the constraints of tr (H L H T ) and h2F with the hard constraints
negative entries to zeros. of tr (H L H T ) ≤ λ1 and h2F ≤ λ2 , where λ1 and λ2 are the
By iterating (11) and (16), OGM can obtain the optimal positive constants
solution of (9). According to [64], it is easy to verify that the 1
OGM converges optimally at the rate O((1/k 2 )). Here, we min V − W H 2F
W ≥0,H ≥0 2
omit the proof due to space constraints.
s.t. tr (H L H T ) ≤ λ1 , h2F ≤ λ2 . (20)
B. Optimize h We further restrict H to be in the unit ball of Rk×n , which
Similarly, let R denote R = V − W f H f , and partition it is easily achieved, because there is a tradeoff between the
into two parts, i.e., R = [R1 , R2 ], where R1 ∈ Rm×q . The bounds of the columns of W and the entries of H ; for example,
optimization problem with respect to h can be formulated as W H = WQQ−1 1 H and Q can be set to normalize H . The
follows: empirical reconstruction error with respect to basis W can,
1 therefore, be defined as
min f (h) = R2 − wh2F + βh22 + 2γ tr (h T L 12 1)
h≥0 2 1
 Rn (W ) = min V − W H 2F (21)
+ γ tr (h T L 22 h) . (17) n H ∈H
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 1. Architecture of the proposed MRMF-based saliency detection.

where H = {H : tr (H L H T ) ≤ λ1 , h2F ≤ λ2 , Hi  ≤ 1, the expected reconstruction error is upper bounded by the
and i = 1, . . . , n}. empirical reconstruction error and is a decreasing function
The expected reconstruction error with respect to basis W with respect to n, which indicates that the proposed model
is defined as can be extrapolated to score a point drawn from the same
distribution.
R(W ) = E v Rn . (22) Theorem 1: Let the columns of V be upper bounded
MRMF is convex with respect to either W or H, but not by r (that is, v ≤ r, v ∈ Rm ), the columns of W are upper
to both. However, given basis W, the representatives H are bounded by c, and H is in the unit ball of Rk×n . For any
fixed due to convexity. Thus, the reconstruction error can W and H learned by (4) and any δ > 0, with probability at
be analyzed by discussing the choice of basis W. The loss least 1 − δ, we have
function class can be defined as √
⎧ ⎫ 1 1 4 π C(c, λ1 , λ2 )r k
E v V − W H  F ≤ V − W H  F +
2 2

⎨ k ⎬ n n n
FW = f W (v) = min v − W j Hi j 2 : W ∈ Rm×k . (23)   
⎩ ⎭ √ 2
H ∈H
j =1 2 π C (c, λ1 , λ2 )k 2 ln 1δ
+ √ + r2
n 2n
Then, we have
where C(c, λ1 , λ2 ) is a constant depending only on the con-
1
n
Rn (W ) = f W (vi ) (24) stants of c, λ1 , and λ2 .
n The following theorem plays an important role in proving
i=1
Theorem 1.
and
Theorem 2 [65]: Let F be an [0, a]-valued function class
R(W ) = E v f W (v). (25) on Rm , and V = {v1 , . . . , vn } ⊂ Rm×n . For any δ > 0, with
probability at least 1 − δ, we have
We next discuss the reconstruction error bound of MRMF.   
 
According to (4), if a point shares the same bases of 1
n
ln 1δ
 f and r with queries and, furthermore, the new represen- sup E v f (v) − f (vi ) ≤ R(F) + a
f ∈F n 2n
tation of the point is a small distance away from those of i=1
the queries, such a point should have a high ranking score. where R(F) denotes the Rademacher complexity
When the basis W is given, H will be fixed due to convexity.
2
n
As a result, the proposed MR crucially depends on the learned
basis W. Bases that fit the query points well and the other R(F) = E σ sup σi f (vi )
f ∈F n
points in the set V are needed. Assume the point set v1 , . . . , vn i=1

are independent identically distributed We can prove that and σ1 , . . . , σn are independent Rademacher variables.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

TAO et al.: MRMF FOR SALIENCY DETECTION 7

To provide a proof sketch to Theorem 2, the following where G(F) denotes the Gaussian complexity of function
McDiarmid’s inequality is needed. class F and
Theorem 3 (McDiarmid’s Inequality): Let V = {v1 , . . . , vn }
1
n
be a sample set of independent random variables. If there exists G(F) = E γ sup γi f (vi )
a > 0, such that the following condition satisfies: f ∈F n
i=1
| f (V) − f (V )| ≤ a ∀i ∈ {1, . . . , n}
i
where γ1 , . . . , γn are independent standard normal variables.
where Vi represents the sample set V with the i th example Next, we are going to upper bound R(FW ). For any
replaced by an independent one. Then, for any > 0, the W learnt by (3), we have
following inequality holds: √ √
  4 πcr k 2 π c2 k 2
1
n
−2n 2 R(FW ) ≤ √ + √ .
Pr E v f (v) − f (vi ) ≥ ≤ exp n n
n a2
i=1 Proof of Lemma 1: Let
where Pr{A} denotes the probability that event A occurs. 
Proof Sketch of Theorem 2: Let W = γi min vi − W Hi 2
  Hi
1
n i
(V) = sup E v f (v) − f (vi ) and
f W ∈FW n
i=1
√  √ 
and W = 8 γis
vi , W es + 2 γils
W el , W es ,

−2n 2 is ils
exp = δ. l, s ∈ {1, . . . , k}
a2
It can be proven that where e1 , . . . , ek are the natural bases. We have
a
| (V) − (V )| ≤ . i
n E(W1 − W2 )2

Using McDiarmid’s inequality, we have, with probability at = (min vi − W1 Hi 2 − min vi − W2 Hi 2 )2
least 1 − δ i
Hi Hi
   
ln 1δ ≤ (max vi − W1 Hi 2 − vi − W2 Hi 2 )2
Hi
(V) ≤ E V (V) + a . i
2n ⎛  2
 
It can also be proven that ≤ ⎝8 max His
vi , (W1 − W2 )es
Hi
E V (V) ≤ R(F). i s
 2 ⎞
Hence, with probability at 1 − δ, we have 
 +2 max Hls Hil
el , (W1 W2 − W2 W1 )es ⎠
  Hi
1  n
ln( 1δ ) ls
sup E v f (v) − f (vi ) ≤ R(F) + a .
f ∈F n 2n For simplicity, we use H ∈ {H : Hi j ≤ 1, i = 1, . . . , k,
i=1
j = 1, . . . , n.} instead of H ∈ H
To prove Theorem 1, we first need to upper bound the 
Rademacher complexity R(FW ). However, this is difficult, ≤8 (
vi , W1 es −
vi , W2 es )2
because the reconstruction error function f W (v) has a is

minimum operation. We will, therefore, use the following +2 (
W1 el , W1 es −
W1 el , W2 es )2
two lemmas to upper bound R(FW ) by finding a proper ils
Gaussian process. = E( W1 − W2 )2 .
Lemma 1 (Slepian’s Lemma): Let  and be mean zero,
separable Gaussian processes indexed by a common set S, The second inequality holds (a + b)2 ≤ 2a 2 + 2b2 . Note
such that that, for simplicity, we have used H ∈ {H : Hi j ≤ 1,
i = 1, . . . , k, j = 1, . . . , n.} instead of H ∈ H to find a
E(s1 − s2 )2 ≤ E( s1 − s2 )2 ∀s1 , s2 ∈ S.
Gaussian process. However, according to the hard constraints
Then that tr (H L H T ) ≤ λ1 and h2F ≤ λ2 in (4), we should have
used the condition Hi j ≤ C(λ1 , λ2 ) ≤ 1, where C(λ1 , λ2 ) is
E sup s ≤ E sup s .
s∈S s∈S a constant depending on the constants of λ1 and λ2 . Since
there is a tradeoff between the bounds of columns of W and
Lemma 2 [66]: The Gaussian complexity is related to the
the entries of H , we can still use H ∈ {H : Hi j ≤ 1,
Rademacher complexity as follows:
 i = 1, . . . , k, j = 1, . . . , n.} by allocating a proper upper
π bound for the columns of W , such that Wi  ≤ C(c, λ1 , λ2 ),
R(F) ≤ G(F)
2 where C(c, λ1 , λ2 ) is a constant depending on c, λ1 , and λ2 .
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 2. Example images in the MSRA-1000 data set. Top: color image. Bottom: corresponding hand-annotation image.

Using Slepian’s Lemma, we have Since the loss function f (v) ≤ v2 ≤ r 2 , using Theorem 2,
we have
E sup W ≤ E sup W  
1
W n
√ W 
≤ 8E sup γis
vi , W es sup E v f W (v) − f W (vi )
W f W ∈FW n
i=1
√ is
   
√ √
+ 2E sup γils
W el , W es 4 πC(c, λ1 , λ2 )r k 2 π C 2 (c, λ1 , λ2 )k 2 ln 1δ
W ≤ √ + √ + r2 .
ils   n n 2n
√  


≤ 8C(c, λ1 , λ2 )E  γis vi  Thus, we have
 
s i  1 1
√ 2     E v V − W H 2F ≤ V − W H 2F
 n n √
+ 2C (c, λ1 , λ2 )E  γils 
  4 πC(c, λ1 , λ2 )r k
√ √
l,s
√ 2
i
√ + √
n
≤ 8C(c, λ1 , λ2 )r k n + 2C (c, λ1 , λ2 )k 2 n.   

The third inequality holds, since the Cauchy–Schwartz 2 πC 2 (c, λ1 , λ2 )k 2 ln 1δ
+ √ + r2 .
inequality is used, and the last inequality holds, since Jansen’s n 2n
inequality is used and the Gaussian variables are orthogonal. This concludes the proof.
Using Lemma 2, we have
R(FW ) V. E XPERIMENTAL R ESULTS

2π √ √ √ √  We next conducted saliency detection experiments on
≤ 8C(c, λ1 , λ2 )r k n + 2C 2 (c, λ1 , λ2 )k 2 n
√n √ four widely used data sets to validate MRMF, namely
4 πC(c, λ1 , λ2 )r k 2 πC 2 (c, λ1 , λ2 )k 2 Microsoft Research Asia (MSRA)-1000 [1], Dalian
= √ + √ .
n n University of Technology - OMRON (DUT-OMRON) [27],
This concludes the proof. Complex Scene Saliency Dataset (CSSD) [67], and
We now can prove Theorem 1 as follows. extended complex scene saliency dataset (ECSSD) [68].
Proof of Theorem 1: MSRA-1000 contains 1000 images and is a subset of
We have proven that MSRA [69]; example images are shown in Fig. 2. The
√ √ DUT-OMRON data set was collected by Yang et al. [27]
4 πC(c, λ1 , λ2 )r k 2 πC 2 (c, λ1 , λ2 )k 2
R(FW ) ≤ √ + √ . and contains 5168 images. The CSSD data set contains
n n 200 scene images collected from BSD300, the visual object
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

TAO et al.: MRMF FOR SALIENCY DETECTION 9

Fig. 3. Experiments on the MSRA-1000 data set. (a)The precision-recall curves of all seven methods. (b) The precision, recall, and F-measure using an
adaptive threshold [1] for each of the datasets

Fig. 4. Experiments on the DUT-OMRON data set. (a)The precision-recall curves of all seven methods. (b) The precision, recall, and F-measure using an
adaptive threshold [1] for each of the datasets.

TABLE II thus normalized scanpath saliency (NSS) is used to compare


P ERFORMANCE C OMPARISONS OF S IX M ETHODS six methods in Table II. Note that NSS can describe the
ON DUT-OMRON D ATA S ET deviation of predicted fixation patterns from the actual fixation
map.
The details of experimental setup are shown in the
following.

A. Baseline Methods
classes data set, and the Internet. The ECSSD data set The effectiveness of MRMF was validated against
further extends CSSD to 1000 images and contains many six other representative algorithms, namely MR [27],
semantically meaningful, but structurally complex, images. Frequency-tuned (FT) [1], spectral residual (SR) [51],
Performance was evaluated by assessing precision and boolean map (BM) [70], and regional contrast
recall. Precision is the ratio of correctly detected salient pixels (RC) [71]. Each of these methods has its own
to the total detected salient pixels, while recall is the ratio of merits and limitations. MR and MRMF are based on
correctly detected salient pixels to the ground truth salient background priors, FT and SR are the frequency domain-
pixels. In addition, we introduced the F-measure to evaluate based methods, while BM computes saliency maps by
the overall performance. The F-measure value is defined as analyzing the topological structure of Boolean maps. RC is
the weighted harmonic mean of precision value and recall a regional contrast-based salient object extraction algorithm
that simultaneously evaluates global contrast differences and
(1 + α)Precision × Recall
Fα = (26) spatially weighted coherence scores. The original parameter
αPrecision + Recall settings for methods of comparison were used in each
where α = 0.3 following [1] and [27]. In addition, the case [71]. For MRMF, we have some important parameters
DUT-OMRON data set also provide the eye fixations, and to set, such as β and γ. For fair comparison, we randomly
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 5. Experiments on the CSSD data set. (a)The precision-recall curves of all seven methods. (b) The precision, recall, and F-measure using an adaptive
threshold [1] for each of the datasets.

Fig. 6. Experiments on the ECSSD data set. (a)The precision-recall curves of all seven methods. (b) The precision, recall, and F-measure using an adaptive
threshold [1] for each of the datasets.

Fig. 7. Some results from the evaluated methods.

selected 500 samples from an external data set to form the are shown in Figs. 3(a), 4(a), 5(a), and 6(a), respectively.
training set. The precision, recall, and F-measure using an adaptive
threshold [1] for each of the data sets are shown
B. Experimental Results and Analysis in Figs. 3(b), 4(b), 5(b), and 6(b). Overall, MRMF outperforms
The precision–recall curves of all seven methods on the the others in terms of precision, recall, and NSS. In addition,
MSRA-1000, DUT-OMRON, CSSD, and ECSSD data sets some results from the evaluated methods are shown in Fig. 7.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

TAO et al.: MRMF FOR SALIENCY DETECTION 11

It can be seen that MRMF has superior performance that [14] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic
preserves the object boundaries and highlights salient scenes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1,
pp. 171–177, Jan. 2010.
pixels. [15] S. Mitri, S. Frintrop, K. Pervolz, H. Surmann, and A. Nuchter, “Robust
object detection at regions of interest with an application in ball recog-
nition,” in Proc. Int. Conf. Robot. Antom., Barcelona, Spain, Apr. 2005,
VI. C ONCLUSION pp. 125–130.
Here, we propose the MRMF method for saliency detection. [16] X. Xu, I. W. Tsang, and D. Xu, “Soft margin multiple kernel learning,”
MRMF incorporates three types of information in a matrix IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 749–761,
May 2013.
factorization framework: 1) local spatial relationships;
[17] J. Han et al., “Representing and retrieving video shots in human-centric
2) labeled background queries; and 3) features from each brain imaging space,” IEEE Trans. Image Process., vol. 22, no. 7,
region extracted from the input image. Since MRMF utilizes pp. 2723–2736, Jul. 2013.
the feature information from image regions, more accurate [18] J. Han, D. Wang, L. Shao, X. Qian, G. Cheng, and J. Han, “Image
visual attention computation and application via the learning of object
ranking scores can be learned than using MR alone, and attributes,” Mach. Vis. Appl., vol. 25, no. 7, pp. 1671–1683, 2014.
the saliency detection is improved. Experimental results on [19] T. Kadir and M. Brady, “Saliency, scale and image description,” Int.
popular data sets illustrate the promise of MRMF. In the future, J. Comput. Vis., vol. 45, no. 2, pp. 83–105, 2001.
[20] J. Li, D. Xu, and W. Gao, “Removing label ambiguity in learning-based
we will apply the proposed MRMF to other applications, visual saliency estimation,” IEEE Trans. Image Process., vol. 21, no. 4,
e.g., visual tracking, image retrieval, and image classification. pp. 1513–1525, Apr. 2012.
In addition, the MRMF relies on the OGM that is an iterative [21] H. E. Egeth and S. Yantis, “VISUAL ATTENTION: Control, representa-
tion, and time course,” Annu. Rev. Psychol., vol. 48, no. 1, pp. 269–297,
optimization procedure. Thus, it is necessary that considering 1997.
parallelized MRMF for real applications. [22] L. Itti, G. Rees, and J. K. Tsotsos, Eds., Neurobiology of Attention.
San Diego, CA, USA: Elsevier, 2005.
[23] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background
ACKNOWLEDGMENT prior based salient object detection via deep reconstruction residual,”
We would like to thank all anonymous reviewers for IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1309–
1321, Aug. 2015.
their valuable suggestions and Mr. Tongliang Liu for helping [24] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discriminant
improve the theoretical study of this paper. process,” in Proc. IEEE Int. Conf. Comput. Vis., Rio de Janeiro, Brazil,
Oct. 2007, pp. 1–6.
[25] D. Gao and N. Vasconcelos, “Integrated learning of saliency, complex
R EFERENCES features, and object detectors from cluttered scenes,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., San Diego, CA, USA, Jun. 2005,
[1] Q. Ma, L. Zhang, and B. Wang, “New strategy for image and
pp. 282–287.
video quality assessment,” J. Electron. Imag., vol. 19, no. 1,
[26] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using
pp. 011019-1–011019-14, 2010.
background priors,” in Proc. Eur. Conf. Comput. Vis., Florence, Italy,
[2] A. Ninassi, O. Le Meur, P. Le Callet, and D. Barbba, “Does where you
Oct. 2012, pp. 29–42.
gaze on an image affect your perception of quality? Applying visual
attention to image quality metric,” in Proc. Int. Conf. Image Process., [27] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection
vol. 2. San Antonio, TX, USA, Sep. 2007, pp. 169–172. via graph-based manifold ranking,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Portland, OR, USA, Jun. 2013, pp. 3166–3173.
[3] D. Gao and N. Vasconcelos, “Discriminant saliency for visual recog-
nition from cluttered scenes,” in Proc. Adv. Neural Inf. Process. Syst., [28] J. Yu, Y. Rui, and D. Tao, “Click prediction for Web image reranking
Vancouver, BC, Canada, Dec. 2004, pp. 481–488. using multimodal sparse coding,” IEEE Trans. Image Process., vol. 23,
[4] C. Siagian and L. Itti, “Biologically inspired mobile robot vision local- no. 5, pp. 2019–2032, May 2014.
ization,” IEEE Trans. Robot., vol. 25, no. 4, pp. 861–873, Aug. 2009. [29] J. Yu, Y. Rui, and B. Chen, “Exploiting click constraints and multi-view
[5] D. Walther and C. Koch, “Modeling attention to salient proto-objects,” features for image re-ranking,” IEEE Trans. Multimedia, vol. 16, no. 1,
Neural Netw., vol. 19, no. 9, pp. 1395–1407, 2006. pp. 159–168, Jan. 2014.
[6] S. Frintrop and P. Jensfelt, “Attentional landmarks and active gaze [30] J. Yu, M. Wang, and D. Tao, “Semisupervised multiview distance metric
control for visual SLAM,” IEEE Trans. Robot., vol. 24, no. 5, learning for cartoon synthesis,” IEEE Trans. Image Process., vol. 21,
pp. 1054–1065, Oct. 2008. no. 11, pp. 4636–4648, Nov. 2012.
[7] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision [31] J. Yu, D. Tao, M. Wang, and Y. Rui, “Learning to rank using user clicks
with Microsoft Kinect sensor: A review,” IEEE Trans. Cybern., vol. 43, and visual features for image retrieval,” IEEE Trans. Cybern., vol. 45,
no. 5, pp. 1318–1334, Oct. 2013. no. 4, pp. 767–779, Apr. 2015.
[8] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in [32] Y. Hu, D. Zhang, J. Ye, X. Li, and X. He, “Fast and accurate matrix
optical remote sensing images based on weakly supervised learning and completion via truncated nuclear norm regularization,” IEEE Trans.
high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol. 53, Pattern Anal. Mach. Intell., vol. 35, no. 9, pp. 2117–2130, Sep. 2013.
no. 6, pp. 3325–3337, Jun. 2015. [33] S. Yang, Z. Yi, M. Ye, and X. He, “Convergence analysis of graph
[9] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learn- regularized non-negative matrix factorization,” IEEE Trans. Knowl. Data
ing for visual recognition,” Int. J. Comput. Vis., vol. 109, nos. 1–2, Eng., vol. 26, no. 9, pp. 2151–2165, Sep. 2014.
pp. 42–59, Aug. 2014. [34] D. Cai and X. He, “Manifold adaptive experimental design for text cate-
[10] J. Han, S. He, X. Qian, D. Wang, L. Guo, and T. Liu, “An object- gorization,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 4, pp. 707–719,
oriented visual saliency detection framework based on sparse coding Apr. 2012.
representations,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, [35] Y. Nesterov, “A method of solving a convex programming problem
no. 12, pp. 2009–2021, Dec. 2013. with convergence rate O(1/k 2 ),” Soviet Math. Doklady, vol. 27, no. 2,
[11] D. Tao, X. Lin, L. Jin, and X. Li, “Principal component 2- pp. 372–376, 1983.
dimensional long short-term memory for font recognition on sin- [36] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual
gle chinese characters,” IEEE Trans. Cybern., Mar. 2015, doi: attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach.
10.1109/TCYB.2015.2414920. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.
[12] N. Sadaka and L. J. Karam, “Efficient perceptual attentive super- [37] G. Kootstra, A. Nederveen, and B. de Boer, “Paying attention to
resolution,” in Proc. Int. Conf. Image Process., Cairo, Egypt, Nov. 2009, symmetry,” in Proc. Brit. Mach. Vis. Conf., Sep. 2008, pp. 1115–1125.
pp. 3113–3116. [38] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau, “A coherent
[13] L. I. Kuncheva and W. J. Faithfull, “PCA feature extraction for change computational approach to model bottom-up visual attention,” IEEE
detection in multidimensional unlabeled data,” IEEE Trans. Neural Netw. Trans. Pattern Anal. Mach. Intell., vol. 28, no. 5, pp. 802–817,
Learn. Syst., vol. 25, no. 1, pp. 69–80, Jan. 2014. May 2006.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[39] O. Le Meur, P. Le Callet, and D. Barba, “Predicting visual fixations [64] N. Guan, D. Tao, Z. Luo, and B. Yuan, “NeNMF: An optimal gradi-
on video based on low-level visual features,” Vis. Res., vol. 47, no. 19, ent method for nonnegative matrix factorization,” IEEE Trans. Signal
pp. 2483–2498, 2007. Process., vol. 60, no. 6, pp. 2882–2898, Jun. 2012.
[40] N. D. B. Bruce and J. K. Tsotsos, “Saliency based on information [65] P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complex-
maximization,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, ities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3,
Canada, Jun. 2005, pp. 155–162. pp. 463–482, Mar. 2003.
[41] X. Hou and L. Zhang, “Dynamic visual attention: Searching for coding [66] M. Ledoux and M. Talagrand, Probability in Banach Spaces:
length increments,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, Isoperimetry and Processes. New York, NY, USA: Springer-Verlag,
BC, Canada, Dec. 2008, pp. 681–688. 1991.
[42] M. Mancas, “Computational attention: Modelisation and application [67] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in
to audio and image processing,” Ph.D. dissertation, Numediart Inst. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Portland, OR, USA,
Creative Technol., Univ. Mons, Mons, Belgium, 2007. Jun. 2013, pp. 1155–1162.
[43] A. Oliva, A. Torralba, M. S. Castelhano, and J. M. Henderson, “Top- [68] Q. Yan, J. Shi, L. Xu, and J. Jia. (Aug. 2014). “Hierarchical
down control of visual attention in object detection,” in Proc. Int. Conf. saliency detection on extended CSSD.” [Online]. Available: http://arxiv.
Image Process., Barcelona, Spain, Sep. 2003, pp. I-253–I-256. org/abs/1408.5418
[44] A. Torralba, “Modeling global scene factors in attention,” J. Opt. Soc. [69] T. Liu et al., “Learning to detect a salient object,” IEEE Trans. Pattern
Amer., vol. 20, no. 7, pp. 1407–1418, 2003. Anal. Mach. Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011.
[45] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, “SUN: [70] J. Zhang and S. Sclaroff, “Saliency detection: A Boolean map approach,”
A Bayesian framework for saliency using natural statistics,” J. Vis., in Proc. IEEE Int. Conf. Comput. Vis., Sydney, Australia, Dec. 2013,
vol. 8, no. 7, pp. 1–20, 2008. pp. 153–160.
[46] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, [71] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,
“SLIC superpixels compared to state-of-the-art superpixel methods,” “Global contrast based salient region detection,” in Proc. IEEE Conf.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, Comput. Vis. Pattern Recognit., Providence, RI, USA, Jun. 2011,
Nov. 2012. pp. 409–416.
[47] T. Avraham and M. Lindenbaum, “Esaliency (extended saliency): Mean-
ingful attention using stochastic image modeling,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 32, no. 4, pp. 693–708, Apr. 2010.
[48] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Proc.
Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2006,
pp. 545–552.
[49] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection using
phase spectrum of quaternion Fourier transform,” in Proc. IEEE Int.
Conf. Comput. Vis. Pattern Recognit., Anchorage, AK, USA, Jun. 2008,
pp. 1–8.
[50] C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliency
detection model and its applications in image and video compression,”
IEEE Trans. Image Process., vol. 19, no. 1, pp. 185–198,
Jan. 2010. Dapeng Tao received the B.E. degree from
[51] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” Northwestern Polytechnical University, Xi’an,
in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Minneapolis, China, and the Ph.D. degree from the South China
MN, USA, Jun. 2007, pp. 1–8. University of Technology, Guangzhou, China.
[52] R. Achanta, S. Hemami, F. J. Estrada, and S. Süsstrunk, “Frequency- He is currently with the School of Information
tuned salient region detection,” in Proc. IEEE Int. Conf. Comput. Vis. Science and Engineering, Yunnan University,
Pattern Recognit., Miami, FL, USA, Jun. 2009, pp. 1597–1604. Kunming, China, as an Engineer. He has authored
[53] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict or co-authored over 30 scientific articles. His
where humans look,” in Proc. IEEE Int. Conf. Comput. Vis., Kyoto, current research interests include machine learning,
Japan, Sep. 2009, pp. 2106–2113. computer vision, and cloud computing.
[54] W. Kienzle, M. O. Franz, B. Scholkopf, and F. A. Wichmann, “Center- Dr. Tao has served more than ten international
surround patterns emerge as optimal predictors for human saccade journals, including the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND
targets,” J. Vis., vol. 9, pp. 1–15, May 2009. L EARNING S YSTEMS , the IEEE T RANSACTIONS ON M ULTIMEDIA, the
[55] A. M. Treisman and G. Gelade, “A feature-integration theory of atten- IEEE S IGNAL P ROCESSING L ETTERS , and PLOS-ONE.
tion,” Cognit. Psychol., vol. 12, no. 1, pp. 97–136, 1980.
[56] C. Koch and S. Ullman, “Shifts in selective visual attention: Towards
the underlying neural circuitry,” Human Neurobiol., vol. 4, no. 4,
pp. 219–227, 1985.
[57] L. Itti, “Automatic foveation for video compression using a neurobio-
logical model of visual attention,” IEEE Trans. Image Process., vol. 13,
no. 10, pp. 1304–1318, Oct. 2004.
[58] H. J. Seo and P. Milanfar, “Static and space-time visual saliency
detection by self-resemblance,” J. Vis., vol. 9, no. 12, pp. 1–27,
2009.
[59] V. Bloom, I. Griva, B. Kwon, and A.-R. Wolff, “Exterior-point method
for support vector machines,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 25, no. 7, pp. 1390–1393, Jul. 2014.
[60] R. J. Peters and L. Itti, “Beyond bottom-up: Incorporating task- Jun Cheng received the B.Eng. and M.Eng. degrees
dependent influences into a computational model of spatial attention,” from the University of Science and Technology of
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Minneapolis, MN, China, Hefei, China, in 1999 and 2002, respectively,
USA, Jun. 2007, pp. 1–8. and the Ph.D. degree from the Chinese University of
[61] L. Shao, D. Wu, and X. Li, “Learning deep and wide: A spectral method Hong Kong, Hong Kong, in 2006.
for learning deep networks,” IEEE Trans. Neural Netw. Learn. Syst., He is currently with the Shenzhen Institutes
vol. 25, no. 12, pp. 2303–2308, Dec. 2014. of Advanced Technology, Chinese Academy of
[62] J. Li, Y. Tian, T. Huang, and W. Gao, “Probabilistic multi-task learning Sciences, Shenzhen, China, as a Professor, and the
for visual saliency estimation in video,” Int. J. Comput. Vis., vol. 90, Director of the Laboratory for Human Machine
no. 2, pp. 150–165, 2010. Control. His current research interests include com-
[63] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Belmont, MA, USA: puter vision, robotics, machine intelligence, and
Athena Scientific, 1999. control.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

TAO et al.: MRMF FOR SALIENCY DETECTION 13

Mingli Song (M’06–SM’13) received the Xu Lin received the B.Eng. degree from the South
B.Eng. degree from Northwestern Polytechnical China University of Technology, Guangzhou, China.
University, Xi’an, China, and the Ph.D. degree He is currently with the Shenzhen Institutes
in computer science and technology from the of Advanced Technology, Chinese Academy of
College of Computer Science, Zhejiang University, Sciences, Shenzhen, China, as a Visiting Student
Hangzhou, China. of the Laboratory for Human Machine Control. His
He is currently a Professor with the Microsoft current research interests include machine learning
Visual Perception Laboratory and the Zhejiang and computer vision.
Provincial Key Laboratory of Service Robot,
Zhejiang University. He has authored or
co-authored over 90 scientific articles at top
venues, including the IEEE T RANSACTIONS ON PATTERN A NALYSIS
AND M ACHINE I NTELLIGENCE, the IEEE T RANSACTIONS ON I MAGE
P ROCESSING, the IEEE T RANSACTIONS ON M ULTIMEDIA, the IEEE
T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS —PART B:
C YBERNETICS , Information Sciences, Pattern Recognition, the Computer
Vision and Pattern Recognition Conference, the European Conference on
Computer Vision, and the ACM Multimedia (MM) Conference. His current
research interests include computational vision and computer graphics, and
applications of machine learning in vision and graphics.
Prof. Song is a Professional Member of the Association for Computing
Machinery. He received the Microsoft Research Fellowship in 2004. He is
an Associate Editor of Information Sciences and Neurocomputing, and an
Editorial Advisory Board Member of Recent Patent on Signal Processing.
He has served at more than ten major international conferences, including the
IEEE International Conference on Data Mining, ACM MM, the International
Conference on Image Processing, the International Conference on Acoustics,
Speech and Signal Processing, the International Conference on Multimedia &
Expo, the Pacific-Rim Conference on Multimedia, the Pacific-Rim Symposium
on Image and Video Technology, and the Computer Analysis of Images and
Patterns, and more than ten prestigious international journals, including the
IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE T RANSACTIONS
ON V ISUALIZATION AND C OMPUTER G RAPHICS , the IEEE T RANSACTIONS
ON K NOWLEDGE AND D ATA E NGINEERING, the IEEE T RANSACTIONS ON
M ULTIMEDIA, the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR
V IDEO T ECHNOLOGY, and the IEEE T RANSACTIONS ON S YSTEMS , M AN ,
AND C YBERNETICS —PART B: C YBERNETICS .

You might also like