You are on page 1of 12

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO.

2, FEBRUARY 2013

537

Linear Distance Coding for Image Classication


Zilei Wang, Jiashi Feng, Shuicheng Yan, Senior Member, IEEE , and Hongsheng Xi
Abstract The feature coding-pooling framework is shown to perform well in image classication tasks, because it can generate discriminative and robust image representations. The unavoidable information loss incurred by feature quantization in the coding process and the undesired dependence of pooling on the image spatial layout, however, may severely limit the classication. In this paper, we propose a linear distance coding (LDC) method to capture the discriminative information lost in traditional coding methods while simultaneously alleviating the dependence of pooling on the image spatial layout. The core of the LDC lies in transforming local features of an image into more discriminative distance vectors, where the robust imageto-class distance is employed. These distance vectors are further encoded into sparse codes to capture the salient features of the image. The LDC is theoretically and experimentally shown to be complementary to the traditional coding methods, and thus their combination can achieve higher classication accuracy. We demonstrate the effectiveness of LDC on six data sets, two of each of three types (specic object, scene, and general object), i.e., Flower 102 and PFID 61, Scene 15 and Indoor 67, Caltech 101 and Caltech 256. The results show that our method generally outperforms the traditional coding methods, and achieves or is comparable to the state-of-the-art performance on these data sets. Index Terms Image classication, image-to-class distance, linear distance coding (LDC).

I. I NTRODUCTION ENERATING compact, discriminative and robust image representations is undoubtedly critical to image classication [1], [2]. Recently, several local features, e.g., SIFT [3] and HOG [4], are quite popular in representing images due to their ability to capture distinctive details of the images. However, the local features are rarely directly fed into image classiers due to the computational complexity and their sensitiveness to noise. A common strategy is to integrate the local features into a global image representation at rst. To this end, various methods [1], [2], [5], [6] have been proposed,
Manuscript received February 16, 2012; revised August 30, 2012; accepted August 30, 2012. Date of publication September 13, 2012; date of current version January 10, 2013. This work was supported in part by the National Natural Science Foundation of China under Grant 61203256 and the Singapore Ministry of Education under Grant MOE2010-T2-1-087. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Erhardt Barth. Z. Wang is with the Department of Automation, University of Science and Technology of China (USTC), Hefei 230027, China, and also with the Department of Electrical and Computer Engineering, National University of Singapore, 117576 Singapore (e-mail: zlwang@ustc.edu.cn). J. Feng and S. Yan are with the Department of Electrical and Computer Engineering, National University of Singapore, 117576 Singapore (e-mail: a0066331@nus.edu.sg; eleyans@nus.edu.sg). H. Xi is with the School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China (e-mail: xihs@ustc.edu.cn). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TIP.2012.2218826

among which the Bag of Words (BoW) based ones [1], [2], [5] present outstanding simplicity and effectiveness. BoW image representation is typically generated via following three steps: 1) extract local features of an image on the interest points; 2) generate a dictionary/codebook and then quantize/encode the local features into codes accordingly; and 3) pool all the codes together to generate the global image representation. Such a process can be summarized as a feature extraction-coding-pooling pipeline. And it has been widely used in recent image classication methods and achieves impressive performance [1], [2], [7]. Within the above framework, the coding process will inevitably introduce information loss due to the feature quantization. Such undesirable information loss severely damages the discriminative power of the generated image representation and thus decreases the image classication performance. Therefore, various coding methods are proposed to more accurately encode local features with less information loss. Most of these methods are developed from the Vector Quantization (VQ) which conducts hard assignment in the coding process [5]. In spite of great simplicity, its inherent large coding error1 often leads to unrecoverable loss of discriminative information and severely limits the classication performance [8]. To alleviate this issue, various coding methods have been proposed. For example, soft-assignment [6], [9], [10] estimates memberships of each local feature to multiple visual words instead of a single one. Another modied method is Super Vector (SV) coding [11], which additionally incorporates the difference between local feature and selected visual word. Thus SV captures the higher-order information and shows improved performance. Though many coding methods [1], [2], [10], [11] are proposed to accurately represent the input features, the information loss in the feature quantization for coding is still inevitable. In fact, Boiman et al. [8] have pointed out that the local features from long-tail distribution are inherently inappropriate for quantization, and the lost information in feature quantization is quite important for good image classication performance. To tackle this issue, the Naive Bayes Nearest Neighbor (NBNN) method is proposed to avoid the feature coding process, by employing the image-to-class distance for image classication [8]. Beneting from alleviating the information loss, NBNN is able to achieve competitive classication performance on multiple datasets with coding based methods. Motivated by its success, several methods [12][14] are developed to further improve the NBNN. However, all variants of NBNN practically employ uniform summation to aggregate image-to-class distances calculated based on local
1 Or called the coding residual, which refers to the difference between original local feature and the reconstructed feature from the produced codes.

10577149/$31.00 2012 IEEE

538

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

features. This introduces two inherent drawbacks, namely they are sensitive to noisy features and easy to be dominated by outlier features. In essence, the BoW-based methods and the NBNN-based methods are using different visual characteristic statistics to perform image classication. The former depends on salient features of an image, while the latter equally treats all the local features. In addition, the NBNN ones replace the image-level similarities with the image-to-class distance on performing classication in order to generate more robust results. Therefore, the BoW and NBNN based methods may be suitable for different types of images. For example, for the images with cluttered background, the BoW based ones show better classication performance due to its ability to capture the salient features. Therefore, it is reasonable to propose that if we can combine the advantages of both of them, namely capturing the saliency of images without information loss, the classication performance can be improved further. Besides reducing the information loss of feature coding, how to more effectively explore spatial context is also crucial for achieving good classication performance. In most of the coding-pooling based methods, Spatial Pyramid Matching (SPM) [7] has been widely adopted in the pooling procedure due to its effectiveness and simplicity. However, SPM strictly requires the involved images to present similar spatial layout to ensure that the generated image representations can match well in element-wise manner [15]. This requirement originates from the fact that the used local features are often representing the object-specic visual patterns. However, such requirement has negative effect on classication accuracy because realistic images usually show various spatial layout even within the same category. Alternatively, if the elements of adopted features can be transformed to bear the class-specic semantic, such requirement would be greatly relieved. In this paper, we propose a novel Linear Distance Coding (LDC) method to simultaneously inherit the nice properties of BoW and NBNN and meanwhile relieve the image spatial alignment requirement of SPM. LDC also works under the feature extraction-coding-pooling framework, i.e., it generates the image representations from the salient characteristic local features for the classication, as shown in Figure 1. The proposed LDC particularly focuses on utilizing the discriminative information lost by the traditional coding methods and more effectively exploiting the spatial information. In practice, LDC transforms each local feature into a distance vector, which is an alternative discriminative pattern of local feature, in the class-manifold coordinate system. Compared with the original local features, each element of the distance vectors represents certain class-specic semantic which consists of the distances of local features to class-specic manifolds. Thus the strict requirement of image layout similarity in original SPM can be effectively relieved, since the embedded class semantic in each feature element robusties the similarity calculation between the objects posing differently, as detailed later. Comprehensive experiments on various types of datasets consistently show that the image representation produced by LDC achieve better or competitive performance compared with state-of-the-arts. Furthermore, the image representations

SPM Linear Coding +

Images

Local features

Maxpooling
Class 1 Class 2 Class K

Class Manifolds

Distance Transformaon

Coding & Pooling

Image Representaon

Distance to Class Manifold

Manifold Coordinate System

Fig. 1. Illustration of linear distance coding. The local features extracted from various classes of training images are rst used to generate a manifold for each class that is represented by a set of local features (i.e., anchor points). Based on the obtained class manifolds, the local feature xi is transformed into a more discriminative distance vector di = [di,1 , di,2 , . . . , di, K ]T , where K denotes the class number. On these transformed distance vectors, the linear coding and max-pooling are performed to produce the nal image representation. The principle of the distance transformation from original local feature xi to distance feature di is to form a class-manifold coordinate system with the K obtained class manifolds, where each class corresponds to one axis. For the k th class manifold M k , the coordinate value di,k of local feature xi corresponds to the distance between xi and this class manifold. Image best viewed in color.

produced by LDC are proven to be complementary to the ones from the original coding methods. Thus their combination, even a direct concatenation of resulting image representations, can yield remarkable performance improvement as expected. The main contributions of this work can be summarized as follows: 1) We propose a novel distance pattern of local features through constructing the class-manifold coordinate system. The produced distance vectors are quite discriminative and is able to relieve the strict requirement of SPM on image spatial layout, beneting from the adopted more robust image-to-class distance. 2) We propose a linear distance coding (LDC) method, which conducts the linear coding and max-pooling on the transformed distance vectors to elegantly aggregate the salient features of images. Compared with the NBNN methods, such process can avoid the undesired case where the discriminative features are dominated by outlier or noisy features, especially for the images with cluttered background. 3) From both theoretical analysis and experimental verication, the image representations produced by LDC are complementary to the one from the traditional coding method. And their combination is shown to outperform each individual of them and achieve the state-of-the-art performance on various benchmark datasets. This paper is organized as follows. Section II introduces the related works, including the linear coding models

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

539

and the NBNN methods. Section III proposes the distance pattern by introducing the class-manifold coordination system. Section IV applies the linear coding and max-pooling on the transformed distance vectors, and the combination of LDC and the original coding method is discussed. The experiments on three types of datasets are presented in Section V, meanwhile the sensitiveness of the key parameters to classication performance is also discussed. Finally, Section VI concludes this work. II. R ELATED W ORKS The proposed Linear Distance Coding (LDC) utilizes simultaneously the linear coding methods and the image-to-class distance adopted in NBNN [8]. In this section, we briey discuss the conventional coding methods and the NBNN methods. 1) Linear Coding Models: Linear coding is to approximate the input feature by a linear combination of the basis in a given dictionary. Through the coding process, input features are transformed into more discriminative codes. The popular linear coding models include Vector Quantization (VQ) [5], Soft-assignment Coding [6], Sparse Coding (SC) [1], Localityconstrained Linear Coding (LLC) [2] and their variants [16]. Given a dictionary B = [b1 , b2 , . . . , b p ] Rd p consisting of p basis features with dimensionality d , linear coding computes a reconstruction coefcient vector v R p to represent the input feature x Rd by minimizing the following loss function: 1 x B v 22 + R(v) (1) L(v) = 2 where the rst term measures the approximation error and the second one serves as regularization. In fact, existing coding models mainly differ from each other at imposing different prior structures on the generated code v via a specic regularization R(). In particular, LLC [2] considers that locality is more essential than sparsity for the feature coding. It adopts a locality adaptor in the regularization R() to replace the 1 -norm used in SC. The locality regularization takes into account the underlying manifold structure of local features and thus ensures good approximation. Inspired by LLC, Liu et al. [10] propose to inject locality into the soft-assignment coding and devise the Localized Soft-Assignment (LSA) coding method. For any local feature x, its membership estimation is restricted to only certain number of nearest basis in the dictionary. LSA discards the possibly unreliable interpretations from distant basis and obtains more accurate posteriori probability estimation. However, the accuracy of such posteriori estimation (i.e., coding result) heavily depends on the size of the adopted dictionary and the underlying distribution of local features, which determine the performance of image classication. Inspecting the feature coding in (1), the information loss may originate from two aspects. The rst one is the inaccurate linear approximation and the imperfectness of the dictionary B . The second one is that the enforced structure in R() can only be achieved by sacricing some approximation accuracy. In linear coding models which operate on the original local features, such information loss is inevitable.

However, the lost information is probably quite important for accurate image classication [8]. 2) NBNN Methods: The Naive Bayes Nearest Neighbor (NBNN) [8] is essentially a non-parametric classication method without a training phase, where the classication is performed based on the summation of Euclidean distances between local features of the test image and reference classes (i.e., image-to-class distance) [8], [12][14]. By avoiding the feature coding, the NBNN effectively reduces the information loss and thus achieves competitive classication performance on multiple benchmark datasets. In the NBNN methods, all local features from the same class are assumed to be i.i.d. sampled from a certain class-specic distribution, and thus image classication is equivalent to a maximum likelihood estimation problem [8]: c = arg max p (c| Q ) = arg max
c c x Q

p (x | c )

(2)

where c denotes the class, and Q denotes all the descriptors of the query image. In particular, the NBNN estimates the likelihood probability through a set of Parzen kernel functions (typically Gaussian kernel function): p (x | c ) = 1 L
r

exp
j =1

1 x xc j 2 2

(3)

where xc j is the j -th nearest neighbor on the class c , is the bandwidth of kernel function, L is a normalization factor, and r denotes the number of nearest neighbors. In NBNN, the case of r = 1 is particularly used due to its simplicity and interpretability. Under such case, the resulting NBNN criterion is simplied to:
N

c = arg min
c i =1

xi xic

2
2

(4)

where xic is the nearest neighbor of xi on the class c, and N is the number of local features. The original NBNN method [8] equally and independently treats local features and classes via the summation in (4), which causes the sensitiveness to the noisy features and outliers. Consequently, the classication performance cannot be greatly improved although more robust image-to-class distance is adopted. More specically, the original NBNN algorithm suffers from the following three drawbacks: 1) the spatial information [7] is not fully exploited, which however is shown to be quite useful for image classication; 2) the computational complexity rapidly increases with the number of local features and thus the scalability is severely limited. In particular, the time complexity for one query image with N features is O( N N D log N D ), where N D is the number of all local features of the training images [8]; and 3) it equally treats all classes for any local feature of testing images, and consequently can not adapt to the involved dataset and capture the image saliency well, as discussed above. To alleviate these issues, various modied methods have been proposed, such as using class-specic Mahalanobis metric instead of Euclidean distance [13], associating classspecic parameters for each class [12] and kernelizing the

540

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

NBNN [14]. These modied NBNN methods [12][14] share two features although they seem to be quite different. First, all of them use the same strategy to improve the classication performance, namely enhancing the adaptiveness of the resultant metrics by learning some key parameters. In fact, such learning process is an alternative of training parametric models on the training samples. Second, the nal classication criterion is always reduced to the summation of certain distance of all local features within each image, no matter what distance metric is adopted. Such uniformly summing operation usually renders the generated metric sensitive to noisy points as aforementioned. Consequently, the individual NBNN cannot outperform the feature coding based methods in the image classication tasks. III. D ISTANCE PATTERN In this work, we focus on solving the image classication problem formally stated as follows: given a set of local features Xi and the class label yi of the i -th image Ii , we want to learn a classier from local features to image label C : Xi yi such that classication error can be minimized w.r.t. both the training and test images. In particular, we aim at a method generating more discriminative image representations from Xi for better classication performance. Here we propose a novel coding method which preserves the superior discriminative capability and robustness of the feature coding based methods [2], and meanwhile effectively captures the lost information in the previous coding methods. In the following, we rst introduce the proposed desired distance pattern which is more discriminative and robust. A. Class-Specic Distance Using the distance between local feature and certain class to estimate image membership can provide better generalization capability. Such class-specic distance is fundamental to the NBNN methods and crucial to achieve outstanding classication performance [8]. In particular, all of the existing NBNN methods approximate the class-specic distance by calculating the distances between the local feature and its corresponding nearest neighbor retrieved in the reference images [8]. Formally, let d (xi , c) denote the distance between a local feature xi and the class c. Here the class c consists of a set of local features {xc j } all of which are extracted from the training images from c. Then d (xi , c) is computed as xi x d (xi , c) = min c
x{x j } 2
2

= xi xic

2
2

(5)

from the correct one because of the fragile quadratic criterion. This may lead to quite unreliable distance pattern and consequently degrade the performance of the classication criterion based on such distance pattern. 2) It is highly computationally expensive to nd the nearest neighbor for each query feature as aforementioned. The computational complexity O( N N D log N D ) is proportionally increasing with the number of local features in the training set. In practice, many works extract a huge number of local features which heavily limits the efciency of NBNN based methods. Although there are some accelerated algorithms [17], [18], the low efciency is still a bottleneck of such distance calculation. To alleviate these issues, we propose a novel algorithm to calculate the distance d (xi , c). The essential idea here is to calculate a more appropriate mapping point xic rather than to simply nd the nearest neighbor as in NBNN. The new xic is allowed to be a virtual local feature in the class c. In particular, we assume the local features of each class are sampled from a class-specic manifold M c , which is completely determined c by available local features of the corresponding class {mic }n i =1 . And such features are called anchor points [19], which can be obtained through clustering the local features from class c. Here the manifold of class c is denoted as M c = c c c ]. Then the computational complexity of a [m1 , m2 , . . . , mn c single input image with N features becomes O ( Nn c log(n c )) with n c N D , where N D is the number of all training local features. For example, in our following experiment, there are about 60 000 local features for each class with 2000 features per image and 30 training images. After the clustering 60 000 anchor points are preprocessing, only n c = 1024 used to describe the manifold. In addition to reducing the complexity, using the cluster centers as anchor points can effectively reduce the inuence of noisy features and thus produce a more robust description for the manifold. This is established under the reasonable assumption that the fraction of outliers is small, and the resultant centers are mainly determined by the dominant inlier features. Now we present an efcient algorithm to determine the good mapping point xic , even when relatively few anchor points are provided. By utilizing the locally linear structure of the manifold, xic can be calculated through the locally linear regression method. More specically, xic is computed as a linear combination of its neighboring anchors in the manifold M c . Here we apply an approximate fast solution of LLC [2] to our problem, which only selects the xed number of nearest neighbors and can be formulated as follows: min xic M c vi
vi 2
2

where xic denotes the mapped point of xi in class c and reduces to the nearest neighbor of xi in the NBNN methods. However, the derived distance in Equation (5) suffers from the following drawbacks: 1) It is quite sensitive to noisy features in the training set {xc j }. Local feature is prone to change signicantly even under slight appearance variation and this causes ubiquitous noisy features. In the presence of noisy features or outliers in {xc j }, the estimated distance of local features in the testing image may severely deviate

subject to : v i, j = 0 1 T vi = 1, i

if mc / Nik j (6)

where vi = [v i,1 , v i,2 , . . . , v i,nc ]T is the linear representation coefcients of xi on the manifold M c , and Nik is the set of k nearest neighbors of xi . Substitute the resultant xic derived from (6) into (5), the distance d (xi , c) will be nally obtained, which is denoted as dic . Such class-specic distance

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

541

is motivated by capturing the underlying manifold structure of the local features and computed in a robust linear regression way. Thus it gains stronger discriminative power and more robustness to noisy and outlier features. B. What is Good Distance Pattern? Let di = distance distance vector of the relationship to all K classes. In contrast to original local features (e.g., SIFT), which describe the appearance patterns of characteristic object, the distance vector represents a relative pattern that captures the discriminative part of local features w.r.t. specied classes, i.e., it is more class-specic as we desired. In fact, the distance vector is the projection residue of local features onto the class manifolds, as shown in Figure 1. Note that in the gure each axis denotes one class manifold. Through such residue-pursuit feature transformation, the distance vector gains the following advantages compared with original local features: 1) The distance vector preserves the discriminative information of local features lost in the traditional feature coding process. 2) The distance vector can coordinate better with the additional operation to explore useful spatial information, e.g., SPM. The spatial pooling of traditional local features requires the involved images have similar object layout such that the resulting representations of different images can be well element-wisely matched. Such over strict requirement is signicantly relieved by the distance vector because of the class-specic characteristic of the adopted image-to-class distance, as shown in Figure 2. Compared with previous NBNN methods which directly sum up the image-class distance for classication, here we propose to use the distance vector as a new kind of local feature. Thus, any classication model used on the original local features can perfectly t for the distance vector. Before providing more robust and discriminative distance pattern, we rst recall the original NBNN strategy for image classication. Given an image I with N local features xi , the distance vectors di R K are calculated as in (5). Then the estimated category c of I is determined by the following criterion:
N N N N

Image-level representaon space


Feature

representaon

[di1 , di2 , . . . , diK ]T R K denote the local feature xi , which aggregates its

Example in Class 1

Image 1

Image 2 Feature space

Class 1

Class 2

Class 1

Class 2

Original local features

Distance features

Fig. 2. Schematic diagram of the distance pattern relieving the requirement of layout similarity. In the original feature space, each class has multiple clusters of characteristic features. When the images involved have different layouts, the resulting image representations may be quite different due to the features contained by the same SPM grid of different images being different. This has a negative impact on the usual element-wise matching-based methods to achieve high classication accuracy. But such an undesired situation can be signicantly resolved by our proposed distance transformation, as all distance vectors within the same class turn out to be more similar in the distance feature space, benetting from the class-specic characteristic of the adopted image-to-class distance. Consequently, image representations of the same class become closer to each other in the image level representation space, even though they show a totally different layout (e.g., the distance image d and vd in class 1). Different shapes represent different representations vI I2 1 classes in certain feature spaces and different color indicates different features (e.g., the pink rectangles represent the indistinctive features in class 1, lying close to class 2). Image best viewed in color.

admits the following form: di = di min(di ), i = f n (di ) = 1 d di

[ d i ,1 , d i ,2 , . . . , d i , K ] T
2

(8)

c = arg min
i =1

di = arg min
k i =1

d i ,1 ,
i =1

d i ,2 , . . . ,
i =1

di, K (7)

where f n () is the normalization function with 2 -norm. From i mainly represents the distance Equation (8), the used d pattern with di 2 = 1. In practice, compared with the direct normalization of f n (di ) without the minimum subtraction, it is experimentally shown that the normalization in (8) produces a slightly higher classication accuracy [14], which may be benetted from the increased gap between elements for more discriminatively describing features. For simplicity, we would i if without ambiguity in the following use di to refer to d sections. Finally, we summarize the procedure to compute the adopted distance pattern in Algorithm 1. IV. L INEAR D ISTANCE C ODING Here we explore how to utilize the obtained distance vectors to produce discriminative and robust image representation. Different from the previous NBNN-like methods, we aggregate the obtained distance pattern under the coding-pooling framework which provides state-of-the-art performance in the previous works. The overview of the image classication owchart is shown in Figure 3. The distance vectors are transformed from local features one by one, then the distance vector and the original local feature are separately encoded d and v . and pooled to generate two image representations vI I

where k is the index of element corresponding to the category. Namely, the original NBNN method just separately considers the element-wise semantic of the obtained distance vector, and completely ignores the intrinsic pattern described by the distance vector. Different from the previous methods, we regard each distance vector as an integral feature, and then apply the outperforming coding model on these transformed features. In particular, the nal used distance pattern in our method

542

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

Algorithm 1: Distance Pattern Data: N local features {xi }iN =1 of image I , the class-specied manifolds M c , c = 1, 2, . . . , K . i , i = 1, 2, . . . , N . Result: The desired distance vectors d for i 1; i N ; i i + 1 do for k 1; k K ; k k + 1 do calculate vi using (6), then di,k = xi M k vi 22 . end Construct the distance vector di = [di,1 , di,2 , . . . , di, K ]T . i from (8). Obtain the normalized distance vector d end
(a)
Local Feature Extracon Linear Coding Spaal Max-pooling Concatenated Representaon Distance Transformaon LDC

Linear SVM

Spaal Max-pooling

(b)
Fig. 3. Overview of the image classication owchart. This architecture has been proven to achieve state-of-the-art performance on the basis of a single type of feature, e.g., LLC [2]. (a) Linear coding and max-pooling are sequentially performed on original extracted local features, resulting in an original image representation. (b) All local features are transformed into distance vectors, on which the linear coding and max-pooling are sequentially performed. This coding process is called LDC in this paper, and it results in a distance image representation. Finally, the original image representation and the distance image representation are simply concatenated so that they complement each other, where linear SVM is adopted for the nal classication.

Fig. 4. Illustration of the complementary between image representations produced by the LLC-like coding methods and our LDC method. In the coding-pooling framework, the original local feature x are approximated by the xed visual words (anchor points) and the corresponding code v. Here we specially suppose the anchor points of all classes to form a xed global dictionary B = [ M 1 , M 2 , . . . , M K ] by concatenating them. Then the original information of the original feature x can be completely expressed by the generated codes v = [v1 , v2 , . . . , v K ]T and the residue error [n1 , n2 , . . . , n K ]T . In fact, the proposed LDC is to utilize the residue error information by compressing nk into dk with dk = nk 2 . Therefore,
d are complementary to each other due the image representations vI and vI to their complementary perspectives on utilizing the original information.
2

Finally, the linear SVM is adopted to classify the images based on individual image representation, or their concatenated image representation. To verify the effectiveness and generalization of such distance transformation, we apply two different coding models independently, i.e., LLC [2] and Localized Soft-Assignment coding (LSA) [10], to encode distance vectors due to their high efciency provided by the approximate fast solution. We particularly illustrate this procedure via LLC2 . Let B R K P be the distance dictionary consisting of P distance vectors b1 , b2 , . . . , b P , which can be obtained by k -means clustering from the obtained distance vectors of training images. For the input distance vector di , the corresponding code yi is calculated as follows [2] min
yi

di B yi

2
2

+ ei

yi i

2
2

, (9)

where max is performed element-wisely for the involved vectors. In addition, SPM with three levels is adopted for the spatial pooling. Thus, the distance image representation d is equivalently compact, salient, and discriminative as the vI original image representation vI . Here we provide brief analysis on the relationship of the original image representation vI and the distance image d . The most intuitive difference is that they representation vI are derived from two different local features: the original local features {xi } and the distance vector {di }, respectively. For individual point within images, the coding quantization on original local features inevitably loses some important information more or less due to only preserving the principal information, while the distance vector captures the discriminative information in the residue part and thus compensate the information loss, as shown in Figure 4. So it is creditable that d are complementhe resulting image representations vI and vI tary to each other. In practice, we simply concatenate vI and d to form a longer vector vc , which is expected to achieve vI I better performance. The benet of such complementarity is well veried by the following experiments on multiple types of benchmark datasets. V. E XPERIMENTS In this section, we evaluate the performance of the proposed method on three groups of benchmark datasets: specic objects (e.g., ower, food), scene and general objects. In particular, the specic object datasets include Flower 102 [20] and PFID 61 [21], in which the images are relatively clean without cluttered background. The scene datasets include Scene 15 [7] and Indoor 67 [22]. And the general object datasets include Caltech 101 [23] and Caltech 256 [24]. Among various feature coding models producing relatively compact image representations, Locality-constrained Linear Coding (LLC) and Localized Soft-Assignment Coding (LSA)

subject to : 1T yi = 1,

where denotes the element-wise multiplication, 1 is a P -dimensional all-1 vector, and ei R P is the locality adaptor that gives different freedom for each visual word proportional to its similarity to the input distance feature di . After linear coding on the distance vectors, the max-pooling is performed on the obtained sparse codes {yi } to produce the d distance image representation vI for image I , namely,
d vI = max(y1 , y2 , . . . , y N )
2 The counterpart of LSA refers to [10] for details.

(10)

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

543

almost always achieves the state-of-the-art classication performance [2], [10]. In addition, they, compared with ScSPM and other similar methods, have much lower computational complexity owing to existed fast solution [2]. Thus we adopt LLC and LSA individually as the coding model in our method, where the max-pooling is always employed. Of course, similar coding models can also be naturally applied on the transformed distance features, e.g., Laplacian Sparse Coding (LSCSPM) proposed by Gao et al. in [16]. And the main target of the following experiments is to verify the uniform effectiveness of the proposed distance pattern on improving classication performance. Moreover, we adopt the best performance of the comparable methods ever reported on each dataset and the achieved accuracies of LLC and LSA as the baselines in the performance evaluation. Before reporting the detailed classication results on these datasets, we rst give the experimental settings. A. Experimental Settings For fair comparison with ever reported results, local features of a single type, dense SIFT [3], are used throughout the experiments. In all of our experiments, SIFT features are extracted at single-scale from densely located patches of gray images. The patches are centered at every 4 pixels and of the xed size as 16 16 pixels, where the VLFeat lib [25] is used. Before feature extraction, all the images are resized with reserved aspect ratio to no more than 300 300 pixels. The anchor points {mic } of each class manifold M c are learned from the training images of that class, and their number is xed as K c = 1024 for all classes throughout our experiments. For the original dense SIFT features, and the corresponding distance vectors, the global dictionaries containing P visual words are learned individually from all training samples via k -means clustering. In particular, P = 2048 is xed for all datasets. Each SIFT feature xi or distance vector di is normalized by its 2 -norm and then encoded into a P -dimensional vector. An important parameter of LLC and LSA is the number c on encoding local features. In our of nearest neighbors knn method, the distance vector is similarly calculated based d neighbors in specied class manifold. For reducing on knn their inuence to classication performance, four different d values are used individually for these parameters, i.e., knn c {1, 2, 3, 4}, and knn {2, 5, 10, 20} as suggested in LLC [2]. In experiments, we report the best result for each method under these parameters, and the inuence of these parameters is discussed in the following subsection. In addition, the bandwidth parameter of LSA is xed as 10, as the authors setting in [10]. In the experiments, the SPM is used by hierarchically partitioning each image into 11, 22, and 44 blocks on 3 levels, whose cumulative concatenations are denoted by SPM0, SPM1 and SPM2, respectively. In particular, SPM2 means that all three levels (from 0 to 2) are used by concatenating their pooling vectors. All obtained image-level representations are fed into the linear SVM in the training and testing phases (the libLinear package [26]), where the penalty parameter of SVM is xed as C = 1. Actually we found the classication

(a)

(b)

Fig. 5. Example images of Flower 102 dataset, where each row represents one category. (a) Original images. (b) Corresponding segmented images. Limited by the performance of the segmentation algorithm, the segmented images may contain part of the background, lose part of the object, or even lose the whole object. Image best viewed in color.

performance is quite stable for different penalty parameter values. The number of repeatitions and the number of training and testing samples follow the provided conguration along with each dataset. The performance is measured by the average classication accuracy on all classes. For multiple runs, both the mean and the standard deviation of the classication accuracy are reported. As for the evaluations of the proposed methods, we report the results of three different image-level representations: the original feature representation vI , the distance image repd c resentation vI , and their direct concatenation vI . In the experimental results, LLC and LSA is assembled separately with different input features. For example, LLC-SIFT refers to applying LLC on the original SIFT features to produce the image level representation, and LLC-Combine refers to the result of the concatenated image representations from LLC-SIFT and LLC-Distance. B. Specic Object Datasets We rst evaluate the proposed method on the Flower 102 [20] and PFID 61 [21] datasets, whose images are relatively clean and the background is less cluttered. 1) Flower 102: Flower 102 is a 102 category ower dataset [20], containing 8189 images. And each class consists of 40 to 258 images. Some examples are shown in Figure 5. In particular, the images possess small inter-class difference and large intra-class variance. Here we focus on classifying the segmented images available from the dataset. Limited by the imperfectness of the segmentation algorithm, the segmented foreground may contain part of background, or lose part of object. Therefore, it is still challenging for the classication method on such segmented images. The dataset has been divided into a training set, a validation set and a testing set in the provided protocol. The training set and validation set consist of 10 images per class. And the testing set consists of the remaining 6149 images (minimum 20 per class). 2) PFID 61: Pittsburgh Fast-Food Image Dataset is a collection of fast food images from 13 chain restaurants (e.g., McDonald, Pizza Hut, KFC) acquired under lab and

544

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

TABLE I C LASSIFICATION A CCURACY (%) C OMPARISON ON T WO O BJECT D ATASETS Flower 102 AND PFID 61 Methods SVM (SIFTint) [20]a KMTJSRC-CG (SIFTint) [27] Bag of SIFT [21]b OM [28]c LLC-SIFT Fig. 6. Example images of PFID 61 dataset, where each row of the left and right part represents one category. Each category contains three instances and each instance has six images from different views. Two images of each instance are shown here. Image best viewed in color. LLC-distance LLC-combine LSA-SIFT LSA-distance LSA-combine Flower 102 55.10 55.20 57.75 59.76 61.45 57.80 58.78 60.38 PFID 61 9.20 28.20 44.63 4.00 48.45 3.58 48.27 3.59 43.35 3.36 46.90 3.47 46.54 3.08

realistic settings [21]. It contains 61 categories of food items selected from 101 categories. There are 3 instances of each food item, each of which are bought from different branches and taken on different days. And 6 images from 6 viewpoints (60 degrees apart) for each food instance. Figure 6 shows 14 categories of them with two example images per category. It is notable that the appearance of different instances in each category vary greatly. And some different categories (e.g., Hamburgers) are too similar to distinguish them even by the human eyes. Such large instance variance and tiny difference between classes make the classication quite challenging. For Flower 102, most of the previous classication methods employing single feature are based on the 2 kernel function of the clustered SIFTint and SIFTbdy features [27]. In stark contrast, we directly uses much simpler and more efcient linear SVM to classify the segmented images. We directly train the classier on the training and validation images, as used by the baseline method provided in [20]. Namely, 20 images per class are used for training, and the remaining are used for testing. For PFID 61, we follow the experimental protocol proposed in previous work [21], [28], and use 3-fold cross-validation to evaluate the performance. In each iteration, 12 images of two instances are used for training and the 6 images of the third one are used for testing. We repeat the training and testing process for 3 times, with a different instance serving as the test set. Table I gives the classication performances of different methods on the datasets Flower 102 and PFID 61. Here KMTJSRC-CG is the method proposed by Yuan et al. [27] that uses multi-task joint sparse coding and achieves the stateof-the-art performance 55.20% on this dataset. As for PFID 61, the state-of-the-art performance is 28.20%. It is achieved by Yang et al. [28] through utilizing the spatial relationship of local features. Besides these methods, we perform the adopted coding methods LLC and LSA on both datasets to demonstrate the effectiveness of our proposed LDC on improving the classication performance. From Table I, it can be observed that the proposed method signicantly outperforms LLC and LSA with SIFT features and generally achieves the state-of-the-art performance. This well veries that the proposed distance pattern of local features is able to more effectively capture the discriminative

a The best baseline accuracy provided by the authors of Flower 102 b One of baseline accuracies on the 61 categories provided by the authors

for the single feature, which is based on SVM.

of PFID 61. c The Orientation and Midpoint (OM), as one of a set of methods based on the statistics of pairwise local features proposed by Yang et al., yields the best accuracy, where the 2 kernel is adopted with SVM.

information among multiple classes. According to our analysis, the combination of the distance vector and the original SIFT features should yield better classication accuracy than using each of them individually. This is because the combination is able to compensate the information loss and provide more useful information. This point is well shown on the dataset Flower 102, where the combination achieves the best accuracy 61.45%. However, the effectiveness of such combination does not hold on the dataset PFID 61, where the individual distance vector achieves the best performance 48.45% rather than the combination. The reason is that different instances of PFID 61 possess too large variations, and thus the consistency of local features distribution between the training images and the testing images is not well guaranteed. This is experimentally demonstrated by the larger accuracy derivations from both LLC and LSA methods in Table I. In this case, the combination may slightly overt the training data and lead to a negligible decrease of classication accuracy, e.g., the average accuracy is decreased from 48.45% to 48.27% when LLC-Distance is combined with LLC-SIFT. C. Scene DataSets Now we evaluate the proposed method on the scene datasets Scene 15 and Indoor 67. The scene recognition is a challenging open problem in high level vision because each image contains not only the undeterminable characterizing objects but also the complex background [22]. Compared with the object classication, the variations of images in the scene classication are more severe, especially for the light condition, scale, and spatial layout. 1) Scene 15: This dataset consists of 15 scene categories, among which 8 categories are originally collected by Oliva et al. [29], 5 are added by Li et al. [5] and 2 are adopted from Lazebnik et al. [7]. Each class contains 200 to 400 images, and the average image size is around

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

545

TABLE II C LASSIFICATION A CCURACY (%) C OMPARISON ON T WO S CENE D ATASETS Scene 15 AND Indoor 67 Methods ROI + gist-annotation [22]a Object Bank [30]b KSPM [7] Fig. 7. Example images of Scene 15 dataset containing all 15 categories with two images per category. ScSPM [1] SC + linear kernel [31]c NBNN [13]d LLC-SIFT LLC-distance LLC-combine LSA-SIFT LSA-distance LSA-combine Scene 15 80.90 81.40 0.50 80.28 0.93 84.10 0.50 77.00 79.81 0.35 80.30 0.62 82.40 0.35 80.12 0.60 79.73 0.70 82.50 0.47 Indoor 67 26.50 37.60 43.78 43.53 46.28 44.19 42.04 46.69

a The baseline result provided by the authors of Indoor 67, where the

Region of interest (ROI) detection is employed to reduce the interference of clutter background and the RBF-kernel SVM is adopted. b Object Bank pre-trains one object detector for each class. c For comparison, the result of basic features is shown here, but it adopts the intersection kernel rather than our adopted linear SVM. d This is the optimized version of NBNN, where the image-to-class distance is learned by employing the Mahalanobis metrics.

Fig. 8. Example images of Indoor 67 data set containing 67 categories. All categories are organized into ve big groups: Store, Home, Public spaces, Leisure, and Working. Four categories with two images per category are shown for each group. Due to the complex background, images within each category vary widely. Image best viewed in color.

300 250 pixels. Figure 7 shows some example images of each category. all 15 categories with two images per category. 2) Indoor 67: This dataset contains 67 indoor scene categories, and a total of 15620 images [22]. The images in the dataset were collected from three different sources: online image search tools (Google and Altavista), online photo sharing sites (Flickr) and the LabelMe dataset. All images have a minimum resolution of 200 pixels along the smaller axis. The number of images varies across categories, but there are at least 100 images per category. To facilitate seeing the variety of different scene categories, they are organized into 5 big scene groups (Store, Home, Public spaces, Leisure, and Working places), as shown in Figure 8. For Scene 15, we follow the setting in [7] to randomly choose 100 images per class for training and test on the rest. In particular, we repeat the evaluation three times, then report the average results and the standard deviation. As for Indoor 67, we follow the settings of the baseline method provided in [22]. The 80 images of each class are used for training and 20 images for testing, whose partition is provided on the dataset website. Table II provides the classication results on Scene 15 and Indoor 67. In the table, several baseline results on these two scene datasets are provided. The used methods include the detection based methods, the linear coding methods, and the NBNN method. For these two datasets, the distance vectors

yield classication performance close to the original local features due to the relatively poor consistency on the feature distribution of training and testing images. As expected, the combination achieves the best performance for both LLC and LSA methods, as the spatial robustness of the transformed distance vectors strengthens the robustness of the nal combined image level representation. D. General Object Datasets Here we conduct experiments on the datasets Caltech 101 and Caltech 256, in which each image contains certain object and cluttered background. The Caltech 101 dataset [23] contains 9144 images in 101 object categories including animals, vehicles, owers, buildings, etc. The number of images per category varies from 31 to 800. The Caltech 256 dataset [24] contains 30, 607 images from 256 object categories and each category contains at least 80 images. Besides the object categories, both datasets are individually added to an extra background class, i.e., BACKGROUND_Google and clutter, respectively. Figure 9 gives some example images. Compared with Caltech 101, Caltech 256 presents much greater variation in object size, location, pose, etc. For both datasets, we randomly select 30 images for training and test on the rest. In particular, we repeat it three times and then report the average classication accuracy and the corresponding standard deviation. Table III provides the resultant classication performance on these two datasets. Here we compare our method mainly with the linear coding methods and the NBNN method. In particular, LLC in [2] adopted three-scale SIFT features, while our work only uses the single-scale SIFT features. For Caltech 256, LLC [2] adopted a dictionary of 4096 visual words to further improve

546

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

Fig. 9. Example images of Caltech 101 and Caltech 256 data sets containing 102 and 257 categories, respectively. Besides object categories, each of both data sets contains one extra background category, namely, BACKGROUND_Google for Caltech 101 and clutter for Caltech 256. All categories in two datasets have large object variations with cluttered background. Compared with Caltech 101, Caltech 256 has a more irregular object layout, which may degrade the classication performance due to the imperfect matching of spatial pooling. Image best viewed in color. TABLE III C LASSIFICATION A CCURACY (%) C OMPARISON
ON

Caltech 101 AND Caltech 256 Caltech 101 66.20 0.50 64.60 0.80 73.20 0.54 71.50 1.10 70.40 73.44 74.21 0.81 72.65 0.33 73.34 0.95 74.59 0.54 72.86 0.33 71.45 0.87 74.47 0.46 Caltech 256 34.10 34.02 0.35 35.74 0.10 37.00 41.19 36.27 0.27 37.40 0.07 38.41 0.11 36.52 0.26 36.30 0.06 38.25 0.08

Methods SVM-KNN [32] KSPM [7], [24] ScSPM [1] SC + linear kernel [31]a LScSPM [16] NBNN [2], [8]b LLC [2]c LSA [10] LLC-SIFT LLC-distance LLC-combine LSA-SIFT LSA-distance LSA-combine

always yields better performance than individual one, as expected. Compared with the previous methods, our method achieves the satisfying performance and outperforms the similar methods with linear SVM and single feature. Actually, the classication accuracy can be further increased if some advanced learning-based model [15] or graph-matching kernel [33] is adopted with neglecting their complications. From the above experimental results on several different types of image datasets, we can summarize the effectiveness of the proposed method as follows: 1) The distance vectors are quite discriminative under mild condition that the distributions of the training data and the testing data are consistent to some extent, e.g., the involved images have less interference of cluttered background. 2) The transformation to the distance vector relaxes the requirement for the similarity of object spatial layout due to its independence on spatial position of distinctive objects. This is one of the critical differences from the original local features. 3) Under the coding-pooling framework, the distance vector and the original feature are complementary to each other. Consequently, their combination can more comprehensively capture the useful classication information and generally achieves higher classication performance, which is uniformly effective on all used datasets. E. Discussion We have proposed the linear distance coding method, and then veried its effectiveness on multiple types of benchmark datasets. Here we evaluate the inuence of the number of nearest neighbors on calculating distance and coding separately. Particularly, we select the datasets Flower 102, Indoor 67 and Caltech 101 with one per type to investigate the performance under different values, where LLC is particularly employed. d on Calculating Distance: In 1) Neighbor Number knn Section III, we introduce the class manifolds to calculate the distance of local feature to certain class with the aim of reducing the complexity and the interference of noisy d affects the nal classication features. To investigate how knn performance, we provide the average classication accuracy d {1, 2, 3, 4}, and the plot is shown under different values knn in Figure 10. From these results, we have the following observations. d than First, the combined representation is more robust to knn the individual distance vector, since the combination also encapsulates the information from the original features, which is not affected by this parameter. Second, the inuence of this parameter varies a lot on different datasets, especially when only the distance vector is adopted. For example, the classication accuracy on the dataset Flower 102 keeps increasing when d increases from 1 to 4. In fact, the performance has only knn d = 1. slight uctuation when discarding the results under knn Based on the observations of the inuences on different d = 3 is a good trade-off as our suggestion. datasets, knn c on Coding: Now we investigate 2) Neighbor Number knn c the effect of knn to the nal classication performance, where

a For fair comparison, the result of basic features with linear kernel is shown here. Higher accuracy is also reported in [31], but where the intersection kernel is employed. b Performance of the original NBNN [8] provided in [2]. c LLC adopts three-scale SIFT features and the global dictionary of size 4096, which can yield higher accuracy than single scale features, especially for Caltech 256 with larger scale variation.

the performance, and our used dictionary of size xed as 2048. However, even following the same setting for Caltech 101 dataset, the results by ourselves are slightly worse than the reported ones in the previous literatures. It is similar for LSA. Such decrease may be introduced by some implementing details. For the fair comparison, here we only compare the results from our own implementation. Comparing the results in Table III, we can observe that the combination of the distance vector and the original features

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

547

80.00% 75.00% 70.00% 65.00% 60.00% 55.00% 50.00% 45.00% 40.00% 35.00% 1 2
d knn

Classification Accuracy

Caltech 101 - Distance Caltech 101 - Combine Flower 102 - Distance Flower 102 - Combine Indoor 67 - Distance Indoor 67 - Combine 3 4

Fig. 10. Classication accuracy of the proposed methods under different d {1, 2, 3, 4}, where three types of data sets, Flower 102, Indoor 67, knn and Caltech 101, are adopted. Compared to the individual distance vector, d , as it provides more the combination is more robust to the parameter knn complete information. Image best viewed in color.

formance of the distance vector is relatively stable on different datasets. For example, the optimal accuracy is c = 10. almost always achieved at knn 3) Combine: Due to taking advantages of both the stable SIFT and the discriminative Distance, the combic across all nation is most robust to the value of knn different datasets. For example, its achieved almost the c = same accuracy on Flower 102 at different values knn 1, 2, 3, 4. c is very inuFrom the above analysis, the parameter knn ential to performance when using original SIFT features to perform LLC, but such dependence is relaxed for the transc = 10 is suggested formed distance vector. In particular, knn for both the individual distance vector and the combination in this work. VI. C ONCLUSION In this paper, we propose linear distance coding method to capture the discriminative information of local features and relieve the dependence of spatial pooling on object layout similarity of images. Consequently, the proposed method can effectively improve the classication performance, which is well veried on various types of datasets. In fact, the distance vector is to extract the discriminative information based on the image-to-class distance, which is motivated quite differently from the traditional coding models. From the analysis and the experiments, it is shown that the distance vector and the original features are complementary to each other. Thus the combination of two image representations can generally yield higher classication performance. Through comparing the classication results of the proposed method on different types of benchmark datasets, it is concluded that the cluttered background would signicantly degrade the nal classication performance because of its inuence on the salient features of different classes. Inspired by this observation, we plan to design a new model to reduce the interference of background aiming to improve the classication performance, e.g., embedding the segmentation results into the classication framework, which forms one of our future directions. R EFERENCES
[1] J. Yang, K. Yu, Y. Gong, and T. Huang, Linear spatial pyramid matching using sparse coding for image classication, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 17941801. [2] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, Localityconstrained linear coding for image classication, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 33603367. [3] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, 2004. [4] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun. 2005, pp. 886893. [5] L. Fei-Fei and P. Perona, A Bayesian hierarchical model for learning natural scene categories, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2005, pp. 524531. [6] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, Visual word ambiguity, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7, pp. 12711283, Jul. 2010. [7] S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006, pp. 2169 2178.

Fig. 11. Classication accuracy curve of LLC (Original), LDC (Distance), c {2, 5, 10, 20}, where and their combination (Combine) for different knn three types of data sets, Flower 102, Indoor 67, and Caltech 101, are adopted. c . In particular, Three methods have different trends as the variation of knn the combination has the most slight diversication, i.e., the combination is c . Image best viewed in considered to be nonsensitive to the parameter knn color.

d = 3 is universally used for calculating the distance vector. knn Similarly, we show the varying classication performance under different values, as shown in Figure 11. In particular, the results of LLC on the SIFT features is provided besides that of the distance vector and the combination, where four c {2, 5, 10, 20} are explored, as suggested in values of knn [2]. For fair comparison, all results here is produced by our own implementations. From Figure 11, the optimal parameter of different methods heavily depends on the characteristics of the involved dataset, e.g., the variations of images, the cluttered degree of the background, etc. Here, we can summarize the observations of Figure 11 for the different representations individually as follows.

1) Sift : For the selected three datasets, the optimal parac = 2 for Flower 102, meter is quite different, e.g., knn c while knn = 5 for Indoor 67 and Caltech 101. This may be caused by the dependence of the optimal parameter value on the interference of cluttered background. In particular, the images in Flower 102 are all segmented, which can signicantly reduces the inuence of background and a small neighborhood is sufcient. 2) Distance: The distance vector possesses different semantic from the original local feature introduced by our proposed transformation. Compared with SIFT, the per-

548

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

[8] O. Boiman, E. Shechtman, and M. Irani, In defense of nearest-neighbor based image classication, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 18. [9] J. van Gemert, J. Geusebroek, C. Veenman, and A. Smeulders, Kernel codebooks for scene categorization, in Proc. Eur. Conf. Comput. Vis., Oct. 2008, pp. 696709. [10] L. Liu, L. Wang, and X. Liu, In defense of soft-assignment coding, in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 24862493. [11] X. Zhou, K. Yu, T. Zhang, and T. Huang, Image classication using super-vector coding of local image descriptors, in Proc. Eur. Conf. Comput. Vis., vol. 5. Sep. 2010, pp. 141154. [12] R. Behmo, P. Marcombes, A. S. Dalalyan, and V. Prinet, Toward optimal naive Bayes nearest neighbor, in Proc. Eur. Conf. Comput. Vis., vol. 4. Sep. 2010, pp. 171184. [13] Z. Wang, Y. Hu, and L.-T. Chia, Image-to-class distance metric learning for image classication, in Proc. Eur. Conf. Comput. Vis., vol. 1. Sep. 2010, pp. 706719. [14] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell, The NBNN kernel, in Proc. Int. Conf. Comput. Vis., vol. 1. Nov. 2011, pp. 18241831. [15] J. Feng, B. Ni, Q. Tian, and S. Yan, Geometric p-norm feature pooling for image classication, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 26092704. [16] S. Gao, I. Tsang, L. Chia, and P. Zhao, Local features are not lonely Laplacian sparse coding for image classication, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., San Francisco, CA, Jun. 2010, pp. 3555 3561. [17] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with automatic algorithm conguration, in Proc. Int. Joint Conf. Comput. Vis. Theory Appl., vol. 1. Lisboa, Portugal, Feb. 2009, pp. 331340. [18] H. Jgou, M. Douze, and C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117128, Jan. 2011. [19] K. Yu and T. Zhang, Improved local coordinate coding using local tangents, in Proc. Int. Conf. Mach. Learn., Jun. 2010, pp. 12151222. [20] M.-E. Nilsback and A. Zisserman, Automated ower classication over a large number of classes, in Proc. Indian Conf. Comput. Vis., Graph. Image Process., Dec. 2008, pp. 722729. [21] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang, PFID: Pittsburgh fast-food image dataset, in Proc. Int. Conf. Image Process., Nov. 2009, pp. 289292. [22] A. Quattoni and A. Torralba, Recognizing indoor scenes, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 413420. [23] F.-F. Li, R. Fergus, and P. Perona, Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories, Comput. Vis. Image Understand., vol. 106, no. 1, pp. 5970, 2007. [24] G. Grifn, A. Holub, and P. Perona, Caltech-256 object category dataset, Dept. Comput. Sci., California Inst. Technology, Tech. Rep. 7694, Apr. 2007. [25] A. Vedaldi and B. Fulkerson. (2008). VLfeat: An Open and Portable Library of Computer Vision Algorithms [Online]. Available: http://www.vlfeat.org/ [26] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, Liblinear: A library for large linear classication, J. Mach. Learn. Res., vol. 9, pp. 1871 1874, May 2008. [27] X. Yuan and S. Yan, Visual classication with multi-task joint sparse representation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 34933500. [28] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, Food recognition using statistics of pairwise local features, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 22492256. [29] A. Oliva and A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, Int. J. Comput. Vis., vol. 42, no. 3, pp. 145175, 2001. [30] L.-J. Li, H. Su, E. P. Xing, and F.-F. Li, Object bank: A highlevel image representation for scene classication & semantic feature sparsication, in Proc. Adv. Neural Inf. Process. Syst., Dec. 2010, pp. 13781386. [31] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce, Learning mid-level features for recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 25592566.

[32] H. Zhang, A. C. Berg, M. Maire, and J. Malik, SVM-KNN: Discriminative nearest neighbor classication for visual category recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006, pp. 21262136. [33] O. Duchenne, A. Joulin, and J. Ponce, A graph-matching kernel for object categorization, in Proc. Int. Conf. Comput. Vis., vol. 5. Barcelona, Spain, Nov. 2011, pp. 17921799.

Zilei Wang received the B.S. and Ph.D. degrees in control theory and control engineering from the University of Science and Technology of China (USTC), Hefei, China, in 2002 and 2007, respectively. He is currently an Associate Professor with the Department of Automation, USTC, and is also with the Vision and Machine Learning Laboratory, National University of Singapore, Singapore, as a Research Fellow. His current research interests include computer vision and media streaming techniques.

Jiashi Feng received the B.S. degree from the University of Science and Technology of China, Hefei, China, in 2007. He is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore. His current research interests include computer vision and machine learning.

Shuicheng Yan (M06SM09) is currently an Assistant Professor with the Department of Electrical and Computer Engineering, National University of Singapore, where he is the Founding Lead of the Learning and Vision Research Group (http://www.lv-nus.org). His current research interests include computer vision, multimedia, and machine learning. He has authored or co-authored over 200 technical papers. He was a recipient of the Best Paper Award from ICIMCS in 2009, ACMMM in 2010, and ICME in 2010, the Winner Prize of the Classication Task in PASCAL VOC in 2010, the Honorable Mention Prize of the Detection Task in PASCAL VOC in 2010, the TCSVT Best Associate Editor (BAE) Award in 2010, and the co-author of the Best Student Paper Award of PREMIA in 2009 and PREMIA in 2011. He is an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY, and the Guest Editor of the special issues for TMM and CVIU.

Hongsheng Xi received the B.S. and M.S. degrees in applied mathematics from the University of Science and Technology of China (USTC), Hefei, China, in 1980 and 1985, respectively. He is currently a Professor with the Department of Automation, USTC, where he also directs the Laboratory of Network Communication Systems and Control. His current research interests include stochastic control systems, network performance analysis and optimization, wireless communications, and signal processing.

You might also like