Professional Documents
Culture Documents
Jinjun Wang† , Jianchao Yang‡ , Kai Yu§ , Fengjun Lv§ , Thomas Huang‡ , and Yihong Gong†
†
Akiira Media System, Palo Alto, California
‡
Beckman Institute, University of Illinois at Urbana-Champaign
§
NEC Laboratories America, Inc., Cupertino, California
† ‡ §
{jjwang, ygong}@akiira.com, {jyang29, huang}@ifp.uiuc.edu, {kyu, flv}@sv.nec-labs.com
tional SPM. LLC utilizes the locality constraints to project c* = argmin || xi – ciTBi ||2
c
K
erate the final representation. With linear classifier, the pro- Descriptor
input: xi
posed approach performs remarkably better than the tra-
ditional nonlinear SPM, achieving state-of-the-art perfor-
mance on several benchmarks. Feature Extraction
Compared with the sparse coding strategy [22], the ob- codebook: B={bj} j=1,…,M
Image
Step 1:
jective function used by LLC has an analytical solution. Find K-Nearest Neighbors of xi, denoted
as Bi
In addition, the paper proposes a fast approximated LLC
method by first performing a K-nearest-neighbor search Figure 1. Left: flowchart of the spatial pyramid structure for pool-
and then solving a constrained least square fitting problem, ing features for image classification. Right: the proposed LLC
bearing computational complexity of O(M + K 2 ). Hence coding process.
even with very large codebooks, our system can still process
multiple frames per second. This efficiency significantly
adds to the practical values of LLC for real applications. by using SPM [15]. Motivated by [10], the SPM method
partitions the image into increasingly finer spatial sub-
regions and computes histograms of local features from
1. Introduction each sub-region. Typically, 2l × 2l subregions, l = 0, 1, 2
The recent state-of-the-art image classification systems are used. Other partitions such as 3 × 1 has also been at-
consist of two major parts: bag-of-features (BoF) [19, 4] tempted to incorporate domain knowledge for images with
and spatial pyramid matching (SPM) [15]. The BoF method “sky” on top and/or “ground” on bottom. The resulted “spa-
represents an image as a histogram of its local features. It tial pyramid” is a computationally efficient extension of the
is especially robust against spatial translations of features, orderless BoF representation, and has shown very promis-
and demonstrates decent performance in whole-image cat- ing performance on many image classification tasks.
egorization tasks. However, the BoF method disregards A typical flowchart of the SPM approach based on BoF
the information about the spatial layout of features, hence is illustrated on the left of Figure 1. First, feature points are
it is incapable of capturing shapes or locating an object. detected or densely located on the input image, and descrip-
Of the many extensions of the BoF method, including tors such as “SIFT” or “color moment” are extracted from
the generative part models [7, 3, 2], geometric correspon- each feature point (highlighted in blue circle in Figure 1).
dence search [1, 14] and discriminative codebook learn- This obtains the “Descriptor” layer. Then, a codebook with
ing [13, 17, 23], the most successful results were reported M entries is applied to quantize each descriptor and gen-
1
erate the “Code” layer, where each descriptor is converted and SPM pooling to get the final representation). This effi-
into an RM code (highlighted in green circle). If hard vector ciency significantly adds to the practical values of LLC for
quantization (VQ) is used, each code has only one non-zero many real applications.
element, while for soft-VQ, a small group of elements can The remainder of the paper is organized as follows: Sec-
be non-zero. Next in the “SPM” layer, multiple codes from tion 2 introduces the basic idea of the LLC coding and lists
inside each sub-region are pooled together by averaging and its attractive properties for image classification; Section 3
normalizing into a histogram. Finally, the histograms from presents an approximation method to allow fast LLC com-
all sub-regions are concatenated together to generate the fi- putation; In Section 4, an incremental training algorithms
nal representation of the image for classification. is proposed to construct the codebook used by LLC; In
Although the traditional SPM approach works well for Section 5, image classification results on three widely used
image classification, people empirically found that, to datasets are reported; and Finally in Section 6, conclusions
achieve good performance, traditional SPM has to use clas- are made, and some future research issues are discussed.
sifiers with nonlinear Mercer kernels, e.g., Chi-square ker-
nel. Accordingly, the nonlinear classifier has to afford addi- 2. Locality-constrained Linear Coding
tional computational complexity, bearing O(n3 ) in training
and O(n) for testing in SVM, where n is the number of Let X be a set of D-dimensional local descriptors ex-
support vectors. This implies a poor scalability of the SPM tracted from an image, i.e. X = [x1 , x2 , ..., xN ] ∈ RD×N .
approach for real applications. Given a codebook with M entries, B = [b1 , b2 , ..., bM ] ∈
RD×M , different coding schemes convert each descriptor
To improve the scalability, researchers aim at obtaining
into a M -dimensional code to generate the final image
nonlinear feature representations that work better with lin-
representation. This section reviews two existing coding
ear classifiers, e.g. [26, 22]. In particular, Yang et al. [22]
schemes and introduces our proposed LLC method.
proposed the ScSPM method where sparse coding (SC) was
used instead of VQ to obtain nonlinear codes. The method
2.1. Coding descriptors in VQ
achieved state-of-the-art performances on several bench-
marks. Yu et al. [24] empirically observed that SC results Traditional SPM uses VQ coding which solves the fol-
tend to be local – nonzero coefficients are often assigned to lowing constrained least square fitting problem:
bases nearby to the encoded data. They suggested a modifi-
N
cation to SC, called Local Coordinate Coding (LCC), which
explicitly encourages the coding to be local, and theoreti- arg min ||xi − Bci ||2 (1)
C
i=1
cally pointed out that under certain assumptions locality is
more essential than sparsity, for successful nonlinear func- s.t.ci 0 = 1, ci 1 = 1, ci 0, ∀i
tion learning using the obtained codes. Similar to SC, LCC
requires to solve L1-norm optimization problem, which is where C = [c1 , c2 , ..., cN ] is the set of codes for X. The
however computationally expensive. cardinality constraint ci 0 = 1 means that there will be
only one non-zero element in each code ci , corresponding
In this paper, we present a novel and practical coding
to the quantization id of xi . The non-negative, 1 constraint
scheme called Locality-constrained Linear Coding (LLC),
ci 1 = 1, ci ≥ 0 means that the coding weight for x
which can be seem as a fast implementation of LCC that
is 1. In practice, the single non-zero element is found by
utilizes the locality constraint to project each descriptor into
searching the nearest neighbor.
its local-coordinate system. Experimental results show that,
the final representation (Figure 1) generated by using LLC
2.2. Coding descriptors in ScSPM [22]
code can achieve an impressive image classification accu-
racy even with a linear SVM classifier. In addition, the op- To ameliorate the quantization loss of VQ, the restrictive
timization problem used by LLC has an analytical solution, cardinality constraint ci 0 = 1 in Eq.(1) can be relaxed
where the computational complexity is only O(M + M ) by using a sparsity regularization term. In ScSPM [22],
for each descriptor. We further propose an approximated such a sparsity regularization term is selected to be the 1
LLC method by performing a K-nearest-neighbor (K-NN) norm of ci , and coding each local descriptor xi thus be-
search and then solving a constrained least square fitting comes a standard sparse coding (SC) [16] problem1 :
problem. This further reduces the computational complex-
N
ity to O(M + K), where K is the number of nearest neigh-
bors, and usually K < 0.1 × M . As observed from our ex- arg min xi − Bci 2 + λci 1 (2)
C
i=1
periment, using a codebook with 2048 entries, a 300 × 300
image requires only 0.24 second on average for processing 1 The non-negative constraint in Eq.(1) is also dropped out, because a
(including dense local descriptors extraction, LLC coding negative ci can be absorbed by flipping the corresponding basis.
The sparsity regularization term plays several important input: xi input: xi input: xi
french-horn, acc: 91% golden-gate, acc: 95% guitar-pick, acc: 91% hawksbill, acc: 91% ketch, acc: 96%
leopards, acc: 100% mars, acc: 88% motorbikes, acc: 94% lawn-mower, acc: 100% sunflower, acc: 95%
tower-pisa, acc: 100% trilobite, acc: 100% zebra, acc: 94% airplanes, acc: 100% faces, acc: 99%
Figure 3. Example images from classes with highest classification accuracy from the Caltech-256 dataset
in comparison with the best performance of the 2007 chal- and the cardinality ranges between 4 ∼ 34 during training.
lenge [6], as well as another recent results in [20]. To We generated two codebooks with 1024 and 2048 entries
make fair comparison, two noteworthy comments need to for each method respectively. Then the same experimen-
be made: tal setup as that reported in Subsection 5.1 were used, and
1. In [6], multiple descriptors were used with nonlinear the results are plotted in Figure 5. As shown, under all dif-
classifier in the winner system. ferent numbers of training samples per class, the learned
2. In [20], multiple decision systems were combined to codebook by Alg.4.1 improved the classification accuracy
get the final results. by 0.3 ∼ 1.4 percent over the codebook by K-Means.
3. For our LLC algorithm, we only used single descriptor
average classification accuracy
75
(HoG) and simple linear SVM as the classifier.
70
Even though, as seen from Table 3, our LLC method can 65
still achieve similar performance as the best system in [6] 60
and [20]. In fact, if only dense SIFT (similar to dense HOG 55 kmean (1024)
learned (1024)
in our framework) was used, the best result on the validation 50 kmean (2048)
set only scored 51.8% in [6], while our method achieved 45
learned (2048)
5 10 15 20 25 30
55.1%, winning a remarkable margin of 3.3%. number of training images per class
75 75
70 70
65
65
60
60
2−NN 55
55 5−NN unconstrained
10−NN 50 shift−invariant
50 20−NN 45 non−negative
40−NN both
45 40
5 10 15 20 25 30 5 10 15 20 25 30
number of training images per class number of training images per class
Figure 6. Performance under different neighbors (Caltech-101) Figure 7. Performance under different constraint (Caltech-101)
average classification accuracy
70
since K is small, solving Eq.(7) with different constraints
60
doesn’t require significantly different computation. As il-
50
lustrated in Figure.7, the shift-invariant constraint leads to
40 sum/norm−2
the best performance. The non-negative constraint, either max/norm−2
30
used alone or together with the shift-invariant constraint, sum/sum
max/sum
20
decreases the performance dramatically. 5 10 15 20 25 30
number of training images per class
Finally, we evaluated the types of pooling and normal- Figure 8. Performance with different SPM setting (Caltech-101)
ization used in the “SPM” layer. As can be observed from
Figure.8, the best performance is achieved by the “max
pooling” and “l2 normalization” combination. The good 6. Conclusion and Future Work
performance of “max pooling” is in consistence with [22].
For the “l2 normalization”, it makes the inner product of This paper presents a promising image representation
any vector with itself to be one, which is desirable for linear method called Locality-constrained Linear Coding (LLC).
kernels. LLC is easy to compute and gives superior image classi-
fication performance than many existing approaches. LLC [12] P. Jain, B. Kullis, and K. Grauman. Fast image search for
applies locality constraint to select similar basis of local im- learned metrics. Proc. of CVPR’08, 2008.
age descriptors from a codebook, and learns a linear com- [13] F. Jurie and B. Triggs. Creating efficient codebooks for vi-
bination weight of these basis to reconstruct each descrip- sual recognition. Proc. of ICCV’05, 2005.
tor. The paper also introduces an approximation method [14] S. Lazebnik, C. Schmid, and J. Ponce. A maximum en-
to further speed-up the LLC computation, and an optimiza- tropy framework for part-based texture and object recogni-
tion method to incrementally learn the LLC codebook us- tion. Proc. of ICCV’05, 2005.
ing large-scale training descriptors. Experimental results [15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing natural
based on several well-known dataset validate the good per-
scene categories. Proc. of CVPR’06, 2006.
formance of LLC.
[16] H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse cod-
The future work includes the following issues: First,
ing algorithms. Advances in Neural Information Processing
additional codebook training methods, such as super- Systems, MIT Press, pages 801–808, 2007.
vised training, will be investigated; Second, besides exact- [17] F. Moosmann, B. Triggs, and F. Jurie. Randomized cluster-
nearest-neighbor search applied in the paper, approximated- ing forests for building fast and discriminative visual vocab-
nearest-neighbor search algorithms will be evaluated to fur- ularies. Proc. of NIPS’07, 2007.
ther improve the computational efficiency of approximated [18] S. Roweis and L. Saul. Nonlinear dimensionality reduction
LLC; And third, integration of the LLC technique into prac- by locally linear embedding. Science, pages 2323–2326,
tical image/video search/retrieval/indexing/management 2000.
applications is on-going. [19] J. Sivic and A. Zisserman. Video google: A text retrieval
approach to object matching in videos. Proc. of ICCV’03,
References pages 1470–1477, 2003.
[20] J. Uijlings, A. Smeulders, and R. Scha. What is the spatial
[1] A. Berg, T. Berg, and J. Malik. Shape matching and object extent of an object? Proc. of CVPR’09, 2009.
recognition using low distortion correspondences. Proc. of [21] J. Wang, S. Zhu, and Y. Gong. Resolution en-
CVPR’05, pages 26–33. hancement based on learning the sparse association
[2] O. Boiman, E. Shechtman, and M. Irani. In defense of image patches. PatternRecognition Lett. (2009)
of nearest-neighbor based image classification. Proc. of doi:10.1016/j.patrec.2009.09.004.
CVPR’08, 2008. [22] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyra-
[3] A. Bosch, A. Zisserman, and X. Munoz. Scene classifica- mid matching using sparse coding for image classification.
tion using a hybrid generative/dicriminative approach. IEEE Proc. of CVPR’09, 2009.
Trans. on Pattern Analysis and Machine Intelligence (PAMI), [23] L. Yang, R. Jin, R. Sukthankar, and F. Jurie. Unifying dis-
2008. criminative visual codebook generation with classifier train-
[4] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. ing for object category recognition. Proc. of CVPR’08, 2008.
Visual categorization with bags of keypoints. Workshop on [24] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local
Statistical Learning in Computer Vision, ECCV, pages 1–22, coordinate coding. Proc. of NIPS’09, 2009.
2004. [25] H. Zhang, A. Berg, M. Maire, and J. Malik. Svm-knn: Dis-
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for criminative nearest heighbor classification for visual cate-
human detection. Proc. of CVPR’05, pages 886–893, 2005. gory recognition. Proc. of CVPR’06, 2006.
[6] M. Everingham, L. Gool, C. Williams, J. Winn, and A. Zis- [26] X. Zhou, N. Cui, Z. Li, F. Liang, and T. Huang. Hierarchical
serman. The PASCAL Visual Object Classes Challenge 2007 gaussianization for image classification. Proc. of ICCV’09,
(VOC2007) Results. 2009.
[7] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative
visual models from few training examples: an incremental
bayesian approach tested on 101 object categories. In IEEE
CVPR Workshop on Generative-Model Based Vision, 2004.
[8] J. Gemert, J. Geusebroek, C. Veenman, and A. Smeul-
ders. Kernel codebooks for scene categorization. Proc. of
ECCV’08, pages 696–709, 2008.
[9] D. Goldfarb and A. Idnani. A numerically stable dual method
for solving strictly convex quadratic programs. Mathemati-
cal Programming, pages 1–33, 1983.
[10] K. Grauman and T. Darrell. Pyramid match kernels: Dis-
criminative classification with sets of image features. Proc.
of ICCV’05, 2005.
[11] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-
egory dataset. (7694), 2007.