A Fast Sparse Coding Method for Image Classification

Image classification is an important problem in computer vision. The sparse coding spatial pyramid matching (ScSPM) framework is widely used in this field. However, the sparse coding cannot effectively handle very large training sets because of its high computational complexity, and ignoring the mutual dependence among local features results in highly variable sparse codes even for similar features. To overcome the shortcomings of previous sparse coding algorithm, we present an image classification method, which replaces the sparse dictionary with a stable dictionary learned via low computational complexity clustering, more specifically, a k-medoids cluster method optimized by k-means++. The proposed method can reduce the learning complexity and improve the feature’s stability. In the experiments, we compared the effectiveness of our method with the existing ScSPM method and its improved versions. We evaluated our approach on two diverse datasets: Caltech-101 and UIUC-Sports. The results show that our method can increase the accuracy of spatial pyramid matching, which suggests that our method is capable of improving performance of sparse coding features.


Introduction
Image classification is an important problem in computer vision.Pattern recognition and machine learning techniques have been widely applied in this field, which usually extract image features and then classify images according to the features.The setting of image features often significantly affects the performance of classification.Feature encoding methods, especially vector quantization (VQ) and sparse coding (SC), have attracted more and more attention because their well demonstrated performances.
The bag-of-features (BoF) representation methods [1][2][3] are important application of VQ and have been used in image classification and 1-D signal recognition [4].This kind of representation is regarded as high-level feature because it represents an image as a vector of occurrence counts of a vocabulary built on the low-level feature extracted from subregions.The high-level feature concerns with the interpretation or classification of a scene, thus may achieve higher accuracy on most images.However, the BoF model disregards all information about the spatial layout of the features, thus may decrease the accuracy when classifying spatially sensitive images.To overcome this disadvantage, two extensions of the BoF model have been developed in the task of image classification: generative models [5][6][7][8] and spatial pyramid matching (SPM) [9].
Generative models are constructed by the latent variables learned from BoF, and classify images according to their intermediate semantics.They consider an image not only as a collection of codes [5,7] but also as a more complicated structure [10,11], which contains abundant image information and is close to the way that human beings see the image.This can explain why generative models have gained growing attention of the research community.
Compared with generative models, SPM can achieve higher accuracy [12].The SPM method [9] mines the spatial information through partitioning the image into increasingly finer spatial subregions and employs pyramid Match Kernel [1] to compare corresponding subregions.Typically, it partitions an image into 4 l subregions in different level (l = 0, 1, 2 ) and computes the BoF histogram within each of the 21 subregions.The BoF is a special case of SPM because SPM reduces to a standard BoF when l = 0.Many image classification tasks such as Caltech-101 [6] and 15-Scenes [9] have shown very promising performance of SPM.
Although the traditional SPM method works well for image classification, classifiers with nonlinear Mercer kernels, e.g., Chi-square kernel, are usually used to boost performance.Accordingly, the nonlinear SVM has a complexity O(n 2 ∼ n 3 ) in training and O(n) in testing, where n is the number of training set images [13,14].When the training set contains many samples, this complexity will cause many calculations, which implies a poor scalability for SPM's real-world applications.To overcome this drawback, Yang et al. [13] proposed an extension of SPM by using Sparse Coding (ScSPM) and achieved state-of-the-art performance on several benchmarks.The ScSPM computes a spatial-pyramid image representation by sparse coding (SC) instead of vector quantization (VQ).This method works better with employing linear classifiers, and thus reduces the computation of classification algorithm [14][15][16].
The ScSPM method represents an image through the following three modules: (1) SIFT extraction; (2) sparse coding; and (3) spatial pooling.The sparse coding contains dictionary learning and feature quantization, which are the most important and govern the quality of image presentation.The ScSPM employs efficient sparse coding method [17] that adopts Lagrange dual to learn dictionary whose components are linear combinations of low-level features.The method has been shown experimentally to be much faster than first-order gradient descent methods [17], and is capable of generating state-of-the-art features in image classification.However, to further improve its performance, some problems need to be solved.First, it cannot effectively handle large training sets because the dictionary learning algorithm has high computational complexity [18,19].Second, local features are dealt with separately, thus the mutual dependence among local features is ignored, which results in highly variable sparse codes even for similar features [16].
To solve these problems, we propose a self-representation sparse coding strategy that separates dictionary learning algorithm from sparse coding.Under this strategy, sparse dictionary is built with low-level feature extracted from some important subregions.This is different from existing sparse coding method, which learns a dictionary by employing linear combination of low level feature extracted from all subregions in training dataset.We employ cluster centers of k-medoids cluster to build our dictionary, because this cluster builds cluster centers by stably selecting local feature extracted from important subregions and can reduce the learning complexity of Lagrange dual.Therefore, the proposed method can solve the sparse coding's problem of high computational complexity and variable sparse codes for similar features.
Based on this strategy, we propose an image classification framework, as shown in Figure 1.Firstly, SIFT descriptors are extracted from each of subregions.Secondly, k-medoids cluster is applied to learn a dictionary from SIFT descriptors.Then, ScSPM method is employed to generate image feature.Finally, classification is performed to classify images according to image feature.

SIFT Feature Extraction
The scale-invariant feature transform (SIFT) is employed as low-level feature in our framework.It is an algorithm in computer vision to detect and describe local regions in images.Four different ways of extracting local regions have been tested [5], which proved that the region descriptors of 128-dimensional SIFT [20] have more useful information and better robustness.These local features have been widely used in image classification [7, 21,22].Therefore, we employed a common setting of regions descriptors that adopt 128-dimensional gray level intensity of SIFT.

k-Medoids Dictionary Learning
The cluster samples are 128-dimensional SIFT descriptors that are extracted from locals of each image, and the k-means++ optimized k-medoids cluster [4] is employed as dictionary learning algorithm, which chooses data points as centers (medoids or exemplars) and works with a generalization of the Manhattan Norm to define distance between data points.
The k-medoids dictionary learning algorithm is shown in Table 1.Firstly, k-means++ algorithm is adopted to initialize medoids.The first medoid is set by random extraction from all cluster samples.Then, iterations until k initialize medoids are selected; in each iteration, a new medoid is selected based on selecting probability of each sample.Secondly, selected medoids are employ as initialized medoids to apply k-medoids cluster algorithm.Finally, convergent cluster centers are adopted as dictionary for sparse coding.

Sparse Coding
An efficient sparse coding algorithm [17] is employed as mid-level feature, which represents images as a set of codes according to its 128-dimensional SIFT descriptors that are extracted from locals of each image.This sparse coding algorithm has a dictionary learning step and an encoding step, and is based on iteratively solving two convex optimization problems: an L1-regularized least squares problem and an L2-constrained least squares problem.In our framework, the dictionary learning step is replaced by the proposed k-medoids dictionary learning method, and existing encoding step is employed to generate the image coding based on dictionary learned by our method.

ScSPM Feature Building and Classification
The ScSPM model [13] is employ for generalizing vector quantization to sparse coding followed by multi-scale spatial max pooling and employed a linear SPM kernel based on SIFT sparse codes.For any image represented by a set of descriptors, we can compute a single feature vector based on the descriptors' codes generated by sparse coding algorithm.Let U be the code obtained via sparse coding algorithm; ScSPM feature is computed by the histogram pooling method: In ScSPM, an approach of using linear SVMs based SC of SIFT is advocated.Let U be the result of applying the sparse coding Equation (1) to a descriptor set; assuming the dictionary to be pre-learned and fixed, image feature is computed by a pre-chosen pooling function: where the pooling function F is defined on each column of U, and each column of U corresponds to the responses of all the local descriptors to one specific item in dictionary.The pooling function F is a max pooling function on the absolute sparse codes: where z j is the jth element of z, u ij is the matrix element at ith row and jth column of U, and M is the number of local descriptors in the region.
Let image I i be represented by z i , a simple linear SPM kernel, as shown in Equation ( 4), are employed in SVM, so that image can be labeled by SVM training and classification algorithm.
where z i , z j = z T i z j , and z l i (s, t) is the max pooling statistics of the descriptor sparse codes in the (s, t)th segment of image I i in the scale level l.

Results
We evaluated our approach on two diverse datasets: Caltech-101 [6] and UIUC-Sports [23].All experiments were repeated ten times with different, randomly selected training and testing images, and the final results were averaged for classification accuracy.Multi-class classification was done with a simple linear SVM classifier trained using the one-versus-all rule: a classifier was learned to separate each class from the rest, and a test image was assigned the label of the classifier with the highest response.

Caltech-101 Dataset
The Caltech-101 dataset contains 101 classes (including animals, vehicles, flowers, etc.) and totals 9144 images with high shape variability, which are provided by Fei-Fei Li [6].The number of images per category varies from 31 to 800.Most images are medium resolution, i.e., about 300 × 300 pixels.We followed the common experiment setup for Caltech-101, i.e., training on 15 and 30 images per category, denoted as 15 training experiment and 30 training experiment, and testing on the rest.
Table 2 shows the comparison between our approach and other SPM based features.Our approach could classify images at the highest accuracy on both 15 training experiment and 30 training experiment.Especially, the accuracy of our method is much higher than that methods based on SPM technology, such as SPM [1], ScSPM [13], and LLC [14].Different from existing SPM based methods, our method adopts the proposed dictionary learning strategy; the results prove that this strategy yields an observable performance boost for SPM.Table 3 shows the comparison between our approach and the state-of-the-art methods.Our approach could classify images at the highest accuracy.Although the computational complexity of our approach is much lower than deep learning methods, the accuracy of our method is higher than deep learning method reported in [24] and equal to deep learning method reported in [25], which shows advantage of our method in both reducing computational cost and increasing accuracy.

UIUC-Sports Dataset
This dataset is composed of eight complex event classes provided by Li-Jia Li and Fei-Fei Li [23], and contains 1579 color images of different sizes.There are 194 rock climbing, 200 badminton, 137 bocce, 236 croquet, 182 polo, 250 rowing, 190 sailing, and 190 snowboarding images.We followed their experimental setting in Object Bank [32] by using 70 randomly drawn images from each class for training and 60 for testing.
Table 4 shows the comparison between our approach and other SPM based features.Our approach could classify images at the highest accuracy.Our method could increase the accuracy by 1.4% compared to LScSPM and 4.4% Equation Local Soft Assignment based on SPM technology, which proves that the proposed strategy boosts the performance of SPM.Table 5 shows the comparison between our approach and state-of-the-art methods.Our approach could increase the accuracy compared to feature extraction models, such as LLKc [33], N 3 SC encoder [29], and CLGC(RGB-RGB) [31].Although the accuracy of our approach was lower than deep learning models, such as Hybrid-CNN [25], TPN-FS [34], and DeepSCNet [35], it could reduce computational cost compared to deep learning methods, which suggest that it is still useful in real-time processing systems.

Discussion
Image classification is an important research field in computer vision.The ScSPM method, which represents an image through SIFT extraction, sparse coding, and spatial pooling, has shown its advantage in this field.However, the dictionary learning involved in sparse coding demands high computational complexity and is very unstable.To solve these problems, in this paper, we propose an image classification framework, whose dictionary learning and feature quantization are developed as two independent parts of sparse coding.Firstly, SIFT descriptors are extracted from each of subregions.Secondly, k-medoids cluster is applied to learn a dictionary from SIFT descriptors.Then, ScSPM method is employed to generate image feature.Finally, classifier is applied to classify images according to image feature.
We evaluated our approach on two diverse datasets: Caltech-101 and UIUC-Sports.The results show that our method can increase accuracy of spatial pyramid matching, which suggests that our method can improve performance of sparse coding features.
Although the proposed method has improved the classification performance, similar categories may be misclassified mutually.To improve this situation, in our future work, we will try to use more efficient low-level features, and apply multiple stage classifiers for boosting the classify performance.

Figure 1 .
Figure 1.Flow chart of our framework.

Table 1 .
The algorithm of fast dictionary learning.// Initialization // Input all samples as X = [x 1 , x 2 , • • • , x N ] // Randomly select a sample to join the medoids set M set as the first medoid M p x j − M (p) set end // Calculating the selecting probability of the candidate samples for j = 1 :N do P j = x j − M k x i − M set = arg min

Table 2 .
Classification accuracy comparison between our approach and other SPM based features on Caltech-101 dataset.

Table 4 .
Classification accuracy comparison between our approach and other SPM based features on UIUC-Sports dataset.