M-SAC-VLADNet: A Multi-Path Deep Feature Coding Model for Visual Classification

Vector of locally aggregated descriptor (VLAD) coding has become an efficient feature coding model for retrieval and classification. In some recent works, the VLAD coding method is extended to a deep feature coding model which is called NetVLAD. NetVLAD improves significantly over the original VLAD method. Although the NetVLAD model has shown its potential for retrieval and classification, the discriminative ability is not fully researched. In this paper, we propose a new end-to-end feature coding network which is more discriminative than the NetVLAD model. First, we propose a sparsely-adaptive and covariance VLAD model. Next, we derive the back propagation models of all the proposed layers and extend the proposed feature coding model to an end-to-end neural network. Finally, we construct a multi-path feature coding network which aggregates multiple newly-designed feature coding networks for visual classification. Some experimental results show that our feature coding network is very effective for visual classification.


Introduction
Deep learning models have gained great attention in the field of computer vision, including visual classification [1][2][3][4][5][6][7][8], super resolution [9,10], semantic segmentation [11,12], object detection [13][14][15] and visual tracking [16]. Compared with the traditional statistical learning methods, deep learning models have two main advantages: (1) based on end-to-end training manner, the networks parameters which are more suitable for the final task can be obtained; and (2) the deep network representation can provide a better description. The deep feature methods can significantly improve the performances over the conventional feature methods, such as scale invariant feature transform (SIFT) [17] feature method and histograms of gradients (HOG) [18] feature method.
Since the end-to-end training model and deep structure representation have great advantages, some recent papers embed the domain knowledge of conventional statistical learning models into the deep neural network and train the entire model by an end-to-end manner. The new neural networks not only inherit the domain expertise but also make all the parameters more suitable for the final application tasks. Representative works include the following. Zuo et al. [19] proposed a novel iteration-wise l p −norm regularizer which is from the maximum a posterior (MAP) model to get the outstanding blind de-convolution results. Peng et al. [20] proposed a novel deep subspace clustering method with sparse prior to obtain the state-of-the-art clustering results. Wang et al. [21] proposed a novel end-to-end l ∞ norm encoder to get the state-of-the-art hash results. Zheng et al. [12] treated the conditional random extends the affine subspace method in [38] to a 1 × 1 convolutional layer which reduces the dimension of the coding.
The second contribution is the proposed Multi-path SAC-VLADNet. The existing feature coding networks only extract the features of the last convolutional layer of a deep convolutional network to compute the feature codings, thus these models can not take full advantage of the convolutional representations for visual classification. To take full advantage of multiple levels representations, the proposed M-SAC-VLADNet uses a novel manner to aggregate multiple SAC-VLAD layers. In the M-SAC-VLADNet, we first extract the convolutional features from multiple layers. Next, we obtain the corresponding SAC-VLAD coding in each convolutional feature. Finally, we aggregate all the SAC-VLAD codings to construct the final multi-path feature coding network which is also an end-to-end feature coding model. The M-SAC-VLADNet can simultaneously use the low, middle and high level features to train multiple feature coding networks, thus will be more discriminative than the single level feature coding network.
The third contribution is that the back propagation function of each new layer is derived. Based on the back propagation algorithm, all the learnable parameters can be obtained. The back propagation models of affine subspace layer and covariance VLAD layer are easily obtained. The SASAC layer is a completely new structure layer, thus we will detailedly discuss the back propagation model of the SASAC layer. Various visual classification experiments will show the superiorities of the new layers. In addition, some visual recognition results demonstrate that SAC-VLADNet is evidently better than SAC-VLAD, and M-SAC-VLADNet is better than SAC-VLADNet. These results demonstrate the superiorities of the end-to-end model and the proposed multi-path feature coding network. We also give some detailed experimental results of our network and other state-of-the-art models to show the advantages of our network.
The remainder of this paper is organized as follows. Section 2 briefly introduces the traditional feature coding framework, the CNN feature for feature coding network and the end-to-end NetVLAD model. Section 3 presents the SAC-VLADNet and the M-SAC-VLADNet. Section 4 gives the experimental comparisons between the proposed model and other state-of-the-art models. Finally, Section 5 concludes this paper.

Related Work
In this section, the introduction of the traditional feature coding framework for visual classification is first given. Next, the introduction of the CNN feature in the feature coding network is given. Finally, the introduction of the NetVLAD method is given.

The Conventional Feature Coding Framework for Image Recognition
The traditional feature coding framework can be divided into five steps: (1) extracting the SIFT [17] features from all the images; (2) solving an minimization problem from all the training SIFT representations to obtain a dictionary; (3) computing the feature codings by a specific feature coding method; (4) pooling the feature codings to get the pooled vectors; and (5) training the final support vector machine (SVM) classifier by the pooled vectors to get the classification result. The block diagram of the traditional feature coding framework for image recognition is shown in Figure 1a.

The CNN Feature for Feature Coding Network
Since the SIFT [17] feature does not have the strong image representation ability, the image classification performances of the traditional feature coding methods are not always satisfactory. Recently, some feature coding models which utilize the CNN features are proposed. Compared with the shallow SIFT feature, the CNN feature is a deeper and more descriptive representation of the original image. In visual classification, the CNN based feature coding networks are obviously better than the SIFT based feature coding methods. Fisher Vector with CNN (FV-CNN [39]) is a representative feature coding network which is based on the CNN feature. FV-CNN [39] trains a gaussian mixture model (GMM) dictionary by the CNN feature and obtains the Fisher Vector (FV) codings by the trained GMM dictionary. The block diagram of the FV-CNN [39] for visual classification is shown in Figure 1b.
To obtain the CNN features of a feature coding network, all the images need to pass through a CNN which is pre-trained on the large scale ImageNet [40] dataset. The most useful features extracted from the pre-trained CNN are the feature of the last convolutional layer and the feature of the last fully connected layer [39]. In the proposed model, we extract the feature of a specific convolutional layer to train our feature coding network.
For a size-s RGB image I ∈ R S×S×3 , the extracted feature of a specific convolutional layer of a deep CNN can be expressed as F ∈ R O×O×D , and D represents the number of the convolutional kernels of a specific convolutional layer. O represents the size of the convolutional feature. F can also be viewed as a feature set which contains M = O × O convolutional descriptors, and each descriptor is D-dimensional.

The End-to-End NetVLAD Model
The NetVLAD model uses the last convolutional feature to train the NetVLAD layer, thus the descriptor set F i = { f ij } M j=1 represents the last convolutional feature of the ith image I i , and the total number of the images is N. f ij ∈ R D×1 is the jth descriptor of F i . Besides, the NetVLAD model uses K visual words {c k } K k=1 (c k ∈ R D×1 ) as the dictionary. For F i , the final VLAD vector is K × D-dimensional and can be expressed as: where Ψ( f ij ) ∈ R KD×1 is the VLAD representation of f ij . The expression of Ψ( f ij ) is: where the sub vector ϕ( f ij k ) ∈ R D×1 in Equation (2) is written as: where λ ij (k) represents the weight coefficient of c k and f ij . In the traditional VLAD [35] model, hard assignment coding is used as the weight coefficient. In the NetVLAD model, soft assignment coding [31] is used as the weight coefficient, and the soft assignment coding is written as: where ||.|| 2 is the l 2 norm of a vector. σ 2 represents the covariance coefficient which controls the decay of the response with the magnitude of the distance. As Equation (4) shows, the soft assignment coding is a normalized weight coefficient which uses the sum of K probabilities as the denominator. After some simple transformations, Equation (4) can be decomposed into a 1 × 1 convolutional layer and a soft-max activation function layer. Based on Equations (1)-(4), the final expression of the NetVLAD model can be written as: where Ψ(F i )(k, d) represents the ((k − 1)D + d) th element of Ψ(F i ) (k = 1, 2, · · · , K; d = 1, 2, · · · , D). f ij (d) and c k (d) represent the d th (d = 1, 2, · · · , D) element of f ij and c k respectively. The NetVLAD layer also uses the widely-used L2-normalization method and intranormalization [41] method to obtain the final coding representation. The complete NetVLAD model for visual classification is illustrated in Figure 2.

The Proposed SAC-VLADNet
In this section, the mathematical details of the SASAC layer, the affine subspace layer and the covariance layer in our SAC-VLADNet will be presented. We further propose the multi-path M-SAC-VLADNet which aggregates multiple SAC-VLADNet layers. The proposed SAC-VLADNet layer is shown in Figure 3. The proposed M-SAC-VLADNet for image classification is illustrated in Figure 4.  . The network structure of the SAC-VLADNet layer. F i is the feature of the i th image in a specific convolutional layer. The blue arrow represents the feed-forward operation of the SAC-VLADNet layer, and the red arrow represents the back-propagation operation of the SAC-VLADNet layer. β(a k , b k , v k ), γ(β) and λ(γ) are Equations (A1)-(A3), respectively. Σ layer is the covariance statistic layer in Equation (14). conv(U, µ) is the 1 × 1 convolutional layer with the weight {U k } and the bias {µ k }. a k , b k , v k , U k and µ k (k = 1, 2, · · · , K) are the trainable parameters, which are obtained by the back propagation algorithm.

The Sparsely-Adaptive Soft Assignment Coding (SASAC) Layer
The NetVLAD model uses the soft assignment coding in Equation (4) as the weight coefficient. Equation (4) can be considered as a normalized probability. For each i, j and k, the probability of f ij and c k is In the proposed network, we use the newly designed SASAC layer as the weight coefficient. The SASAC layer uses a multidimensional Gaussian probability density function (MGPDF) to define the probability of f ij and c k . The MGPDF with Euclidean distance is written as: where ./ is the element wise division operation of two vectors, σ k1 , σ k2 , · · · , σ kD are the covariance parameters of c k . Different from the standard MGPDF that directly computes σ k = σ k1 σ k2 · · · σ kD , our SASAC layer uses a trainable parameter to replace σ k . The trainable probability density function in SASAC layer is written as: where . * is the element wise multiplication operation of two vectors, and a k ∈ R D×1 , b k ∈ R D×1 and v k ∈ R are the trainable parameters. If we set a k , b k and v k as the following parameters, Equation (7) will be exactly equivalent to Equation (6).
However, in the SASAC layer, a k , b k and v k are achieved by an end-to-end learning manner, instead of being directly constructed from the pre-computed expression in Equation (8).
Similar to the soft assignment coding in Equation (4), the SASAC layer also uses normalized probability to construct the weight coefficient. The normalized expression of Equation (7) is written as: For a certain k, if the probability p ij (k) is very small, this unreliable probability will affect the classification performance of the model. Besides, many works show that the sparse codings are helpful for improving the image classification performance. To eliminate the adverse impacts of the unreliable probabilities and obtain the sparse weight coefficient, the SASAC layer only considers the largest T probabilities and forces other small probabilities to be 0. The final expression of our SASAC layer is the following expression.
where S T ( f ij ) is a set that satisfies the following conditions: It is easy to see that the soft assignment coding in Equation (4) can be considered as a special case of Equation (10) Our SASAC layer in Equation (10) can adaptively learn all the parameters (a k , b k and v k ) based on a normalized MGPDF and obtain more sparse weight coefficient than the soft assignment coding layer in Equation (4). The SASAC layer is differentiable, thus the SASAC layer can be trained in an end-to-end method which can obtain the more suitable parameters for image classification. The SASAC layer is a new neural network layer which incorporates the domain knowledge of the sparse MGPDF and the deep learning model. To the best of our knowledge, the end-to-end SASAC layer is not studied in the previous deep neural network. In this paper, we first embed the end-to-end SASAC layer into a deep neural network for image classification.

The End-to-End Affine Subspace Layer
The original NetVLAD model exploits the PCA algorithm for dimension reduction. The proposed network exploits the affine subspace method in [38] for dimension reduction, which not only provides a piecewise linear approximation of the data manifold but also makes the low dimensional representations still have strong discriminations. The affine subspace layer in our SAC-VLADNet can be written as: where represents the projective matrix of a specific subspace [38]. P represents the subspace dimension. In our SAC-VLADNet, U k and µ k are obtained through training, instead of being directly obtained by the pre-computed U k . U k f ij + µ k in Equation (12) can be considered as a 1 × 1 convolutional layer which has the weight {U k } and the bias {µ k }, thus the conventional CNN training method can efficiently train the end-to-end affine subspace layer. The first order statistical information is written as:

The Covariance Layer
From Equation (5), it is clear to see that the original NetVLAD model only uses the first-order statistical information. The NetVLAD layer and the traditional pooling methods achieve the aggregated features from the spatial scale without considering the feature interaction between each channel. The proposed SAC-VLADNet exploits the covariance matrix to get the interactive feature which can efficiently enhance the representation ability. The final aggregated feature in the proposed network is the concatenation of the first-order and the covariance statistical information. The covariance statistical information of Equation (13) is written as: where vec is the vector operation which transforms the matrix to the corresponding column vector. Based on Equation (14), we use the covariance matrix of the first order feature coding to get the interactive representation between the feature channel. Since Equation (14) is also differentiable, the covariance statistic layer can be learned by an end-to-end method.

The Complete SAC-VLADNet
Based on the back propagation model of the SAC-VLADNet, the proposed network can be trained by an end-to-end manner. The back propagation models of the affine subspace layer in Equation (12) and the covariance statistic layer in Equation (14) can be easily obtained. The SASAC layer is a new structure layer, and we give in detail the back propagation function of the SASAC layer in Appendix A.
For the ith convolutional feature F i , the final form of SAC-VLAD coding (ξ(F i ) ∈ R P(K+P)×1 ) is a P(K + P)-dimensional vector and written as: (15) where L2norm is the L2 normalization method of a vector. From Equation (15), we could find that the final feature representation ξ(F i ) can capture both spatial aggregation information and interactive information between feature channels. This design can efficiently improve the final representation ability. Based on the derived back propagation functions, we can extend the SAC-VLAD in Equation (15) to an end-to-end deep network (SAC-VLADNet). a k , b k , v k , U k and µ k (k = 1, 2, · · · , K) are the learnable weights in SAC-VLADNet, and these parameters are learned by the back propagation algorithm. In the proposed SAC-VLADNet, the feed-forward procedure first computes the final softmax classification loss. Next, we compute the gradients of all the parameters and use the back propagation algorithm to update each layer in SAC-VLADNet. We use the blue and the red arrows in Figure 3 to represent the end-to-end training procedure of the SAC-VLADNet.

The Proposed M-SAC-VLADNet
Since the current feature coding networks (end-to-end feature coding networks [36] and non end-to-end feature coding networks [39,42]) only use the last convolutional features to compute the feature coding, these single path feature coding networks can not take full advantage of convolutional features for image classification.
Based on our newly-designed SAC-VLADNet, we further propose a novel M-SAC-VLADNet which aggregates multiple SAC-VLADNet layers for visual classification.
The M-SAC-VLADNet extracts L features from L convolutional layers. L features are defined as i , · · · , F (L) i , and ξ(F i ) are the corresponding SAC-VLAD representations. The final classification loss of the M-SAC-VLADNet is the standard softmax loss written as: where C is the number of categories, H{x, y} = 1 is an indicator function which satisfies H{x, y} = 1 if x = y, otherwise H{x, y} = 0. y i represents the label of the ith image. ρ ic is the total prediction score: where [g (l) C ] T are the weight and bias of the lth (l = 1, 2, · · · , L) fully-connected (FC) layer. Equation (17) can be further written as: where G c = [g (1) c ; g (2) c ; · · · ; g c . G = [G 1 , G 2 , · · · , G C ] T and B = [B 1 , B 2 , · · · , B C ] T are the weight and bias of the final softmax classifier.
Compared with the NetVLAD [36] model, which only uses the single level feature coding to train the final classifier, the proposed M-SAC-VLADNet exploits multiple SAC-VLAD codings for image classification, thus the proposed multi-path feature coding network is expected to be more discriminative.
The M-SAC-VLADNet is also an end-to-end feature coding model. We first obtain the initialization parameters in each SAC-VLADNet layer, and then train the entire M-SAC-VLADNet by an end-to-end method. Based on the back propagation algorithm, the gradient information of the softmax classifier can be used to update the parameters in each SAC-VLADNet layer. Because of this, the proposed M-SAC-VLADNet can be trained in a supervised way. We define the feed operation of the M-SAC-VLADNet as the blue arrow in Figure 4 and define the back operation of the M-SAC-VLADNet as the red arrow in Figure 4.

Experimental Results
In this section, the classification performances of the proposed SAC-VLADNet and M-SAC-VLADNet are evaluated on several image benchmarks. For a fair comparison, the parameters in NetVLAD and SAC-VLADNet are set to the same values. For other compared classification methods, we tune the corresponding parameters to get the best results. The experimental image databases include MIT [43] indoor scene database, Stanford cars [44] dataset, Caltech-UCSD Birds 200 (CUB200) [45] database and Caltech256 [46]) object database. The basic specifications of all the datasets are shown in Table 1. First, the experimental setting of the proposed network iss given. Next, we evaluate some important factors that significantly affect the image recognition rates of the proposed SAC-VLADNet. Finally, we will give some detailed experimental results of our deep network and other state-of-the-art classification models to demonstrate the superiorities of SAC-VLADNet and M-SAC-VLADNet.

Experimental Setting
In our experiments, we used the VGG-VD [47] network to extract the single level feature for SAC-VLADNet and the multiple levels features for M-SAC-VLADNet, All the images were resized to 448 × 448 pixels. We used random crop technology and random mirror technology to augment all the training images. We used the flexible and efficient deep learning library Mxnet [48] to extract the deep CNN features and implement the SAC-VLADNet and the M-SAC-VLADNet. To minimize the classification loss, the stochastic gradient descent (SGD) optimization algorithm was used.
For the proposed SAC-VLADNet, we used the VGG-VD [47] network which is pre-trained from the large scale ImageNet [40] dataset to initialize the frontal deep CNN. Then, we used the last convolutional features to learn the initialized dictionary {c k } K k=1 . We used the K-means algorithm in VLFeat library [49] to train the initialized dictionary. Besides, we used the affine subspace model in [38] to initialize the affine subspace parameters U k (k = 1, 2, · · · , K). We used the corresponding analytical relationships in Section 3 to initialize a k , b k , v k and µ k . Based on Equation (15), we obtained the final SAC-VLAD representations. Finallu, based on the obtained SAC-VLAD representations, we achieved the initial weight and bias of the last fully-connected layer by training a softmax classifier. The non end-to-end SAC-VLAD can be viewed as the initial value of the end-to-end SAC-VLADNet. Based on the back propagation algorithm, the SAC-VLADNet model can achieve the final parameters for visual classification.
For the proposed M-SAC-VLADNet, we first extracted the convolutional features of L = 4 layers (Relu5_1, Relu5_2, Relu5_3 and Pool5) on VGG-VD [47] network to obtain four initialized SAC-VLADNet layers, and then concatenated the four SAC-VLAD representations together.
Finally, based on the concatenate SAC-VLAD representations, we obtained the initial values of G and B in Figure 4 by training a softmax classifier. Based on the above initialization parameters, the M-SAC-VLADNet obtained the optimal parameters for visual classification by an end-to-end manner.

Analyses of Some Important Factors
In this subsection, we evaluate some important factors that affect the image recognition rate of the proposed SAC-VLADNet. When we evaluate a specific factor, we set all other factors to fixed values. We evaluate all the factors on Caltech256 [46] dataset. The experimental configuration of the Caltech256 database can be found in Section 4.6.4.
From Equation (10), it is clear to see that the SASAC layer only considers the largest T probabilities and enforces other small probabilities to be zeros. T is a very important factor which will affect the image recognition rates of the SAC-VLAD and the SAC-VLADNet. We compare the image recognition rates in Caltech256 [46] dataset with different T, we set the dictionary size (K) and the subspace dimension (P) as 128 and 128, respectively. The Caltech256 [46] image recognition rates of the SAC-VLAD and the SAC-VLADNet with different T are shown in Figure 5. As shown in Figure 5, it is obvious to see that T should be a suitable value. If T is too small, such as T = 1, some contributing probabilities are disregarded, which will decrease the discrimination of the SASAC layer. If T is too big, such as T ≥ 32, the unreliable probabilities will also reduce the discrimination of the SASAC layer. In this experimental result, when T = 7, the SAC-VLAD and the SAC-VLADNet get the best image classification performances. We can select optimal T for other datasets in a similar way. In a specific database experiment, we give the optimal T. For our M-SAC-VLADNet, T is set to the same value in each SAC-VLADNet layer .
Dictionary size (K) is another pivotal parameter. The Caltech256 [46] image recognition rates of the SAC-VLAD and the SAC-VLADNet with different K are shown in Figure 6. We set T and the subspace dimension (P) as 7 and 128, respectively.
As shown in Figure 6, it is clear to see that, when the dictionary becomes larger, the accuracy also increases. However, after K is greater than a certain value, the accuracy cannot be further improved. In this experimental result, the accuracy of the SAC-VLADNet does not show apparent improvement when K is larger than 128. In other databases experiments, the SAC-VLADNet also gets good enough performances when K = 128. In the following experiments, we set K as 128. For the NetVLAD model, we also set the dictionary size as 128 in the following experiments. The final length of the SAC-VLADNet coding is determined by the subspace dimension (P). The Caltech256 [46] image recognition rates of the SAC-VLAD and the SAC-VLADNet with different P are shown in Figure 7, we set T and the dictionary size (K) as 7 and 128, respectively.
As shown in Figure 7, it is clear to see that the SAC-VLADNet does not have a good enough result when P = 128. For other databases experiments, SAC-VLADNet also achieves good enough results when P = 128. To make the SAC-VLADNet representation have relatively low length, we set P = 128 for the following experiments. The length of the SAC-VLADNet representation is P(K + P) = 128 × (128 + 128) = 32768 in the following experiments. The SASAC layer and covariance statistic layer are two vital layers in the proposed SAC-VLADNet. To demonstrate the effects of the SASAC layer and the covariance statistic layer, we give some experimental comparisons of the variants of SAC-VLADNets. In this section, a SAC-VLADNet model that doe not have the covariance statistic layer is described as the SA-VLADNet, and the corresponding non end-to-end model is described as SA-VLAD. The image recognition rates of the NetVLAD, the proposed models and other variants in Caltech256 [46] database are shown in Figure 8. As Figure 8 shows, SAC-VLAD improves 1.5% over SA-VLAD, and SAC-VLADNet improves 1.0% over SA-VLADNet, which demonstrates the effect of the covariance statistic layer. SA-VLADNet achieves 1.2% improvement over NetVLAD, which demonstrates that the SASAC layer is also an important layer for improving discrimination. Since SASAC layer and covariance statistic layer can significantly improve the image recognition rate of the proposed SAC-VLADNet, these two layers are necessary components of the proposed deep network.

Statistical Test of SAC-VLADNet and NetVLAD
In this subsection, we give the statistical test of SAC-VLADNet and NetVLAD. Figure 9 shows the error bars of SAC-VLADNet and NetVLAD on 10 different data duplicates. The error bar shows that the proposed SAC-VLADNet increases the recognition rate by 2-4% over the NetVLAD, which demonstrates the great superiority of the SAC-VLADNet. We use the Matlab t-tests function to do the statistical test of the SAC-VLADNet and the NetVLAD. The statistical test results demonstrate that the differences between the proposed SAC-VLADNet and the NetVLAD are statistically significant when significance level α = 0.05.

Analysis of Coding Results
In this subsection, we give some extended discussions of the coding results. We randomly select one test sample from the Caltech256 database to get the NetVLAD coding and the SAC-VLADNet coding. The coding results of SAC-VLADNet and NetVLAD are shown in Figure 10. As Figure 10 shows, the NetVLAD coding is relatively irregular, yet the SAC-VLADNet coding has some certain rules. The first half of the SAC-VLADNet coding is the first order sparse coding, and the second half of the SAC-VLADNet coding is the covariance sparse coding. The sparse and second order representations make the SAC-VLADNet coding more discriminative than the NetVLA coding. Besides, the regular SAC-VLADNet coding is better distinguished than the irregular NetVLAD coding, which enhances the representation ability of the SAC-VLADNet coding.

Comparisons with Other State-of-the-Art Classification Models
In this subsection, the experimental results of the proposed model and other state-of-the-art models on each dataset will be given.

MIT Indoor Recognition
MIT [43] indoor scene database is a challenging indoor scene dataset. This dataset consists of 15,620 indoor scene samples of 67 classes. The common training/test division in [43] is used to obtain the scene recognition results.
In the MIT [43] indoor scene dataset, the optimal T is 7, the compared models in this dataset include FV-CNN [39], FC-CNN [39], Bilinear CNN (B-CNN) [50], Task driven pooling (TDP) [51], CaffeNet [1], directed acyclic graph CNN (DAG-CNN) [52], Caffe-DAG [52] and NetVLAD [36]. The original FV-CNN [39] coding model uses the multi-scale input images to obtain the FV representations. However, the proposed SAC-VLADNet uses the single-scale images with 448 × 448 pixels to get the SAC-VLAD representations. To get a fair result, in our comparative experiment, FV-CNN model utilizes the single-scale images with 448 × 448 pixels to get the FV representation. Table 2 shows the image recognition rates of the proposed model and other methods on MIT-indoor [43] database. As Table 2 shows, since VGG-VD [47] network can extract deeper CNN features than the AlexNet [1], VGG-VD [47] based methods are much better than AlexNet [1] based methods. Compared with FC-CNN [39], FV-CNN [39], TDP [51] and DAG-CNN [52], which are the VGG-VD methods, our SAC-VLADNet has obvious advantages. Besides, the SAC-VLADNet improves 2.8% over the NetVLAD [36] and 2.4% over the B-CNN which are end-to-end trained deep networks, this classification result shows the effectiveness of the proposed deep feature coding network. M-SAC-VLADNet achieves 0.9% improvement over SAC-VLADNet and has obvious advantages over other CNN methods, thus the proposed M-SAC-VLADNet is very effective for scene classification. Caltech-UCSD Birds 200 (CUB200) [45] is a widely used bird image database. CUB200 dataset consists of 11,788 bird images from 200 bird categories, and the training and test sets in this database are roughly equal. Besides, this dataset has detail part annotation and bounding box annotation. Bird images always have different poses and viewpoints, and the background will affect the estimation of the birds, thus classifying bird categories is very challenging.
The CUB200 database also gives the annotations of Part and bounding box (bbox), yet our methods only utilize the class information and not consider annotation of part and bounding box.
As Table 3 shows, the traditional FV [34] coding method uses the SIFT feature to compute the FV coding, thus the traditional FV [34] coding method is significantly worse than other CNN methods. Part R-CNN [53], PS-CNN [54] and Deep LAC [55] are based on AlexNet [1], and these AlexNet methods are usually worse than other VGG-VD [35] methods. Considering the VGG-VD [47] methods, our end-to-end SAC-VLADNet achieves 7.6% improvement over our non end-to-end SAC-VLAD, which shows the great superiority of the end-to-end training manner in the proposed network. Besides, our SAC-VLADNet is obviously better than FV-CNN [39], ProCRC [56], NAC [57], Multi-grained [59] and WPA [58]. Compared with the NetVLAD [36], our SAC-VLADNet achieves 4.1% improvement, which shows the effects of the new structure end-to-end layers. B-CNN, CBP-RM [60], CBP-TS [60] and LRBP [61] are state-of-the-art end-to-end models on CUB200 database, and our end-to-end SAC-VLADNet is comparable to these end-to-end methods. Based on the VGG-VD [47] network, SPDA-CNN [62] learns a better part detectors and achieves 84.6% recognition rate.  Table 3 demonstrates that our multi-path feature coding network is very effective for bird classification. Table 3. The accuracies (%) on the CUB200 dataset.

Car Categorization
Stanford [44] car database consists of 16,185 car samples of 196 classes. This dataset is split into 8144 training car images and 8041 test car images. The widely used training and test divisions in [44] are used to obtain the car categorization performances.
In the car database, the optimal T is 3, the compared models include FV coding [34], revisiting the fisher vector (RFV [63]), FV-CNN [39], NetVLAD [36], CBP-RM [60], CBP-TS [60], B-CNN [50], LRBP [61] and boosted CNN (BoostCNN [64]). Table 4 shows the Stanford cars recognition rates of our network and other competing models. As Table 4 shows, the SIFT feature methods (FV coding [34] and RFV [63]) are significantly worse than the other CNN methods. Compared with the NetVLAD, our SAC-VLADNet achieves 2.8% improvement, which demonstrates that our new structure end-to-end layers can efficiently improve the image classification performance. Besides, our SAC-VLADNet is comparable to CBP-RM [60], CBP-TS [60], B-CNN [50] and LRBP [61] which are end-to-end deep models. BoostCNN [64] is a state-of-the-art CNN model on Car dataset. Our M-SAC-VLADNet achieves 0.4% improvement over BoostCNN [64] and is obviously better than the other CNN methods, which shows the advantage of the new structure M-SAC-VLADNet in car categorization.

Caltech256 Classification
Caltech256 [46] is a massive object image database. This database consists of 256 object categories with at least 80 samples per classer. The total number of this database is 30,680. Following the widely-used experimental setting, we randomly select 60 images per class as the training set and use the remaining images as the test set. To get a fair results, we run our methods 10 times for each partition and report the average classification accuracies.
As Table 5 shows, the traditional SIFT feature coding methods (ScSPM [26] and LLC [30]) are significantly worse than the CNN feature coding methods. Compared with FV-CNN [39], NAC [57], FC-CNN [39], DSP [42] and ProCRC [56], our SAC-VLADNet gets significant improvement. Compared with the NetVLAD [36], our SAC-VLADNet achieves 2.2% improvement, which demonstrates the superiorities of the newly-designed end-to-end layers. Besides, our M-SAC-VLADNet achieves at least 1.1% improvement over the others, which demonstrates the superiority of our multi-path feature coding network in object classification.  Table 6 gives the training and test speeds (samples per second) of the SAC-VLADNet, the M-SAC-VLADNet, the VGG-VD and the NetVLAD. In the training stage, since the SAC-VLADNet uses the concatenation of the first-order and the covariance statistics, the SAC-VLADNet is more time-consuming than the NetVLAD [36] which only computes the first-order VLAD coding. Besides, since the VGG-VD [47] has multiple high-dimensional fully connected layers, SAC-VLADNet is faster than VGG-VD. Since the M-SAC-VLADNet aggregates multiple feature coding layers, our multi-path network is slower than the SAC-VLADNet, and the running speed of the M-SAC-VLADNet is similar to that of the VGG-VD. In the test stage, the proposed SAC-VLADNet is slightly slower than the NetVLAD and faster than the VGG-VD. Although the SAC-VLADNet and the M-SAC-VLADNet are slower than the NetVLAD, considering the SAC-VLADNet and the M-SAC-VLADNet have the better image classification performances, the proposed deep networks are still very effective.

Conclusions
In this work, we propose a sparsely-adaptive and covariance VLAD (SAC-VLAD) coding method which is more discriminative than the original VLAD coding method. Based on the back propagation models, the SAC-VLAD coding method is extended to an end-to-end SAC-VLADNet. We further propose an end-to-end multi-path SAC-VLADNet (M-SAC-VLADNet) which aggregates multiple SAC-VLADNet layers for visual classification. Our models can efficiently embed the domain knowledge of the feature coding into the deep convolutional neural network. The experimental comparisons demonstrate that the our model is very competitive for visual classification.
Author Contributions: Boheng Chen conceived of and designed the study. Jie Li implemented some baseline methods. Boheng Chen and Jie Li made the figures and reformatted the manuscript. All authors revised and polished the manuscript. All authors have read and approved the final manuscript.

Appendix A. The Back Propagation Function of SASAC Layer
For each k (k = 1, 2, · · · , K), Equation (10) is equivalent to the following three expressions: Equation (A2) can be considered as a variant of the max pooling layer. In max pooling layer, the largest value is held and the remaining values are ignored. In Equation (A2), the largest T values are held and the remaining values are set to be zeros. Equation (A3) is a normalized layer which can obtain normalized weight coefficients. In this paper, the final classification loss is defined as J. For each k (k = 1, 2, · · · , K), the gradient of the loss J with respect to the output of the SASAC layer is defined as ∂J ∂λ ij (k) . When ∂J ∂λ ij (k) is obtained, by using the chain rule, the gradients of γ ij (k) and β ij (k) are derived as: Based on β ij (k) (k = 1, 2, · · · , K), the gradients of the loss with respect to the layer input and trainable parameters (a k , b k and c k ) can be obtained. The following contents will present these back propagation functions.
Gradient of f ij : The gradient of J with respect to f ij can be obtained by: Based on Equation (A1), ∂β ij (k) ∂ f ij is derived as: Based on Equations (A6) and (A7), the gradient of f ij is derived as: Gradients of a k , b k and v k : The gradients of J with respect to a k , b k and v k can be obtained by: Based on Equations (A9) and (A10), the gradients of a k , b k and v k are derived as: