A Cascaded Ensemble of Sparse-and-Dense Dictionaries for Vehicle Detection

Featured Application: The vehicle detection algorithm proposed in this work could be used in autonomous driving systems to understand the environment, or could be applied in surveillance systems to extract useful transportation information through a camera. Abstract: Vehicle detection as a special case of object detection has practical meaning but faces challenges, such as the difﬁculty of detecting vehicles of various orientations, the serious inﬂuence from occlusion, the clutter of background, etc. In addition, existing effective approaches, like deep-learning-based ones, demand a large amount of training time and data, which causes trouble for their application. In this work, we propose a dictionary-learning-based vehicle detection approach which explicitly addresses these problems. Speciﬁcally, an ensemble of sparse-and-dense dictionaries (ESDD) are learned through supervised low-rank decomposition; each pair of sparse-and-dense dictionaries (SDD) in the ensemble is trained to represent either a subcategory of vehicle (corresponding to certain orientation range or occlusion level) or a subcategory of background (corresponding to a cluster of background patterns) and only gives good reconstructions to samples of the corresponding subcategory, making the ESDD capable of classifying vehicles from background even though they exhibit various appearances. We further organize ESDD into a two-level cascade (CESDD) to perform coarse-to-ﬁne two-stage classiﬁcation for better performance and computation reduction. The CESDD is then coupled with a downstream AdaBoost process to generate robust classiﬁcations. The proposed CESDD model is used as a window classiﬁer in a sliding-window scan process over image pyramids to produce multi-scale detections, and an adapted mean-shift-like non-maximum suppression process is adopted to remove duplicate detections. Our CESDD vehicle detection approach is evaluated on KITTI dataset and compared with other strong counterparts; the experimental results exhibit the effectiveness of CESDD-based classiﬁcation and detection, and the training of CESDD only demands small amount of time and data.


Introduction
Vehicle detection as a special case of object detection is a computer vision technique having practical meaning. For example, it plays an important role in autonomous systems based on camera input. However, vehicle detection faces challenges, like the problem of recognizing vehicles exhibiting various orientations, the serious influence from occlusions, the clutter of background content, to list a few.
Existing vehicle detection approaches, e.g., [1][2][3][4][5][6], could be divided into two categories: non-deep-learning approaches and deep-learning-based approaches. Generally, non-deeplearning detection approaches rely on a traditional machine learning classification model as window or proposal classifier with limited discriminative ability [1,7], or further aided by 3D geometrical models while requiring more detailed labeling [6]. Existing deeplearning-based detection approaches [8,9] exhibit strong discriminative ability, but their training-data-intensive and training-time-intensive characteristics cause troubles for their real-world application. In contrast, dictionary learning models have been successfully applied to recognition tasks through learning of dictionaries for discriminative coding, attaining good recognition ability with low demanding of training data and time [10,11]. In the case of vehicle detection, vehicles and background exhibit various appearances which are hard to represent with a single dictionary, but are well represented with an ensemble of multiple dictionaries [12,13]; furthermore, dictionaries could easily cooperate with additional machine learning models to attain strong recognition abilities, like [13,14].
This paper presents a dictionary-learning-based sliding-window-styled vehicle detection approach. In general, the proposed approach learns an ensemble of dictionaries to represent both vehicle appearances and background patterns. Specifically, vehicle and background categories are further divided into several subcategories, each of which is easier to represent with a dictionary model; each subcategory of vehicle corresponds to certain orientation range (a sub-range in 0~2π) and certain occlusion level (fully visible/partly occluded/largely occluded); and each subcategory of background corresponds to a cluster of background patterns (the background pattern clusters are obtained by clustering algorithms like K-means). For each subcategory, we learn a pair of sparse-and-dense dictionaries (SDDs) to represent corresponding sample appearances; SDD is capable of capturing subcategory-specific intrinsic features with its sparse dictionary and absorbing non-subcategory-specific patterns (like noises and shadows) with its dense dictionary, so it is able to give good reconstructions to samples of the corresponding subcategory only but not for others; the desired SDD could be efficiently learned through the supervised low-rank (SLR) decomposition [11]. Thus, an ensemble of such SDDs (ESDD) for all subcategories of vehicle and background are capable of well representing both of them even though their appearances are various due to the aforementioned challenges, and each constituting SDD's subcategory-specific reconstruction ability enables the ESDD to classify vehicle from background. We further organize the ESDD as a two-level cascade of ESDD (CESDD) to perform two-stage coarse-to-fine classifications which is more robust and reduces computations; input samples are firstly classified by the first level of ESDD corresponding to easy-to-distinguish subcategories and then, based on the first-level classification results, selectively passed to SDDs within the second level (corresponding to hard-to-distinguish subcategories) for re-classification. To attain further accuracy enhancement, the CESDD is further coupled with a downstream AdaBoost process whose weak classifiers take corresponding SDDs' reconstruction residues as input features and are aggregated to produce final classifications. The CESDD model is used as a window classifier in a simple-to-implement sliding-window scanning process over input image pyramids to generate multi-scale detections. These detections usually contain duplicate ones, so we adopt non-maximum suppression (NMS) process then; specifically, instead of the often-used greedy NMS [2] which brutally trade-off between accuracy and recall rate through inter-detection IoU thresholding, we adopt the mean-shift-like NMS [7] which could effectively remove duplicate detections while properly handling the detections of very near vehicles, and some adaptations are introduced for bounding box refinement and false positive reduction. To summarize, the main contributions of this work are: • an SDD-ensemble-based vehicle/background classification approach (ESDD); • an effective organization of SDD ensemble into a two-level cascade for sliding-window classification robustness and computation saving (CESDD); • a cooperation mechanism of CESDD with a downstream AdaBoost process for accurate sliding-window classification; • an adapted mean-shift-like NMS method for duplicate detection removal.
The remainder of the paper is organized as follows: Section 2 reviews related works; Section 3 introduces the training of CESDD; Section 4 introduces vehicle detection with CESDD; Section 5 presents experiments on CESDD; Section 6 gives a conclusion of this work and further improvement directions of CESDD.

Related Works
In this section, we review previous works on vehicle detection and object detection of different categories of methodologies, and works on application approaches of detection on autonomous systems.

Non-Deep-Learning Object Detection
As for the domain of non-deep-learning sliding-window-styled object detection, there are several effective schemes with various complexity. The renowned Viola-Jones detector [15] achieved good human face detection performance, employing a set of simple features describing local image intensity patterns and an AdaBoost classifier. Dalal and Triggs [1] designed the effective HOG feature to be used with a linear-SVM classifier, achieving big improvement in pedestrian detection. To adapt to deformable objects, Felzenszwalb et al. [2] proposed a root window filter in connection with several part filters whose positions are adjustable, in combination with a Latent-SVM classifier; this proposal gained considerable enhancement in multi-class object detection. Girshick et al. [3] further generalized this part-based model to grammar models which allow more flexible subdivision of parts and embodied ability to represent partial occlusions. Based on the idea of organizing multiple window filters with graph structure as in the case of grammar models, Wu et al. [16] proposed the And-Or Graph model with more complex graph structure to represent variations of vehicle appearances, and even the appearances of multiple vehicles altogether, being able to handle more various occlusions.
Some other non-deep-learning object detection methods employ 3D geometry models to assist detection. Xiang et al. [17] adopted 3D voxel models to represent several frequently appearing vehicle appearances, being able to perform accurate 2D detections as well as 3D detections. Zia et al. [6] established fine-grained 3D wireframe models for 3D vehicle detection, and thus enabled inference of occlusions to the grain level of wireframe vertex.

Deep-Learning-Based Object Detection
Deep-learning-based object detection methods use deep neural networks to undertake all or a great portion of the whole process of detection. Girshick et al. [4] proposed the region-based convolutional neural network model (R-CNN) for object detection task, which adopted selective-search-based region proposal generation, CNN-based feature extraction and SVM-based classification in sequence. This R-CNN model was further modified into Fast R-CNN [5] by absorbing classification and bounding box regression stages into one CNN, achieving better detection performance. Redmon et al. [18] proposed the YOLO object detection model which executes the complete detection task all in one deep neural network, regressing from the input image all the way to bounding boxes in real-time. Liu et al. [8] proposed the single-shot multibox detector (SSD) model which uses a deep CNN to explore simultaneously at many positions on an input image plane with multiple types of anchor boxes of various sizes and aspect ratios, and adjusts these anchor boxes to fit on target objects. Murtza et al. [19] noticed the effectiveness of both manually designed features and neural-network-learned features for object detection, and thus proposed concatenating HOG pyramid with CNN features and learning an ensemble classifier upon the concatenated features; this method was proved to be effective on pedestrian detection. Zhang et al. [20] used a pre-trained CNN model for image classification and performed transaction mining to obtain frequently occurring activation patterns as cues for target object detection in previously unseen images. Rahman et al. [21] performed object detection on RGB-D images and proposed a region proposal generation network that is able to fuse multi-modal information. Wang et al. [22] noticed the inconvenience of manually setting hyperparameters for region proposal network (RPN) which is an essential component in many two-stage object detection methods, and thus proposed to utilize collective intelligence algorithms, like particle swarm optimization, to optimize these hyperparameters, enhancing the accuracy of RPN-based methods. Xu et al. [23] noticed the inaccuracy of bounding box regression in R-CNN series methods, and proposed a kind of particle-searching-based algorithm instead; the particles search for local features pertinent to the target object in the feature map generated by the network, and were then clustered and grouped to induce the possible bounding boxes.

Application Approaches of Detection on Autonomous Systems
As for the special case of application on autonomous systems, various detection approaches have been proposed with special designs to ensure the efficiency and reliability of inference. Fremont et al. [24] proposed to use fish-eye camera input to perform human detection around heavy machines like forklifts; to deal with the distorion in fish-eye images, they use deformable part-based models (DPMs) [2] with different part anchor settings for different places in the camera view; they further introduced LiDAR input on which initial regions of interest are generated and perform finer detections within these regions, resulting in enhanced time performance and reliability. Tsiktsiris et al. [25] proposed a hybrid model of convolutional neural networks and recurrent neural networks to perform abnormal event detection in autonomous shuttle mobility infrastructures; they adopted a special auto-encoder which is comprised of both convolution layers and ConvLSTM [26] layers to extract spatiotemporal features for video sequences within shuttle infrastructures, and fed the video features to subsequent stacked LSTMs [27] to produce "normal/abnormal" classification results. Xu et al. [28] proposed to deal with the car rotation problem in detection on unmanned aerial vehicles' imaging by adjusting the images according to road direction which is estimated through straight line detection; they also proposed a switching algorithm to choose between two candidate detectors based on their time performances under different scenes, in order to maintain detection rate. Table 1 summarizes the strengths and weaknesses of detection approaches of the aforementioned methodologies and our CESDD approach. Compared with these approaches, our CESDD could be trained with a small amount of training samples and training time while still achieving good classification and detection performance.

Training of CESDD
The essence of CESDD training is to learn the ensemble of descriptive sparse-anddense dictionaries, upon which downstream AdaBoost classifier is learned for reinforced vehicle/background classification, from a set of training samples which should be properly processed into features fit for detection. A workflow of training of CESDD is illustrated in Figure 1a, and a complete algorithm description is summarized in Algorithm 1.
by solving (2) using Algorithm A1; 5 Code over SDD ensemble for all training samples to obtain reconstruction residues from all SDDs as Specifically, CESDD training learns a two-level cascade of dictionary ensembles. Level 1 consists of dictionary pairs of subcategories from relatively standard vehicle and background types with clear specificity: fully visible (FV) vehicles and partly occluded (PO) vehicles for vehicle category, and general background (GB) for background category. Level 1 ensemble is responsible for producing initial predictions which are then handed over to Level 2 of the cascade for refinement. Level 2 consists of dictionaries of relatively vague types: largely occluded (LO) vehicles for vehicle category, and vehicle parts (VP) for background category. The usage of Level 2 ensemble differs from Level 1: each of the samples tagged as positive in initial predictions is to be represented over dictionary pairs of type VP, producing corresponding reconstruction residues; the minimal one of these residues are to be compared with the minimal one from vehicle category in Level 1; if the minimal residue from VP is smaller, then the initial prediction should be corrected as negative. This refinement process is identically carried out over samples initially predicted as negative: each of these samples is to be represented over dictionary pairs of LO vehicle subcategories; the minimal reconstruction residue from these subcategories is to be compared with the minimal one from GB subcategories in Level 1; if the minimal residue from LO vehicle is smaller, then the sample should be regarded as positive. This cascade-styled classification scheme exhibited its advantage in reducing false positives during detection, as shown in Figure 2; its classification performance is illustrated in subsequent experiments.  Through levels in the cascade, ESDD-based window classification is performed: a window patch from sliding-window scan process is passed to each SDD in the ensemble for coding, and the resulting reconstruction residues are then transformed into the input to the downstream AdaBoost classifier to give prediction.

Training Sample Categorization
In this work, samples from vehicle category and background category consist of several deliberately specified types, to take care of the variousness of image contents during vehicle detection. Samples in vehicle category consist of three types with different occlusion levels: FV vehicles, PO vehicles, and LO vehicles; samples in background category consist of two types: GB and VP. The constitution is illustrated in Figure 3. These types constitute almost all sorts of image contents in urban street views in KITTI2017 dataset [29] (the 2D object detection subset included in KITTI's "3D Object Detection Evaluation 2017" channel).
Moreover, to be suitable for subsequent ESDD learning, each type of vehicle samples are further divided into subcategories according to their orientations, as illustrated in Figure 4. Background types are further divided into subcategories through clustering process for which we adopted K-means.  . Vehicle category is divided into three types according to extent of occlusion; background category is divided into two types according to whether vehicle part appears.

Feature Extraction
For in-the-wild vehicle detection, vehicles exhibit various appearances, resulting from geometrical differences, various orientations, tremendous illumination variations, etc. Adoption of descriptive features robust against these influences is critical to effective classification and vehicle detection. Two essential properties of sample appearance should be emphasized here: shape and color. Shape information generally resides in two scales: contours and other geometrical patterns on a macro scale, and textures on a micro scale. In this work, several types of feature extraction schemes are tested and compared regarding their performances in classification and detection, as could be seen in subsequent experiment. Based on the testing results, it is discovered that the combination of features describing all three types of appearance properties (macro shape, micro texture, and color) gives the best classification performance; specifically, we adopted FHOG for description of macro shapes, cell-structured LBP (CSLBP) for micro textures, and color names (CN) for color information.
The concatenated features make up high dimensionality, which is time-consuming when processed and may contain redundant components that are useless in classification. To deal with this issue, 1 -based feature selection with the assistance of a linear SVM is utilized and works well with the proposed features, as could be seen from subsequent experiments.

SDD Ensemble Learning
The effectiveness of sliding-window-styled detection relies on the effectiveness of window-wise classification. Through the scanning process, a great number of window patches are densely sampled from many positions in images of multiple scale levels, consisting usually of far more background patches than well-bounded vehicle patches. Thus, a window-wise classifier is obliged to achieve high enough classification accuracy as well as good recall rate. Classifier's accuracy and recall rate are factually two views of one essence: modeling quality of target data space.
In this work, the proposed novel classifier, using dictionary ensemble, exploits this very essence directly. Specifically, training samples are categorized as vehicle and background, and are further divided into subcategories. For vehicle category, dividing criterion is set as vehicle orientation and occlusion level, which influence vehicle appearances severely. For background category, sample contents are not related to useful labels (that is, orientations and occlusion levels) as in the case of vehicle category, so we adopt Kmeans clustering algorithm to directly produce background training sample clusters as subcategories, each containing samples with similarity. With the constitution of subcategories set, dictionaries could be learned. For each subcategory, with limited variance of appearances, dictionary with adequate representation ability could be learned effectively. Specifically, the proposed SDD dictionary learning method is adopted and described in the following subsection.

Basic SDD Learning and Coding
For the task of learning SDD for a single subcategory, the SLR-decomposition-based method [11] is proposed. Given a training sample collection D ∈ R k×m from a certain subcategory, with m samples of k-dimension arranged as column vectors, we decompose it into three parts as: where the first term AX preserves subcategory-specific intrinsic patterns existing throughout all samples (A ∈ R k×n A , X ∈ R n A ×m denote the intrinsic sparse dictionary and the corresponding sparse coding, respectively), the second term BY preserves non-intrinsic patterns (B ∈ R k×n B , Y ∈ R n B ×m denote the non-intrinsic dense dictionary and the corresponding dense coding, respectively), the third term R ∈ R k×m represents residues like noises, and n A , n B denote the atom number of A, B, respectively. Taking into consideration that samples within certain subcategory share similarities to some extent, low-rank property of D and (A, B) could be reasonably expected. In addition, with the assumption that A is responsible for capturing subcategory-intrinsic patterns while B for non-subcategory-intrinsic patterns, for a sample of the same subcategory, its relatively stable subcategory-intrinsic pattern is expected to be captured by one or a few atoms in A with each atom being a distinct mode, while the relatively unstable nonsubcategory-intrinsic part should be allowed more freedom to choose linear combination coefficients over atoms in B to better deal with the variances. Thus, the coding X over A is expected to be sparse, while the coding Y over B is expected to be relatively dense. With all these considerations, the desired objective for the proposed SDD learning could be expressed as: The low-rank property of (A, B) could be measured with nuclear norm; the sparsity of codings X could be measured with 1 norm; since there is no requirement of sparsity for codings Y, only its magnitude is to be minimized, measured with Frobenius norm; the residues R, possibly noises, should exhibit sparsity and is thus measured with 1 norm. At the same time, the reconstruction quality is required to be maintained, acting as a restriction term.
Once the dictionary pair (A, B) is learned through optimizing the SDD learning objective proposed above, they could be used to represent incoming samples. Given a single input sample d ∈ R k , its representation with (A, B) could be expressed as: where x ∈ R n A is the coding over A, y ∈ R n B is the coding over B, and r ∈ R k is the residue. The expected properties of the coding result should follow the analysis of the SDD learning objective above. Thus, the objective of coding over (A, B) is directly inferred as below: min x,y,r The solution to SDD learning objective (2) is left until Appendix A, while the solution to (4) is similar to (2) and hence we skip the details.

Constructing SDD Ensemble
With basic SDD learning and coding approaches set up, the strategy of using SDDs to perform vehicle/background classification would be straightforward: we model the data space by representing vehicle and background categories with respective SDDs, and an input sample of certain category should receive distinct reconstruction qualities (in terms of reconstruction residues) from SDDs of different categories. Before coming to the proposed multi-subcategory scheme, a more heuristic scheme is to learn only one SDD for vehicle category and one for background, as illustrated in Figure 5a. However, as simple as it is, this no-subcategory scheme gave unsatisfactory classification performance, as shown in Figure 6a. This behavior could be expected, since both vehicles and background in the real-world dataset (such as KITTI) exhibit various appearances which are hard to represent with only one SDD per category; the great variances of appearances caused trouble for SDD learning to capture patterns intrinsic enough for a whole category. This observation led us to propose the multi-subcategory scheme, which subdivides both vehicle and background categories into multiple subcategories, as described earlier and illustrated in Figure 5b. By introducing more subcategories, the variances of sample appearances within subcategories are within the representation ability of SDDs, and the patterns intrinsic to each subcategory are easier to capture, so the discriminative ability enhancement is expected and proved by experimental results in Figure 6a.

Downstream AdaBoost Classifier Learning
Even though the proposed cascade of SDD ensemble achieved pretty good classification performance, the situation of in-the-wild vehicle detection is still very complex in terms of image appearances. In this case, classification accuracy, especially for vehicle category, deserves more emphasis, since it is critical to the robustness of vehicle detection. The fact of the existence of ensemble of SDDs in the proposed CESDD-based classifier provides a chance to cooperate with ensemble learning models, like AdaBoost. Specifically, every SDD from vehicle category and every other SDD from background category could be coupled to act as a weak vehicle/background classifier (a micro ESDD-based classifier with ensemble size as two). In this way, many such couples of SDDs could be constructed as such kind of weak classifiers, creating the possibility of applying AdaBoost process. These SDD couples are not independent from each other, since there exists couples sharing common SDDs, while AdaBoost algorithm requires independent weak classifiers that are learned incrementally. To bridge this gap, instead of directly using the SDD couples as weak classifiers, additional machine learning models could be adopted as the weak classifiers which take the reconstruction residues from the SDD couples as inputs. With this adaptation, AdaBoost process could be carried out to learn these weak classifiers and corresponding aggregation weights. The steps in CESDD training and CESDD-based detection where this downstream AdaBoost mechanism is involved are illustrated as portions of Figure 1a,b; note that each weak classifier is learned from the training samples' reconstruction residues produced by the corresponding SDD couple only.
In the case of cascaded SDD ensemble, weak classifiers learned from SDD couples purely consisting of Level 1 SDDs are evaluated for all incoming test samples to generate initial classifications, while weak classifiers learned from SDD couples consisting of Level 2 VP type SDDs and Level 1 vehicle category (FV or PO) SDDs are only evaluated for initial positive classifications, and weak classifiers learned from SDD couples consisting of Level 2 LO vehicle type SDDs and Level 1 background category SDDs are to be evaluated only for initial negative classifications. For an input sample, classifications from all evaluated weak classifiers are aggregated to form the final classification; the detailed procedure is illustrated in Figure 7. The effectiveness of this downstream AdaBoost process could be observed in subsequent experimental results, and the enhancement in classification accuracy of vehicle category is noticeable.

Detection with CESDD
This section describes the process of detection with CESDD, which consists of three stages: sliding-window scan, window-wise classification, and non-maximum suppression. The workflow of the proposed CESDD-based vehicle detection is illustrated in Figure 1b, and a complete algorithm description of vehicle detection with CESDD is summarized in Algorithm 2.

Sliding-Window Classification
On receiving an input image, an image pyramid is built by up-sampling and downsampling the image to multiple scale levels. Sliding-window scan process is performed on this pyramid by moving windows of proper aspect ratios (window width over window height) throughout all positions on each image plane in the pyramid. In the case of vehicle detection in driver view on streets, like the case of the KITTI2017 dataset, vehicle bounding boxes' aspect ratios vary in a limited range depending on the vehicle orientations, from the large aspect ratio of side view to a small aspect ratio of front view, as is shown in Figure 4. This indicates that one single aspect ratio for scan window is not enough to capture vehicles of various orientations. Thus, we adopted two distinct aspect ratios for scan windows: one close to the ratios of side views, and another one close to front views; these two aspect ratio values were censused from the KITTI2017 training set label information. Window patches obtained using these aspect ratios are then resized to a uniform size, for the convenience of subsequent classification; these window patches are further processed into proper features as described in Section 3.2. CESDD-based classification is then carried out for all processed window patches; comprehensive description on classification with CESDD is detailed in Section 3.3, and the cooperation of CESDD with downstream AdaBoost process is described in Lines 5∼13 in Algorithm 2. After window classification, initial detections are obtained.

Mean-Shift-Based NMS
Directly applying the proposed CESDD-based classifier to all window patches would likely produce highly overlapped detections around target vehicles, as shown in Figure 6a. To deal with this issue, the mean-shift-based NMS is adapted and applied.
The detections from the previous classification are assigned bounding box information and scores of confidence. Mean-shift-based NMS initially transforms the detections into points in a x-y-scale 3D space, with x and y representing coordinates in the image plane and scale representing the scale level in the pyramid. Then, a kernel density estimation map is built by weighted combination of 3D Gaussian distributions centered at these 3D detection points, with the weights as monotonically increasing functions of the detection confidence scores. Then, from the initial positions of the detections in the 3D space, all detection points are moved in the map according to gradients; this moving process will be iterated several times until convergence where all 3D points have reached modes in the map. Finally, the reached modes will be regarded as desired detections. The detailed process could be found in [7]; this process is identical to a mean-shift process.
Some adaptations are necessary to make the mean-shift-based NMS to fit into the case of vehicle detection in this work. Firstly, the initial detections are not of the same aspect ratio. Thus, for each found mode from the mean-shift-based NMS, the aspect ratios of the group of detection points having reached the same mode are averaged; the average aspect ratio is applied to the output detection bounding box inferred from the mode. Apart from this, modes having attracted very few detection points, like one or two, tend to be false positives and should be dropped, because true vehicle targets always attract more detections around them. This could be observed in Figure 2. Furthermore, a proper threshold should be set to screen out modes with too low kernel density estimation values which correspond to detection confidences.

Experiments
In this section, we present the experimental results of our CESDD vehicle detection approach. Specifically, we evaluate CESDD-based classification and CESDD-based detection from multiple aspects. For all evaluations, we set the sparse dictionary A's atom number as 10, and the dense dictionary B's atom number as 20, for all subcategories' SDDs; in total, we set 16 subcategories for vehicle category, and 12 subcategories for background category.

CESDD Classification
Experiments on classification are conducted over image patch samples obtained from KITTI2017. Vehicle samples, including car type and van type, could be cropped out according to corresponding label information; each vehicle sample is cropped using a window with one of the two pre-set aspect ratios for detection. Background samples, except for VP, are obtained using both two aspect ratios, by randomly sampling in non-vehicle areas. To obtain VP, cropping windows are sampled around vehicle area, partially overlapping with the vehicles with proper extents. In total, we collected 21,556 training samples and 11,710 test samples for CESDD classification evaluation. We evaluate classification performance with four criteria: accuracy for vehicle category (veh acc), recall rate for vehicle category (veh recall), accuracy for background category (bkg acc), recall rate for background category (bkg recall). These criteria are calculated as: where N tp , N fp , N tn , N fn denote the number of true positives, false positives, true negatives and false negatives in classifier predictions, and N veh , N bkg denote the number of vehicle and background samples in the test set. Here, we regard classification to vehicle category as "positive" and classification to background category as "negative", and "true" indicates the classification is correct and "false" indicates incorrect. Firstly, classification performances of two types of subcategory divisions are evaluated, as is presented in Figure 6a. Clearly, the scheme of an SDD ensemble constituted with multiple subcategories per category works better in classification, showing that the multisubcategory scheme indeed gives better representation to the various vehicle appearances and background patterns.
The effect of organizing SDD ensemble into cascade is evaluated also. The classification performances with and without cascade are shown in Figure 6b. The test samples here include LO vehicles, thus is more difficult to discriminate. The application of cascade gives similar performance, but with decreased time cost.
Experiments have also been conducted to examine the effect of downstream AdaBoost. The classification performances with and without downstream AdaBoost are shown in Figure 6c. It could be seen that the downstream AdaBoost process improved classification accuracy of vehicle category greatly, which is very important to the robustness of vehicle detection.
The classification performances of several choices of features are evaluated, as shown in Figure 6d. Feature selection's effect on classification performance is also evaluated, as shown in Figure 6e; it could be observed that the linear-SVM-based feature selection reduced dimensionality significantly while retaining almost the same classification performance as the original high-dimensional features.
The CESDD classifier is compared with other classifiers which are strong counterparts and based on different methodologies. Comparison results with random forest [30], gradient boosted trees (gbtrees) [31], discrete AdaBoost (boost) [32], linear-SVM, and RBF-SVM [33] are presented in Figure 6f. It could be seen that the CESDD classifier gives better performance than most counterparts and attains similar performance with the RBF-SVM whose hyperparameters are heavily optimized during our experiment, using the same training and testing sample set. We also measured the training time of the counterpart approaches and our CESDD approach, as shown in Table 2. It could be observed that the training of CESDD is efficient, similar with the decision-tree-based counterparts (gradient boosted trees, discrete AdaBoost) and much faster than the SVMs.

CESDD Vehicle Detection
In this experiment, the CESDD vehicle detector is trained over 18,195 vehicle samples (including FV, PO, and LO), 33,218 background samples of GB, and 18,402 background samples of VP, all obtained from the KITTI2017 training set. For comparison, the DPM detector is obtained from voc-release5 site [34], which is named CAR_FINAL.MAT and trained from Pascal VOC 2007's TRAINVAL set.
Qualitative test results from the proposed CESDD vehicle detector with and without mean-shift-based NMS are presented in Figure 8; qualitative comparison with DPM detector is presented in Figure 9. It is observed that the proposed CESDD vehicle detector is effective in these in-the-wild scenarios.

Conclusions
In this work, we have proposed a dictionary-learning-based vehicle detection approach named CESDD. It explicitly represents the various appearances of vehicles and background with an ensemble of SDD (ESDD) dictionaries for all subcategories, which could be efficiently learned using SLR decomposition, and organizes this ESDD as a twolevel cascade for two-stage coarse-to-fine classification. Furthermore, CESDD is coupled with a downstream AdaBoost for robust classification. For vehicle detection, this CESDD model is used as a window classifier in sliding-window scan process over image pyramids to generate multi-scale detections, which are then refined using an adapted mean-shiftbased NMS. Experimental evaluation results of CESDD-based classification and detection from multiple aspects proved that our CESDD is able to achieve good classification and detection performance with low demanding of training time and data. At present, our CESDD vehicle detection approach is time consuming at inference due to the time efficiency of coding over the dictionary ensemble, which could possibly be addressed by introducing parallelizations in further works.

Conflicts of Interest:
The authors declare no conflict of interest. (2) We solve problem (2) using augmented Lagrange Multiplier method. Specifically, we introduce two auxiliary variables A , B as proxies for A, B, respectively, and transform (2) into the following optimization objective as an augmented Lagrangian function: min A,A ,X,B,B ,Y,R

Appendix A. Solution to SDD Learning Problem
where Z 1 , Z 2 , Z 3 are Lagrange multiplier matrices corresponding to the constraints D = AX + BY + R, A = A , B = B , respectively. We present the complete algorithm for solving (A1) in Algorithm A1, where SVD(·) is the singular value decomposition operation, Kmeans(·) is the K-means clustering operation over input samples, taking the input-output form as: [centroids, labels] = Kmeans(D), where centroids is a k × c matrix which stores c clusters' centroids, labels is a c × m matrix which stores the assigned cluster labels for all samples in D; S λ (·) is the elementwise soft-thresholding operation defined as: S λ (x) = sign(x) max(|x| − λ, 0). (A3)