Adaptive Decision Support System for On-Line Multi-Class Learning and Object Detection

: A new online multi-class learning algorithm is proposed with three main characteristics. First, in order to make the feature pool ﬁtter for the pattern pool, the adaptive feature pool is proposed to dynamically combine the three general features, Haar-like, Histogram of Oriented Gradient (HOG), and Local Binary Patterns (LBP). Second, the external model is integrated into the proposed model without re-training to enhance the efﬁcacy of the model. Third, a new multi-class learning and updating mechanism are proposed that help to ﬁnd unsuitable decisions and adjust them automatically. The performance of the proposed model is validated with multi-class detection and online learning system. The proposed model achieves a better score than other non-deep learning algorithms used in public pedestrian and multi-class databases. The multi-class databases contain data for pedestrians, faces, vehicles, motorcycles, bicycles, and aircraft.


Introduction
The importance of computer vision-related technology and research is increasing with the increasing efficiency of hardware computing.The main research areas in computer vision are object detection and classification techniques.Accurate detection and quick identification of objects has always been the most important issue in this area.Among all object detection methods, machine learning is one of the most frequently studied.In recent years, there have been many successful applications such as face detection and recognition [1,2], vehicle detection and tracking [3][4][5], and pedestrian detection [6][7][8].There many single-class offline learning algorithms in machine learning that use massive training data and algorithms to find an approximation function closest to the unknown target function.The most common ones are Adaptive Boosting [9] and Support Vector Machine [10].These single-class offline learning algorithms usually require a person to collect and classify a large number of positive and negative samples, after which different feature extraction and selection methods are used to find the distinguishable features for the dataset.Finally, these features are combined into a detection model by using a machine learning algorithm.For multiple objects, it is necessary to train several detection models in the same way.Although single-class offline learning has performed well in the field of object detection, substantial training time is necessary for multi-object detection.Thus, it is not recommended to combine several single object models together.As a result, multi-class offline learning is proposed.
Multi-class offline learning has many benefits, both in detection and classification.For example, detection speed and the discrimination of the multi-class offline learning Appl.Sci.2021, 11, 11268 2 of 25 model in classifying several objects usually perform better than the model built by multiple single-class offline learning models, as multi-class offline learning algorithms are obtained by combining or modifying single-class offline learning algorithms in a variety of ways.In addition, the algorithm, developed in this way, adapts to the multi-category classification problem.Some examples of such algorithms are Multi-class Support Vector Machine [11], Joint Boosting [12], and Random Forests [13].Although these multi-class offline learning algorithms effectively solve the problem of multiple objects classification, people must collect and classify a large number of positive and negative samples in advance, thus requiring more memory space and training time.These problems could mean that the algorithm cannot be trained in real-time with their application in some specific situations.To solve these problems, an online learning algorithm is proposed in this paper.
The online learning algorithm does not need to collect a large number of positive and negative samples in advance; instead, providing data one by one can make it adapt to the scene gradually.As a result, it does not need to learn the classifier at one time from a large number of samples.This not only reduces the training time of the detection model, but also decreases the memory space required.Most of these online learning algorithms evolved from offline learning algorithms.However, some of them also evolved from single-class to multi-class, like offline learning.
Oza and Russell proposed Online Boosting [14], that is an improved online version of Adaptive Boosting [9], which achieves the effect of boosting by simulating the number of times the data are sampled.Grabner and Bischof proposed another Online Boosting method [15,16], which simulates Adaptive Boosting [9] but executes in a single-step method.It mainly selects the better classifier from the selector and combines these classifiers into one strong classifier.It replaces the weak classifier in the selector according to the adaption degree of samples to update the model.In addition to these improved online learning algorithms inspired by the offline learning algorithm, other algorithms were also proposed to combine with these online learning algorithms, e.g., On-line Conservative Learning, which was proposed by Roth, Peter M. [17,18].In this algorithm, two models are compared, and the higher confidence sample is taken to automatically update the model.It achieves the goal of complete automatic learning and does not need people to label samples.Saffari [19] proposed On-line Random Forests, which can be applied in multi-class object detection.It trains each Decision Tree by simulating the number of times the data are sampled, without using Pruning.This is an improved version of Random Forest [13].
However, certain problems remain in the existing online learning algorithms.For example, online learning algorithm is an algorithm for general objects, so there is a huge influence on detection accuracy and learning speed depending on how suitable features are selected for these samples [20,21].The online learning algorithms [14][15][16], use a single feature during learning.If they use a feature that does not detect a certain object well, for example, using a Haar-like feature [22] to detect a ball-shaped object or LBP [23] to detect a smooth object, the performance suffers.In particular, multi-class classification has this problem because a single feature has insufficient distinguishing ability to distinguish several objects at the same time.Therefore, if we can dynamically allocate the type of feature according to the pattern pool, the problem would be solved.In addition, there are some online learning algorithms such as Online Random Forest [19] whose models are irreversible as soon as the decision has been made.As a result, if the algorithm makes a false decision, the final performance of the detection model declines.At this time, if we can adjust the model thoroughly and correct the false decision, the detection model is able to perform well again.In addition, the classification result would also be bad if the feature that is used cannot effectively classify certain special samples.Most of the existing online learning algorithms, [14][15][16] and so on, have this problem.A good solution would be to combine these learning algorithms with an external model according to the status of the model after learning to increase the detection performance of the model.Finally, since the sample of online learning is a one-by-one input, it would also be a challenge to preserve the better distinguished sample.Online Conservative Learning [17,18] chooses the sample which confidence as the updating sample is higher to solve this problem.With this approach, however, we cannot determine the more representative sample.Therefore, On-line Learning with Adaptive Model [24] proposes a solution to sample storage and replacement.This method derives the more representative samples by calculating the importance and combining similar samples.For these reasons, this paper proposes an online multi-class learning and detection system that can dynamically allocate features based on samples and can combine with external models based on the learning effect to increase the performance of the model.Therefore, the proposed model not only uses the concept of the sharing feature in Joint Boosting classifier [12] to build the model, but also uses the feature selector in Online Boosting algorithm [15,16] to update and select features that are distinguished better.Then, we used Back Propagation Neural Network [25] to combine it with the external model.In the pattern pool, the proposed model adopts the same method in On-line Learning with Adaptive Model [24] to make an update.In the feature part, the proposed model uses Haar-like [22], Histogram of Oriented Gradient (HOG) [26], and Local Binary Patterns (LBP) [23].Except for simplifying the last two features so that it can use Integral image [27] to speed up the model, we also proposed an algorithm to dynamically allocate and replace features according to the learning samples.Finally, we proposed a new updating method by combining the Joint Boosting Classifier [12] with the Online Boosting [15,16].The system is able to adjust the model according to all pattern pools so that it is not affected by previous false decisions.

Proposed System
The proposed system flowchart is shown in Figure 1.In the Modeling part, we labelled some target patterns in the input image and collected positive and negative pattern pools from these input images.Then, according to the aspect ratio of the target patterns, we generated the Haar-like [22], HOG [26], and LBP [23] initial feature pools.After that, we determine the applicability of a feature to the patterns by calculating the positive and negative average response value.Further, the adaptive feature pool could be formed by dynamically selecting different proportions of Haar-like, HOG, and LBP features from the initial feature pools.Finally, weak classifiers, which distinguishing abilities are stronger, were chosen to compose a strong classifier.The model is built by these multiple strong classifiers, trained by each class.After the model is built, we set up multiple feature selectors in each strong classifier to speed up the updating of the model.The selector included the feature selected from the adaptive feature pool.In the detection part, the model calculates the score of each class and decides the classification result according to these scores.In the updating part, as the input of the image, model detection is performed and then the misclassified patterns are updated in the pattern pool.We selected the weak classifier that had better distinguishing ability from all the candidate features in each feature selector to replace the weak classifier that had less distinguishing ability in the new pattern pool than the strong classifier.Then, we replaced and updated the candidate features that had less distinguishing ability from the feature selector.Finally, new features were generated to replace the features that were not applicable anymore.In this way, we could dynamically adjust the model.The learned model can also use a neural network [25] to combine with an external model for enhancement and expansion.

Modeling
This section will introduce how to build the object detection model.It mainly includes feature selection and classifier construction.

Pattern Pool Initialization
For each different class detection target, the user needs to label at least one pattern.The system decides the aspect ratio and scaling range of detection window according to this pattern and generates a positive and negative pattern pool for the algorithm to learn.Figure 2 is an illustration of labeling the detection targets.The generation of patterns in a positive and negative pattern pool is decided by the overlapping area of the labeled target and sliding window in each input image, as given in (1) The size of the sliding window starts from the minimum labeled target and zooms in equal magnification (1.2 in this research) until the size reaches the largest labeled size.The scanning stride is the short side of the sliding window times the fixed magnification (0.1 in this research) to scan through the whole image.Then, the pattern is generated by calculating the overlapping area.We used the same setting as in On-line Learning with Adaptive Model [24].If the overlapping area is larger than 0.875, it is viewed as a positive pattern, as shown in Figure 3.Note that the overlap area of the positive sample is set to be less than 0.8, and it is not conducive to the selection of initial features.It will also affect the correctness of the object classification.If the overlapping area is less than 0.3, it is viewed as a negative pattern, as shown in Figure 4.The above steps are completed in the first labeled frame.As for the other labeled frames, the positive patterns are also chosen using the above steps.The negative patterns are taken where the overlapping area is between 0 and 0.3.At the same time, if the overlapping area between the positive pattern's position of the first labeled frame and the positive pattern of this frame is 0, this area would also be seen as a negative pattern to make up the background occluded by the positive pattern in the first labeled frame, as shown in Figure 5.

Generation of Adaptive Feature Pool
Since, the focus of this paper is an online detection system for multiple objects, a single feature is insufficient.Therefore, we adopted and combined three different feature descriptors that have good performance in different kinds of images.There are Haar-like features, which can detect the change in grey level, HOG features, which are good at analyzing gradient direction, and LBP features, which can analyze the image texture.In the initializing part, the proposed system generates 1000 Haar-like, HOG, and LBP initial features each according to the aspect ratio of the labeled target patterns.We adopted six kinds of basic Haar-like features with random scales and positions.Figure 6 shows the basic Haar-like features that are used and the process of feature extraction.To use HOG and LBP features, we simplify them to increase the processing speed.The simplified method is similar to a Haar-like feature.After randomly generating the position and size of a rectangle in the training pattern, HOG calculates the statistics of the gradient value in this rectangle to derive a nine-dimensional histogram for feature extraction, as shown in Figure 7. LBP calculates the statistics of the LBP in this rectangle to derive a sixteen-dimensional histogram for feature extraction, as shown in Figure 8.After calculating the initial features, we used the difference between the average feature response value of the positive patterns and the negative patterns ( P mean and N mean ) as the distinguishing ability of a feature from the existing patterns.We selected 1000 Haarlike, HOG, and LBP initial features randomly in this way.The total number of features of 1000 can be adjusted according to the calculation performance.In the experiment, using 1000 initial features for selection and replacement, all can achieve real-time calculations and maintain sufficient detection results.The number of dynamically selected features for each type are shown in ( 2)-( 4).Finally, we adopted these initial features to compose the adaptive feature pool.
Number of dynamic LBP features: To evaluate the performance of the proposed adaptive feature pool, we used CMU PIE (face) [28], VOC 2005 Person (pedestrian) [29], and CarData (vehicle) [30] databases.In the experiment, we used the feature pool including 1000 features and adopted Adaboost [9] for feature selection.The termination condition was when the number of weak classifier reached 100 or the error rate approached zero (in this research, we set it as 0.00001).We performed the experiments three times for each database and used the average error rate for analysis.The result shown in Table 1 explains that it does not matter whether a single type of feature is used or combined with another; these features all achieved good performance on a single-class classification.However, the proposed adaptive feature pool showed the best performance with respect to adapting different kinds of objects.

Weak Classifier Initialization
A weak classifier is composed of a feature and a threshold.The threshold of the weak classifier is calculated by the corresponding feature and pattern pool.We took the training of three-class classification as an example to introduce the calculating method.There would be 2 3 − 1 combinations of strong classifiers for three-class classification, as shown in Table 2.For all the weak classifiers in each strong classifier, we calculated the average response value of positive patterns (P mean ), the average response value of negative patterns (N mean ), and the threshold according to the corresponding pattern pool.Here, we took S 1 (v) and S 2 (v) as an example, as shown in Figure 9. Since we used Adaboost [9] to build our system, the threshold of the feature was set using the middle value of P mean and N mean , as in (5). of three-class classification as an example to introduce the calculating method.There would be 2 − 1 combinations of strong classifiers for three-class classification, as shown in Table 2.For all the weak classifiers in each strong classifier, we calculated the average response value of positive patterns ( ), the average response value of negative patterns ( ), and the threshold according to the corresponding pattern pool.Here, we took ( ) and ( ) as an example, as shown in Figure 9. Since we used Adaboost [9] to build our system, the threshold of the feature was set using the middle value of and , as in (5).
If the pattern is positive while the response value is larger than the threshold, the polar is 1.The contrary is −1.Since the HOG and LBP have multiple dimensions, the determination adopted a majority decision.If the number of dimensions determined as a positive pattern was larger than the number of dimensions determined as a negative pattern, this pattern was viewed as a positive pattern.Otherwise, it was a negative pattern.Finally, calculate the Polar, as in (6).
If the pattern is positive while the response value is larger than the threshold, the polar is 1.The contrary is −1.Since the HOG and LBP have multiple dimensions, the determination adopted a majority decision.If the number of dimensions determined as a positive pattern was larger than the number of dimensions determined as a negative pattern, this pattern was viewed as a positive pattern.Otherwise, it was a negative pattern.

Building of Initial Detection Model
After acquiring all the weak classifiers, we used the concept of sharing feature in Joint Boosting [12] to build the multi-class detection model.We used the same adaptive feature pool for the strong classifiers to perform training.The training data used the same allocation as shown in Table 2.The Adaboost [9] is adopted as a learning algorithm.The termination condition was when the error rate approached zero (in this research, we set it as 0.00001) or when the iteration times reached 50.Finally, we combined these strong classifiers to derive a feature-sharing model, as shown in Figure 10.The detection models of the three classes H(v,1)~H(v,3) are shown in (7).
2.1.5.Feature Selector Initialization We adopted the concept of Online Boosting Algorithm [15].While updating the model, we only updated and calculated the features in the feature selector instead of all the features in the adaptive feature pool to reduce the computation process.To preserve the features with better distinguishing ability in the feature selector, a feature updating method is used.First, the calculated error rate of the feature with regard to the pattern pool in the initial detection model was used to determine whether the feature possessed a good distinguishing ability and could be preserved in the feature selector.We preserved 50 features in each feature selector.This method can make the model have a better feature basis when updating as shown in Figure 11.Here, the preserved proportion of Haar-like, HOG, and LBP features was the same as in the adaptive feature pool as mentioned in Section 2.1.2.Since the termination condition in the training of each strong classifier was when the number of weak classifiers reached 50, 50 feature selectors were provided by each strong classifier.

Detection
During detection, a sliding window would scan through the whole image to get the target, and the target would be input to the model.Then, we calculated the response value of the target using the strong classifier of each class.The output would be the one with the highest response value.The detection process is shown in Figure 12.
a good distinguishing ability and could be preserved in the feature selector.We preserved 50 features in each feature selector.This method can make the model have a better feature basis when updating as shown in Figure 11.Here, the preserved proportion of Haar-like, HOG, and LBP features was the same as in the adaptive feature pool as mentioned in Section 2.1.2.Since the termination condition in the training of each strong classifier was when the number of weak classifiers reached 50, 50 feature selectors were provided by each strong classifier.Further, to filter out the background quickly, we separated the strong classifier of each class into three layers.If the scores of the strong classifier in the first layer were not more than half, the target was not calculated in the next layer.Take class 1 as an example: the first layer is S 1 (v); the second layer is S 2 (v) and S 3 (v); and the third layer is S 5 (v).If the score of any strong classifier in this layer is more than half, the target is passed into the next layer.If both the scores of S 2 (v) and S 3 (v) are not more than half, the third layer strong classifier, S 5 (v), is not calculated.The scores of the strong classifier that are not calculated are ignored to speed up the detection process.If the error rate of the strong classifier to the pattern pool is higher than 0.3, the classes are considered not to have a good sharing feature result.This strong classifier is ignored during detection, and the target is passed to the next layer for calculation.Finally, to merge the overlapping detection bounding box (B det ), we merged the detection bounding boxes which overlapping area was more than 0.5.The result of the merged detection bounding box was decided by majority decision.For example, if we merged two class 1 detection bounding boxes and one class 2 detection bounding box, the output was class 1, and the position and size were the average of these three boxes.The calculation of the overlapping area is shown in (1).

Updating 2.3.1. Pattern Pool Updating
This paper followed the pattern updating method in online learning with Adaptive Model [24], since the goal of the proposed method is to be able to perform, in real-time, online multiple learning and detection.Under the transaction of computing speed and accuracy, according to different experimental environment experiences, the maximum pattern class size of about 30 is a relatively good balance.If there are misclassified patterns that need to be updated to the model, the system checks if there are more than 30 patterns of each class in the pattern pool.If so, we used Euclidean distance as in (8) to merge similar patterns.
After merging, if there are still more than 30 patterns, we list out the patterns in order according to the importance of the pattern, calculated as shown in ( 9) The patterns with lower importance were removed until the number of patterns in each class was less than 30 (in this research, m and n are the number of negatives and positives, respectively).The experiments were performed with setting the λ 1 and λ 2 between 0 and 1, respectively.λ 1 = 0.1 and λ 2 = 0.3 yielded the best results.Finally, the patterns that needed to be updated were added into the pattern pool.In the process of updating, we adopted Adaboost [9] as the training method, and the weight of the merged patterns was added together; in each iteration, the weight of the new pattern used the average weight of all the patterns in the pattern pool, so that the distribution of the patterns was balanced in the next updating iteration, and the model could be updated starting from any weak classifier.The storing method of positive patterns adopted only a single pattern pool, as shown in Figure 13.The storing method of negative patterns is shown in Figure 14, that preserved the entire negative pattern pool built during initialization and only performed merging and replacement on the updated negative pattern pool.

Feature Selector Updating
If the feature in the feature selector had an error rate higher than 0.3 for the pattern pool at the moment, the system viewed this feature as not applicable for the classification and removed the feature right away.Then, it randomly chose a feature from the adaptive feature pool that had not been used by this feature selector and which was the same feature type as the removed feature until all the features in the feature selector were updated.The method is shown in Figure 15.

Adaptive Feature Pool Updating
There are two kinds of feature updating procedures in our model-single feature updating and the whole type of feature updating.In single-feature updating, we removed the feature that was abandoned by all the feature selectors from the adaptive feature pool and generated another same type of feature.In this whole type of feature updating process, if a feature selector chose all the features of a certain type from the adaptive feature pool, the system removed all of this type of feature and generated a new feature again.Finally, calculate the threshold of each feature to the pattern pool.The calculation method is the same as in Section 2.1.3.The process is shown in Figure 16.

Detection Model Updating
After updating the adaptive feature pool, we updated the strong classifier, which is generated according to the classification class, in the model.To preserve the weak classifier that can adapt most of the pattern and to make the strong classifier capable of adapting a new pattern pool, we removed the weak classifier from the half of the weak classifier owned by the strong classifier one by one back to the first weak classifier.In this way, we could build the strong classifier by combining weak classifiers whose error rates were lower than the allowable value.The definition of the allowable value is the average error rate of all weak classifiers in the strong classifier to the pattern pool, as shown in Figure 17.Then, we used the pattern and weights updating results, derived as in Section 2.3.1, to select the most suitable weak classifier from the feature selector until the error rate approached zero (in this study, we set it as 0.00001) or the iteration reached 50, as shown in Figure 18.By doing so, we could speed up the process of model updating.

External Model Combination
After the model was completed, if the accuracy of a certain class was not sufficient, we could combine it with an external model using Backpropagation Neural Network [25] to enhance the detection system.The choice of the external model is to use a single-object detection model with higher accuracy for the accuracy of a certain class, which is not sufficient in online learning model.We adopted a three-layer neural network and used 300 nodes in the hidden layer.The activation function is a sigmoid function.Through the backpropagation neural network, we learned the error rate of each class output in the original, trained online learning model and the external model.Then, we can increase the final accuracy without retraining our online model.The input node and output nodes were based on the number of classes.Here, we took three-class classification as an example to combine our model with an external model, as shown in Figure 19.

Experimental Results
This paper takes the following settings in the next sections: The termination condition of the strong classifier is when the number of weak classifiers reaches 50 or the error rate approaches zero (in this paper, we set it as 0.00001).The number of patterns preserved in the pattern pool is 30 per class.Otherwise, an iteration time of the external model greater than 6000 or a mean square error of the output and an expectation value from the neural network less than one are the learning termination conditions.Our system was performed on a Windows 7 operating system.The computer equipment is Intel(R) Core(TM) i7-2630QM CPU @ 2.00 GHz with 16 GB DRAM using Microsoft Visual C++.

System Evaluation Standard
There are various kinds of system evaluation standards.We adopted three different evaluation methods.The first is the comprehensive evaluation of Recall and Precision called the F-measure.The second is the Equal Error Rate, which can evaluate the applicability of the classifier.The third is the most basic and common Error Rate.While introducing the first evaluation standard, we first addressed the standard, defined by the PASCAL VOC 2007 Challenge, to determine whether the detection goal is correctly detected.

PASCAL VOC 2007 Challenge
In the experiment, we adopted the system evaluation in PASCAL VOC 2007 Challenge.It is an evaluation standard for object detection, recognition, and classification in computer vision.The full name is "The Pattern Analysis, Statistical Modeling and Computational Learning Visual Object Classes 2007 Challenge".
There is a positive correlation between the assessment of the system and the ability to correctly detect the target.It uses the overlapping area of the detected bounding box, predicted by the system, and ground-truth bounding box, provided by the database, to determine whether it correctly detects the goal, as given in (10).In Figure 20, the yellow line is the ground-truth bounding box (B ) provided by the database, and the red line is the detected bounding box (B ), predicted by the system.

Experimental Results
This paper takes the following settings in the next sections: The termination condition of the strong classifier is when the number of weak classifiers reaches 50 or the error rate approaches zero (in this paper, we set it as 0.00001).The number of patterns preserved in the pattern pool is 30 per class.Otherwise, an iteration time of the external model greater than 6000 or a mean square error of the output and an expectation value from the neural network less than one are the learning termination conditions.Our system was performed on a Windows 7 operating system.The computer equipment is Intel(R) Core(TM) i7-2630QM CPU @ 2.00 GHz with 16 GB DRAM using Microsoft Visual C++.

System Evaluation Standard
There are various kinds of system evaluation standards.We adopted three different evaluation methods.The first is the comprehensive evaluation of Recall and Precision called the F-measure.The second is the Equal Error Rate, which can evaluate the applicability of the classifier.The third is the most basic and common Error Rate.While introducing the first evaluation standard, we first addressed the standard, defined by the PASCAL VOC 2007 Challenge, to determine whether the detection goal is correctly detected.

PASCAL VOC 2007 Challenge
In the experiment, we adopted the system evaluation in PASCAL VOC 2007 Challenge.It is an evaluation standard for object detection, recognition, and classification in computer vision.The full name is "The Pattern Analysis, Statistical Modeling and Computational Learning Visual Object Classes 2007 Challenge".
There is a positive correlation between the assessment of the system and the ability to correctly detect the target.It uses the overlapping area of the detected bounding box, predicted by the system, and ground-truth bounding box, provided by the database, to determine whether it correctly detects the goal, as given in (10).In Figure 20, the yellow line is the ground-truth bounding box (B gt ) provided by the database, and the red line is the detected bounding box (B det ), predicted by the system.After defining the condition of correct detection, we introduced the F-measure.We took pedestrian detection as an example to define some terms for different detection results.The number of correctly detected pedestrians is called the true positives; false positives represent the number of backgrounds that were wrongly detected as pedestrians; the number of pedestrians that were detected is the false negatives; and the number of real backgrounds that were also detected as non-pedestrian is the true negatives.Precision and Recall are two measurements widely used in computer vision.Precision is the measurement based on the detected results.It represents the proportion of the real number of pedestrians among all the detected bounding boxes, as given in (11).

Precision =
True positives (True positives + False positives) (11) Recall is the measurement based on the ground-truth bounding box provided by the database.It represents the proportion of correctly detected number of pedestrians among all the ground-truth bounding boxes, as given in (12).

Recall =
True positives (True positives + False negatives) (12) However, we cannot evaluate the system by precision or recall alone.If the precision is low and the recall is high, there are fewer false negatives and many false positives, as shown in Figure 21.If the precision is high and the recall is low, there are fewer false positives and many false negatives, as shown in Figure 22.Therefore, we adopted the F-measure as our evaluation standard, which can consider both precision and recall.The formula is shown in (13).In the following experiments, we used precision, recall, and the F-measure as the comparison standard.

Equal Error Rate
Before introducing EER, we have to define some terms first.The proportion of correctly detected positives among all positive samples is the true positive rate (TPR), as in (14).The proportion of misclassified negative samples among all negative samples is the false positive rate (FPR), as in (15).

True positive rate =
True positives True positives + False negatives (14) False positive rate = False positives False positives + True negatives (15) By adjusting the threshold of the detection model, we can use TPR (Y axis) and FPR (X axis) to draw the ROC curve, as shown in Figure 23.The blue line is the result, derived by randomly guessing the classifier.If the threshold of the classifier is higher, all samples are classified as negative samples.Then, the FPR and TPR are both 0, as in the case of point (a) (0,0) in Figure 23.If the threshold of the classifier is low, all samples are classified as positive samples, and then the FPR and TPR are both 1, as in the case of point (b) (1,1) in Figure 23.This is the way to adjust the threshold and draw the ROC curve between (0,0) and (1,1).If the ROC curve is closer to (0,1), the classifier is better as point (c) in Figure 23, and all the predictions are correct.If the ROC curve is closer to (1,0), as in the case of point (d) in Figure 23, all the predictions are wrong.If we draw a diagonal line between (1,0) and (0,1) on the ROC curve, the intersection of this line and ROC curve is the Equal Error Rate, as in Figure 24.This Equal Error Rate is around 0.73.The setting of this threshold makes the system have the most balanced performance.Therefore, the Equal Error Rate is usually used to determine the performance of the system.In the following experiments, we also use it to perform comparisons with different classifiers.

Error Rate
The Error Rate is a common and straightforward system evaluation standard.The definition is shown in (16).We used it in the experiment to compare the performance of our system.

Comparison with Offline Learning Algorithms
In this section, we compare the proposed method with other offline learning algorithms that use a single feature and multiple features in two different multi-class classification databases to prove the multi-class learning effect of the proposed method.

VOC2005 Multi-Class Classification
In this section, we compare offline learning algorithms that use a single feature.The database is VOC 2005 [29], which includes bikes, vehicles, motorcycles, and pedestriansfour classes for classification.There are a total of 63 bikes, 159 vehicles, 109 motorcycles, and 81 pedestrians to use for positive samples in the training data.Negative samples of one class are the positive samples of the other classes that do not belong to this class.For example, the negative samples of bikes are the 159 vehicles, 109 motorcycles, and 81 pedestrian images.VOC 2005 provides two testing data, the simple one-test1 and the difficult one-test2.This paper adopts the difficult one-test2 dataset.It is difficult because the variability in samples is large.For example, there are side views, front views, and fallen down vehicles.There are different pedestrian poses, and sometimes there is occlusion.These testing data included 342 bikes, 295 vehicles, 202 motorcycles, and 874 pedestrians.The negative samples were chosen in the same way as the training data were selected.
We performed the experiment eight times and reported the results in this section.Each experiment used 40% of the training data for initializing, and then 10% of the other 60% of data was used as a unit for updating the model.The final result used EER for evaluation, as shown in Table 3.We can see that the standard deviation of the average was only 0.7%.This proves that the stability of the proposed method is very good.A comparison with other work is shown in Table 4.In Table 4, the main comparison method, ICBS [31], uses Saliency Driven Nonlinear Diffusion and Multi-scale Information Fusion to blur the background and make the goal object have a better effect during feature extraction.Finally, it uses single feature SIFT [32] combined with SVM [10] to perform training and classification.From Table 4, it is found that although we used online learning for training, it could still achieve good performance by making the feature pool more adaptive to the pattern pool because of dynamically allocation of multiple features.The proposed method outperformed other single-feature offline learning algorithms by at least 3.1% in both average and best results.This proves that the proposed method can obtain a good result in multi-class classification.

Caltech Multi-Class Classification
In this section, we perform a comparison with other offline learning algorithms that use multiple features on the Caltech [36] database.This database provides a four-class classification problem-aircrafts, motorcycles, vehicles, and faces.There are 1074 aircrafts, 826 motorcycles, 1155 vehicles, and 450 faces.The negative samples used 900 background images for aircrafts, motorcycles, faces, and 1370 background images including roads and street scenes for vehicles.
While using this database, we combined all the background images together and performed a comparison using the same standard as the comparison target [37], which used 150 random positive samples and 150 negative samples in each class, totaling 1200 positive and negative samples in four classes, to perform the training.The other positive and negative samples were for testing.We also performed eight experiments, and the final results are shown by the error rates, as indicated in Table 5.We found that the standard deviation was only 0.2%, which again proves the stability of our research.A comparison with GVC [37], which uses both the Gabor feature [38] and Moment feature [39] to perform SVM [10] training is shown in Table 6.The Gabor feature is usually used for age recognition and texture analysis, and the Moment feature is usually used for aircraft recognition and shape analysis.Combining these two features can extract features in terms of both detail and contour, making the classification effect better.Further Backpropagation Neural Network combines the proposed method with a single-class model, which achieved an error rate of 5.7% for aircrafts, 55.5% for motorcycles, 55.8% for vehicles, and 47.8% for faces, and thus the recognition ability for aircrafts increased.The termination conditions were iterated more than 6000 times, or the mean square error of the output from the neural network and the expected output were less than one.We performed the experiments eight times, and used the best, worst, and average values to perform further analysis, as in Table 7.We achieved an error rate of only 2.3%, which proves that even if we face a classification goal that is difficult, we can still use an external model to enhance the proposed model and achieve the ideal result.We used the database CAVIAR [42], which was gathered from surveillance video of a shopping mall in Portugal.In the training stage, 1200 frames used for online learning and 34,705 frames for testing, the same as the other comparison targets.In the testing video, each person in every frame was labeled even if partially occluded.Figure 25 is the definition of the region of interest in this database.We used these settings to perform a comparison with OLAM [24], OC [17], and ORF [19].OLAM is an online learning system combining bagging and a cascade structure.OC is the online learning algorithm combining Online boosting with Principal components analysis.This method is also a competitive online learning algorithm that is the same as ORF.Compared with the other methods, many parameters need to be set in ORF.In this experiment, we used five trees, whose largest depths were 20 with 900 random features in each node.Before the split of each node, at least 200 pieces of data need to be considered, and the minimum gain of a split should be less than 0.1.The final result is shown in Table 8.OC uses a shape and appearance model to automatically obtain new patterns and update the model, while ORF, OLAM, and our work obtain new patterns by hand labeling.Here, we found that ORF performed less well than our approach, although it also uses hand labeling; the split node in ORF cannot be changed.If it is faced with other important patterns, it cannot adjust the model moderately according to these patterns while updating, thus affecting the performance.In the final results, although we performed less well in terms of precision, we achieved much better recall than the other methods.The proposed method outperformed OLAM, in terms of recall by 6.4% and F-measure by 1.1%.Furthermore, we used fewer frames while learning than the other methods but, in terms of the F-measure, achieved comparable and even better results than the other methods mainly because we adopted dynamic multiple feature allocation to make the feature pool more adaptive to the pattern pool.The results demonstrated the performance of our online learning algorithm.

Caltech Multi-Class Classification
In this section, we performed a comparison with ORF [19] on Caltech database [38], which can perform multi-class classification among all the online learning algorithms.According to the description of ORF, this approach is close to the result of offline random forest, as shown in Figure 26.Therefore, we compared our method with Random Forest (RF) [13].The experiment adopted the RF algorithm provided by Matlab R2015a, in which the internal parameters are set as the default.We provided 400 Haar-like, HOG, and LBP features each for feature extraction and used 100 trees for training, the same as ORF, to indirectly compare our research with ORF.The F-measure is used in this experiment to evaluate the result, as shown in Table 9.Although the average precision fell by around 2%, the proposed method outperformed RF in terms of recall by 7.6% and F-measure by 3%.This again proves that our online learning algorithm can achieve comparable performance.

Online Learning Curve
We record the learning curve of pedestrian detection on the CAVIAR database [42], as shown in Figure 27.We analyzed the learning effect of our system according to this online learning curve.After training in the initial model, the positive patterns that the system has seen are not diverse enough, and thus the learned weak classifiers suffer from overfitting to the positive patterns.Therefore, in the beginning of the learning curve, the precision was high, but the recall was relatively low.With the input of images, the observed positive patterns became more diverse, and the weak classifiers were replaced and updated.This can increase the description ability of positive patterns.Finally, although some false positives were generated, causing precision to decrease, recall was increased on a large scale.At around 260 frames, recall and precision crossed; the learning of the system converged, but new positive patterns and negative patterns are updated to the system and combined.Here, we analyzed only the F-measure, as shown in Figure 28.We found that our approach starts to converge in terms of the F-measure after around 260 frames.The Fmeasure after that is basically oscillating between 80% and 85%.Sometimes, recall and precision decrease somewhat, and thus the F-measure decreases as well.After observing the pattern pool, we find that by this point, the pattern pool is performing merging, and this makes the positive and negative patterns that the system can refer to decrease temporarily.However, after updating is performed a few more times, the F-measure increases again.
From the whole learning curve, we can see that there is an obvious upward trend in the F-measure.

Conclusions and Future Works
We proposed a new online multi-class learning algorithm.There are three main parts to this algorithm.The first is adaptive feature pool.We proposed a feature allocation method to allocate the proportion of Haar-like, HOG, and LBP features dynamically according to the types of patterns, making the feature more adaptive to the pattern pool; the second part is combining it with the external model.After the model has been learned, we can combine it with the external model according to the learning condition to enhance the adaptive ability of the model; the last part is the model structure.By combining and modifying the sharing feature concept in Joint Boosting [12], feature selector in Online Boosting [15], and learning method in Adaptive Boosting [9,[43][44][45], we further proposed a new model learning and updating method which can adjust the model thoroughly according to the pattern pool.
We proved the effect of adaptive feature pool, and by combining the external model and multi-class learning.We verified the updating effect of the whole model and the single-class learning effect.Among the general targets, including pedestrians, vehicles, faces, and motorcycles, our method has outperformed other single-feature offline learning algorithms by 3.1% in EER; the error rate is lower by 0.6% than other multiple features offline learning algorithms; in terms of the F-measure, we achieved results that are 1.1% higher than those of other single-feature online learning algorithms; finally, the proposed method also outperformed other multiple-feature online learning algorithms by 3% in terms of the F-measure.
The experimental results proved that the proposed method can effectively solve most of the online learning and multi-class problems and is better than the other methods.However, since we used more feature types and classifiers to build the model, the detection speed is slower.In the CAVIAR database, the resolution is 320 × 240 and the scanning window is from 29 × 70 to 50 × 120.The detection speed and the effect of each image are shown in Table 10.This detection speed poses difficulties in reality.In the future, the method can be sped up by program optimization or parallelization.Or, we can use other general and rapidly calculated features to make it more adaptive to the feature pool and to decrease the weak classifiers, which are required by the model, thus reducing the detection time of the system.Furthermore, our system now updates the model by hand labeling.In the future, some algorithms can be used to reduce the amount of updating.

Figure 5 .
Figure 5. Makeup of the background in the first labeled frame.

Figure 7 .
Figure 7. Feature extraction process of HOG.

Figure 9 .
Figure 9. Calculation of response value.Figure 9. Calculation of response value.

Figure 13 .
Figure 13.Storing method of the positive pattern pool.

Figure 14 .
Figure 14.Storing method of the negative pattern pool.

Figure 15 .
Figure 15.Updating method of the feature selector.

Figure 17 .
Figure 17.Method of deriving the strong classifier, whose error rate is lower than the allowable value.

Figure 18 .
Figure 18.Selection of a suitable weak classifier again from the feature selector.

Figure 19 .
Figure 19.Illustration of the neural network.

Figure 19 .
Figure 19.Illustration of the neural network.

Figure 20 .
Figure 20.Ground-truth bounding box and detected bounding box.

Figure 21 .
Figure 21.Low precision and high recall.

Figure 22 .
Figure 22.High precision and low recall.

Figure 26 .
Figure 26.ORF 1 approaches the same error as RF.

Table 1 .
Comparison result (error rate) of different feature types.

Table 2 .
Training data of each classifier.

Table 2 .
Training data of each classifier.

Table 4 .
Comparison results (EER) with other algorithms in VOC2005.

Table 6 .
Enhanced result (error rate) of combination with external aircraft model in Caltech.

Table 7 .
Enhanced result (error rate) of combination with the external aircraft model in Caltech.
Figure 25.Region of interest.

Table 8 .
Comparison results with other online learning algorithms in CAVIAR.

Table 9 .
Comparison result of our research with RF 1 in Caltech 1.

Table 10 .
Detection speed of each frame.