Evaluation of Robust Spatial Pyramid Pooling Based on Convolutional Neural Network for Traffic Sign Recognition System

: Traffic sign recognition (TSR) is a noteworthy issue for real-world applications such as systems for autonomous driving as it has the main role in guiding the driver. This paper focuses on Taiwan’s prohibitory sign due to the lack of a database or research system for Taiwan’s traffic sign recognition. This paper investigates the state-of-the-art of various object detection systems (Yolo V3, Resnet 50, Densenet, and Tiny Yolo V3) combined with spatial pyramid pooling (SPP). We adopt the concept of SPP to improve the backbone network of Yolo V3, Resnet 50, Densenet, and Tiny Yolo V3 for building feature extraction. Furthermore, we use a spatial pyramid pooling to study multiscale object features thoroughly. The observation and evaluation of certain models include vital metrics measurements, such as the mean average precision (mAP), workspace size, detection time, intersection over union (IoU), and the number of billion floating-point operations (BFLOPS). Our findings show that Yolo V3 SPP strikes the best total BFLOPS (65.69), and mAP (98.88%). Besides, the highest average accuracy is Yolo V3 SPP at 99%, followed by Densenet SPP at 87%, Resnet 50 SPP at 70%, and Tiny Yolo V3 SPP at 50%. Hence, SPP can improve the performance of all models in the experiment.


Introduction
In all countries, traffic signs have essential information for drivers on the road, including the speed limitation, direction indication, stop information, and so on [1]. Traffic sign recognition systems (TSRS) are crucial in numerous applications in the real world, such as autonomous driving, traffic surveillance, driver protection and assistance, road network sustenance, and investigation of traffic disturbances [1]. Two related subjects that important in TSRS are traffic sign detection (TSD) and traffic sign recognition (TSR). TSD directly affects the safety of drivers and because of their ignorance can easily cause damage. Automatic systems that support drivers can improve unsafe driving behavior based on the detection and recognition of signs [2]. TSRS are difficult and complicated tasks in consequence of several problems, including occlusion, illumination, color variation, rotation, and skew that appear from camera setup in the surroundings. Further, there could be multiple signs in an image with different colors, sizes, and shapes [3,4].
The reality of traffic signs deliberates to have distinguishable features and to be specific, such as simple shapes and uniform colors. The detection and recognition of traffic signs imply a constrained problem. In addition, there are some differences in the design of signs between the countries. In certain cases, there can be significant differences in the design of signs in various countries. These

CNN for Object Detection
There have been several classic object recognition networks in the last few years [5], for instance AlexNet [6] (2012), VGG [7] (2014), GoogLeNet [8] (2015-2016), ResNet [9,10] (2016), SqueezeNet [11] (2016), Xception [12] (2016), MobileNet [13] (2017-2018), ShufficNet [14] (2017-2018), SE-Net [15] (2017), DenseNet [16] (2017), and CondenseNet [17] (2017), Initially, the convolutional neural network was developed and enlarged to achieve greater precision accuracy. However, networks have grown smaller and more efficient in recent years. In highly accurate target sensing tasks, the new deep learning algorithms, especially those that apply to CNN, such as You Only Look Once, (Yolo) v3, show huge potential [18]. The multiscale and sliding window approach that produces bounding boxes and scores via CNN can be implemented efficiently within a ConvNet [19], and R-CNN [20]. Besides, R-CNN is also expensive in time and memory, as it executes a CNN forward-pass for all object proposal without sharing computation. To solve this problem, spatial pyramid pooling networks (SPPnets) [21] were introduced to increase the efficiency of R-CNN through computational sharing. SPPnet calculates feature maps from the entire input image only once and then supplies feature in arbitrary-size sub-images to generate fixed-length representations and for detectors training. However, SSPnet eliminates the replicated evaluation of convolutional feature maps, it still needs training in a multi-stage pipeline as the fixed-length feature vectors generated by numerous SPP layers are also moved on to fully-connected layers. Therefore, the whole process is still slow. Certain techniques, including single shot multiBox detector (SSD) [22] and Yolo [23], exemplify all the processing in a single fully-convolutional neural network rather than making a persistent pipeline of regional proposals and object classification. This knowledge conducts to a significantly more expeditious object detector. The one-stage method relies on the end-to-end regression approach technology. Yolo V3 [24] applied Darknet-53 to substitute Darknet-19 as the backbone network and employed multiscale prediction [25].

Spatial Pyramid Pooling (SPP)
In terms of object recognition tasks, spatial pyramid pooling (SPP) [26,27] was significantly victorious. Consider its severity, it is competing among methods that use more complicated spatial models. For the interpretation of the spatial pyramid, the image is split into a range of finer grids at each level of the pyramid. In addition, it is commonly-known as spatial pyramid matching (SPM) [28], a development of the bag-of-words (BoW) model [29], which is one of the most famous and successful methods in computer vision methods. SPP has continued been an important component and superior system to win the competition in the classification [30,31] and detection [32] before the recent ascendance of CNN.
Some benefits of SPP [21] could be explained as follow: First, SPP can produce a fixed-length output despite the input dimension. Second, SPP applies multi-level spatial bins, while the sliding window pooling employs just a single-window size. Next, SPP allows us not only to generate images from arbitrarily sized images for testing but also to feed images with different sizes and scales during training. Additionally, training with variable-size images raises invariance in size and decreases overfitting. In addition, SPP is extremely effective in object detection. In the foremost object detection method R-CNN, the features from candidate windows are obtained through deep convolutional networks. Furthermore, SPP can combine features derived at variable scales to the flexibility of input scales. CNN layers receive some despotic input sizes, but they generate outputs of variable sizes. The softmax classifiers or fully-connected layers require fixed-length vectors. Such vectors can be generated by the BoW approach [29] that pools the features together at the same time. SPP improves the performance of BoW in that stage, and it can preserve spatial information by pooling in local spatial bins. The space bins have proportional sizes to the image size, and regardless of the image size the number of bins is fixed. On the contrary, the sliding window pooling of former deep networks and the number of sliding windows depends on the scale of the data. Hence, to implement the deep network for images of arbitrary sizes, the last pooling layer will substitute with an SPP layer. In the particular spatial bin, we pool the replies of each filter and apply max pooling. The outputs of the spatial pyramid pooling are kM dimensional vectors with the number of bins indicated as M. Further, k is the number of filters in the latest convolutional layer. The fixed-dimensional vectors are the input to the fully-connected layer. By using SPP, the input image can vary in size, which allows not only arbitrary aspect ratios but also enables absolute scales. The input image can resize to any scale and adopt an identical deep network. When the input image is at diverse scales, the network with the equivalent filter sizes will extract features at various sizes and scales. A network structure with an SPP layer can be seen in Figure 1. In our work, the SPP blocks layer is inserted to the Yolo V3, Resnet 50, Densenet, and Tiny Yolo V3 configuration file. Moreover, we use the same SPP blocks layer in the configuration file with a spatial model. The spatial model uses down sampling in convolutional layers to receive the important features in the max-pooling layers. It applies three different sizes of the max pool for each image by using

Object Detection Architecture
The principal features of each architecture (Yolo V3, Densenet, Resnet 50, Tiny Yolo V3) are summarized in this section.

Yolo V3 and Tiny Yolo V3
Yolo V3 was proposed by [24] in 2018. It splits the input image into (S × S) grids cells [33] with the same size and forecast bounding boxes and probabilities for each grid cell. Yolo V3 uses multiscale fusion to make predictions and uses a singular neural network to process the complete image. The dimension clusters are applied as prior boxes to predict boundary boxes. Therefore, the k-means method is adopted to carry out dimensional clustering on the target boxes in the dataset and get nine prior boxes of various sizes, which are evenly spread to feature graphs of various scales. Further, Yolo V3 allows individual bounding box anchor for each ground truth object [34]. If the core point of the object's ground truth drops inside a specific grid, and the grid is responsible for recognizing the object. Figure 2 describes the bounding boxes with the prior dimension and location prediction. As shown in Figure 2, , , , are the x, y center coordinates of the width, and height of our prediction. , , , and are the network outputs. Next, and are the top-left coordinates of the grid, whereas and are anchors dimensions for the box [23,35]. The Tiny Yolo V3 model is a reduced version of the Yolo V3 model. Yolo V3 applies the architecture of darknet 53 and then employs many 1 × 1 and 3 × 3 convolution kernels to extract features. This model is lighter and faster than Yolo while also outperforming other light model's accuracy. Tiny Yolo V3 shrinkage the number of convolutional layers, usually it only has seven convolutional layers. The features are derived by a small number of 1 × 1 and 3 × 3 convolutional layers. In addition, Tiny Yolo V3 uses the pooling layer in place of Yolo V3's convolutional layer with a step size of 2 to attain dimensionality reduction. Nevertheless, its convolutional layer structure still uses the equal structure of the loss function (Convolution2D + BatchNormalization + LeakyRelu) as Yolo V3. The model is trained and calculated the loss value, and the loss function used by Tiny Yolo V3 is the same as that of Yolo V3. Hence, the loss function is essentially consist of the position of the prediction frame (x,y), the prediction frame size (w,h), the class prediction (class), and the confidence prediction (confidence) [36]. Further, Yolo V3 SPP and Tiny Yolo V3 SPP is implemented by incorporating three SPP modules in Yolo V3 and Tiny Yolo V3 in front of three detection headers between the 5 and 6 convolutional layers [37]. Yolo V3 SPP and Tiny Yolo V3 SPP are designed to improve the detection accuracy of baseline models further.

Densenet
Densenet has over 40 layers and has a higher convergence speed [38]. Further, Densenet needs to consider additional functionality channels, including single-level dimensions or cross-level dimensions, to reduce the need for functional replication in the network model and enhance the retrieval of features [39]. Moreover, Densenet has appealing benefits as follows: It assists feature reuse and relieves the disappearing gradient problem. Consequently, it also has clear limitations. First, every layer simply combines the feature maps extracted by concatenating the process from previous layers. The operation was done without considering the interdependencies between different channels [40]. Further, the Densenet is principally composed of Dense Block, Transition Layer, and Growth Rate [41]. Dense Block [42]: every Densenet consists of N Dense Blocks. In any Dense Block, there exist m layers where each layer is linked feed-forward to all consecutive layers. If is denoted as the output from the ℎ layer then it is calculated using Equation (5): where is the composite function that operated in this layer and a concatenation function between the individual layers inside it will be processed. The concatenated features are treated through a combination function that composed of BN, Relu, and Convolution (3 × 3).
A layer between each dense block to which the spatial dimension of the characteristic's maps known as the transition layer. It is consisting of (1 × 1) convolution layer and (2 × 2) average pooling. Growth Rate: The output from each concatenation function in Equation (5) is a feature map f. The size of the ℎ layers is ( − 1) + , where is the number of channels of the major input image. To improve the efficiency of the parameter and to monitor the network growth, f is limited to the growth rate G with a small integer value. This variable helps to monitor the amount of information stored in each layer.

Resnet 50
Residual Networks (Resnet) [9] are deep convolutional networks where the basic idea is to skip blocks of convolutional layers by using shortcut connections. Further, Resnet is characterized by a very deep network and contains 34 layers to 152 layers [43,44]. This architecture can be seen in Figure 3 and developed by researchers at Microsoft won the ILSVRC 2015 classification task [45]. In the Resnet model, a residual network structure is implemented. The deep CNN model not only avoids the issue of model deterioration by using the residual network structure, but it also achieves better efficiency. The Resnet used skip connections to make convergence more rapid. Even the much deeper layers of Resnet can be trained more quickly than previous ones. This model also used the batch normalization technique to avoid overfitting [46]. Both of these feature extractors are built with four residual blocks: based on the original paper, the first three-layer (namely conv2_x, conv3_x, and conv4_x) extract Region Proposal Networks (RPN) features, while the final layer of conv4_x is applied for predicting region proposals. Moreover, box classifier features are gained by the last layer of the fourth residual block (conv5_x) [47,48].

Methods
In this section, we explain our proposed methodology to recognize Taiwan's prohibitory signs using spatial pyramid polling combine with Yolo V3. Figure 4 illustrates our Yolo V3 SPP architecture. Algorithm 1 explains the Yolo V3 SPP recognition process as follows. 5. Appeal the optimum confidence of the K bounding boxes with the threshold . 6. If > means that the bounding box includes the object. Otherwise, the bounding box does not contain the object. 7. Select the category with the greatest predicted probability as the object category relating to. 8. Apply the non-maximum suppression (NMS) to conduct a maximum local search to overcome redundant boxes and output. 9. Object detection result presentation.
Taiwan's prohibitory sign image class P1, P2, P3, and P4 were used as input of the object detection process. The algorithm processes some phases as follows; (1) the detected targets are limited by bounding boxes. (2) The objects in the class of the image are associated. The same target is given the same mark in each image. (3) The same image will give the same target uniform label. (4) NMS is used to perform a maximum local search to compress redundant boxes and output, then display the results of object detection. In our work, Yolo V3, with a spatial model, uses down sampling in convolutional layers to receive the important features in the max-pooling layers. It applies three different sizes of the max pool for each image by using  Yolo V3 SPP model is performed in one phase for detecting and recognizing Taiwan's prohibitory sign. This work used the BBox label tool [49] to generate a bounding box for each sign (no entry, no stopping, no parking, speed limit). Further, the labeling process is executed for all class labels P1, P2, P3, and P4. Usually, one image can have more than one bounding box, and it means one image can have more than one label. In the detection phase, a single class detector model was used, and one class label belongs to one training model. Hence, our experiment uses four training models. Additionally, object coordinates in the form ( , , , ) are the return values of the bounding box labeling tool. These object coordinates are different from the Yolo input value. On the contrary, the Yolo input value is the center point of the object and its width and height (x, y, w, h). Therefore, the system must modify the bounding box coordinate into the Yolo input format. The modification process use Equations (6)-(11) [50].
where H is the height of the image, ℎ is the absolute height of the image, W is the width of the image, and is the absolute width of the image. Therefore, float values relative to the width and height of the image ( , ℎ); this value can be equal from 0.0 to 1.0.

Dataset
Considering there is no pre-existing dataset for Taiwan's prohibitory signs, the system had to customize a database and collect the image by ourselves. The dataset split into 70% for training, 30% for testing and the dataset contains pictures of multiple scenes. This experiment focused on Taiwan's prohibitory sign that consists of 235 no entry images, 250 no stopping images, 185 speed limit images, and 230 no parking images. Moreover, Table 1 represents Taiwan's prohibitory signs in detail.

Training Result
The process of training obtained additional data from the original images with the application of basic geometric transformation methods such as random transformations, rotations, scale shifts, tears, horizontal flips, and vertical flips. These techniques are commonly used to train large neural networks. Therefore, the experiment performs some operations during data augmentation using several parameter settings as follows: rotation_range = 20, zoom_range = 0.10, width_shift_range = 0.2, height_shift_range = 0.2, and shear_range = 0.15. Therefore, the system manually detected and recognized the traffic sign used a bounding box labelling tool to give a coordinate location for the object to be detected [51]. The results of the tools are four points of the position coordinate, along with the class label. Next step, transform the label to Yolo format before training use the Yolo Annotation tool [49]. The tool changes the values to a format that could be read by the Yolo V3 training algorithm. Moreover, the training model environment is a Nvidia RTX1080Ti GPU accelerator 11 GB memory, i7 central processing unit (CPU), and 16 GBDDR2 memory.
The Yolo loss function is as follows [52][53][54]: where , , , ℎ ,,̂ are the central coordinates, width, height, confidence, and category probability of the predicted bounding box, and those symbols without the cusp are real labels. B symbolizes that any grid divines B bounding boxes represents that the object drops within the ℎ bounding box of the ℎ grids.
exhibits that there are no targets in the bounding box. Further, IouErr is the IoU error. The grid that includes the object and the grid without an object has different weights. Therefore, λnoobj = 0.5 is added to undermine the impact of a large number of grids without objects on the loss value. The classification error is ClsErr. Cross-entropy is used to calculate losses and works only on the grid with a target. Moreover, Yolo V3 employ the sigmoid function as the activation function for the class prediction. The sigmoid function more effectively finishes the issue when the same target has two labels than the softmax function [55]. Furthermore, the coordinate error is CoordErr. The cross-entropy loss is used for the coordinates in the core point, and the variance loss is applied for the width and height. Our experiment set the λcoord to 0.5, means that the errors of width and height in the calculation are less effective. For a coordinate error, the calculation will be done when the grid predicts an object [53]. Figure 5 explains the reliability of the training process using Yolo V3 (a) and Yolo V3 SPP (b). The training loss value for each model is 0.0141 and 0.0125, respectively. Our work uses max_batches = 8000 iterations, policy = steps, scale = 0.1, 0.1 and steps = 6400, 7200. At the beginning of the training process, the system is beginning with zero or no information and a high learning rate. Therefore, as the neural network is presented with growing amounts of data, the weights must change less aggressively. Thus, the learning rate needs to be decreased over time. Further, in the configuration file, this decrease in learning rate is accomplished by first specifying that our learning rate decreasing policy is stepwise. For instance, the learning rate starts from 0.001 and remains constant for 6400 iterations. It then multiplies by scales to obtain the new learning rate. If the scale = 0.1, 0.1 and the current iteration number is 1000 (0.001) then current_learning_rate = learning_rate × scales [0] × scales [1] = 0.001 × 0.1 × 0.1 = 0.00001. From Figure 5, we can conclude that Yolo V3 SPP is more stable than Yolo V3 during the training process.    Training loss value, mAP, and AP performance for all classes using Tiny Yolo V3 could be seen in Figure 8. Figure 8a shows the reliability of the training process using Tiny Yolo V3. Tiny Yolo V3 uses max_batches = 500,200, and the training loss value reach 0.0185 at 84,300 iterations. Therefore, Tiny Yolo V3 SPP uses max_batches = 500,200, and the iteration stops at 72,700, with the loss value 0.0144 in Figure 8b. The training process is unstable, and it takes a long time to train this model. The complete results of training mAP and AP performance of all models and classes are shown in Table 2.   Table 2 represents the training loss value, mAP, AP, precision, recall, F1, IoU performance, and calculation time for class P1, P2, P3, and P4. The samples are split into three types: true positive (TP) samples, referring to the number of samples that are properly specified, false positive (FP) samples, referring to the number of samples that have not been identified, true negative (TN) referring to the number of samples that have not been recognized.
The integral over the precision p(o) is the average mean average precision (mAP) and shown in Equation (21).
where p(o) is the precision of the object detection. IoU computes the overlap ratio between the boundary box of the prediction (pred) and ground-truth (gt) [1].
Based on The trend is SPP can increase the mAP and IoU on each model in the experiment. SPP can be combined with any model and will strengthen that model. For instance, the worst model in the experiment was Tiny Yolo V3 with mAP 82.69% and IoU 75.29%. In addition, SPP can improve the performance of Tiny Yolo V3, so for Tiny Yolo V3 SPP, the mAP becomes 84.79% (rise 2.1%) and IoU 79.23 (rise 3.94%).

Discussion
In this stage, we use twenty Taiwan's prohibitory sign images for testing with different sizes and conditions. The accuracy and time measurements of the experiments are presented in Table 3. In general, Yolo V3 SPP exhibits better accuracy than other models. The highest average accuracy is Yolo V3 SPP at 99% followed by Yolo V3 at 92%, Densenet SPP at 87%, Densenet at 82%, Resnet 50 SPP at 70%, Resnet 50 at 50%, Tiny Yolo V3 SPP at 50%, and Tiny Yolo V3 at 40%. The trend is that the accuracy of the combination model with SPP increases with the detection time, which is means that the combination model with SPP involves more time to detect the sign. For instance, for Yolo V3 SPP, the average time of detection is 17.6 milliseconds, while Yolo V3 requires 16.7 millisecond. The longest detection time is Densenet SPP; this model requires around 40 milliseconds. Following this, Densenet needs 38.3 milliseconds to detect the sign. On the other hand, the fastest model in the experiment is Tiny Yolo V3. Therefore, Tiny Yolo V3 needs 5.4 milliseconds, and Tiny Yolo V3 SPP requires 8 milliseconds to recognize the sign. Further, the SPP affects the performance of accuracy and detection time. In the experiment result, the images are tested one by one to show that SPP can improve the detection and recognition performance of traffic signs compared to those not using SPP. For example, there are 5 images that cannot be detected using Resnet 50 models, but Resnet 50 SPP can detect all traffic signs in the images properly as shown in Table 3.   Figure 11 shows the detection effectiveness of different algorithms. It can be seen that the localization accuracy of Yolo V3 SPP (Figure 11b) was higher than the others. Yolo V3 SPP can detect all two signs in the image. In Figure 11a or Figure 11c-f, all algorithms failed to detect all class P1 signs in the image, detecting only a single sign. However, for the last two images in Figure 11g,h, all algorithms exhibited false detection and missed detection.

Conclusions
This paper presents an experimental comparative analysis of eight models of traffic signs based on deep neural networks. We investigate the principal aspects of certain detectors, such as precision accuracy, detection time, workspace size, and the number of floating-point operations within CNN. In addition, this paper refers to the spatial pyramid pooling (SPP) and modifies the backbone network of Yolo V3, Resnet 50, Densenet, and Tiny Yolo V3. We employ SPP to raise the local region at diverse scales in the equivalent convolutional layer for learning multi-scale object features more details. The mAP comparison of all models shows that Yolo V3 SPP outperforms other models in the experiment. Yolo V3 SPP exhibits the highest total BFLOPS (65.69), and mAP (98.88%). SPP can improve the total BFLOFS 0.378 from 65.312 to 65.69, thus making Yolo V3 SPP more robust for detecting the sign. The experimental results disclose that SPP can rectify the effectiveness of detecting and recognizing Taiwan's prohibitory signs. SPP improves the performance and backbone network of YoloV3, Resnet 50, Densenet, and Tiny Yolo V3. Although SPP requires longer time, this model is better for detecting multiple images. As shown in Figure 11b, Yolo V3 SPP can detect all signs in the image while others not. Nevertheless, Tiny Yolo V3 and Tiny Yolo V3 SPP load fewer layers (24 layers) compare to others. Densenet SPP contains the most layers (312 layers) and requires a large workspace size (104.86 MB). Related to the detection time, the fastest models in the experiment are Tiny Yolo V3, and the longest models are Densenet SPP.
In future research work, we will enhance our dataset to all of Taiwan's traffic signs and add experimental data in multiple scenarios and different weathers conditions for training and testing. We will expand the dataset through generative adversarial networks (GAN) [62][63][64] to obtain better performance and results.