Deep Learning for Tra ﬃ c Sign Recognition Based on Spatial Pyramid Pooling with Scale Analysis

: In the area of tra ﬃ c sign detection (TSD) methods, deep learning has been implemented and achieves outstanding performance. The detection of a tra ﬃ c sign, as it has a dual function in monitoring and directing the driver, is a big concern for driver support systems. A core feature of autonomous vehicle systems is the identiﬁcation of the tra ﬃ c sign. This article focuses on the prohibitive sign. The objective is to detect in real-time and reduce processing time considerably. In this study, we implement the spatial pyramid pooling (SPP) principle to boost Yolo V3’s backbone network for the extraction of functionality. Our work uses SPP for more comprehensive learning of multiscale object features. Then, perform a comparative investigation of Yolo V3 and Yolo V3 SPP across various scales to recognize the prohibitory sign. Comparisons with Yolo V3 SPP models reveal that their mean average precision ( mAP ) is higher than Yolo V3. Furthermore, the test accuracy ﬁndings indicate that the Yolo V3 SPP model performs better than Yolo V3 for di ﬀ erent sizes.


Introduction
Traffic sign recognition (TSR) technologies are an essential feature of numerous real-world implementations, including Automated Driver Assistance Systems (ADAS) [1,2], autonomous driving, traffic control, driver welfare, and maintenance of the road network. Many researchers are currently working on this problem with popular computer vision algorithms [3]. The emergence of recent improvements in deep learning [4] has contributed to the significant advance for target detection [5][6][7] and identification tasks [8][9][10]. Moreover, most studies centered on creating profound convolutional neural networks (CNN) to increase precision [11,12].
The reason that traffic signs are created to be different and recognizable, using basic types and standardized colors accordingly to their country-specific existence, suggests a limiting issue in their identification and recognition. A method that generalizes efficient identification is difficult to find [1]. Nonetheless, it is still a challenge to develop a stable real-time TSR. During test time, latency is critical for decision-making depending on the atmosphere and real-life factors, such as partial occlusion, multiple views, illuminations, and temperature. Every TSR needs to address these problems well. This research will concentrate on the prohibitive identification and understanding of signs in Taiwan. The inspiration is the absence of a traffic sign detection database or analysis system in Taiwan. The most excellent advanced algorithms for object detection like SSD [13,14], Faster R-CNN [15,16], R-FCN [17],

Traffic Sign Recognition with You Only Look Once (Yolo) V3
Based on [29], this research work combines Adaboost, and Yolo V2 approaches for traffic sign studies. The system uses real traffic signs collected in the center of Kaohsiung, a large city in southern Taiwan. Additional research on traffic signs, particularly in Taiwan, is presented in [30]. This work tracks traffic signs from video recordings using its proposed program for obtaining the traffic signs image. CNN validates the precision of the generated dataset.
In [31], focus on Taiwan's stop sign detection and recognition. They conduct some experiments with a different setting and analyze the importance of anchor calculation using k-means and original Yolo V3 for Taiwan stop sign detection and recognition. Their experiment proved that anchor recalculation based on our dataset is very important.
Dewi et al. [28] investigates the state-of-the-art of various object detection systems including Yolo V3, Resnet 50, Densenet, and Tiny Yolo V3 combined with spatial pyramid pooling (SPP). Their research adopts the concept of SPP to improve the backbone network of Yolo V3, Resnet 50, Densenet, and Tiny YoloV3. Hence, their experiment findings show that Yolo V3 SPP strikes the best total BFLOPS (65.69), and mAP (98.88%). The highest average accuracy is Yolo V3 SPP at 99%, followed by Densenet SPP at 87%, Resnet 50 SPP at 70%, and Tiny Yolo V3 SPP at 50%. Hence, SPP can improve the performance of all models in the experiment.
Other research studied various weights presented by the darknet framework, including the best weight, the final weight, and the last weight [31]. They conduct and analyze the comparative experiment of Yolo V3 and Yolo V3 SPP with different weights. Experimental results show that the mean average precision (mAP) of Yolo V3 SPP is better than other models.
Based on the previous research work we found that nobody focused on the significant of scale parameter of Yolo in the configuration file. In our research will concentrate more on the importance of scale parameters in the Yolo V3 and Yolo V3 SPP configuration file.
Yolo V3 was introduced the first time by Redmon et al. [32,33] in 2016. A single neural network interprets the entire picture. Yolo V3 separated the image into grid cells and provides boundary boxes and possibilities for each grid cell [34]. Yolo V3 makes a prediction using multiscale fusion.
Yolo V3 consists of 53 layers with deep characteristics and was built on Darknet-53. Yolo V3 has demonstrated better than ResNet-101, ResNet-152, or Darknet-19 [33]. Figure 1 exhibits the construction of Darknet-53. The input image is divided by the Yolo V3 algorithm into S×S grids. If the central point of the ground reality of the object decreases within the required grid, the grid can define the target. Each grid outputs B bounding prediction boxes, including bounding box location data that consist of coordinates of the middle point (x, y), width (w), height (h), and confidence prediction.
Appl. Sci. 2020, 10, x 3 of 16 × 416 input image size is integrated with three scales using up-sample and FPN fusion [35]. The three scales obtained are 13 × 13, 26 × 26, and 52 × 52, respectively [36]. Yolo V3 consists of 53 layers with deep characteristics and was built on Darknet-53. Yolo V3 has demonstrated better than ResNet-101, ResNet-152, or Darknet-19 [33]. Figure 1 exhibits the construction of Darknet-53. The input image is divided by the Yolo V3 algorithm into S×S grids. If the central point of the ground reality of the object decreases within the required grid, the grid can define the target. Each grid outputs B bounding prediction boxes, including bounding box location data that consist of coordinates of the middle point (x, y), width (w), height (h), and confidence prediction. The Yolo loss function of the boundary box consists of four sections [37], and the formula could be seen on Equation (1) [38][39][40].
Further, _ is the loss of predicted central coordinate, and _ is the loss of width and height of the prediction bounding box. Next, _ is the loss of the predicted category, and _ is the loss of the predicted confidence. The process of measurement is shown in Equations (2)- (5).
Further, Coord_Err is the loss of predicted central coordinate, and BBox_Err is the loss of width and height of the prediction bounding box. Next, Category_Err is the loss of the predicted category, and Con f _Err is the loss of the predicted confidence. The process of measurement is shown in Equations (2)- (5).
Appl. Sci. 2020, 10, 6997 4 of 16 Moreover, (x i ,y i ) is the position of the prediction bounding box. (x i ,ŷ i ) is the actual position obtained from the training data. w i and h i are the width and height of the predicted bounding box, respectively. λ coord is to control the prediction position loss of the prediction box. λ noobj is to control the no target loss in a single grid. c i is the confidence score.ĉ i i is the intersection part of the predicted bounding box and the actual box.
Further, Yolo V3 employs the sigmoid function as a tool for predicting the activation function. The sigmoid function solves the problem efficiently, while the equal target has two labels [39,41,42].

Spatial Pyramid Pooling (SPP) Network
Spatial Pyramid Pooling (SPP) [25,26] is one of computer vision's most popular approaches. Spatial Pyramid Pooling (SPM) is commonly referred to as SPP and Bag-of-Words (BOW) model development [43]. SPP [24] belongs to an essential feature of leading and competitive classification schemes [44][45][46] and detection [47] before the current rise of CNN.
Some advantages of SPP are given in [27]. First, the SPP provides a fixed output despite input dimensions, whereas sliding window is not possible in preceding systems [48]. Second, SPP applies multi-level room cabinets and the pooling of sliding windows requires just one window. Since input dimensions are versatile, SPP can incorporate functionality obtained at variable dimensions. Figure 2 indicates a network configuration for an SPP network. This work placed the SPP block in the configuration file of Yolo V3. In the SPP layer, the final convolutional feature maps' outcome is classified into spatial bins in proportional sizes. The number of bins is fixed despite the dimensions of the image.
Appl. Sci. 2020, 10, x 4 of 16 Moreover, ( , ) is the position of the prediction bounding box. ( , ) is the actual position obtained from the training data. and ℎ are the width and height of the predicted bounding box, respectively.
is to control the prediction position loss of the prediction box. is to control the no target loss in a single grid.
is the confidence score. ̂ i is the intersection part of the predicted bounding box and the actual box.
Further, Yolo V3 employs the sigmoid function as a tool for predicting the activation function. The sigmoid function solves the problem efficiently, while the equal target has two labels [39,41,42].

Spatial Pyramid Pooling (SPP) Network
Spatial Pyramid Pooling (SPP) [25,26] is one of computer vision's most popular approaches. Spatial Pyramid Pooling (SPM) is commonly referred to as SPP and Bag-of-Words (BOW) model development [43]. SPP [24] belongs to an essential feature of leading and competitive classification schemes [44][45][46] and detection [47] before the current rise of CNN.
Some advantages of SPP are given in [27]. First, the SPP provides a fixed output despite input dimensions, whereas sliding window is not possible in preceding systems [48]. Second, SPP applies multi-level room cabinets and the pooling of sliding windows requires just one window. Since input dimensions are versatile, SPP can incorporate functionality obtained at variable dimensions. Figure  2 indicates a network configuration for an SPP network. This work placed the SPP block in the configuration file of Yolo V3. In the SPP layer, the final convolutional feature maps' outcome is classified into spatial bins in proportional sizes. The number of bins is fixed despite the dimensions of the image.

Yolo V3 SPP Architecture
This segment outlines the proposed technique for detecting and identifying road signs from Taiwan using Yolo V3 with SPP. Figure 3 describes the Yolo V3 SPP architecture. Object detection using Yolo V3 SPP proceeds as follows. The initial stage separates the image input into S×S grids. Each grid generates K bounders according to the calculation of the anchor boxes. The framework then implements the CNN for extracting all object characteristics from the picture and forecast the = [ , , , , ] and the = [ , , … . , ] . Afterward, it compares the maximum confidence of the K bounding boxes with the threshold . If > , meaning that the bounding box contains the object. Otherwise, the bounding box does not contain the object. Next, the system then selects the category with the highest predicted probability as the

Yolo V3 SPP Architecture
This segment outlines the proposed technique for detecting and identifying road signs from Taiwan using Yolo V3 with SPP. Figure 3 describes the Yolo V3 SPP architecture. Object detection using Yolo V3 SPP proceeds as follows. The initial stage separates the image input into S×S grids. Each grid generates K bounders according to the calculation of the anchor boxes. The framework then implements the CNN for extracting all object characteristics from the picture and forecast the b = b x , b y , b w , b h , b c T and the class = [class 1 , class 2 , . . . ., class c ] T . Afterward, it compares the maximum confidence IoU truth pred of the K bounding boxes with the threshold IoU thres . If IoU truth pred > IoU thres , meaning that the bounding box contains the object. Otherwise, the bounding box does not contain the object. Next, the system then selects the category with the highest predicted probability as the object category. Finally, for performing a maximum local exploration, for suppressing redundant boxes, output, and displaying the results of object detection, this experiment employs Non-Maximum Suppression (NMS).
In the research, Yolo V3 SPP uses convolutional layer sampling to achieve the max-pool layers' best possible functionality. Yolo V3 SPP employs three scales of the max pool for all images using [route]. Various layers -2, -4 and -1, -3, -5, -6 in conv 5 were used in each [route]. Moreover, conv 5 is the final layer of convolution and 256 is the conv 5 layer filter number. These created feature maps, called fixed-length representations, are then collected (see Figure 2). This experiment compares the performance of Yolo V3 and Yolo V3 SPP at different scales. SoftMax classification layers and boundary box regression are initialized in the Gaussian zero-mean distributions with standard deviations of 0.01 and 0.001. The global learning rate is 0.001, momentum is 0.9, and the parameter decay is 0.0005. The learning rate parameter determines how vigorously the latest batch of data can be used for learning.
Appl. Sci. 2020, 10, x 5 of 16 object category. Finally, for performing a maximum local exploration, for suppressing redundant boxes, output, and displaying the results of object detection, this experiment employs Non-Maximum Suppression (NMS).
In the research, Yolo V3 SPP uses convolutional layer sampling to achieve the max-pool layers' best possible functionality. Yolo V3 SPP employs three scales of the max pool for all images using [route]. Various layers -2, -4 and -1, -3, -5, -6 in were used in each [route]. Moreover, is the final layer of convolution and 256 is the layer filter number. These created feature maps, called fixed-length representations, are then collected (see Figure 2). This experiment compares the performance of Yolo V3 and Yolo V3 SPP at different scales. SoftMax classification layers and boundary box regression are initialized in the Gaussian zero-mean distributions with standard deviations of 0.01 and 0.001. The global learning rate is 0.001, momentum is 0.9, and the parameter decay is 0.0005. The learning rate parameter determines how vigorously the latest batch of data can be used for learning. The work arranges six models, Yolo V3 1, Yolo V3 2, Yolo V3 3, Yolo V3 SPP 1, Yolo V3 SPP 2, Yolo V3 SPP 3. This work uses different scales of (0.1, 0.1), (0.2, 0.2), and (0.3, 0.3) for each Yolo V3 and Yolo V3 SPP. An n-classes object detector should run the training for at most limited 2000×n batches. In the experiment, four classes have 8000 iterations for maximum batches. It means that the training will be processed until 8000 iterations. For example, the scale = 0.1, 0.1 and the current iteration number are 10,000 (0.001) batches so the system can calculate the current learning rate = learning rate × scales [0] × scales [1] = 0.001 × 0.1 × 0.1 = 0.00001.

Prohibitory Sign and Object Detection
The Yolo V3 system is used to detect and identify for prohibitory signs in Taiwan in one step. The system starts by making a boundary box for each sign with the BBox label tool for training [49]. The method of labeling is done with four type marks. More than one bounding box can host an image. In this stage, one class detector model is used, where a symbol is a single model of training. Object coordinates in the form ( , , , ) are the bounding box marking tool's output value. This output is not in the form of the Yolo object coordinates format. Yolo's input value is the central point and the

Prohibitory Sign and Object Detection
The Yolo V3 system is used to detect and identify for prohibitory signs in Taiwan in one step. The system starts by making a boundary box for each sign with the BBox label tool for training [49]. The method of labeling is done with four type marks. More than one bounding box can host an image. In this stage, one class detector model is used, where a symbol is a single model of training. Object coordinates in the form (x 1 , y 1 , x 2 , y 2 ) are the bounding box marking tool's output value.
This output is not in the form of the Yolo object coordinates format. Yolo's input value is the central point and the width and height of the object (x, y, w, h). Therefore, the system must transform the bounding box coordinate into the input model for Yolo. The conversion process is shown in Equations (6)- (9).
Further, w is the image width, dw is the absolute image width, h is the image height, and dh is the total image height. Float values of the image width and height (dw, dh) can also be similar to 0.0 to 1.0.

Dataset
In this work, we collected and processed traffic sign images manually from CarMax dashboard camera footage while driving on a sunny day and at night around Taichung City. The camera images, from which the traffic sign images are extracted, have a resolution of 1920 × 1080 pixels. We also used the Oppo F5 mobile phone camera to collect the traffic sign images with a resolution of 1080 × 2160 pixels. The traffic sign images are cropped and annotated before use for training. Furthermore, the concentration is on the prohibition sign, including 235 no entry images, 250 no stopping images, 185-speed limit images, and 230 no parking images. The data collection is separated into 70 percent for training and 30 percent for testing [28]. Further, 900 images are shown in Table 1. in this work. width and height of the object (x, y, w, h). Therefore, the system must transform the bounding box coordinate into the input model for Yolo. The conversion process is shown in Equations (6)-(9).
Further, w is the image width, dw is the absolute image width, h is the image height, and dh is the total image height. Float values of the image width and height (dw, dh) can also be similar to 0.0 to 1.0.

Dataset
In this work, we collected and processed traffic sign images manually from CarMax dashboard camera footage while driving on a sunny day and at night around Taichung City. The camera images, from which the traffic sign images are extracted, have a resolution of 1920 × 1080 pixels. We also used the Oppo F5 mobile phone camera to collect the traffic sign images with a resolution of 1080 × 2160 pixels. The traffic sign images are cropped and annotated before use for training. Furthermore, the concentration is on the prohibition sign, including 235 no entry images, 250 no stopping images, 185speed limit images, and 230 no parking images. The data collection is separated into 70 percent for training and 30 percent for testing [28]. Further, 900 images are shown in Table 1. in this work.

Training Results
Data augmentation is a significant element of the advancement of deep learning models. While the data augmentation has been shown to enhance the classification of images significantly, object identification has not been extensively studied [50]. Additionally, data augmentation is a famous method widely employed to improve the training process of CNN. The system is applied preprocessing steps, including data augmentation in the training stage. Therefore, during data augmentation, the system performs several operations, such as rotation with a probability of 0.5 and a maximum rotation of 20 degrees for each image. Next, the zoom range was 10 percent and 0.2 for width shift and height shift range. Further, the traffic signs are identified by using a bounding box labeling tool [49] for providing a coordinate position to the object. The outcomes of the tools and the class mark are four points of the position coordinate system. Then, before training, the system will transform a label to a Yolo format label. This work applied another method, the Yolo Annotation framework in Python programming language [51], to convert the values to a format that can be read  x, y, w, h). Therefore, the system must transform the bounding box coordinate into the input model for Yolo. The conversion process is shown in Equations (6)-(9).
Further, w is the image width, dw is the absolute image width, h is the image height, and dh is the total image height. Float values of the image width and height (dw, dh) can also be similar to 0.0 to 1.0.

Dataset
In this work, we collected and processed traffic sign images manually from CarMax dashboard camera footage while driving on a sunny day and at night around Taichung City. The camera images, from which the traffic sign images are extracted, have a resolution of 1920 × 1080 pixels. We also used the Oppo F5 mobile phone camera to collect the traffic sign images with a resolution of 1080 × 2160 pixels. The traffic sign images are cropped and annotated before use for training. Furthermore, the concentration is on the prohibition sign, including 235 no entry images, 250 no stopping images, 185speed limit images, and 230 no parking images. The data collection is separated into 70 percent for training and 30 percent for testing [28]. Further, 900 images are shown in Table 1. in this work.

Training Results
Data augmentation is a significant element of the advancement of deep learning models. While the data augmentation has been shown to enhance the classification of images significantly, object identification has not been extensively studied [50]. Additionally, data augmentation is a famous method widely employed to improve the training process of CNN. The system is applied preprocessing steps, including data augmentation in the training stage. Therefore, during data augmentation, the system performs several operations, such as rotation with a probability of 0.5 and a maximum rotation of 20 degrees for each image. Next, the zoom range was 10 percent and 0.2 for width shift and height shift range. Further, the traffic signs are identified by using a bounding box labeling tool [49] for providing a coordinate position to the object. The outcomes of the tools and the class mark are four points of the position coordinate system. Then, before training, the system will transform a label to a Yolo format label. This work applied another method, the Yolo Annotation framework in Python programming language [51], to convert the values to a format that can be read  x, y, w, h). Therefore, the system must transform the bounding box coordinate into the input model for Yolo. The conversion process is shown in Equations (6)-(9).
Further, w is the image width, dw is the absolute image width, h is the image height, and dh is the total image height. Float values of the image width and height (dw, dh) can also be similar to 0.0 to 1.0.

Dataset
In this work, we collected and processed traffic sign images manually from CarMax dashboard camera footage while driving on a sunny day and at night around Taichung City. The camera images, from which the traffic sign images are extracted, have a resolution of 1920 × 1080 pixels. We also used the Oppo F5 mobile phone camera to collect the traffic sign images with a resolution of 1080 × 2160 pixels. The traffic sign images are cropped and annotated before use for training. Furthermore, the concentration is on the prohibition sign, including 235 no entry images, 250 no stopping images, 185speed limit images, and 230 no parking images. The data collection is separated into 70 percent for training and 30 percent for testing [28]. Further, 900 images are shown in Table 1. in this work.

Training Results
Data augmentation is a significant element of the advancement of deep learning models. While the data augmentation has been shown to enhance the classification of images significantly, object identification has not been extensively studied [50]. Additionally, data augmentation is a famous method widely employed to improve the training process of CNN. The system is applied preprocessing steps, including data augmentation in the training stage. Therefore, during data augmentation, the system performs several operations, such as rotation with a probability of 0.5 and a maximum rotation of 20 degrees for each image. Next, the zoom range was 10 percent and 0.2 for width shift and height shift range. Further, the traffic signs are identified by using a bounding box labeling tool [49] for providing a coordinate position to the object. The outcomes of the tools and the class mark are four points of the position coordinate system. Then, before training, the system will transform a label to a Yolo format label. This work applied another method, the Yolo Annotation framework in Python programming language [51], to convert the values to a format that can be read  x, y, w, h). Therefore, the system must transform the bounding box coordinate into the input model for Yolo. The conversion process is shown in Equations (6)-(9).
Further, w is the image width, dw is the absolute image width, h is the image height, and dh is the total image height. Float values of the image width and height (dw, dh) can also be similar to 0.0 to 1.0.

Dataset
In this work, we collected and processed traffic sign images manually from CarMax dashboard camera footage while driving on a sunny day and at night around Taichung City. The camera images, from which the traffic sign images are extracted, have a resolution of 1920 × 1080 pixels. We also used the Oppo F5 mobile phone camera to collect the traffic sign images with a resolution of 1080 × 2160 pixels. The traffic sign images are cropped and annotated before use for training. Furthermore, the concentration is on the prohibition sign, including 235 no entry images, 250 no stopping images, 185speed limit images, and 230 no parking images. The data collection is separated into 70 percent for training and 30 percent for testing [28]. Further, 900 images are shown in Table 1. in this work.

Training Results
Data augmentation is a significant element of the advancement of deep learning models. While the data augmentation has been shown to enhance the classification of images significantly, object identification has not been extensively studied [50]. Additionally, data augmentation is a famous method widely employed to improve the training process of CNN. The system is applied preprocessing steps, including data augmentation in the training stage. Therefore, during data augmentation, the system performs several operations, such as rotation with a probability of 0.5 and a maximum rotation of 20 degrees for each image. Next, the zoom range was 10 percent and 0.2 for width shift and height shift range. Further, the traffic signs are identified by using a bounding box labeling tool [49] for providing a coordinate position to the object. The outcomes of the tools and the class mark are four points of the position coordinate system. Then, before training, the system will transform a label to a Yolo format label. This work applied another method, the Yolo Annotation framework in Python programming language [51], to convert the values to a format that can be read by the Yolo V3 training algorithm. The research experiment is carried out on a computer-based on

Training Results
Data augmentation is a significant element of the advancement of deep learning models. While the data augmentation has been shown to enhance the classification of images significantly, object identification has not been extensively studied [50]. Additionally, data augmentation is a famous method widely employed to improve the training process of CNN. The system is applied pre-processing steps, including data augmentation in the training stage. Therefore, during data augmentation, the system performs several operations, such as rotation with a probability of 0.5 and a maximum rotation of 20 degrees for each image. Next, the zoom range was 10 percent and 0.2 for width shift and height shift range. Further, the traffic signs are identified by using a bounding box labeling tool [49] for providing a coordinate position to the object. The outcomes of the tools and the class mark are four points of the position coordinate system. Then, before training, the system will transform a label to a Yolo format label. This work applied another method, the Yolo Annotation Appl. Sci. 2020, 10, 6997 7 of 16 framework in Python programming language [51], to convert the values to a format that can be read by the Yolo V3 training algorithm. The research experiment is carried out on a computer-based on the Python environment, which applies a Nvidia RTX2080Ti GPU (11GB memory) and an i7 CPU with 16 GB DDR2 memory. Figure 4 represents the training process's reliability using Yolo V3 1 (a) and Yolo V3 SPP 1 (b). The work uses 8000 iterations, policy = steps, and steps = 6400, 7200. Since the start has zero knowledge, the learning rate must be high at the beginning of the training phase. However, with the volume of data available in the neural network, the weights tend to adjust less vigorously. The learning rate must be lowered over time. Furthermore, this reduction in learning rates in the configuration file is made by stating first that the learning rate decreases step by step. Moreover, the learning rate begins at 0.001 and stays constant for 6400 iterations. It multiplies through percentages to get the latest standard of learning. Figure 4 shows that Yolo V3 SPP 1 is more stable than Yolo V3 1 through the training process. The detailed outcomes of the training performance are demonstrated in Table 2.     Table 2 displays the loss value of training, mAP, and AP results for all classes after 8000 cycles of training. The average training validity failure is about 0.013 for both levels. Therefore, the training model has extremely reliably identified objects. After 7200 iterations, the training model converges and stays consistent for the rest of the training. The validation loss for Yolo V3 1 is 0.0141, Yolo V3 2 is 0.015, Yolo V3 3 is 0.0129, Yolo V3 SPP 1 is 0.0125, for Yolo V3 SPP 2, 0.0144, and Yolo V3 SPP 3, 0.0133. The mean average accuracy (mAP) is averaged over the p(o) accuracy by Equation (10) [52,53].
Furthermore, p(o) is the precision of the Taiwan prohibitory sign detection. Precision and Recall are illustrated by Equations (12) and (13), [40,54]: Moreover, TP represents True positives. FP is a positive sample of the misclassified. FN stands for a negative sample of the misclassified. The value of IoU is the relationship between the result of the detection, the reality of the ground truth, and its relation [55]. IoU measures the projection ratio and shown in Equation (13) [1,56,57].
In Table 2, Yolo V3 SPP (98.88%, 99.12%, 98.93%) obtains a maximum mAP better than that of Yolo V3 (98.73%, 98.84%, 98.49%). Furthermore, Yolo V3 loaded 107 layers during the mAP calculation with BFLOPS rates of 65,312, and Yolo V3 SPP loaded 114 layers with BFLOPS rates of 65,69. SPP can enhance the overall BFLOFS 0.378, making Yolo V3 SPP more stable and precise. Table 3 demonstrates the test accuracy for the prohibition signs in Taiwan. In comparison, Class P2 displays the highest mean precision accuracy, around 96.29%, supported by Class P1 at 92.45%, Class P4 at 91.69%, and Class P3 at 90.70%. Yolo V3 SPP 3 obtained the highest accuracy, around 95.53%, of any of the models tested, followed by Yolo V3 SPP 1 at 93.59%. Furthermore, Class P2 has the highest number of training images among other classes, amounting to 250, so the accuracy result for this class is the highest.

Testing Results
In this section, the experiments use random twenty prohibitive sign images of varying sizes and environments for model checking. The accuracy and time measurements of the experiments are presented in Table 4.
Generally, Yolo V3 SPP demonstrates higher precision than Yolo V3. The most leading average accuracy is Yolo V3 SPP 1 at 99.1%, followed by Yolo V3 SPP 3 at 93.33%. The trend is that the accuracy of Yolo V3 SPP grows along with the detection time. This indicates that Yolo V3 SPP needs more time to detect the sign. For example, for Yolo V3 SPP 1, the average time of detection is 0.458 s, and Yolo V3 1 needs 0.448 s. Further, a different scale affects the learning rate and detection time. These results indicate that if the system uses a large number for scale, the detection time will be faster. Hence, the accuracy decreases compare to the original scale in Yolo V3. Similar to this, the accuracy decreases in Yolo V3 SPP adopting a different scale. The experiment results thus show that Yolo V3 SPP is more robust than Yolo V3. In this work, we use three different scales and provided a deep analysis for Yolo V3 and Yolo V3 SPP. Based on this experiment result, we can summarize as follows. (1) If the system wants the highest accuracy, we can use the original scale = 0.1, 0.1. (2) The system will use scale = 0.3, 0.3 if we want to increase the detection time more quickly.
The previous research [28,31,52] only focus on using the basic configuration of Yolo V3. They use the best weight provided from darknet with the scale = 0.1, 0.1. They optimize for accuracy but not for detection time. SPP contains more layers than the original method, which is why SPP takes more time for processing time. In our research, we provide a way to reduce detection time by increasing the scale parameter in the Yolo V3 and Yolo SPP configuration files. Our research proves that by using a scale = 0.3, 0.3 the detection time is faster than using a scale = 0.1, 0.1.
Sub-sampling and max-pooling have significant benefits. Convolution subsampling can be stronger reversed in subsequent sampling layers. Max pooling works slightly for deleting certain maximum frequency noise from the target image by choosing only maximum values from adjacent areas. By merging them, SPP seems to utilize both benefits to improve Yolo V3's backbone network.
Furthermore, Figure 5a-c gives the test results for the Yolo V3 model with an average accuracy of around 95.92% and a detection time of 0.4415 s. Moreover, Figure 5d-f shows the test results for Yolo V3 SPP using the similar image. The average accuracy is 97.27%, and the detection time is 0.4548 s. The system can identify the prohibitory sign class P3 well. In Figure 6a-c, Yolo V3 failed to detect all class P1 signs in the image, detecting only a single sign. However, Yolo V3 SPP 1 can detect three signs well in Figure 6d, and Yolo V3 SPP 2 can detect two signs in Figure 6e,f. will use scale = 0.3, 0.3 if we want to increase the detection time more quickly.
The previous research [28,31,52] only focus on using the basic configuration of Yolo V3. They use the best weight provided from darknet with the scale = 0.1, 0.1. They optimize for accuracy but not for detection time. SPP contains more layers than the original method, which is why SPP takes more time for processing time. In our research, we provide a way to reduce detection time by increasing the scale parameter in the Yolo V3 and Yolo SPP configuration files. Our research proves that by using a scale = 0.3, 0.3 the detection time is faster than using a scale = 0.1,0.1.
Sub-sampling and max-pooling have significant benefits. Convolution subsampling can be stronger reversed in subsequent sampling layers. Max pooling works slightly for deleting certain maximum frequency noise from the target image by choosing only maximum values from adjacent areas. By merging them, SPP seems to utilize both benefits to improve Yolo V3's backbone network.
Furthermore, Figure 5a-c gives the test results for the Yolo V3 model with an average accuracy of around 95.92% and a detection time of 0.4415 s. Moreover, Figure 5d-f shows the test results for Yolo V3 SPP using the similar image. The average accuracy is 97.27%, and the detection time is 0.4548 s. The system can identify the prohibitory sign class P3 well. In Figure 6a-c, Yolo V3 failed to detect all class P1 signs in the image, detecting only a single sign. However, Yolo V3 SPP 1 can detect three signs well in Figure 6d, and Yolo V3 SPP 2 can detect two signs in Figure 6e,f.   (d) (e) (f)

Conclusions
This article refers to the SPP and transforms the network structure of Yolo V3. The research uses SPP to select local regions on a different scale of the same convolutional layer to learn multiscale system characteristics. The experimental findings indicate that SPP will increase the performance of Taiwan's prohibitory signals identification and recognition. Furthermore, the accuracy decreases compare to the original scale in Yolo V3. However, the accuracy increases in Yolo V3 SPP adopting different scales. Moreover, comparison mAP of all models revealing Yolo V3 SPP outperforms Yolo V3. Nevertheless, mAP findings reveal that the Yolo V3 SPP model performs better over various scales

Conclusions
This article refers to the SPP and transforms the network structure of Yolo V3. The research uses SPP to select local regions on a different scale of the same convolutional layer to learn multiscale system characteristics. The experimental findings indicate that SPP will increase the performance of Taiwan's prohibitory signals identification and recognition. Furthermore, the accuracy decreases compare to the original scale in Yolo V3. However, the accuracy increases in Yolo V3 SPP adopting different scales. Moreover, comparison mAP of all models revealing Yolo V3 SPP outperforms Yolo V3. Nevertheless, mAP findings reveal that the Yolo V3 SPP model performs better over various scales than Yolo V3. Further, the scale will affect the learning rate and detection time. If we use a significant number for scale, the detection time will decrease, however, the accuracy will fall. We can conclude from the experiment result: (1) the system can be applied the original scale = 0.1, 0.1 if we want the best precision. (2) If we're going to increase the detection time more quickly, scale = 0.3. 0.3 can be used.
In future studies, we will enlarge the dataset focus from Taiwan prohibitory signs to all Taiwan traffic signs with the different condition including occlusion, multiple view, illumination, color variation, multiple weather conditions including heavy rain and snow. Further, future studies can extend the data set over the generative adversarial network (GAN) to create a synthetic image and obtain better results. Furthermore, we will test different scales and learning rates in the Yolo V3 SPP configuration file and the newest Yolo V4.