# Traffic Sign Recognition based on Synthesised Training Data

## Abstract

## 1. Introduction

## 2. Related Work

#### 2.1. Early Methods

#### 2.2. Deep Learning Methods

## 3. Generation of Synthetic Training Data

#### 3.1. Resources

- Warning signs with their characteristics being their triangular shape, with red borders.
- Regulatory signs, which are usually found to have a circular shape, with varying colours (direction indicators or vehicles restrictions).
- Speed limits, which are circular with a red lining (for maximum speed), or blue (minimum speed).

#### 3.2. Generating Novel Training Images from Templates

#### 3.3. Dataset Normalisation

## 4. Classification Scheme

#### 4.1. CNNs and Three-Dimensional Image Depth

#### 4.2. Utilising Exponential Linear Units

#### 4.3. Normalisation

#### 4.4. Sample-Based Discretisation

- The size of the filter to be used. Most filters in max pooling are either of size 2 × 2 or 3 × 3, as based on these values, the kernel will traverse the entire image matrix. Furthermore, taking into account the use of the mean or max value at each traverse, the system will compute and output the suitable value.
- The stride of the kernel, as it defines the step that is used while passing through the image vector. A larger stride will resolve in a smaller output, since less pixels will overlap between kernel steps. For example, a 2 × 2 filter with a stride of two will resolve in non-overlapping pixels in the final output down-scaled feature vector.

#### 4.5. Regularisation

## 5. Experimental Results

#### 5.1. Implementation Details

#### 5.2. Classification Results

## 6. Conclusions and Future Work

## Author Contributions

## Conflicts of Interest

Sample Availability: A sample of the code and results obtain can be found in: https://github.com/
alexandrosstergiou/Traffic-Sign-Recognition-basd-on-Synthesised-Training-Data. |

**Figure 1.**Examples of traffic sign distortions from the validation set created. Form left to right: example of an obstacle in the image, motion blur, colour fading, lighting variations and shade from leaves and environment conditions in combination with low lighting levels.

**Figure 2.**List of all traffic sign classes that were used. These data can be changed; however, for the purposes of this work, the possible classes will be the 50 displayed.

**Figure 3.**Difference in brightness by using dissimilar values and methods to compare the foreground and background variations. The top row consists of the template image, one of the twenty affine transforms (Number 15) and the two new images produced by comparing the average and root mean square values of the background and the sign, utilising the 1D approach. The bottom row includes the two examples based on the perceived brightness of the foreground-background. There is also another range of examples from the methods.

**Figure 4.**Examples of synthetically-created images from the training dataset-SGTD(top row) and actual photos of real life traffic signs from the validation set (bottom row).

**Figure 5.**Model structure used. The first two convolutional layers $Con{v}_{1}$ and $Con{v}_{2}$ are based on 32 kernels three by three in size. Both convolution layers also include zero padding during the creation of their activation maps; thus, the output volume remains the same spatially. Before the next pair of convolutions, a MaxPooling operation is performed that decreases the spatial dimensions of the activation maps to half of the original. The next two layers, namely $Con{v}_{3}$ and $Con{v}_{4}$, use zero padding and the same size kernels as before with their number increased to 64. MaxPooling is also applied after the middle two layers. The last two $Con{v}_{5}$ and $Con{v}_{6}$ layers follow the same architecture as the previous layers with 128 filters. The activation maps are then flattened to a 4608 vector and passed to the hidden layer, which is composed of 512 neurones. The probabilities are finally passed to the output layer containing the 50 labels.

**Figure 6.**An example of what is perceived by the CNN at each layer. The image used is from the validation set with a Gaussian filter applied.

**Figure 7.**Accuracy rates with different activation functions and the improvement in results when introducing batch normalisation. The top right shows the epochs of a model with leaky ReLu, while the one on the left of it is trained with PReLu. The bottom two models use ELU as a rectifier with the exception that the bottom left also includes batch normalisation and shows a boot in rates as the model generalises more efficiently.

**Figure 8.**Accuracy rate achieved with a set of 170,000 training examples (3400 per class), while the classifier was validated on a set of 2500 real-world road sign examples.

**Table 1.**Accuracy rates (%) of different methods used for traffic sign classification. MCDNN, Multi-Column Deep Neural Networks.

Method | Accuracy |
---|---|

Spatial transform/inception [16] | 99.98% |

MCDNN [12] | 99.46% |

Human observations (best) | 99.22% |

Human observations (average) | 98.84% |

Multi-scale CNN [3] | 98.31% |

Random forests [10] | 96.14% |

LDA and HOG [4] | 95.68% |

**Table 2.**Accuracy and Kappa statistic rate achieved using synthesised training data for different models with their respective number of iterations. ELU, Exponential Linear Unit.

Classifier Type | Accuracy Rates (%) | Kappa Statistic |
---|---|---|

CNN w/Leaky ReLU—0 epochs | 87.88% | 0.8788 |

CNN w/ PReLu, 50 epochs | 87.03% | 0.8703 |

CNN w/ ELU—50 epochs | 87.88% | 0.8788 |

CNN w/ ELU and BN, 50 epochs | 92.20% | 0.9219 |

CNN w/ ELU and BN, 100 epochs, new data | 91.84% | 0.9183 |

