In this section, we discuss the details of various templates and how they are used to produce synthetic training images of these samples under a range of different viewing and illumination conditions.
3.1. Resources
For the purposes of this work, the basic templates that were used were acquired from the YouGov website [
19]. From these, the fifty most significant and common classes selected can be seen in
Figure 2. The three main categories are:
Warning signs with their characteristics being their triangular shape, with red borders.
Regulatory signs, which are usually found to have a circular shape, with varying colours (direction indicators or vehicles restrictions).
Speed limits, which are circular with a red lining (for maximum speed), or blue (minimum speed).
The second set of resources is composed of 1000 images from British roads (and a small portion of general examples from other countries), which will represent the background. There were 500 examples of urban areas and roads, while the other half of the background samples depicted rural environments. This separation took place in order to contemplate the different environments where a traffic sing can be spotted. Additionally, some sign classes are more probable to be observed in rural areas, such as wildlife warning signs. On the contrary, intersection warnings may be more appropriate for urban environments. In our implementation, signs were generated for all 1000 background images without taking into account the probability that a class may be more suited for a specific scenario.
3.2. Generating Novel Training Images from Templates
The initial step of our pipeline is the segmentation of the traffic sign from its background. The equation used to extract the sign is formally written in Equation (
2), for the left border, and Equation (
3), for the right, which correspond to each row of pixels (yi) in the image. The threshold values were empirically chosen from tests and experiments that were made in order to minimise the number of non-foreground pixels that were obtained during the segmentation of the sign. Programmatically, pixels are considered one dimensional (grey), if they show a difference of lower than 15 between each possible product from the subtraction of two channels. Under other circumstances, this difference would be too great for a variety of colours. However, because signs are assigned with bright colours, in order for them to be distinguishable in a plethora of environments, the assumption of a large channel difference yields satisfying results. The same approach is also applied to the second part of the process where each pixel is compared with its neighbour to find where a large variation exists. A good indication for a pixel being a border pixel of the sign would be the difference in one (or more than one) of its channels compared to the pixel that is next to it.
Once the template is stored in an RGBA format, the affine transformation is applied [
20]. This method is used to represent both changes in the rotation and different scales of the sign. Affine transformation has been a standardised technique for producing the illusion created by non-ideal camera angles that are focused on filming specific objects [
21]. Geometric distortions are an essential aspect of real examples, as the position of the camera is very likely to produce perspective irregularities while altering the entire scene. The three main uses in this system is the scaling, translation and rotation functions that it provides. Their formal equation can be found in Equation (
4). Affine transformation falls into the class of linear 2D geometric transformations in which variables in the original image (such as the colour space values of a pixel P(
x,
y)) are linked to new variables (for pixel P’(
x’,
y’)). Therefore, the pixel grid of the image is deformed and mapped to a destination image. For this work, twenty different transformations were used, all based on three significant points in the image. The first point is located close to the start of the axis at one tenth of both the height and the width or alternatively in some instances, at the right side of the image. This determines the shearing of the final image, based on the point that is assigned after the transform is performed. Secondly, the other two points are primarily used for both the rotation and scaling of the image. One is placed in the middle at the top of the image. The second of the two is defined as in the middle of the left side, or the equivalent in some cases, at the right side. All twenty distinct affine transforms are applied to each image class and are stored in files in a directory that is based on the sign to which the transformation was applied.
Next, the system computes the average and the Root Mean Square (RMS) brightness of the pixels in the image. This is based on the assumption that the RMS could only be bigger than or equal to the average, which will relate to a higher number assigned to the brightness of the image. However, considering the average pixel values, the image may be assigned a brightness value that is not entirely representative. This might be due to the fact that a minority of pixels include values that are in the borders of the distribution [
22] and thus tempering with average value. A possible method to overcome the problem is by utilising the median value. Nonetheless, it has been observed that this difference was beneficial for the method as it was effectively representing variations that are shown in natural settings and the discrete material with which the sign is made. Therefore, in order to preserve a portion of the image noise that is seen in real-world photos, the average and the root mean square values are better suited.
In addition to the average and RMS values, two different approaches are used for understanding the brightness. The first uses a grey-scaled version of the image and computes the average and RMS pixel values. This is considered a standardised method, as the luminance in this instance is defined as the single dimensional image, without the use of the information from any of the other three image channels. An alternative method is the utilisation of the HSPcolour model [
23]. The major difference from other colour models such as HSV (which uses the V component to demonstrate the brightness as the maximum value from the RGB channels) or HSL(with L being the average of the maximum and minimum value from the RGB channels), is the fact that it includes the degree of influence of each RGB channel. The equation used for the conversion can be seen in Equation (
5), in which the constants represent the extent to which each colour affects the pixel’s luminance.
The same process is also applied to the image with the RMS and average values computed for the 1D image and the perceived brightness. At this stage, all the necessary values have been identified for both the background and foreground; therefore, the margin of the two is calculated as the absolute difference of their respective values. Then, in order to find the brightness that the sign needs to be adapted to, the margin is divided by a ratio that is based on maximising the similarity of all the image features while adding colour distortion to represent background conditions. For measuring similarity and for testing the thresholds, the images were segmented in ten distinct regions. For each region, a histogram was plotted. The histograms were later compared with their equivalent regions at the sign and at the background, as it was found that the best results (histograms that maximised the similarity with the foreground and background images) came with the use of the thresholds selected for each category. The outcome for each approach is depicted in
Figure 3.
As a final step, the system stores the image by applying the new illumination that was found through using an enhancer function. There are examples in which this automated procedure was deemed insufficient and the distortions overshadowed the image features. The generated signs are revised with a set of threshold values to segment valuable examples from others that otherwise could not be used. The data separation function considers both cases in which the traffic sign may be black and white. In the instance of a grey sign, an additional function is used to store the file paths of every image that is grey. This is achieved by calculating the difference between all the individual channels on the average pixel value of the image and reassuring that the margin is lower than one. Coloured sings are judged based on the difference between the RGB values. However, conditional independence is assumed between the three relationships, as for example a bright red image will have a great margin when comparing the red and blue channel, but the difference may be less than one in the case of blue and green channels.
The last step of the synthetic example generator is the blending of the background and foreground. This action is performed by a transitional combination of the traffic sign with the background, while also creating a blending effect at the border of the sign for realistic representations of the environmental effects on the object. The bordering pixels are manipulated in a way that the opacity is periodically decreased, and therefore, the final RGB values are changed to an intermediate result between the actual sign and the colours of the background. Final images are resized to 48 × 48 in order to minimise the divergence from actual images.