Prototyping a Trafﬁc Light Recognition Device with Expert Knowledge

: Trafﬁc light detection and recognition (TLR) research has grown every year. In addition, Machine Learning (ML) has been largely used not only in trafﬁc light research but in every ﬁeld where it is useful and possible to generalize data and automatize human behavior. ML algorithms require a large amount of data to work properly and, thus, a lot of computational power is required to analyze the data. We argue that expert knowledge should be used to decrease the burden of collecting a huge amount of data for ML tasks. In this paper, we show how such kind of knowledge was used to reduce the amount of data and improve the accuracy rate for trafﬁc light detection and recognition. Results show an improvement in the accuracy rate around 15%. The paper also proposes a TLR device prototype using both camera and processing unit of a smartphone which can be used as a driver assistance. To validate such layout prototype, a dataset was built and used to test an ML model based on adaptive background suppression ﬁlter (AdaBSF) and Support Vector Machines (SVMs). Results show 100% precision rate and recall of 65%.


Introduction
According to the Brazilian Federal Highway Police [1], more than 150 thousand traffic accidents were reported in 2014.Traffic lights are clear and intuitive guidance devices for traffic control.However, they are systematically disrespected by Brazilian drivers.It was reported that 7500 of these accidents were related to such disrespect, causing 425 fatal victims.Some hypotheses are raised as possible motivations for such infractions: (1) poorly located traffic lights; (2) faulty/off traffic lights or in very dim light; (3) ambient light that disturbs the vision of the driver; (4) visual impairment of the driver; (5) yellow traffic light advance; and (6) number of traffic regulator items to be observed.
The first two items can be easily solved with the effort of the traffic regulator in arranging and maintaining traffic lights optimally on the streets.A Traffic Light Recognition Device (TLR) to assist the driver could deal with the remaining items.
The main task of a TLR is to avoid accidents and save lives by informing the presence of a red or green traffic light to the driver in a non-intrusive way.More complex TLR can bring richer information, such as pointing out the main traffic light for a specific route (when there is more than one) and how far from the driver the traffic light is.Smarter insights could also be provided, as indicating the speed the driver should keep so he/she could advance the largest number of green signals in a row, for instance.A smart TLR is also useful for pedestrians who are visually impaired, especially those with sound signaling.
The electronic device used to build a smart TLR and how it should be positioned inside the vehicle influence the TLR success.If the device has a faulty camera or lack stability in the attachment to the vehicle, for example, images can be blurred or may not accurately reflect the scene.
Devices that are capable of automatically detect and classify objects, whether these are images or not, usually require rules-driven programming; the rules guide the device in its perception of the world.Machine learning methods automate the burden of having to find the proper set of rules to specific problems, as discussed in Section 2.
A major problem with ML techniques is the need of large datasets to find patterns, i.e., the rules.Large and representative datasets are difficult to build.Another important problem concerns processing costs: the more data, the greater the need for processing power and memory capacity.Machines with high computational power run into costs of acquisition and maintenance.
In Section 3, we propose a TLR layout prototype using a detection and recognition method from [2].The proposed TLR uses the camera and processing unit of an ordinary smartphone.Furthermore, Section 4 seeks to evaluate if the use of Expert Knowledge (EK) within ML models reduce the burden of building up huge datasets.In Section 5, concluding remarks are reported.

Related Works
An object recognition mechanism works in two phases in order to recognize objects from an image: (1) detection of possible targets and (2) classification of the targets.
When working with object detection/recognition, we need to define what object features shall be used to guide the algorithm.In traffic light recognition, features such as light, shape and color are commonly used.
Artificial Neural Networks (NN), Saliency Map (SM), and Blob Detection (BD) are the most common techniques used to detect traffic lights.In [3][4][5][6], Convolutional Neural Network (CNN) was used to detect possible traffic lights whereas in [7] PCAnet was used with the same goal.In [8], the authors used a learning algorithm based on image feature channels and Histogram of Oriented Gradient (HOG) to detection and recognition.Saliency maps were used as a detection tool by [5,[9][10][11][12].We also observed fine examples of blob detection use in [13][14][15].Geometric transforms were used in the detection phase by [16][17][18], which applied the Hough Circular Transform and by [19], which used the Radial Symmetry Fast Transform.Some less common techniques used alone or in association with the ones cited before are Adaptive Filters [2], Template Matching [20], Gaussian Distribution [21], Probability Estimation with CNN [3], and Top Hat [22].Processing image algorithms are also commonly used to detect traffic lights: color or shape segmentation was used by [23,24] whereas threshold was used by [25,26].
Table 1 relates the rates obtained by the works previously mentioned and its main techniques.It is difficult to make an analysis and compare the results to see which one was better once different metrics have been used and, more importantly, few papers worked with the same datasets.However, it is notable that many papers presented more than a 90% precision rate, or above an 80% accuracy rate.When the paper presented more than one result, we included the lowest and the highest value in the referred column at Table 1 Color or Shape Segmentation/HOG/SVM --89.90 [30] Color or Shape Segmentation/SVM 86.20 95.50 - [21] Gaussian Distribution --80.00-85.00[16] Geometric Transforms --56.00-93.00[34] Color or Shape Segmentation/Histograms --97.50 [7] PCAnet/SVM --97.50 [6] CNN/Saliency Map --96.25 [24] Color or Shape Segmentation --92.00-96.00[19] Geometric Transforms 87.32 84.93 - [17] Geometric Transforms --70.00 [25] Color or Shape Segmentation/Threshold --88.00-96.00[35] Color or Shape Segmentation/Histograms --50.00-83.33[32] Color or Shape Segmentation/SVM 98.96 99.18 - [20] Template Matching 98.00 97.00 - [43] Hidden Markov Models --90.55 [38] Template Matching --90.50 [22] Top Hat --97.00 [39] Template Matching --69.23 [36] Histograms --91.00 [41] Probability Histograms --94.00 [18] Geometric Transforms/Histograms --89.00 [40] Template Matching 98.41 95.38 - [44] Template Matching 44.00-63.0075.00-94.00-Data used in the related works are not always made available by the authors, and, when available, only a few are complete, i.e., contains separate traffic light images and whole traffic scene images.In Table 2, the Type column refers to what kind of traffic light the dataset contains, the Traffic light samples column shows how many images containing only a traffic light exists, these images are very useful to train Machine Learning algorithms and are obtained from whole frames containing traffic scenes.The Traffic frame samples column accounts how many traffic scenes exists in the dataset which are not the same used to extract the traffic lights images.
One of the most used datasets in academy research is the LARA dataset [40].The LARA dataset provides video and image of vehicle traffic lights in day light.
A pedestrian traffic light dataset is the PRIA dataset provided by [45].A larger dataset with both vehicle and pedestrian traffic light is the Worcester Polytechnic Institute (WPI) Dataset [14].Another vehicle dataset is provided by [2].All of these datasets provide only daytime samples of traffic lights.
Recently, the LISA dataset was provided as part of the Visions for Intelligent Vehicles and Applications project-VIVA [46].LISA consists of a vehicle traffic light dataset, a traffic sign dataset, hand detection/gesture/tracking dataset and a face dataset, with samples of traffic lights in both daytime and nighttime.
It is important to notice that the datasets were built up in different regions of the planet, with differences concerning the shape and the type of the traffic lights.All experiments of this paper uses the dataset created by [2].
Considering the availability, balance, quality and the amount of data, we chose two datasets to work with.In this article, we show the results using the dataset presented by [2].In future work, we will present new results using the dataset from [14].Looking at Table 2, we can see that these two datasets are the only ones having traffic lights' exclusive images for machine learning training and whole traffic images.

Traffic Light Recognition Device Prototype
A main question when prototyping a TLR device is where it will be positioned in the vehicle, once it has to provide a clear view of the exterior scene and at the same time do not compromise the vision of the driver.Another critical observation is that the device shall be protected from adverse meteorological conditions like rain, or be waterproof.The heat also might cause problems in some electronic devices, so the sunlight incidence at the device location may be considered as well.In addition, trepidation of vehicle motion might have a critical influence in the device vision and, so, stabilization is a huge requirement.The TLR also should be able to provide warning sound to advise the distracted driver.Smartphones accomplish these requirements and, so, are an affordable alternative to TLR devices.
In this work, a smartphone was positioned inside a vehicle to capture actual traffic scenes with and without the presence of traffic lights.Two different types of support are commonly used to attach a smartphone in the car panel: air conditioning supports and windshield suction cups.Air conditioning supports can not be used to position a TLR device because it has no outside view from the vehicle.Windshield suction cups supports are a possible choice; however, the support may fall down or become very shaking if low quality suckers are used.
To overcome the smartphone support issue and to meet the requirements specified previously, we designed a stable device support using a two-sided tape and part of a windshield suction cup support.We remove the support portion that holds the device from the cable with a suction cup that is attached to the windshield.Then, we fixed the first part centralized at the vehicle panel with the two-sided tape.This design allows the device to capture the traffic scene without a bias to the left or to the right.The proposed layout forces the device to use the camera in landscape mode, reducing the portion of sky captured and maximizing the traffic scene size obtained (Figure 1).
Three different smartphones were used to capture traffic videos containing traffic lights: a Motorola G second generation (Motorola Mobility LLC, Chaoyang District, China), an iPhone 6 (Foxconn CMMSG Indústria de Eletrônicos LTDA, Jundiaí-SP, Brazil), and a Galaxy S8+ (Samsung Eletrônica da Amazônia LTDA., Manaus-AM, Brazil).All devices were configured to capture video with HD resolution.Figure 2 shows an example of images obtained with these devices.The images were extracted from videos at 5 frames per second (fps) rate.A main concern when using smartphones in tasks like this is if such devices has enough processing power.It is important to notice that nowadays most smartphones have as much processing power and memory as some modern notebooks.Besides that, other researchers have been investigating and testing the use of mobile devices to recognize traffic lights throughout the last years.For example, Roters et al. [45] were able to recognize traffic lights in live video streams using a Nokia N95 mobile phone (manufacturer not specified by the author).The prototype could run 5 to 10 frames/s and used only a few seconds to give the user a feedback when testing in real field tests.In addition, Sung and Tsai [47] used an Android 4.04 smartphone equipped with a 1.5 GHz dual-core processor and 1 GB of RAM memory, with this configuration was able to process each image in 15.7 ms, which is half of the time required for real time processing on a mobile device, according to the author.More recently, Oniga et al. [21] used a smart mobile device equipped with a quad-core processor at 2.3 GHz.The results were in the range of 50-60 ms computation time for images with resolution of 1280 × 720 pixels, and 30 ms for images with resolution of 640 × 480.The image resolution used by [42] was 1413 × 1884, and the experiments with a Nexus 5 device (manufacturer not specified by the author) with Android 5 obtained an average computation time of 115.5 ms.This papers shows that, although it can be improved in some cases, today's smartphones have enough processing power to accomplish such a task.

Adaptive Background Suppression Filter
To highlight regions of interest (ROI) in the image, Oniga et al. [2] proposed an Adaptive Background Suppression Filter-AdaBSF.In the algorithm, a 4-channel feature map W i , where i represent the 4-channel feature map index, is generated extracting R, G and B channels and calculating the normalized gradients of the input image.
To search for vertical and horizontal traffic lights, the window size for W i is fixed at 16 × 8 pixels and 8 × 16 pixels, respectively.Since each window is four-dimensional, the pixel amount is D = 16 × 8 × 4 per window.Each window is represented by a feature vector x of size D = 512.The multi-scale problem was solved by down-sampling the original image to different scales while the window detection remains with fixed size.
The aim of AdaBSF algorithm is to design an Finite Impulse Response (FIR) filter specified by the vector w = [w 1 , w 2 , ..., w D ] T in a way that y = w T x.The output y assigns a score to each detection window, which represents how likely the detection window covers a traffic light.
To classify the ROI found by AdaBSF, ref. [2] used Support Vector Machines (SVM).The author created a cascade of SVM classifiers that begins classifying the ROI, whether it is a traffic light or not.If it is a traffic light, the next SVM classifies the ROI into "red type" or "green type".The traffic light is further classified by the next SVM, observing whether it has an arrow and its direction, using an '1-vs.-1'voting method.
In this paper, the method proposed by [2] was applied to images obtained by a smartphone as a TLR device.The TLR follows a specific layout suitable for real time use.This layout is specified in the following sections.SVMs and the AdaBSF algorithm were trained with traffic light samples provided by the author.Negative samples, i.e., background samples, were extracted from four random test sequences also provided.Since we had no access to author's code, the algorithm was implemented in Python language and results were compared to the original's as a way to ensure coding correctness.

Prototype Results
Images obtained by the TLR device using this support prototype were submitted to classification in a personal computer using the method applied in [2].
The images were obtained using three different smartphone models, as cited before.Images from Motorola G 2nd Generation did not present good results and were discarded.Images obtained by the iPhone 6 contains 682 images, 209 negative samples and 473 traffic light samples.The third group is formed by 247 images obtained with a Galaxy S8+, being 165 traffic light samples and 82 negative samples.A total of 929 traffic images were analyzed: 638 images for green or red traffic lights and 291 negative samples.
Most images in [2]'s dataset present two equal traffic lights for the same road.In these cases, we account only one result by pair of classifications.If both traffic lights were classified correctly, or one was classified correctly and the other was missed, one true positive is accounted.If both traffic lights were missed or at least one of them is classified in the wrong class, one false negative is accounted.This reflects the real-life behavior when we just need to look at one traffic light to make a choice.
In Figure 3, it is possible to see detailed results from each image group.The two groups achieved high precision rates, but the iPhone 6 group presented a low recall (60%).This can be explained by the fact that the traffic light samples used in the training dataset are too different from some traffic lights of the iPhone group dataset as shown in Figure 4.If the training samples do not properly represent the real world, some traffic lights can not be recognized.In addition, due to geographic/meteorological issues (and possibly, the device used to obtain the images), the lighting aspect of the training set images is very different from that of the test set.
The low recall rates of the iPhone images group influenced the final rates of TLR tests, as observed in Figure 5.In comparison to the results obtained by [2], the TLR results are valid to justify its use in future research.
Figure 5. Precision and recall rates on our reproduction of [2], original work from [2], and tests using the images obtained with the TLR device prototype, respectively.

Expert Knowledge
The main problem with ML algorithms is that a reasonable amount of data is required, and the data needs to be balanced and categorized.Building up such database from scratch requires time and great effort and, in most cases, data may not even exist in large number.
We attempt to reduce the amount of required data to train a ML algorithm taking into account Expert Knowledge (EK).The EK used consists of the traffic light location on a given image obtained from the vehicle interior.The idea behind the chosen EK is that the traffic light is more likely to appear in certain locations, for example in the central and upper portions of the image, since they are always suspended on poles and must be visible to the driver from a reasonable distance.
To corroborate the idea, traffic light frequency maps were constructed.The method consists of hand tagging a set of images with the region in which the traffic light(s) appear.As a result, we have a frequency map of the regions where traffic lights most appear.
Datasets from [2] were used to generate a frequency map.A random sample of 650 images was generated from the test data and was analyzed.In Figure 6, we can observe the graphical representation of the frequency maps produced.Frequency data were smoothed using an averaged filter with size mask = 7.In the bottom, the scale goes from the lowest value (left) to the highest value (right).
To compare the results of the EK approach with the original or classic approach (without EK), the same data were used in training.Unfortunately, the datasets found in the literature do not provide the coordinates from which the traffic light samples were obtained from the original images, making it impossible to combine the previously calculated frequency with the training samples.To solve this problem, it was assumed that all training samples were found in regions of nonzero value in the frequency map, so the combination could be made with random frequency values.It was found that the data follow the beta distribution, which was used to generate the random frequency values so that they still keep a relation with the previous calculated frequency map.
In Figure 7, we observe the frequency distribution histogram of the map shown in Figure 6.The x-axis represents the previously calculated traffic light frequency values, and the y-axis represents the frequency of those values.The parameters of the beta distribution found for the map of Figure 6 were (α = 0.0342, β = 12.8985). .Precision and recall rates on our reproduction of [2], original work from [2], and tests using the images obtained with the TLR device prototype, respectively.

PCANet/SVM Classifier
Color extraction and blob detection are used to detect possible traffic lights or ROI.The color extraction is performed in the HSV color space.Blob detection is implemented by combining image processing algorithms such as flooding, contour tracking, and closure.The combination of these techniques allows the identification of both circular traffic lights and arrow lights.
The classification phase consists of a PCA network (PCANet).The PCANet is based on Principal Component Analysis-PCA.According to [48], the PCA is a statistics approach that can be used to analyze relations between a large number of variables to find a way to condensate the information from the set of original variables into a smaller set of variables with a minimal loss of information.This allows the PCANet to simulate the behavior of a Convolutional Neural Network (CNN).
The creators of PCAnet, [49], wanted to design a simple deep learning network to easily train and to adapt to different data and tasks.The convolution in a PCAnet is made using PCA filters as a kernel, the filters are found by an unsupervised learning method during the training process.The number of filters can vary and, accordingly to [14], more filters means better network performance.Eight (8) PCA filters were used in the two-layer PCAnet proposed.
The use of PCA filters in PCAnet is given as follows: for each patch of pixels in the input image, there is a filter of the same size.The mean of the pixels are calculated and its value is removed from the filter, operation that is called the Patch Mean Removal Phase.After that, the filters are used to convolve over the image.The combination of Patch Mean Removal Phase and PCA Filters Convolution Phase are called stages or layers of the network.
The output of the network is a binary hashing that is converted to a block-wise histogram in decimal values.The block-wise histogram represents the output features used to feed the SVM which performs the final ranking.For further details, see [49].

Training with Expert Knowledge
The traffic light recognition method used to test the EK insertion has been proposed by [14].The author proposed a PCANet to find feature vectors given a set of traffic lights informed for training.The feature vectors are then used to train SVMs that will classify the traffic lights (see Section 4.1 for details).
The inclusion of EK in this process can take many different forms.It is possible, for instance, to obtain the coordinates (in the original scene) of the region classified by the method as a traffic light, analyze the frequency map in the same coordinates and submit these values to a threshold operation that will classify the traffic light.The problem of this approach is that the algorithm becomes too deterministic since only the exact regions in which the traffic light appear in the frequency map would achieve positive results.
A different approach is to perform a sum or multiplication operation on one of the recognition method internal phases.This could be done between PCANet layers, or after the PCANet output just before the SVM layer.The problem in this case is that the dimensionality of data is reduced by the convolution inside the algorithm, making it difficult the direct combination with the value of the frequency map.
We decided to use the EK as the first layer of the recognition method.In such a way, after the detection phase, Regions Of Interest (ROI) are multiplied by the frequency value obtained for the equivalent region in the frequency map.Since the frequency map and the original scene have the same dimensionality, and if one is in possession of the ROI coordinates in the original scene, this combination will tend to highlight the ROI found for the next algorithm.
An important concern at this point is to ensure that the expert instruction does not turn the method too deterministic, as it would if it were applied at the end of the method.It is possible that some part of the frequency map has a value of 0 (zero), which means that in the analyzed sample no traffic lights appeared in that region.However, although the odds are small, it is also possible that a traffic light appears in a non-common region of the scene, and TLR needs to be able to find it.
To avoid the combination of the expert instruction with the data completely canceling some data, it was necessary to define an increment inc that is added to the multiplication factor f at, which consists of the frequency value found in the frequency map.In all the experiments, we set inc = 0.1.
We have used same training set defined by [14], consisting of 9977 samples of green traffic lights and 10,975 samples of red traffic lights, amounting to 20,952 training samples in total.In our case, EK is added though.
After each training, an accuracy test was performed.The amounts of data used in the training were: 1000, 4000, 7000, 10,000 and 13,000.
In Figure 8, we can observe the accuracy rates obtained by successive tests.We can notice that the training with EK (orange line) using 1000, 4000 and 7000 samples increased in at least 15% the accuracy rate in comparison with the training without EK (blue line).
The difference between the two rates were close only when using 10,000 samples to train the model.In this case, the training with EK reached 98.38% accuracy and the method without EK obtained 93.02% accuracy.With 13,000 samples, the rates were even closer, with the EK training reaching 85.02% while the no-EK training obtained 86.48%.In this last test, we can see that, although the test without EK had a higher accuracy, the difference is only 1.46%, which is not statistically significant (p-value of the test for difference of proportions equal to 0.8092).
Other interesting information observed in Figure 8 is the decrease in the accuracy at the end of both lines, when the tests are performed with more data.This behavior-known as overfitting-is common when the model becomes too specialized in the data, which means that the model can classify correctly only the training data, but has not generalized the data to correctly classify new data.

Conclusions
This work presented a TLR device layout prototype used to capture road scenes in order to validate the usage of a smartphone as a TLR in a real environment.This work also shows a method of EK inclusion to improve accuracy rates of a machine learning algorithm used to classify traffic light images.
The tests with the TLR device prototype achieved a 100% precision rate and 65% recall rate.The results demonstrate the prototype feasibility.The recall rate can be improved by training the applied algorithm with more representative samples, which will be done in the future along with cross-validation tests.The results also show that Galaxy S8+ and iPhone 6, two different mobile platforms, can be successfully used as TLR devices.
The use of Expert Knowledge explored in this paper also showed promising results.Training with 1000, 4000 and 7000 samples with Expert Knowledge always achieved test accuracy rates at least 15% higher than the training without Expert Knowledge.This result proves that the Expert Knowledge can reduce the amount of data required to train an algorithm, reducing at same time the computational effort needed.This experiment also stimulates the research of similar methods in other areas.
Future work includes testing the Expert Knowledge method in other datasets, tests with the complete flow of TLR, i.e., using a automatic traffic light detection method along with the Expert Knowledge classification algorithm, and real-time tests using the prototype presented in this paper with our classification method.

Figure 1 .
Figure 1.Traffic Light Recognition device support holding an iPhone 6.

Figure 2 .
Figure 2. Images obtained using the TLR device support prototype with different smartphones.From top to bottom (left to right): (a) Motorola G 2nd Generation; (b) iPhone 6; and (c) Galaxy S8+.

Figure 3 .
Figure 3. Precision and recall rates by the smartphone used to obtain the images.

Figure 4 .
Figure 4. From left to right: the red and green traffic light samples used in training, red and green traffic light samples from the test dataset obtained with the TLR device prototype

Figure 6 .Figure 7
Figure 6.Frequency map of traffic light appearances in a sample of [2] dataset.Scale on the bottom.

Figure 8 .
Figure 8. Test accuracies after training with expert knowledge (orange line) and without expert knowledge (blue line).

Table 1 .
. The papers non-cited in Table 1 used only computational time as metric, which is out of scope for this paper.The lowest accuracy rates were achieved by the use of Geometric Transforms, Template Matching and Color or Shape Segmentation Combined with Histograms, while the highest one was achieved by PCAnet/SVM technique, and also Color or Shape Segmentation Combined with Histograms.Regarding the Color or Shape Segmentation Combined with Histograms, this variety of results is understandable once each paper can use different approaches in this task.The lowest precision rate was achieved by Template Matching, while all the other approaches have obtained above an 80% precision rate, including Template Matching in other tests in the same paper where the worst result was accounted.Rates achieved by related works and techniques.

Table 2 .
Public traffic light datasets.