Street Sign Recognition Using Histogram of Oriented Gradients and Artificial Neural Networks

Street sign identification is an important problem in applications such as autonomous vehicle navigation and aids for individuals with vision impairments. It can be especially useful in instances where navigation techniques such as global positioning system (GPS) are not available. In this paper, we present a method of detection and interpretation of Malaysian street signs using image processing and machine learning techniques. First, we eliminate the background from an image to segment the region of interest (i.e., the street sign). Then, we extract the text from the segmented image and classify it. Finally, we present the identified text to the user as a voice notification. We also show through experimental results that the system performs well in real-time with a high level of accuracy. To this end, we use a database of Malaysian street sign images captured through an on-board camera.


Introduction
Automatic vehicle navigation is typically performed using the global positioning system (GPS).However, GPS-based navigation could become unreliable due to several reasons, such as, interference from other systems, issues with signal strength, limitations in receiver sensitivity, and unavailability of maps.Therefore, it is advantageous to have a back-up system to take over the navigation process in case of such issues.Vision-based navigation is an alternative solution, in which street sign recognition plays a major role.Also, due to improper maintenance and the use of non-contrasting colours, people with vision impairments may find it difficult to read street signs.Automatic street sign recognition can be used to aid such individuals as well.
Street signs typically contain alphanumerical characters, the identification of which has received much focus in recent years, specifically in the context of scanned books and documents [1][2][3][4].Street sign recognition, like other similar problems such as license plate and transport route recognition, typically consists of three stages: (i) extraction of the region of interest (ROI) from the image, (ii) segmentation of characters, and (iii) character recognition [5].For ROI extraction and character segmentation, researchers use many different methods, the majority of which are based on image thresholding.Classifiers are usually trained for the character recognition task.
A street name detection and recognition method in the urban environment was developed by Parizi et al. [6].First, they used Adaboost [7] to detect the ROI.Then, they used histogram-based text segmentation and scale-invariant feature transform (SIFT) [8] feature matching for text image classification.An approach for text detection and recognition in road panels was introduced by Gonzalez et al. [9].For character text localisation, they used the maximally stable extremal regions (MSER) algorithm [10].An algorithm based on HMM was used for word recognition/classification.
In this paper, we discuss the development of an intelligent identification and interpretation system for Malaysian street signs.The system consists of the following steps: (1) real-time capture of images using a digital camera mounted on the dashboard of a vehicle, (2) segmentation of the street sign from each frame captured by the camera, (3) extraction of characters from the segmented image, (4) identification of the street sign, and (5) presentation of the identified street sign as a verbal message.We showed through experimental results that the proposed method is effective in real-time applications and comparable with other similar existing methods.The remainder of the paper is organised as follows.The proposed methodology for street sign recognition and interpretation is described in Section 2. The experimental setup and results (including performance comparisons) are discussed in Section 3. Section 4 concludes the paper.

Methodology
This section describes the development of the street sign recognition system for Malaysian street signs.An overview of the system is shown in Figure 1 and the functionality of each step is discussed in the following sub-sections.

Acquisition of Images
Images of Malaysian street signs were captured through a digital camera mounted on the dashboard of a vehicle driving through the streets of Kuala Lumpur.Specifications of the camera used for the image acquisition are shown in Table 1.Over two thousands images of street signs were obtained using this process.Figure 2 shows an example of a Malaysian street sign thus captured.

Extraction of the Region of Interest
We pre-processed the obtained images to extract the region of interest (i.e., the area that contains the street sign).First, a histogram equalisation was performed on the image frame to improve contrast.Next, the blue channel of the image was extracted and noise was removed from it (salt and pepper noise removal and median filtering).Then, using a simple theresholding method, it was binarised to identify the blue coloured street sign.The resulting image contained all possible objects for the street sign.The objects were identified using a blob detection algorithm.An object measurement algorithm was used to find the most likely candidate for the street sign using the object's height to width ratio.This was then used as a mask to extract the region of interest from the histogram equalised image (output of the first step of this algorithm).Figure 3 illustrates this process, with Figure 3a showing the block diagram of the process and Figure 3b showing the results at each step for an example image.

Extraction of Text
From the region of interest obtained in the previous step, we extracted the text of the street sign.First, the street sign was converted to greyscale.Then, it was binarised using thresholding.A 3 × 3 median filter was used for smoothing the binary image and removing small objects.The objects that remained in the binary image were considered to represent a character.Next, the region of interest for each object was extracted and normalised into 56 × 56 pixels.Figure 4 illustrates the text extraction process.

Calculation of Text Features
The extracted characters then went through a feature calculation process, so that each character could be represented as a set of feature values.Here, we used histogram of oriented gradients (HOG) [36] for this purpose.First, the image was subdivided into smaller neighborhood regions (or 'cells') [37].Then, for each cell, at each pixel, the kernels [−1, 0, +1] and [−1, 0, +1] T were applied to get the horizontal (G x ) and vertical (G y ) edge values respectively.The magnitude and orientation of the gradient were calculated as respectively.Histograms of the unsigned angle (0 • to 180 • ) weighted on the magnitude of each cell were then generated.Cells were combined into blocks and block normalisation was performed on the concatenated histograms to account for variations in illumination.The cell size affected the length of the feature vector, as shown in Figure 5.
In our application, we used a cell size of 4 × 4 as it provided a good balance between encoded spatial information and feature dimensions, which helps speed up training.Since our images were 56 × 56 pixels, this gave us 14 × 14 cells.Block sizes of 2 × 2 cells were used for the block normalisation with a 50% overlap.This resulted in 13 × 13 × 4 normalised histograms.Each histogram had nine bins, and as such, the resulting feature vector for each image had 13 × 13 × 4 × 9 = 6084 elements.

Text to Voice Interpretation
The proposed system can notify the user of the recognised street signs in two ways.It can either be visualised as text or converted to voice.Figure 8 shows these two forms of output of the street sign recognition system.

Experimental Results
To conduct the experiments, a Dell latitude E6420 (Dell Inc., Round Rock, TX, USA) computer running windows 10 professional (64-bit) powered by Intel R Core TM (Intel, Santa Clara, CA, USA) i5 2.5 GHz processor, and 8 GB of RAM (Dell Inc., Round Rock, TX, USA) was used.The system and related experiments were implemented using MATLAB (R2016a) (MathWorks, Natick, MA, USA) image processing, computer vision, and neural network toolboxes.

Training Performance of the Neural Network
The training performance of the neural network for several training iterations is shown in Table 2. Training number 16 with 168 iterations provided the best error rate (0%), and as such was considered the optimal training number for the proposed system.Figure 9 illustrates the training performance with respect to parameters such as cross-entropy error.Receiver operating characteristic (ROC) curves and the performance of ANN training are shown in Figure 10.Perfect classification results were seen at 168 iterations.

Performance on Testing Data
Data that had not been used in the training process was used to test the performance of the neural network classifier.The testing dataset was extracted from images of the street sign from Malaysian street signs using the process discussed in Section 2. Each class contained 100 test samples, resulting in a total of 1600 samples (see Figure 11).
Testing performance is shown in Figure 12.The ROC curve (as shown in Figure 12a) determines the values of the area under curve (AUC) for all testing samples (16 classes).Most of the classes achieved perfect AUC as seen from the figure.The confusion matrix for the testing data is shown in Figure 12b.The overall percentage of correct classification was 99.4%.The highest levels of misclassification were observed in the class pairs of '7' and '/', '1' and '7' and '2' and 'N'. Figure 13 shows some characters that led to miscalssifications.Testing performance with respect to some common metrics, along with how they were calculated based on the the number of true negative (TN), true positive (TP), false negative (FN), and false positive (FP) classifications, is shown in Table 3.
Table 3. Training performance with respect to common performance metrics.

Comparison of ROI Extraction using Different Colour Spaces
To observe if choice of colour space used in the ROI extraction played a significant role in the performance of our method, we compared the original method based on the RGB colour space with those using other colour spaces such as HSV, YCbCr, and CIEL*a*b* [38].First, we manually cropped 20 images to extract the ROI.Then, we created the colour profiles for these extracted regions in each colour space.Figure 14 shows the colour profile histograms for HSV, YCbCr, and CIEL*a*b* colour spaces.Next, for each channel in a colour space, we calculated the value range using mean (µ) and standard deviation (σ) as [µ − 2σ µ + 2σ].This value range was then used to threshold the ROI from the image.Comparison results with respect to the different ROI extraction methods are shown in Table 4.As seen from the comparison results, the choice of colour space does not significantly affect the performance of the system.

Comparison of Different Feature Extraction Methods
To explore the effect of the feature extraction method, we compared the classification performance when using the original HOG features and some others that are often used in the literature: local binary patterns (LBP) [21], Haar-like [22], and bag-of-features (BoF) [23].We used our ANN as the classifier.We considered accuracy and average time to extract features across all images to be measures of performance in this comparison.Table 5 shows the results.From this comparison, we observe that HOG provides the best accuracy and a comparable level of efficiency.

Comparison with Similar Existing Methods
To evaluate the performance of the proposed method on the Malaysian street sign database, we compared it to some similar existing methods.For example, Kamble et al. [39] discussed handwritten character recognition using rectangular HOG (R-HOG) feature extraction and used a feed-forward neural network (FFANN) and a support vector machine (SVM) for classification.Su et al. [40] also investigated the character recognition task in natural scenes.They used convolutional co-occurrence HOG (CHOG) as their feature extractor and SVM as their classifier.Tian et al. [41] performed text recognition with CHOG and SVM.Boukharouba et al. [42] classified handwritten digits using a chain code histogram (CCH) [43] for feature extraction and a SVM for classification.Niu et al. [44] introduced a hybrid method for recognition of handwritten digits.They used a convolution neural network (CNN) to extract the image features and fed them to a hybrid classifier for classification.Their hybrid classifier contained a CNN and SVM.For the purpose of comparison, we re-implemented these methods and trained and tested them on our database.The pre-processing procedure discussed above was performed to extract text from the images for all compared methods.Table 6 shows the comparison results.

Comparison of Methods with Respect to the MNIST Database
To investigate the transferability of the proposed method, we compared its performance with the above methods on a publicly available text image database.For this purpose, we used the modified national institute of standards and technology (MNIST) database [45].As this database only contains text images, no pre-processing (as discussed in Section 2.2) was required.As such, only feature extraction and classification methods were considered here.Table 7 shows the comparison results.The execution time here refers to the average time (in seconds) for feature extraction and recognition of a single character.We observed that the method discussed in Niu et al. [44] performed best with respect to classification accuracy.The proposed method was slightly less accurate, but shows the best execution time.

Conclusions
In this paper, we presented a system for Malaysian street sign identification and interpretation in real-time.The proposed system consisted of a few steps: image acquisition, extraction of the region of interest (i.e., the street sign) from the image, extraction of text, calculation of features (histogram of oriented gradients) from the text, recognition or classification of the text (using a neural network), and the presentation of the identified text visually and verbally.Experimental results showed high performance levels (including when compared to other similar existing methods), indicating that the proposed system is effective in recognising and interpreting Malaysian street signs.
As such, it can be used as an alternative/backup to GPS-based navigation and as an aid for visually impaired individuals.In future work, we will investigate the use of deep learning techniques in the recognition system.We will also explore how this system can be used for identifying street signs in other countries and under difficult imaging conditions (such as low-lit environments at night).We will also extend it to be used in other similar applications such as license plate detection and traffic sign detection.Furthermore, we will also consider methods to possibly improve detection levels, for example, the removal of deformations (such as that caused by perspective projection) as a pre-processing step.

Figure 1 .
Figure 1.Overview of the proposed system for automatic street sign identification and interpretation.

Figure 2 .
Figure 2.An example of a Malaysian street sign.

Figure 3 .
Figure 3. Extraction of the region of interest: (a) block diagram of the algorithm and (b) results at each step.

Figure 4 .
Figure 4. Extraction of text from the street sign.

Figure 5 .
Figure 5. Visualisation of dimension in the histogram of oriented gradients (HOG) feature vector.
Figure 7 shows the architecture of the neural network classifier.The training dataset contained 200 text images per class, resulting in a total of 3200 training samples.70% of these samples were randomly allocated to the training set (2240 samples).We used 15% (480 samples) each for validation and testing.

Figure 8 .
Figure 8. Presentation of recognised street signs: (a,b) are street signs, (c,d) are annotated street signs with recognised text, (e,f) are extracted text words, and (g,h) are the voice plots.

Figure 10 .
Figure 10.Neural network training performance with respect to the receiver operating characteristic (ROC) curve.

Table 1 .
Camera specification and image acquisition parameters.

Table 2 .
Neural network training performance.

Table 4 .
Comparison results for receiver operating characteristic (ROI) extraction with different colour spaces.

Table 5 .
Performance comparison with different feature extraction methods.

Table 6 .
Performance comparison with respect to similar existing methods.

Table 7 .
Performance comparison on the modified national institute of standards and technology (MNIST) database.