Real-Time Pattern-Recognition of GPR Images with YOLO v3 Implemented by Tensorflow

Artificial intelligence (AI) is widely used in pattern recognition and positioning. In most of the geological exploration applications, it needs to locate and identify underground objects according to electromagnetic wave characteristics from the ground-penetrating radar (GPR) images. Currently, a few robust AI approach can detect targets by real-time with high precision or automation for GPR images recognition. This paper proposes an approach that can be used to identify parabolic targets with different sizes and underground soil or concrete structure voids based on you only look once (YOLO) v3. With the TensorFlow 1.13.0 developed by Google, we construct YOLO v3 neural network to realize real-time pattern recognition of GPR images. We propose the specific coding method for the GPR image samples in Yolo V3 to improve the prediction accuracy of bounding boxes. At the same time, K-means algorithm is also applied to select anchor boxes to improve the accuracy of positioning hyperbolic vertex. For some instances electromagnetic-vacillated signals may occur, which refers to multiple parabolic electromagnetic waves formed by strong conductive objects among soils or overlapping waveforms. This paper deals with the vacillating signal similarity intersection over union (IoU) (V-IoU) methods. Experimental result shows that the V-IoU combined with non-maximum suppression (NMS) can accurately frame targets in GPR image and reduce the misidentified boxes as well. Compared with the single shot multi-box detector (SSD), YOLO v2, and Faster-RCNN, the V-IoU YOLO v3 shows its superior performance even when implemented by CPU. It can meet the real-time output requirements by an average 12 fps detected speed. In summary, this paper proposes a simple and high-precision real-time pattern recognition method for GPR imagery, and promoted the application of artificial intelligence or deep learning in the field of the geophysical science.


Introduction
In the application of ground-penetrating radar (GPR) engineering detection, the following three cases are the most common: (1) Inspection of the atypical situation of reinforced concrete structures such as bridges, tunnels, or public roads, or the number of steel bars inside those structures; (2) locating certain objects underground, such as archaeological research; (3) evaluating and measuring the distribution of hollows, voids, or soil firmness in highways, bridges, and tunnels. Nonetheless, the outcomes, after GPR detection, are often judged by the worker's experience to recognize the location and size information of the target [1,2]. Actually, these kinds of evaluations using GPR image are not channels. Where the second block network of YOLO v3 carried one residual block which includes zero padding, convolution and residual unit, and 128 × 3 × 3 convolutional layer; it outputs some 104 × 104 feature maps with 128 channels. The third YOLO v3 network contains two residual block, then go through 256 × 3 × 3 convolutional layer, where the 4th block contains many residual shortcut to all 256 × 256 feature maps. In addition, this block makes vector concatenated operation of residual shortcut to reduce the gradient explosion and outputs 52 × 52 feature maps with 384 channels. With up-sample, some 52 × 52 feature maps are outputted for YOLO v3 to detect small-scale objects [20]. Similarly, the fifth block outputs many 26 × 26 feature maps for detecting medium-scale targets. At last, network still passes by many residual shortcut connection blocks which include zero padding, convolution, and residual unit. Finally, YOLO v3 designed a 255 × 1 × 1 convolution layer to output 13 × 13 feature maps with 255 channels for detecting big objects [21]. In general, YOLO v3 can detect images on three different scales with 32 × 32, 16 × 16, and 8 × 8 feature maps, where the first detected operation layer is at 82th layer; its stride takes 32 to generate 13 × 13 feature maps. The second up-sampling operation is at 94th and the third detection layer is at 106th layer, which produces a feature map with dimensions 52 × 52 × 255. Overall architecture of YOLO v3 is shown in Figure 1. In addition, we used the K-means clustering to select bounding box priors in YOLO v3. The following Figure 2 shows a part graph of YOLO v3 exported from TensorBoard of TensorFlow visualization API, which was actually a neural network connection diagram for YOLO's second up sampling. The TensorBoard can show the output and input tensor variables at each node, in addition, it can show the dependency between the tensor operations through some edges. The conv2d here is abbreviated for the convolution layer or block in Figure 1. Similarly, Leaky relu is denoted as ReLu layers in Figure 1 and batch normalization is denoted as BN layer. We can visualize all concatenation operation of each stage of YOLO v3 through TensorBoard, such as the attributes behind the convolution layer indicate that the input and output tensors correspond to this convolutional layer. The loss represents the value of the current convolutional layer after passing through the optimizer.

Bounding Box Encoding Strategy
Soil objects underground are regularly sensitive to electromagnetic waves caused by its physical properties [22,23]. Most of them appear as parabolic with openings downward or obvious energy reflection in electromagnetic waves format. GPR moves along the survey line and continuously collects a series of trajectories (A-scan) to form electromagnetic wave B-scan images [24]. Before YOLO v3 training, GPR images were collected in this way. As mentioned in Section 2.1, YOLO v3 outputs feature maps or cells by three different stages, and each bounding box is responsible for multiple categories [25]. Suppose that the input GPR images size still is 416 × 416, as shown in Figure 3a, then the original picture can be divided into 13 × 13 cells. Those cells that are parabolic vertex M in the GPR images are responsible for predicting corresponding targets. When annotating GPR image samples, we make the center of ground truth box (rectangle A1B1C1D1) to correspond to the position of parabola apex. Red box in Figure 3a contains the midpoint of target wave; rectangle A1B1C1D1 marked as red solid line is ground truth bounding box and ABCD marked as red dotted line represents the predicted box. Figure 3b shows the encode ways of ground truth box in YOLO v3. Point A 1 is the top left corner of box; t x and t y are the pixel position of point A 1 in GPR image. Zx and Zy are noted as pixel width and height of each cell respectively. The width of B 1 D 1 is marked as t w and the height of C 1 D 1 is marked as t h . As shown in Figure 3c, each bounding box was attributed to one object score or confidence P i ∈ {0, 1}, 4 box coordinates (t x , t y , t w , t h ), and one class score S i . Here, S i follows similarly as the one hot encoding method and S i ∈ [0, 1]. If S i equals to 0, there was no current detected target in the GPR image; otherwise, if S i equals to 1, it indicates that there exists current detected target. Finally, the feature map corresponds to 13 × 13 cells, and the output bounding box encoded tensor shape is (13 × 13, 6). If there were n GPR image samples, then all bounding box sizes corresponded to tensor (n, 13 × 13, 6). It is worth noting that if it is a non-hyperbolic target, such as the voids detected below, we will use the original encoding method of YOLO v3.

Anchor Box Selection by K-Means Clustering
The K-means is an iterative algorithm that can divide data into K predefined clustering and cluster each point into specific data groups [26]. When YOLO v3 trains GPR sample data, anchor box can control skillfully the over fit recognition results of soil targets, because in the high-frequency electromagnetic wave reflection signal, when two target positions are relatively close, those close parabolic vertices will be easily assigned to the same bounding box. K-means defines the size of bounding box through cluster analysis. Absolutely, K-means tries to keep clusters as different as possible at this point in order to minimize the sum of squared distances between all centers of data clusters [27,28]. First, we define K value and initialize the centroids by shuffling, then keeping iterating until there is no change in the centroids outcome. This is called expectation maximization [29]. Assuming that there are m samples, here we introduce a multi-sample function about K value: If the point t (x, y) belongs to the K cluster, then σ ik = 1 otherwise σ ik = 0; at this time, µ k can be considered as the centroid of t (x, y). If the derivative of F function can minimize the equation solution, then the problem can be solved using the following formula: Here, it is needed to distinguish F solution and recalculate the centroid after the last clustering iteration. Obviously, data points t (x, y) are assigned to close clusters. Finally, we can recalculate each cluster centroid according to the following Equation (3) to reflect the situation of the new point allocation.
K-means uses data distance as the evaluated criterion to determine the selection of anchor box. Algorithm iteration is initialized at the beginning. In order to avoid the F function staying at the local optimal rather than global optimal, this paper adopts a variety of centroid initialization to run the K-means algorithm [30,31]. After filtering by K-means, the encode label of YOLO v3 for GPR image increases the data dimension. As shown in Figure 4, Y label represents the encoded bounding box without increasing the dimension. This is the transpose of data matrix in Figure 3c. Y K−Means represents the encoded bounding box that have added n anchor boxes output by K-means clustering.

Principle Analysis of V-IoU Processing with NMS
Non-maximum suppression (NMS) is commonly applied to extract the window with the highest score in detection algorithm, such as feature extraction in sliding windows, pedestrians in automatic driving, and vehicle recognition [32]. Similarly, in the GPR image, after feature maps are produced by the convolutional layer of YOLO v3 in the 3rd stage recognized by the classifier, for some underground targets, there are a large number of bounding boxes that cross each other or contain the same parabolic midpoint in one cell. The goal of NMS is to remove the detected redundant boxes and keep the best one. First, it is needed to mention here the intersection over union (IoU) score. IoU is a standard performance metric for image category segmentation problems [33]. For a given set from image, IoU defined by Equation (5) gives the ratio of intersection and union of the predicted bounding box and ground truth bounding box [34]. Suppose t represents the probability outputs of pixel set N after filter by activation function in the GPR image; Y denotes the data set composed of ground truth bounding box; Y ∈ {0, 1} M marks 0 for non-target pixels and 1 for target pixels.
First, in YOLO v3, NMS can calculate the confidence C of the proposal region and sort the bounding boxes list. Second, NMS selects the predicted box with the largest score; then the IoU coefficients of other remain bounding boxes and the current box are calculated. If the IoU value is greater than the predefined threshold, NMS will delete this bounding box [35]. This is a complete iterative process in which NMS is applied to select the maximum score bounding box for one target. Then in the second iteration, the highest score box is still selected in the remaining boxes and those that exceed the predefined IoU threshold are deleted until all possible targets in the GPR image have been pick up.
After the YOLO v3 residual network and 1 × 1 convolutional layer, a large number of bounding boxes are generated on the region proposal area outputted by feature map. As shown in Figure 5a, S 0 denotes the starting position of GPR: S 1 , S 2 , and S 3 represent respectively the soil surface position of parabolic electromagnetic wave signal generated by three iron cylinders with buried depths of 0.25 m, 0.3 m, and 0.35 m, respectively. The soil dielectric constant is about 6.5 and the electrical conductivity is about 0.002 s/m. It can be seen that there are numerous prediction bounding boxes around each target. Now we focus on one parabola. In the process of YOLO v3 algorithm recognizing the target from GPR image, it is uncomplicated to misidentify the parabola originally belonging to one object as multiple targets because of the oscillating signal from electromagnetic wave [36]. The points N, P, and Q in Figure 5b represent three parabola generated by some strong conductive targets in depth direction of the soil. The number on a side of SOIL label represents the probability of being identified as a target, with the maximum value as 1 and the minimum as 0. YOLO v3 recognizes or locates those as three adjacent targets, but it is only one target, although their IoU threshold has been included in the predefined range. Therefore, this paper proposes the principle of V-IoU merging vacillate signals of similarity waves based on GPR images. Assume the location of ground truth box (red box) was marked as coordinate (t xn , t yn , t wn , t hn ) and the locations of another two boxes which were marked by GPR echo signal vacillation were predicted as (t x1 , t y1 , t w1 , t h1 ) and (t x2 , t y2 , t w2 , t h2 ). Then the coordinate of ground truth box of point N can be denoted as t xn + t wn 2 , t yn − t hn 2 . Similarly, the pixel coordinate of P is denoted as t x1 + t w1 2 , t y1 − t h1 2 and Q is denoted as t x2 + t w2 2 , t y2 − t h2 2 . First, it is worth noting that we define a horizontal threshold β here and make the (t xn − t x1 ) + t wn ; if those parabolic midpoint or N, P, and Q points at the soil depth satisfy the horizontal and vertical critical values, we will liberate the limitation of IoU threshold and merge those prediction boxes. This is the core idea of V-IoU, for example, D i ∈ [−α, α] and i ∈ R.

Loss Function and Learning Rate Adaptive Optimizer
Loss function of YOLO v3 in this paper is composed of mean variance and error [37]. Specifically, it is mainly divided into three parts for the calculation of offset losses, midpoint coordinate of parabola in GPR image prediction error gprErr, V-IoU prediction error viouErr, and classification error clsErr [38].
Here preset the weight of gprErr γ gpr as 5 and the weight of viouErr γ viou as 0.5 in order to rectify the domination of large target is weaker than the small target during detection. It can be expressed by the following formula: After derivation, the loss function of this three parts can be expressed as: wherex l ,ŷ l ,ŵ l , andĥ l in the Equation (7) are denoted as predicted values by YOLO v3, x i , y i , w i , and h i expressed as training tag value; I tar ij indicates that if the object falls into the j-th position of lattice i-th bounding box, its value is either 1 or 0.
whereĈ i in the Equation (8) whereP i in the Equation (9) is denoted as predicted value, P i is expressed as the training tag value. Figure 6 below shows a graph of the YOLO loss function node in TensorBoard; the input element were the loss output of conv2d_59, conv2d_67, and conv2d_75; where the input_1, input_2, and input_3 correspond to the gprErr, viouErr, and clsErr in Equation (6) respectively. When using the gradient descent method to optimize YOLO v3 loss value, even though the loss function have to be optimized near the minimum value, there still exists a large gradient. In this way, using a global learning rate will cause some serious problems, such as slow gradient convergence or unstable loss value. In order to solve this problem, this article uses the Adam algorithm which is a learning rate adaptive algorithm improved by the RMSProp algorithm proposed by Kingma in 2014 [39]. First, we set a default learning rate (0.001 in TensorFlow) and two exponential decay rates for moment estimation (default is 0.9 and 0.990 in TensorFlow); then initialize the moment variable and its time step count; finally, we continuously correct the deviation through biased moment estimation to update the weight and learning rate. Figure 7 below shows two structural diagrams of Adam optimizers in TensorBoard.

Experimental Parameters
GPR model in this paper used the GX750-HDR (GEO AB Company, Sundbyberg, Sweden) of Swedish Guideline GEO AB Company. Sampling number collected for each channel was 412, sampling interval was 0.015 m, the coupling distance of GPR antenna preset was 0.14 m, and the diameter of the ranging wheel preset was 17 cm. GPR data preprocessing software was the REFLXW 7.5 which its copyright by K.J. Sandmeier. The training data set format adopted the COCO data format [40,41]. Here, we marked GPR image target for YOLO v3 training by the visual object tagging tool (VoTT) 2.1.0. Operating system was windows 10, and its processor model is Intel remove direct ground wave, (5) remove high and low frequency signals, (6) horizontal smoothing. A total of 331 GPR image samples were collected in the experiment, of which the proportion of training set in whole data set is 70%, the validation set is 20%, and the test set is 10% in whole data [42]. In the YOLO v3 training stage, the batch size and subdivision of training sets are preset as 20. Epoch of each stage is preset as 51 and the learning rate is predefined as 0.001.

Anchor Boxes Selection by K-Means Clustering
After using VOTT tool to label all hyperbola targets from GPR images, there are 386 rectangular boxes containing parabola generated from the training dataset of ground truth images. Location parameters of ground truth box are composed of four corner coordinates of the rectangular box as (x min , y max ), (x min , y min ), (x max , y max ), and (x max ,y min ). Obviously, we only need to take four parameters x min , y min , x max , and y max for clustering effect or silhouette coefficient analysis [43]. Silhouette coefficient is a significant evaluation index for clustering performance. Its value is commonly between [−1, 1]. When the silhouette coefficient is closer to 1, the cohesion and separation of K-means model are better. In Figure 8a, we adjusted the clustering or centroid number of K-means to 2; the maximum number of iterations is predefined as 200; after normalizing the x min , y min data, it can be seen that the clustering group of centroid were still relatively demonstrable. The silhouette coefficient output by the silhouette score function from sklearn module was 0.4839. Compared with Figure 8b, when the number of centroid was set to 3, there exist high-separation and low-cohesive phenomenon for the clustered groups after standardized data. Similarly, Figure 8c,d shows the clustering effect of x max and y max data when the clustering is set to 2 and 3. At this time, the silhouette coefficient output by the silhouette score function was 0.4868. After calculation, finally we got four anchor boxes values for training configuration parameters that consist of x min , y min , x max , and y max .

V-IoU and NMS Training Loss Performance
After derivation of Section 2.5, IoU-YOLO v3 loss function contains three parts. The first part is the average error loss of the centroid position in GPR bounding boxes which is centroid position (t x ,t y ) relative to ground truth boxes. Here, the coordinate related to x axis of the predicted bounding box can be denoted asb x which is equal to sigmoid (t x ) + C x and its coordinates related to y axis can be denoted asb y which is equal to sigmoid t y + C y . Obviously after weight processing, the smaller the loss value, the closer the centroid between the predicted coordinate b x ,b y and the true value b x , b y , the better the prediction performance of logical regression function. In the first training phase of YOLO V3 with the V-IoU and NMS, when the epoch was less than 10, the loss value began to decrease very fast. When in the second stage, the convergence speed of loss function became steady and slow. Comparing the blue curve without adding V-IoU in Figure 9, the training performance of YOLO loss function seemed equivalent in two stages, but the completion time of entire 83 epochs was 3 h and 57 min. This is because the local optimization produced by the training process will affect the algorithm calculation efficiency to update function weights by back propagation. For this reason, as can be seen from Figure 10, the loss value of IoU + NMS had been changing back and forward between 22.5 and 40, and three local optimal solutions that appear at the positions are indicated by five green arrows; however, V-IoU + NMS was relatively stable, and it is undemanding to perform global gradient descent to find the global optimal solution. In order to prevent data over fitting, the loss function is considered to be sufficiently convergent; when the epoch was equal to 83 iteration was stopped.

YOLO v3 Detection Effect
It can be known from the YOLO v3 network architecture in Section 2.1 that YOLO v3 can be detected on three feature maps of different scales and output after the input image size have been down sampled to 32, 16, and 8. Testing datasets contain three scenes for the real-time detected performance test, which cover the single class and multi-class pattern-recognition which include hyperbolic and voids features. The evaluation index refers the mean average precision (mAP) to training batches [44,45]. Assuming that P is denoted as the actual number of samples among target prediction, this is called precision. R is the recall rate, T is denoted as true positives, where P = TP (TP+FP) and R = TP (TP+FP) , where TP is the true positives and FP the false negatives; the mAP can be calculated by equation

AP N classes
, where AP is denoted as the average precision. Figure 11 shows the improved detection effect of YOLO v3 with V-IoU on single class targets. The verified data set showed in Figures 11-13 were collected from the Soils research key Laboratory of South China Agricultural University. First, we detect the object's physical position through GPR, and then mark the hyperbola vertex by the marking button on the MALA GPR controller. Finally, we use the difference between the identified rectangle midpoint and the marker's value to determine the ground truth. Here, the V-IoU threshold was preset to 0.50. As can be seen from the figure, although some targets are small in the GPR image, the YOLO v3 detector can recognize it. This is because compared to YOLO v2, the V3 version has three detections, which are one down-sampled 13 × 13 and two up-sampled with 26 × 26, 52 × 52 feature maps. In addition, YOLO v3 have added a series of convolutional layer with 3 × 3 or 1 × 1 size that increase appropriately the number of channels. Overall, in this situation, total 132 hyperbolas in GPR image were tested. The correct detection number is 121, missed targets number is 7, and false alarm number is 10.   When there were multi-class targets in the detected GPR image, the predicted boxes can distinguish or identify the parabolas or voids. For some parabolas with multiple overlapping signals the vertex of curve was well positioned, as shown in Figure 12. Obviously, the less electromagnetic interference or noise in the GPR image, the better recognition and location performance. Those targets that are shallow from the soil surface had relatively obvious higher recognition scores. It can be seen that there were no misidentified boxes, all targets can be identified and located to the parabolic midpoint at overlapping positions. Figure 12 showed that the parabola with signal oscillation due to some highly conductive targets can be identified and located by the YOLO v3 detector with V-IoU. Overall, in this multi-class targets situation, total 82 hyperbolas in GPR image were tested. The correct detection number is 62, missed targets number is 4, and false alarm number is 5.
In engineering applications, we often need to detect the number of metal bars among concrete structures. It can be seen from Figure 13 that for the number of single-layer steel bars, the predicted bounding boxes can be positioned accurately; but for the multi-layer-reinforced concrete structure, there exists a case of missing identification. After many experiments and data statistics, if taking the number of hyperbola as a performance index, the YOLO V3 artificial intelligence recognition method proposed in this paper can predict the number of ground truth targets in GPR image by 90% accuracy, and its position error is less than 10% length unit. When detecting the number of concrete structures, total 192 hyperbolas in GPR image were tested. The correct detection number is 175, missed targets number is 11, and false alarm number is 8. Overall, YOLO v3 can achieve satisfactory performance when recognizing and positioning electromagnetic wave from GPR image features.

Learning Rate and Mean Average Precision Comparison
The learning rate directly affects the convergence state of the YOLO v3 training performance, and batch size affects the generalization performance. Earlier, we have discussed the Adam adaptive algorithm to update the global learning rate. In TensorFlow, we set the initial parameters of the learning rate to the same value. Here we evaluate the model optimization of SSD, faster-rcnn, and VIoU-YOLO v3 through the change of learning rate in training epoch. As shown in Figure 14, the learning rates of SSD, faster-rcnn, and VIoU-YOLO v3 were between 52 and 72 in epoch. The YOLO v3 has converged to a stable value when epoch was 73, which made the updated weight of loss value in TensorFlow to be reduced to the global threshold in a shorter time. The change of SSD is very close to VIoU-YOLO v3, but we can see from Figure 15 that the same situation occurs again similarly to Figure 9. Loss value of the SSD model will easily converge to its local optimal value with the increase of training times; obviously, after comparing the learning rate and loss value, the convergent speed of YOLO v3 with VIoU is more ideal.  Furthermore, we compared mAP of SSD, faster-rcnn, YOLO v2 and YOLO v3 with different V-IoU (or IoU) thresholds and scenes. We used 300 GPR image samples to generate Table 1. Here, the mAP 50 means its IoU threshold preset as 0.5 and mAP 75 preset as 0.75. Similarly, the mAP sc , mAP mc , and mAP metal_bars represents the single classification, multi-class targets detection and only contains single layer metal bars scenes respectively. As shown in Table 1, after comparison, when the V-IoU threshold was 0.50, YOLO v3 with darknet-53 as the backbone can achieve a maximum mAP of 83.16; the SSD with ResNet-34 as the backbone can achieve an mAP of 75.66. The mAP scores of Faster-RCNN, YOLO v2, and v3 are more or less. When the IoU threshold was 0.75, the mAP scores of YOLO v3 and VIoU YOLO v3 are 77.15 and 75.90, respectively; SSD achieved a maximum mAP score of 79.80. In the detection which have multi-classes targets of GPR image, it is clear that YOLO v3 achieved an ideal mAP score. Comparing the mAP score of single class scenes, V-IoU YOLO v3 scored 83.17; in addition, when detecting the metal bars underground, although YOLO v3 achieved the highest mAP score of 79.90, V-IoU YOLO v3 still scored 76. 10. In general, V-IoU YOLO v3 can achieve the best performance for three different real-time scenes.

Real-Time Performance and fps Testing
In expectation of testing the real-time detection speed of YOLO v3, we randomly selected five batches from 331 GPR images with size 416 × 416; the number of image batches were 100, 150, 200, 250, and 300, respectively, and took the mAP value in Table 1 as reference. Computer processor still is Intel(R) Gold 6130 with CPU with 2.10 GHz. As shown in Figure 16, when the batch size was 200, the detection speed of SSD can reach to 11 fps. After comparison, the detection speed of Faster-RCNN in each batch was not ideal, and its maximum detection speed is just 5fps. It can be seen from Figure 16 that the average detection speed of YOLO v2 is 5 fps. The fastest detection speed of YOLO v3 and VIoU-OLO v3 can reach 15fps, and their average value is around 12fps. In other words, when the vehicle is equipped with GPR device, its detection speed can reach between 10 km/h and 20 km/h. Consequently, the VIoU-YOLO v3 detection method proposed in this paper can fulfill the real-time detection requirements.

Conclusions
In this paper, a YOLO v3 was applied to build neural network detector to achieve real-time pattern-recognition of GPR images. It can be applied to actual underground detection engineering with meaningful accuracy and robustness based on Tensorflow, but this article is also limited to less samples and detection types of targets. Overall, this paper developed an innovative research application based on artificial intelligence algorithm in the field of electromagnetic wave detection. The main conclusions are as follows: (1) Redefined the encode approach of YOLO v3 and proposed a labeling technique with using parabolic vertices as feature points; this provides a high-precision encoding technique for locating targets in GPR image. (2) Proposed the principle of V-IoU; when the position of parabola vertex is within a certain range, free the limitation of IoU threshold. This method effectively reduces the false recognition rate caused by electromagnetic interference.

Conflicts of Interest:
The authors declare no conflict of interest.