Pavement Distress Detection Using Three-Dimension Ground Penetrating Radar and Deep Learning

: Three-dimensional ground penetrating radar (3D GPR) is a non-destructive examination technology for pavement distress detection, for which its horizontal plane images provide a unique perspective for the task. However, a 3D GPR collects thousands of horizontal plane images per kilometer of the investigated pavement. The existing detection methods using GPR images are time-consuming and risky for subjective judgment. To solve the problem, this study used deep learning methods and 3D GPR horizontal plane images to detect pavement structural distress, including cracks, repairs, voids, poor interlayer bonding, and mixture segregation. In this study, two deep learning methods, called CP-YOLOX and SViT, were used to achieve the aim. A dataset for anomalous waveform localization (3688 images) was ﬁrst created by pre-processing 3D-GPR horizontal plane images. A CP-YOLOX model was then trained to localize anomalous waveforms. Five SViT models with different numbers of encoders were adopted to perform the classiﬁcation of anomalous waveforms using the localization results from the CP-YOLOX model. The numerical experiment results showed that 3D GPR horizontal plane images have the potential to be an assistant for pavement structural distress detection. The CP-YOLOX model achieved 87.71% precision, 80.64% mAP, and 33.57 sheets/s detection speed in locating anomalous waveforms. The optimal SViT achieved 63.63%, 68.12%, and 75.57% classiﬁcation accuracies for the 5-category, 4-category, and 3-category datasets, respectively. The proposed models outperformed other deep learning methods on distress detection using 3D GPR horizontal plane images. In the future, more radar images should be collected to improve the accuracy of SViT.


Introduction
Three-dimensional ground penetrating radar (3D GPR) is an emerging non-destructive inspection technology that is efficient, accurate, and multi-dimensional [1,2].It has been a major tool for pavement distress detection and pavement condition evaluation [3][4][5].Pavement distress detection is defined as the process of classifying and locating instances of pavement distresses in images or videos.Compared to a 2D GPR, 3D GPR uses stepping frequency and antenna array technologies to collect the full structure data of a pavement section.Informative 3D data can be used to detect internal pavement distress, such as cracks, repairs, voids, poor interlayer bonding, and mixture segregation [6].However, a 3D GPR can obtain thousands of radar images per kilometer in different dimensions.GPR images have been processed using traditional machine learning algorithms in several studies [7].Rebecca M.W. et al. [8] combined support vector machines (SVMs) and hidden Markov models (HMMs) for Crevasse detection in ice sheets.Zhou et al. [9] combined SVM with H-Alpha Decomposition for subsurface target classification of GPR.The existing processing methods are fallible and unreliable for pavement distress detection using 3D GPR data.In the last decade, machine learning has produced breakthroughs due to the rapid development of deep learning.Deep learning provided excellent performance in the fields of image recognition, speech recognition, and information security [10][11][12].In particular, convolutional neural networks (CNNs) have made excellent achievements in object detection thanks to their powerful feature extraction architecture [13,14].Additionally, the vision transformer (ViT) model, which has emerged in the last two years, also has achieved remarkable success in object detection [15,16].Such cases provide a new idea for GPR image processing [17].For example, Liu et al. [18] detected and located reinforced steel bars in concrete using GPR images and a Single Shot MultiBox Detector (SSD) model with good accuracy and detection speed.Li et al. [19] and Liu et al. [20] achieved the automatic detection of concealed cracks and voids by using the You Only Look Once (YOLO) model and 3D GPR, respectively.Sha et al. [21] proposed three CNN models to classify, localize, and measure structural cracks and potholes in asphalt pavements using GPR images.Hou et al. [22] proposed a data enhancement method based on a convolutional self-encoder structure, which significantly improved the accuracy of crack classification.Yan et al. [23] proposed a pavement distress detection model based on faster region convolutional neural network (Faster R-ConvNet), which reduced the ratio of missing and false detection.In addition, Sha et al. [24] used cascaded CNN models to overcome the low-accuracy problem of traditional CNNs in identifying low-resolution GPR images.Tong et al. and Gao et al. [25][26][27] used GPR images to identify, locate, measure, and reassemble 3D models of internal cracks by building CNN models.In addition, they developed a Faster R-ConvNet model to accurately identify internal pavement distress (reflection cracks, water-damage pits, and uneven settlements).Moreover, GPR signals were directly input into a network-in-network architecture to achieve the training and testing of the model, and the results showed that the method was effective in detecting cracks, water-damage pits, and uneven settlements.Long [28] used a 3D-to-2D data transformer for reverse-time offset imaging and adopted a single-shot detector to investigate subsurface structures.Wang et al. [29] implemented GPR data enhancement using cycle generative adversarial networks and localized radar hyperbolic waveforms by a Faster R-ConvNet.Kim et al. [30] combined GPR images from different channels into ones for pavement distress detection, and the results showed that the method effectively reduced the error rate.Omwenga M. M. et al. [31] proposed deep reinforcement learning (DRL)-based autonomous cognitive GPR (AC-GPR) to achieve the automatic detection of subsurface targets with superior accuracy and speed.
The waveform characteristics of pavement distress vary from one channel to another in a 3D GPR.The majority of the reported studies focus on the longitudinal radar images since most distresses are obvious in these images.However, these methods ignore a fact that 3D GPR can obtain pavement internal information in three dimensions and the horizontal radar images also can provide a unique view of pavement distress detection.However, there are very few research studies that study this problem.
Motivated by the above reason, this study aims to detect pavement distress based on the waveform characteristics of different distress in horizontal radar images.In horizontal radar images, anomalous waveforms are located by a CP-YOLOX model, which is a modified version of You Only Look Once X (YOLOX) [32].On this basis, to investigate the capacity of horizontal GPR images to represent different pavement distresses, the localization results from the CP-YOLOX model were intercepted to build a classification data set according to three class-membership strategies.Five SViT models, as simplified versions of Vision Transformer [16], classify anomalous waveforms into one of the possible distress categories.The objective of this study is to identify distress types from horizontal radar images using deep learning methods and to provide assistance for distress detection in the future by combining longitudinal radar images.
The rest of this paper is organized as follows.The methods of our study are presented in Section 2, which include the collection of 3D GPR images, the processing method of anomalous waveform localization using the CP-YOLOX model, and the method for distress classification using SViT models.Section 3 presents the numerical experiment results.The conclusions are summarized in Section 4.

Acquisition and Pre-Processing GPR Images
GeoScope 3D Radar was used to collect 3D GPR data, which mainly consisted of a GeoScope TM MK IV mainframe, a DXG1212 shallow ground-coupled antenna array, an RT3D acquisition software, and a 3dr-Examiner data processing software, as shown in Figure 1.The antenna matching mode was a conventional mode with 12 channels; trigger spacing was 5 cm; time window was 25 ns; dwell time was 3 µs.A 3D GPR using a stepping frequency of 100~3000 MHz and antenna array technology captured the full range of wavelengths of pavement interior information during data acquisition, achieving a balance between depth and resolution in a single acquisition.The investigated road sections were on Zhangshu-Ji'an Highway, Jiangxi, China.Structure types I-IV in Figure 2 show four main types of pavement structures on the highway.The lengths of the road sections with structure types I, II, III, and IV are 21.9 km, 8.3 km, 175.4 km, and 4.0 km, respectively.The four structures were the most common in China.All 1574 original GPR images were collected and each of them represents a road section with a length of 60 m and a width of 0.8 m.The 416 × 416 resolution image is a common input size for the YOLO model, and the detector can detect anomalous waveforms well at this resolution.Therefore, these images were cropped and resized to finally obtain 3688 horizontal radar images with a resolution of 416 × 416, as shown in Figure 3.The actual width and length of each cropped radar image were 0.8 m and 20 m, respectively.A crack is a typical type of distress in the pavement and appears in horizontal radar images as stripes with a strong reflective appearance.Pavement repairs were performed to prevent the pavement from deteriorating.The waveform at the repair location would show a highlight feature compared to its nearby location due to the difference in repair materials and raw materials.Voids refers to the phenomenon of bottom cavity of the pavement structure layer caused by settlement and distortion between the old and new asphalt pavement.It is often depicted as a blocky highlight on horizontal radar images.Due to environmental constraints during construction, there was poor interlayer bonding at the interlayer.It appears as a highlight anomaly on horizontal radar images.Mixture segregation is the result of an uneven paving process due to poorly mixed materials or uncontrolled temperatures during production, mixing and paving.This mixture segregation is often depicted as a messy highlight feature on horizontal radar images due to the large number of voids.Typical pavement structural distresses (cracks, repairs, voids, poor interlayer bonding, and mixture segregation) were shown in Figure 4. Different distresses were represented clearly in the horizontal radar images.Therefore, horizontal radar images provide a unique perspective for pavement distress detection and can be used as an aid for pavement inspection.However, the horizontal radar images still had two problems.First, the waveform characteristics were complex because the same pavement distress may have shown different characteristics at different depths.For example, voids may be shown as black or white highlights at different depth locations.In addition, due to the complex and various internal structures of pavements, except for typical pavement distresses, there were also a large number of noise waveforms that derive from many real-world factors, such as antenna vibration and background noise.These anomalous waveforms significantly affect distress detection performances.Therefore, a robust and accurate method was needed to process the horizontal radar images.In order to alleviate the overfitting and improve the accuracy of the location model, random data augmentation was performed on the dataset.In this study, random data enhancement had the following three main points: (1) Randomly crop an image; (2) Randomly resize an image in terms of its length and width; (3) Randomly distort the color gamut of an image.
The three data enhancement methods were performed simultaneously, such as the examples shown in Figure 5b-d.The randomly enhanced image was filled with zeros in the remaining positions.

Structure of CP-YOLOX
In order to localize anomalous waveforms, a CSP PAN YOLOX (CP-YOLOX) model has been proposed, which is modified from the original YOLOX [32].The proposed model can be divided into three parts: CSPDarkNet-SPP as Backbone, Path Aggregation Network (PAN) as Neck, and Decoupled Head as Prediction [33][34][35].The architecture of the model is shown in Figure 6.(1) Backbone The backbone extracts high-dimensional features from inputs using several convolutional and pooling layers.In this study, the CSPDarkNet-SPP network was used as the backbone of CP-YOLOX.The architecture and parameters of the network were shown in Figure 7 and Table 2, respectively.The CSPDarkNet-SPP network incorporates Focus, Cross Stage Partial (CSP), and Spatial Pyramid Pooling (SPP) to reduce the floating operations, increase perceptual field, and enhance feature extraction efficiency.
In the CSPDarkNet-SPP network, an input image passes through the Focus structure, which concentrates the information in the length and width dimensions into the channel dimension.This operation obtained two-fold downsampling features without information loss.The features were then compressed and extracted using a Convolutional Batch SiLU (CBS) structure and four CSP layers.With the last CSP layer, an SPP operation was added to increase the perceptual field of the network by utilizing maximum pooling layers with different sizes, which improved feature extraction efficiency.A CSP layer divides its input into two parts by performing a 1 × 1 convolution operation in which one part passes through multiple residual modules to extract features, and then the features are concatenated with another part.The residual module consists of two CBS and one residual edge.The first CBS was used for channel adjustment to reduce the computation cost, while the second one was used for feature extraction.The CSP layer reduces the calculation of the network as well as the redundancy of the network information while ensuring efficient feature extraction.Based on the proposed model, the number of CSP layers represents the number of residual module cycles, and the cycle numbers of the four residual modules were 2, 6, 6, and 2.
The CBS consists of a convolutional layer, a batch normalization (BN) layer, and a SiLU activation function [36].There were two sizes of convolutional kernels in the convolutional layer.The 1 × 1 convolutional kernels were used for channel adjustment to increase the nonlinear fitting ability, while the 3 × 3 convolutional kernels were used for feature extraction.The BN layer normalized the outputs of a convolutional layer to accelerate network convergence and alleviate overfitting.The SiLU activation function was a comprehensive version of the Sigmoid and ReLU functions and is defined as follows. (2) Its smooth and non-monotonic characteristics work well on deep neural networks.The function is shown in Figure 8.  (2) Neck Neck was an architecture between backbone and prediction, which fused the multidimension features from the backbone network and imported them into the prediction architecture, as shown in Figure 9. Three different features (Feat1, Feat2, and Feat3) from the CSPDarkNet-SPP backbone were imported into the neck architecture.The low-dimensional feature Feat1 contained strong local information, while the high-dimensional feature Feat3 included distress-semantics information.In this study, a Path Aggregation Network (PAN) was used as the neck architecture to fused these features, as shown in Figure 9, for which its parameters are shown in Table 3.In PAN, multi-dimension features were fused by sequentially upsampling and using convolutional layers.
(3) Prediction A decoupled head was proposed as the prediction architecture, using neck architecture features as inputs.The decoupled head decoupled the classification and regression tasks, as shown in Figure 10.The regression task aimed to predict the bounding box of each anomalous waveform area, while the classification task classified each anomalous waveform area into one of the distress classes.The decoupled head network used a CBS layer to generate two feature vectors based on the input features.The first vector was then passed through two CBS layers to produce a classification prediction vector, and the second Appl.Sci.2022, 12, 5738 9 of 23 vector was also passed through two CBS layers to produce a regression prediction vector.After concatenating the two vectors, the dimension of the concatenated vector was adjusted by using a 1 × 1 convolution layer.The final output of the model was a feature vector with the size of 1 × 6, where 6 can be divided into 1 + 1 + 4, corresponding to anomalous waveform or not, confidence, and location of the prediction boxes.The proposed model adopted the anchor-free strategy, which did not require predefining anchor sizes.In addition, this study also adopted the simplify optimal transport assignment (SimOTA) [37] strategy to dynamically matched positive samples for different distresses.Anchor-free and SimOTA were used in conjunction to reduce the complexity of the detection head and to increase the robustness of the model.

Distress Classification Dataset
Even though the proposed location model can determine anomalous waveform areas, it cannot easily determine the categories of these areas.For example, there was no easy method to determine if the anomalous waveform area was caused by cracks or uneven settlement.The main reason for this was that few studies had demonstrated the ability of horizontal GPR images to represent different types of pavement distress.However, the class-membership strategies of anomalous waveform images affected distress classification.An over-fine classification strategy had the risk of misclassification, despite sometimes providing a precise decision; an over-coarse strategy cannot provide an informative decision.In this study, three class-membership strategies, as shown in Table 4, were proposed to demonstrate the capacity of horizontal GPR images to represent different pavement distresses.Among them, HD1 (Horizontal Distress 1) indicated that all anomalous waveforms were considered cracks; HD16 indicated that the anomalous waveforms could represent cracks or noises; HD45 indicated that the anomalous waveforms may be poor interlayer bonding or mixture segregation; HD6 indicated that the anomalous waveforms were background noises.As shown in Table 1, 9045 anomalous waveforms were manually labeled, which were then intercepted from the horizontal radar images to create a classification dataset.Figure 11 illustrates how the image interception method works.The anomalous waveform areas were cropped from the radar images based on the position coordinates.The intercepted anomalous waveform images had different shapes, while a classification model was required to had the inputs with a fixed size.Thus, the cropped anomalous waveform images were resized by adding zeros to the missing parts.Finally, these filled images were assigned to one of the distress categories using different class-membership strategies.Therefore, three datasets with the same images but different labels were generated, as shown in Table 5.In all three datasets, the number of samples in different categories was balanced.In the 5-category dataset, all 395 images were selected for HD126, and 400 images were randomly selected for each of the other categories to create the dataset, and the total number of datasets was 1995.In the 4-category dataset, the number of randomly selected images were 1072, 1100, 1100, and 1100, respectively, for a total of 4372 images.In the 3-class dataset, the numbers of randomly selected images were 1600, 1366, and 1600, and the total number of images was 4566.For the three classification methods, the ratio of the training and validation sets was 8:2, and the test set consisted of all the anomalous waveforms in 600 horizontal radar images with a total of 1592 images.

Structure of SViT
In this study, Simplify Vision Transformers (SViTs) were used to perform the classification task, which was a simplified version of Vision Transformer (ViT).An SViT can be separated into three modules, namely Embedding, Transformer Encoder, and MLP Head, as shown in Figure 12.
(1) Embedding As shown in Figure 12, the embedding module converted the input image into the data format required by the transformer encoder module.The input image was split into 196 blocks of 16 × 16 pixels in the embedding module.The blocks were then flattened into one-dimensional vectors by a flattening layer to obtain 196 vector sequences with 256 dimensions.Finally, the position information is embedded in each vector sequence.(

2) Transformer Encoder
The transformer encoder module, as the core of the SViT model, extracted features from its input vector sequences.The architecture of the transformer encoder is shown in Figure 13.The transformer encoder consisted of two residual structures.The first residual structure was multi-head attention, while its second counterpart was multi-layer perceptron (MLP).Multi-head attention divided the input data into multiple parts to perform attention computation.In this study, the number of heads in multi-head attention was four, i.e., the vector sequence of 197 × 256 was divided into four subsequences of 197 × 64 and imported into the attention module.In the attention module, the vector sequence was further split into Query (Q), Key (K), and Value (V), and the results of attention and multi-head attention were computed as follows.The number of neurons in the first fully connected layer of the MLP module was 1024, which was four times the size of the inputs, and the number of neurons in the second fully connected layer was 256, which was the same size as the inputs.The coefficient of the dropout was 0.1, and the GELU activation function [38] was adopted as follows.
The function of GELU was shown in Figure 14.The transformer encoder module was repeated several times to extract multi-level features.In this study, five transformer encoders with 3, 6, 9, 12, and 15 cycles were used, and the corresponding models were called SViT-3, SViT-6, SViT-9, SViT-12, and SViT-15, as shown in Table 6.(3) MLP Head Based on the features from the transformer encoder, the MLP Head classified the input image into one possible class.After completing feature extraction by the transformer encoder module, the output shape was 197 × 256, which contained the class token embedded in the embedding module.The class token learned the features of anomalous waveforms in the transformer encoder and extracted the class token individually into the MLP head, which enabled the classification of anomalous waveforms.

Learning Strategy (1) Overall
The weights and bias of the location and classification models were updated by the backward propagation algorithm [14] as follows: where ∂L ∂W ij (q) and ∂L ∂b j (q) were the gradients of the loss function with respect to weight W ij (q) and bias b j (q), and η was the learning rate.
The convergence of the model can be accelerated by using different learning rates during different training stages.In this study, an exponential descent function was used to represent the learning rate: where lr b was the base learning rate; γ was the coefficient of learning rate decay.
The location and classification models were trained in TensorFlow 2.5 framework.The training device was a cloud server with AMD EPYC 7302 CPU, 64 G RAM, and NVIDIA GeForce RTX 3090 GPU with 24 GB memory.The testing device was a laptop with Intel i7-9750H CPU, 16 G RAM, and NVIDIA GeForce GTX 1650 GPU with 4 GB memory.
(2) Training of CP-YOLOX In the learning strategy, three different loss functions were combined in the CP-YOLOX model to compute the gaps between predicted and target information, including the decision of anomalous waveform or not, confidence, and location of the prediction boxes.The loss of category and confidence was defined by the cross-entropy loss function, and the loss of box position was computed by the CIoU loss function [39] as follows: where IoU was the intersection over union between the prediction and truth boxes; b and b gt represent the center points of the prediction and truth boxes, respectively; ρ was the Euclidean distance between the two center points; c represents the diagonal distance that can contain the minimum closed area of both the prediction and truth boxes; v represents the similarity of the aspect ratio; w gt , h gt , w p , and h p were the length and width of the truth and prediction boxes; α was the weight parameter.
The proposed CP-YOLOX model was trained in three stages, and each stage had 100 epochs.The values of lr b in the three stages were 1 × 10 −3 , 1 × 10 −4 , and 5 × 10 −5 , while γ was 0.97, as shown in Figure 15a.The Adam optimizer [40] was used for training, and the parameters of the optimizer were 0.9 and 0.999, as shown in Table 7.The training of the model was terminated when the loss value of the validation set could no longer be reduced after 20 epochs.(3) Training of SViT SViT models use cross-entropy loss functions to compute the gap between prediction and target classes.The models also adopted Equation (10) to represent the learning rate.The models were trained for 20 epochs with lr b of 1 × 10 −4 and γ of 0.97, and the same Adam optimizer was used for training, as shown in Figure 15b and Table 7.The training of the model was terminated when the loss value of the validation set could no longer be reduced after 10 epochs.

Analysis of Localization Results
The CP-YOLOX model achieved the lowest loss value of 3.060 in the validation set at the 258th epoch.After that, the training continued for 20 epochs and the loss value stopped decreasing.The loss curve of the model was shown in Figure 16.Five metrics were used to evaluate the well-trained localization model, Precision, Recall, F1 score, and mean average precision (mAP) [14].The results were shown in Figure 17, which were obtained at a confidence threshold of 0.5.In the test set, CP-YOLOX model achieved 87.71%, 58.73%, and 80.64% for precision, recall, and mAP, respectively.The model required 29.8 ms to processed one image, and the frame per second reached 33.57images/s.The evaluation results were shown in Table 8.The well-trained model was used to localized the anomalous waveforms in the horizontal radar images with a confidence threshold of 0.5.Some results were shown in Figure 18.Prediction and truth boxes were plotted into the radar images, with green and red boxes being prediction boxes and blue boxes being truth boxes.The prediction boxes display green with IoU ≥ 0.5; otherwise, it shows red.Prediction result and confidence levels were plotted into radar image, as shown in Figure 19.As a general rule, the proposed model was able to identify the anomalous waveform areas in horizontal radar images.

Results of Training and Testing
In order to reduce the error, each of the five SVIT models was trained three times to find optimal one using the validation set.The five models were trained three times for each of the three data sets in Table 5.A total of 45 training results were obtained, as shown in Figure 20.The weights with the smallest loss in the validation set among the three results were selected, and the corresponding loss value decline curves were shown in Table 9. Figure 21 presents the accuracies of 15 optimal models in the test sets.On the SViT-9 model, the 5-categories and 3-categories test sets had the highest accuracy with 63.63% and 75.57%, respectively.On the SViT-6 model, the 4-categories test sets had the highest accuracy with 68.12%.The SViT model predicted the highest accuracy of 75.57% for the 3-categories test sets, corresponding to the categories of crack, poor interlayer bonding, and mixture segregation.Based on the results in Figure 21, the SViT-6 model was tested in the 4-categories test set, while the SViT-9 models were tested in the 5-and 3-categories test sets.The results are shown in Figure 22.In the three class-membership strategies, the models were the most accurate in classifying pavement distresses without background noises, with HD1 accuracies of 73.3%, 76.4%, and 82.1% and HD45 accuracies of 76.1%, 78.5%, 83.3%, respectively.The models had poor accuracies in predicting distress containing background noises, in which all accuracies were lower than 70%.In horizontal radar images, SViT model was capable of detecting cracks, poor interlayer bonding, and mixture segregation distress.The model was tested well on a 3-category dataset, but the best distress classification strategy needs to be determined along with longitudinal radar images.The future study should focus on how to suppress the noise waveform and on more detailed analysis of the disturbance waveform.

Prediction Results of Different Models
SViT was a simplified model based on ViT, which was different from the traditional CNN models described in Section 2.3.Different models were compared with the proposed model.Comparison models included ViT and CNN-based MobileNet and ResNet50 [41,42].The floating-point operations (FLOPs) and accuracies of different models were shown in Figure 23 and Table 10.SViT outperformed ViT and MobileNet models on accuracy, parameter, and FLOPs, even though its accuracy was slightly lower than one of ResNet50 model.As a result, the proposed model had fewer parameters and FLOPs, ensuring a high level of accuracy, which enables it to perform distress classification quickly.This demonstrated the effectiveness of the proposed model in identifying pavement distress.

The Influence of Number of Samples on the Model
A ViT model surpassed traditional CNN models in the field of image recognition once given sufficient samples in the learning set [16].During training, the fitting ability of SViT was excellent, and only a few epochs were required to complete the training.Figure 24 presents the accuracies of different models with different datasets.With more samples, the accuracy of the SViT model was close to that of ResNet50.The accuracy gap between the two models shrank from 6.06% to 1% as the number of single-category samples increased.SViT still had the potential to outperform the ResNet50 model if more samples were available.Therefore, more 3D GPR images should be collected to improve the performance of the proposed model.

Conclusions
Pavement distress was detected using deep learning and horizontal GPR images in this study.The anomalous waveform areas in horizontal radar images were located by a CP-YOLOX model.Five SViT models classified anomalous waveforms into one of the possible distress categories.A GPR image dataset collected from China demonstrated the effectiveness of the proposed models.The following conclusions can be drawn.
(1) The proposed CP-YOLOX model could localize anomalous waveforms caused by pavement distresses.With a confidence threshold of 0.5, the CP-YOLOX model localized anomalous waveforms with mAP of 80.64%, Precision of 87.71%, and Recall of 58.73%.The model processed radar images with a speed of 33.57images/s.(2) The proposed SViT model was capable of detecting cracks, poor interlayer bonding, and mixture segregation distress in horizontal radar images.For the category without background noise, the model had a high prediction accuracy.Future studies should focus on how to suppress the noise waveform and on more detailed analysis of the disturbance waveform.
(3) The proposed SViT model had fewer parameters and FLOPs, ensuring a high level of accuracy, which enables it to perform distress classification quickly.With the increase in GPR images, the gap between SViT and ResNet50 shrunk from 6.06% to 1%, indicating that more data samples had the potential to improve the performance of SViT.This demonstrated the superiority of the proposed model on the pavement distress classification.(4) In the three classification datasets, the 3-categories dataset had the highest accuracy, followed by the 4-categories dataset, and the 5-categories dataset had the lowest accuracy.However, the model trained based on the 5-categories dataset provided the most detailed basis for distress classification.Subsequently, we need to combine the horizontal detection results with the longitudinal radar images to determine the best classification method using 3D GPR.

Figure 3 .
Figure 3. Pre-processing of radar images: (a) processing of horizontal radar images; (b-g) cropped horizontal radar images.

2. 2 .
Proposed Localization Model 2.2.1.Distress Localization Dataset It is necessary to generate a dataset of 3D GPR horizontal images before building a deep learning model for anomalous waveform localization.The collected 3688 horizontal radar images were divided into training, validation, and test sets, corresponding to 2470, 618, and 600 images, respectively.This study then made block-level labels for each image.Labellmg software in the Python environment was used to label the anomalous waveform areas in an image, such as the example shown in Figure 5a.A bounding box indicated the location of an anomalous waveform area.

Figure 5 .
Figure 5. Data enhancement: (a) original images and (b-d) images of random data enhancement. s

Figure 9 .
Figure 9. Flow chart of structure of PAN.

Figure 11 .
Figure 11.Process of intercepting anomalous waveform areas from a horizontal radar image.The red boxes are manual labeled.

Figure 13 .
Figure 13.Schematic diagram of structure of transformer encoder.

Figure 15 .
Figure 15.Learning rate decline curve: (a) three stages of training for the CP-YOLOX; (b) training for the SViT.

Figure 16 .
Figure 16.Decline curves of loss values in training and validation sets.

≥Figure 17 .
Figure 17.Curves of evaluation metrics for the training and test sets: (a-d) the Precision, Recall, F1 score, and AP values of the training set, respectively; (e-h) the Precision, Recall, F1 score, and AP values of the testing set, respectively.

Figure 18 .
Figure 18.The prediction results of CP-YOLOX on the test set: (a-h) Predictions based on eight randomly selected images.

Figure 19 .
Figure 19.The prediction results of CP-YOLOX on the original image: (a-d) Predictions based on four randomly selected images.

Figure 20 .
Figure 20.Loss values of validation sets on different models under three data sets.

Figure 21 .
Figure 21.Accuracy of different models on 3 test sets.

Figure 23 .
Figure 23.Accuracy and FLOPs of each model under different data sets.

Figure 24 .
Figure 24.Comparison study of sample number and model accuracy.

Table 1
presents the total number of bounding boxes in the training, validation, and test sets.

Table 1 .
Number of training, validation, and test samples.

Table 3 .
Structure parameters of PAN.Figure 10. Flow chart of decoupled head.

Table 4 .
Class-membership strategies of anomalous waveform in horizontal radar image.

Table 5 .
Characteristics of three classification datasets.

Table 6 .
Detailed parameters of the 5 SViT models.

Table 7 .
The detailed parameters of model training.

Table 8 .
Results of the test set.

Table 10 .
Accuracy results of 3 datasets with different models.