Performance Evaluation of Convolutional Neural Network Models for Classification of Highway Hidden Distresses with GPR B-Scan Images

Jin, Guanglai; Liu, Qilin; Cai, Wenlong; Li, Mingjie; Lu, Congde

doi:10.3390/app14104226

Open AccessArticle

Performance Evaluation of Convolutional Neural Network Models for Classification of Highway Hidden Distresses with GPR B-Scan Images

¹

Jiangsu Sinoroad Engineering Technology Research Institute Co., Ltd., Nanjing 211800, China

²

School of Mechanical and Electrical Engineering, Chengdu University of Technology, Chengdu 610059, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4226; https://doi.org/10.3390/app14104226

Submission received: 1 April 2024 / Revised: 27 April 2024 / Accepted: 14 May 2024 / Published: 16 May 2024

(This article belongs to the Special Issue Damage Monitoring and Defect Identification Based on Deep/Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Despite the considerable advancements in automated identification methods of highway hidden distress with ground-penetrating radar (GPR) images, there still exist challenges in realizing automated identification of highway hidden distress owing to the quantity, variability, and reliability of the distress samples and diversity of classification models. Firstly, the dataset collected contains 31,640 samples categorized into four categories: interlayer debonding, interlayer loosening, interlayer water seepage, and structural loosening from 1500 km highway, for obtaining larger enough samples and covering the variable range of distress samples. Secondly, the distresses were labeled by experienced experts, and the labels were verified with drilled cores to ensure their reliability. Lastly, 18 exemplary convolutional neural network (CNN) models from 8 different architectures were evaluated using evaluation metrics such as precision, recall, and f1-score. Further, confusion matrix and Grad-CAM techniques were utilized to analyze these models. The experimental results show that VGG13 performed most prominently and stably, while the lightweight network SqueezeNet1_1 performed particularly well with a batch size of 64. Furthermore, this study indicates that models with fewer layers can achieve comparable or better performance than deeper models.

Keywords:

GPR B-scan; highway hidden distress; deep learning; convolutional neural networks; image classification; Grad-CAM

1. Introduction

The construction of highways is significant for the comfort of travelers, the improvement of transportation efficiency, and the promotion of economic development. To ensure traffic safety, extend the service life of highways, and reduce maintenance costs, it is necessary to detect both overt and hidden distresses of highways in a timely manner. Various methods, including manual inspection, core drill sampling, and non-destructive testing, have been implemented for the detection of highway distress. Manual inspection is effective for detecting surface distress, but it can be costly. Destructive detection methods, such as core drilling, have been gradually replaced due to their destructiveness and inefficiency. At present, as a primary method to detect highway hidden distresses, non-destructive detection technology mainly uses acoustic waves, electromagnetic waves, and infrared radiation to quickly and accurately identify highway distresses and provide a scientific basis for maintenance work. As a non-destructive testing technique, ground-penetrating radar (GPR) has been widely used to detect highway hidden distresses due to its high efficiency, continuous acquisition, and relatively low cost, and this technique has proven to be effective in the identification of subsurface defects and anomalies [1].

To detect GPR-based highway hidden distresses, the primary research is focused on the analysis and interpretation of GPR data. GPR data is divided into three types: one-trace waveforms (A-scan), two-dimensional profiles (B-scan), and three-dimensional data (C-scan). Todkar et al. conducted extensive experiments based on GPR A-scan data on asphalt concrete pavements using a support vector machine approach, and the results demonstrated its effectiveness in pavement debonding detection [2]. This analysis method based on A-scan data is simple and effective, but it is easily affected by partial noise and lacks comprehensive consideration of neighboring signals. Liang et al. used advanced air-coupled three-dimensional ground-penetrating radar (3D-GPR) to detect highway damage, and the results showed a strong correlation between the abnormal signals of the 3D-GPR images and pavement damage, verifying the ability of 3D-GPR to accurately identify highway damage features [3]. However, the analysis and interpretation of 3D data face challenges, such as the management of a large amount of data, the need for high computing power, and difficulty in labeling the 3D data. At present, the analysis and interpretation of GPR data still focus on two-dimensional GPR B-scan data, including a series of successive A-scans as the transceiver antenna moves equidistantly in a given direction. These data reflect the longitudinal section of the highway and have a smaller computational volume compared to the C-scan. Due to various factors, such as the noise of each hardware module of the instrument, the inhomogeneity of the underground medium, and the interaction of echoes, there will still be some noise in the GPR B-scan data that affects recognition and detection accuracy. However, research based on GPR B-scan data has rapidly developed with the progress of various digital signal processing techniques for noise suppression, such as Kalman filtering, principal component analysis, and wavelet transform [4,5]. At present, various classification methods are proposed to identify highway distresses based on GPR B-scan data, which can be divided into three groups based on different feature extraction methods: traditional-based, machine learning-based, and deep learning-based. These methods are aimed at extracting essential features from GPR B-scan data to achieve accurate classification. Traditional feature extraction methods, such as edge detection, least squares, and Hough transform, and machine learning-based methods, such as support vector machine (SVM), artificial neural network, and naive Bayes classifier, rely heavily on a priori knowledge and expert experience, which is a significant problem that cannot be avoided. However, the rapid development of deep learning has brought breakthroughs, particularly in image feature extraction [6]. Deep learning-based methods can achieve end-to-end recognition and detection, automatically learning feature representations directly from GPR B-scan images [7].

The recognition and detection of highway distress using GPR have always been challenging tasks. Rasol et al. systematically reviewed typical cases and studies of GPR for highway and subsurface detection, highlighting the many challenges with highway GPR data [8]. Ayman et al. conducted a comprehensive review of a computer vision model and its application to highway hidden distress classification, detection, segmentation, and quantification, demonstrating that in the modern transportation field, researchers and practitioners have been working on the development of methods that can efficiently utilize GPR data for the identification and evaluation of highway hidden distresses [9]. After analyzing and summarizing numerous previous studies, this paper concludes that the three main factors affecting the accuracy of the automatic detection task of highway hidden distresses are the classification accuracy of distress samples, the localization accuracy of distresses, and the recognition accuracy between foreground (distresses) and background. Effective feature extraction and accurate classification of distress samples are crucial for the accuracy of the automatic distress detection task. Previous studies have shown that selecting the appropriate backbone network for extracting features is particularly important for the performance of these detection methods [10]. Numerous studies have demonstrated that deep learning-based image recognition methods, particularly those based on convolutional neural networks (CNNs), have achieved outstanding performance in highway distress recognition with GPR B-scan images. However, it should be noted that current recognition and detection efforts mainly focus on extracting hyperbolic features from GPR B-scan images [11].

Since 2017, this study has conducted highway hidden distress detection work in East China, mainly in Jiangsu and Zhejiang provinces, and collected measured data from several highways with a total length of nearly 1500 km. By combining the GPR B-scan data with the validation results from on-site core drilling sampling, which were verified by experts in the field, the main categories of highway distress in the region were identified, including interlayer debonding, interlayer water seepage, interlayer loosening, and structural loosening. Structural loosening is the most serious and least numerous hidden distress, which is to be expected since China places a high priority on highway maintenance. Interlayer distresses are subdivided due to maintenance cost considerations in actual projects, but their characteristics are largely similar. Identification of distress categories has become more challenging as the differentiation between distress categories has become more refined. Analysis of the measured data revealed that unlike common highway distresses such as cracks [12] and Cavities [13], the four categories of distresses found in East China do not exhibit typical hyperbolic features in the image data. As a result, the accurate detection of these four categories of distress is much more difficult in this region. Several classification models based on CNN architectures have been proposed for large natural image datasets, but GPR B-scan distress image features are relatively simple, lacking the complexity and diversity of natural images [14]. Therefore, for GPR B-scan images of highway hidden distresses in East China, particularly in Jiangsu and Zhejiang provinces, where there is an imbalance in the number of category samples, no clear hyperbolic features of the distresses, and relatively simple image features, it is unclear which CNN classification model is more effective in terms of considering the combination of model performance, parameters, complexity, and sizes. In this paper, 18 classical CNN models are selected for testing and evaluation based on the distress features in the field data.

This paper is structured as follows: Section 2 provides a detailed description of the dataset, experimental methodology, and other relevant parts. Section 3 presents and illustrates the experimental results. Section 4 provides a comprehensive analysis and discussion of the experimental results. Finally, the work of this paper and future research directions are summarized in Section 5.

2. Methodology

2.1. Dataset Description

2.1.1. Data Acquisition

There are several natural image datasets available for the testing and evaluation of image classification models, such as ImageNet [15]. However, there is currently no publicly available dataset for radar images of highways, making it difficult to evaluate the performance of classification models for highway distress. From 2017 to 2022, this study successfully collected data from several highways in East China, with a total highway mileage of nearly 1500 km, mainly including the Ning-Yang (Nanjing–Yangzhou) Highway, Hu-Ning (Shanghai–Nanjing) Highway, Yanjiang Highway, and Zhendan Highway in Jiangsu Province, and the Taijin Highway, Yong-jin (Ningbo–Jinhua) Highway, and Hangxinjing (Hangzhou–Xin’anjiang–Jingde Town) Highway in Zhejiang Province.

Highways in the region typically have a three-layer structure consisting of a surface layer, a water-stabilized base layer (including an upper and lower base layer), and a subbase layer, as shown in Figure 1.

For the field data collection, the fourth-generation high-dynamic ground-penetrating radar MALA GX750 series manufactured by the Swedish company GuidelineGEO, Stockholm, Sweden, was used, with the following technical specifications: antenna frequency: 750 MHz; scanning speed: 1290 (traces/s); detection depth: 0~1.5 m; antenna coupling mode: air coupling; weight: 3.6kg; detection vehicle speed: 80 (km/h).

The distress information in the collected raw GPR data was difficult to recognize due to a large amount of background noise. To highlight the characteristics of each distress, a series of processing steps were taken, including dewow, move start time, compensation of energy decay, background removal, deconvolution, and bandpass filtering.

2.1.2. Constructing Dataset

At present, most of the research work has only focused on a single type of highway distress, such as cracks [16], debonding [17], and moisture damages [18], which are relatively easy to identify. The task of categorizing interlayer distresses and intralayer distresses (i.e., structural loosening), as well as subdividing interlayer distress into interlayer debonding, interlayer loosening, and interlayer water seepage, posed a significant challenge in this study. From the perspective of road maintenance, the main reasons for subdividing highway distress into multiple categories are as follows:

Assessing the severity of distress: different categories of distress affect the highway structure to different degrees, which ultimately affects the repair methods, measures, and costs;
Statistics and analysis: statistics on the frequency and distribution of various categories of distress can help formulate more effective maintenance plans and prevention strategies;
Data recording and management: establishing a detailed database of distress helps in distress prediction and highway maintenance.

GPR uses electromagnetic waves to detect the internal structure of highways. Electromagnetic waves are reflected at the interface of different media due to changes in their dielectric constants. The amplitude of the reflection is stronger when the difference between the dielectric constants of the two media is greater. The image characteristics of the GPR B-scan reflect the health of highway structures and different categories of highway distress. By combining the radar images with the drilled sample cores, anomalous areas in the GPR B-scan images can be analyzed to achieve qualitative and quantitative interpretations of different categories of distress. The images and sample cores of the four categories of distresses are shown in Figure 2. The following is a brief description of the various distresses included in the field data:

Interlayer debonding: There is a continuous separation between the layers of the road structure, and interlayer delamination occurs. This type of distress, known as interlayer debonding (abbreviated as poor_l), typically results from interlayer bonding due to natural aging of the highway, excessive loading, or other factors.
Interlayer water seepage: This type of distress (abbreviated as water_l) occurs when a highway that already has the interlayer debonding and other less serious distress is not maintained in time and is infiltrated by rain. Unlike other air-containing distresses, this type of distress contains water. The dielectric constants of water and air differ significantly (usually 81 for water and 1 for air). Therefore, Figure 2 shows a clear polarity reversal in the interlayer water seepage compared to the interlayer debonding.
Interlayer loosening: This kind of distress is characterized by nibbling and loosening of the interlayer bonding surface. Interlayer loosening (abbreviated as loose_l) occurs when interlayer debonding, interlayer water seepage, or other relatively light distresses are still not repaired in a timely manner and are constantly crushed by vehicles.
Structural loosening: This type of distress is characterized by fractures and looseness in highway structural layers. Structural loosening (abbreviated as loose_s) is the most serious class of these four categories of distress.

To achieve accurate distress classification, experienced domain experts labeled the distress and verified the labels with drilled core samples. We obtain a dataset containing 31,640 samples using the flowchart shown in Figure 3. PCA dimensionality reduction was performed to visualize the distribution of the dataset, as shown in Figure 4. The categories of interlayer debonding, interlayer water seepage, and structural loosening have clear boundaries, particularly the most pronounced boundaries observed between structural loosening and the other categories. However, interlayer loosening partially overlaps with the sample distribution of the other three categories.

The dataset was divided into training, validation, and test sets in a ratio of 6:1:3. Table 1 shows the number of samples in each category for each dataset, revealing a significant sample imbalance problem. In particular, the structural loosening category of distress has only 698 samples. Among the interlayer distresses, interlayer debonding generally occurs first; then, interlayer debonding develops into interlayer water seepage due to rainwater infiltration and interlayer water seepage develops into interlayer loosening as the highway continues to be operated. Table 1 shows that interlayer debonding has the highest number of samples, followed by interlayer water seepage, and interlayer loosening has the lowest number of samples among the interlayer distress categories, which is consistent with the development process of highway distress. In addition, structural loosening, the most severe distress category, has the lowest number of samples. Therefore, the highway distress samples are significantly unbalanced. Although various methods have been proposed to deal with sample imbalance, such as sample augmentation [19] and loss function design [20], this paper concentrates on finding the optimal CNN-based model for highway ground-penetrating radar datasets, which exhibits sample imbalance.

2.2. Network Models

2.2.1. Description of Convolutional Neural Network Architectures

Classical CNN models were trained and evaluated to obtain an efficient and highly accurate classification model for GPR images. The basic structure of these CNNs consists of convolutional, pooling, and fully connected layers. The convolutional layer extracts the features of the input image through convolutional operations and nonlinear activation functions. Each convolutional layer outputs a set of feature maps that represent different responses of the features, such as edges, textures, or colors. The pooling layer down-samples the feature maps from the convolutional layers to reduce the number of features while retaining important feature information. The fully connected layer nonlinearly maps the output of the pooling layer and inputs it to the classifier. By stacking multiple convolutional and pooling layers, the CNN can extract the high-level semantic features of the image and efficiently classify and recognize it [21]. In addition to these basic components, several related techniques and methods can improve model robustness, accelerate training, and reduce overfitting, including batch normalization, residual connections, dropout, pre-training, and fine-tuning.

2.2.2. Model Selection

Due to variations in structure and number of layers, multiple network models can be derived from the same network architecture. These models differ in their ability to extract image features to varying degrees. Canziani et al. conducted an extensive analysis of multiple state-of-the-art deep neural networks on important metrics such as accuracy, memory usage, and parameters in the ImageNet image categorization challenge task, which not only showed the excellent performance of these networks in the classification task but also provided some insights for other researchers in the selection and design of neural networks [22].

In 2012, AlexNet achieved a Top-1 accuracy of 73.8% in the ImageNet image classification challenge, marking a milestone in deep learning and demonstrating the power of deep CNNs [23]. The VGG architecture, renowned for its deep network structure at the time of its proposal, focuses on feature extraction by increasing the number of convolutional layers [24]. This architecture showed that adding more convolutional layers can indeed improve feature extraction results. The Inception architecture is a fundamental element of GoogLeNet, balancing model efficiency and accuracy by introducing techniques such as multiscale convolution and sparse connectivity [25]. Over time, several versions of the Inception architecture have been developed, including InceptionV2 and InceptionV3. However, for large and deep complex models such as VGG and GoogLeNet, as the number of layers continues to increase, the gradient gradually decreases and converges to zero during the training process, resulting in inefficient weight updates. He et al. proposed ResNet in 2015, which introduced jump connections to solve the gradient vanishing problem, and this important residual structure enabled proper training of deeper networks [26]. In 2016, DenseNet was proposed, which introduced dense connections to facilitate full feature transfer and sharing [27]. Compared to ResNet, DenseNet has better feature transfer and fewer parameters. However, in scenarios with limited computational resources, large models may be too bulky. Therefore, lightweight networks such as MobileNet [28], SqueezeNet [29], and EfficientNet [30] have been proposed. These networks maintain high accuracy rates while reducing parameters and computation through structural design and model compression.

In the field of highway distress identification and detection, the above network architecture has been widely used. In the classification task, Wu et al. adopted DenseNet and deconvolution network framework for pixel-level detection and combined it with a support vector machine classifier to classify cracks, achieving 98% accuracy [31]. Li et al. constructed EfficientNet and MobileNet models, respectively, to identify pavement water seepage problems, and the research results showed that EfficientNet performed well and could accurately detect pavement water seepage [32]. Martinez-Ríos et al. used generalized Morse wavelet (GMWs) to perform continuous wavelet transform on vehicle vertical acceleration data for the detection of transverse cracks in the road surface and fine-tuned GoogLeNet, SqueezeNet, and ResNet18 [33]. Experimental results show that SqueezeNet has higher sensitivity in the validation phase. Meftah et al. combined a random forest machine learning classifier with MobileNet, InceptionV3, and Xception to construct an efficient CNN model, confirming the model’s effectiveness in identifying road cracks on real concrete surfaces [34]. In the detection task, Ma et al. used VGG16 as the basic convolutional network to propose a road crack detection method based on a multi-feature layer convolutional neural network, thereby improving the accuracy of road crack identification [35]. Zhang et al. used ResNet50 and YOLO v2 networks for feature extraction and identification, and the results show that this method had excellent performance and advantages in asphalt pavement water damage detection and location [18].

Based on the above analysis, this paper tests and evaluates 18 network models derived from 8 network architectures, namely AlexNet, InceptionV3, VGG (VGG11, VGG13, VGG16, and VGG19), ResNet (ResNet18, ResNet34, ResNet50, and ResNet101), DenseNet (DenseNet121 and DenseNet169), MobileNet (MobileNetV2 and MobileNetV3), SqueezeNet (SqueezeNet1_0 and SqueezeNet1_1), and EfficientNet (EfficientNetB2 and EfficientNetB3).

2.3. Training Description

All the work is based on the Windows 10 operating system, and the workstation is equipped with a graphics processing unit (CPU: Inter(R) Xeon(R) Gold 6133; RAM: 128 GB; GPU: NVIDIA GeForce GTX 3090; python3.8; pytorch2.1.1; cuda12.1).

The weight files trained on the large natural image dataset were used as pre-training weights for each of the 18 network models for transfer learning. All parameters of the network layers of each model were trained on the highway distress dataset. The hyper-parameters were chosen uniformly. The input image size is 224 × 224, and Figure 5 shows the resized image of the four categories of distress. The input channels are set to 1. This is because the sample images of distress are in grayscale, as opposed to a typical 3-channel RGB image. The number of output categories was set to 4. The optimizer used was stochastic gradient descent (SGD). The initial learning rate is set to 0.01. The learning rate strategy used is the exponential descent strategy. The loss function used is the cross-entropy loss function. The number of epochs used is 30. In particular, the InceptionV3 input image size is set to 299 × 299. These hyperparameters were chosen based on a large number of repetitive base experiments conducted in the previous period. Since these 18 models and pre-trained weights are designed for natural images with multiple categories, the input and output layers need to be modified before training the models.

The batch size of the training samples has an impact on the model performance. Due to the limitation of memory and graphics memory, 16, 32, and 64 are considered in this paper. The training process is repeated three times to account for the random selection of training samples and other influencing factors.

During the training process, the training loss decreases as the epoch increases, while the validation loss generally shows a trend of decreasing and then slightly increasing due to the overfitting phenomenon that may occur in the later stages of model training. The training is repeated three times for each network under the same batch size, and the weights with the smallest validation loss are retained in each training, and the optimal weights are finally selected.

2.4. Evaluation Metrics

2.4.1. Training Metrics

The accuracy and loss values of the training and validation sets are recorded at the end of each training cycle and are used to plot learning curves and evaluate model learning. Training accuracy represents the proportion of training samples with known labels that are correctly classified, with higher values indicating better model performance on the training set. The training loss measures the difference between the predictions of the model on the training set and the actual labels. Lower values indicate a better fit of the model to the trained data. Validation accuracy is the percentage of correct classifications of the model on the validation set, with higher values indicating better model performance on the validation set. Validation loss measures the difference between the predictions made by the model and the actual labels on the validation dataset. Lower values indicate better model performance on the validation data. Accuracy and loss are interrelated evaluation metrics. Loss typically decreases as accuracy increases.

During the training phase, changes in these metrics are monitored to evaluate whether the model is converging, whether there are problems such as underfitting or overfitting, and to predict in advance the likely predictive ability of the model on unseen data. Additionally, the parameters, computation amount, size, and average epoch duration of the model are recorded to provide a comprehensive evaluation.

2.4.2. Testing Metrics

When evaluating the model using the test set, three well-known evaluation metrics were used in this study: precision, recall, and F1-score. These evaluation metrics are calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P},

(1)

R e c a l l = \frac{T P}{T P + F N},

(2)

F 1 s c o r e = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

where TP is the true positive, FP is the false positive, FN is the false negative, and TN is the true negative.

A visual confusion matrix was used to provide detailed classification results for each category, helping to analyze the performance of the model on different categories. Additionally, due to the imbalance of distress samples in the dataset, three metrics, macro-average precision, macro-average recall, and macro-average F1-score, have been introduced to overcome possible evaluation biases. The metrics are calculated by summing and averaging the values for each category. Macro-averaging assigns equal weight to each category, ensuring a balanced impact on the overall assessment and providing fairer and more comprehensive model assessment results.

2.4.3. Grad-CAM Visualization

Grad-CAM is a visualization technique that can intuitively highlight differences in the model’s distribution of attention to different regions of an image. This study uses Grad-CAM to explore how the model predicts highway distress images.

By combining the gradient of the convolutional layer and the feature map, Grad-CAM calculates the weights of each feature map channel and multiplies and sums the weights with the feature map to obtain the final Class Activation Map (CAM), which is finally up-sampled to superimpose the CAM onto the input image [36]. The CAM is displayed as a heat map, which indicates the level of activation of different regions in the image through color changes. Higher heat values are represented by warmer colors in the plot, indicating that the model is paying more attention to that region. Using CAM, Grad-CAM can demonstrate the difference in the model’s attention region for different distresses while also providing a basis for evaluating the model’s reliability and robustness. Furthermore, the Grad-CAM technique can be used to display the distribution of the model’s attention before classification for some samples of incorrectly predicted emergencies and explore the possible reasons for the model’s prediction errors.

3. Results

3.1. Training Results

This paper considers three batches of 16, 32, and 64 and observes the effect of this hyperparameter on each model’s performance. To comprehensively evaluate the models, the total parameters, floating-point operations (FLOPs), size, and the average duration of one epoch of the models are recorded, as shown in Table 2.

Among the 18 models, the VGG architecture has the largest size and total parameters, with VGG19 having as many as 139.59 M parameters and a model size of 532.48 MB. The three lightweight networks discussed have smaller total parameters and sizes. Specifically, the SqueezeNet architecture stands out with only 0.73 M total parameters and 2.78 MB model size for SqueezeNet1_0 and 0.72 M total parameters and 2.76 MB model size for SqueezeNet1_1. Meanwhile, there is a positive correlation between FLOPs and the batch size of the input samples. The FLOPs of a model increase as the batch size increases. The four models of the VGG architecture are still the largest in terms of FLOPs, with VGG19 at a batch size of 64 having as high as 1252.49 G FLOPs, which is the largest FLOPs of all the models. AlexNet and the three representative lightweight networks have smaller FLOPs. For example, MobileNetV2 has only 3.70 G computations for a batch size of 64. For model training time, the larger the batch of input samples, the faster the model is trained and the shorter the average duration of an epoch. The DenseNet architecture has the longest training time, with DenseNet169 taking up to 3 min and 32 s per epoch at a batch size of 16. However, it is worth noting that the EfficientNet architecture, despite being a lightweight model, has a similar training time to larger models such as VGG16 and VGG19.

Training accuracy and loss and validation accuracy and loss are computed and plotted after each epoch. Accuracy and loss curves for CNNs of eight network architectures at batch sizes of 16, 32, and 64 are shown in Figure 6a–c and Figure 7, respectively. One representative network model is displayed for each architecture, with two shown for both VGG and ResNet. When comparing the same model, increasing the batch size multiplicatively does not cause a significant change in training and verification accuracy. Figure 7 shows a clear trend of decreasing and then increasing validation loss for the two network models, AlexNet and VGG, indicating model overfitting. However, increasing the batch size significantly mitigates this phenomenon. This can be observed from the various types of curves during the learning stage, where the fluctuations are relatively smoother with larger batch sizes. The learning curves of different distress categories show that interlayer debonding has smaller fluctuations, the highest validation accuracy, and the fastest convergence, while the structural loosening curves have the largest fluctuations, the lowest validation accuracy, and the slowest convergence.

3.2. Testing Results

The training process was repeated three times in total. To compare the generalization effect, the selection of weight files preferred the validation loss over the validation accuracy. Therefore, for different batch sizes, the set of model weights chosen to be retained is the one with the lowest validation loss of the model. Finally, 54 results were obtained. The models were applied to the test set, and their predictions were used as the basis for evaluating their performance. The precision, recall, and f1-score of all models for each type of distress in the test set are organized in Table 3, Table 4 and Table 5. In addition, the last column of each table shows the macro-average precision, macro-average recall, and macro-average f1-score to evaluate the overall performance of the models.

Figure 7. Training loss and validation loss curves for CNNs of 10 network models with batch sizes 16 and 32, respectively. The solid line represents the training loss, while the dashed line represents the validation loss. The curves with different colors represent different batch sizes, which are illustrated in the legend with “val loss” and “train loss”, followed by an underscore and the specific values of the batch sizes.

For ease of analysis, the naming rule for the models is to add an underscore to the name of the network model followed by the value of the corresponding batch size, e.g., VGG13_16 and VGG13_64 stand for VGG13 trained at a batch size of 16 and a batch size of 64. For precision, the best models overall are VGG13_16 and VGG13_64 (both 98.426%). The model with the lowest overall performance was MobileNetV3_16 (96.938%). Regarding the interlayer debonding category of distress, the model with the highest precision was VGG13_32 (99.469%). In contrast, the model with the lowest precision was DenseNet169_32 (98.649%). Regarding the interlayer water seepage category of distress, the best model was SqueezeNet1_1_16 (99.138%). In contrast, the model with the lowest precision was VGG19_32 (97.430%). ResNet34_16 performed the best for the interlayer loosening distress category with a precision of 98.755%. In contrast, InceptionV3_32 had the lowest precision of 96.713%. For the structural loosening distress category, both VGG13_16 and VGG13_64 performed the best, with a precision of 98.537%. In contrast, MobileNetV3_16 had the lowest precision of only 91.964%.

For recall, the VGG13_32 model performed the best overall, with a macro-average recall of 98.921%, while the VGG19_64 model had the worst overall performance, with a macro-average recall of only 97.261%. For the interlayer debonding category of distress, the VGG19_64 model had the highest recall at 99.721%, while the InceptionV3_32 model had the lowest recall at 98.802%. The EfficientNetB2_16 model had the highest recall rate of 99.463% for the interlayer water seepage distress category, while the MobileNetV3_64 model had the lowest recall rate of 97.711%. For interlayer loosening distress, the InceptionV3_32 model performed best, with a recall rate of 98.105%, while the worst model was ResNet34_16 (96.211%). Regarding structural loosening distress, ResNet50_16 was the best model, with a recall rate of 98.578%. In contrast, VGG11_16 had its lowest recall, with 93.365%.

For the f1-score, the overall best model is SqueezeNet1_1_64 (98.120%), while the worst model overall was MobileNetV3_16 (97.519%). For the interlayer debonding category of distress, the best model was AlexNet_64 (99.416%), while the worst model was EfficientNetB2_64 (99.068%). For the interlayer water seepage category of distress, the best model was DenseNet121_64 (98.725%), while the worst model was EfficientNetB2_64 (98.200%). For the interlayer loosening category of distress, the best model was AlexNet_64 (97.966%), while the worst model was DenseNet169_16 (97.275%). For the structural loosening category of distress, the best models were VGG13_16 and VGG13_64 (both 97.115%), while the worst model was MobileNetV3_16 (94.713%).

Combining the performance of each model during the training period and the prediction results on the test set enables a more comprehensive performance evaluation of the 18 network models. To provide a more intuitive analysis of some of the better-performing models and the models of interest, 16 visualized confusion matrices covering eight network architectures are shown in Figure 8, including SqueezeNet1_1_64, VGG13_16, VGG13_32, VGG13_64, and VGG16_64, which performed better in general; AlexNet_64, which performed better in interlayer debonding, interlayer water seepage, and interlayer loosening; InceptionV3_16 and DenseNet121_64, which performed better in interlayer water seepage and interlayer loosening; ResNet18_32, which performed better in residual networks; ResNet18_64, which performed better in structural loosening and interlayer debonding; ResNet50_16, which performed better in interlayer loosening and structural loosening; ResNet50_32, which performed better in interlayer water seepage and interlayer debonding; MobileNetV3_16, EfficientNetB2_64, and DenseNet169_16, which showed relatively poor performance overall.

The confusion matrices in Figure 8 show the detailed classification results of the model on the four categories of distress samples from the test set. The prediction results show that each category of distress was misclassified differently when the model made a classification error. The interlayer loosening category of distress was most likely to be predicted as interlayer water seepage, followed by interlayer debonding. The structural loosening category of distress was usually only misclassified as interlayer loosening. The interlayer debonding category of distress was most likely to be predicted as interlayer loosening and usually not be misclassified as structural loosening. The interlayer water seepage category of distress was most likely to be predicted as interlayer loosening, followed by interlayer debonding.

For each category of distress, ResNet50_16 performed best in classifying the interlayer loosening category of distress, correctly classifying 2585 samples of this distress. AlexNet_64 and MobileNetV3_16 performed the best in classifying the structural loosening category of distress, both with only five samples misclassified. MobileNetV3_16 performed the best in classifying the interlayer water seepage category of distress, with 3577 correctly classified samples. ResNet101_32 performed the best in classifying the interlayer water seepage category of distress, with 3028 correctly classified samples. However, ResNet101_32 performed poorly in classifying the interlayer loosening category of distress, with 97 samples misclassified. DenseNet169_16 and EfficientNetB2_64 performed poorly in classifying the structural loosening category of distress, with 11 samples misclassified. VGG13_16 performed poorly in classifying the interlayer debonding category of distress, with 31 samples misclassified. ResNet50_16 performed the worst in classifying the interlayer water seepage category of distress, with 64 samples misclassified.

3.3. Grad-CAM Visualization Results

Figure 9 displays the Grad-CAM heat maps of the eight network architectures on the correctly classified representative samples of each category. This visualization highlights the differences in the models’ regions of interest for the images of the four categories of distress samples. The models were selected for the same reasons as when generating the confusion matrix. All models selected the final convolutional layer for the Grad-CAM operation. This is because the last convolutional layer is closest to the output layer in the network model, has a higher level of feature representation, and can more accurately reflect the distribution of the model’s attention to the image before classification prediction.

The attention regions of the models correlate with the damage characteristics, and there is some similarity in the attention of different models for the same sample, but there are also significant individual differences, as shown in Figure 9. For interlayer loosening, interlayer debonding, and interlayer water seepage, AlexNet_64, SqueezeNet1_1_64, and the four models of the VGG architecture showed minimal areas of attention. In contrast, InceptionV3_64, DenseNet169_16, MobileNetV3_16, and the five models of the ResNet architecture had larger areas of attention. For structural loosening, most models have a larger region of interest, particularly EfficientNetB2_64, except for SqueezeNet1_1_64.

Some typical and representative misclassified samples were selected to further explore the possible reasons for misclassification via these models by generating Grad-CAM heat maps, as shown in Figure 10. Obviously, when classifying these distress samples, the attention of some models seems to be biased towards other regions, which is different from that when classifying correctly. This shift in attention may be an important reason for the misclassification. From the prediction categories and confidence values, sample-13 and sample-14 are misclassified as interlayer water seepage and interlayer debonding, respectively, by most models. Sample-15 and sample-16 were both predicted as interlayer loosening, and the confidence values of the incorrect predictions were relatively high. Most models incorrectly predicted Sample-17 as interlayer water seepage and sample-18 as interlayer loosening. Sample-19 and sample-20 were also incorrectly predicted by the majority of the models, and the other three distresses were all possible categories of incorrect predictions. These results align with the confusion matrix shown in Figure 8.

4. Analysis and Discussion

4.1. Training Phase

Figure 6 and Figure 7 show that all models have smooth training curves during the training process, while the validation curves exhibit fluctuations. In particular, the models with larger parameters, such as AlexNet and the VGG architecture, show a trend of decreasing and then increasing validation loss curves, which is a typical overfitting phenomenon. Decreasing the batch size exacerbates the overfitting phenomenon. The InceptionV3 and DenseNet architectures exhibit slight overfitting. It is worth noting that residual networks and the three lightweight models (MobileNet, EfficientNet, and SqueezeNet) show minimal overfitting, which is almost negligible, and that batch size has little effect on these models.

Several factors can lead to overfitting. Firstly, increasing the model complexity generally enhances its representational ability, but overfitting occurs when the dataset is small. Although the amount of data collected in this study is substantial in highway distress recognition, it remains small compared to several public natural image datasets. Therefore, the lack of training data is one of the reasons for the overfitting of the models. Secondly, the dataset used in this study has relatively simple features, so it may result in overfitting because an excessively complicated model may fully describe minor variations of data. ResNet and the three lightweight networks (MobileNet, EfficientNet, and SqueezeNet) exhibit minimal overfitting due to their specific network structures and relatively small models. Firstly, the residual connectivity mechanism of the ResNet allows for better information propagation and improved model performance and generalization by effectively addressing issues such as gradient vanishing and representation bottlenecks. Secondly, the lightweight network employs various techniques to decrease model complexity, including deep separable convolution, channel attention mechanism, and network pruning. These strategies can effectively reduce the number of parameters and the complexity of the model to mitigate the risk of overfitting.

Meanwhile, batch size is also a factor affecting overfitting. A smaller batch size can increase the variance of the gradient estimate and make model training unstable due to the introduction of more randomness. Conversely, larger batches can speed up training and provide more accurate gradient estimates but require more memory. Choosing a larger batch size can help to reduce overfitting, but limited hardware capabilities prevent the large batch size. Data augmentation and optimization algorithms may solve the overfitting problem in the next study.

4.2. Testing Phase

All models were evaluated for performance on an unseen test set containing 9496 distress samples, of which 2639 are interlayer loosening, 211 are structural loosening, 3588 are interlayer loosening, and 3058 are interlayer water seepage. The evaluation of the models included several common metrics such as precision, recall, and f1-score, as well as macro-average precision, macro-average recall, macro-average f1-score, and confusion matrices. In this paper, the f1-score and the macro-average f1-score are comprehensive evaluation metrics because they provide a balance between precision and recall. The evaluation results indicate that all models demonstrate strong recognition capabilities for all four categories of highway distresses. To a certain extent, this result can also show that the features of the distresses in the highway distress images in the region where the data were collected are indeed not as rich as those in the natural images. In the evaluation, the best model was the lightweight network SqueezeNet1_1 trained at a batch size of 64. However, compared to VGG13, which is the best non-lightweight network in the evaluation, SqueezeNet1_1 took longer to converge. Nevertheless, due to its small number of parameters, computation, and size, SqueezeNet1_1 may be more advantageous in real engineering scenarios with limited computational resources.

The qualitative analysis of distress samples in each category revealed that interlayer debonding was the most readily identifiable distress, followed by interlayer water seepage and, finally, structural loosening. The main reason for this result is the imbalance in the number of samples of different distress categories. In the dataset used, the distress category with the highest number is interlayer debonding, followed by interlayer water seepage, and finally, structural loosening, which accounted for only 6% of the number of samples in interlayer debonding. Therefore, this qualitative analysis result was expected. However, a quantitative analysis shows that the differences in the evaluation metrics for the different categories are not significant, particularly for structural loosening. The visualization plot of the PCA dimensionality reduction of the dataset explains this result. As shown in Figure 4, there was a slight overlap between samples of interlayer loosening and other categories, but the distribution of samples for each distress category exhibited intra-class concentration and inter-class dispersion, particularly in the loose structure. It also explains why the category of structural loosening, which has very small samples, was still well identified in all models.

Comprehensively considering the evaluation metrics, it can also be found that some of the models have higher precision but lower recall, such as ResNet34_16 and VGG11_16. Some models have lower precision but higher recall, such as MobileNetV3_16. Combined with the confusion matrix, the models with higher precision but lower recall tend to predict the samples from other categories as the current category; at the same time, they also miss some samples from the current category. The models with lower precision but higher recall tend to predict samples as the current category but may also misclassify some samples from other categories as the current category. The main reasons for these results are the imbalance in the number of samples of the four categories of distress and the inconsistency in the degree of difficulty in recognizing them. A typical example is MobileNetV3_16, which has the lowest macro-average precision, relatively high macro-average recall, and the lowest macro-average f1-score of all the models. In Figure 8, MobileNetV3_16 has the highest number of correct identifications of structural loosening and interlayer debonding compared to other models. However, it also has the highest number of misclassified samples from other categories, such as these two categories.

In the ILSVRC [37], there is a tendency to improve the model recognition performance by increasing the number of layers of the CNN. However, the experimental results in this paper show that the performance of the network models derived from two network architectures, VGG (VGG11, VGG13, VGG16, and VGG19) and ResNet (ResNet18, ResNet34, ResNet50, and ResNet101) on the GPR B-scan distress image dataset, exhibit different trends, i.e., the performance of the models tends to deteriorate as the number of CNN layers increases, and the best models in the test set are VGG13 and ResNet18. This phenomenon occurs because the GPR B-scan distress images have relatively few features and a small dataset. Therefore, for the dataset in this paper, a network model with relatively few layers should be selected. However, it is important to note that if the number of layers in the model is too small, it may not be able to effectively characterize the distress image features.

4.3. Grad-CAM Visualization

The Grad-CAM technique provides a deeper understanding of the model’s prediction process. By visualizing Grad-CAM heat maps for each layer of the model, it is possible to understand what has been learned and the differences at each stage of the model’s feature extraction. This paper selects the last convolutional layer with high feature representation for Grad-CAM heatmap visualization. Unlike other evaluation metrics, the Grad-CAM technique does not directly reveal the performance of a model, but it can reveal the differences in attention of each model, which can be used to guide model optimization.

By observing the Grad-CAM heat maps of all the models presented in Figure 9, it can be found that the attention of different CNN architectures to the same sample has some similarities, but there are also obvious differences in certain architectures. Specifically, the network models with deeper layers, represented by ResNet and DenseNet, have stronger expressive ability and sensory wildness due to the introduction of residual connectivity, dense connectivity, and other techniques, which pay more attention to the global information of the distressed image during the learning process. As a result, it shows extremely similar attention and more attention areas in the Grad-CAM heat map. In contrast, network models with shallower layers, such as AlexNet and VGG, tend to focus more on the detailed local features of the image. As a result, fewer regions of attention are shown in the Grad-CAM heatmaps, and there is some variation in model attention. Similarly, EfficientNetB2 and MobileNetV3 have deeper network structures and more complex connections compared to SqueezeNet1_1, so pay more attention to the global information of the distress image, again showing more regions of attention. In general, deeper models tend to display more regions of interest on the Grad-CAM heat map, extract high-level semantic features, and achieve better performance. However, Table 5 shows that some shallower network models, such as AlexNet, VGG13, and SqueezeNet1_1, outperform deeper models in the evaluation results. Possible reasons for the observed results include the relatively small size of the GPR B-scan distress image dataset used in this paper, which may lead to overfitting on deeper models. Additionally, some of the shallower networks may focus not only on local features of the image but also on global features. Therefore, although the shallow network models show fewer regions of interest in the Grad-CAM heat map, their overall performance may be relatively better.

Figure 10 shows some typical and representative misclassified samples. Taking sample-13 as an example, the models that are correctly classified with high confidence are MobileNetV3_16 (89.01%) and SqueezeNet1_1_64 (88.98%), while the models that are incorrectly classified with high confidence for the incorrect category are ResNet18_32 (97.84%) and ResNet18_64 (84.35%). The correctly categorized models directed their attention toward the key characteristics of interlayer loosening, whereas the misclassified models focused on features resembling those of interlayer water seepage. This shows that the probable cause of the models’ misclassification is the bias of the models’ attention.

5. Conclusions

This study selected some classical CNN models for training and testing to determine a more applicable and efficient classification model for the characteristics of highway distress data collected in East China, particularly in Jiangsu and Zhejiang provinces. Evaluation metrics, such as precision, recall, and f1-score, were also used to evaluate the model, and the confusion matrix and Grad-CAM technique were also used to analyze the model to provide ideas for further research.

Four main categories of distress are present on highways in the region: interlayer debonding, interlayer loosening, interlayer water seepage, and structural loosening. The data were labeled by professionals and produced into a dataset. This dataset shows that the dataset in this region is mainly characterized by sample imbalance, no obvious hyperbolic features, similar features of interlayer distresses, and relatively simple features. This study aims to find the optimal CNN model for a given dataset by considering both model performance and complexity. Eighteen network models derived from eight different architectures were trained and evaluated, and some conclusions are summarized below.

All 18 models perform well. The VGG13 architecture stands out for its excellent classification performance at batch sizes of 16, 32, and 64. For lightweight networks, SqueezeNet1_1 performs particularly well at a batch size of 64;
For the identification of the four types of highway hidden distresses, the f1-score of interlayer debonding and interlayer water seepage is the highest, while the f1-score of interlayer loosening and structural loosening is slightly lower due to the similar characteristics of interlayer distress and the limited number of samples, respectively;
After a comprehensive analysis of the evaluation results, confusion matrix, and Grad-CAM heat map, this study concludes that the number of layers in a CNN model does not necessarily determine its ability to recognize highway hidden distresses in GPR B-scan images. Networks with fewer layers can achieve similar or even better performance than deeper, more complex networks while requiring fewer resources and being less constrained.

Automatic detection of highway hidden distress based on GPR is still a challenging task, and its detection accuracy is affected by a variety of factors. Among these factors, the classification accuracy of the distress samples is one of the most critical factors. Therefore, the findings of this study not only identify a suitable classification model of highway distress in this region but also provide valuable insights for further automated distress detection tasks. In particular, this study provides ample experimental evidence for selecting the backbone network for feature extraction. Moreover, according to China’s highway construction standards, highways in different regions have similar structures, and the CNN model has the ability to extract high-level semantic features, so the CNN model proposed in this paper has potential application value in other regions. The future of work can be described as follows. Firstly, the classification model, distress location algorithm, and foreground and background recognition method are combined to establish an accurate and efficient highway distress detection scheme. Secondly, the proposed classification models will be tested and analyzed with highway distress samples from other regions to determine the generalization of these models. Thirdly, using knowledge distillation and other transfer learning methods, the object detection model in natural images can be transferred to the field of highway distress detection. Lastly, distress detection models based on lightweight networks, such as SqueezeNet1_1, may be integrated into GPR acquisition systems for real-time detection and identification.

Author Contributions

Conceptualization, G.J. and C.L.; Data curation, Q.L.; Formal analysis, G.J. and C.L.; Funding acquisition, G.J. and C.L.; Investigation, G.J.; Methodology, Q.L. and W.C.; Project administration, G.J. and Q.L.; Resources, G.J. and W.C.; Software, Q.L.; Supervision, C.L.; Validation, Q.L., M.L. and C.L.; Visualization, Q.L.; Writing—original draft, G.J. and Q.L.; Writing—review and editing, G.J. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the project Highway Hidden Distresses Detection and Recognition of Jiangsu Sinoroad Engineering Technology Research Institute Co., Ltd., grant number JSZL-20200021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to contact the corresponding author.

Acknowledgments

We sincerely acknowledge the engineering technicians from Jiangsu Sinoroad Engineering Technology Research Institute Co., Ltd. for conducting all the field work and collecting the GPR data.

Conflicts of Interest

Authors Guanglai Jin and Wenlong Cai were employed by the company Jiangsu Sinoroad Engineering Technology Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Jiangsu Sinoroad Engineering Technology Research Institute Co., Ltd. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Liu, C.; Du, Y.; Yue, G.; Li, Y.; Wu, D.; Li, F. Advances in automatic identification of road subsurface distress using ground penetrating radar: State of the art and future trends. Autom. Constr. 2024, 158, 105185. [Google Scholar] [CrossRef]
Todkar, S.S.; Le Bastard, C.; Ihamouten, A.; Baltazart, V.; Dérobert, X.; Fauchard, C.; Guilbert, D.; Bosc, F. Detection of debondings with ground penetrating radar using a machine learning method. In Proceedings of the 2017 9th International Workshop on Advanced Ground Penetrating Radar (IWAGPR), Edinburgh, UK, 28–30 June 2017; pp. 1–6. [Google Scholar]
Liang, X.; Yu, X.; Chen, C.; Jin, Y.; Huang, J. Automatic classification of pavement distress using 3D ground-penetrating radar and deep convolutional neural network. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22269–22277. [Google Scholar] [CrossRef]
Chen, G.; Fu, L.; Chen, K.; Boateng, C.D.; Ge, S. Adaptive ground clutter reduction in ground-penetrating radar data based on principal component analysis. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3271–3282. [Google Scholar] [CrossRef]
Ni, S.-H.; Huang, Y.-H.; Lo, K.-F.; Lin, D.-C. Buried pipe detection by ground penetrating radar using the discrete wavelet transform. Comput. Geotech. 2010, 37, 440–448. [Google Scholar] [CrossRef]
Tong, Z.; Gao, J.; Yuan, D. Advances of deep learning applications in ground-penetrating radar: A survey. Constr. Build. Mater. 2020, 258, 120371. [Google Scholar] [CrossRef]
Hou, F.; Rui, X.; Fan, X.; Zhang, H. Review of GPR activities in Civil Infrastructures: Data analysis and applications. Remote Sens. 2022, 14, 5972. [Google Scholar] [CrossRef]
Rasol, M.; Pais, J.C.; Pérez-Gracia, V.; Solla, M.; Fernandes, F.M.; Fontul, S.; Ayala-Cabrera, D.; Schmidt, F.; Assadollahi, H. GPR monitoring for road transport infrastructure: A systematic review and machine learning insights. Constr. Build. Mater. 2022, 324, 126686. [Google Scholar] [CrossRef]
Ayman, H.; Fakhr, M.W. Recent computer vision applications for pavement distress and condition assessment. Autom. Constr. 2023, 146, 104664. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
HOU, F.; SHI, R.; LEI, W.; DONG, J.; XU, M.; XI, J. A review of target detection algorithm for GPR B-SCAN processing. J. Electron. Inf. Technol. 2020, 42, 191–200. [Google Scholar]
Rasol, M.A.; Pérez-Gracia, V.; Fernandes, F.M.; Pais, J.C.; Santos-Assunçao, S.; Santos, C.; Sossa, V. GPR laboratory tests and numerical models to characterize cracks in cement concrete specimens, exemplifying damage in rigid pavement. Measurement 2020, 158, 107662. [Google Scholar] [CrossRef]
Alsharahi, G.; Bouami, M.F.; Faize, A.; Louzazni, M.; Khamlichi, A.; Atounti, M. Contribution of analysis and detection the risks appearing in roads using GPR method: A case study in Morocco. Ain Shams Eng. J. 2021, 12, 1435–1450. [Google Scholar] [CrossRef]
Dikmen, M. Investigating Transfer Learning Performances of Deep Learning Models for Classification of GPR B-Scan Images. Trait. Du Signal 2022, 39, 1761–1766. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Tong, Z.; Gao, J.; Han, Z.; Wang, Z. Recognition of asphalt pavement crack length using deep convolutional neural networks. Road Mater. Pavement Des. 2018, 19, 1334–1349. [Google Scholar] [CrossRef]
Todkar, S.S.; Baltazart, V.; Ihamouten, A.; Dérobert, X.; Guilbert, D. One-class SVM based outlier detection strategy to detect thin interlayer debondings within pavement structures using Ground Penetrating Radar data. J. Appl. Geophys. 2021, 192, 104392. [Google Scholar] [CrossRef]
Zhang, J.; Yang, X.; Li, W.; Zhang, S.; Jia, Y. Automatic detection of moisture damages in asphalt pavements from GPR data with deep CNN and IRS method. Autom. Constr. 2020, 113, 103119. [Google Scholar] [CrossRef]
Yue, G.; Liu, C.; Li, Y.; Du, Y.; Guo, S. Gpr data augmentation methods by incorporating domain knowledge. Appl. Sci. 2022, 12, 10896. [Google Scholar] [CrossRef]
Wang, B.; Chen, P.; Zhang, G. Simulation of GPR B-scan data based on dense generative adversarial network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3938–3944. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Canziani, A.; Paszke, A.; Culurciello, E. An analysis of deep neural network models for practical applications. arXiv 2016, arXiv:1605.07678. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 3065386. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Wu, Y.; Yang, W.; Pan, J.; Chen, P. Asphalt pavement crack detection based on multi-scale full convolutional network. J. Intell. Fuzzy Syst. 2021, 40, 1495–1508. [Google Scholar] [CrossRef]
Li, J.; He, Z.; Li, D.; Zheng, A. Research on water seepage detection technology of tunnel asphalt pavement based on deep learning and digital image processing. Sci. Rep. 2022, 12, 11519. [Google Scholar] [CrossRef] [PubMed]
Martinez-Ríos, E.A.; Bustamante-Bello, R.; Navarro-Tuch, S.A. Generalized morse wavelets parameter selection and transfer learning for pavement transverse cracking detection. Eng. Appl. Artif. Intell. 2023, 123, 106355. [Google Scholar] [CrossRef]
Meftah, I.; Hu, J.; Asham, M.A.; Meftah, A.; Zhen, L.; Wu, R. Visual Detection of Road Cracks for Autonomous Vehicles Based on Deep Learning. Sensors 2024, 24, 1647. [Google Scholar] [CrossRef] [PubMed]
Ma, D.; Fang, H.; Wang, N.; Xue, B.; Dong, J.; Wang, F. A real-time crack detection algorithm for pavement based on CNN with multiple feature layers. Road Mater. Pavement Des. 2022, 23, 2115–2131. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]

Figure 1. Map of the data collection area. Using the highways in Jiangsu Province as an example shows the highway structure in the area.

Figure 2. Comparison of GPR B-scan images of four types of highway distresses and core samples. The distresses include interlayer debonding (a,b), interlayer water seepage (c,d), interlayer loosening (e,f), and structural loosening (g,h).

Figure 3. Flowchart of constructing dataset. Using GPR to collect raw data on-site (a); processing raw data with processing methods such as dewow, adjusting the start time, compensating for energy decay, and so on (b); extracting the B-scan image with more prominent characteristics of the distress and segmenting every twenty meters of the corresponding highway of the B-scan image (c); handing over all the data to professionals for distress labeling work (d); obtaining images and labels of the four categories of distresses to build the dataset (e).

Figure 4. Three-dimensional distribution plot of the dataset after PCA dimensionality reduction. The plot visualizes the four categories of distress samples, and each point represents a distress sample (interlayer debonding—green, interlayer loosening—blue, interlayer water seepage—purple, and structure loosening—yellow) (a). The axes in the plot are the projections of the dataset on the new three principal components. For ease of observation, only the sample points of the interlayer loosening are hidden (b).

Figure 5. Samples of the four categories of distress resized to 224 × 224 from original images.

Figure 6. (a) Training accuracy curves and validation accuracy curves of AlexNet, InceptionV3, VGG13, and VGG16 for four distress categories at batch sizes 16, 32, and 64, respectively. (b) Training accuracy curves and validation accuracy curves of ResNet18, ResNet50, DenseNet169, and MobileNetV2 for four distress categories at batch sizes 16, 32, and 64, respectively. (c) Training accuracy curves and validation accuracy curves of EfficientNetB3 and SqueezeNet1_1 for four distress categories at batch sizes 16, 32, and 64, respectively.

Figure 8. Sixteen representatives visualized confusion matrices covering 8 network architectures. Each row and column of the matrix represents a different distress. The diagonal line indicates the number of distress samples that the model correctly classified in the test set, while the other positions indicate the number of samples that the model classified as one distress when it was actually another.

Figure 9. Grad-CAM heat maps of correctly categorized representative samples of each category on 8 network architectures. For each distress, three representative samples were selected. The heat map displays the distribution of the model’s attention in the distress images, highlighting the differences in the regions of focus for the different categories. The model’s attention is indicated by warm colors, while cool colors indicate regions that have little impact on the classification prediction.

Figure 10. Grad-CAM heat map of some typical and representative misclassified samples on eight network architectures. The first column displays the images and true labels of the distress samples input to the model. The following columns show the Grad-CAM heat maps for each model for each sample. Below each image, the top two prediction categories and confidence values of the model’s prediction results are also labeled. Correctly classified results are labeled in green, and misclassified results are labeled in orange.

Table 1. Number of distress samples. The total number of samples for each type of distress is included, as well as the number of samples for each type of distress in the training, validation, and test sets.

Distress Category	Train Dataset	Val Dataset	Test Dataset	All Dataset
loose_l	5275	879	2639	8793
loose_s	418	69	211	698
poor_l	7176	1196	3588	11,960
water_l	6113	1018	3058	10,189

Table 2. Performance of the training period of a model. The table records the model’s total parameters (in millions), FLOPs (in billions), size (in megabytes), and the average duration (in seconds) used for an epoch. Both FLOPs and the average duration of an epoch are listed in ascending order of batch size (16/32/64). In each column, the first four smaller values are marked in green, and the first four larger values are marked in orange.

Models	Parameters (M)	FLOPs (G)	Sizes (MB)	Avg Epoch Time (s)
AlexNet	57.00	10.61/21.22/42.45	217.46	60/51/47
InceptionV3	21.79	91.75/183.50/367.00	92.90	148/111/107
VGG11	128.78	120.76/241.51/483.02	491.26	88/81/78
VGG13	128.97	179.95/359.89/719.78	491.97	116/106/104
VGG16	134.28	246.53/493.07/986.14	512.22	138/129/122
VGG19	139.59	313.12/626.25/1252.49	532.48	159/149/142
ResNet18	11.17	27.92/58.84/111.67	42.62	65/58/48
ResNet34	21.28	57.59/115.19/230.37	81.18	80/61/58
ResNet50	23.51	64.85/129.70/259.39	89.68	96/79/76
ResNet101	42.50	124.57/249.14/498.29	162.13	145/113/110
DenseNet121	6.95	45.08/90.15/180.31	26.52	167/116/97
DenseNet169	12.48	53.69/107.39/214.77	47.63	212/142/123
MobileNetV2	2.23	5.10/10.21/20.42	8.50	93/64/52
MobileNetV3	1.52	0.93/1.85/3.70	5.80	92/63/50
EfficientNetB2	7.71	11.09/22.19/44.37	29.40	138/97/74
EfficientNetB3	10.70	16.15/32.29/64.58	40.82	150/119/89
SqueezeNet1_0	0.73	9.94/19.87/39.74	2.78	61/54/48
SqueezeNet1_1	0.72	3.98/7.96/15.92	2.76	63/53/48

Table 3. Precision and macro-average precision of the prediction results of the 18 network models on the test set. The values in each column of the table are listed in ascending order of batch size (16/32/64). The top five more accurate predictions are labeled in green, while the top five less accurate predictions are labeled in orange. All values in the table are expressed as percentages.

Models	Poor_l	Water_l	Loose_l	Loose_s	Macro-Avg Precision
AlexNet	99.085/99.139/99.250	98.721/98.977/98.691	97.904/97.437/98.284	95.327/95.305/94.495	97.759/97.714/97.680
InceptionV3	99.056/99.383/98.784	98.819/98.914/98.400	97.912/96.713/98.083	96.209/94.860/97.596	97.999/97.467/98.216
VGG11	99.359/99.220/99.304	98.591/98.490/98.592	97.327/97.171/97.764	98.010/96.667/96.209	98.322/97.887/97.967
VGG13	99.274/99.469/99.221	98.211/98.406/98.083	97.683/97.800/97.863	98.537/97.585/98.537	98.426/98.315/98.426
VGG16	99.111/99.140/99.386	98.053/98.527/98.114	98.196/97.912/97.795	96.651/97.561/98.058	98.003/98.285/98.338
VGG19	99.249/99.165/98.840	98.152/97.430/98.593	98.309/98.376/98.274	95.305/97.561/95.755	97.754/98.133/97.865
ResNet18	99.139/99.277/99.195	98.118/98.337/98.466	98.155/97.795/98.164	95.305/97.578/94.907	97.679/98.245/97.683
ResNet34	99.139/98.892/99.001	97.559/98.021/98.363	98.755/98.151/97.494	96.172/97.585/98.522	97.907/98.162/97.790
ResNet50	99.166/99.030/99.001	99.008/98.820/98.524	97.071/97.982/97.608	94.860/96.172/96.667	97.526/98.001/97.950
ResNet101	99.386/99.029/98.921	98.654/97.740/98.654	97.508/98.718/97.907	93.578/93.578/97.561	97.282/97.267/98.261
DenseNet121	99.332/99.167/99.138	98.401/98.526/98.725	97.684/97.723/97.911	97.573/97.129/96.651	98.247/98.136/98.106
DenseNet169	98.890/98.649/99.029	97.954/98.780/98.719	97.853/97.567/97.462	98.039/97.087/97.115	98.184/98.021/98.081
MobileNetV2	98.675/99.195/99.194	98.334/98.713/98.529	98.077/97.286/97.838	96.209/95.735/97.561	97.824/97.732/98.280
MobileNetV3	98.758/98.918/98.730	98.496/97.706/98.777	98.533/98.412/97.390	91.964/94.860/97.573	96.938/97.474/98.117
EfficientNetB2	99.221/98.783/98.944	98.527/98.876/98.296	97.641/97.428/97.237	93.981/94.860/97.561	97.343/97.487/98.009
EfficientNetB3	99.414/99.166/98.998	98.302/98.079/98.556	97.497/97.821/97.129	94.393/95.327/97.101	97.401/97.598/97.946
SqueezeNet1_0	99.164/99.030/99.249	98.684/98.303/98.816	97.069/98.086/97.476	97.585/96.190/97.101	98.125/97.903/98.161
SqueezeNet1_1	99.138/99.002/98.948	99.138/98.527/98.849	98.089/97.906/97.801	98.030/97.087/97.596	98.360/98.131/98.298

Table 4. Recall and macro-average recall of the prediction results of the 18 network models on the test set. The values in each column of the table are listed in ascending order of batch size (16/32/64). The top five more accurate predictions are labeled in green, while the top five less accurate predictions are labeled in orange. All values in the table are expressed as percentages.

Models	Poor_l	Water_l	Loose_l	Loose_s	Macro-Avg Recall
AlexNet	99.638/99.498/99.582	98.463/98.038/98.627	97.347/97.954/97.651	96.682/96.209/97.630	98.033/97.925/98.372
InceptionV3	99.470/98.802/99.582	98.496/98.267/98.561	97.726/98.105/96.931	96.209/96.209/96.209	97.975/97.846/97.821
VGG11	99.359/99.220/99.443	98.365/98.136/98.430	97.954/97.613/97.764	93.365/96.209/96.209	97.261/97.794/97.961
VGG13	99.136/99.220/99.359	98.757/98.921/98.692	97.461/97.689/97.196	97.461/95.735/95.735	97.772/98.921/97.745
VGG16	99.443/99.554/99.275	98.823/98.398/98.692	96.931/97.726/97.461	95.735/94.787/95.735	97.733/97.616/97.791
VGG19	99.470/99.275/99.721	99.019/99.182/98.561	96.931/96.400/97.082	96.209/94.787/96.209	97.907/97.411/97.893
ResNet18	99.443/99.443/99.582	98.888/98.594/98.627	96.779/97.461/97.272	96.209/95.261/97.156	97.830/97.690/98.159
ResNet34	99.526/99.498/99.470	99.346/98.823/98.234	96.211/96.552/97.310	95.261/95.735/94.787	97.586/97.652/97.450
ResNet50	99.359/99.638/99.415	97.907/98.561/98.234	97.954/97.537/97.423	98.578/95.261/96.209	97.857/97.749/97.820
ResNet101	99.220/99.498/99.693	98.300/99.019/98.300	97.878/96.324/97.499	96.209/96.682/94.787	98.020/97.881/97.570
DenseNet121	99.415/99.498/99.359	98.627/98.332/98.725	97.499/97.575/97.689	95.261/96.209/95.735	97.700/97.903/97.877
DenseNet169	99.359/99.693/99.470	98.627/97.973/98.267	96.703/97.272/97.499	94.787/94.787/95.735	97.369/97.431/97.743
MobileNetV2	99.638/99.554/99.415	98.463/97.842/98.561	96.628/97.802/97.726	96.209/95.735/94.787	97.734/97.733/97.622
MobileNetV3	99.693/99.387/99.638	98.528/98.888/97.711	96.741/96.286/97.575	97.630/96.209/95.261	98.148/97.693/97.546
EfficientNetB2	99.443/99.498/99.192	99.463/97.776/98.103	97.234/97.613/97.347	96.209/96.209/94.787	97.691/97.774/97.357
EfficientNetB3	99.247/99.359/99.164	98.463/98.496/98.234	97.423/96.969/97.423	95.735/96.682/95.261	97.717/97.876/97.520
SqueezeNet1_0	99.192/99.582/99.415	98.071/98.528/98.234	97.878/97.120/98.067	95.735/95.735/95.261	97.719/97.741/97.744
SqueezeNet1_1	99.331/99.526/99.582	98.921/98.463/98.267	97.272/97.461/97.726	94.313/94.787/96.209	97.459/97.559/97.946

Table 5. F1-score and macro-average f1-score of the prediction results of the 18 network models on the test set. The values in each column of the table are listed in ascending order of batch size (16/32/64). The top five more accurate predictions are labeled in green, while the top five less accurate predictions are labeled in orange. All values in the table are expressed as percentages.

Models	Poor_l	Water_l	Loose_l	Loose_s	Macro-Avg F1-Score
AlexNet	99.361/99.318/99.416	98.592/98.505/98.659	97.625/97.695/97.966	96.000/95.755/96.037	97.894/97.818/98.019
InceptionV3	99.263/99.092/99.181	98.657/98.589/98.481	97.819/97.404/97.503	96.209/95.529/96.897	97.987/97.654/98.016
VGG11	99.359/99.220/99.373	98.478/98.313/98.511	97.639/97.391/97.764	95.631/96.437/96.209	97.777/97.840/97.964
VGG13	99.205/99.344/99.290	98.484/98.663/98.386	97.572/97.744/97.529	97.115/96.651/97.115	98.094/98.100/98.080
VGG16	99.277/99.346/99.331	98.436/98.462/98.402	97.559/97.819/97.628	96.190/96.154/96.882	97.866/97.945/98.061
VGG19	99.360/99.220/99.279	98.584/98.298/98.577	97.615/97.378/97.674	95.755/96.154/95.981	97.828/97.763/97.878
ResNet18	99.290/99.360/99.388	98.502/98.465/98.546	97.462/97.628/97.716	95.755/96.403/96.019	97.752/97.964/97.917
ResNet34	99.332/99.194/99.235	98.445/98.420/98.298	97.466/97.345/97.402	95.714/96.651/96.618	97.739/97.905/97.889
ResNet50	99.262/99.333/99.207	98.454/98.690/98.379	97.510/97.759/97.516	95.529/95.714/96.437	97.689/97.874/97.885
ResNet101	99.303/99.263/99.306	98.477/98.376/98.477	97.693/97.507/97.703	95.105/95.105/96.154	97.644/97.563/97.910
DenseNet121	99.373/99.332/99.248	98.514/98.429/98.725	97.592/97.649/97.800	96.403/96.667/96.190	97.970/98.019/97.991
DenseNet169	99.124/99.168/99.249	98.289/98.375/98.492	97.275/97.419/97.481	96.386/95.923/96.420	97.768/97.721/97.911
MobileNetV2	99.154/99.374/99.304	98.399/98.276/98.545	97.347/97.543/97.782	96.209/95.735/96.154	97.777/97.732/97.946
MobileNetV3	99.223/99.152/99.182	98.512/98.294/98.241	97.629/97.338/97.482	94.713/95.529/96.403	97.519/97.578/97.827
EfficientNetB2	99.332/99.139/99.068	98.495/98.323/98.200	97.437/97.520/97.292	95.082/95.529/96.154	97.586/97.628/97.678
EfficientNetB3	99.331/99.262/99.081	98.383/98.287/98.395	97.460/97.393/97.276	95.059/96.000/96.172	97.558/97.735/97.731
SqueezeNet1_0	99.178/99.305/99.331	98.376/98.416/98.524	97.472/97.601/97.771	96.651/95.962/96.172	97.919/97.821/97.950
SqueezeNet1_1	99.234/99.263/99.264	98.550/98.495/98.557	97.679/97.683/97.763	96.135/95.923/96.897	97.900/97.841/98.120

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, G.; Liu, Q.; Cai, W.; Li, M.; Lu, C. Performance Evaluation of Convolutional Neural Network Models for Classification of Highway Hidden Distresses with GPR B-Scan Images. Appl. Sci. 2024, 14, 4226. https://doi.org/10.3390/app14104226

AMA Style

Jin G, Liu Q, Cai W, Li M, Lu C. Performance Evaluation of Convolutional Neural Network Models for Classification of Highway Hidden Distresses with GPR B-Scan Images. Applied Sciences. 2024; 14(10):4226. https://doi.org/10.3390/app14104226

Chicago/Turabian Style

Jin, Guanglai, Qilin Liu, Wenlong Cai, Mingjie Li, and Congde Lu. 2024. "Performance Evaluation of Convolutional Neural Network Models for Classification of Highway Hidden Distresses with GPR B-Scan Images" Applied Sciences 14, no. 10: 4226. https://doi.org/10.3390/app14104226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Evaluation of Convolutional Neural Network Models for Classification of Highway Hidden Distresses with GPR B-Scan Images

Abstract

1. Introduction

2. Methodology

2.1. Dataset Description

2.1.1. Data Acquisition

2.1.2. Constructing Dataset

2.2. Network Models

2.2.1. Description of Convolutional Neural Network Architectures

2.2.2. Model Selection

2.3. Training Description

2.4. Evaluation Metrics

2.4.1. Training Metrics

2.4.2. Testing Metrics

2.4.3. Grad-CAM Visualization

3. Results

3.1. Training Results

3.2. Testing Results

3.3. Grad-CAM Visualization Results

4. Analysis and Discussion

4.1. Training Phase

4.2. Testing Phase

4.3. Grad-CAM Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI