1. Introduction
Under the long-term dynamic loads of trains and the coupled action of water and soil, the railway ballast structure is prone to typical diseases such as mud pumping. As shown in
Figure 1, mud pumping usually occurs when fine soil particles migrate upward under pore water pressure and accumulate at the ballast and subgrade interface, forming deposits like slurry [
1,
2]. This process weakens the bearing capacity of the ballast and subgrade, causes uneven settlement, degrades track geometry, and may even lead to structural damage, posing risks to railway operational safety [
3,
4]. Therefore, early identification and accurate assessment of mud pumping are important for condition monitoring and maintenance decision-making of railway infrastructure [
5]. Traditional inspection of railway diseases mainly relies on manual surveys and local excavation with sampling. These methods are inefficient, labor intensive, and highly subjective, and they cannot support continuous and non-contact monitoring [
6,
7].
Ground penetrating radar (GPR) is a non-destructive subsurface exploration technique based on the principles of electromagnetic wave propagation and reflection. By recording the electromagnetic signals reflected at interfaces between different medium, GPR enables continuous imaging of underground structures [
8], as illustrated in
Figure 2. Owing to its high detection efficiency, fine spatial resolution, and non-invasive nature, GPR has been widely applied in areas such as road and railway structural assessment, underground void detection, pipeline localization, and geological exploration [
9,
10]. In railway engineering, GPR is commonly employed for evaluating ballast structures, identifying layer interfaces, and detecting subsurface diseases, providing a crucial tool for infrastructure health monitoring [
11]. Due to the pronounced heterogeneity of subsurface medium, electromagnetic waves are susceptible to attenuation, scattering, and multipath effects during propagation. Consequently, diseases such as ballast fouling and mud pumping often appear in GPR B-scan profiles as weak reflections with blurred boundaries and irregular structures, posing significant challenges for intelligent identification.
Compared with conventional road structures or general subsurface inspection scenarios, the railway track system constitutes a more complex and highly structured electromagnetic propagation environment. The periodic arrangement formed by rails, sleepers, and ballast materials induces significant electromagnetic scattering and multipath propagation effects. Furthermore, under high-speed operating conditions, vehicle-mounted GPR detection introduces additional factors such as coupling noise, dynamic multipath interference, and signal energy attenuation. This multi-source coupling effect leads to a pronounced low signal-to-noise ratio in railway GPR B-scan profiles, where anomalous targets typically manifest as weak reflections, blurred boundaries, and structurally unstable responses. As a result, the effectiveness of traditional detection methods that rely primarily on spatial morphological features is substantially degraded in this context. Early studies approached this problem from the perspective of signal physics and systematically investigated the role of frequency-domain information in characterizing ballast conditions. Shihab et al. conducted time–frequency representation analysis of GPR echoes to achieve quantitative assessment of ballast degradation [
12]. Al-Qadi et al. employed the Short-Time Fourier Transform (STFT) to perform joint time–frequency characterization of air-coupled GPR data, enabling effective evaluation of ballast layer thickness and detection of water accumulation [
13]. They further compared Discrete Wavelet Transform (DWT) with STFT and demonstrated that wavelet analysis provides superior capability for quantitative prediction of contamination levels [
14]. Xiao and Liu proposed a multi-frequency fusion method based on forward and inverse S-transform, improving the resolution of deep subgrade defects [
15]. These studies collectively indicate that for subsurface anomalies that are difficult to distinguish based on spatial morphology alone, the key differences are often reflected in spectral composition and energy distribution. Therefore, frequency-domain information provides a more physically meaningful basis for discrimination.
Although the importance of frequency-domain features has been well established, recent deep learning–based methods for railway GPR detection still predominantly focus on spatial structure or temporal information modeling. Teng et al. developed a semi-supervised learning framework that integrates convolutional enhancement operations with the DETR object detection model, enabling defect detection in railway ballast under limited labeled GPR profile data [
6]. Liu et al. addressed the high-dimensional redundancy and multi-channel temporal characteristics of vehicle-mounted GPR detection data by combining convolutional neural network (CNN) with recurrent neural network (RNN). They proposed a CRNN model to jointly exploit waveform features and temporal dependencies, improving detection efficiency while reducing model complexity [
16]. Yang et al. tackled the spatial correlation among multiple A-scans in ballast thickness estimation by integrating temporal convolutional network with graph neural network, achieving accurate prediction of the interlayer interface [
17]. In another study, Yang et al. employed generative adversarial network (GAN) for data augmentation and combined them with a semi-supervised YOLO framework to alleviate sample imbalance, thereby improving subgrade defect detection performance [
18]. However, these approaches fundamentally rely on learning spatial or temporal patterns and lack explicit modeling of spectral information. This limitation becomes particularly critical in the detection of mud pumping. Such anomalies typically appear in GPR profiles as irregularly distributed regions with blurred boundaries and weak reflection intensity. During deep convolution and downsampling processes, these features are highly susceptible to attenuation or even loss due to feature compression, leading to missed detections. Moreover, their spatial manifestations are highly similar to those of water-rich anomalies or loose structural zones. Existing time–frequency analysis studies have shown that the intrinsic differences among these anomaly types are primarily reflected in their spectral response patterns. Therefore, detection models that rely solely on spatial features struggle to achieve effective discrimination.
Although previous studies have validated the significance of frequency-domain information for urban road GPR signal discrimination from a time–frequency analysis perspective [
19], the modeling paradigm based on time–frequency information remains insufficiently explored for the specific task of mud pumping detection in railway scenarios. More importantly, most existing approaches adopt a feature fusion strategy, where frequency-domain information is treated as an independent representation and directly incorporated into target description. However, this fusion paradigm exhibits inherent limitations in railway GPR applications. Frequency-domain features essentially characterize the response intensity of signals across different frequency bands, and their representation is fundamentally different from that of spatial features. Direct concatenation or element-wise fusion of these heterogeneous features at the feature level can introduce conflicts due to inconsistent statistical distributions, thereby interfering with effective spatial structure modeling. In fact, the primary role of frequency-domain information is not to directly describe the target itself, but to capture the differential spectral response patterns among various anomaly types, to provide prior guidance on which frequency bands are more discriminative. Therefore, rather than treating frequency-domain information as an equivalent feature for fusion, a more principled approach is to model it as a modulation signal that guides spatial features to focus on responses in critical frequency bands. This strategy preserves the representational capacity of spatial structures while enhancing the model’s ability to discriminate weak-reflection anomalies such as mud pumping.
To address these challenges, this paper proposes a time-frequency collaborative object detection framework, termed YOLO-DGW, for railway mud pumping detection. In railway GPR, the horizontal axis of a B-scan represents the spatial displacement of the antenna along the survey line, while the vertical axis corresponds to the time of the electromagnetic echoes, conventional convolutional detectors can only capture spatial–temporal features in the image domain. To address this limitation, YOLO-DGW builds upon YOLOv8 by introducing wavelet decomposition to map signals from the time-domain into the frequency-domain. This transformation is further utilized to construct modulation priors, thereby enhancing the model’s capability to recognize anomalies under complex conditions. The main contributions of this paper are as follows:
- (1)
A cross-group attention–driven spatial feature enhancement method is proposed, which strengthens local anomaly responses while preserving the structural continuity of GPR profiles, thereby improving the representation of weak-reflection targets.
- (2)
A wavelet decomposition–based frequency-domain modulation mechanism is developed, embedding spectral information as a modulation prior into the feature learning process to guide frequency-sensitive spatial feature representation.
- (3)
A scale-adaptive loss function, A-CIoU, is designed to enhance the localization stability of the model for detecting mud pumping across diverse scales.
2. Principle of YOLO-DGW
To address the challenges of weak reflections, poor structural continuity, and significant scale variations in ballast fouling and mud pumping in GPR B-scan under complex subsurface environments, this study proposes an intelligent detection network, YOLO-DGW, which integrates frequency-domain information with spatial structural modeling, based on the single-stage YOLOv8 framework [
20,
21]. The overall architecture of YOLO-DGW is illustrated in
Figure 3. The network retains the typical Backbone–Neck–Detection Head design of YOLO series. The Backbone is responsible for multi-scale feature extraction and incorporates Cross-Group Attention Convolution (CGAConv) modules to enhance spatial structure modeling. The Neck fuses features across different scales, while the Detection Head performs object classification and bounding box regression.
2.1. Spatial Feature Modeling Module Based on Cross-Group Attention
In GPR B-scan profiles, mud pumping generally exhibit a certain degree of continuous structural distribution along the ballast, with locally enhanced energy responses. This dual characteristic of structural continuity and localized energy concentration imposes high demands on feature modeling. The network must capture the overall continuity of echo structures while highlighting the localized anomaly responses. Conventional Convolutional Neural Network (CNN) typically handle structural modeling and local enhancement within a unified feature channel, causing different types of information to be mixed in the same feature space. As a result, strong local responses can overshadow structural information, especially under low SNR conditions or complex backgrounds, reducing the network’s discriminative capability for anomalous structures.
To overcome this limitation, this study proposes CGAConv module. CGAConv separates the feature channels into groups, modeling structural information and local enhancement independently, and employs a cross-group attention mechanism to facilitate information interaction. This approach preserves structural continuity while strengthening local anomaly responses, as shown in
Figure 4.
Let the input feature map be denoted as
B, C, H and W represent the batch size, number of channels, height, and width, respectively. The input features are first projected through the standard convolution in YOLOv8 to obtain the base feature representation:
Conv denotes the standard YOLOv8 convolution operation, consisting of a convolution layer, batch normalization, and an activation function. To enhance local feature representation, depthwise convolution is applied to the base feature, yielding the enhanced feature:
DWConv denotes depthwise convolution with the number of groups equal to the number of channels, kernel size k = 5, and stride s = 1. This operation enlarges the receptive field and captures local contextual information. F represents the structure-preserving feature, while F
e denotes the locally enhanced feature. To enable information interaction between the two feature groups, global average pooling is applied to construct global feature descriptors:
GAP denotes the global average pooling operation, producing outputs in
. Subsequently, channel-wise mean compression is performed to obtain scalar global descriptors, which are concatenated as:
A lightweight mapping function, implemented using two 1 × 1convolutions, is then employed to learn cross-group attention weights:
denotes a nonlinear mapping function composed of fully connected layers and activation functions, and
is the Sigmoid activation function. The learned weights are used to recalibrate the two feature groups:
The recalibrated features are concatenated along the channel dimension:
To further enhance feature representation, the concatenated feature is split into two groups along the channel dimension and processed with lightweight convolution and residual fusion:
where Split(⋅) denotes equal partitioning along the channel dimension, Concat(⋅) denotes channel-wise concatenation. Through the above cross-group attention mechanism, the model can adaptively adjust the weighting between structure-preserving features and enhanced features according to the input data, achieving a dynamic balance between structural information and local discriminative information under different scenarios. Compared with conventional convolutional structures, CGAConv not only preserves the spatial continuity of GPR echo structures but also strengthens local anomaly responses, thereby improving detection stability and robustness in complex subsurface environments.
2.2. Wavelet Frequency-Domain Modulation Guided Spatial Feature Recalibration
During GPR signal propagation, different types of subsurface media induce varying degrees of attenuation and scattering due to their distinct electromagnetic properties. Physically, this is manifested as differences in spectral energy distribution. Conventional detection methods that rely solely on spatial-domain features often struggle to effectively distinguish anomalies with similar spatial morphology. Motivated by this observation, this paper proposes a Wavelet-Guided Feature Modulation (WGF) module, as illustrated in
Figure 5. The module introduces frequency-domain statistical information to modulate spatial features, thereby guiding the network to learn more discriminative feature responses without explicitly altering the structural form of feature representations.
Let the input feature map be:
The Haar wavelet is adopted as the basis function for the DWT, and a single-level wavelet decomposition is independently applied to each channel. This process is implemented via four fixed 2 × 2 convolution kernels with a stride of s = 2 and no additional padding, producing four sub-band responses
. The overall tensor can be expressed as:
The four sub-bands correspond to the low-frequency component (LL) and high-frequency components (LH, HL, HH), respectively. The transformation is performed independently for each channel, with no cross-channel convolution or mixing. Next, global statistical pooling is applied to the wavelet coefficients to construct a spectral descriptor. By performing average pooling over both spatial and channel dimensions, the global response of each sub-band is obtained:
These responses are concatenated to form the descriptor vector:
which captures the global energy distribution of the input features across different frequency bands. A two-layer fully connected network is then employed to generate channel-wise modulation parameters:
denote a nonlinear mapping function with output dimension
, which is then split and broadcast to the spatial dimensions:
The spatial features are subsequently modulated via channel-wise affine transformation:
During training, the WGF module leverages frequency-domain statistical information extracted via wavelet decomposition to generate modulation parameters , which are jointly optimized with the network parameters under the supervision signal. Notably, this process does not explicitly model or classify frequency-domain features; instead, it influences the response distribution of spatial features through a modulation mechanism.
In summary, the WGF module establishes a collaborative relationship between the frequency and spatial domains: spatial features are responsible for representing structural information and semantic distributions, while frequency-domain information forms a modulation prior through global statistics, guiding the recalibration of spatial features. This design enhances the robustness and discriminative capability of the model in complex subsurface environments.
2.3. Bounding Box Regression Loss
In practical engineering environments, ballast fouling and mud pumping exhibit significant scale variations across different developmental stages. Early-stage diseases are typically small with blurred boundaries, whereas advanced-stage diseases cover larger spatial extents. Using a uniform bounding box regression constraint can lead to unstable localization for small targets or insufficient boundary fitting for large targets, thereby degrading detection accuracy. A scale-adaptive loss function, termed A-CIoU, is proposed. Its complete mathematical formulation is given as:
denotes the center distance loss, and v represents the aspect ratio consistency loss.
and
represent the predicted bounding box and the true bounding box respectively. The scale factor s is defined as the ratio between the ground truth box area A
t and the total area of a single B-scan profile A
max:
The weighting functions
and
are designed in a linearly complementary manner to dynamically balance the contributions of the localization loss and the shape loss. Their analytical forms are defined as:
The optimization rationale of this design is as follows. For small-scale anomalies (s→0), →1, assigning a higher weight to center point regression. Since small targets occupy only a few pixels in GPR profiles, minor variations in width and height can cause significant fluctuations in IoU. By prioritizing center alignment, the predicted bounding box can rapidly lock onto the core region of the target. For large-scale anomalies (s→1), increases, shifting the optimization focus toward shape fitting. Well-developed anomalies typically exhibit pronounced horizontal or vertical extensions. Strengthening the influence of the aspect ratio constraint term v enables the model to more accurately capture the spatial morphology of anomalies, thereby reducing underfitting issues along the boundaries of large targets.
3. Experimental Procedure and Results Analysis
To evaluate the detection accuracy and generalization ability of YOLO-DGW under different geological conditions, systematic experiments were conducted using a unified dataset and consistent training settings. Ablation experiments were performed to analyze the contribution of each key module to the detection performance. Comparative experiments with several detection models were carried out to verify the advantages of the proposed method over mainstream object detection algorithms.
3.1. Experimental Setup and Evaluation Metrics
The experimental data were collected from field GPR measurements on three representative railway trunk lines in China: the Shanghai–Kunming (SH-KM) Line, the Xinxiang–Yanzhou (XX-YZ) Line, and the Beijing–Baotou (BJ-BT) Line. These railway lines are located in eastern, central, and northern regions of China, respectively. They cover different geographical environments and subsurface conditions, have strong regional representativeness for railway engineering inspection. The selected railway sections are prone to mud pumping. These areas usually have a relatively high groundwater level. Under long-term train loading and water related processes, mud pumping is likely to occur.
As shown in
Figure 6, the GPR data were acquired using a vehicle-mounted GPR system equipped with an air-coupled antenna with a center frequency of 2 GHz. The relatively high center frequency provides higher vertical resolution, making it suitable for the fine characterization of shallow subsurface structures and defects. Along the track direction, the spatial sampling interval was set to 0.16 m. For each trace, the echo signal was recorded within a time window of 90 ns, yielding a total of 512 sampling points. During data acquisition, the inspection vehicle operated at a speed of 140–160 km/h.
In GPR data, standing wave interference is one of the primary sources of background noise, typically appearing as high-intensity horizontal stripes along the trace direction. To mitigate this interference, a mean background removal method was applied to the radar profiles. Specifically, regions strongly affected by horizontal interference were selected, and their traces were averaged to estimate the background noise, which was then subtracted trace by trace from the original data to suppress stable horizontal interference components [
22]. To compensate for signal energy attenuation with increasing depth during electromagnetic wave propagation, gain processing was applied [
23]. For 16-bit GPR data, the gain range was controlled within an absolute value of 25,000 to avoid excessive noise amplification. In addition, to suppress both high-frequency and low-frequency noise, a one-dimensional band-pass filtering operation was performed [
24], with the frequency range set from 800 MHz to 3 GHz. This preserves the effective reflection signal band while removing irrelevant noise components.
All collected data were subjected to detection result validation through field engineering investigations. The mud pumping regions were confirmed via on-site defect surveys, borehole sampling, and engineering records, ensuring that the annotated anomalous regions in the GPR B-scan profiles accurately reflect real defect conditions. Based on radar characteristics and validated engineering results, the mud pumping areas were manually annotated. The annotation process relied primarily on expert knowledge and was further guided by the typical interpretation criteria for mud pumping in B-scan profiles specified in the” Technical Specification for Vehicle-Mounted GPR Detection of Ballasted Track” issued by the China Academy of Railway Sciences. Specifically, within the target depth range, anomalous regions typically exhibit continuous or locally discontinuous high-amplitude in-phase reflection wave groups or coherent reflectors. These may appear as relatively smooth reflective interfaces or exhibit undulating patterns resembling “peaks” or “hat-like” shapes. The annotation process was carried out by experts with extensive experience in railway engineering and GPR interpretation, and underwent multiple rounds of detection result review to ensure accuracy. The final dataset consists of 3925 mud pumping sample images, covering defect characteristics of varying scales and morphological patterns. Among them, 1065 samples were collected from the SH-KM line, 1446 from the XX-YZ line, and 1414 from the BJ-BT line. To ensure model generalization, the data from each railway line were split into training, validation, and test sets in a ratio of 8:1:1. These subsets were then merged to form unified training, validation, and test sets for model training and performance evaluation.
To comprehensively evaluate the performance of the proposed model on the single class object detection task, Precision, Recall, F1-score, and Average Precision (AP) are adopted as the primary evaluation metrics. Precision measures the proportion of correctly detected targets among all detected targets [
25], while Recall quantifies the proportion of ground truth targets that are correctly detected [
26]. Their definitions are given in Equations (20) and (21).
TP denotes the number of correctly detected targets, FP represents false positives, and FN indicates false negatives. The F1-score is used to provide a comprehensive evaluation of the model by jointly considering Precision and Recall [
27], and its definition is given in Equation (22).
AP is adopted to evaluate the matching quality between predicted and ground-truth bounding boxes in this single-class detection task. AP is computed as the area under the precision–recall curve at different recall levels, with the IoU threshold set to 0.5 [
28]. The definition of IoU is given in Equation (23). To assess detection performance under varying localization strictness, AP is further reported at multiple IoU thresholds, including AP@0.20, AP@0.30, AP@0.40, and AP@0.50, corresponding to IoU thresholds of 0.20, 0.30, 0.40, and 0.50, respectively. b
p and b
gt denote the predicted bounding box and the ground truth bounding box.
3.2. Model Training Performance
This study is implemented using the PyTorch2.1.0 deep learning framework. The hardware configuration includes an Intel(R) Core(TM) i7-14650HX processor, an NVIDIA GeForce RTX 4060 GPU, and 128 GB of RAM. The software environment consists of Python 3.10 and CUDA Toolkit 12.1. During training, the number of epochs is set to 300, the batch size is 16, and the input image size is 640 × 640. The AdamW optimizer is adopted for parameter optimization, with an initial learning rate of 0.01. In addition, an early stopping strategy is introduced, where training is automatically terminated if no improvement is observed on the validation set for 100 consecutive epochs.
Figure 7 illustrates the variation trend of the loss function during training of the YOLO-DGW model. The loss decreases rapidly within the first 50 epochs, followed by a steady convergence phase. After approximately 250 epochs, the curve becomes stable with no obvious oscillation or overfitting behavior. Ultimately, the model loss converges robustly to a low level.
3.3. Comparative Experiments and Case Validation
3.3.1. Experimental Setup and Quantitative Evaluation
To ensure fairness and scientific rigor in the evaluation process, a comprehensive benchmark system covering multiple network architectures is constructed. Single-stage detection models, including YOLOv5 [
29], YOLOv10 [
30], and YOLOv11 [
31], a two-stage classical detector Faster R-CNN [
32], and an improved YOLOv8-based model with channel attention (YOLOv8+SE) [
33], are selected as comparison methods. All models are trained under a unified hardware platform, and identical training strategies are adopted to eliminate the influence of hyperparameter settings on experimental results. To reduce the impact of randomness introduced by weight initialization during deep learning training, all experiments are repeated independently five times, and the results are reported in the form of “mean ± standard deviation” to improve statistical reliability.
To further evaluate detection performance under different geological conditions, three major railway trunk lines, the SH-KM Line, XX-YZ Line, and BJ-BT Line, are selected to construct the test sets. Precision, Recall, F1-score, and AP@0.20, AP@0.30, AP@0.40, and AP@0.50 are used as evaluation metrics for quantitative assessment. The detection results of all models on the three test sets are presented in
Table 1.
The results demonstrate that YOLO-DGW consistently achieves state-of-the-art performance across all datasets. Taking the SH-KM Line as an example, the proposed method achieves a Precision of 55.57%, Recall of 70.97%, and F1-score of 62.07%, with an AP@0.50 of 61.87%. Compared with the second-best model YOLOv11, improvements of 7.13% and 2.90% are obtained in terms of F1-score and AP@0.50, respectively. On both the XX-YZ Line and BJ-BT Line, YOLO-DGW maintains consistently high performance, with F1-score values exceeding 62.8%. In addition, the proposed model exhibits relatively lower standard deviations across all comparison groups, demonstrating strong robustness under heterogeneous geological conditions.
To further verify the statistical significance of performance improvements, significance tests are conducted between YOLO-DGW and all comparison models. Considering potential differences in means and variances across models, Welch’s t-test (two-tailed) is adopted to evaluate whether the null hypothesis, “no significant difference among models”, can be rejected. The significance level is set top <0.05.
The results, shown in
Table 2,
Table 3 and
Table 4, indicate that on all evaluation metrics (Precision, Recall, F1-score, and AP@0.50) across the three datasets, YOLO-DGW achieves highly significant differences (
p < 0.001) compared with Faster R-CNN. Compared with YOLOv5, YOLOv8+SE, and YOLOv10, statistically significant improvements (
p < 0.01) are consistently observed. Even when compared with the strongest baseline, YOLOv11, all key metrics still pass the significance test (
p < 0.05), confirming the robustness of the improvements.
Figure 8 and
Figure 9 further characterize the detection behavior of the model from the perspectives of dynamic threshold response and PR curves, respectively. From the Precision variation with respect to AP thresholds, it can be observed that YOLO-DGW consistently achieves the highest precision across different detection requirements. From the PR curve trajectories, the curve of YOLO-DGW encloses all comparison models. Notably, even in high-recall regions, the proposed model maintains a relatively high precision level, demonstrating a strong balance between precision and recall.
3.3.2. Typical Case Comparison and Visualization
To intuitively validate the practical detection performance of different models on railway GPR profiles, two representative mud pumping scenarios are selected for comparative analysis. The visualization results are shown in
Figure 10. In the figure, bounding boxes of different colors are used to mark the defect regions detected by each model, enabling evaluation of their localization accuracy and recognition capability in complex radar images.
In Case A (
Figure 10a1–a7), the target region exhibits a typical weak-reflection anomaly. The internal echo energy is significantly lower than that of the surrounding stratified structures, while the boundaries show irregular diffusion patterns, resulting in overall low signal contrast. The comparison results indicate that YOLOv5 produces obvious missed detections in this region, accompanied by noticeable bounding box shifts. Faster R-CNN is able to detect the anomalous area; however, the predicted bounding box is overly large and fails to tightly fit the actual defect region. YOLOv8+SE shows partial response to the anomaly, but its localization accuracy remains suboptimal. YOLOv10 also suffers from missed detection, while YOLOv11 exhibits bounding box displacement and fails to accurately cover the defect boundary. In contrast, YOLO-DGW accurately localizes the weak-reflection anomaly, with detection results highly consistent with the actual defect boundary.
In Case B (
Figure 10b1–b7), the mud pumping defect is physically superimposed with surrounding stratified structures. Its echo response is severely obscured by a strong overlying reflective interface, manifesting as severe local waveform distortion and disruption of coherent reflector continuity, which significantly increases detection difficulty. Under such complex background interference, most baseline models perform poorly. YOLOv5 completely misses the target in this scenario. Faster R-CNN only identifies partial anomalous regions without full coverage. YOLOv8+SE produces detection responses but with unstable bounding boxes. YOLOv10 generates false detections, while YOLOv11 shows bounding box offset and fails to accurately align with the defect boundary. In comparison, YOLO-DGW is capable of identifying multiple anomalous regions under complex background interference, producing complete and accurately localized detection results.
3.3.3. Field Excavation Validation
To provide a preliminary validation of the model’s detection results from a physical perspective, field excavation verification experiments were conducted on representative railway sections. By comparing the model inference results derived from GPR data with the actual distribution of anomalies revealed through on-site excavation, a qualitative analysis was carried out in terms of spatial localization consistency and morphological correspondence of the detected anomalies.
In this field verification, three representative sections with mud pumping anomalies were selected, with actual defect lengths of approximately 0.8 m, 0.6 m, and 1.9 m, respectively.
Figure 11 presents a full comparison pipeline for each section, including the original B-scan images, expert manual annotations, model predictions, and corresponding field excavation results. The proposed model is able to accurately localize defect regions across all three typical anomaly scenarios, with prediction results showing high consistency with both manual annotations and field observations in terms of spatial position and morphological distribution. The mileage errors for the detection results in
Figure 11c–f are both 0.3 m, while the error for
Figure 11g,h is 0.4 m, with all errors controlled within 0.4 m.
3.3.4. Cross-Line Generalization Capability Evaluation
To further assess the generalization capability of the proposed method under different railway line conditions, mud pumping samples from the Beijing–Jiujiang Railway (BJ-JJ) were introduced as an independent test set. This dataset contains a total of 498 representative anomaly samples, covering variations in burial depth, anomaly scale, and background noise interference. Unlike the datasets from the SH-KM, XX-YZ, and BJ-BT lines used for model training, the BJ–JJ dataset was not involved in any training or validation process and was used exclusively for final performance evaluation after model training. The data acquisition system, antenna parameters, sampling intervals, and preprocessing pipeline for the BJ-JJ dataset were kept consistent with those of the training datasets, thereby eliminating the influence of system-level discrepancies on the experimental results.
The quantitative evaluation results on the BJ-JJ independent test set are presented in
Table 5. The proposed method achieved a Precision of 57.12%, Recall of 70.84%, and an F1-score of 63.24%. The average precision under different IoU thresholds reached 74.05% (AP@0.20), 70.17% (AP@0.30), 66.48% (AP@0.40), and 62.31% (AP@0.50), respectively. These results indicate that the model maintains stable detection performance even on previously unseen railway line data.
3.4. Ablation Experiments and Module Validation
3.4.1. Ablation Setup and Quantitative Evaluation
To systematically evaluate the contribution and synergistic effects of each improved module in the YOLO-DGW framework for railway mud pumping detection, progressive ablation experiments are conducted on three test sets: the SH-KM Line, XX-YZ Line, and BJ-BT Line. Starting from the original YOLOv8 baseline, five comparative variants are constructed by progressively integrating CGAConv, WGF, and the A-CIoU loss function. All ablation experiments are conducted under identical training strategies and hyperparameter settings. Performance is quantitatively evaluated using Precision, Recall, F1-score, and AP under multiple IoU thresholds, in order to objectively assess the contributions of each module in improving feature extraction capability, localization accuracy, and robustness to interference.
Table 6 provides a detailed record of performance evolution across different ablation stages.
Experimental results show that each proposed module contributes positively to the baseline model. After introducing the A-CIoU loss function, the AP@0.50 improves by approximately 2–3% across all railway lines, demonstrating the effectiveness of the optimized regression strategy in refining defect boundary localization. The incorporation of CGAConv further enhances spatial feature interaction, leading to a steady improvement in F1-score. Meanwhile, the WGF module significantly improves Recall by reweighting features through frequency-domain information, particularly under complex background conditions. Notably, when all modules are integrated into the full YOLO-DGW framework, a significant synergistic improvement is observed. On the SH-KM Line, the F1-score reaches 62.07%, representing an improvement of 12.2% over the baseline model.
Statistical significance tests presented in
Table 7,
Table 8 and
Table 9 confirm that YOLO-DGW exhibits significant differences (
p < 0.01) compared with all intermediate ablation variants across all key metrics, validating the systematic effectiveness and scientific soundness of the proposed improvements.
3.4.2. Visual Validation of the WGF Module
To further verify the feature representation capability of the WGF module under complex backgrounds, a comparative visualization of feature evolution before and after module integration is conducted, as shown in
Figure 12. This experiment consists of three components: (1) the original B-scan images with manual annotations as the baseline reference (
Figure 12a1–a3); (2) an ablation comparison group, where the WGF module is replaced with a standard convolutional block of equivalent parameter scale, and both its input feature maps (
Figure 12b1–b3) and output feature maps (
Figure 12c1–c3) during forward propagation are recorded; and (3) the WGF experimental group, where the input features before modulation (
Figure 12d1–d3) and the corresponding output features after WGF transformation (
Figure 12e1–e3) are visualized.
In terms of feature representation, when significant interlayer structural disturbances exist in the B-scan images, the results from the ablation group show that after processing with standard convolution, the reflection signals from stratified layers and mud pumping anomalies are heavily entangled, with no clear separability in energy response. In contrast, the WGF-based results exhibit a marked change in feature distribution. Compared with the pre-processing state, which is dispersed and mixed, the output feature maps show a substantial attenuation of interlayer reflection signals, while the waveform features corresponding to mud pumping regions are preserved and further emphasized, achieving effective separation between defect signals and background interference at the feature level.
In B-scan images containing culvert structures, the ablation model, due to the lack of frequency-domain guidance, produces strong responses at high-intensity reflections on culvert walls in
Figure 12c3, incorrectly identifying them as potential defect regions. In comparison, the WGF model (
Figure 12e3) effectively suppresses strong scattering features generated by the culvert structure through frequency-weight reallocation, thereby avoiding the misclassification of non-defect structures as anomalies.
3.5. Inference Efficiency and Engineering Applicability
To validate the deploy ability of the proposed algorithm in practical railway GPR inspection systems, a comprehensive evaluation is conducted from three perspectives: inference speed, post-processing overhead, and real-time engineering capability. On the model side, considering only forward inference, YOLO-DGW achieves an average processing time of 6.4 ms per frame, corresponding to a throughput of approximately 156 FPS. It should be clarified that the DWT employed in the WGF module is not implemented via any external signal processing library. Instead, it is realized in the form of convolutions using fixed Haar wavelet filters. The process introduces no additional learnable parameters, and its computational complexity is significantly lower than that of standard convolutional layers. As a result, its impact on the overall inference overhead is negligible. In practical deployment scenarios, detection outputs must undergo non-maximum suppression (NMS) and defect region aggregation to transform discrete bounding boxes into structured continuous defect segments. Experimental results show that NMS requires an average of 0.9 ms per frame, while region aggregation takes approximately 1.2 ms per frame, resulting in a total post-processing overhead of 2.1 ms per frame.
In summary, the overall end-to-end processing time of the system is approximately 8.5 ms per frame, yielding a total throughput of about 117 FPS. In vehicle-mounted high-speed inspection scenarios, this efficiency can robustly support continuous GPR data stream processing at speeds up to 150 km/h, enabling real-time analysis of approximately 150 km of railway data per hour.
4. Discussion
This section systematically discusses the performance sources, applicability, and potential limitations of the proposed YOLO-DGW model based on the comparative experiments, visualization analysis, and field validation presented in
Section 3, with the aim of revealing its effectiveness and engineering significance in railway GPR defect detection tasks.
From the quantitative results in
Table 1, YOLO-DGW achieves consistently superior performance in Precision, Recall, F1-score, and multi-threshold AP across all three railway trunk lines, while maintaining low standard deviations over five independent runs. These results imply two key points: (1) the model exhibits strong cross-region generalization ability across different geological environments (SH-KM Line, XX-YZ Line, and BJ-BT Line), indicating that its feature extraction mechanism is not overfitted to a specific stratigraphic structure; (2) the relatively small performance variance demonstrates strong robustness to random initialization and training perturbations. Combined with the statistical significance results in
Table 2,
Table 3 and
Table 4, YOLO-DGW shows statistically significant or highly significant improvements over all comparison models. Notably, even when compared with the strongest baseline (YOLOv11), the proposed method consistently maintains significant advantages, indicating that performance gains are not incidental but stem from systematic architectural improvements. The PR curves and dynamic threshold analysis (
Figure 8 and
Figure 9) further demonstrate that the superiority of YOLO-DGW is not limited to single-point metrics but is reflected in overall detection behavior. Its PR curve consistently envelops those of all comparison models across the full recall range, and it maintains relatively high precision even in high-recall regions. This indicates that the model effectively suppresses false negatives while controlling false positives, achieving a superior precision–recall trade-off. This property is particularly critical for mud pumping detection, where minimizing missed detections is often prioritized while keeping false alarms under control.
In the qualitative case studies shown in
Figure 10, YOLO-DGW demonstrates strong capability in modeling complex GPR signals. In weak-reflection scenarios, it successfully extracts meaningful anomalies from low signal-to-noise ratio backgrounds, avoiding the missed detections commonly observed in YOLOv5 and YOLOv10. In strongly interfered scenarios, the model effectively distinguishes mud pumping from layered structures and culverts, demonstrating strong robustness against structural interference. In contrast, although Faster R-CNN provides some detection capability, its region proposal mechanism tends to introduce background redundancy in GPR applications, while standard YOLO-based models suffer from insufficient feature representation under complex waveform conditions.
From the ablation study in
Table 6, the performance contributions of each component in YOLO-DGW can be clearly interpreted. The A-CIoU loss function mainly improves bounding box regression accuracy, leading to consistent gains across AP metrics. CGAConv enhances channel–spatial interaction, improving feature discriminability. The WGF module significantly boosts Recall by reweighting features using frequency-domain information, particularly in complex backgrounds. Importantly, the improvements from each module are not simply additive but exhibit clear synergistic effects after integration, resulting in a substantial increase in F1-score. This observation is also supported by statistical tests, where significant differences are observed between the full model and all intermediate variants.
The visualization results in
Figure 12 further validate the mechanism of the WGF module. Conventional convolution is prone to interference from stable layered reflections in GPR data, leading to feature entanglement between background structures and defect signals. In contrast, the WGF module introduces a frequency-domain weight modulation mechanism to recalibrate responses across different frequency bands. This effectively suppresses feature activations associated with stable background structures at the level of statistical representation, while guiding the network to produce stronger responses in non-stationary signal regions such as mud pumping anomalies. Visualization results show that, after incorporating the WGF module, the feature maps exhibit more pronounced energy concentration patterns. Signals that were previously difficult to disentangle in the spatial domain become more discriminative after frequency-domain modulation, leading to a qualitative enhancement in target separability. This observation indicates that the WGF module leverages multi-band frequency information as modulation weights, enabling the network to selectively attend to different frequency components during feature extraction. From an experimental standpoint, it demonstrates that joint time–frequency modeling can effectively improve the robustness of weak target representations under complex background conditions, providing supportive qualitative evidence for distinguishing structural interference from stochastic anomalies.
As shown in the field excavation validation in
Figure 11, the predictions of YOLO-DGW are generally consistent with the actual anomaly distribution in terms of spatial localization, and exhibit good morphological correspondence in typical regions. This serves as a qualitative proof-of-concept, preliminarily confirming the effectiveness of the proposed method in capturing the physical characteristics of mud pumping in real engineering scenarios. A closer inspection reveals that the predicted regions tend to slightly over-expand compared to the actual excavation boundaries, reflecting the uncertainty in delineating target extents under complex subsurface conditions. This discrepancy mainly arises from the diffraction-induced signal dispersion of electromagnetic waves in heterogeneous ballast, as well as the blurred physical transition zones formed by moisture migration at the anomaly boundaries. At present, this boundary deviation has not been quantitatively modeled and remains a limitation of the study. Future work will address this issue by introducing physics-informed constraints to achieve more accurate boundary convergence. It should also be noted that, due to practical constraints, the number of excavation samples is limited. To mitigate potential missed detections, future efforts will focus on extending continuous inspection mileage and integrating multi-source data, thereby further enhancing model robustness and improving boundary delineation accuracy.
In terms of efficiency, YOLO-DGW achieves a forward inference time of 6.4 ms per frame and an end-to-end processing time of 8.5 ms per frame, corresponding to a throughput of approximately 117 FPS. This is sufficient to support real-time continuous GPR data stream analysis at vehicle speeds of around 150 km/h. This efficiency is related to the optimized feature representation mechanism: spatial enhancement reduces redundant detections, frequency-domain modulation suppresses background activations, and scale-adaptive regression reduces post-processing correction overhead. Together, these components achieve a unified optimization of computational efficiency and detection accuracy, forming a high-efficiency and high-precision framework rather than a simple acceleration strategy.
Despite the significant performance improvements, YOLO-DGW still achieves a Recall of approximately 0.70, indicating that certain missed detections remain in complex GPR scenarios and reflecting an explicit performance boundary. Based on quantitative results and case analysis, failure cases can be categorized into three main situations: (1) in early-stage or low-moisture conditions, mud pumping exhibits low-amplitude and low-contrast signals in B-scans, which are easily mixed with background noise, resulting in insufficient separability; although WGF enhances frequency-domain discrimination to some extent, it still struggles when signal strength approaches the noise floor; (2) in the presence of strong reflective structures such as ballast layering or sleepers, weak defect signals may be masked or distorted by high-energy background, weakening both spatial and spectral consistency; (3) mud pumping inherently exhibits gradual boundary transitions without clear geometric separation, making it difficult for supervised learning to establish stable decision boundaries, leading to boundary under-detection or localization drift.
In summary, the main failure modes of YOLO-DGW are concentrated in three typical GPR challenges: low signal-to-noise weak targets, strong structural occlusion, and fuzzy boundaries. This indicates that target observability remains a fundamental constraint on performance under extremely complex propagation conditions. Future work may focus on low-SNR feature enhancement, physics-informed modeling, and cross-domain adaptive learning to further improve robustness and generalization capability.