1. Introduction
Ground-Penetrating Radar (GPR), an advanced non-destructive testing (NDT) technology, is renowned for its high-precision localization, rapid scanning, operational flexibility, and superior detection accuracy [
1]. This technique finds broad applications, ranging from soil contamination assessment [
2], underground pipeline detection [
3], sediment detection [
4], and subsurface exploration [
5]. By emitting high-frequency electromagnetic waves and analyzing their reflections, GPR enables the accurate identification of subsurface anomalies [
6]. Effective target localization enhances the interpretation of GPR data, thereby facilitating the accurate mapping of subsurface features. The advantages of precise localization are substantial. It enables the identification of underground voids, which is crucial for assessing potential hazards such as sinkholes or soil instability [
7]. Furthermore, it ensures reliable mapping of utility networks, minimizing excavation risks during construction [
8]. Such capabilities make GPR indispensable for infrastructure safety and geotechnical engineering.
Conventional GPR image recognition predominantly depends on template matching and image processing techniques. Liu et al. [
9] applied the Sobel operator to detect hyperbolic edges in GPR images, while Li et al. [
10] leveraged randomized Hough transforms for automated root identification. Maas et al. [
11] adopted the Viola–Jones learning algorithm to constrain regions of interest, thereby reducing computational overhead in Hough transform-based localization. M. Sharafeldin et al. [
12,
13] deployed a total of 10 electrical resistivity imaging (ERI), 26 shallow seismic refraction (SSR), and 19 GPR survey lines across the Giza Plateau, and performed integrated inversion to build a three-layer subsurface model, accurately characterizing the groundwater aquifer and its water table depth. Luo et al. [
14] introduced a Laser Dynamic Deflectometer method that uses vehicle-mounted laser Doppler sensors to capture pavement deflection velocity anomalies as indicators of subsurface cavities for rapid, non-invasive road network screening. Recently, machine learning has revolutionized hyperbolic feature recognition in GPR data. Dou et al. [
15] developed a Connected Component Clustering (C3) algorithm to isolate hyperbolas within target regions, whereas Zhang et al. [
16] proposed a symmetry-based method for root detection, enabling radius estimation via multidirectional feature extraction. Nevertheless, these approaches, whether traditional or neural network-based, suffer from inherent limitations: manual parameter tuning, suboptimal training efficiency, and restricted detection accuracy.
Current methodologies primarily leverage deep learning to improve detection accuracy and computational efficiency, broadly categorized into two-task and single-task approaches. Two-stage detectors initially identify regions of interest through a candidate region generation step, followed by precise classification and bounding box regression on these regions to achieve high-precision object detection [
17]. Pham et al. [
18] employed Faster Region-based Convolutional Neural Networks (Faster R-CNN) [
19] outperforming traditional HOG-based methods on real-world datasets; Cui et al. [
20] developed a Faster R-CNN-based framework for highway GPR layer detection, enabling real-time, high-accuracy analysis; Li et al. [
21] proposed an optimized model based on Mask R-CNN to achieve centimeter-scale void morphology and location detection with high precision and quantitative analysis; However, two-stage detection methods lead to slow processing speeds and prolonged training times, with insufficient information fusion between stages increasing false alarm rates, thus impacting the overall system performance and reliability.
In contrast, single-stage detectors such as YOLO (You Only Look Once) [
22] and Single Shot MultiBox Detector (SSD) [
23] directly predict object categories and locations without RPNs, offering superior real-time performance for subsurface target detection [
24]. Wang et al. [
25] augmented SSD by adding feature pyramid fusion layers and introducing Generalized Intersection over Union to enhance underground target detection performance; Qui et al. [
26] modified YOLOv5’s architecture for small-target detection in GPR imagery; Hu et al. [
27] incorporated attention mechanisms into YOLOv5 to refine hyperbolic feature recognition; Wang et al. [
28] extend the YOLO-v7 architecture with an unsupervised domain-adaptive network, training jointly on simulated finite-difference time-domain-based GPR data and real-world GPR images to robustly detect delamination defects in subsurface pavement layers; Tian et al. [
29] proposed a state-space-model-based detection method capable of processing arbitrarily long GPR B-scan sequences using an SSM framework to robustly localize hyperbolic features; Wang et al. [
30] integrated CBAM into YOLOv8 for urban subsurface defect detection. However, existing studies using object detection models can only detect and frame the approximate location of targets without precise subsurface target localization. To address this challenge, Li et al. [
31] extended YOLOv4 with a keypoint detection branch (YOLOv4-hyperbola), enabling joint bounding-box and vertex prediction. Since both methods aim to locate specific subsurface targets, issues such as poor compatibility and accuracy for tasks involving different scale targets persist, affecting efficiency and reliability in practical applications.
This study introduces Dual Attentive YOLOv11-based Keypoint Detector (DAYKD), a lightweight multi-task deep learning framework for precise detection and localization of underground targets in GPR data. As illustrated in
Figure 1, The framework consists of three stages: dataset integration and annotation, target detection using an attention-enhanced YOLO-DAFRNet module, and keypoint localization within detected regions. By incorporating dual-task optimization, attention-based feature enhancement, and a cascaded learning structure, DAYKD effectively addresses three core challenges in GPR-based subsurface target interpretation:
(1) The methodology is structured into two distinct tasks: target detection and keypoint detection. In the initial task, the DAYKD model is trained using a target detection dataset to accurately identify and localize potential target regions. In the subsequent task, a portion of the weights from the first task is shared, and the model is further trained on a keypoint detection dataset, thus refining the accuracy of both target detection and localization in underground environments.
(2) Incorporating two specialized modules—the Convolution and Attention Fusion Module (CAFM) [
32] and the Feature Refinement Network (FRFN) [
33] to enhance the network’s performance during both the target detection and keypoint recognition tasks. These modules optimize the network architecture by improving its global perception capabilities and enhancing its ability to extract multi-scale features from images. This, in turn, facilitates the refinement of the feature selection process and results in a substantial increase in recognition accuracy.
(3) A notable innovation of this study lies in the task-specific partitioning of the dataset. The complete dataset is divided into two tailored subsets: one dedicated to target detection, focusing on identifying the presence and number of buried objects; and the other oriented toward keypoint detection, aimed at localizing underground target. This task-aligned dataset design allows the model to be trained more effectively for each objective, leveraging the shared feature representation while optimizing performance for both detection and localization tasks. This dual-purpose dataset structure enhances the modularity and flexibility of the proposed framework.
3. The Proposed DAYKD Model
The development of DAYKD focuses on enhancing feature extraction capability, multi-scale feature fusion, and the speed and stability of the loss function. The proposed method inherits the four-part structure of YOLOv11, namely the backbone, neck, and head, as illustrated in
Figure 5.
3.1. Basic Model Architecture
Inspired by the YOLOv11 model, we adopt it as the foundational framework and propose a novel dual-task detection architecture. As the first YOLO series model to officially support tasks both object detection and keypoint detection tasks, YOLOv11 inherently possesses the necessary capabilities for a two-task approach, dynamically invoking task-specific detection heads. Notably, the model has demonstrated exceptional performance in facial keypoint detection, indicating strong potential for effective transferability to GPR keypoint detection tasks.
Building upon YOLOv8 [
41], YOLOv11 introduces optimizations across its backbone network, feature fusion layers, and detection heads, achieving superior inference speed and accuracy. The overall network architecture is illustrated in
Figure 5.
The YOLOv11 backbone network is designed to extract hierarchical features from input images, efficiently capturing visual information across low-level to high-level representations. The architecture replaces the original C2F (Cross-task Partial with Two Fusions) module with an improved C3K2 (Cross-task Partial Network v3 with K2 convolution layers) module, enhancing feature extraction efficiency through a parallel convolutional design and adaptive parameter configuration. Specifically, the C3K2 module employs dual convolutional layers in place of a single large kernel, alongside a channel-splitting strategy to reduce computational complexity. Additionally, variable-sized convolutional kernels are applied to expand the receptive field, significantly improving performance in large-object detection and complex background scenarios. Following the SPPF (Spatial Pyramid Pooling Fast) module, we introduce a novel C2PSA (Cross-task Partial with Pyramid Squeeze Attention) module. This component splits feature maps using a CSP-based approach: one branch propagates features directly, while the other undergoes dynamic spatial refinement via a PSA attention mechanism before feature concatenation. This design not only reduces computational overhead but also enhances the model’s ability to focus on occluded objects and critical regions. The backbone progressively extracts multi-scale feature maps (P1 to P5), enriching semantic information at varying depths while minimizing model parameters. This hierarchical structure significantly improves the recognition of complex patterns (e.g., hyperbolic signatures) and enhances the detection of subtle details in GPR images, such as minor subsurface structural variations and intricate hyperbolic-shaped targets.
The feature fusion layer generates three multi-scale convolutional outputs with resolutions corresponding to 1/8, 1/16, and 1/32 of the original input image. In the detection head architecture, YOLOv11 employs a decoupled structure that independently processes classification, localization, and keypoint detection through dedicated convolutional branches. This design significantly enhances the model’s adaptability to GPR image characteristics. A key innovation in the classification detection head is the implementation of Depthwise Convolution (DWConv), which performs channel-wise spatial convolution while eliminating inter-channel interactions. This architectural choice achieves substantial reductions in both parameter count and computational complexity. The detection pipeline processes multi-scale feature maps through task-specific convolutional layers to extract relevant semantic features. These features are then unified through a final convolution operation that transforms the multi-scale representations into the required vector space, generating the model’s final output.
3.2. Enhanced Feature Representation for Object Detection
This section presents the optimization of the attention mechanism in the YOLOv11 framework. Specifically, the attention layer in the C2PSA module within the backbone has been replaced with the CAFM, which integrates the strengths of convolutional neural networks and transformers. By employing convolution operations to extract local features and the attention mechanism to capture global features, the CAFM effectively models both global and local characteristics, thereby enhancing detection performance.
Globel and Local Feature Extraction: CAFM in Backbone
The conventional convolution operation, while effective at capturing local features, suffers from an inherently limited receptive field that hinders its ability to model global contextual information effectively. In contrast, transformer architectures excel at capturing long-range dependencies through their attention mechanisms. To bridge this gap, we introduce the CAFM, which synergistically integrates convolutional operations with attention mechanisms to enable comprehensive joint modeling of both local and global features. The proposed module comprises two branches: a local branch and a global branch. In the local branch, convolution operations and channel reordering are employed to extract local features. Meanwhile, the global branch utilizes the attention mechanism to capture long-range feature dependencies.
The proposed method improves the final module of the YOLOv11 backbone, replacing the original CSPSA module. Building upon the original PSABlock, which enhances spatial features solely through multi-scale convolution and channel weighting, the CAFM introduces a series of structural improvements to enhance feature extraction and information interaction capabilities.
The CAFM processes features through two complementary branches. In the local feature extraction branch, channel shuffle operations disrupt channel independence to enhance feature diversity. Following channel dimension adjustment via convolution, features are divided into subgroups where depthwise convolution enables cross-channel information mixing. Within each subgroup, Depthwise Convolution is applied to achieve cross-channel information mixing. This design not only reduces the number of parameters but also effectively enhances feature representation capability. In the global feature extraction branch, CAFM further incorporates a learnable scaling parameter, which adjusts the magnitude of the Q-K similarity matrix prior to the SoftMax function. This adjustment enhances the model’s representational capacity while mitigating the risk of the attention matrix becoming excessively sharp.
The module ultimately combines local and global features through element-wise addition, effectively synthesizing information from different receptive fields. This architecture significantly enhances both feature extraction capability and global dependency modeling. Compared to the original PSABlock, CAFM demonstrates superior performance in capturing complex patterns and long-range spatial relationships while maintaining computational efficiency.
3.3. Enhanced Feature Refinement for Keypoint Detection
In this section, the FRFN mechanism is incorporated into the C3k2 module of YOLOv11. Although the C3k2 module is an optimized version of the traditional CSP Bottleneck structure in YOLOv11, keypoint detection tasks demand stronger feature refinement and fusion capabilities. The FRFN mechanism effectively addresses this requirement by transforming and optimizing feature representations. Specifically, it enhances feature information, reduces redundant data, and improves information expression along the channel dimension, thereby strengthening the model’s ability to capture and utilize critical features.
Refined Details: C3K2 with FRFN
While conventional feed-forward networks (FFNs) process information independently at each pixel location to enhance feature representations through self-attention mechanisms, their performance in keypoint localization can be significantly degraded by redundant spatial information. To address this limitation, we introduce the FRFN with an enhance and simplify paradigm. This approach first employs Partial Convolution (PConv) to amplify informative feature elements critical for keypoint detection. Subsequently, a gating mechanism selectively suppresses redundant information propagation, effectively reducing interference from irrelevant features and improving localization accuracy.
Building upon this foundation, we develop an enhanced C3k2_FRFN module by integrating the FRFN mechanism with the C3K2 architecture. This hybrid module combines the computational efficiency of CSPNet’s partial feature flow design with the refined feature processing capabilities of FRFN. The architecture splits input feature maps into two parallel processing streams, maintaining optimal gradient flow while enabling more effective feature fusion. This dual-path design significantly enhances the network’s ability to extract and combine multi-scale features, particularly benefiting keypoint detection tasks that require precise spatial localization.
Building on this, the module introduces the FRFN mechanism to further enhance feature refinement and fusion, particularly in capturing spatial and channel information. By combining partial convolution and depthwise separable convolutions, FRFN reduces computational complexity while expanding the receptive field, thereby improving the model’s ability to express local features. Moreover, the C3k2_FRFN module uses a gate mechanism to adaptively fuse features from different sources, ensuring efficient feature flow and maximizing information extraction. This improved design not only enhances the network’s performance in complex visual tasks but also significantly boosts the robustness and computational efficiency of the model, especially for high-precision tasks such as small object detection and keypoint localization.
3.4. Loss Function Strategy
The original loss function of YOLOv11 includes classification loss, object confidence loss, and bounding box regression loss. The total loss function is shown in Equation (7):
In this context, , , and , respectively, represent classification loss function weighting factors, object confidence loss function, and bounding box regression loss function. By designing different weighting combinations, the same DAYKD model can be tailored to emphasize either hyperbolic target detection or keypoint detection tasks in GPR images.
The YOLOv11 architecture employs Binary Cross-Entropy (BCE) as its classification loss function to quantify the discrepancy between predicted probabilities and ground truth labels. Specifically, the loss function calculates the overall loss by averaging the loss for each predicted sample. The complete loss formulation is presented in Equation (8):
where
is the output classes of the network,
is the true label classes, and
is the total number of samples. These elements are fundamental in computing the classification loss, which evaluates the model’s ability to correctly assign class labels across all samples in the dataset.
The object confidence loss function, defined in Equation (9), employs a binary cross-entropy formulation to assess the agreement between predicted objectness scores and their ground-truth counterparts. It comprises two weighted components: the first penalizes deviations in grid cells containing objects, while the second regulates predictions in background regions, with a balancing coefficient mitigating class imbalance.
where
is the object confidence output by the network and
is the actual label indicating the presence of an object (1 or 0). This binary supervision guides the network in distinguishing between object and background regions during training.
The bounding box regression loss function, which quantifies the discrepancy between predicted bounding boxes and ground truth annotations, is formally defined in Equation (10). This loss function is pivotal in guiding the model toward precise object localization by minimizing spatial deviations throughout the training process.
where
is the intersection over union between the predicted and ground-truth bounding box,
is the Euclidean distance between the centers of the predicted and ground-truth bounding boxes,
is the diagonal length of the smallest enclosing area that contains both boxes,
is an additional term that measures the consistency of the aspect ratio, and
is the weight used to balance the aspect ratio term.
4. Experiment Results and Analysis
4.1. Experimental Settings and Evaluation Metrics
4.1.1. Experimental Settings
The training protocol for the proposed model employed hyperparameters detailed in
Table 1. Model performance was evaluated at each epoch termination, with the optimal training epoch selected according to the
mean Average Precision (
mAP) metric on the validation set.
Considering the varying image scales in the dataset, different batch sizes and image dimensions were used for different tasks. In the object detection task, a batch size of 32 and an image resolution of were employed; in the keypoint detection task, a batch size of 64 and an image resolution of were adopted. All experiments were conducted on an NVIDIA RTX 3050 GPU and 12-core Intel Core i5-12500H CPU platform to satisfy computational demands.
4.1.2. Evaluation Metrics
To conduct a more objective and scientific quantitative evaluation of the algorithm’s performance, this study employs commonly used evaluation metrics such as
precision,
recall,
mAP, and
F1-score.
Precision measures the model’s ability to correctly identify positive samples, while recall assesses the model’s capability to comprehensively cover positive samples. As a comprehensive metric,
mAP further evaluates the model’s overall performance by calculating a weighted average across multiple thresholds.
F1-score effectively mitigates the weakest link effect by balancing precision and recall through emphasizing their lower value. This multi-metric evaluation framework ensures robust performance characterization across all critical aspects of detection quality, validating both the effectiveness and practical reliability of the proposed method. The formulas for the aforementioned evaluation metrics are shown in Equations (11)–(16):
In the formulas of the above evaluation metrics, (True Positives) denotes the number of correctly predicted positive samples, (False Positives) refers to the number of incorrectly predicted positive samples, and (False Negatives) represents the number of actual positive samples that were not detected by the model. denotes the total number of object categories involved in the evaluation. For mAP50-95, represents the threshold at the -th level, ranging from 0.50 to 0.95 with a step size of 0.05, and = 10 indicates the total number of such thresholds. The term corresponds to the Average Precision of the -th category under the threshold .
4.2. GPR Dataset Collection
In this section, a total of 1617 images from both the simulated and real datasets were annotated with bounding boxes using LabelMe [
42], and divided into a training set and a test set in a ratio of 8:2. For the training of the keypoint dataset, a total of 6112 images were obtained through cropping, and the dataset was split in a ratio of 8:2.
4.2.1. Simulated Dataset
To rigorously evaluate the proposed method’s performance, this study conducted comprehensive experiments using both simulated and real-world datasets. The simulated dataset consists of 1500 images generated using gprMax [
43] to address the issue of limited samples in the real-world dataset. The simulation environment was modeled as a 2.5 m × 0.5 m × 2.5 mm 3D domain, discretized with a spatial resolution of 5 mm (x, y) and 2.5 mm (z). A Ricker wavelet with a center frequency of 0.9 GHz served as the excitation signal. The propagation medium was primarily concrete, with an overlying air layer.
Cylindrical metallic targets (modeled as perfect electric conductors, PEC) were randomly placed within the concrete volume, with their axes aligned along the z-direction. These images are categorized into three classes based on the number of targets: images with 1–9 targets, 10–15 targets, and 16–20 targets, distributed in a ratio of 5:3:2. Each simulation scenario includes a paired modeling script and resulting B-scan image, ensuring one-to-one correspondence for analysis. This stratified design maintains a balanced representation of detection difficulty and reflects real-world complexity distributions.
4.2.2. Field Dataset
The real-world dataset is divided comprises two distinct acquisition scenarios, with the data collection process detailed in
Figure 6. The final constructed dataset comprises a total of 167 GPR images. Field scene 1: The first part is from Neyland Pedestrian Bridge at the University of Tennessee, Knoxville, as shown in
Figure 6a. Data collection was performed using a GSSI SIR-4000 GPR system with a 2 GHz frequency antenna. Field scene 2: The second part is from the laboratory of the School of Earth Sciences at Guilin University of Technology using a GSSI SIR-4000 GPR system with a 400 MHz frequency antenna. The data acquisition scene and the tested concrete wall diagram are shown in
Figure 6b. Steel bars with a diameter of 5 cm were embedded on both the left and right sides of the wall. The left side contained solid steel bars, while the right side featured hollow steel bars. The distances of these steel bars from the outer wall boundary were 10 cm, 20 cm, 30 cm, and 35 cm, respectively.
4.3. Ablation Experiment
4.3.1. Performance of Individual or Combination Improvements
To rigorously evaluate the contribution of each proposed enhancement, we conducted systematic ablation studies assessing individual components’ impact on model performance. Each improvement method was individually tested, with F1-score, mAP50, mAP50-95, precision, and recall selected as evaluation metrics for model accuracy. Parameter count and Giga Floating-point Operations Per Second (GFLOPS) were used to assess model complexity, while FPS was adopted as the evaluation metric for inference speed.
As evidenced by the ablation study results summarized in
Table 2, the DAYKD framework exhibits marked improvements across all evaluation metrics, attaining an F1-score of 0.929, mAP50 of 0.947, and mAP50-95 of 0.825. Comparative analysis reveals significant performance advantages over the baseline model, with the F1-score exhibiting a 13% increment, mAP50 demonstrating a 12% enhancement, and mAP50-95 showing a notable 16% elevation. The YOLOv11 + CAFM configuration achieves respective improvements of 9% and 10% in mAP50 and mAP50-95 metrics relative to the baseline, substantiating the efficacy of the CAFM in augmenting feature extraction capabilities. The FRFN module further optimizes multi-scale feature fusion, allowing YOLOv11 + FRFN to attain an mAP50-95 of 0.788, corresponding to a 15% advancement compared to the baseline model.
The synergistic integration of CAFM and FRFN modules in DAYKD facilitates optimal detection performance. Precision and recall metrics demonstrate notable enhancements of approximately 6% and 7%, respectively, indicating superior capability in minimizing false positive rates while maximizing detection sensitivity. Although exhibiting increased parameter count (1.28 M vs. baseline 1.15 M) and computational complexity (3.2 G FLOPs vs. baseline 2.8 G FLOPs), DAYKD maintains real-time processing efficiency with an inference speed of 65.4 FPS, an 8% improvement over the baseline. While YOLOv11 + CAFM achieves marginally higher frame rates (68.2 FPS), its detection accuracy remains suboptimal compared to the integrated DAYKD architecture.
The experimental findings collectively demonstrate that the CAFM and FRFN modules significantly enhance feature representation capacity and detection precision through distinct yet complementary mechanisms. DAYKD effectively leverages both architectural enhancements, achieving comprehensive performance optimization that balances accuracy metrics with computational efficiency. This integration strategy yields state-of-the-art performance in object detection tasks, particularly in scenarios requiring high-precision recognition across multiple scales.
4.3.2. Convergence Stability Evaluation
The training dynamics and convergence behavior of the DAYKD model were rigorously analyzed through the examination of loss curves across different training tasks.
Figure 7 presents the respective loss trajectories for both object detection and keypoint prediction tasks, revealing several important characteristics of the learning process. All loss functions demonstrate consistent monotonic decay throughout the training regimen, with values asymptotically approaching stable minima in later epochs. This smooth convergence profile, devoid of any oscillatory behavior or divergence, indicates robust optimization dynamics and effective learning without evidence of overfitting. The observed stabilization of loss values in the final training task further confirms the model’s ability to reach a well-optimized solution state, suggesting both numerical stability and effective capacity utilization of the network architecture.
Figure 7a reveals distinct convergence characteristics in the object detection task, where the bounding box loss exhibits rapid initial descent during early optimization (typically epochs 1–50), followed by asymptotic stabilization. This biphasic convergence pattern demonstrates the model’s capacity for efficient spatial feature acquisition while maintaining stable optimization dynamics throughout extended training. The more gradual reduction in dfl_loss components reflects the inherent complexity of learning precise distribution focal representations for bounding box regression.
Conversely,
Figure 7b demonstrates accelerated convergence in keypoint detection compared to traditional pose estimation architectures, with loss values decreasing approximately 40% faster during initial training tasks. This enhanced learning efficiency represents a marked improvement over conventional approaches where pose-related losses typically exhibit slower convergence due to the intricate nature of structural feature learning. Moreover, the stable decline of the keypoint object confidence loss (kobj_loss) indicates the model’s increasing reliability in predicting the presence of keypoints. These results collectively demonstrate that the DAYKD model exhibits strong performance and stability in both object detection and keypoint detection tasks. The observed rapid optimization in DAYKD empirically validates three critical aspects of our approach: the effectiveness of the baseline architecture selection, the optimal design of the parameter-sharing strategy between detection tasks, and the appropriate composition of the keypoint training dataset. These results collectively demonstrate the model’s superior capability in simultaneous spatial and structural feature learning, establishing a new benchmark for integrated object detection and keypoint estimation tasks.
4.4. Performance Estimation of the Proposed Model
Figure 8 systematically illustrates the complete DAYKD recognition workflow, including the original data, enhanced processing, category probability visualization, object detection results before and after NMS, and final keypoint estimation. The original B-scan image in
Figure 8a displays characteristic signal reflections through red-blue waveform patterns, representing unprocessed subsurface radar measurements. Following enhancement in
Figure 8b, the processed image exhibits significantly improved contrast between target signals and background noise, facilitating subsequent detection tasks.
The object detection task produces a category probability thermogram as shown in
Figure 8c, with color intensity representing target likelihood where deeper red hues correspond to higher probabilities. These probability estimates, when integrated with the preliminary bounding box detections visible in
Figure 8d, undergo Non-Maximum Suppression processing to generate the final object localization results displayed in
Figure 8e. The system successfully identifies hyperbolic features, marked by blue bounding boxes with confidence scores ranging from 0.65 to 0.81, demonstrating the method’s consistent detection reliability.
Further,
Figure 8f illustrates the keypoint detection results. Keypoints are marked on each detected hyperbolic target to more accurately describe the geometric characteristics of the hyperbolic features. The confidence scores associated with keypoint annotations are generally higher, typically ranging from 0.88 to 0.94, suggesting that the method achieves high precision in keypoint localization. The combined results validate DAYKD’s capability for integrated object localization and structural feature extraction in complex GPR data.
The detection results of the DAYKD model on simulated data for underground target detection tasks are illustrated in
Figure 9 and
Figure 10.
Figure 11,
Figure 12 and
Figure 13 further present a comparison between the model’s detection outcomes and the corresponding ground truth images for field scenes 1 and 2. Each detected bounding box contains a potential hyperbolic target, five keypoints, and the associated confidence scores. As shown in the figures, all targets were accurately identified, with no missed or false detections. The vertex of each hyperbola was precisely located and marked with a red dot, while other keypoints were highlighted in green.
From a model detection perspective, directly inputting images containing positional and time-traveling coordinate axes may interfere with the model’s feature extraction and affect detection performance. To mitigate this, all images are preprocessed to remove coordinate axes during both the training and validation phases. After detection is completed, the positional and temporal coordinates are re-applied to the output images for subsequent analysis and visualization. From
Figure 9 and
Figure 10, it can be observed that despite the presence of challenging conditions such as horizontal and vertical overlaps and indistinct target boundaries, the DAYKD model is still capable of performing accurate detection and localization.
Figure 11,
Figure 12 and
Figure 13 showcase the model’s performance on real-world field data from two distinct acquisition scenarios. Despite substantial background noise and frequent target occlusions characteristic of ground-penetrating radar environments, DAYKD maintains consistently high detection accuracy, with confidence scores predominantly above 0.90. The model exhibits particular robustness in high-density target scenarios with small object sizes, successfully resolving individual targets through precise spatial differentiation.
Quantitative analysis reveals close alignment between the detected targets and ground truth references in both spatial position and hyperbolic shape characteristics. This performance consistency across diverse testing conditions, from controlled simulations to complex field environments, confirms the model’s superior adaptability and reliability for subsurface detection tasks. The combination of high-confidence detections and geometrically accurate keypoint localization establishes DAYKD as a robust solution for challenging underground target identification applications.
4.5. Performance Comparison with Other Algorithms
As evidenced by the quantitative results in
Table 3, the DAYKD model establishes new benchmarks in underground object detection, achieving a state-of-the-art mAP50 of 94.7%. This performance surpasses conventional approaches including Faster R-CNN (by 12.3%), RTMDet (by 8.9%), and the baseline model (by 11.5%), while remaining competitive with Cascade R-CNN (difference of 1.2%). The model’s exceptional capability becomes particularly evident in the more rigorous mAP50-95 metric, where it outperforms all comparison methods by substantial margins exceeding 10 percentage points, demonstrating remarkable robustness across diverse object scales and geometric configurations.
The further examination of the recall metric reveals DAYKD’s superior detection completeness, achieving 92.2% recall, a 7.5% improvement over the second-best performer. This advantage stems from the model’s optimized feature representation and effective handling of challenging subsurface targets. The marginal deficit in mAP50 relative to Cascade R-CNN can be attributed to fundamental architectural differences: Cascade’s multi-task refinement mechanism provides progressive bounding box optimization that proves particularly effective at higher IoU thresholds, while DAYKD’s unified architecture prioritizes overall detection quality and computational efficiency. This trade-off reflects the inherent balance between precision refinement and holistic detection performance in deep learning-based object detection systems.
To effectively track the convergence of our proposed model during training, we recorded the box_loss convergence of various models over 300 epochs as an indicator of target localization performance.
Figure 14 illustrates the variation in box_loss for five different models. The results demonstrate that DAYKD achieved the best performance throughout the training process, with the fastest decline in loss value, ultimately stabilizing at 0.35, showcasing superior convergence performance and accuracy. YOLOv11 and RTMDet also exhibited strong performance, characterized by rapid convergence and low final loss values. The Cascade model presented a slightly higher box_loss, while Faster RCNN maintained the highest box_loss throughout the training, indicating its inferior performance in the bounding box regression task.
This study conducts a systematic comparative analysis of five mainstream object detection algorithms under two representative experimental conditions: a simulated data environment and real-world field scene 1. The investigation aims to rigorously evaluate the practical efficacy of these models across varying environmental complexities. The evaluated algorithms comprise Faster R-CNN, RTMDet, Cascade, YOLOv11, and DAYKD. As evidenced by the detection outcomes depicted in
Figure 15 and
Figure 16, distinct performance differentials emerge among the models regarding detection accuracy and operational robustness. Notably, conventional detection frameworks including Faster R-CNN, RTMDet, and Cascade employ bounding-box-based localization methodologies without keypoint detection integration, thereby constraining their analytical capabilities. In contrast, both YOLOv11 and DAYKD incorporate advanced keypoint detection architectures, demonstrating superior precision and enhanced structural comprehension in target analysis.
Within controlled simulation environments, all algorithms demonstrate competent performance in basic recognition tasks, achieving target identification with negligible error margins. However, performance stratification becomes apparent with increasing target density and intersection scenarios. As quantitatively demonstrated in
Figure 15, conventional models including Faster R-CNN, RTMDet, and Cascade manifest significant detection deficiencies in complex configurations, characterized by elevated rates of false negatives and suboptimal localization precision. These limitations indicate inherent constraints in high-density target processing capabilities. Comparative analysis further reveals that traditional architectures exhibit constrained feature representation capabilities and limited structural interpretability relative to their advanced counterparts.
In contrast, YOLOv11 demonstrates significant improvements in detection confidence through its integrated keypoint detection mechanism, thereby effectively mitigating false positive identifications and misdetections in regions devoid of distinct structural features. DAYKD exhibits the most balanced and robust performance metrics across all evaluation parameters, demonstrating exceptional boundary localization precision and superior small-object detection efficacy. The model achieves detection accuracy surpassing the 90% threshold, representing substantial performance enhancement over comparative architectures, which substantiates the effectiveness of its structural design and feature extraction methodology.
The evaluation on field scene 1, characterized by greater environmental complexity, further highlights the robustness of each model. Although Faster R-CNN, RTMDet, and Cascade can detect a majority of hyperbolic structures, their overall recognition performance is markedly inferior to that of YOLOv11 and DAYKD. In particular, DAYKD maintains high detection accuracy and boundary fitting in complex real-world conditions, with virtually no missed detections. This demonstrates its strong generalization ability and adaptability, making it the most robust and reliable model in this evaluation. In summary, DAYKD achieves the best detection performance across both tested environments, excelling in accuracy, robustness, and small-object recognition. It stands out as the most capable model among those evaluated. These results further validate the effectiveness of incorporating keypoint detection in enhancing both the precision and stability of object detection algorithms.
In another example from field scene 1, we conducted a detailed comparative analysis of the model combinations used in the ablation study. Unlike the relatively clear and scale-consistent detection targets in
Figure 16, the objects in
Figure 17 exhibit significant scale variations, along with complex backgrounds and occlusion effects, introducing substantial interference. These factors impose greater demands on the model, particularly in terms of global perception and multi-scale feature extraction capabilities.
As shown in
Figure 17a–d, the performance differences among the models in handling this challenging scenario are evident. The baseline YOLOv11 model suffers from missed and false detections due to object scale variations and background noise. By integrating the CAFM during the object detection task, the model enhances its regional perception of targets, effectively mitigating background interference and significantly reducing missed detections. When the FRFN module is applied solely during the keypoint detection task, the model exhibits stronger feature fusion and scale modeling capabilities, leading to a notable improvement in keypoint confidence scores. Finally, the model incorporating the DAYKD strategy achieves the highest overall detection accuracy, demonstrating that this approach effectively enhances generalization and robustness.
However, it is worth noting that the proposed algorithm struggles to recognize incomplete hyperbolic targets, which may arise due to measurement line constraints or interference from adjacent objects. Although partial hyperbolic signatures often still indicate the presence of true subsurface targets, the model currently lacks the sensitivity to consistently interpret these incomplete patterns.
5. Conclusions
This paper proposes the DAYKD model, a dual-architecture framework based on YOLOv11, designed for efficient and precise underground target detection and keypoint localization in GPR images. DAYKD employs a two-task training strategy: the first task focuses on high-precision object detection, while the second task refines keypoint recognition accuracy. Ablation experiments demonstrate that DAYKD achieves a precision of 93.7% and an mAP@50 of 94.7% in object detection tasks. Compared to state-of-the-art models, such as YOLOv11, Cascade R-CNN, RTMDet, and Faster R-CNN, DAYKD exhibits superior performance, with an mAP@50-90 of 82.5% and a recall of 99.2%.
DAYKD demonstrates strong performance in GPR image interpretation, particularly in distinguishing overlapping subsurface targets and maintaining accuracy in cluttered, noisy environments. Its dual-task architecture and modular design enable efficient, robust inference across diverse imaging conditions and GPR systems. The model exhibits resilience to background noise and interference, with low computational overhead. However, its generalization is limited by the diversity of real-world datasets. DAYKD also struggles to detect incomplete hyperbolic signatures, especially near image boundaries, which may lead to missed detections under complex conditions.
Future work will aim to improve the generalization ability of the DAYKD model by expanding on-site data collection across a wider range of geological conditions and deployment scenarios. A more diverse and representative dataset is expected to enhance the model’s robustness in real-world applications. In addition, we plan to explore the use of partially labeled samples, particularly those with incomplete hyperbolic features, to improve the model’s ability to detect subtle or truncated subsurface targets under challenging conditions. To support deployment in resource-constrained environments, future research will also focus on improving computational efficiency. This includes investigating model compression techniques such as pruning, knowledge distillation, and low-bit quantization. These methods are expected to reduce inference time and memory consumption, enabling real-time performance while maintaining detection accuracy in practical engineering settings.