This section reports empirical evidence for the proposed quasi-static in situ workflow on the UATD dataset, focusing on whether the injected pixel-wise statistical modeling (echo-intensity probability weighting and acoustic attenuation compensation) yields measurably improved FLS imagery and more reliable supervision, and whether these gains translate into stronger and more robust UATD performance.
4.1. Statistical Enhancement and Calibration Results
The statistical parameters estimated from the training split according to Equations (
1), (
9) and (
12) are summarized in
Table 6. Overall, the probability weighting parameters fall within a plausible range, whereas the fitted attenuation parameters differ markedly from those of conventional sonar attenuation models. In particular, the fitted spreading loss coefficient
k becomes negative, although it is typically constrained to be positive in traditional formulations where intensity is assumed to decrease monotonically with range.
This behavior is explained by the empirical range-intensity trend shown in
Figure 5. In the context of the UATD dataset, near-distance echo intensities initially increase due to near-field effects and measurement artifacts (e.g., speckle and acoustic shadowing) before eventually decaying. Such non-monotonic behavior cannot be captured by traditional attenuation models that enforce a globally decreasing profile. If a conventional monotonic model were used for attenuation compensation, the near-source noise would be erroneously amplified rather than suppressed, which is detrimental for both visualization and downstream detection. Our model accurately captures this initial intensity surge, effectively mitigating near-source artifacts. Notably, the short-range intensity increase discussed herein is an observation restricted to the UATD dataset under its particular acquisition geometry and operating conditions, and therefore should not be construed as a universal property of FLS imagery. Rather, it indicates that the proposed approach is capable of adapting to such specific scenarios.
To provide an intuitive understanding of the enhancement process,
Figure 6 presents a visual comparison between raw FLS images and the outputs of each enhancement component. The proposed enhancement improves perceptual target saliency while suppressing background clutter, which benefits both deep detectors and manual annotation during dataset construction. In
Figure 6, the left and bottom axes show pixel height and pixel width, respectively, while the right and top axes indicate range (m) and azimuth (degree). For all subfigures, the horizontal axis spans pixel coordinates from 0 to 1020 (left to right), corresponding linearly to azimuth angles from
degrees to 60 degrees. For subfigures (a) (first row) and (c) (third row), the vertical axis spans pixel coordinates from 0 to 1020 (top to bottom), corresponding linearly to ranges from 0 m to 10 m. In contrast, subfigure (b) (second row) spans pixel coordinates from 0 to 1540 (top to bottom), corresponding linearly to ranges from 0 m to 15 m. The gridlines uniformly partition both axes into 10 equal intervals to facilitate position reading.
Applying attenuation compensation alone (
Figure 6, column 2) improves target visibility by correcting range-related decay, but it can also amplify distant large-area high-intensity noise, revealing a limitation of using only the attenuation model.
The probability weighting enhancement (column 3) effectively suppresses uniform background noise. However, it may tend to enhance the tails of small, high-intensity noise and weaken portions of the target signal, creating defective areas of unexpectedly low intensity within targets. Additionally, it sometimes amplifies weak background noise in rows containing targets, resulting in horizontal artifacts.
The fused result (column 4) effectively integrates the strengths of both channels. The signal degradation (defective areas) introduced by the probability distribution model is compensated by the signal preservation of the attenuation channel. Furthermore, distinct noise patterns generated by the individual filters manifest as unique color artifacts in the fused image, making them easily distinguishable from the consistent color patterns of actual targets.
The examples in
Figure 6 illustrate the complementary effects between attenuation compensation and probability weighting. In
Figure 6a (first row), attenuation compensation effectively enhances the target while avoiding noise amplification for this scene, whereas probability weighting substantially boosts background noise, particularly patchy artifacts. In
Figure 6b (second row), the overall weak echo signals limit the target enhancement from attenuation compensation, but probability weighting prominently highlights the target. In
Figure 6c (third row), attenuation compensation yields a sufficiently clear target but fails to suppress large-area noise, which is effectively mitigated by probability weighting. Channel-wise fusion of these results enables effective complementarity, while pseudo-coloring provides targets with a more pronounced and consistent pattern.
The increased target saliency also enables more accurate dataset annotation and calibration. To demonstrate this, we analyze an instance containing a (spherical) ball and a square cage, summarized in
Table 4. According to the original dataset documentation [
21], targets were deliberately deployed prior to data collection, ensuring the ground-truth count and class of objects are correct, while bounding boxes may still suffer from human annotation uncertainty in low-visibility raw imagery.
The physical dimensions of the targets are listed in
Table 5. For the ball with radius 0.25 m, the maximum reflective surface is bounded by a circle of diameter 0.5 m, and the maximum depth profile (front-to-back extent along range) is bounded by radius 0.25 m. Therefore, we expect its actual reflective width
and depth profile
.
Based on the original annotations in
Table 4, namely the range
m, azimuth
, and image dimensions
and
, Equations (
17) and (
16) allows us to convert pixel-wise bounding box extents into metric estimates of the reflective width
and depth profile, i.e., the height of the reflective surface
. The initial bounding box of the ball yields a reflective width of 0.6238 m and a depth profile of 0.4381 m, which significantly exceeds the physical dimensions of the ball. Conversely, the original annotation for the square cage yields a width of 0.0167 m and a depth of 0.0084 m, which is implausibly small. These discrepancies occurred because the targets were indistinguishable in the raw data (
Figure 7, upper left, white bounding boxes).
After applying our statistical enhancement, both targets become clearly delineated (
Figure 7, right). We calibrated the bounding boxes (red boxes), yielding a corrected width of 0.4715 m and a depth of 0.2239 m for the ball, as well as a width of 0.3537 m and a depth of 0.1460 m for the cage. Both sets of calibrated dimensions closely align with their true physical sizes. Furthermore, a suspicious high-intensity pattern at pixel-wise coordinates (400, 1020) to (500, 1080) was revealed by the enhancement. Calculations indicate a width of 0.6710 m and a depth of 0.5841 m, which does not match any deployed target. This confirms it is a noise artifact, verifying that the original annotators were correct to exclude it.
Beyond numerical verification, the first and second columns of
Figure 6 visually demonstrate that original FLS images, even after logarithmic transformation or attenuation compensation, exhibit insufficient contrast for effective target identification. To enhance readability, the left panel of
Figure 7 has been processed using a contrast enhancement pipeline—contrast stretching with percentile clipping, contrast-limited adaptive histogram equalization, and a mild unsharp mask—making multiple shadows more apparent than in the unprocessed raw view. Nevertheless, this example underscores that our approach delineates the bounding boxes of targets more effectively. Although targets were deliberately deployed to ensure ground-truth accuracy, Xie et al. [
21] struggled to track, identify, and distinguish the targets in some images, including this example, during the annotation process. This led to confusion, with the targets being annotated as a single object. As shown in the upper left of
Figure 7 (white bounding boxes), only one prominent object is discernible to the naked eye in the raw image, which we believe contributed to the annotation error, and the contrast-enhanced illustration still does not provide sufficiently clear visual cues to enable accurate delineation of the bounding boxes. Consequently, the contrast enhancement and pseudo-coloring provided by our FLS image enhancement method effectively aid the annotation process.
Overall, this example demonstrates that statistical enhancement improves target interpretability and supports more accurate labeling through physically grounded consistency checks.
4.2. Performance Evaluation and Ablation Studies
To quantify the impact of injecting quasi-static in situ statistical information into the pipeline of UATD tasks, we evaluate the proposed enhanced detector (denoted as YOLOv12n-QSIS) against representative one-stage YOLOv12 variants, transformer-based DETR variants, and a two-stage FRCN baseline on the UATD dataset. We further compare our results with prior UATD-related methods reported on public datasets, including FLSD-Net [
12], ATTMPConvNet [
18], and WBF-ASFFNet [
13]. Notably, ASFFNet [
13] does not provide sufficient reproducible details, and its description of the UATD dataset substantially contradicts that of Xie et al. [
21]. Although it reports markedly higher performance than other studies, the source of this discrepancy is unclear; we therefore include its numbers in our table only for reference.
Table 7 reports Precision, Recall, F1 score, and computational cost in terms of FLOPs, together with model size.
As shown in
Table 7, YOLOv12n-QSIS achieves the best overall detection performance among all compared detectors while retaining a lightweight backbone and low computational cost. Compared with the vanilla YOLOv12n baseline, our proposed hybrid framework improves F1 score by 8.1%, from 0.800 to 0.865, with nearly the same parameter size and a modest increase in FLOPs (from 4.2 G to 6.3 G), indicating that the gain primarily comes from the proposed quasi-static in situ processing rather than increased model capacity. Moreover, our proposed method also outperforms substantially larger backbones, such as YOLOv12x (F1 score: 0.865 vs. 0.844, a 2.5% improvement), despite requiring 95% fewer FLOPs (6.3 G vs. 131.3 G) and 96% fewer parameters (2.6 MB vs. 59.1 MB), suggesting a more favorable accuracy–efficiency balance than scaling the network size alone. To visualize this trade-off,
Figure 8 plots F1 score versus FLOPs for all evaluated models. The results show that, for conventional deep detectors, improved performance is largely coupled with increased computational cost, approximately following a monotonic trend. In contrast, the proposed hybrid approach delivers the strongest accuracy in the low-compute regime, avoiding the steep computational growth required by larger backbones.
To further isolate the contributions of individual components in the hybrid framework, we conducted an ablation study by systematically removing key modules from the workflow. The variants are defined as follows: QSIS-nc (no calibration), QSIS-nac (no attenuation compensation), and QSIS-npw (no probability weighting). The results, detailed in
Table 8, confirm that the full quasi-static in situ learning pipeline yields the highest overall performance, highlighting the synergistic effect of these modules.
Notably, removing the calibration module causes the largest degradation, particularly in recall. This observation is consistent with a key challenge in UATD: imperfect or inconsistent annotations can distort the target appearance distribution, making some instances difficult to retrieve. As discussed in
Section 1, this challenge arises from inherent ambiguities in the FLS images, which substantially complicate the manual annotation process. While increasing model scale can reduce certain false positives, thus improving precision, it does not fundamentally resolve missed detections induced by unreliable pattern cues. The proposed calibration step explicitly mitigates this issue by correcting annotation-related inconsistencies, thereby improving recall. Although calibration introduces additional manual effort during the tranche context construction stage, this cost is incurred only offline and does not affect inference-time computation.
Attenuation compensation and probability weighting also yield consistent gains. Although probability weighting contributes more to the overall F1 score, attenuation compensation adds negligible computational cost compared to conventional model scaling, justifying its retention in the final architecture.