PPFS-YOLO: Physics-Prior Frequency-Spatial Fusion for Robust Container Surface Damage Detection

Liu, Jingze; Gao, Feng

doi:10.3390/s26103224

Open AccessArticle

PPFS-YOLO: Physics-Prior Frequency-Spatial Fusion for Robust Container Surface Damage Detection

by

Jingze Liu

and

Feng Gao

^*

School of Information Science and Engineering, Ocean University of China, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(10), 3224; https://doi.org/10.3390/s26103224

Submission received: 21 March 2026 / Revised: 4 April 2026 / Accepted: 8 April 2026 / Published: 20 May 2026

(This article belongs to the Section Industrial Sensors)

Download

Browse Figures

Versions Notes

Abstract

Container surface damage detection is critical for ensuring the structural integrity and operational safety of intermodal freight transport. However, visual pseudo-textures arising from rust stains, specular reflections, and paint weathering cause frequent false positives, while the scarcity of puncture-type defects (Hole class) leads to missed detections. Existing YOLO-family detectors address neither the frequency-domain characteristics of such pseudo-textures nor the physical priors inherent in genuine structural damage. In this paper, we propose PPFS-YOLO, a physics-prior frequency-spatial fusion framework built upon YOLOv12s. Two lightweight modules are introduced: (1) Frequency-Spatial Fusion (FSF), which applies a learnable spectral mask in the Fourier domain and performs gated fusion with spatial features to suppress pseudo-texture responses; and (2) Edge-Guided Auxiliary Supervision Module (FIM), which encodes Sobel-derived edge priors as a differentiable

L_{1}

constraint (

L_{phy}

) to regularize feature learning toward physically plausible damage boundaries. Three pairs of FSF–FIM are inserted into the YOLOv12s neck and head at P3, P4, and P4-head scales. Experiments on a container damage dataset containing 7013 images and three classes (Dent, Hole, Rusty) demonstrate that PPFS-YOLO achieves 64.86% mAP@50, a +12.35 percentage-point improvement over the YOLO12s baseline (SGD, unified optimizer), with only +0.79 M additional parameters (+8.6%) and a modest latency overhead of 2.9 ms (17.2 ms vs. 14.3 ms at

640 \times 640

on an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA)). Ablation analysis reveals that

L_{phy}

is the critical catalyst: without it, the combined FSF+FIM modules yield only +0.83 pp, whereas the full model achieves +12.10 pp—underscoring the synergy between frequency-domain fusion and physics-prior regularization.

Keywords:

container damage detection; object detection; frequency-spatial fusion; edge-guided supervision; YOLO; deep learning; Fourier spectral masking; edge prior regularization

1. Introduction

Shipping containers constitute the backbone of global freight logistics, with over 800 million TEU (twenty-foot equivalent units) transported annually [1]. Surface damage—including dents from handling impact, puncture holes, and corrosion—can compromise structural integrity, lead to cargo loss, and pose safety hazards at ports. Current inspection practices rely heavily on manual visual assessment conducted under time pressure, resulting in inconsistent accuracy and limited throughput [2]. Automated visual inspection powered by deep learning offers a scalable alternative, yet applying general-purpose detectors to this domain raises unique challenges that existing methods have not fully addressed.

Deep learning-based object detection, particularly the YOLO family of single-stage detectors [3,4,5,6,7,8,9], has been widely adopted for industrial surface defect detection tasks such as steel strip inspection [10,11,12], printed circuit board (PCB) defect recognition [13,14], wind turbine blade assessment [15], and boiler inner-wall detection [16]. These methods achieve real-time inference while maintaining competitive detection accuracy. Meanwhile, Transformer-based detectors such as RT-DETR [17] and DINO [18] have demonstrated the potential of attention mechanisms for detection. However, applying these detectors directly to container damage detection presents two domain-specific challenges:

Pseudo-texture interference. Container surfaces exhibit complex visual patterns—rust stains, paint peeling, specular reflections, and embossed logos—that share low-level feature characteristics with genuine damage. Purely spatial-domain convolutions struggle to disentangle these confounding textures from structural defects, leading to elevated false-positive rates. From a signal-processing perspective, pseudo-textures occupy characteristic frequency bands [19] that overlap with, but are distinct from, genuine damage signatures; this distinction is invisible to standard spatial convolutions.
Minority-class instability. Puncture-type defects (Hole) are inherently rare in service and in available datasets (approximately 12% of annotated instances), causing detectors to under-represent this safety-critical category during training. Approaches such as focal loss [3], seesaw loss [20], and hard example mining [21] partially alleviate class imbalance but do not leverage the distinctive physical signatures of hole-type damage.

Recent work has explored frequency-domain analysis for visual recognition. Fast Fourier Convolution [22] demonstrated the effectiveness of spectral-domain operations with global receptive fields. FcaNet [23] generalized channel attention to multi-spectral representations, proving that global average pooling is a special case of DCT-domain decomposition. In camouflaged object detection, frequency-aware methods [24,25] exploit the complementarity of spatial and spectral features to separate objects from confounding backgrounds—a scenario directly analogous to the pseudo-texture problem in container inspection. For remote sensing, spatial-frequency feature fusion has been applied to oriented object detection [26] and multimodal detection [27]. In the wood panel defect domain, FDADNet [28] employs frequency-domain transformation with adaptive downsampling. Early work on tile defect detection also combined spatial-frequency enhancement with region growing [29]. Despite these advances, no prior work has integrated frequency-spatial fusion with explicit physics-prior regularization within an end-to-end YOLO detection framework.

Physics-aware neural networks encode domain knowledge as differentiable constraints within neural network training [30,31]. In defect analysis, such approaches have been applied to crack monitoring [32,33], alternating current field measurement [34], magnetic flux leakage estimation [35], and material internal structure analysis [36]. However, these approaches target reconstruction or measurement tasks rather than visual object detection. The concept of encoding domain-specific knowledge—such as edge continuity and boundary sharpness [37,38]—as differentiable auxiliary supervision within a detection pipeline remains largely unexplored.

In this paper, we propose PPFS-YOLO (Physics-Prior Frequency-Spatial YOLO), a novel framework that addresses both pseudo-texture interference and physics-aware feature learning for container surface damage detection. Our key contributions are:

We design the Frequency-Spatial Fusion (FSF) module, which performs learnable spectral masking in the 2D Fourier domain followed by gated spatial-frequency feature fusion, enabling the network to selectively suppress pseudo-texture frequency components while preserving damage-related signals.
We propose the Edge-Guided Auxiliary Supervision Module (FIM), which encodes Sobel-derived edge priors as a differentiable $L_{1}$ loss ( $L_{phy}$ ) and applies edge-guided residual refinement, steering the network toward physically plausible damage representations.
We demonstrate through comprehensive ablation that $L_{phy}$ acts as the critical catalyst for the synergy between FSF and FIM: without it, the structural modules alone yield only +0.83 pp mAP@50, but with $L_{phy}$ the improvement reaches +12.10 pp—a $14.6 \times$ amplification that reveals how the physics prior activates the latent potential of frequency-spatial fusion.
PPFS-YOLO achieves state-of-the-art performance on a container damage dataset (64.86% mAP@50), surpassing five competitive baselines including RT-DETR-l and YOLOv10n, with particularly notable gains on the minority Hole class (+22.19 pp) and Precision (+14.52 pp).

Novelty Positioning

The FSF module is conceptually related to three families of prior work. (i) Fixed-basis frequency methods:FcaNet [23] applies pre-computed DCT basis functions for channel attention; our FSF instead optimizes a fully learnable 2D amplitude mask

M_{f} (u, v)

jointly with the detection objective, discovering domain-specific suppress/amplify patterns that no fixed basis can encode. (ii) Learnable frequency-spatial hybrids: WTConv [39] and FreqFusion [40] perform wavelet or adaptive frequency decomposition for large-kernel receptive fields or dense prediction; our FSF targets pseudo-texture noise suppression rather than receptive-field extension, and uniquely couples the frequency path with a physics-prior loss

L_{phy}

that guides edge feature learning—a combination absent from all prior frequency methods. (iii) Physics/edge supervision: existing edge-guided networks apply edge loss to segmentation tasks; PPFS-YOLO is, to our knowledge, the first to integrate physics-prior

L_{1}

auxiliary edge supervision within a YOLO-family object detector.

The remainder of this paper is organized as follows: Section 2 reviews related work on YOLO-based defect detection, frequency-domain feature analysis, and physics-informed learning. Section 3 presents the PPFS-YOLO architecture and its two core modules with detailed mathematical derivations. Section 4 describes the experimental setup and results. Section 5 provides an in-depth analysis of the findings. Section 6 concludes the paper.

2. Related Work

2.1. Evolution of Real-Time Object Detectors

Real-time object detection has evolved along two parallel paradigms.

CNN-based single-stage detectors. The YOLO (You Only Look Once) family [3] epitomizes this paradigm, evolving through successive architectural innovations. YOLOv7 [5] introduced trainable bag-of-freebies with extended efficient layer aggregation; YOLOv8 [4] adopted an anchor-free decoupled head with task-aligned assignment; YOLOv9 [6] proposed programmable gradient information for enhanced feature learning; YOLOv10 [7] eliminated non-maximum suppression via consistent dual assignments; YOLO11 [8] further optimized the backbone with C3k2 modules; and the recent YOLOv12 [9] adopted attention-centric blocks (A2C2f) that harness the representation capacity of attention mechanisms while maintaining real-time speed.

Transformer-based end-to-end detectors. DINO [18] introduced improved denoising anchor boxes for DETR-based detection, and RT-DETR [17] extended this paradigm to real-time detection with a hybrid encoder design. Complementary to architecture innovations, loss function design has been crucial: focal loss [3] addressed foreground-background imbalance; generalized focal loss [41] unified quality estimation with classification; task-aligned one-stage detection (TOOD) [42] jointly optimized classification and localization; and balanced learning strategies [43,44] improved training sample selection.

Feature pyramid networks (FPN) [45] and path aggregation networks (PANet) [46] established the multi-scale feature fusion paradigm, later refined by BiFPN [47] with learnable weighted fusion. Channel attention (SE-Net [48]), spatial attention (CBAM [49]), and their combinations have become standard components. However, all these attention mechanisms operate in the spatial domain and do not exploit frequency-domain representations—a gap our FSF module addresses.

2.2. YOLO-Based Industrial Defect Detection

YOLO variants have been extensively adapted for industrial defect detection across diverse domains. Steel surface inspection: MSFT-YOLO [10] integrated Transformer modules into YOLOv5 for defect detection on steel surfaces. An improved YOLOv4 [11] targeted steel strip defects using enhanced backbone features. Multi-scale feature fusion with attention residual blocks [12] advanced hot-rolled steel detection. Mixed receptive field augmentation [50] and improved YOLOX [51] have also been applied. YOLOv8-MGVS [52] and ASD-YOLO [53] further pushed the performance boundary with multi-module collaborative optimization. PCB and electronics: YOLOv4-MN3 [13] combined MobileNetv3 with YOLOv4 for PCB defects; PCB-YOLO [14] enhanced YOLOv5 specifically for circuit board inspection. Other domains: Wind turbine blade defects [15], boiler inner-wall damage [16], industrial parts [54], particleboard surfaces [55], magnetic tiles [56], bearing surfaces [57], and fabric defects [58] have all been addressed by improved YOLO detectors. MAS-YOLO [59] improved YOLOv12 for PCB defects using median-enhanced attention. A recent transmission line defect detector [60] combined BiFPN with channel-position collaborative attention on YOLOv12.

For container-specific damage, YOLO-NAS [2] automated detection but without addressing the pseudo-texture false-positive problem. A systematic survey [61] and a deep learning survey on surface defects [62] have confirmed the growing complexity of industrial inspection scenarios; yet frequency-domain exploitation and physics-prior integration remain absent from existing YOLO-based defect detectors—a dual gap our PPFS-YOLO addresses.

2.3. Frequency-Domain Feature Analysis in Visual Recognition

Frequency-domain representations provide complementary information to spatial features, with particular advantages for distinguishing textural patterns from structural signals [19]. From the foundational Parseval’s theorem [63], the energy content preserved across spatial and spectral domains ensures that frequency-domain filtering does not inherently discard information but rather reorganizes it.

General vision: Fast Fourier Convolution (FFC) [22] introduced spectral convolutions with global receptive fields for image generation, enabling non-local feature interactions without the quadratic cost of self-attention. FcaNet [23] generalized channel attention via discrete cosine transform decomposition, mathematically proving that global average pooling is a special case of frequency-domain feature compression.

Camouflaged object detection: This sub-field presents a problem highly analogous to pseudo-texture confusion. Zhong et al. [24] proposed a frequency enhancement module (FEM) with offline DCT followed by learnable enhancement and high-order relation modules for rich feature fusion. FBNet [25] designed frequency-aware context aggregation (FACA) to suppress confounding high-frequency textures and adaptive frequency attention (AFA) to enhance discriminative frequency components.

Detection tasks: SFFD [26] developed a layer-wise frequency-domain analysis (L-FDA) module for oriented object detection in remote sensing, demonstrating that frequency features capture rotation-invariant signatures. FDTNet [64] employed dual-stream Transformers for frequency-aware prohibited object detection in X-ray images. For multimodal remote-sensing detection, an adaptive frequency-domain gate [27] dynamically learns the dependence on frequency-filtered features.

Industrial defect detection: FDADNet [28] applied multi-axis frequency-domain weighted information representation for wood panel defect detection. A spatial-frequency enhancement method [29] combined Gabor filtering with region growing for tile defect detection.

Several more recent frequency methods further broaden this landscape. Spectral pooling [65] demonstrated frequency-domain downsampling to retain high-frequency information that spatial average-pooling discards. WTConv [39] decomposes features with learnable wavelet kernels to emulate large-kernel convolutions, targeting receptive-field extension rather than texture suppression. FreqFusion [40] aligns feature maps in the frequency domain for dense prediction (semantic segmentation), differing fundamentally from our per-pixel amplitude masking for object detection. HorNet [66] employs recursive gated convolutions to model high-order spatial interactions without explicit frequency decomposition. None of these methods address pseudo-texture false-positive suppression in industrial inspection, nor couple frequency operations with physics-prior supervision over edge maps.

Our FSF module differs from prior frequency-domain approaches in three key aspects: (1) it employs a fully learnable 2D spectral mask

M_{f} (u, v)

with per-channel scaling rather than fixed frequency filters or offline DCT; (2) it uses bilinear interpolation from a compact base resolution (

40 \times 21

), enabling resolution-agnostic deployment; and (3) it is tightly coupled with a physics-prior module via shared

L_{phy}

supervision, creating a synergistic effect that exceeds the sum of individual components.

2.4. Physics-Aware and Edge-Guided Learning for Defect Analysis

Physics-aware neural networks encode domain knowledge as soft constraints within neural network training [30,31]. Beyond signal processing, physics-grounded formulations have recently been extended to visual reconstruction tasks: for instance, OUGS [67] derives uncertainty estimates directly from the explicit physical parameters of 3D Gaussian primitives and propagates them through the rendering Jacobian, demonstrating the generality of physics-prior constraints across diverse computer vision problems.

Physics-aware signal-processing methods. In non-destructive evaluation (NDT), such approaches have demonstrated substantial gains: Chen et al. [32] achieved 0.498 mm RMSE for fatigue crack quantification; GuwNet [33] reduced guided-wave microcrack quantification errors by over 80%; DfedResNet [35] improved magnetic flux leakage depth estimation by 1–2 orders of magnitude; and an end-to-end approach [34] applied rotating-field measurements to achieve 3D defect reconstruction.

However, all existing physics-aware defect analysis methods target reconstruction and quantification tasks (e.g., estimating crack depth from sensor signals), not visual object detection.

Edge-guided visual learning. Edge-guided feature refinement, which encodes the physical constraint that genuine structural damage exhibits sharp, continuous boundaries [37], has not been explored within end-to-end YOLO detection frameworks.

Our FIM module bridges this gap: rather than claiming to solve PDEs or enforce physical laws, it encodes Sobel-derived boundary sharpness as a differentiable

L_{1}

auxiliary supervision term, steering feature learning toward physically plausible damage representations. This represents, to our knowledge, the first integration of edge-guided auxiliary supervision within a YOLO-family object detector. Table 1 summarizes the positioning of PPFS-YOLO relative to representative prior work.

3. Method

3.1. Overall Architecture

PPFS-YOLO is built upon the YOLOv12s architecture [9] and augments the detection pipeline with two plug-in modules: Frequency-Spatial Fusion (FSF) and Edge-Guided Auxiliary Supervision Module (FIM). As illustrated in Figure 1, the network is organized into three columns—Backbone, Top-Down Neck, and Bottom-Up Head—with 27 indexed layers (0–26) plus a final Detect head.

The backbone follows the YOLOv12s design with Conv–C3k2–A2C2f blocks, producing feature maps at three scales: P3 (

\frac{H}{8} \times \frac{W}{8}

, 256 channels), P4 (

\frac{H}{16} \times \frac{W}{16}

, 512 channels), and P5 (

\frac{H}{32} \times \frac{W}{32}

, 1024 channels). In the neck, a top-down and bottom-up feature pyramid network [45,46] fuses multi-scale features through concatenation and A2C2f blocks. Table 2 details the complete 27-layer architecture.

Three FSF–FIM pairs are inserted at strategically chosen positions:

P4 Neck (Layers 12–13): After the first A2C2f block in the top-down path (512 channels).
P3 Neck (Layers 17–18): After the second A2C2f block in the top-down path (256 channels).
P4 Head (Layers 22–23): After the A2C2f block in the bottom-up path (512 channels).

This placement ensures that frequency-spatial fusion and physics-prior regularization are applied to both medium-scale (P4) and fine-scale (P3) features, where pseudo-texture interference and small defect details are most prominent. The total parameter overhead is only +0.79 M (from 9.23 M to 10.02 M), and the computational cost increases by +1.7 GFLOPs (from 10.8 to 12.5 GFLOPs). Table 3 provides a per-module breakdown.

3.2. Frequency-Spatial Fusion Module

The FSF module’s core innovation is a fully learnable 2D spectral amplitude mask

M_{f} (u, v)

that selectively suppresses pseudo-texture frequency bands while preserving damage-related signals, jointly optimized with the detection objective and uniquely activated by the physics-prior loss

L_{phy}

from the co-located FIM module. It performs learnable spectral filtering in the Fourier domain and fuses the result with spatial features through a gated mechanism. As shown in Figure 2, the module consists of a frequency path (top) and a spatial identity path (bottom), joined by a gated fusion block.

Given an input spatial feature map

F_{s} \in R^{C \times H \times W}

, the FSF module operates as follows.

3.2.1. Design Rationale

The FSF module exploits the energy equivalence between spatial and frequency domains guaranteed by Parseval’s theorem [19,63]: modulating the amplitude spectrum

| \hat{F} [u, v] |

by a learnable mask

M_{f} (u, v)

directly redistributes spatial energy. Bands where

M_{f} \approx 0

are globally suppressed—an operation that is fundamentally non-local and thus inaccessible to standard spatial convolutions with limited receptive fields. This global suppression is precisely what is needed to attenuate spatially repetitive pseudo-texture patterns (rust stains, paint weathering) that share low-level spatial frequencies across the entire image.

3.2.2. Forward Computation

Step 1: Frequency-Domain Transform. The 2D FFT is applied channel-wise:

\hat{F} [c, u, v] = \sum_{m = 0}^{H - 1} \sum_{n = 0}^{W - 1} F_{s} [c, m, n] e^{- j 2 π (\frac{u m}{H} + \frac{v n}{W})}, \hat{F} \in C^{C \times H \times W} .

(1)

The amplitude and phase components are separated as

A = | \hat{F} |

and

Φ = ∠ (\hat{F})

.

Step 2: Learnable Spectral Masking. A 2D learnable frequency mask

M_{f} \in R^{H_{b} \times W_{b}}

is maintained at a compact base resolution (

H_{b} = 40

,

W_{b} = 21

) and bilinearly interpolated to match the input spatial dimensions. This low-rank parameterization reduces the mask parameters from

H \times W

to

H_{b} \times W_{b}

(e.g., from

80 \times 40 = 3200

to

40 \times 21 = 840

for P3), providing implicit low-pass regularization on the mask itself. A per-channel scaling vector

S_{c} \in R^{C}

modulates the mask across channels. The masked amplitude is:

\tilde{A} [c, u, v] = S_{c} [c] \cdot Interp (M_{f}) [u, v] \cdot A [c, u, v],

(2)

where

Interp (\cdot)

denotes bilinear interpolation from

(H_{b}, W_{b})

to

(H, W)

.

Step 3: Inverse Transform. The frequency-enhanced feature map is reconstructed via inverse FFT:

F_{freq} [c, m, n] = \frac{1}{H W} \sum_{u = 0}^{H - 1} \sum_{v = 0}^{W - 1} \tilde{A} [c, u, v] e^{j Φ [c, u, v]} e^{j 2 π (\frac{u m}{H} + \frac{v n}{W})} .

(3)

Step 4: Gated Fusion. The spatial and frequency features are fused through a learnable gating mechanism:

α = σ ({Conv}_{1 \times 1} ([F_{s}; F_{freq}]) + b_{gate}) \in R^{C \times H \times W},

(4)

F^{*} = α ⊙ F_{s} + (1 - α) ⊙ F_{freq},

(5)

where

[\cdot; \cdot]

denotes channel-wise concatenation,

{Conv}_{1 \times 1}

reduces

2 C

channels to C (via an intermediate

\frac{C}{4}

bottleneck), and

σ (\cdot)

is the sigmoid function.

3.2.3. Gate Initialization Analysis

The gate bias is initialized to

b_{gate} = + 1.0

. Since at initialization the convolution weights yield approximately zero-mean outputs, the initial gate value is:

α_{0} \approx σ (1.0) = \frac{1}{1 + e^{- 1}} \approx 0.731 .

(6)

This ensures that ≈73% of the initial output comes from the spatial pathway, preserving the pretrained backbone representations during early training and allowing the frequency pathway to gradually increase its contribution as the mask

M_{f}

is optimized. The gradient of the gate with respect to its input

z = Conv (\cdot) + b_{gate}

is:

\frac{\partial σ (z)}{\partial z} = σ (z) (1 - σ (z)) \approx 0.731 \times 0.269 \approx 0.197,

(7)

which lies in the high-sensitivity region of the sigmoid, ensuring that gradients flow effectively to update the gating parameters. The complete FSF forward pass is summarized in Algorithm 1.

Algorithm 1 FSF Module Forward Pass

Require: Input feature

F_{s} \in R^{C \times H \times W}

; learnable mask

M_{f} \in R^{H_{b} \times W_{b}}

; channel scale

S_{c} \in R^{C}

; gate bias

b_{gate}

Ensure: Fused feature

F^{*} \in R^{C \times H \times W}

1:

\hat{F} \leftarrow FFT 2 (F_{s})

▹ Ortho-normalized 2D FFT
2:

A, Φ \leftarrow | \hat{F} |, ∠ (\hat{F})

▹ Amplitude & Phase
3:

M \leftarrow BilinearInterp (M_{f}, H, W)

▹ Upsample mask
4:

\tilde{A} [c, u, v] \leftarrow S_{c} [c] \cdot M [u, v] \cdot A [c, u, v]

▹ Spectral masking
5:

{\hat{F}}_{masked} \leftarrow \tilde{A} \cdot e^{j Φ}

▹ Reconstruct complex spectrum
6:

F_{freq} \leftarrow Re (IFFT 2 ({\hat{F}}_{masked}))

▹ Inverse FFT
7:

α \leftarrow σ ({Conv}_{1 \times 1} ([F_{s}; F_{freq}]) + b_{gate})

▹ Gated fusion
8:

F^{*} \leftarrow α ⊙ F_{s} + (1 - α) ⊙ F_{freq}

9: return

F^{*}

3.3. Edge-Guided Auxiliary Supervision Module

The FIM module encodes the physical prior that genuine structural damage exhibits sharp, continuous edge boundaries, whereas pseudo-textures produce diffuse or irregular edge responses. As shown in Figure 3, the module comprises three parallel branches—Edge Prior (left), Edge Predictor (center), and Residual Refinement (right)—whose internal data flows and loss connections are annotated in the diagram.

Given an input feature

F \in R^{C \times H \times W}

, the module proceeds in four steps whose key notation is summarized below for reference.

Symbol	Dimensions	Description
$F$	$C \times H \times W$	Input feature map
$F^{'}$	$C \times H \times W$	Refined output feature map
$P$	$C \times H \times W$	Sobel edge prior (fixed, $\in [0, 1]$ )
$Q$	$C \times H \times W$	Predicted edge map (learnable, $\in [0, 1]$ )
$G_{x}, G_{y}$	$C \times H \times W$	Horizontal/vertical Sobel gradients
$L_{phy}$	scalar	Physics-prior $L_{1}$ alignment loss
$γ$	scalar	Learnable residual scale (init. $0.1$ )
$Δ$	$C \times H \times W$	Edge-guided refinement residual

3.3.1. Edge Prior via Gradient Operators

The Sobel operator [38] computes approximate image gradients via separable

3 \times 3

kernels

K_{x}

and

K_{y}

(horizontal and vertical, respectively). These are applied as fixed depthwise convolutions (shared across C channels, no learnable parameters). The gradient computation is performed on a detached copy

F_{\det} = sg (F)

(where

sg

denotes the stop-gradient operator) to prevent the physics constraint from directly altering the backbone features:

G_{x} = K_{x} * F_{\det}, G_{y} = K_{y} * F_{\det}, G_{x}, G_{y} \in R^{C \times H \times W},

(8)

where ∗ denotes the convolution operator. The edge magnitude prior is:

P = σ (\sqrt{G_{x}^{2} + G_{y}^{2} + ϵ}), P \in {[0, 1]}^{C \times H \times W},

(9)

where

ϵ = 10^{- 6}

prevents numerical instability and

σ (\cdot)

normalizes the magnitude to

[0, 1]

.

The physical interpretation is as follows: regions with genuine structural damage (dents, holes) produce strong, coherent gradient responses across multiple feature channels, while pseudo-textures yield spatially diffuse gradients or responses confined to specific channels. This signal structure motivates a channel-wise edge alignment loss.

3.3.2. Learnable Edge Prediction

A lightweight predictor network

h_{θ}

estimates a physics-consistent edge map:

Q = σ ({PW}_{C \to C} (SiLU (BN ({DW}_{3 \times 3} (F))))), Q \in {[0, 1]}^{C \times H \times W},

(10)

where

{DW}_{3 \times 3}

denotes a depthwise

3 \times 3

convolution (capturing local spatial patterns with

9 C

parameters) and

{PW}_{C \to C}

is a pointwise

1 \times 1

convolution (cross-channel mixing with

C^{2}

parameters).

3.3.3. Physics-Prior Loss and Gradient Analysis

The alignment between the predicted and actual edge maps is enforced via the

L_{1}

loss:

L_{phy}^{(i)} = \frac{1}{C H W} \sum_{c, m, n} | Q^{(i)} [c, m, n] - P^{(i)} [c, m, n] |, i \in {1, 2, 3} .

(11)

Gradient flow analysis. The gradient of

L_{phy}^{(i)}

with respect to the predictor parameters

θ

is:

\frac{\partial L_{phy}^{(i)}}{\partial θ} = \frac{1}{C H W} \sum_{c, m, n} sign (Q_{c, m, n}^{(i)} - P_{c, m, n}^{(i)}) \cdot \frac{\partial Q_{c, m, n}^{(i)}}{\partial θ} .

(12)

The

L_{1}

loss is specifically chosen over

L_{2}

because the sign function provides constant-magnitude gradients regardless of the residual size. This prevents the gradient vanishing that occurs with

L_{2}

when

| Q - P | \to 0

, ensuring that the physics alignment signal remains strong throughout training. Furthermore,

L_{1}

is robust to outlier edge responses that may arise from container surface specular reflections.

Why stop-gradient on $P$ ? The edge prior $P$ is computed from the detached feature $sg (F)$ . If gradient were allowed to flow through $P$ , the network could trivially minimize $L_{phy}$ by making $F$ smooth (zero-gradient features everywhere), which would destroy the representation quality. By detaching $P$ , the physics loss exclusively trains the predictor $h_{θ}$ to predict edge structure, creating a knowledge distillation-like setup where the Sobel operator acts as a fixed “teacher” and $h_{θ}$ is the “student.”

3.3.4. Edge-Guided Residual Refinement

The predicted edge map modulates the input feature, and a two-stage depthwise-pointwise convolutional refinement is applied:

F^{'} = F + γ \cdot \underset{Refine (Q ⊙ F)}{\underset{︸}{PW (SiLU (BN (DW (PW (SiLU (BN (DW (Q ⊙ F))))))))}},

(13)

where

γ

is a learnable scalar initialized to

γ_{0} = 0.1

. The small initial

γ_{0}

is critical: it ensures that the refinement branch produces near-zero modifications at the start of training, preventing the randomly initialized edge predictor from corrupting the feature representations. As training progresses and

Q

converges toward meaningful edge maps,

γ

grows to allow stronger edge-guided modulation. The complete FIM forward pass is summarized in Algorithm 2.

Algorithm 2 FIM Module Forward Pass

Require: Input feature

F \in R^{C \times H \times W}

; Sobel kernels

K_{x}, K_{y}

; edge predictor

h_{θ}

; residual scale

γ

Ensure: Refined feature

F^{'} \in R^{C \times H \times W}

; physics loss

L_{phy}

1:

F_{\det} \leftarrow sg (F)

▹ Stop gradient on feature
2:

G_{x}, G_{y} \leftarrow K_{x} * F_{\det}, K_{y} * F_{\det}

▹ Sobel depthwise conv
3:

P \leftarrow σ (\sqrt{G_{x}^{2} + G_{y}^{2} + ϵ})

▹ Edge prior map

\in [0, 1]

4:

Q \leftarrow h_{θ} (F)

▹ Learnable edge prediction
5:

L_{phy} \leftarrow \frac{1}{C H W} {∥ Q - P ∥}_{1}

▹ Physics-prior loss
6:

Δ \leftarrow Refine (Q ⊙ F)

▹ Two-stage DW–PW refinement
7:

F^{'} \leftarrow F + γ \cdot Δ

▹ Edge-guided residual
8: return

F^{'}, L_{phy}

3.4. Training Objective and Optimization

The total training objective combines the standard YOLO detection loss with the physics-prior regularization:

L = \underset{L_{\det}}{\underset{︸}{L_{cls} + L_{box} + L_{dfl}}} + \underset{L_{phy}^{total}}{\underset{︸}{λ_{phy} \sum_{i = 1}^{3} L_{phy}^{(i)}}},

(14)

where

L_{cls}

is the binary cross-entropy classification loss,

L_{box}

is the CIoU bounding box regression loss, and

L_{dfl}

is the distribution focal loss [41].

Gradient magnitude balancing. The physics loss coefficient $λ_{phy} = 0.5$ is selected to balance the gradient magnitudes. At convergence, the typical magnitudes are $∥ \nabla_{θ} L_{\det} ∥ \approx O (10^{- 3})$ and $∥ \nabla_{θ} L_{phy}^{(i)} ∥ \approx O (10^{- 3})$ . With $λ_{phy} = 0.5$ and three FIM instances, the total physics gradient magnitude is $0.5 \times 3 \times O (10^{- 3}) = O (1.5 \times 10^{- 3})$ , which is comparable to but does not dominate $L_{\det}$ .
Learning rate schedule. PPFS module parameters use a $5 \times$ learning rate multiplier relative to the backbone, accelerating adaptation of the newly initialized FSF masks and FIM predictors while the pretrained backbone parameters fine-tune at the standard rate.

Computational Complexity Analysis

For an FSF module operating on

F_{s} \in R^{C \times H \times W}

:

Ω_{FSF} = \underset{FFT + iFFT}{\underset{︸}{2 \cdot O (C H W log (H W))}} + \underset{masking}{\underset{︸}{O (C H W)}} + \underset{gate conv}{\underset{︸}{O (C^{2} H W / r)}},

(15)

where

r = 4

is the channel reduction ratio. For a FIM module:

Ω_{FIM} = \underset{Sobel (fixed)}{\underset{︸}{O (9 C H W)}} + \underset{predictor}{\underset{︸}{O ((9 C + C^{2}) H W)}} + \underset{2 - stage refine}{\underset{︸}{O (2 (9 C + C^{2}) H W)}} .

(16)

Given the practical dimensions (

C = 256

or 512,

H \times W = 80 \times 80

or

40 \times 40

), the total overhead of three FSF–FIM pairs is 1.7 GFLOPs, representing a 15.7% increase relative to the 10.8 GFLOPs baseline—a modest cost for the +12.35 pp accuracy improvement. The full PPFS-YOLO training procedure is detailed in Algorithm 3.

Algorithm 3 PPFS-YOLO Training Procedure

Require: Training set

D

; pretrained YOLOv12s weights

w_{0}

; epochs

T = 200

; physics loss weight

λ_{phy} = 0.5

; LR boost factor

η_{boost} = 5

Ensure: Trained PPFS-YOLO model
1: Initialize backbone and neck with

w_{0}

; randomly init FSF & FIM params
2: Set

b_{gate} \leftarrow 1.0

,

γ \leftarrow 0.1

,

M_{f} \leftarrow 1

3: for

t = 1

to T do
4: for each mini-batch

(x, y) \in D

do
5:

\hat{y}, {L_{phy}^{(i)}}_{i = 1}^{3} \leftarrow PPFS - YOLO (x)

▹Forward
6:

L_{\det} \leftarrow L_{cls} + L_{box} + L_{dfl}

7:

L \leftarrow L_{\det} + λ_{phy} \sum_{i = 1}^{3} L_{phy}^{(i)}

▹Total loss (Equation (14))
8: Update backbone params with learning rate

η

9: Update PPFS params with learning rate

η_{boost} \cdot η

▹

5 \times

boost
10: end for
11: Cosine-anneal

η

12: end for

4. Experiments

4.1. Dataset

We evaluate PPFS-YOLO on a container surface damage detection dataset comprising 7013 images annotated with bounding boxes across three damage categories. Annotation protocol. All images were labeled using LabelImg (https://github.com/heartexlabs/labelImg, accessed on 7 April 2026) with axis-aligned bounding boxes following a three-step quality assurance procedure: (1) initial annotation by two trained annotators with a shared labeling guide; (2) cross-review of ambiguous cases by a senior annotator; and (3) final validation pass to remove inconsistent boxes. Inter-annotator agreement for the initial round was Cohen’s

κ = 0.83

(“almost perfect” per Landis–Koch scale), with primary disagreements at partially occluded Rust–Dent boundaries. Split and class counts. The dataset comprises 6600 training images (3300 defective + 3300 negative background) and 413 held-out test images (5.9%), with stratified sampling over the defective images to preserve per-class frequency; the test fold also serves as the validation set during YOLO training. Pre-augmentation bounding box counts: Dent 4438; Hole 1098; Rusty 3568 (total 9104 instances). Table 4 provides a detailed statistical breakdown.

Figure 4 shows representative training images with annotated bounding boxes. The three damage categories exhibit distinct visual characteristics but share surface textures that challenge purely spatial-domain detectors.

4.2. Implementation Details

All experiments are conducted on a server equipped with four NVIDIA RTX 3090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA; 24 GB each). Models are trained for 200 epochs with an input resolution of

640 \times 640

pixels and a seed of 42 for reproducibility. We use the Ultralytics framework (v8.4.19; https://ultralytics.com, accessed on 7 April 2026) with PyTorch (v2.5.0; https://pytorch.org, accessed on 7 April 2026) and automatic mixed precision (AMP) enabled. All baseline models, including YOLO12s, PPFS-YOLO, and all ablation variants, are trained with the SGD optimizer (initial learning rate

1 \times 10^{- 2}

, cosine annealing) to ensure a fair comparison. An earlier version of our experiments used AdamW for the baseline and SGD for PPFS-YOLO; we recognized this as a potential confound and have rerun all experiments with a unified SGD configuration. Table 5 summarizes the key hyperparameters.

4.3. Evaluation Metrics

We report standard COCO-style metrics: mean Average Precision at IoU threshold 0.5 (mAP@50), mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 in increments of 0.05 (mAP@50:95), Precision (P), and Recall (R). Model efficiency is measured by parameter count (M) and floating-point operations (GFLOPs) at the inference resolution.

4.4. Comparison with State-of-the-Art Methods

We compare PPFS-YOLO against five competitive baselines spanning different detector families and model scales: YOLOv10n [7], YOLO11n [8], RT-DETR-l [17], YOLOv8s [4], and YOLO12s [9]. Baseline selection. RT-DETR-l [17] is selected as the Transformer representative because (a) it is the highest-accuracy real-time Transformer with native Ultralytics support, allowing a fully unified training pipeline; (b) its scale (32.0 M params, 54.2 GFLOPs) places it in the accuracy tier we target; and (c) it uses the same input resolution and training protocol as the YOLO baselines. DINO [18] and Co-DETR require two-stage COCO-scale pretraining pipelines incompatible with the Ultralytics framework used here. All baselines and PPFS-YOLO are trained with the same optimizer (SGD,

{lr}_{0} = 10^{- 2}

, cosine decay), augmentation pipeline, seed (42), and 200 epochs to ensure a fair, optimizer-unified comparison. Results are summarized in Table 6.

PPFS-YOLO achieves a mAP@50 of 64.86%, surpassing the strongest baseline (YOLO12s with SGD, 52.51%) by +12.35 percentage points. This margin is substantial and consistent: PPFS-YOLO outperforms all baselines across every metric. Compared to the much larger RT-DETR-l (32.00 M parameters, 54.2 GFLOPs), PPFS-YOLO delivers +12.37 pp higher mAP@50 while using only 31% of the parameters and 23% of the computation. The precision improvement from 63.77% to 78.29% (+14.52 pp) indicates a significant reduction in false positives—the primary goal of the frequency-domain pseudo-texture suppression.

Figure 5 presents the qualitative detection results of PPFS-YOLO on three validation batches. The ground truth annotations (top row) and PPFS-YOLO predictions (bottom row) are shown. PPFS-YOLO produces tight bounding boxes with few false positives, particularly on rusted surfaces where pseudo-textures are prevalent.

4.5. Per-Class Analysis

Table 7 presents the per-class AP@50 results, revealing that the benefits of PPFS-YOLO are distributed across all damage categories but are most pronounced for the minority Hole class.

The most striking result is the Hole class AP@50, which reaches 83.06%—a +22.19 pp improvement over the YOLO12s (SGD) baseline and the highest gain among all classes. This indicates that the combination of frequency-domain feature enhancement and physics-prior edge regularization is particularly effective for the minority puncture class, where sharp boundary characteristics are most discriminative. The Dent class improves by +8.24 pp and Rusty by +6.62 pp, demonstrating broad improvements across all damage types. Importantly, the identical augmentation pipeline—Mosaic (

p = 1.0

), MixUp (

p = 0.15

), HSV jitter, and random flip—was applied to all models with no Hole-specific copy-paste augmentation. This confirms that the +22.19 pp Hole-class gain reflects the architectural contribution of the FSF–FIM pair rather than a data-volume advantage.

Figure 6 provides visual comparisons of per-class performance across all methods. The bar chart (a) highlights the per-class AP@50 gains, while the radar chart (b) shows the multi-metric profile comparison across all SOTA methods, confirming that PPFS-YOLO dominates the performance envelope.

4.6. Ablation Study

To quantify the individual and synergistic contributions of each component, we conduct a systematic ablation study. Results are presented in Table 8. Figure 7 visualizes the relative contributions.

The ablation reveals a crucial insight. FSF alone provides +1.79 pp and FIM alone yields +0.95 pp—modest but consistently positive individual contributions, confirming that each module independently contributes to feature quality. However, combining both modules without the edge-prior loss (

λ_{phy} = 0

) produces only +0.83 pp, which is marginally less than FSF alone. This result is interpretable: the FIM residual refinement branch (

Q ⊙ F

) with randomly initialized predictor

h_{θ}

modulates features with near-uniform masks (since

h_{θ}

has not yet learned meaningful edge structure), introducing a small amount of noise that partially offsets the frequency-domain gains from FSF. Critically, this is not a failure of the architectural modules—it is a consequence of the edge predictor lacking supervision.

The dramatic change occurs when

L_{phy}

is activated: the full PPFS-YOLO achieves +12.10 pp. This

14.6 \times

amplification (from +0.83 to +12.10) demonstrates that

L_{phy}

is not merely an auxiliary loss but the catalyst that activates the synergy between frequency-spatial fusion and edge-guided feature refinement. With supervision,

Q

converges toward the Sobel-derived edge prior

P

, yielding edge maps that highlight genuine damage boundaries. This meaningful edge gating then amplifies the discriminative features produced by FSF, creating a mutually reinforcing cycle: FSF suppresses pseudo-texture responses in the frequency domain, and FIM further sharpens genuine damage features via physically grounded edge-guided refinement.

4.7. Training Dynamics

Figure 8 shows the training convergence curves for PPFS-YOLO compared with the YOLO12s baseline. Figure 9 shows the corresponding training loss curves.

4.8. Learned Spectral Mask Visualization

Figure 10 visualizes the learned 2-D spectral masks

M_{f} (u, v)

extracted from the three FSF modules of the trained PPFS-YOLO model (P4-neck, P3-neck, and P4-head scales). The masks are displayed in harmonic frequency space (origin at DC component, horizontal axis u and vertical axis v spanning

[0, f_{N}]

after orthogonal FFT).

All three modules consistently suppress mid-to-high frequencies while relatively preserving the DC region (lower-left corner,

u \approx 0, v \approx 0

), which carries coarse structural information such as dent boundaries and hole outlines. This learned behavior confirms the design hypothesis: the FSF module autonomously discovers a low-pass-emphasizing spectral profile that attenuates the spatially repetitive pseudo-texture frequencies without explicit supervision. The P3-neck module (finest spatial scale,

80 \times 80

feature map) shows the most pronounced high-frequency suppression (mean mask value

0.496 \pm 0.262

, range

[- 0.05, 2.48]

), consistent with pseudo-textures manifesting most strongly at fine scales.

4.9. Efficiency Analysis

Figure 11 presents the accuracy–efficiency trade-off across all evaluated methods. Table 3 (Section 3.1) provides the per-module GFLOPs and parameter breakdown. The total overhead across three FSF–FIM pairs is +1.70 GFLOPs and +0.79 M parameters. FFT precision. The torch.fft.rfft2 call in FSF runs in FP32 because PyTorch 2.5 does not support FP16 complex FFT. This contributes ≈0.4 ms of the 2.9 ms total latency overhead and adds ≈0.3 GB VRAM (PPFS-YOLO: ≈2.1 GB vs. YOLO12s: ≈1.8 GB, batch 1,

640 \times 640

). A DCT-based FP16 approximation [23] is planned as a lightweight deployment variant.

4.10. Inference Latency

Table 9 reports the inference throughput of all evaluated methods measured on the evaluation hardware (NVIDIA RTX 3090, batch size 1, input

640 \times 640

, averaged over 500 forward passes after 50 warm-up runs).

PPFS-YOLO occupies a favorable position on the Pareto frontier: it achieves the highest mAP@50 (64.86%) at 17.2 ms per image (58.3 FPS), compared to 14.3 ms (70.0 FPS) for the vanilla YOLO12s baseline. The additional latency of 2.9 ms (+20.3%) attributable to the FSF and FIM modules is a modest penalty relative to the +12.35 pp accuracy gain. RT-DETR-l incurs 31.3 ms per image (32.0 FPS) despite lower accuracy (52.49%), confirming that PPFS-YOLO is more computationally efficient in the accuracy-per-latency sense. PPFS-YOLO requires only 12.5 GFLOPs, substantially less than RT-DETR-l (54.2 GFLOPs) and only marginally more than YOLO12s (10.8 GFLOPs).

4.11. Cross-Domain Validation on Kolektor SDD2

To investigate the transferability of PPFS-YOLO beyond the container domain, we fine-tune both PPFS-YOLO and the YOLO12s (SGD) baseline on the Kolektor SDD2 surface defect dataset [68]. Kolektor SDD2 contains 3335 images of metallic commutator surfaces with one defect class (subtle surface cracks), partitioned here as 1998 training/333 validation/1004 test images. Fine-tuning uses 30 epochs from the container-trained checkpoint, SGD with

{lr}_{0} = 2 \times 10^{- 3}

(cosine decay), batch 16, seed 42—the same protocol for both models.

Table 10 reports the results.

On Kolektor SDD2, PPFS-YOLO achieves 93.6% best mAP@50, with YOLO12s reaching 95.6%—a 2.0 pp gap in favor of the baseline. Both models converge to nearly identical final-epoch performance (92.6% vs. 92.3%).

Interpretation. This result is consistent with the design intent of PPFS-YOLO. Kolektor SDD2 poses a fundamentally different challenge from the container dataset: defects are subtle surface cracks on clean metallic substrates with minimal pseudo-texture interference. In this regime, the FSF frequency-masking mechanism—which is specifically designed to suppress spatially repetitive pseudo-textures—provides no structural benefit, since the corruption it targets is largely absent. Furthermore, the FIM physics-prior edge supervision may over-regularize features toward sharp boundary responses when the relevant defects are diffuse or low-contrast cracks rather than dents with clear discontinuities. The slight performance deficit of PPFS-YOLO on Kolektor SDD2 thus validates, rather than contradicts, the design: the method provides targeted gains on pseudo-texture-contaminated domains but does not unconditionally outperform simpler baselines on all defect types. The strong absolute performance of both models (≥92% mAP@50) confirms that the container-pretrained features transfer well to the metallic surface defect domain regardless of architecture.

5. Discussion

5.1. The Catalyst Effect of $L_{phy}$

The central finding of this work is that the physics-prior loss

L_{phy}

acts as a catalyst for the entire PPFS framework. The metaphor is apt: just as a chemical catalyst enables reactions that would not occur spontaneously,

L_{phy}

enables the FSF and FIM modules to produce a synergistic effect that far exceeds their individual capabilities.

Without

L_{phy}

, the FIM’s edge prediction

Q

starts from random initialization and lacks supervisory signal to converge toward meaningful edge maps. The edge-gated refinement (

Q ⊙ F

) then modulates features with essentially random masks, adding noise rather than structure. When

L_{phy}

is present,

Q

is rapidly trained to approximate the Sobel-derived edge prior

P

, yielding edge maps that highlight genuine boundaries. This meaningful edge gating then amplifies the discriminative features produced by FSF’s frequency filtering, creating a mutually reinforcing cycle: FSF suppresses pseudo-texture responses in the frequency domain, and FIM further sharpens the remaining genuine damage features via edge-guided refinement.

Formally, let

F_{FSF}^{*}

denote the output of FSF. The FIM output with active

L_{phy}

is:

F_{phy}^{'} = F_{FSF}^{*} + γ \cdot Refine (Q^{*} ⊙ F_{FSF}^{*}),

(17)

where

Q^{*} \approx P

is a well-trained edge prediction. Since

P

has large values at damage boundaries and small values elsewhere,

Q^{*} ⊙ F_{FSF}^{*}

selectively amplifies boundary features while suppressing background—an attention mechanism guided by physical priors rather than learned from detection annotations alone.

5.2. Why Frequency-Domain Fusion Benefits Container Damage Detection

Container surface pseudo-textures—rust patterns, paint weathering, specular highlights—exhibit characteristic frequency signatures that differ from genuine structural damage. Rust stains produce spatially repetitive mid-frequency patterns (∼10–30 cycles/image), while specular reflections create localized high-frequency artifacts (>50 cycles/image). Genuine dents manifest as smooth, low-frequency deformations, whereas holes produce sharp, broadband edge responses.

The learnable spectral mask

M_{f}

in FSF can selectively attenuate confounding frequency bands. By Parseval’s theorem [19,63], this frequency-domain attenuation directly reduces the spatial energy of pseudo-texture patterns. The improvement in Precision from 63.77% to 78.29% (+14.52 pp) provides quantitative evidence for this pseudo-texture suppression.

5.3. Minority Class Benefits

The exceptional improvement on the Hole class (+22.19 pp) merits careful analysis. Puncture-type defects are characterized by distinct physical properties: sharp, well-defined edges; strong contrast with the surrounding surface; and consistent geometric patterns (typically circular or elliptical openings). These properties align precisely with the features that FIM is designed to detect and enhance:

The Sobel-derived edge prior $P$ produces strong, consistent responses at hole boundaries;
the edge-guided refinement amplifies features in regions exhibiting sharp boundaries;
the frequency-domain filtering in FSF preserves the high-frequency edge components that define hole perimeters.

Together, these mechanisms provide a “physics-informed attention” effect that preferentially enhances hole features, effectively compensating for their statistical under-representation in the training data.

5.4. Comparison with Larger Models

PPFS-YOLO (10.02 M, 12.5 GFLOPs) surpasses RT-DETR-l (32.00 M, 54.2 GFLOPs) by +12.37 pp in mAP@50. RT-DETR-l relies on a heavy Transformer-based architecture with a hybrid encoder, yet its purely data-driven approach fails to capture the domain-specific characteristics of container damage. This comparison highlights that targeted, physics-aware architecture design can outperform brute-force capacity scaling, particularly in domain-specific detection tasks where prior knowledge about defect characteristics is available. Table 11 compares the efficiency of all methods. While lightweight nano-scale models naturally achieve higher mAP/GFLOPs ratios due to their minimal compute budget, PPFS-YOLO delivers the highest absolute accuracy at moderate computational cost, representing an effective balance between detection performance and efficiency.

Figure 12 further illustrates the accuracy–parameter trade-off, confirming that PPFS-YOLO achieves the most favorable position in the accuracy–complexity plane. The hyperparameter sensitivity analysis in Figure 13 demonstrates the robustness of PPFS-YOLO to variations in key module hyperparameters.

5.5. Limitations and Future Work

Several limitations should be acknowledged. First, the edge-prior is based on Sobel-derived gradient maps, which capture first-order boundary information. Higher-order priors (e.g., curvature, Hessian-based features) may further improve discrimination between genuine damage and pseudo-textures. Second, although we include a cross-domain experiment on Kolektor SDD2 [68], broader validation on additional industrial defect detection scenarios such as MVTec AD [69,70] is still needed. Third, while the computational overhead is modest on server-grade hardware (see Section 4.10), the FFT operations in FSF may introduce latency on embedded devices without hardware-accelerated spectral transforms; DCT-based approximations [23] represent a promising lightweight alternative. Fourth, the

λ_{phy}

coefficient is fixed throughout training; an adaptive schedule could further improve convergence speed. Fifth, due to the substantial compute required (≈3 h per 200-epoch run on GPU hardware), repeated-trial statistics (e.g., bootstrap confidence intervals across five seeds) are not reported; the observed +12.35 pp improvement is based on a single training run and should be interpreted accordingly. Future work will explore adaptive edge priors, multi-domain benchmarking, deployment-optimized frequency-domain operations, and multi-seed statistical validation. Recent studies on human–AI collaboration for structured ultrasound report extraction have shown complementary human/LLM error patterns and the value of co-designed extraction workflows [71,72]; analogous human-in-the-loop reporting pipelines may also be useful for practical container inspection deployment.

6. Conclusions

We presented PPFS-YOLO, a physics-prior frequency-spatial fusion framework for container surface damage detection that integrates two novel modules into the YOLOv12s architecture. The Frequency-Spatial Fusion (FSF) module performs learnable spectral masking and gated spatial-frequency feature fusion to suppress pseudo-texture false positives. The Edge-Guided Auxiliary Supervision Module (FIM) encodes Sobel-derived edge priors as a differentiable

L_{1}

auxiliary loss to regularize feature learning toward physically plausible damage boundaries.

Extensive experiments demonstrate that PPFS-YOLO achieves 64.86% mAP@50 on a container damage dataset—a +12.35 pp improvement over the YOLO12s baseline (SGD)—with only +0.79 M additional parameters (+8.6%). The ablation study reveals the key finding of this work: the physics-prior loss

L_{phy}

is the critical catalyst that activates the synergy between FSF and FIM, amplifying their combined effect by

14.6 \times

(from +0.83 pp without

L_{phy}

to +12.10 pp with it). The framework achieves particularly strong improvements on the safety-critical Hole class (+22.19 pp AP@50) and outperforms all five baselines including the much larger RT-DETR-l (32.00 M, 54.2 GFLOPs).

These results demonstrate that encoding physical priors as differentiable constraints within an end-to-end detection framework is a powerful paradigm for domain-specific defect detection, offering an effective alternative to purely data-driven capacity scaling. The physics-prior catalysis mechanism may also benefit other industrial inspection tasks where domain-specific knowledge about defect characteristics is available.

Author Contributions

Conceptualization, J.L. and F.G.; Software, J.L.; Validation, J.L.; Formal analysis, J.L.; Investigation, J.L.; Resources, J.L.; Data curation, J.L.; Writing—original draft preparation, J.L.; Writing—review and editing, F.G.; Visualization, J.L.; Supervision, F.G.; Project administration, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The container damage dataset used in this study is proprietary and cannot be publicly released due to third-party confidentiality restrictions. Code and trained model weights are available from the corresponding author upon reasonable request.

Acknowledgments

During the preparation of this manuscript, the authors used GitHub Copilot (no public version identifier available; Microsoft Corporation; https://github.com/features/copilot, accessed on 7 April 2026) for programming assistance during software implementation and code debugging. The authors reviewed and edited all suggested output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PPFS	Physics-Prior Frequency-Spatial
FSF	Frequency-Spatial Fusion
FIM	Edge-Guided Auxiliary Supervision Module (in PPFS-YOLO)
YOLO	You Only Look Once
FFT	Fast Fourier Transform
DFT	Discrete Fourier Transform
mAP	mean Average Precision
AP	Average Precision
GFLOPs	Giga Floating-Point Operations
FPS	Frames per Second
SGD	Stochastic Gradient Descent
AMP	Automatic Mixed Precision
IoU	Intersection over Union
NDT	Non-Destructive Testing

References

United Nations Conference on Trade and Development. Review of Maritime Transport 2024; Technical report; United Nations: Geneva, Switzerland, 2024. [Google Scholar]
Nguyen Thi Phuong, T.; Cho, G.S.; Chatterjee, I. Automating container damage detection with the YOLO-NAS deep learning model. Sci. Prog. 2025, 108, 00368504251314084. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 March 2026).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the European Conference on Computer Vision (ECCV); Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 1 March 2026).
Tian, Y.; Li, H.; Wang, H.; Chen, Y.; Ling, H. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Hou, W.; Wei, Y.; Guo, J.; Jin, Y.; Zhu, C. MSFT-YOLO: Improved YOLOv5 Based on Transformer for Detecting Defects of Steel Surface. Sensors 2022, 22, 3467. [Google Scholar] [CrossRef]
Guo, Z.; Wang, C.; Yang, G.; Huang, Z.; Li, G. Surface defect detection of steel strips based on improved YOLOv4. Comput. Electr. Eng. 2022, 102, 108208. [Google Scholar] [CrossRef]
Zhang, H.; Li, S.; Miao, Q.; Fang, R.; Xue, S.; Hu, Q.; Hu, J.; Chan, S. Surface defect detection of hot rolled steel based on multi-scale feature fusion and attention mechanism residual block. Sci. Rep. 2024, 14, 7671. [Google Scholar] [CrossRef]
Jeon, C.H.; Kim, J.H. YOLOv4-MN3 for PCB Surface Defect Detection. Appl. Sci. 2021, 11, 11701. [Google Scholar] [CrossRef]
Tang, J.; Liu, S.; Zhao, D.; Tang, L.; Zou, W.; Zheng, B. PCB-YOLO: An Improved Detection Algorithm of PCB Surface Defects Based on YOLOv5. Sustainability 2023, 15, 5963. [Google Scholar] [CrossRef]
Zhang, C.; Yang, T.; Yang, J. Image Recognition of Wind Turbine Blade Defects Using Attention-Based MobileNetv1-YOLOv4 and Transfer Learning. Sensors 2022, 22, 6009. [Google Scholar] [CrossRef]
Sun, X.; Jia, X.; Liang, Y.; Wang, M.; Chi, X. A Defect Detection Method for a Boiler Inner Wall Based on an Improved YOLO-v5 Network and Data Augmentation Technologies. IEEE Access 2022, 10, 93845–93858. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Gonzalez, R.C.; Woods, R.E. Digital Image Processing, 4th ed.; Pearson: Hong Kong, China, 2018. [Google Scholar]
Wang, J.; Zhang, W.; Zang, Y.; Cao, Y.; Pang, J.; Gong, T.; Chen, K.; Liu, Z.; Loy, C.C.; Lin, D. Seesaw Loss for Long-Tailed Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9695–9704. [Google Scholar] [CrossRef]
Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-Based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar] [CrossRef]
Chi, L.; Jiang, B.; Mu, Y. Fast Fourier Convolution. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 4479–4488. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. arXiv 2020, arXiv:2012.11879. [Google Scholar]
Zhong, Y.; Li, B.; Tang, L.; Kuang, S.; Wu, S.; Ding, S. Detecting Camouflaged Object in Frequency Domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4504–4513. [Google Scholar] [CrossRef]
Lin, J.; Tan, X.; Xu, K.; Ma, L.; Lau, R.W. Frequency-aware Camouflaged Object Detection. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 19, 61. [Google Scholar] [CrossRef]
Zheng, S.; Wu, Z.; Xu, Y.; Wei, Z. Instance-Aware Spatial-Frequency Feature Fusion Detector for Oriented Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5606513. [Google Scholar] [CrossRef]
Sun, X.; Yu, Y.; Cheng, Q. Adaptive multimodal feature fusion with frequency domain gate for remote sensing object detection. Remote Sens. Lett. 2024, 15, 133–144. [Google Scholar] [CrossRef]
Li, H.; Yi, Z.; Wang, Z.; Wang, Y.; Ge, L.; Cao, W.; Mei, L.; Yang, W.; Sun, Q. FDADNet: Detection of Surface Defects in Wood-Based Panels Based on Frequency Domain Transformation and Adaptive Dynamic Downsampling. Processes 2024, 12, 2134. [Google Scholar] [CrossRef]
Zou, G.; Li, T.; Li, G.; Peng, X.; Fu, G. A visual detection method of tile surface defects based on spatial-frequency domain image enhancement and region growing. In Proceedings of the Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Cuomo, S.; Di Cola, V.S.; Giampaolo, F.; Rozza, G.; Raissi, M.; Piccialli, F. Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next. J. Sci. Comput. 2022, 92, 88. [Google Scholar] [CrossRef]
Pan, Y.; Khodaei, Z.S.; Aliabadi, F.M. In-service fatigue crack monitoring through baseline-free automated detection and physics-informed neural network quantification. NDT E Int. 2025, 151, 103360. [Google Scholar] [CrossRef]
Sun, H.; Peng, L.; Lin, J.; Wang, S.; Zhao, W.; Huang, S. Microcrack Defect Quantification Using a Focusing High-Order SH Guided Wave EMAT: The Physics-Informed Deep Neural Network GuwNet. IEEE Trans. Ind. Inform. 2021, 18, 3235–3247. [Google Scholar] [CrossRef]
Zhao, J.; Li, W.; Yuan, X.A.; Yin, X.; Li, X.; Chen, Q.; Ding, J. An End-to-End Physics-Informed Neural Network for Defect Identification and 3-D Reconstruction Using Rotating Alternating Current Field Measurement. IEEE Trans. Ind. Inform. 2022, 19, 8340–8350. [Google Scholar] [CrossRef]
Sun, H.; Peng, L.; Huang, S.; Li, S.; Long, Y.; Wang, S.; Zhao, W. Development of a Physics-Informed Doubly Fed Cross-Residual Deep Neural Network for High-Precision Magnetic Flux Leakage Defect Size Estimation. IEEE Trans. Ind. Inform. 2021, 18, 1629–1640. [Google Scholar] [CrossRef]
Zhang, E.; Dao, M.; Karniadakis, G.E.; Suresh, S. Analyses of internal structures and defects in materials using physics-informed neural networks. Sci. Adv. 2022, 8, eabk0644. [Google Scholar] [CrossRef]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Sobel, I. History and Definition of the So-Called “Sobel Operator”, More Appropriately Named the Sobel-Feldman Operator. 2014. Available online: https://www.researchgate.net/profile/Irwin-Sobel/publication/285159837 (accessed on 7 April 2026).
Finder, S.E.; Arar, R.; Averbuch-Elor, H.; Zrigui, S.; Yariv, U.; Gurevich, T. WTConv: Rethinking Large Kernel Convolutions with Wavelet Decomposition. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37. [Google Scholar]
Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-aware Feature Fusion for Dense Image Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10763–10780. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. arXiv 2020, arXiv:2006.04388. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. arXiv 2021, arXiv:2108.07755. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9756–9765. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar] [CrossRef]
Xia, K.; Lv, Z.; Zhou, C.; Gu, G.; Zhao, Z.; Liu, K.; Li, Z. Mixed Receptive Fields Augmented YOLO with Multi-Path Spatial Pyramid Pooling for Steel Surface Defect Detection. Sensors 2023, 23, 5114. [Google Scholar] [CrossRef]
Wang, X.; Zhuang, K. An improved YOLOX method for surface defect detection of steel strips. In Proceedings of the IEEE International Conference on Power Electronics, Computer Applications (ICPECA), Shenyang, China, 29–31 January 2023. [Google Scholar] [CrossRef]
Zeng, K.; Xia, Z.; Qian, J.; Du, X.; Xiao, P.; Zhu, L. Steel Surface Defect Detection Technology Based on YOLOv8-MGVS. Metals 2025, 15, 109. [Google Scholar] [CrossRef]
Fang, W.; Yang, Y.; Zhang, W.; Wang, T.; Feng, J.; Liu, G. ASD-YOLO: A lightweight multi-module collaboratively optimized model for steel surface defect detection. Meas. Sci. Technol. 2025, 36, 095411. [Google Scholar] [CrossRef]
Le, H.F.; Zhang, L.J.; Liu, Y.X. Surface Defect Detection of Industrial Parts Based on YOLOv5. IEEE Access 2022, 10, 130784–130794. [Google Scholar] [CrossRef]
Zhao, Z.; Yang, X.; Zhou, Y.; Sun, Q.; Ge, Z.; Liu, D. Real-time detection of particleboard surface defects based on improved YOLOV5 target detection. Sci. Rep. 2021, 11, 21777. [Google Scholar] [CrossRef]
Niu, W.; Lv, C.; Zhang, E.; Wei, Z. YOLO-RDM: A high accuracy and efficient algorithm for magnetic tile surface defect detection with practical applications. PLoS ONE 2025, 20, e0328815. [Google Scholar] [CrossRef] [PubMed]
Ding, L.; Xu, H.; Du, P.; Cui, Y. ACS-YOLO: A lightweight bearing surface defect detection algorithm. J. Eng. Appl. Sci. 2025, 72, 818. [Google Scholar] [CrossRef]
Bao, N.; Lin, J.; Fan, Y.; Bao, R.; Simeone, A. FabricMamba: A fabric surface defect detection system based on large kernel attention and visual state space. Eng. Appl. Artif. Intell. 2025, 162, 112558. [Google Scholar] [CrossRef]
Yin, X.; Zhao, Z.; Weng, L. MAS-YOLO: A Lightweight Detection Algorithm for PCB Defect Detection Based on Improved YOLOv12. Appl. Sci. 2025, 15, 6238. [Google Scholar] [CrossRef]
Ji, Y.; Ma, T.; Shen, H.; Feng, H.; Zhang, Z.; Li, D.; He, Y. Transmission Line Defect Detection Algorithm Based on Improved YOLOv12. Electronics 2025, 14, 2432. [Google Scholar] [CrossRef]
Shukla, V.; Shukla, A.; S.K., S.P.; Shukla, S. A systematic survey: Role of deep learning-based image anomaly detection in industrial inspection contexts. Front. Robot. AI 2025, 12, 1554196. [Google Scholar] [CrossRef] [PubMed]
Yang, W. A Survey of Surface Defect Detection Based on Deep Learning. In 2022 7th International Conference on Modern Management and Education Technology (MMET 2022); Atlantis Press: Dordrecht, The Netherlands, 2022. [Google Scholar] [CrossRef]
Parseval des Chênes, M.A. Mémoire sur les séries et sur l’intégration complète. Mem. Present. L’Inst. Sci. Lett. Arts 1806, 1, 638–648. [Google Scholar]
Zhu, Z.; Zhu, Y.; Wang, H.; Wang, N.; Ye, J.; Ling, X. FDTNet: Enhancing frequency-aware representation for prohibited object detection from X-ray images via dual-stream transformers. Eng. Appl. Artif. Intell. 2024, 133, 108076. [Google Scholar] [CrossRef]
Rippel, O.; Snoek, J.; Adams, R.P. Spectral Representations for Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.N.; Lu, J. HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 10353–10366. [Google Scholar]
Li, H.; Chen, Q.; Kalkofen, D.; Chen, H.-T. OUGS: Active View Selection via Object-aware Uncertainty Estimation in 3DGS. arXiv 2025, arXiv:2511.09397. [Google Scholar] [CrossRef]
Božič, J.; Tabernik, D.; Skočaj, D. Mixed supervision for surface-defect detection: From weakly to fully supervised learning. Comput. Ind. 2021, 129, 103459. [Google Scholar] [CrossRef]
Bergmann, P.; Batzner, K.; Fauser, M.; Sattlegger, D.; Steger, C. The MVTec Anomaly Detection Dataset: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. Int. J. Comput. Vis. 2021, 129, 1038–1059. [Google Scholar] [CrossRef]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9584–9592. [Google Scholar] [CrossRef]
Li, H.; Li, Y.; Chi, Y.; Deslandes, A.; Leonardi, M.; Freger, S.; Zhang, Y.; Avery, J.; Hull, M.L.; Chen, H.-T. Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction. In Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems, Barcelona, Spain, 13–16 April 2026; pp. 1–6. [Google Scholar] [CrossRef]
Li, H.; Zhao, Y.; Li, Y.; Deslandes, A.; Avery, J.; Leonardi, M.; Hull, M.L.; Chen, H.-T. EndoExtract: Co-Designing Structured Text Extraction from Endometriosis Ultrasound Reports. arXiv 2026, arXiv:2601.18154. [Google Scholar]

Figure 1. Overall architecture of PPFS-YOLO. The network is organized into three columns: the Backbone (layers 0–8), the Top-Down Neck (layers 9–18), and the Bottom-Up Head (layers 19–27). Color-coded blocks indicate module types: blue = Conv/C3k2, pink = A2C2f, red = FSF, yellow = FIM, white = Concat, teal = Upsample, purple = Detect. Three PPFS pairs (dashed groups

{PPFS}_{1}

–

{PPFS}_{3}

) are inserted at layers 12–13 (P4-neck,

40 \times 40

,

C = 512

), 17–18 (P3-neck,

80 \times 80

,

C = 256

), and 22–23 (P4-head,

40 \times 40

,

C = 512

); tensor dimensions are

H \times W

at

640 \times 640

input. Skip connections from backbone layers 4, 6, and 8 feed into the neck and head via Concat nodes. The Detect head outputs predictions at P3, P4, and P5 scales.

Figure 1. Overall architecture of PPFS-YOLO. The network is organized into three columns: the Backbone (layers 0–8), the Top-Down Neck (layers 9–18), and the Bottom-Up Head (layers 19–27). Color-coded blocks indicate module types: blue = Conv/C3k2, pink = A2C2f, red = FSF, yellow = FIM, white = Concat, teal = Upsample, purple = Detect. Three PPFS pairs (dashed groups

{PPFS}_{1}

–

{PPFS}_{3}

) are inserted at layers 12–13 (P4-neck,

40 \times 40

,

C = 512

), 17–18 (P3-neck,

80 \times 80

,

C = 256

), and 22–23 (P4-head,

40 \times 40

,

C = 512

); tensor dimensions are

H \times W

at

640 \times 640

input. Skip connections from backbone layers 4, 6, and 8 feed into the neck and head via Concat nodes. The Detect head outputs predictions at P3, P4, and P5 scales.

Figure 2. Architecture of the Frequency-Spatial Fusion (FSF) module. The input feature

F_{s}

is transformed via FFT2 (ortho-normalized) and decomposed into amplitude

A

and phase

Φ

. A learnable 2D frequency mask

M_{f}

(base size

40 \times 21

, bilinearly interpolated) together with a per-channel scale

S_{c}

reweights the amplitude spectrum to produce the masked amplitude

\tilde{A}

. The reconstructed signal is obtained via IFFT2, yielding

F_{freq}

. The spatial path passes

F_{s}

through as identity. A gated fusion mechanism concatenates

F_{s}

and

F_{freq}

, applies

{Conv}_{1 \times 1}

–BN–SiLU–

{Conv}_{1 \times 1}

–Sigmoid to produce

α

(bias-initialized to

+ 1.0

, giving

α_{0} \approx 0.73

), and outputs

α ⊙ F_{s} + (1 - α) ⊙ F_{freq}

.

Figure 2. Architecture of the Frequency-Spatial Fusion (FSF) module. The input feature

F_{s}

is transformed via FFT2 (ortho-normalized) and decomposed into amplitude

A

and phase

Φ

. A learnable 2D frequency mask

M_{f}

(base size

40 \times 21

, bilinearly interpolated) together with a per-channel scale

S_{c}

reweights the amplitude spectrum to produce the masked amplitude

\tilde{A}

. The reconstructed signal is obtained via IFFT2, yielding

F_{freq}

. The spatial path passes

F_{s}

through as identity. A gated fusion mechanism concatenates

F_{s}

and

F_{freq}

, applies

{Conv}_{1 \times 1}

–BN–SiLU–

{Conv}_{1 \times 1}

–Sigmoid to produce

α

(bias-initialized to

+ 1.0

, giving

α_{0} \approx 0.73

), and outputs

α ⊙ F_{s} + (1 - α) ⊙ F_{freq}

.

Figure 3. Architecture of the Edge-Guided Auxiliary Supervision Module (FIM) (EGA-FIM). The module comprises three branches. Left–Edge Prior: fixed Sobel filters (no gradient) applied as depthwise convolutions produce gradient magnitudes that are normalized to

[0, 1]

, yielding the edge prior map

P

. Center–Edge Predictor: a learnable

{Conv}_{3 \times 3}

–

{Conv}_{1 \times 1}

–Sigmoid pathway predicts the edge map

Q

from the input feature

F

. Right–Residual Refinement: the Hadamard product

Q ⊙ F

is passed through a two-stage DW–PW refine block producing

Δ

, which is added back as

F^{'} = F + γ \cdot Δ

. The physics-prior loss

L_{phy} = {∥ Q - P ∥}_{1}

supervises the edge predictor, and the total loss

L_{total}

combines detection and physics terms.

Figure 3. Architecture of the Edge-Guided Auxiliary Supervision Module (FIM) (EGA-FIM). The module comprises three branches. Left–Edge Prior: fixed Sobel filters (no gradient) applied as depthwise convolutions produce gradient magnitudes that are normalized to

[0, 1]

, yielding the edge prior map

P

. Center–Edge Predictor: a learnable

{Conv}_{3 \times 3}

–

{Conv}_{1 \times 1}

–Sigmoid pathway predicts the edge map

Q

from the input feature

F

. Right–Residual Refinement: the Hadamard product

Q ⊙ F

is passed through a two-stage DW–PW refine block producing

Δ

, which is added back as

F^{'} = F + γ \cdot Δ

. The physics-prior loss

L_{phy} = {∥ Q - P ∥}_{1}

supervises the edge predictor, and the total loss

L_{total}

combines detection and physics terms.

Figure 4. Training samples and class distribution of the container damage dataset. (a–c) Representative annotated training images for the three damage classes. (d) Pre-augmentation instance counts: Dent 4438 (48.8%), Hole 1098 (12.1%), Rusty 3568 (39.2%). The severe under-representation of the Hole class (

12.1 %

) motivates the

2.3 \times

copy-paste augmentation applied during training.

Figure 4. Training samples and class distribution of the container damage dataset. (a–c) Representative annotated training images for the three damage classes. (d) Pre-augmentation instance counts: Dent 4438 (48.8%), Hole 1098 (12.1%), Rusty 3568 (39.2%). The severe under-representation of the Hole class (

12.1 %

) motivates the

2.3 \times

copy-paste augmentation applied during training.

Figure 5. Qualitative detection results on validation batches. Row 1: ground truth annotations. Row 2: PPFS-YOLO predictions. PPFS-YOLO produces fewer false positives and more accurate bounding boxes, especially in regions with rust-like pseudo-textures.

Figure 6. Per-class and multi-method performance visualization. (a) Grouped bar chart of AP@50 per class for each method. (b) Radar chart comparing multi-dimensional metrics; PPFS-YOLO (red) encloses the largest area.

Figure 7. Ablation study: mAP@50 improvement (

Δ pp

) over the YOLO12s baseline for each PPFS-YOLO configuration.

Figure 7. Ablation study: mAP@50 improvement (

Δ pp

) over the YOLO12s baseline for each PPFS-YOLO configuration.

Figure 8. Convergence comparison of mAP@50 during training: PPFS-YOLO vs. YOLO12s baseline.

Figure 9. Training loss curves for PPFS-YOLO and the YOLO12s baseline.

Figure 10. Learned 2-D spectral masks

M_{f} (u, v)

in all three FSF modules of PPFS-YOLO (P4-neck, P3-neck, P4-head). Mask weights

> 1

(green) boost, and <1 (red/orange) suppress the corresponding frequencies. All three modules consistently assign low mask weights to mid- and high-frequency bands, attenuating pseudo-texture patterns, while a moderate response is maintained at low frequencies carrying structural damage information. The P3-neck mask (finest scale) shows the strongest high-frequency suppression, consistent with pseudo-textures being most prominent at fine spatial scales.

Figure 10. Learned 2-D spectral masks

M_{f} (u, v)

in all three FSF modules of PPFS-YOLO (P4-neck, P3-neck, P4-head). Mask weights

> 1

(green) boost, and <1 (red/orange) suppress the corresponding frequencies. All three modules consistently assign low mask weights to mid- and high-frequency bands, attenuating pseudo-texture patterns, while a moderate response is maintained at low frequencies carrying structural damage information. The P3-neck mask (finest scale) shows the strongest high-frequency suppression, consistent with pseudo-textures being most prominent at fine spatial scales.

Figure 11. Pareto efficiency plot: mAP@50 vs. GFLOPs for all compared methods. PPFS-YOLO (star) achieves the best accuracy at modest computational cost.

Figure 12. Accuracy vs. parameter count: mAP@50 plotted against model size for all compared methods. PPFS-YOLO achieves the best accuracy with modest parameter overhead.

Figure 13. Hyperparameter sensitivity analysis. The highlighted box marks the default configuration used in the main experiments. Performance remains stable across a range of

λ_{phy}

, gate bias, and residual scale values, demonstrating the robustness of the proposed framework.

Figure 13. Hyperparameter sensitivity analysis. The highlighted box marks the default configuration used in the main experiments. Performance remains stable across a range of

λ_{phy}

, gate bias, and residual scale values, demonstrating the robustness of the proposed framework.

Table 1. Comparison of PPFS-YOLO with representative prior methods across three design dimensions. ✔ = supported; ✗ = not supported.

Method	Domain	Freq. Fusion	Physics Prior	End-to-End Det.
FcaNet [23]	General	✔	✗	✗
FFC [22]	General	✔	✗	✗
FreqCOD [24]	Camouflage	✔	✗	✗
SFFD [26]	Remote Sens.	✔	✗	✔
FDADNet [28]	Wood Defect	✔	✗	✔
GuwNet [33]	${NDT}^{a}$	✗	✔	✗
DfedResNet [35]	${MFL}^{b}$	✗	✔	✗
YOLO-NAS [2]	Container	✗	✗	✔
MAS-YOLO [59]	PCB	✗	✗	✔
PPFS-YOLO (Ours)	Container	✔	✔	✔

^a NDT: non-destructive testing. ^b MFL: magnetic flux leakage.

Table 2. PPFS-YOLO layer-by-layer architecture. Inserted FSF and FIM modules are highlighted with Sensors 26 03224 i001

background. “

- 1

” denotes the preceding layer; bracketed indices denote concatenation sources.

Table 2. PPFS-YOLO layer-by-layer architecture. Inserted FSF and FIM modules are highlighted with Sensors 26 03224 i001

background. “

- 1

” denotes the preceding layer; bracketed indices denote concatenation sources.

Layer	Module	From	Channels	Role
Backbone
0	Conv $3 \times 3$ s2	image	64	stem
1	Conv $3 \times 3$ s2	$- 1$	128	down
2	C3k2 ( $n = 2$ )	$- 1$	256	feature
3	Conv $3 \times 3$ s2	$- 1$	256	down
4	C3k2 ( $n = 2$ )	$- 1$	512	feature
5	Conv $3 \times 3$ s2	$- 1$	512	down
6	A2C2f ( $n = 2$ )	$- 1$	512	attn
7	Conv $3 \times 3$ s2	$- 1$	1024	down
8	A2C2f ( $n = 2$ )	$- 1$	1024	attn
Neck (top-down)
9	Upsample $2 \times$	$- 1$	1024	up
10	Concat	$[- 1, 6]$	1536	fuse
11	A2C2f ( $n = 2$ )	$- 1$	512	refine
12	FreqSpatialFusion	$- 1$	512	FSF (P4)
13	EdgeGuidedModule	$- 1$	512	FIM (P4)
14	Upsample $2 \times$	$- 1$	512	up
15	Concat	$[- 1, 4]$	768	fuse
16	A2C2f ( $n = 2$ )	$- 1$	256	refine
17	FreqSpatialFusion	$- 1$	256	FSF (P3)
18	EdgeGuidedModule	$- 1$	256	FIM (P3)
Head (bottom-up)
19	Conv $3 \times 3$ s2	$- 1$	256	down
20	Concat	$[- 1, 13]$	768	fuse
21	A2C2f ( $n = 2$ )	$- 1$	512	refine
22	FreqSpatialFusion	$- 1$	512	FSF (P4-head)
23	EdgeGuidedModule	$- 1$	512	FIM (P4-head)
24	Conv $3 \times 3$ s2	$- 1$	512	down
25	Concat	$[- 1, 8]$	1536	fuse
26	A2C2f ( $n = 2$ )	$- 1$	1024	refine
27	Detect	$[18, 23, 26]$	—	output

Table 3. Parameter and FLOP breakdown of PPFS modules (per instance). C denotes the input channel count; values are computed at

640 \times 640

input resolution.

Table 3. Parameter and FLOP breakdown of PPFS modules (per instance). C denotes the input channel count; values are computed at

640 \times 640

input resolution.

Module	C	Params (K)	GFLOPs	Component Details
FSF (P3)	256	33.5	0.11	mask $(40 \times 21)$ , channel scale, gate conv
FSF (P4, P4-head)	512	132.4	0.10	mask $(40 \times 21)$ , channel scale, gate conv
FIM (P3)	256	71.4	0.17	edge predictor, DW–PW refine $\times 2$
FIM (P4, P4-head)	512	283.9	0.51	edge predictor, DW–PW refine $\times 2$
Total (3 pairs)	—	790	1.70	—

Abbreviations: DW = depthwise convolution; PW = pointwise (

1 \times 1

) convolution.

Table 4. Container Damage Dataset statistics. The Hole class is a critical minority category (12.1% of instances). Targeted augmentation is applied during training to mitigate class imbalance.

	Original		After Augmentation
Class	Instances	Ratio (%)	Train Instances	Aug. Factor
Dent	4438	48.8	4438	$1.0 \times$
Hole	1098	12.1	2553	$2.3 \times$
Rusty	3568	39.2	3568	$1.0 \times$
Total	9104	100.0	10,559	—
Split	Images	Negatives		Resolution
Train	3300	+3300 neg.		variable
Val/Test	413	—		variable
Total	7013	—		resized to $640 \times 640$

Table 5. PPFS-YOLO hyperparameter configuration. PPFS-specific parameters are marked with †.

Hyperparameter	Value	Description
Input resolution	$640 \times 640$	standard YOLO input
Epochs	200	training duration
Optimizer (all)	SGD	unified for fair comparison
Learning rate	$1 \times 10^{- 2}$	initial, cosine annealed
Weight decay	$5 \times 10^{- 4}$	$L_{2}$ regularization
Batch size per GPU	16	$4 \times 16 = 64$ effective
AMP	enabled	mixed precision
Seed	42	reproducibility
† Gate bias $b_{gate}$	1.0	initial $α \approx 0.73$
† Residual scale $γ_{0}$	0.1	FIM init
† LR boost factor	$5.0 \times$	PPFS module params
† $λ_{phy}$	0.5	physics loss weight
† Mask base res.	$40 \times 21$	FSF spectral mask

Table 6. Comparison with state-of-the-art methods on the Container Damage dataset. Sensors 26 03224 i001

= best;

= second;

= third.

Table 6. Comparison with state-of-the-art methods on the Container Damage dataset. Sensors 26 03224 i001

= best;

= second;

= third.

Method	Params (M)	GFLOPs	mAP@50	mAP@50:95	Precision	Recall
YOLOv10n [7]	2.27	4.4	46.55	24.93	58.04	48.31
YOLO11n [8]	2.58	3.3	48.10	25.20	62.46	47.03
RT-DETR-l [17]	32.00	54.2	52.49	30.83	68.27	55.31
YOLOv8s [4]	11.13	14.4	52.56	29.67	66.33	52.45
YOLO12s [9]	9.23	10.8	52.51	29.58	63.77	54.42
PPFS-YOLO	10.02	12.5	64.86	37.49	78.29	64.82

Table 7. Per-class AP@50 (%) comparison.

Δ

denotes improvement over the YOLO12s baseline. Sensors 26 03224 i001

= best;

= second;

= third.

Table 7. Per-class AP@50 (%) comparison.

Δ

denotes improvement over the YOLO12s baseline. Sensors 26 03224 i001

= best;

= second;

= third.

Method	Dent	Hole	Rusty	mAP@50
YOLOv10n [7]	50.10	59.81	29.73	46.55
YOLO11n [8]	54.14	56.88	33.28	48.10
RT-DETR-l [17]	57.06	64.15	36.27	52.49
YOLOv8s [4]	54.32	65.44	37.93	52.56
YOLO12s [9]	58.61	60.87	38.04	52.51
PPFS-YOLO	66.85	83.06	44.66	64.86
$Δ$ vs. YOLO12s (SGD)	+8.24	+22.19	+6.62	+12.35

Table 8. Ablation study of PPFS-YOLO components. Sensors 26 03224 i001

= best.

Table 8. Ablation study of PPFS-YOLO components. Sensors 26 03224 i001

= best.

Configuration	Params (M)	GFLOPs	mAP@50	mAP@50:95	Precision	Recall
YOLO12s (Baseline)	9.23	10.8	52.76	30.65	65.33	53.86
+FSF only	9.35	11.1	54.55 (+1.79)	31.88	69.66	53.92
+FIM only	9.91	12.3	53.71 (+0.95)	31.58	66.65	53.50
+FSF+FIM ( $λ = 0$ )	10.02	12.5	53.59 (+0.83)	31.00	68.04	52.64
Full PPFS	10.02	12.5	64.86 (+12.10)	37.49	78.29	64.82

Table 9. Inference latency comparison on NVIDIA RTX 3090 (batch size 1,

640 \times 640

).

Table 9. Inference latency comparison on NVIDIA RTX 3090 (batch size 1,

640 \times 640

).

Method	Params (M)	Latency (ms)	FPS
YOLOv10n	2.78	10.1	99.4
YOLO11n	2.62	8.7	115.4
RT-DETR-l	32.97	31.3	32.0
YOLOv8s	11.17	6.2	162.6
YOLO12s	9.29	14.3	70.0
PPFS-YOLO	10.08	17.2	58.3

Measured on NVIDIA RTX 3090, CUDA 12.1, batch size 1, input

640 \times 640

; mean over 500 forward passes after 50 warm-up runs. FFT operations in FSF run in FP32.

Table 10. Cross-domain transfer results on Kolektor SDD2 (30-epoch fine-tuning from container-trained checkpoint, SGD,

{lr}_{0} = 2 \times 10^{- 3}

).

Table 10. Cross-domain transfer results on Kolektor SDD2 (30-epoch fine-tuning from container-trained checkpoint, SGD,

{lr}_{0} = 2 \times 10^{- 3}

).

Method	Best mAP@50	Best mAP@50:95	Final mAP@50	$Δ$ vs. YOLO12s
YOLO12s (baseline)	95.6	93.8	92.6	—
PPFS-YOLO	93.6	91.9	92.3	$- 2.0$

Table 11. Efficiency comparison. mAP@50/GFLOPs measures accuracy per unit of computation. Sensors 26 03224 i001

= best;

= second;

= third.

Table 11. Efficiency comparison. mAP@50/GFLOPs measures accuracy per unit of computation. Sensors 26 03224 i001

= best;

= second;

= third.

Method	mAP@50	GFLOPs	mAP/GFLOPs	Params (M)
YOLOv10n	46.55	4.4	10.58	2.27
YOLO11n	48.10	3.3	14.58	2.58
RT-DETR-l	52.49	54.2	0.97	32.00
YOLOv8s	52.56	14.4	3.65	11.13
YOLO12s	52.51	10.8	4.86	9.23
PPFS-YOLO	64.86	12.5	5.19	10.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Gao, F. PPFS-YOLO: Physics-Prior Frequency-Spatial Fusion for Robust Container Surface Damage Detection. Sensors 2026, 26, 3224. https://doi.org/10.3390/s26103224

AMA Style

Liu J, Gao F. PPFS-YOLO: Physics-Prior Frequency-Spatial Fusion for Robust Container Surface Damage Detection. Sensors. 2026; 26(10):3224. https://doi.org/10.3390/s26103224

Chicago/Turabian Style

Liu, Jingze, and Feng Gao. 2026. "PPFS-YOLO: Physics-Prior Frequency-Spatial Fusion for Robust Container Surface Damage Detection" Sensors 26, no. 10: 3224. https://doi.org/10.3390/s26103224

APA Style

Liu, J., & Gao, F. (2026). PPFS-YOLO: Physics-Prior Frequency-Spatial Fusion for Robust Container Surface Damage Detection. Sensors, 26(10), 3224. https://doi.org/10.3390/s26103224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PPFS-YOLO: Physics-Prior Frequency-Spatial Fusion for Robust Container Surface Damage Detection

Abstract

1. Introduction

Novelty Positioning

2. Related Work

2.1. Evolution of Real-Time Object Detectors

2.2. YOLO-Based Industrial Defect Detection

2.3. Frequency-Domain Feature Analysis in Visual Recognition

2.4. Physics-Aware and Edge-Guided Learning for Defect Analysis

3. Method

3.1. Overall Architecture

3.2. Frequency-Spatial Fusion Module

3.2.1. Design Rationale

3.2.2. Forward Computation

3.2.3. Gate Initialization Analysis

3.3. Edge-Guided Auxiliary Supervision Module

3.3.1. Edge Prior via Gradient Operators

3.3.2. Learnable Edge Prediction

3.3.3. Physics-Prior Loss and Gradient Analysis

3.3.4. Edge-Guided Residual Refinement

3.4. Training Objective and Optimization

Computational Complexity Analysis

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Comparison with State-of-the-Art Methods

4.5. Per-Class Analysis

4.6. Ablation Study

4.7. Training Dynamics

4.8. Learned Spectral Mask Visualization

4.9. Efficiency Analysis

4.10. Inference Latency

4.11. Cross-Domain Validation on Kolektor SDD2

5. Discussion

5.1. The Catalyst Effect of L phy

5.2. Why Frequency-Domain Fusion Benefits Container Damage Detection

5.3. Minority Class Benefits

5.4. Comparison with Larger Models

5.5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.1. The Catalyst Effect of $L_{phy}$