1. Introduction
With the continuous advancement of the Industry 4.0 paradigm, the electronics sector is steadily moving toward higher levels of intelligence and automation. PCBs, as a fundamental element of modern electronic systems, play a pivotal role in ensuring the stability and reliability of final products. However, defects on PCB surfaces are unavoidable during manufacturing, storage, and transportation processes. Even slight imperfections may adversely affect product performance and undermine overall quality, potentially leading to considerable economic losses. Consequently, the development of high-precision and robust PCB surface defect detection techniques has become increasingly critical in contemporary electronic manufacturing.
Traditional PCB surface defect detection methods primarily include manual inspection [
1], electrical inspection [
2] and Automated Optical Inspection (AOI) [
3]. Manual inspection relies heavily on the expertise and concentration of human operators, making it prone to visual fatigue, which in turn leads to low efficiency and relatively high rates of missed detections and false alarms [
1]. Compared with manual inspection, the electrical inspection improves detection accuracy; however, it requires direct physical contact with the PCB during the inspection process, which may cause secondary damage. Moreover, this approach depends on specialized equipment and fixtures, leading to increased costs and limited flexibility [
2]. In recent years, AOI systems have been extensively adopted, offering improved detection accuracy and higher efficiency compared with manual inspection. Despite these benefits, AOI systems entail substantial equipment investment, and their performance is still affected by non-negligible false-positive and missed-detection rates [
4].
With the continuous advancement of surface defect detection techniques, numerous studies have employed image processing-based methods, such as threshold segmentation [
5] and edge detection [
6]. While these methods can be effective in relatively simple scenarios, their performance degrades in PCB inspection tasks due to the presence of complex and cluttered backgrounds. In such cases, defect extraction largely depends on local image cues, which restricts robustness and limits generalization across varying conditions [
1]. In addition, traditional machine learning-based methods, including decision trees [
7], random forests [
8], and support vector machines [
9], have also been applied to PCB defect detection. However, their effectiveness is closely tied to the quality of handcrafted features and classifier design. In practice, these models often involve considerable computational overhead and exhibit limited inference speed, which hinders their deployment in real-time industrial environments.
In recent years, deep learning-based detection techniques have achieved remarkable progress across various domains, including industry [
10,
11], agriculture [
12,
13], transportation [
14,
15], etc. These approaches benefit from powerful feature representation capabilities and efficient inference processes. With the continued advancement of deep learning in PCB surface defect detection, existing detection algorithms can generally be categorized into two-stage and one-stage frameworks. Two-stage approaches, represented by R-CNN [
16], Fast R-CNN [
17], and Faster R-CNN [
18], are known for their high detection accuracy. However, their high computational complexity, slow inference speed, and long processing time limit their applicability in scenarios with constrained computing resources. In contrast, one-stage methods, such as YOLO [
19] and SSD [
20], are designed to strike a more favorable balance between accuracy and efficiency. Although SSD improves detection accuracy and reduces false positives, it still involves a relatively large number of parameters and model size, which negatively affects detection speed [
20]. In contrast, the YOLO family offers a more streamlined detection paradigm by directly learning discriminative features from raw input data [
21] and performing object classification and localization simultaneously within a single forward pass. This unified design enables YOLO-based methods to achieve competitive accuracy while maintaining high inference efficiency.
Despite the significant progress made in deep learning-based research, YOLO-based PCB defect detection still faces notable challenges. These difficulties primarily arise from the complex background and the small size of surface defects, which make the detection process highly sensitive to noise. In particular, tiny defects typically present weak visual cues and prominent high-frequency characteristics, which are highly susceptible to degradation or loss during successive down sampling operations. Furthermore, many existing methods are constrained by limited receptive fields, reducing their ability to capture sufficient contextual information. In addition, commonly used feature fusion strategies typically assign equal weights to all channels, resulting in relatively coarse representations with limited discriminative capability. These factors restrict further improvements in both the accuracy and robustness of PCB defect detection.
To address the aforementioned challenges, this study presents a series of targeted enhancements to the feature extraction, feature fusion and loss function within the YOLO11n framework, resulting in the proposed FMW-YOLO model. Unlike existing approaches that primarily emphasize either feature enhancement or lightweight design, the proposed FMW-YOLO integrates frequency-aware representation, noise suppression, expansion of the receptive field and efficient multi-scale feature fusion within a unified framework, thereby achieving a more balanced and effective detection performance. The proposed framework not only improves detection accuracy but also reduces the number of parameters. These characteristics make the proposed method well suited for real-time PCB inspection in resource-constrained industrial environments. The main contributions of this paper are summarized as follows:
In the backbone network, a Frequency-Enhanced Channel-Transposed and Local Feature Network (FCT-LFNet) is proposed. We propose the Dual-Frequency and Channel Attention Aggregation (DFCAA) module and the Lightweight Edge-Gaussian Block (LEGB) module. The DFCAA module is used to recover high-frequency details and enhance the expressive capability of features, while the LEGB module effectively suppresses noise interference and enhances boundary representation.
In the neck network, a Multi-Scale Context-Aware Enhancement (MSCAE) network is designed. The Multi-Scale Feature Pyramid Network with Integrated Channel Attention (MSFPNICA) is proposed to enhance cross-scale feature interaction, thereby achieving effective multi-scale feature fusion. In addition, the Dilated Reparam Residual Module (DRRM) is designed in this network, thereby enlarging the receptive field.
Regarding loss function, to alleviate the issue that all predicted bounding boxes are treated with identical gradient updates during training, Wise-IoU is incorporated into the loss formulation. By adaptively down-weighting low-quality samples, this mechanism mitigates their adverse impact on optimization, thereby enhancing training stability and improving overall detection performance.
For validation, the effectiveness of the proposed method is validated through extensive experiments conducted on the HRIPCB and DeepPCB datasets, which are widely used benchmarks for PCB surface defect detection.
3. Materials and Methods
3.1. Overview of YOLO11
The YOLO11 algorithm [
37], proposed by the Ultralytics team, represents a recent advancement in the YOLO family of object detection models. As an end-to-end one-stage detection algorithm, it achieves a favorable balance between detection accuracy and inference efficiency, while also exhibiting improved generalization capability. Compared with previous versions such as YOLOv8, YOLO11 introduces significant optimizations in network architecture design, feature extraction efficiency and training strategies. YOLO11 provides five model variants: YOLO11n, YOLO11s, YOLO11m, YOLO11l, and YOLO11x. These variants differ in network depth, width, and the maximum number of channels, resulting in progressively increasing parameter sizes and computational costs [
38]. The overall architecture of YOLO11 mainly consists of three components: a backbone for feature extraction, a neck for feature fusion, and a head for generating final detection results [
39]. Among them, the C3k2 module, illustrated in
Figure 1, serves as a key component that improves upon the traditional C3 structure to enhance feature extraction capability, particularly for complex and multi-scale scenarios. To achieve a balance between detection accuracy and computational efficiency, this study adopts YOLO11n as the baseline model.
3.2. Architecture of FMW-YOLO
The overall architecture of FMW-YOLO is illustrated in
Figure 2. In the backbone network, a Dual-Frequency and Channel Attention Aggregation (DFCAA) module is proposed to effectively integrate high-frequency information derived from shallow features with the original representations, thereby enhancing the recovery of fine-grained details and improving feature expressiveness. Furthermore, a Lightweight Edge-Gaussian Block (LEGB) is designed to alleviate noise interference and enhance robustness under low-quality and low-contrast imaging conditions. Building upon these components, the Frequency-Enhanced Channel-Transposed and Local Feature Network (FCT-LFNet) is further constructed to achieve more comprehensive multi-scale feature extraction.
In the neck network, the Multi-Scale Feature Pyramid Network with Integrated Channel Attention (MSFPNICA) is constructed to adaptively recalibrate channel-wise fusion weights, enabling more discriminative feature aggregation. Furthermore, to expand the receptive field and capture richer contextual information, a Dilated Reparam Residual Module (DRRM) is designed. By integrating MSFPNICA with DRRM, the novel Multi-Scale Context-Aware Enhancement (MSCAE) network is developed, which promotes effective multi-scale feature integration and strengthens the overall representational capacity of the model.
Finally, the Wise-IoU loss function is introduced to alleviate the excessive competition among high-quality anchor boxes while mitigating the adverse gradients generated by low-quality samples [
40].
3.3. Frequency-Enhanced Channel-Transposed and Local Feature Network
In PCB surface defect detection, feature extraction often exhibits limited capability in modeling channel-wise feature relationships, insufficient recovery of high-frequency details, and weak adaptability to defect contexts across multiple scales. Moreover, the complex background of PCB surfaces makes it difficult to accurately delineate defect boundaries under conditions of noise and low contrast, which negatively affects detection robustness. To address these challenges, a Frequency-Enhanced Channel-Transposed and Local Feature Network (FCT-LFNet) is proposed, which integrates multi-frequency channel enhancement with edge-aware Gaussian refinement. Specifically, a Dual-Frequency and Channel Attention Aggregation (DFCAA) module is designed to capture channel dependencies while jointly extracting low- and high-frequency components, thereby improving the representation of fine-grained details. Meanwhile, a Lightweight Edge-Gaussian Block (LEGB) is introduced to adaptively combine shallow edge information with deep Gaussian representations, enabling effective noise suppression and clearer boundary localization. The complementary integration of these two components significantly improves the model’s capability in detail recovery and boundary identification.
3.3.1. Dual-Frequency and Channel Attention Aggregation
While low-frequency components preserve global structure and semantic stability, the effective restoration of high-frequency information remains critical for accurate detail representation. Enhancing high-frequency responses not only strengthens inter-channel dependencies but also improves the fidelity of fine-grained feature reconstruction, thereby benefiting overall representation quality. To this end, a Dual-Frequency and Channel Attention Aggregation (DFCAA) module is proposed to refine selected C3k2 within the YOLO11n backbone. The original bottleneck units in C3k2 are replaced with the proposed DFCAA to enhance feature extraction capability. By incorporating channel-level self-attention and explicit frequency-aware modeling, DFCAA effectively addresses the limitations of conventional designs in capturing high-frequency details. The structure of DFCAA is illustrated in
Figure 3a, where Channel Transposed Attention (CTA) and Dual-Frequency Feed-Forward Network (DFFN) serve as key components for feature transformation and enhancement.
The Channel Transposed Attention (CTA) is designed to perform self-attention operations along the channel dimension, enabling more effective modeling of inter-channel dependencies and overcoming the limitation of conventional attention mechanisms that predominantly emphasize spatial information. By leveraging channel-wise interactions, CTA facilitates the prioritization of informative features and enhances detail preservation during feature reconstruction. The structure of CTA is illustrated in
Figure 3c. First, the input features are processed by dividing the input feature map into multiple channels. During the self-attention computation, the query (
QZ), key (
KZ) and value (
VZ) representations are generated to capture the relationships across different channels. The attention operation can be formulated as follows:
where α denotes a learnable temperature parameter used to adjust the scale of the dot-product operation, while
FC-A refers to the feature obtained by performing self-attention computation along the channel dimension. Through the aforementioned self-attention calculation, CTA effectively models inter-channel dependencies, allowing the network to emphasize more informative channel responses during feature reconstruction. Subsequently, CTA reorganizes the connections between different attention heads to generate the channel attention feature
FCA. To reduce computational cost, spatial and channel features are integrated only within the channel attention mechanism. For additional feature dimensions, the same projection strategy is adopted to obtain both attention features
FC1 and
FC2, as well as the spatial projection output
YS. These representations are then utilized for feature extraction and cross-domain weighting, and the overall computation can be formulated as follows:
where
f(·) represents the sigmoid activation function. Finally, the spatial attention feature weights are used to modulate the CA output features, while the CA features are inversely used to reweight the spatial attention features. This complementary interaction effectively integrates spatial attention features with channel attention features, thereby enhancing the final feature representation.
The Dual-Frequency Aggregation Feed-Forward Network (DFFN) is designed to enhance high-frequency representations for improved recovery of fine-grained details. Conventional attention mechanisms often exhibit a bias toward low-frequency components, which may result in the attenuation of high-frequency information. To mitigate this issue, DFFN explicitly models frequency decomposition, enabling effective detail enhancement while preserving global structural consistency. The architecture of DFFN is illustrated in
Figure 3b. First, the input feature
FCA is projected to
Yin through a fully connected layer and then activated by the GELU function. Subsequently, a frequency gating mechanism is employed to separate the low- and high-frequencies for independent processing. The low-frequency information is retained to maintain global structural stability, whereas the high-frequency branch is refined using 1 × 1 convolution and depth-wise convolution (DWConv) to enhance local details. The calculation formula for frequency gating is:
where
Yfg is the result of element-by-element multiplication of the two features. This design enables effective integration of feature representations with high-frequency components from both branches. Through this dual-frequency information aggregation strategy, DFFN preserves the global structural information while preventing the loss of high-frequency components. By jointly aggregating low- and high-frequency information, the module further improves the restoration of fine-grained image details.
In summary, the DFCAA module incorporates both CTA and DFFN to strengthen high-frequency feature modeling, thereby improving the fidelity of fine-grained detail reconstruction.
3.3.2. Lightweight Edge-Gaussian Block
Owing to the small size of PCB components and the dense distribution of pads, many defects frequently occur in pad regions where distinguishing defect boundaries from structural patterns becomes challenging. This significantly increases the difficulty of feature extraction. Moreover, noise interference further degrades edge clarity, leading to ambiguous object boundaries. To address these challenges, a Lightweight Edge-Gaussian Block (LEGB) is designed and incorporated into the last two C3k2 stages of the YOLO11n backbone, replacing the original bottleneck units. This proposed block employs an edge-Gaussian aggregation mechanism to effectively mitigate boundary ambiguity, suppress noise interference and enhance boundary representation as well as feature detail preservation, thereby improving the overall robustness of the model.
As shown in
Figure 4, the proposed LEGB achieves a favorable balance between edge-aware information and global features, enabling the network to extract more discriminative representations even under noisy conditions and low-contrast scenarios. Specifically, the input feature
Fin is first processed by the Lightweight Edge-Gaussian Module (LEGM) to enhance low-quality feature representations. The resulting features are then fed into a 1 × 1 convolution layer, followed by an Activation–Normalization (AN) operation, yielding the output feature
Fmid:
where
denotes a two-dimensional 1 × 1 convolution. Next, a second 1 × 1 convolution is applied to adjust the channel dimension to
C. This is followed by normalization and a dropout operation with a rate of 0.1. Finally, the processed feature is added to the initial input via a residual connection, producing the output feature
Fout:
The LEGM first applies an edge-Gaussian aggregation (EGA) module to the input features, producing an intermediate feature
Fega. To further emphasize the more informative channels, the Efficient Channel Attention (ECA) strategy [
41] is introduced. The formulation of this mechanism can be expressed as follows:
where
indicates an adaptive convolution along one dimension whose kernel size z is proportionally related to the number of channels C; GAP refers to channel-wise global average pooling; ⊗ indicates the element-wise multiplication operation;
Ftemp denotes the output feature after the Sigmoid function; and
Fo denotes the output feature generated by the LEGM.
The EGA module introduces an edge-Gaussian aggregation mechanism that adaptively fuses edge cues with Gaussian modeling responses through a weighted integration strategy, thereby enhancing feature representation. Furthermore, the module employs a stage-wise selection strategy for the input features
Fin, where shallow layers are primarily responsible for edge feature extraction, while deeper layers focus on Gaussian modeling. The output feature obtained through EGA is denoted as
Aega:
where
Aedga denotes the edge feature extraction operation, and
Agauss denotes the Gaussian modeling operation. The obtained
Aega is combined with the input
Fin and subsequently refined through a three-layer convolutional block, resulting in the enhanced feature:
where
denotes a two-dimensional 3 × 3 convolution. Finally, the convolutional block output
Fa is combined with input
Fin through element-wise multiplication and addition operations, and then through 3 × 3 convolution to obtain the enhanced feature:
where ⊕ denotes the element-by-element addition operation.
In summary, the LEGB preserves fine-grained boundary information through the edge-Gaussian aggregation mechanism, effectively suppresses the adverse effects of noise, and improves the robustness of detection in complex background scenarios.
3.4. Multi-Scale Context-Aware Enhancement
During inference, defects often exhibit diverse scales and sparsely distributed, which impose higher requirements on contextual modeling capabilities. To this end, a Multi-Scale Context-Aware Enhancement (MSCAE) network is proposed to improve feature interaction efficiency and contextual information. The proposed network consists of two key parts: Multi-Scale Feature Pyramid Network with Integrated Channel Attention (MSFPNICA) and the Dilated Reparam Residual Module (DRRM). Specifically, MSFPNICA enhances cross-scale feature interaction and improves the preservation of small-object information through channel-adaptive weighting and bidirectional selective fusion. Meanwhile, DRRM expands the receptive field using dilated convolution and reparameterization strategies, enabling more effective modeling of sparse features and contextual dependencies. By integrating these two complementary modules, the proposed framework strengthens global semantic modeling while maintaining fine-grained detail recovery, thereby producing more discriminative feature representations for subsequent defect localization and classification in the detection head.
3.4.1. Multi-Scale Feature Pyramid Network with Integrated Channel Attention
In PCB surface defect detection, Feature Pyramid Networks (FPNs) [
42] are widely used for multi-scale feature fusion. By employing a top-down pathway with lateral connections, FPNs integrate high-level semantic information with low-level spatial details, thereby improving multi-scale representation capability. However, standard FPN architectures and their variants still suffer from several limitations. First, they lack fine-grained modeling in the channel dimension. Most existing designs rely on simple element-wise addition for feature fusion, implicitly treating all channels as equally important. Under complex industrial scenarios, such an assumption may introduce redundant or less informative features, thereby degrading detection performance. Moreover, the overly simplified fusion strategy makes it difficult to effectively exploit complementary information across different feature scales. To address these issues, we propose a Multi-Scale Feature Pyramid Network with Integrated Channel Attention (MSFPNICA) architecture, which incorporates the channel attention (CA) mechanism to enhance feature modeling. As shown in
Figure 5a, the proposed architecture enhances channel-wise feature modeling and improves cross-scale feature interaction, leading to more effective multi-scale feature fusion.
We adopt a dual fusion strategy with bidirectional feature transfer, guided by the CA mechanism. This architecture provides significant advantages in cross-scale feature interaction, channel-selective modeling, and the preservation of small objects. Before feature fusion, CA is employed to adaptively reweight the input features
F, enabling the network to emphasize more informative channels while suppressing less relevant ones. The resulting output feature is denoted as
Fca:
where
Fmax and
Favg represent maximum pooling and global average pooling. This process highlights significant channels and suppresses redundant information. During the feature selection stage, at each layer (P3, P4, P5), the CA-enhanced features are adaptively reweighted in a channel-wise manner. Subsequently, a 1 × 1 convolution is applied to perform dimensionality reduction:
During the stages of dual-feature fusion and bidirectional feature propagation, the Selective Feature Fusion (SFF) submodule is employed for feature fusion and information transfer. The top-down and down-up structures are shown in
Figure 5b,c. To avoid information loss caused by simple element-wise addition, a hybrid fusion strategy combining multiplicative modulation and additive compensation is adopted. In the top-down pathway, high-level semantic features are used as guiding signals to modulate the fusion process, enabling the preservation of essential semantic information from low-level representations. Specifically, both high-level and low-level features are taken as inputs
f. The high-level features are first expanded via transposed convolution and subsequently up sampled using bilinear interpolation to align their spatial resolution with that of the low-level features. Afterwards, the CA mechanism is applied to transform the high-level features into attention weights, thereby enabling more refined feature integration. The final output
fout can be expressed as:
where Inter represents bilinear interpolation; TConv represents transposed convolution; the variables
flow and
fhigh denote the low-level and high-level feature maps, respectively; and CA indicates the channel attention mechanism. The down-up propagation follows a symmetrical process, with its final output
fout expressed as:
In summary, the MSFPNICA structure solves the limitation of the channel-equality assumption through the CA mechanism. The proposed dual fusion strategy, which integrates multiplicative modulation with additive compensation, achieves a balance between feature selection and information preservation, thereby enhancing the efficiency of multi-scale feature interaction within the model.
3.4.2. Dilated Reparam Residual Module
Traditional convolutional networks typically rely on small kernel sizes, which lead to limited receptive fields and insufficient contextual information capture, thereby constraining effective feature fusion. Moreover, due to the highly imbalanced distribution of PCB defect patterns, conventional models often struggle to adequately capture sparse structures, leading to increased false positives and missed detections. To tackle these challenges, we further enhance the MSFPNICA and propose the Dilated Reparam Residual Module (DRRM), as shown in
Figure 6a. This module replaces the original bottleneck units within the C3k2 in the neck network, thereby expanding the receptive field and strengthening the model’s ability to capture complex spatial features.
In the YOLO11n architecture, the C3k2 module typically employs 3 × 3 convolutions. However, the limited receptive field of small kernels restricts its ability to capture sufficient contextual information. To address this limitation, the Dilation-Wise Residual (DWR) module is introduced to expand the receptive field by incorporating convolutions with multiple dilation rates, thereby enabling more comprehensive contextual modeling over a wider spatial range. The DWR adopts a two-step design comprising Region Residualization (RR)–Semantic Residualization (SR), as shown in
Figure 6b. Each layer contains multiple dilated convolution branches with different dilation rates (e.g.,1, 3, 5, etc.). In the RR stage, convolution, ReLU activation, and batch normalization (BN) are sequentially applied to generate feature maps with diverse spatial representations. Subsequently, the SR stage processes these regional feature maps using dilated depth-wise convolutions with different dilation rates, allowing the model to capture multi-scale contextual dependencies. Each dilated branch operates on a distinct receptive field, thereby improving the model’s ability to perceive multi-scale features. The outputs of the dilated convolutions are incorporated into the original input through residual connection, which enhances information propagation and alleviates the gradient vanishing problem. By fusing these multi-scale features, the DWR effectively strengthens the capability of contextual information extraction, which has proven beneficial in dense prediction tasks such as semantic segmentation.
Due to the complex spatial characteristics of PCB defects, stronger feature representation capability is required. However, the aforementioned improvements still exhibit limited ability in capturing sparse patterns and may inevitably introduce additional computational overheads. To address these issues, we further enhance the DWR by incorporating the Dilated Reparam Block (DRB), which replaces the original 3 × 3 convolutions with parallel branches using dilation rates of 3 and 5. The architecture of the DRB module is illustrated in
Figure 6c. The DRB adopts multiple small convolution kernels arranged in parallel, each operating with a different dilation rate to capture multi-scale spatial information. Dilated convolution introduces spacing between kernels to expand the receptive field without changing the kernel size. The outputs of these parallel convolutions are subsequently adaptively weighted and fused, followed by structural reparameterization to merge them into an equivalent large convolutional kernel. This strategy enables DRB to achieve the advantages of large-kernel convolution while avoiding additional computational burden. To evaluate the impact of integrating the DRB module on computational efficiency, frames per second (FPS) are compared in the experimental section. This analysis aims to demonstrate that the proposed structural design reduces the computational cost during the inference stage. Consequently, it improves the modeling of sparse spatial patterns, enhances receptive field expansion efficiency, and facilitates effective global feature extraction. By integrating these two approaches, the Dilated Reparam Residual Module (DRRM) is constructed. Specifically, DRRM first applies a standard 3 × 3 convolution for initial feature extraction, after which the feature maps are divided into three groups. Different processing strategies are applied to these groups to enhance the extraction of multi-scale semantic information. One group is processed using a 3 × 3 convolution, while the other two groups are processed by DRB modules with different receptive field sizes. Through multiple parallel dilated convolution branches, the DRB improves the ability of large convolution kernels to capture sparse patterns and obtain richer contextual information without increasing computational complexity. Finally, multi-level feature fusion is performed across different branches and interaction paths, thereby improving the detection capability for small objects and fine-grained details. By letting the input feature be denoted as
z, the DRRM can be expressed as follows:
where ReLU denotes the activation function; Conv
3×3(z) represents the standard 3 × 3 convolution; DConv
3×3 indicates the 3 × 3 dilated convolution; DRB refers to the Dilated Reparam Block; Conv
1×1 corresponds to the 1 × 1 pointwise convolution;
signifies the feature map concatenation operation;
FCRB(
z) denotes the feature map output after processing by ReLU, batch normalization and convolution operations;
FDD(
z) represents the feature map generated through DConv and DRB module processing; and
Fout indicates the final output feature.
In summary, the DRRM effectively expands the receptive field, improving the precision and efficiency of semantic segmentation. It enhances the modeling of sparse patterns and facilitates multi-scale information extraction, thereby strengthening the overall representational capacity of the network. Consequently, the proposed module effectively reduces the rates of missed detections and false positives.
3.5. Wise-IoU Loss Function
Traditional Intersection over Union (IoU)-based loss functions, such as SIoU, GIoU, DIoU and CIoU, exhibit certain limitations when handling low-quality prediction boxes. On the one hand, samples with low IoU values tend to introduce unstable gradient signals during the training process. On the other hand, these methods apply the same gradient update strategy to all prediction boxes, which prevents the model from effectively emphasizing high-quality predictions and consequently restricts further improvements in detection accuracy. To overcome these issues, the Wise-IoU loss function is introduced as the bounding box regression loss [
40]. The parameters are set as ratio = 0.7, d = 0.0, and u = 0.95. This configuration adjusts the contribution of samples according to their IoU quality, thereby alleviating the impact of low-quality predictions and promoting more stable optimization. This approach incorporates a dynamic nonlinear weighting mechanism that assigns adaptive penalties to predictions of varying quality, effectively reducing the influence of low-quality samples while amplifying the contribution of high-quality ones during optimization. As a result, the proposed strategy improves training stability and enhances the robustness of the model. The traditional loss function can be expressed as follows:
where
qp is the predicted box and
qt is the true box. The Wise-IoU is formulated as follows:
where
RW(
qp, qt) is the Wise penalty term, which is used to adaptively adjust the sample weight, and its form is defined as:
where
α and
β are adjustment parameters. When the IoU value is low, the penalty term increases accordingly, thereby weakening the gradient effect of low-quality samples during training. When the IoU value is high, the penalty term gradually approaches zero, enabling the model to place greater emphasis on optimizing high-quality prediction boxes.
In summary, the incorporation of Wise-IoU loss effectively alleviates the adverse influence of low-quality samples, leading to more stable optimization during training. Meanwhile, by assigning greater emphasis to moderately qualified anchor boxes, the model is able to better approximate the overall data distribution.
5. Discussion and Conclusions
This paper proposes FMW-YOLO, a frequency-domain enhanced and multi-scale context-aware detection model based on YOLO11n. The proposed model achieves superior detection performance for PCB images characterized by complex background lines as well as small and irregular defects.
First, FCT-LFNet is designed by integrating DFCAA and LEGB. DFCAA introduces frequency-domain features, while LEGB adopts an edge-Gaussian aggregation mechanism. The combination of these two components effectively restores high-frequency details and suppresses noise interference, thereby enhancing the model’s feature extraction capability. Second, MSCAE consisting of MSFPNICA and DRRM is constructed. MSFPNICA incorporates a channel attention mechanism to enhance feature interaction, while DRRM expands the receptive field through convolutions with different dilation rates. The integration of these two modules improves the efficiency of cross-scale feature fusion and enhances the robustness of the model under complex background conditions. Finally, Wise-IoU is introduced to mitigate the negative impact of low-quality samples through an adaptive gradient balancing mechanism, thereby stabilizing the detection process and improving localization accuracy.
Since Sc and Sp exhibit similar shapes, distinguishing between these poses certain challenges for accurate detection. Specifically, Sc occurs outside the PCB, whereas Sp appears on the PCB. In future work, we will focus on further improving the detection performance for complex defect categories such as Sc and Sp, enabling more accurate discrimination between defects located on the circuit and those occurring outside the circuit. Additionally, we plan to explore the integration of the proposed framework with Transformer-based hybrid architecture to broaden its applicability across diverse manufacturing scenarios.