1. Introduction
With the development of science and technology, the degree of sophistication of modern equipment is becoming higher and higher. PCB is gradually moving towards miniaturization and complexity of the direction of development, and the continuous enhancement of process requirements may be accompanied by a series of defects in the PCB-manufacturing process, missing holes, mouse bites, open circuits, short, spur, spurious copper, and other defects. These defects may cause PCB performance degradation, and the quality of subsequent products negatively affects the economic losses. Therefore, it is crucial to detect defects on the PCB surface efficiently. Currently used for PCB surface defect detection, the central inspection methods are manual visual inspection method and electrical characteristics of the identification method of destructive testing, non-destructive testing X-Ray detection, infrared thermography, and ultrasonic detection [
1,
2]. However, constrained by material and environmental factors, both methods are difficult to adapt to different production environments.
Deep learning-based computer vision technology is advancing rapidly due to its cost-effectiveness, high efficiency, and resilience to variations in material and production environments. Consequently, it has been extensively applied in the detection of surface defects on PCB. Detection algorithms in this domain are generally categorized into two types: one-stage and two-stage detection algorithms. A prominent example of two-stage detection algorithms is the RCNN series [
3,
4,
5,
6], which conducts target detection through a two-step process. Initially, several candidate regions are generated, followed by the selection of the most accurate regions for localization and classification predictions. Despite their effectiveness, these two-stage algorithms require extensive computational resources, leading to slower inference speeds and lower frame rates, particularly on devices with limited processing capabilities. In contrast, one-stage detection algorithms—including the YOLO series [
7,
8,
9,
10], SSD [
11], and FCOS [
12]—approach the target detection task as a single machine learning challenge, generating predictions directly from the input image without relying on candidate regions. These methods not only maintain high levels of accuracy but also preserve the benefit of rapid detection, making them particularly well-suited for real-time applications in intricate environments, and they are especially advantageous for detecting defects on PCB surfaces.
Related research work has further improved the generalized detector to enable the model to recognize multiple defects efficiently and accurately simultaneously to achieve a better trade-off between accuracy and speed. Ref. [
13] introduced inverted residual blocks and coordinate attention [
14] in the YOLOX framework [
15], effectively reducing the network parameters while enhancing the recognition of PCB surface defects. Ref. [
16] Based on YOLOv7 [
17], FasterNet [
18] and the Convolutional Block Attention Module (CBAM) [
19] are combined to effectively extract spatial features, reduce redundant operations, and enhance the discriminative ability of feature expression to improve detection accuracy. Ref. [
20] designed LDD-Net, a lightweight PCB surface defect detection network that focuses on critical defect features through efficient downsampling and attention module.
Although these methods have achieved relatively good results in PCB surface defect detection, there are still some challenges: (1) Reducing the network’s weight may impact its accuracy without necessarily improving the model’s real-time performance. This is because hardware conditions limit the model’s inference speed, and excessive reliance on low computational resources can impede the model’s deployment on GPUs. (2) The increasingly small size of PCB surface defects and the interference of complex backgrounds pose significant challenges to the target detection capability of existing models. Most existing methods make it difficult to fuse multi-scale features, resulting in noticeable intra-class differences and inter-class similarities. (3) Although the introduction of attention modules, such as CBAM and SE, can reduce the unnecessary interference of noise for model detection to a certain extent, complex environments that include light variations in production environments, dust, and other interfering factors may lead to blurring of defect boundaries, which affects the accuracy of defect detection on PCB surfaces.
To achieve lightweight without affecting the detection accuracy and further improve the model’s performance, an RG-ELAN module combining the ideas of ELAN and GhostNet [
21] is designed, using a heavily parameterized module as the computational block. The introduction of the Adaptive Interaction Feature Integration (AIFI) [
22] module captures the dependencies between features. It removes unnecessary interactions between shallow features, enabling the model to process and fuse important information more effectively, thus further improving the overall detection performance.
To mitigate the uncertainty caused by smaller defect sizes and complex backgrounds, this paper introduces a multi-branching neck architecture for multi-scale feature fusion, namely the Unified Multi-scale Feature Fusion Pyramid Network (UMSFPN). A key component of this network is the Cross Stage Partial Multi-scale Module (CSPMS), specifically designed to enhance feature representation across different scales. This neck framework improves detection accuracy and performance for small target detection by effectively fusing multi-scale features. It is particularly well-suited for application scenarios with complex scenes and small targets’ extensive, dense distribution. On a custom dataset, UMSFPN achieves a 2% improvement in AP and a 3.1% improvement in AP50, outperforming the original neck structure. Additionally, it is plug-and-play with other deep learning models and demonstrates exceptional generalization across different datasets.
The channel-priority convolutional attention module (CPCA) [
23], which can dynamically allocate attention weights in channel and spatial dimensions, is added to the detection head of the model to cope with the problem of the model’s insufficient detection ability for small targets in complex environments. As a result, the proposed YOLO-UMS model achieves a superior trade-off between speed and accuracy compared to existing detectors.
The main contributions of this paper are summarized as follows:
(1) Aiming at the problems of complex backgrounds, small-size defects and diversified features in PCB surface defect detection, the RG-ELAN feature extraction method is proposed. This method effectively reduces the uncertainty in the detection process and improves the detection accuracy of the small target information while realizing lightweight. The AP50 is improved by 4.1% on the PCB-M dataset while reducing the computation amount and realizing lightweight.
(2) The AIFI module is introduced to replace the traditional SPPF module, which reduces shallow feature interactions, optimizes deep semantic information fusion, and improves the detection accuracy of small targets.
(3) The UMSFPN is proposed, and by introducing the weighted bi-directional feature pyramid network (BiFPN),the Efficient up-convolution block (EUCB), and designing a new type of feature extraction module CSPMS, the effectiveness and efficiency of feature fusion and extraction are significantly improved. Extraction. At the same time, its plug-and-play modular design and lightweight features make it widely applicable and high-performance in practical applications.
(4) The head network is added with the CPCA to reduce the influence of uncertainty factors on the model and further enhance the granularity of small target features.
3. Methods
3.1. Overall Network Structure
In this paper, YOLO11 [
34] is used as the base model, the state-of-the-art target detector suitable for this paper’s research on PCB surface defect detection with considerable accuracy and speed. On this basis, designing the RG-ELAN feature extraction module and the new UMSFPN neck, as well as introducing the AIFI module and the CPCA, the YOLO-UMS model is proposed, and its overall structure is shown in
Figure 1.
YOLO-UMS consists of three main components: the backbone, neck, and head. In the backbone, features are downsampled and extracted using 3 × 3 convolutions to preserve pixel information. The network is structured into four stages, each employing a 3 × 3 convolution for downsampling and channel expansion. Feature extraction is enhanced through a combination of 3 × 3 convolutional downsampling and RG-ELAN, balancing detail retention and model efficiency, with two RG-ELANs per stage. Following the fourth stage, AIFI is integrated to expand the multi-scale receptive field, while C2PSA refines features. The UMSFPN neck facilitates multi-scale feature fusion across multiple branches, where CSPMS strengthens multi-scale sensing and enhances feature extraction. Additionally, BiFPN’s weighted fusion method effectively processes features of varying scales. For detection, the head predicts small targets on high-resolution feature maps and large targets on low-resolution ones. Building on the YOLO11 detector head, the improved design incorporates the CPCA, leveraging semantic information from the backbone and neck for more precise predictions.
3.2. RG-ELAN
PCB surface defect detection presents challenges due to complex backgrounds, small defect sizes, and diverse feature variations. Traditional network feature extraction methods struggle to capture these tiny defects effectively, leading to poor performance. To solve this problem, the RG-ELAN module is designed to enhance the feature extraction capability of small targets and reduce the computation and storage consumption, considering the lightweight requirement.
This module is built on the ELAN architecture. By integrating RepConv [
44] and GhostNet concepts, it efficiently extracts lightweight features. The RepConv layer, which is parameterizable, enhances the expressive ability of the network through multi-branching structures in the training phase; in the inference phase, these multi-branching structures are merged into a single standard convolutional layer, which reduces the computational complexity and the model parameters and improves the inference speed. ELAN is a high-efficiency layer-aggregation network architecture aimed at optimizing the fusion of the features of the different layers of neural networks and the transmission mechanism to improve the model’s overall performance. The transfer mechanism improves the overall performance and efficiency of the model. Specifically, RG-ELAN introduces the RepConv layer based on ELAN and combines the design concepts of GhostNet to generate more feature maps by cheap 1 × 1 convolution operation. RG-ELAN is used to improve the feature extraction in the backbone part, which reduces the amount of computation and realizes the lightness, as shown in
Figure 2.
In the RG-ELAN module, 1 × 1 convolution is first used to increase the number of input channels to twice the number of channels in the hidden layer, which facilitates the extraction of richer abstract features and prepares for subsequent feature segmentation and processing. Subsequently, a Split operation is performed to divide the input feature map into two branches regarding channel dimension, each with half the number of channels of the original feature map. One branch passes the feature map directly to the output, and the other is processed by the RepConv layer and multiple 3 × 3 convolutional layers for production. Referring to the design concept of GhostNet, the intermediate feature maps in mainstream CNNs tend to have more redundancy. To reduce this redundancy, after N-1 3 × 3 convolutional layers, the branch uses cheap 1 × 1 convolution to generate a portion of new feature maps. With this segmentation, the model can process the feature maps of each branch independently. Subsequently, the features of each branch are merged using a cross-stage hierarchy to ensure that the gradient information in different network paths can be propagated with significant correlation differences, thus enhancing the overall feature extraction capability.
Figure 2b,c show the network structure of RepConv in the training and inference phases. Jump connections model the information flow to enhance the feature extraction capability. The reparameterization process is described as follows: in the training phase, the inputs are batch normalized after being computed through three branches of constant connections, 3 × 3 convolutional layers and 1 × 1 convolutional layers, followed by element-by-element summation and passing through the activation function layer to obtain the results.
and
are the input–output feature maps, and
and
denote the weights of the 1 × 1 and 3 × 3 convolutions, respectively.
,
,
, and
are the BN’s mean, standard deviation, learning scale factor, and bias accumulation after 3 × 3 convolution. Similarly,
,
,
, and
denote 1 × 1 convolution, while
,
,
, and
are used for constant branching. Constant branching is applicable only when the input and output channel dimensions are identical, and the convolution stride is 1. In this case, it functions as a 1 × 1 branch with a fixed weight of 1. A 1 × 1 convolution, in turn, can be viewed as a zero-padded 3 × 3 convolution. Hence, the 3 × 3 branch is used to illustrate the reparameterization process:
Thus, the weights and biases of the 3 × 3 branch can be converted into
and
. The other two branches can be converted into A and B similarly using the 3 × 3 branch. A similar derivation can be obtained for the other two branches. The final output of RepConv in the inference process is expressed as follows:
where
M represents the output of RepConv with the SiLU activation function, and
and
are the weights of 1 × 1 convolution and constant branch, respectively;
and
are the deviations of 1 × 1 convolution and constant branch, respectively. The multi-branch architecture allows the model to be viewed as a collection of multiple shallower models, effectively improving the network representation and preventing gradient vanishing.
In the inference phase, the method reduces the three-branch structure in the training phase to a combination of a 3 × 3 convolutional layer and an activation function layer through a constant transform, which maintains the feature extraction capability and reduces the computational complexity. The single-branch structure improves parallelism and inference speed while increasing the nonlinear and representational capabilities of the network and enhancing the ability to model complex data. The process merges all branches into a single convolutional kernel, making the inference phase more computationally efficient while maintaining the performance advantage of the training phase.
Ultimately, the reparameterization process enables the model to achieve a single-branch structure in the inference phase. This significantly improves parallelism and inference speed while retaining the multi-branch design’s representational capabilities and nonlinear advantages in the training phase.
3.3. AIFI
AIFI can reduce unnecessary interactions between networks, thus effectively reducing cross-scale attention computation overhead. Introducing the AIFI module into YOLO11 enhances interaction with intra-scale features. It works with the UMSFPN neck to further improve the effect of multi-scale feature fusion, thus obtaining richer global features. Compared with SPPF, AIFI not only enhances the feature interaction capability and optimizes the computational overhead but also adopts dynamic adjustment of the fusion method to improve the generalization capability and adaptability of the model, better work with complex network structures (e.g., UMSFPN), and improve the overall performance.
The feature fusion process of the AIFI module is shown in
Figure 3, where the two-dimensional S5 feature maps are first converted into high-dimensional vectors by linear transformation or convolution operation to accommodate the subsequent self-attention mechanism. Subsequently, these high-dimensional vectors are input into the Multi-Head Self-Attention module, where multiple attention heads capture the complex dependencies between features in parallel to fully understand the feature map’s information. The output of Multi-Head Self-Attention is summed up with the original input through residual linkage. It undergoes Layer Normalization, effectively alleviating the gradient vanishing problem and stabilizing the training process. Then, the feature information is passed to the Feed-Forward Neural Network for nonlinear transformation and feature extraction, further enhancing the feature expression ability. After another layer of normalization, attention scores containing important information are generated, reflecting each feature’s importance in the global context. Finally, the processed features are converted to a 2D form, denoted as F5, for subsequent multi-scale feature fusion. The AIFI module only processes the S5 feature map because it has more advanced semantic information than the shallower S3 and S4 feature maps, which allows for more effective differentiation of different objects while at the same time reducing the amount of computation and has a minimal impact on the detection performance.
In addition, combining the AIFI module with UMSFPN and RG-ELAN further optimizes the fusion process of the multi-scale features. It enhances the ability to capture the dependencies between the features, thus significantly improving the overall detection performance. Through this kind of teamwork, AIFI not only outperforms the traditional SPPF module in terms of feature interaction and computational efficiency but also improves the generalization ability and adaptability of the model by dynamically adjusting the fusion strategy, which makes the model perform even better in complex environments.
3.4. UMSFPN
In the target detection task, multi-scale feature fusion can effectively combine different levels of feature information to enhance the detection effect. Classical feature fusion architectures such as Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN) have been widely used in YOLO detectors. However, when performing multi-scale (especially small target) detection in complex scenes, traditional FPN and PAN have certain limitations regarding efficient computation and high-quality feature fusion.
This paper proposes a plug-and-play neck, UMSFPN, to address these issues. The main features include the following: (1) Multi-scale weighted feature fusion: the weighting mechanism of BiFPN is introduced to dynamically adjust the weights of features at different levels, making multi-scale feature fusion more adaptive. (2) Efficient up-sampling mechanism: A lightweight EUCB module is designed to optimize the spatial resolution recovery of shallow features. (3) Feature extraction module optimization: The newly designed CSPMS module improves the efficiency and diversity of feature extraction by multi-scale convolution and global heterogeneity mechanisms.
Uniform channel number design: In designing the new neck, the same number of channels is used in the convolutional layers from the output of the backbone network to each branch of the neck, and the channel number of the whole neck network is uniformly 256.
3.4.1. Using BiFPN’s Weighted Feature Fusion Method
A single multi-scale feature representation falls short of capturing the complexity of PCB surface defects, which are diverse in morphology and size and are affected by noise, illumination variations, and background interference. Although traditional shallow and deep feature fusion methods can improve detection accuracy, it is difficult for them to simultaneously meet the demands of coping with complex defects in terms of efficiency and accuracy. The UMSFPN designed in this study uses a BiFPN to assist in fusing shallow and deep feature information.
The introduction of the BiFPN structure significantly enhances the network sensing capability, connecting high-level semantic and low-level spatial information through cross-layer feature fusion. It excels in small-target detection (e.g., tiny defects). The bidirectional information flow design of BiFPN ensures that the low-level fine-grained features (e.g., small-target edges and details) are effectively retained in the multilayer processing. Meanwhile, the deep-layer high-semantic information captures global features of large targets (e.g., more significant defects or anomalous regions) to further optimize the tiny defect detection capability.
The weighted feature fusion mechanism dynamically adjusts the importance of input features to maximize their utility. Specifically, BiFPN assigns learnable weights to each feature and normalizes them using a method similar to SoftMax, which scales the weights to a range of [0, 1] and ensures they sum to 1. This normalization plays a key role in constraining and stabilizing the weights during training. The formula for fast normalized fusion is shown below:
where
is the learnable weight, the activation function ReLU ensures that each
can be obtained from the network training, and the sum of
weights is first calculated. Then, a small constant
is added for numerical stabilization. Then, the weights are normalized. Finally, the input features
are weighted using the normalized weights to obtain the output features O.
As shown in
Figure 4, the structure diagram of BiFPN demonstrates its bidirectional feature fusion path. Through this design, UMSFPN not only improves the network’s multi-scale learning capability but also ensures high accuracy under complex interference and background changes.
3.4.2. Feature Fusion Module CSPMS
A new feature fusion module, CSPMS, has been designed to fulfill the UMSFPN feature extraction and fusion capability. CSPMS integrates a CSP structure and replaces the traditional C2f residual block with an MSCB block.
The CSP structure splits the input feature maps into two parts, one of which is passed directly to the subsequent layers. In contrast, the other part is merged with the original feature maps after multiple convolution operations. This design effectively reduces redundant computations, maintains feature diversity, and significantly improves computational efficiency. Specifically, the CSP structure ensures the retention of key information by directly transferring some of the features while enhancing the feature expression capability through convolutional processing.
Figure 5a illustrates the detailed structure of the CSPMS, showing how the feature map is segmented and processed by multiple MSCB blocks before merging. This design reduces redundant computations, preserves feature diversity, and improves computational performance.
Since networks with larger receptive fields are suitable for detecting large objects, while small objects benefit from smaller receptive fields, the UMSFPN extends the Global Heterogeneous Kernel Selection (GHKS) mechanism by integrating heterogeneous convolutional kernels. In CSPMS, convolution kernels of different sizes, 1 × 1, 3 × 3, 5 × 5, 7 × 7, and 9 × 9, are used to adapt to different resolutions and to acquire multi-scale perceptual information progressively.
Multi-scale deep convolution (MSDC) is the core component of CSPMS, and its structure is shown in
Figure 5b. MSDC collects information from multiple perceptual fields using convolutional kernels of different sizes in parallel to capture rich contextual details. Each convolutional kernel extracts features at various spatial scales, enabling the network to extract fine-grained features from multiple receptive fields and enhance the perception of multi-scale targets. The output of the MSDC is adaptively fused to reduce redundancy and improve the interactions between features. Subsequently, the feature channels are reorganized by Channel Shuffle to optimize the information flow and make the information fusion at different scales more efficient.
The following equation can describe the computational process of MSDC:
KS denotes a collection of multi-scale convolutional kernels, denotes deep convolutional blocks with different kernel sizes, and X is the input feature map. Each convolutional kernel processes the feature maps in parallel, capturing contextual information at various scales.
Figure 5c illustrates the structure of the multi-scale efficient convolution module, a key component of CSPMS. This module combines MSDC and pointwise convolution to improve the network’s feature extraction capability by efficiently processing multi-scale features.
The process first extends the number of channels of the input feature map using a pointwise convolution layer (extension factor = 2) to provide richer feature information for subsequent convolution operations, followed by applying batch normalization (BN) and the ReLU6 activation layer. Then, multi-scale deep convolution is performed by the MSDC module to capture the contextual information of different receptive fields to enhance feature diversity and improve the adaptability to targets of different sizes. After multi-scale convolution, the feature map undergoes a channel-shuffling operation to reorganize the feature channels and enhance feature interaction. Next, redundant features are reduced by channel compression to ensure that the computational overhead is controlled while maintaining information richness. Finally, batch normalization (BN) is applied to stabilize the training and enhance the model’s generalization ability. The computational formula for the whole process can be expressed as follows:
This multi-scale efficient convolution module (MSCB) significantly improves the model’s robustness and accuracy in handling objects of different sizes by optimizing the feature flow and efficiently extracting multi-scale features. It mainly demonstrates excellent performance in small target detection in complex backgrounds.
By combining MSDC and pointwise convolution (PWC), the CSPMS module can efficiently process feature information at different scales and optimize feature interaction and information flow through channel shuffling and fast normalized fusion strategies. This design significantly improves the model’s robustness and accuracy in multi-scale target detection tasks, especially in small target detection in complex backgrounds. In addition, CSPMS improves overall computational efficiency by reducing redundant computations and preserving feature diversity, which enables UMSFPN to have better detection performance and real-time performance in real applications.
3.4.3. EUCB
In this study, an EUCB is employed to gradually upsample the feature maps at the current stage to ensure that their size and resolution can match the feature maps of the next jump connection. The up-sampling operation plays a crucial role in feature fusion in networks, especially when dealing with multi-scale features, to effectively recover fine-grained spatial information and ensure that the feature maps do not lose important visual information in the delivery process. EUCB optimizes the up-sampling process of feature maps through a multi-step design to achieve efficient computational performance and accurate feature fusion results, as shown in
Figure 6.
To match the size of the feature maps in the subsequent jump connection, EUCB first uses an up-sampling operation (Up) with a scale factor of 2 to double the size of the feature maps in the current stage. The up-sampled feature maps are then subjected to a 3 × 3 Depthwise Convolution (DWConv), which preserves computational efficiency without appreciably raising the computational overhead and allows for the extraction of more local information. Batch Normalization (BN) and ReLU activation functions are used after the convolution operation to further improve the expressiveness of the feature maps and add nonlinear features, which enhances feature learning.
To ensure the feature maps are consistent in terms of the number of channels, the final step of the EUCB utilizes a 1 × 1 convolution (Pointwise Convolution) to reduce the channel count, aligning the final output feature maps with the number of channels in the feature maps of the network’s next stage. This adjustment guarantees that the dimensions of the feature maps at each stage can be seamlessly connected, allowing for effective feature transfer to the subsequent network layers.
The following equation can describe the operation of the formulaic representation EUCB:
X is the input feature map, Up denotes the upsampling operation, DWConv is the depth convolution, BN is the batch normalization operation, and is the 1 × 1 convolution operation.
3.5. CPCA
The CPCA module is added to the YOLOv11 header network to enhance further the model’s ability to detect small-target defects. CPCA is a channel-prioritized convolutional attention module that forms deep spatial attention using multi-scale depth-separable convolutional modules. It dynamically allocates weights in both channel and spatial dimensions, enhancing the representation of small-target-related features and improving detection accuracy. Introducing this prior mechanism ensures more reasonable attention allocation, allowing the network to select feature channels relevant to small targets better, thus improving detection performance. Additionally, applying CPCA across different levels of feature maps further refines the small-target features and boosts detection accuracy.
The structure of the channel-prioritized convolutional attention module is shown in
Figure 7, which is similar to CBAM and adopts the form of the channel first and space later. Given an intermediate feature map
, the channel attention module first deduces a one-dimensional channel attention map
and multiplies element-by-element
with the input feature F. Then, it broadcasts the channel attention values along the spatial dimension to obtain the fine features with channel attention
. The spatial attention module processes
to generate a three-dimensional spatial attention map
. The final output feature,
, is obtained by multiplying
and J
element by element. The final output feature, F, is obtained by multiplying H and J element by element. The total attention process can be summarized follows:
where ⊗ denotes element-by-element multiplication.
Average pooling and global maximum pooling operations are used to aggregate the spatial information in the feature map. This aggregation process produces two separate spatial context descriptors. These descriptors are then fed into a shared multilayer perceptron (MLP) and a sigmoid activation function to compute the weights for each channel. The channel attention maps are obtained by combining the outputs of the shared MLP in an element-by-element summation. To reduce parameter overhead, the shared MLP consists of a single hidden layer where the size of the hidden activation is set to
, and r denotes the reduction ratio. The computation of channel attention can be summarized as follows:
where
denotes the sigmoid function.
To avoid enforcing consistency in the spatial attention map for each channel, it is considered more realistic to dynamically assign attention weights in both channel and spatial dimensions.
Figure 7 illustrates the use of deep convolution to capture spatial relationships between features, ensuring that inter-channel relationships are preserved while reducing computational complexity.
Spatial attention captures spatial features at different scales through depth-separable convolutional blocks of various sizes. Then, based on spatial convolution, the features are further refined through 1 × 1 convolution to achieve channel mixing of features. Finally, the weights of channel and spatial attention are applied to the original features to realize the weighting and reconstruction of the features. The calculation of spatial attention can be described as follows:
DwConv denotes the depth convolution, denotes the ith branch, and is the identity connection. A strip convolution in both depth directions approximates a standard per-channel convolution with a large kernel. The size of the convolution kernel is different for each channel to capture multi-scale information.
5. Conclusions
In this paper, a UMSFPN is proposed, which not only improves the accuracy of small target detection but maintains a low computational overhead, which is suitable for complex practical application environments. Meanwhile, the RG-ELAN feature extraction module is designed to reduce the computational overhead using the reparameterization technique, significantly enhancing the detection of small targets. The introduced AIFI module improves the feature interaction capability and fine-grained feature expression of the model, and the CPCA attention mechanism further enhances the model’s sensitivity to features of targets of different sizes, significantly improving the small target detection accuracy. Based on this, the YOLO-UMS PCB surface defect detector is constructed, and without relying on additional data pre-training, YOLO-UMS achieves 42.6% AP and 84.0% AP50 with the same amount of computation, which successfully achieves a better balance between speed and accuracy, and it is more suitable for PCB surface defect detection than other detectors.
The experimental results show that YOLO-UMS exhibits significant performance improvement on the self-collected PCB-M dataset, with YOLO-UMS improving by 3.2% and 6.4% in AP and AP50, respectively, compared to the baseline YOLO11. In addition, UMSFPN performs outstandingly across different models and different datasets and can help YOLO v5 and YOLO v9t to improve 1.2% AP and 0.7% AP, respectively, and 0.5% AP and 0.9% AP on the datasets PCB-B and PCB-D, respectively. These results validate the superiority of YOLO-UMS in detecting surface defects on PCBs and broad applicability. At the same time, YOLO-UMS has good generalization ability and can provide efficient and accurate inspection results in various complex scenarios.
Although YOLO-UMS has achieved good results on multiple datasets, its robustness under extreme environments (e.g., severe occlusion and intense illumination changes) still needs further verification. Future research could focus on further optimizing the inference speed in higher resolution and complex scenes, especially for applications on low-power devices. In addition, dynamic model tuning strategies based on environment adaptive mechanisms will be explored to ensure the best performance of YOLO-UMS in various application scenarios. Finally, YOLO-UMS can be further optimized to support better real-time detection and adapt to different hardware platforms (e.g., embedded devices and mobile devices).