1. Introduction
In recent years, few-shot object detection (FSOD) has emerged as a research hotspot in the field of computer vision, aiming to address object detection challenges in scenarios with scarce annotated data. Traditional object detection algorithms primarily rely on large amounts of annotated data for supervised learning [
1]. However, in specialized domains, acquiring extensive labeled data is often costly and impractical. To tackle this issue, researchers have proposed various few-shot object detection algorithms based on transfer learning [
2] or meta-learning [
3,
4] strategies. In natural scene object detection tasks, few-shot object detection technology has achieved remarkable progress. Through meta-learning-based or metric learning-based approaches, effective knowledge transfer from base classes to novel classes can be achieved, enabling few-shot object detection models to recognize new category targets with only minimal annotations. These few-shot object detection methods typically depend on rich feature diversity and uniformly distributed samples to facilitate knowledge transfer. As a result, they are particularly well suited for natural image scenes where objects exhibit significant variations yet maintain structural consistency. These approaches have demonstrated strong generalization capabilities on natural scene datasets such as COCO [
5] and Pascal-VOC [
6].
However, in the context of practical industrial applications, a significant research gap persists between the current few-shot object detection techniques and PCB defect detection requirements. Conventional few-shot methods prove inadequate in addressing the characteristic challenges of PCB defects—including low contrast, microscopic scales, and complex background noise—while traditional PCB surface inspection methodologies remain heavily dependent on large annotated datasets, fundamentally failing to overcome the scarcity of defect samples in real-world production environments. Existing research has yet to effectively resolve critical challenges such as domain adaptation needs, real-time processing constraints, and sensitivity to microscopic defects in manufacturing settings. Consequently, our research aims to bridge this gap by advancing innovative applications of few-shot learning theory in industrial vision systems, thereby enhancing quality control standards for electronic products.
However, the current few-shot object detection methods face two critical challenges when applied to PCB surface defect detection tasks. The first issue is the insufficient diversity of PCB defect features. Different types of defects on PCB surfaces may exhibit only subtle morphological and textural differences, while defects of the same category demonstrate high morphological consistency across different locations or scales [
7]. This results in weak inter-class variations and limited intra-class diversity. In few-shot scenarios, the limited annotated samples cannot adequately cover these fine-grained variations, making it difficult for models to learn discriminative feature representations, which significantly reduces the defect classification performance. The second challenge arises from the high-density circuit patterns and repetitive metal textures on PCB surfaces, which introduce substantial background noise [
8]. This noise not only shares confusing similarities with genuine defects in local features but also interferes with the region proposal network’s (RPN) candidate box generation, leading to numerous false positives or missed detections and reducing the model’s generalization capabilities. Under few-shot conditions, models struggle to effectively learn background noise suppression from limited samples, further exacerbating false detection rates and causing the detection accuracy to fail to meet the stringent requirements of industrial quality inspection. It is evident that existing general-purpose few-shot object detection algorithms exhibit significant limitations when applied to PCB defect detection. The primary manifestation is that natural image-oriented models lack prior knowledge modeling of PCB boards’ regular texture patterns, resulting in the substantially reduced discriminative power of few-shot features under complex background interference.
To address the aforementioned two limitations, researchers have proposed several relevant approaches. Hsiao et al. [
9] proposed a ResNet-SE-CBAM Siamese network that integrates residual modules, squeeze-and-excitation modules, and convolutional block attention modules. This architecture enhances feature extraction capabilities through attention mechanisms and employs metric learning to enable category expansion without retraining. Combined with SSIM-based sample selection and high-defect-rate training strategies, their method focuses on optimizing feature representation through attention mechanisms, thereby complementing the multi-scale defect-aware prototype network proposed in the following section. We propose a second-order meta-learning-based few-shot object detection algorithm for PCB surface defect inspection, which achieves query–support feature fusion and cross-domain knowledge transfer through a dual-path Transformer architecture. Specifically, the current meta-learning approaches predominantly adopt the Faster R-CNN framework, whose detection performance heavily relies on the accuracy of region proposal generation. However, region proposals tend to overemphasize local texture features, making it difficult for the model to learn comprehensive foreground representations of PCB surface defects when novel classes are few and intra-class variations are limited. Therefore, this paper introduces a deformable Transformer structure as the fundamental framework for object detection, thereby mitigating the issues of excessive local feature dependency and error propagation in target localization that are inherent to region proposal-based approaches. Extending this approach, to address the challenge of insufficient intra-class sample diversity for novel categories in few-shot PCB defect detection tasks, we propose a multi-order meta-enhanced prototype (MOMP) Network. This framework employs meta-learning strategies for shallow feature enhancement, dynamically optimizing support features using globally filtered query features. The method effectively resolves the limitation of conventional single-feature-constructed support prototypes in comprehensively capturing target diversity and complexity, which often restricts their guidance capabilities for query information. Furthermore, to tackle the few-shot defect foreground generalization problem under high-density PCB background noise interference, we present the dual-prototype guided Transformer (DPGT) method. By designing a parallel architecture within the Transformer encoder that generates noise-invariant features through support set prototypes, our approach achieves robust cross-sample knowledge transfer from support to query information while resisting interference. Unlike existing multi-level methods that rely on the simple concatenation of hierarchical features, the second-order meta-learning proposed in this paper achieves deep semantic alignment between support and query sets through gradient-level optimization, overcoming the limitations of traditional methods in representing fine-grained defects in few-shot scenarios. Compared to attention-based approaches that focus solely on feature re-weighting in the spatial or channel dimensions, our method explicitly models cross-sample semantic relationships through task-level meta-optimization strategies.
To validate the effectiveness of the proposed algorithm, we employed two PCB surface defect datasets—DeepPCB [
10] and DsPCBSD+ [
11]—as experimental datasets for training and accuracy evaluation. During the experimental setup phase, we performed few-shot grouping on these two PCB defect datasets. For the DeepPCB dataset, we selected four categories as base classes and another two categories as novel few-shot classes. For the DsPCBSD+ dataset, we chose five categories as base classes and four additional categories as novel few-shot classes. After completing the few-shot dataset partitioning, we conducted model training and evaluation based on this few-shot dataset, comparing the proposed algorithm with meta-learning-based algorithms such as Meta-FRCNN. The results show that our SM-FSOD algorithm achieved a 9.8–14.0% improvement in the AP50 metric on the DeepPCB dataset. Similarly, on the DsPCBSD+ dataset, the AP50 benchmark of SM-FSOD still surpassed that of Meta-FRCNN by 1.7–5.1%. Experimental results demonstrate that, compared with the transfer learning-based TFA [
12] and meta-learning-based algorithms like Meta-FRCNN [
13] and FSOR-SR [
14], SM-FSOD exhibits superior detection accuracy for few-shot categories. Moreover, even when the number of few-shot categories increases, SM-FSOD maintains stable accuracy performance.
Based on the aforementioned research background, the structure of this paper is organized as follows.
Section 2 summarizes prevailing general frameworks for PCB surface defect detection and systematically analyzes the limitations of two mainstream few-shot object detection methods in PCB defect identification tasks.
Section 3 innovatively proposes a few-shot object detection algorithm based on a second-order meta-learning strategy.
Section 4 validates the superior performance of the proposed method through systematic comparative experiments and ablation studies on multiple domain-specific PCB surface defect datasets. Finally, the full-text research achievements are summarized and future research directions are outlined. The core innovations of this research are primarily manifested in the following aspects:
- (1)
We conducted few-shot data partitioning on two PCB defect datasets and developed a second-order meta-learning framework specifically tailored for the characteristic features of PCB surface defects.
- (2)
We proposed the MOMP network to enhance shallow support features via meta-learning strategy, significantly improving foreground diversity for novel defect categories.
- (3)
We developed the DPGT network utilizing a parallel Transformer to generate noise-resistant features, effectively suppressing background interference and enabling stable few-shot PCB surface defect detection.
3. SM-FSOD: A Meta-Learning Approach for Few-Shot Object Detection
In this section, we present the theoretical foundations of second-order meta-learning algorithms. Building upon this meta-learning framework, we introduce the overall architecture of our proposed Second-Order Meta-Learning for Few-Shot Object Detection (SM-FSOD) algorithm. Subsequently, we provide detailed descriptions of the two novel meta-learners incorporated into our approach, explaining their mechanisms in enabling few-shot category knowledge transfer. This systematic presentation comprehensively elucidates the algorithmic logic underlying our proposed solution.
3.1. Second-Order Meta-Learning Algorithms
Meta-learning algorithms applied to few-shot object detection tasks in natural scenarios are typically constructed only at the shallow feature level extracted by the backbone network, guiding the learning of few-shot query information under simple support feature prototypes. As shown in Part A of
Figure 2, this generic first-order meta-learning algorithm exhibits several structural limitations when applied to PCB surface defect detection tasks.
The fundamental challenge in PCB defect detection lies in the characteristic manifestation of defects where inter-category variations appear as only subtle morphological–textural differences, while intra-category defects exhibit remarkable consistency across spatial locations and scales. This intrinsic property creates significant limitations in few-shot learning scenarios: prototype features generated through the simplistic random sampling of support instances inherently lack query-specific adaptability, leading to insufficient correlation between support prototypes and query features. Consequently, the conventional approach of single-level matching confined to shallow convolutional feature spaces proves inadequate in capturing the hierarchical semantic representations of few-shot PCB defects, ultimately failing to model the complex interdependencies between query and support samples in few-shot PCB defect detection tasks.
Furthermore, the high-density PCB circuit background introduces compounded interference issues. At the few-shot feature metric level, shallow feature-based matching mechanisms exhibit excessive sensitivity to background noise, resulting in low confidence scores for certain critical defect features. Meanwhile, conventional region proposal networks generate numerous low-quality candidate boxes, which amplify the class confidence of background noise. This dual interference mechanism makes it particularly challenging to capture fine-grained information about novel few-shot PCB defect foregrounds, ultimately leading to classification confusion between few-shot defect categories and background negative samples.
Figure 2 presents a comparative analysis of failure and success cases in meta-learning using class activation mapping (CAM) heatmaps. Part A displays failure cases of first-order meta-learning along with their corresponding support prototypes. The CAM heatmaps demonstrate that these prototypes fail to effectively capture distinctive features of the few-shot sample categories, leading to inaccurate feature representations and poor generalization to novel classes. In contrast, Part B illustrates successful cases of second-order meta-learning with their respective support prototypes. The accompanying CAM heatmaps reveal that second-order methods can effectively identify and leverage critical features to construct more discriminative prototypes, thereby enabling successful adaptation to new tasks. The side-by-side CAM visualization clearly demonstrates the significant advantage of second-order methods over first-order approaches in generating more effective and meaningful prototype representations for few-shot learning scenarios.
In order to alleviate the aforementioned issues, this paper designs a second-order meta-learning strategy, as illustrated in the second-order meta-learning strategy section of
Figure 2. This second-order meta-learning strategy effectively mitigates bias caused by imbalanced sample distributions through its inherent learning mechanism, as shown in Algorithm 1 below:
Algorithm 1 Second-Order Meta-Learning Algorithm |
- 1:
Phase 1: First-Order Meta-Learning (Adaptive Prototype Generation) - 2:
{Extract support set features} - 3:
{Extract query set features} - 4:
for each query feature do - 5:
{Feature modulation} - 6:
{Prototype generation function} - 7:
end for - 8:
Phase 2: Second-Order Meta-Learning (Semantic Alignment) - 9:
for each prototype–query pair do - 10:
{Semantic alignment} - 11:
{Classification prediction} - 12:
end for
|
In the first-order meta-learning phase, the features of query samples dynamically modulate the feature distribution of support samples via a meta-learner, generating more representative support prototypes. This process not only integrates the internal structural relationships of the support samples but also incorporates similarity metrics between query and support samples, enabling prototype generation to adaptively focus on the critical features of minority-class defects. In the second-order meta-learning phase, the refined support prototypes guide the deep semantic mining of query samples; another meta-learer aligns the support prototypes with the deep features of query samples at this stage, enhancing the distinction between defects and the background in a higher-dimensional semantic space, thereby significantly improving the recognition capabilities for rare defect samples. Furthermore, the strategy employs data augmentation methods such as cropping and random rotation on input samples to increase the diversity of defect samples.
The corresponding experimental results indicate that this two-stage meta-learning algorithm significantly improves the target detection algorithm’s ability to perceive small sample PCB defect features, thereby comprehensively enhancing the detection accuracy of the meta-learning algorithm for PCB defect queries.
3.2. Framework Overview of the SM-FSOD Algorithm
Based on the aforementioned second-order meta-learning idea, this paper constructs a few-shot object detection network based on the Detection Transformer algorithm, as illustrated in
Figure 3. By incorporating the second-order meta-learning strategy, it fully establishes deep-level correlations between support samples and query samples, thereby significantly enhancing the algorithm’s performance in few-shot PCB surface defect detection tasks.
Within the overall framework, we employ ResNet-50 as the backbone network. This selection is primarily based on the following reasons. First, the widespread adoption of ResNet in the computer vision field ensures comparability with existing state-of-the-art (SOTA) methods, enabling a fair performance evaluation of our framework. Second, its residual structure provides a powerful representational capacity, allowing the effective extraction and generalization of discriminative features essential for few-shot learning tasks. Third, ResNet has been demonstrated to exhibit excellent compatibility with Transformer encoders, supplying high-semantic-level features suitable for encoder processing.
First, this paper designs a multi-order meta-enhanced prototype (MOMP) network, serving as a first-order meta-learner to achieve the dynamic generation of defect support prototypes. The core of this approach lies in employing multi-dimensional attention filtering to extract the holistic semantic information of few-shot query images; the MOMP network achieves the dynamic semantic aggregation of support prototypes. This process leverages the informative components of the query to guide the generation of support prototypes. Such generation not only focuses on the feature relationships among support samples but also compensates for information gaps caused by insufficient novel class samples through the guidance of global query information. As a result, the generated support prototypes exhibit greater representativeness and adaptability.
To further enhance the guidance of support prototypes for query features, this paper proposes a dual-prototype guided Transformer (DPGT) network based on the Transformer architecture, which serves as a second-order meta-learner. The key innovation lies in the self-attention module, which further aggregates multi-level information from the internal features of support samples to enhance the representation capabilities of support categories. Subsequently, at each encoding layer, support prototypes and query features interact in parallel, with bidirectional feature updating achieved through cross-attention mechanisms, thereby better capturing complex semantic relationships between support and query samples. The DPGT structure not only effectively utilizes support prototypes to guide the representation of query features but also models the similarities and differences between support and query samples in deep feature spaces, improving the algorithm’s precision in target localization and classification.
The second-order meta-learning-based object detection network introduces global information and deep-level feature interactions, making the generation of PCB surface defect prototypes more dynamic and flexible while significantly enhancing the feature representation capabilities for query samples of novel defect categories. Experimental results on two PCB surface defect datasets demonstrate that the network achieves substantial performance improvements across multiple few-shot object detection benchmarks, fully validating its effectiveness and generalizability.
Our SM-FSOD model employs the Hungarian matching loss function, which constructs a matching cost matrix and then determines the optimal assignment between the predicted and ground truth bounding boxes using the Hungarian algorithm. This approach not only considers class matching but also optimizes the localization accuracy through the L1 loss and generalized intersection over union (IoU) loss. The overall loss function is defined as follows.
First, let
denote the set of ground truth objects, where each
consists of a class label
and a bounding box
. Since the number of actual targets is usually much smaller than
N, the set
is padded with “no-object” placeholders (⌀) to match the size of the predicted set
. The goal is to find an optimal permutation
over the predicted set that minimizes the total matching cost with the ground truth set. This is formally defined as
The matching cost
for each pair is defined as
This cost function consists of two parts: a classification confidence term
and a bounding box regression loss
. Both terms are applied only when the ground truth target
is valid (i.e.,
). When
, the total matching cost becomes zero due to the indicator function, thereby avoiding penalization on background regions and reducing false positives’ influence. Once the optimal assignment
is determined, the overall loss function is computed as
The bounding box loss
combines the intersection over union (IoU) loss and the L1 distance as follows:
Here, and are hyperparameters controlling the trade-off between box overlap and coordinate accuracy. The IoU loss encourages better spatial alignment between the predicted and ground truth boxes, while the L1 loss ensures precise localization. To handle class imbalance—especially the abundance of background predictions—we assign a reduced loss weight to samples with when computing the final loss. Additionally, we empirically find that using raw predicted probabilities (instead of log-probabilities) in the matching cost function leads to better convergence and performance.
The specific architectures of the proposed MOMP network and DPGT network within the SM-FSOD framework will be introduced in the following two sections.
3.3. MOMP Block: Multi-Order Meta-Enhanced Prototype Network
Existing meta-learning strategies rely on single feature-level support prototype construction, which proves inadequate for PCB surface defect detection tasks where the intra-class similarity of target regions is high. This limitation hinders stable guidance for few-shot defect query image learning. To enhance the support prototype’s perception capabilities for few-shot defect queries, we propose a multi-order meta-enhanced prototype network (MOMP) as a meta-learner. Through base class data training, the MOMP learns to utilize global coupling information from query images to guide support prototype generation. This enables query-aware prototypes that maintain alignment with targets in query images, as illustrated in
Figure 4.
The MOMP consists of two key components: the query global coupling information generation module and the prototype-aware enhancement module. Specifically, the query features (where B denotes the batch size, C represents the number of channels, and indicates the spatial dimensions) are first fused with support features through a 2D convolutional coupling structure to generate a global attention mask , which is subsequently utilized for prototype-aware enhancement processing.
The 2D convolutional layers separately extract deep semantic features from both the query and support representations. A tensor dot product is then computed to establish spatial correlations between their high-level semantic features. These correlation scores are normalized via Softmax to enhance regions with strong query–support correspondences in the spatial domain. The resulting tensor, termed the global attention mask, is learnable and enables the generalization of prototype attention during meta-learning training. Subsequently, to capture global contextual information from query features, an average pooling operation is applied. This compresses the spatial dimensions of each channel in the query feature tensor into scalar values, producing a channel-wise global representation vector of the query
, formulated as
Since average pooling preserves smooth spatial contextual information, it is better suited for defect regions with blurry boundaries. Moreover, when handling sparse defect patterns, average pooling avoids the potential gradient sparsity issues associated with max pooling. Thus, in our MOMP module, we adopt the strategy of adaptive average pooling combined with spatial-wise concatenation.
The support spatial mask construction employs the cosine distance metric (CDM) as the similarity measure between vectors. At the channel level, it performs pixel-wise comparison between the spatial support features and the query’s global information to identify regions of strong spatial correspondence between support features and query information. This support spatial mask (denoted as Condition) serves as the basis for subsequent prototype-aware enhancement. In the final prototype-aware enhancement stage, the derived global information mask and support spatial mask undergo pixel-wise multiplication to generate an attention-enhanced mask. This mask acts as a spatial increment, effectively strengthening the support prototypes through the integration of query global information.
We adopt cosine similarity primarily due to its amplitude invariance. In few-shot learning scenarios, the feature magnitudes across different episodes may vary significantly due to the non-stationary distribution of meta-testing tasks. By focusing on the angular direction of feature vectors rather than their absolute magnitudes, cosine similarity more robustly handles intra-class variation and feature scale imbalance.
Through this approach, the meta-learning framework achieves guidance for support prototype generation via extracted query information, enhancing the support features’ perception capabilities for few-shot PCB defect categories.
3.4. DPGT Block: Second-Order Meta-Learner with Dual-Prototype Guided Transformer
Conventional few-shot object detection algorithms typically construct support prototypes only from shallow feature semantics in their meta-learners to guide few-shot query learning. While such approaches can handle few-shot problems to some extent, the limited representational capacity of shallow features often captures only intra-class local consistency, failing to effectively model higher-order inter-class relationships. For PCB surface defect detection in few-shot settings, subtle distinctions between positive samples and complex background classes may not be distinguishable through shallow features alone. To address these limitations, we propose a second-order meta-learner with a dual-prototype guided Transformer (DPGT). The architecture is illustrated in
Figure 5.
This algorithm is implemented through support category prototype computation and meta-learning training. Specifically, in the support branch, a reconstructed Transformer encoder integrates input features with bounding box information to generate support prototypes for each category. Let the support features be denoted as
, where
B represents the number of support images. The deep semantic representation of support information is computed through a residual-connected self-attention mechanism, formulated as
where:
Here,
denotes the input support features;
,
, and
are the learnable projection matrices for queries, keys, and values, respectively; and
d is the feature dimension scaling factor.
Subsequently, the foreground regions of the support features are extracted using the annotated positional information and encoded into corresponding category prototypes . This operation, performed during the support feature processing stage, extracts the class-specific features from each support bounding box and generates the encoded prototypes of the support categories via a Sigmoid function. This effectively captures fine-grained details in the support features and produces discriminative category prototypes (represented by the vector set ), which are then used in the meta-learner to guide the learning of the query branch.
The query branch follows the original encoder structure, directly encoding the query features into deep semantic information vectors. Finally, a meta-learner is constructed between the query information and the support category prototype vectors. This enables task learning to guide the relevant parameters of few-shot query information learning through the category prototype vectors. To ensure the global significance of the features, the meta-learner retains the structure of attention computation. A learnable tensor
W is introduced to construct a learnable reinforcement metric between the encoded support categories and queries, as described by the following formula:
The final step involves reasoning with the coupled query vectors through a feed-forward neural network (FFN) layer:
By modeling the meta-learner on deep semantic features, the Transformer query encoding can capture more complex contextual information and higher-order relationships between novel class foregrounds and complex backgrounds and adaptively learn support prototypes. This effectively enhances the accuracy of few-shot learning for PCB surface defects and the model’s category generalization capabilities. Additionally, it mitigates the confusion problem between few-shot category foregrounds and complex PCB backgrounds caused by the limited representation capacity of shallow features and the difficulty in capturing deep semantic information of categories.
4. Experiments and Analysis
In this section, we present experiments on few-shot subsets of two PCB defect datasets to train and evaluate the proposed few-shot object detection (FSOD) algorithm. The performance of SM-FSOD is benchmarked against that of several state-of-the-art FSOD methods to assess its detection capabilities for few-shot PCB defects. Additionally, we evaluate the model’s real-time performance and training stability, with visualizations demonstrating its practical application potential.
4.1. Description of the Dataset
We utilized two PCB defect datasets to train the proposed object detection algorithm and evaluate the detection accuracy of the presented SM-FSOD algorithm. These two datasets are the DsPCBSD+ dataset and the DeepPCB dataset.
The DsPCBSD+ dataset is a publicly available benchmark dataset developed specifically for printed circuit board (PCB) defect detection tasks, demonstrating significant research value in industrial inspection applications. This comprehensive dataset consists of 10,259 low-definition industrial field images with an average resolution of 226 × 226 pixels, captured using a multi-angle lighting system to ensure optimal coverage of nine critical PCB defect categories, namely SP (spur), SC (spurious copper), OP (open), MB (mouse bite), HB (hole breakout), CS (conductor scratch), CFO (conductor foreign object), BMFO (base material foreign object), and SH (short), as shown in
Figure 6.
Each defect instance is meticulously annotated in standardized COCO-format JSON files, including precise bounding box coordinates, categorical labels, visibility indicators for occlusion cases, and three-tiered difficulty classifications. The dataset is strategically partitioned into 8192 training images, 2048 validation images, and 2606 test images, with the detailed defect size distribution showing 42.7% small targets (32 × 32 pixels), 38.1% medium targets (32 × 32–96 × 96 pixels), and 19.2% large targets (96 × 96 pixels).
In the field of printed circuit board (PCB) defect detection, compared to the DsPCBSD+ dataset, the DeepPCB dataset is designed to provide high-quality training data for deep learning models. The dataset consists of 1500 image pairs, each comprising a defect-free template image and a corresponding test image with defects. These images were captured using a linear-scan CCD camera with a resolution of 48 pixels per millimeter. The images were first cropped into 640 × 640-pixel sub-images and aligned using template matching techniques to minimize preprocessing effort. Subsequently, manual annotation was performed to meticulously label the locations and types of defects in each test image, covering six common types of PCB defects, as illustrated in
Figure 7.
To conduct few-shot object detection experiments, the original dataset needs to be processed and divided into different splits. Each split consists of base classes and few-shot novel classes, where the base classes retain their original sample sizes, while the novel classes undergo few-shot processing. Specifically, the novel classes are adjusted to contain 5-shot, 10-shot, 30-shot, and 50-shot samples for few-shot experiments. For the DsPCBSD+ dataset, 5 classes are retained as base classes, while the remaining 4 classes are treated as novel classes. Similarly, for the DeepPCB dataset, 4 classes are kept as base classes, and the other 2 classes are designated as novel classes. The specific class assignments for different splits are presented in
Table 1 and
Table 2.
4.2. Description of Experimental Environment and Parameters
The proposed algorithm was implemented based on PyTorch 1.7.1 and the Detectron2 framework, with all model training and accuracy testing conducted on a parallel computing platform equipped with NVIDIA L40 GPUs using the CUDA parallel computing framework version 12.0. During the model training phase, we employed the AdamW algorithm as the optimizer and implemented a learning rate warm-up period for the first 100 iterations to stabilize initial training. Specific training parameters were configured as follows: in the base training phase, the input batch size was set to 8, while, during the few-shot meta-learning phase, it was increased to 16 to facilitate rapid model generalization to novel classes.
For the two differently scaled datasets, we established distinct model optimization strategies. For base training on the DsPCBSD+ dataset, each input batch was processed as one training step, with a total of 60,000 optimization steps performed; correspondingly, the base class training on the DeepPCB dataset comprised 110,000 optimization steps to ensure coverage of all instances in the dataset. During the base training phase, the initial learning rate was uniformly set to 0.02 with a learning rate scheduling factor of 0.1. Specifically, for the DsPCBSD+ dataset, the learning rate was reduced by a factor of 10 at 40,000 and 50,000 iterations, while similar adjustments were made for the DeepPCB dataset at 80,000 and 100,000 iterations.
In the novel class meta-learning stage, the optimization process on the DsPCBSD+ dataset used an initial learning rate of 0.001 for 40,000 fine-tuning steps, with the learning rate reduced tenfold at the 36,000th optimization step. For the DeepPCB dataset, we adopted an initial learning rate of 0.005 for 80,000 optimization steps, reducing the learning rate at the 70,000th step to achieve optimal model performance. Upon completing training, the testing phase was conducted with the model configured to retain the top 1000 highest-confidence candidate boxes during evaluation for model inference. All testing experiments were performed using the same hardware configuration as in the training phase to ensure the reliability and consistency of the experimental results, with multiple runs executed to verify the reproducibility and statistical significance of the findings. The comprehensive implementation details, including specific versions of all dependencies and precise hardware specifications, were meticulously documented to facilitate experimental replication and validation by the research community.
4.3. Accuracy Comparison of Few-Shot Object Detection Methods
The evaluation method adopted in this paper follows the N-way K-shot few-shot object detection algorithm evaluation paradigm. Here, N represents the number of novel classes in the dataset—for the DsPCBSD+ dataset, N = 4, and, for the DeepPCB dataset, N = 2. K denotes the number of samples per novel class, with values set to 5, 10, 30, and 50. For instance, five-shot means that only five samples per novel class are provided to the model during the meta-learning phase for training. After completing meta-learning on the novel classes, the model weights are used for inference on the test set, and the results are recorded. Due to the limited number of novel class samples, the experimental outcomes may vary depending on the selection of these samples. To mitigate this variability, the evaluation of the SM-FSOD algorithm is averaged over 10 different novel class sample selections. In the few-shot partitioning experiments, we adopted a stratified random sampling strategy to ensure balanced class distribution, with a fixed random seed (seed = 5) to guarantee partitioning reproducibility. Each shot configuration (5-/10-/30-/50-shot) was independently subjected to 10 randomized trials, with the support and query sets regenerated in each trial. The final results are reported in the form of the mean and standard deviation.
We selected several representative object detection algorithms as control groups for precision comparison experiments. Among them, the Two-Stage Fine-Tuning Approach (TFA) algorithm proposed by Wang et al. [
12] is a transfer learning method based on fine-tuning that achieves generalization to few-shot categories by fine-tuning the last layer of the detector on novel classes. To comprehensively evaluate algorithm performance, it is necessary to compare SM-FSOD with several meta-learning algorithms. These include the Few-Shot Object Detection via Feature Reweighting (FSRW) algorithm proposed by Kang et al. [
30], the Meta-Learning-Based Region-Based Convolutional Network (Meta-RCNN) algorithm by Yan et al. [
31], and the Meta-FRCNN algorithm by Han et al. [
13]. These methods all implement interaction between few-shot query information and support information by designing single-stage meta-learners at the feature level within the Faster-RCNN framework.
Table 3 presents the evaluation results on the DsPCBSD+ dataset, where two different novel class splits (split 1 and split 2) are defined. Using an intersection over union (IoU) threshold of 0.5 (AP50), the ratio of correctly predicted bounding boxes to the total number of test samples is calculated. This metric assesses the model’s generalization ability for few-shot class detection and further measures the detection accuracy of the proposed algorithm.
As shown in
Table 3, the proposed SM-FSOD algorithm achieves promising evaluation results on few-shot classes compared to the current mainstream few-shot object detection methods. Through the performance comparison with several meta-learning-based few-shot object detection algorithms, it can be observed that the proposed SM-FSOD algorithm achieves accuracy improvements ranging from 1.7% to 5.1% in terms of the AP50 metric when compared to the general Meta-FRCNN algorithm. By averaging the precision across different scenarios, the proposed SM-FSOD method demonstrates superior performance over the current state-of-the-art approaches in comprehensive evaluation settings.
To comprehensively evaluate the performance advantages of the proposed algorithm and validate its industrial applicability, we conducted systematic precision comparison experiments on the DeepPCB dataset. As a benchmark dataset in the field of PCB defect detection, it covers six major types of common industrial defects.
Table 4 presents the evaluation results of the SM-FSOD algorithm on the more challenging DeepPCB dataset, assessing the mean detection accuracy for novel classes under S = 5, S = 10, S = 30, and S = 50 settings. For a more comprehensive evaluation on the DeepPCB dataset, we additionally adopted the intersection over union (IoU) threshold of 0.5 (AP50) to calculate the average precision for novel classes. Compared with mainstream transfer learning and meta-learning based methods, the SM-FSOD algorithm achieves superior detection results. Specifically, when compared to the Meta-FRCNN meta-learning algorithm, our method achieves improvements of 9.8 percentage points in the 30-shot setting and 14.0 percentage points in the 5-shot setting. The results demonstrate that SM-FSOD achieves optimal performance across different benchmarks, indicating that the SM-FSOD model exhibits excellent generalization capabilities in few-shot learning tasks.
4.4. Algorithm Ablation Study
To thoroughly analyze the contributions of each module to the overall performance of the SM-FSOD algorithm, we conducted ablation experiments on the DsPCBSD+ dataset under the split 2 data partitioning scheme, covering scenarios from 5-shot to 50-shot. The results are presented in
Table 5.
The first column displays the detection results using a traditional meta-learning algorithm with the Detection Transformer. These results indicate that, without modules specifically improved for few-shot tasks, the original algorithm exhibits some overfitting when the number of novel class samples is limited, leading to relatively low average precision.
The second column shows the performance when employing an MOMP network as a single-stage meta-learner. The introduction of this improved meta-learner brings varying degrees of performance enhancement across different few-shot scenarios, particularly achieving a 5.1% improvement in the 10-shot setting compared to the baseline. This demonstrates that, compared to constructing the meta-learner on shallow features, our approach of building the meta-learner on high-level semantic features can more effectively organize support prototypes and guide the learning of few-shot novel classes. The third column represents the results obtained by further introducing the DPGT network on top of the single-stage meta-learning framework, thereby constructing a complete second-order meta-learning object detection algorithm. The results show that the second-order meta-learning framework, enhanced by the GPIB module, achieves superior performance while keeping the parameter increase within an acceptable range.
Further analysis of the ablation experimental results reveals that, when the sample size is sufficient, traditional baseline methods can already achieve relatively saturated feature representations, as evidenced by the baseline reaching 52.5 mAP under the 50-shot setting. In contrast, the second-order meta-learning strategy introduced in our method demonstrates greater advantages in extreme few-shot scenarios, achieving a 6.7 mAP improvement over the baseline in the 10-shot setting, while the enhancement narrows to 4.6 mAP in the 50-shot setting. This confirms that the proposed method is more suitable for industrial inspection scenarios with extremely scarce data.
4.5. Visualization Results of Object Detection
We conduct a visual analysis of the object detection algorithm to gain deeper insights into its behavior and performance. Specifically, we first visualize the prediction results—including detection box locations, category labels, and confidence distributions—to intuitively demonstrate the algorithm’s detection effectiveness across different scenarios, as shown in
Figure 8. The detailed analysis is as follows.
Through the predictive visualization of bounding boxes and corresponding confidence scores for defect targets, we observe the following issues with the unimproved algorithm: firstly, the model exhibits lower confidence in few-shot novel-class instances, accompanied by missed detections of novel-class objects. This indicates that the baseline algorithm suffers from overfitting to few-shot novel-class targets. However, after integrating the improved MOMP module into the baseline framework, we observe a significant reduction in missed detections for few-shot novel-class targets, along with a more balanced confidence distribution across novel class instances. Furthermore, in Scenario 2, the localization accuracy for novel class foreground targets improves substantially, with the predictions becoming more concentrated on target instances and significantly reduced background interference within bounding boxes. Our analysis reveals that the model’s missed detections primarily stem from the limited number of support samples for few-shot novel classes. This scarcity prevents the algorithm from adequately learning discriminative features for these categories, resulting in ambiguous representations in the feature space. The introduction of the MOMP module significantly mitigates this issue.
Figure 8 presents the detection results after incorporating the DPGT model to establish the complete second-order meta-learning framework. The results demonstrate that the DPGT approach achieves further suppression of both false background detections and missed detections. This verifies that the GPIB-enhanced first-order meta-learner resolves the classification confusion between background negative samples and novel class samples by leveraging global query information to pre-guide prototype construction, thereby enabling the object detection algorithm to attain superior generalization capabilities for few-shot categories.
These visualization results empirically validate the reliability of the ablation studies and further elucidate the contributions of individual algorithmic modules. The synergistic interplay among multiple modules not only enhances the overall detection performance but also plays a pivotal role in target discrimination under complex scenarios. Furthermore, this paper incorporates attention heatmap visualizations to reveal the algorithm’s focus regions during feature extraction and target localization, as depicted in
Figure 9 and
Figure 10.
Specifically, as shown in
Figure 9, we visualize the heatmaps on the second few-shot split of the DsPCBSD+ dataset. The first and second columns demonstrate that, for SC category targets, without the DGPT module, the model’s attention is scattered and it fails to concentrate on valid foreground regions. After incorporating the DGPT module, the model achieves effective focus on both the morphology and locations of foreground targets. The third column reveals that, when multiple small targets coexist in a single scene, without DGPT, the model cannot adequately attend to all foreground objects. With DGPT integration, the model successfully focuses on most defective foreground regions.
Meanwhile, we also conducted heatmap visualizations on the first few-shot setting of the DeepPCB dataset to evaluate the model’s performance when multiple categories coexist. The results demonstrate that, across six different background configurations, the incorporation of both the MOMP and DGPT modules enhances the model’s attention to few-shot category foregrounds while maintaining its generalization capabilities for base category foregrounds. This validates that the model can sustain robust performance in relatively complex scenarios.
The gradient-weighted class activation mapping (Grad-CAM) heatmaps intuitively display the algorithm’s attention to different feature regions. The results indicate that, during the dynamic integration of support and query information, the algorithm’s response to few-shot novel class foregrounds is significantly enhanced. This confirms that, under the guidance of global features, the support prototypes’ perception of novel class objects is effectively strengthened, providing more reliable support information for the subsequent meta-learner stage. Meanwhile, the heatmap intuitively displays the image regions relied upon by the model for decision-making by calculating gradient flow. If the heatmap concentrates on the target defect structures rather than background noise, it indicates the model’s ability to effectively filter out interference from the background during detection. From another perspective, when the input image contains complex background noise, if the heatmap can still accurately highlight the actual defects (instead of falsely activating noise regions), it demonstrates the model’s strong resistance to interference.
To deeply analyze the model’s attention mechanism and error patterns,
Figure 11 presents paired visualization results of heatmaps with false positive/false negative cases. Specifically targeting two typical defect types, the figure demonstrates the differences in the model’s attention distribution between successful detection and failure cases.
Through comparative observation, it can be seen that, in CFO detection, the model with only MOMP added accurately focuses on the inverted color areas at the center of stains but may miss detections when multiple similar stains appear simultaneously; in SMFO detection, when the background becomes complex, the model with only MOMP ignores the PCB texture background and concentrates on edge background areas, indicating that a single MOMP may exhibit false activation in periodic background patterns. These visualization results demonstrate that, while the MOMP module can effectively distinguish between defect features and background patterns in most cases, its attention mechanism can still be affected by periodic background patterns. Further visual analysis shows that, in samples containing defect-mimicking periodic backgrounds, the proportion of cases where the module excessively focuses on the background remains low. This confirms that our proposed second-order constraint mechanism effectively suppresses background response and reduces the false activation rate of background patterns.
4.6. Real-Time Performance and Stability Analysis of the Algorithm
To evaluate whether our proposed object detection algorithm can meet real-time requirements in practical applications and assess the reasonableness of its GPU memory consumption, we employ two critical metrics, time efficiency and memory usage, to validate the feasibility of the SM-FSOD few-shot object detection algorithm. Specifically, this section presents a comparative analysis between the proposed algorithm and several few-shot object detection approaches in terms of the frame rate (FPS) and resource consumption (GiB), evaluating their performance characteristics under varying computational resource conditions. To ensure fair comparisons, all experiments were performed under identical hardware configurations. The detailed experimental results are presented in
Table 6.
The experimental results demonstrate that the proposed SM-FSOD method achieves an optimal balance between efficiency and performance. Compared to Meta-FRCNN and other approaches, it delivers superior detection accuracy with only a marginal increase of 1.1 GiB in GPU memory usage and a maximum speed reduction of 1.2 FPS. This approach successfully resolves the accuracy–speed trade-off, maintaining computational efficiency while enhancing the detection performance, thereby providing a practical solution that balances both precision and efficiency for real-world applications.
According to the general technical requirements for production lines, the operating speed of mainstream PCB inspection lines typically ranges from 3 to 4.5 m/min. Taking a common PCB board size (length × width: 45 cm × 60 cm) as an example, at a line speed of 4 m/min, each board takes approximately 6.75 s to pass through the inspection area. Our method achieves a processing speed of 4.3 FPS, which exceeds the minimum number of image samples required for full-coverage inspection. Therefore, based on the relationship between the production line speed, board size, and processing capacity, a theoretical processing speed of 4–5 FPS fully meets the real-time requirements of general precision-oriented PCB inspection lines.
4.7. Analysis of Computational Cost and Convergence Optimization Strategies
To comprehensively evaluate both the efficiency and optimization behavior of our proposed framework, we conduct detailed analyses covering the computational cost, convergence acceleration, and class imbalance mitigation.
We first provide a layer-wise breakdown of model complexity, including the backbone, MOMP, DPGT, matching module, and detection heads. For each component, we report the number of parameters and memory consumption. As shown in
Table 7, the MOMP and DPGT introduce modest overhead compared to the backbone, while still enabling substantial accuracy gains.
To assess the real-time feasibility in practical applications, we benchmark the end-to-end inference throughput on both a mid-range GPU (NVIDIA RTX 3060) and a CPU (Intel i7-12700). We set the batch size to 1, use float32 precision, and fix the top-K proposals to 300. The time breakdown reveals that DPGT accounts for 21% of the inference time and MOMP for 13%, confirming deployment viability under moderate latency constraints. Regarding optimization dynamics, we clarify that the Hungarian loss employs weights , , balancing classification and localization objectives. To address class imbalance, especially the prevalence of background samples, we downweight the no-object class by a factor of during loss computation.
Figure 12 illustrates the training curves for both the classification and box regression losses. Compared to the baseline, our model achieves faster convergence within the first 50 epochs, suggesting that the prototype-guided matching and hierarchical optimization not only improve the final AP but also stabilize the training dynamics. This supports our claim that the proposed guidance mechanism accelerates convergence rather than merely optimizing endpoint metrics.
4.8. Further Evaluation of Robustness and Practical Metrics
To provide a more comprehensive evaluation of the proposed method’s practical utility and robustness, we conducted additional experiments focusing on finer-grained detection metrics and sensitivity to few-shot configurations. In addition to the commonly used AP50 metric, we also included the AP75 to assess the method’s precision under stricter localization criteria. Furthermore, we report the log-average miss rate (LAMR) and F1-scores at fixed confidence thresholds of 0.5 and 0.75. These metrics are critical in real-world PCB inspection scenarios, where accurate localization and threshold stability are essential due to the high demands of PCB defect detection. As shown in
Table 8.
To investigate the robustness of the proposed method under different few-shot settings, we visualize the AP75 distribution of novel classes from split 2 of the DeepPCB dataset across three different configurations using violin plots. The results show that the method maintains relatively stable performance across categories. As shown in
Figure 13.