In this section, we first present PCB-Faster-RCNN. Then, we introduce the various improved modules of Faster-RCNN, including ResNet-101 as the backbone, the deformable convolution module employed to enhance the adaptability of the model to geometric variations, and the CBAM, which is integrated to strengthen feature representation and improve discriminative performance.
3.2. Residual Neural Network
The backbone network is typically VGG-16 in the traditional Faster-RCNN framework. As a classical convolutional neural network architecture, VGG-16 achieved remarkable performance in early object detection and image recognition tasks, laying the foundation for subsequent research on deep convolutional networks. However, the structural design of VGG-16 primarily relies on stacking of deep 3 × 3 convolutional kernels and fully connected layers to enhance its detection accuracy, lacking cross-layer information transmission mechanisms, which can make the deep features prone to gradient vanishing or exploding during training, thereby constraining its scalability and hindering further network deepening. In addition, although VGG-16 demonstrates reasonable performance in extracting global semantic features, its ability to handle small-scale, morphologically complex, or boundary-ambiguous objects remains insufficient. This limitation is particularly evident in PCB defect detection, where the network struggles to accurately represent small defects or irregularly shaped objects, leading to a noticeable decline in detection accuracy. To address these issues, in this paper, we replace the original VGG-16 backbone in Faster-RCNN with ResNet-101 [
43]. By introducing residual learning and shortcut connections, ResNet-101 effectively alleviates the degradation problem commonly encountered in deep neural network training while simultaneously improving the stability of feature propagation and gradient flow. Moreover, its deeper residual structure facilitates the extraction of multi-scale features and enhances adaptability to complex deformation patterns. As a result, the proposed backbone substitution significantly improves detection precision and robustness, particularly in handling diverse defect categories on PCB surfaces. The residual neural network consists of several residual blocks. The overall structure of the residual block is illustrated in
Figure 2.
As we can see in
Figure 2, each residual block consists of two convolutional layers and an ReLU activation function. The input X is processed through two different paths after entering the residual block. The first path passes through two convolutional layers and an ReLU activation function to produce the output
, while the second path directly transmits the input
X as the output through a residual connection. The outputs of the two paths will eventually be added together, and the final output F (X) + X of the residual block is obtained after passing through the ReLU activation function, which is mathematically expressed as Equation (
1).
In the residual block,
X denotes the input to the block,
W represents the weights of the learnable layers,
is the output obtained after processing through these layers and activation functions, and
Y is the final output of the residual block. The core idea of residual networks is the introduction of residual learning through the residual connection. The traditional neural network learns the mapping from input
x to output
y, which is
, while ResNet introduces a residual function
, where the network learns the difference between the input and the output. According to Equation (
1), the network does not directly learn the mapping
but instead learns the “residual”
between the input
x and the output
y.
There are several variants of residual neural networks with different numbers of layers, including ResNet-50, ResNet-101, ResNet-152, and so on, where the number represents the depth of the network. Although these different versions of residual networks exhibit significant improvements in precision, accuracy, and generalization ability compared to traditional networks, there are still notable differences between them. Specifically, ResNet-50 offers advantages in computational efficiency but has relatively limited precision in object detection tasks due to its shallower architecture. On the other hand, ResNet-152 achieves higher precision with its deeper architecture but comes with increased computational cost and a higher risk of overfitting. Therefore, ResNet-101 was selected as the backbone for Faster-RCNN to balance both accuracy and computational efficiency after a comprehensive evaluation of the dataset used in this paper.
In this paper, we present a network architecture based on five residual modules, aiming at improving the ability of the model to recognize objects at multiple scales through feature maps of varying sizes. Each residual module extracts feature information from objects of different sizes, and the feature maps become more abstract and richer in semantic information as the network deepens. In the lower layers, feature maps capture more low-level details, such as edges and textures, while higher layers focus on capturing semantic information, such as object categories and structures. To enhance the expressiveness of the features, the feature maps from different scales are fused through a feature pyramid network. The feature pyramid network effectively integrates multi-level features, improving recognition accuracy across objects of various sizes. Through this multi-scale feature fusion, the model demonstrates improved robustness and accuracy in detecting objects of different sizes, thereby enhancing overall performance.
Figure 3 illustrates the implementation of this feature fusion process.
ResNet-101 enables efficient learning and propagation of features at deeper layers by introducing residual connections. This capability of deep feature extraction is particularly critical for small objects with limited pixel information as the network must capture fine-grained shape and texture details from low-level features. Residual blocks play a key role in cross-layer information transmission, not only enhancing the ability of the model to model detailed features but also improving its sensitivity to object details, thereby enabling reliable recognition of small objects in complex backgrounds. Complementarily, the feature pyramid network enhances multi-scale feature representation by extracting and integrating features across multiple scales, allowing the network to effectively leverage information from different levels and achieve improved accuracy and robustness in object detection tasks. In PCB defect detection, some types of defects are inherently small in size. By integrating ResNet-101 with a feature pyramid network, the model can efficiently extract discriminative features across multiple scales, thereby improving its sensitivity to objects of varying sizes and strengthening its capability to detect small-scale defects. This multi-scale feature representation and fusion mechanism enables the model to achieve more robust and accurate defect recognition in complex inspection scenes.
3.3. Deformable Convolution Network
Traditional convolution has a relatively limited receptive field due to the fixed sampling position of the convolution kernel. Moreover, pooling operations reduce the size of the feature map, inevitably resulting in the loss of some image information. This issue is particularly prominent in images with object deformation and may even result in severe model degradation in some extreme cases. To address the limitation, in this paper, we introduce deformable convolution [
44] for feature extraction, which allows the convolution kernel to flexibly select sampling points from neighboring pixels by incorporating learnable offsets into the sampling locations, thereby enhancing its ability to model objects with geometric deformations and scale variations. The deformable convolution can capture richer structural and semantic information from the image compared to conventional convolution, thus improving the robustness and representational capacity of the model in complex visual tasks.
Figure 4 illustrates examples of deformable convolution.
As illustrated in
Figure 4, traditional convolution extracts features from fixed sampling locations, and its computational formulation is shown in Equation (
2). In contrast, deformable convolution introduces the offset, which takes the sampling point as the center and adaptively selects the sampling position within its surrounding neighborhood range by adding an offset to enable more flexible feature extraction. Both the offset and sampling position are randomly generated rather than predefined, thereby allowing the convolution to capture richer geometric structures and semantic information from the image. The corresponding computational process is expressed in Equation (
3).
In Equations (
2) and (
3),
x denotes the input feature map,
represents the convolution kernel weight at position
, and
is the central sampling point. The set of sampling locations of the convolution kernel is denoted as
, while
refers to the offset, which is typically randomly generated and can take non-integer values. In conventional convolution, the input features are aggregated through weighted summation at fixed sampling locations. In contrast, deformable convolution introduces the offset
at each sampling position
, thereby shifting the sampling location to
, which makes the convolution kernel no longer constrained to a fixed regular grid; it can adaptively select sampling locations, enabling more flexible and informative feature extraction.
The PCB image data we used in this paper was collected from real-world manufacturing scenarios, where the quality of images is inevitably affected by factors such as environmental conditions and camera angles, which often cause issues such as object deformation, insufficient illumination, background reflections, and blurred details, thereby increasing the difficulty of defect detection. To address these challenges, DCNs are modularly embedded into the high-level feature extraction stage of ResNet to enhance the adaptability of the model to target geometric deformations and complex scenes. Specifically, we replace all traditional convolutions with the DCN in Res4, which is responsible for encoding high-level semantic information. Simultaneously, we introduce the DCN into the first traditional convolutional layer of Res5, enabling the model to have adaptive sampling capabilities in semantically rich but spatially low-resolution stages. DCNs can adjust the convolution sampling position through offsets, allowing the convolution kernels to more flexibly align features with scale-changed objects. This embedding method does not change the overall topology of the backbone but significantly enhances the spatial modeling capabilities, making it more suitable for the complex appearance changes of objects such as PCB manufacturing scenarios. To further validate its effectiveness, the ablation experiment based on conventional convolution was conducted, allowing for a systematic comparison of their performance differences.
3.4. Convolutional Block Attention Module
Traditional attention mechanisms are generally restricted to a single dimension, either channel attention or spatial attention. Such designs inevitably suffer from inherent limitations in feature representation. On the one hand, channel attention can effectively emphasize informative feature channels but often neglects spatial positional information, making it difficult to accurately localize salient regions in complex visual scenes. On the other hand, spatial attention is capable of highlighting crucial locations, yet it fails to capture semantic interdependencies across channels, thereby resulting in incomplete and suboptimal feature descriptions. In contrast, the Convolutional Block Attention Module (CBAM) [
45] introduces a sequential integration of channel- and spatial attention. By first refining channel-wise feature responses and subsequently enhancing spatially significant regions, CBAM establishes a complementary relationship between global semantic modeling and local spatial localization. This joint mechanism enables a more holistic and discriminative feature refinement process, effectively compensating for the deficiencies of conventional single-dimensional attention mechanisms. Furthermore, unlike computationally intensive self-attention methods, CBAM requires only minimal additional operations due to its lightweight and modular design, thereby maintaining low computational overhead and high scalability, which allows CBAM to be seamlessly embedded into a wide range of convolutional neural network architectures as a plug-and-play module. Based on this, in this paper, we incorporate CBAM into the backbone to enhance feature representation. Through the joint exploitation of channel dependencies and spatial importance, the model is able to extract more fine-grained and multidimensional representations from input images. Consequently, the proposed framework exhibits improved discriminability when dealing with visually similar targets under complex and cluttered conditions. The overall architecture of CBAM is illustrated in
Figure 5.
As illustrated in
Figure 5, CBAM is composed of two sequential sub-modules, including the channel attention module (CAM) and the spatial attention module (SAM). The input feature map is first processed by CAM, where channel-wise attention is applied to selectively emphasize informative feature channels with higher semantic relevance, and then the feature map with the channel attention is fed into SAM, where spatial attention is imposed to highlight salient regions across spatial dimensions. Through this cascaded operation, the final output feature map effectively integrates both channel- and spatial attention, thereby yielding a more comprehensive and discriminative feature representation.
For CAM, it computes complementary channel descriptors via global average and max pooling and projects them through a shared bottleneck MLP, fusing them with a sigmoid function; then, the feature map will be rescaled channel-wise to emphasize semantically informative channels. Specifically, for the input feature map
, the CAM first performs two types of pooling operations along the spatial dimension, thereby generating complementary channel descriptors. The formulations of these pooling operations are presented in Equations (
4) and (
5).
Specifically, global average pooling
computes the mean activation across all spatial locations within each channel to characterize the overall response strength, while the max pooling
extracts the maximum activation value along the spatial dimension to capture the most salient local feature. The combination of these two descriptors provides a complementary global representation for each channel, encompassing both statistical characteristics and salient responses. Subsequently, the two channel descriptors are fed into a shared-weight two-layer multilayer perceptron (MLP) to further model inter-channel dependencies, shown in Equations (
6) and (
7).
where the first weight layer will reduce the channel dimension from
C to
C/
r where
denotes the reduction ratio, which not only decreases the number of parameters and computational cost but also compresses the channel representation to extract more discriminative information. Subsequently, the second weight layer restores the dimension to
, thereby generating a weight distribution that matches the original channel number, and the nonlinear activation function
, typically ReLU, is introduced to enhance the nonlinearity and expressiveness of the learned features during this process. Finally, two nonlinear mapping vectors are produced, corresponding to the average pooling and max pooling branches, respectively. These two outputs are then fed into a gating mechanism to integrate complementary information and adaptively assign channel-wise weights, as shown in Equation (
8).
In the gating unit, the outputs from the average pooling and max pooling branches are first combined via element-wise addition. Then, the aggregated result is normalized by a sigmoid activation function
, producing a channel attention vector
of length
C, which characterizes the importance coefficients of individual channels with values constrained within the range of 0 to 1. Finally, the weights
are applied to the input feature map
F through channel-wise multiplication, obtaining the feature map
with channel attention, as shown in Equation (
9):
After processing by CAM,
is fed into SAM, which derives complementary spatial descriptors via channel-wise average and max pooling, aggregates them using a single
convolution followed by a sigmoid gate to form a spatial attention map, and re-weights the feature map pixel-wise to highlight salient regions while suppressing irrelevant backgrounds. The SAM learns the spatial attention map
, which adaptively recalibrates features along the spatial dimension to emphasize salient regions and suppress irrelevant background information. Specifically, the feature map
is subjected to average pooling and max pooling along the channel dimension, resulting in two complementary two-dimensional response maps, as defined in Equations (
10) and (
11).
where
denotes the mean value across all channels at spatial location
, reflecting the overall response intensity at that position, while
extracts the maximum activation among all channels, highlighting the most salient local pattern. These two descriptors provide complementary spatial representations, with the former preserving global statistical information of the background and the latter emphasizing discriminative local regions. Subsequently, the two response maps are concatenated along the channel dimension to form the composite representation
, which is then processed by a single convolutional layer for feature fusion. Finally, the output is normalized via a sigmoid function to generate the spatial attention map, as shown in Equation (
12).
The convolution operation can fuse the two spatial descriptors, enabling the extraction of contextual features within local neighborhoods, which makes the attention map not only rely on pointwise statistical information but also capture spatial dependencies across adjacent regions. The output is subsequently normalized by a sigmoid activation function
, which compresses the values into the range of 0 to 1, thereby producing the final spatial attention map
that reflects the importance of each spatial location, and then attention map is multiplied element-wise with the channel-refined feature map, allowing the network to enhance salient regions while suppressing background and noise interference and obtaining the final refined feature map
, shown in Equation (
13).
In the real-world PCB manufacturing scene, various types of defects often exhibit visually similar characteristics due to environmental interference and imaging limitations of the cameras, which increases recognition ambiguity and degrades model performance. To address this, the CBAM module is integrated into the multi-scale feature extraction process of the backbone network in a lightweight and embedded manner. Specifically, we insert CBAM modules at the output of each module in the ResNet backbone, ensuring that each level of features undergoes channel recalibration before entering the next stage of convolutional computation. Specifically, the CAM models global features along the channel dimension and adaptively learns the relative importance of each channel, thereby emphasizing discriminative feature channels that are critical for distinguishing visually similar defects; meanwhile, the SAM redistributes feature weights across the spatial dimension, highlighting salient locations while suppressing irrelevant regions, which enables the network to focus on key internal structures of the defects rather than relying solely on overall contours, thus mitigating confusion caused by morphological similarity. The synergy of CAM and SAM endows the network with discriminative advantages in both feature selection and spatial localization, leading to significant improvements in accuracy and robustness for distinguishing between similar defect categories. The embedding method does not change the original structure of the backbone network but can effectively enhance the salient regions of the target and suppress redundant background information in the low and middle layers of feature extraction, thereby improving the feature representation ability of the overall detection network and its robustness in PCB manufacturing scenarios.
3.5. CIoU Loss Function
The CIoU loss incorporates the ratio between the ground truth and the predicted bounding boxes, enabling a more comprehensive evaluation of localization accuracy, shown in Equations (
14) and (
15).
In Equations (
14) and (
15),
d represents the distance between the predicted boxes and center point of the ground-truth bounding boxes.
c represents the distance between the diagonals of the minimum bounding matrix.
represents the ratio of the two boxes, shown in Equation (
16). The final function of CIoU loss is shown in Equation (
17).
Compared with other loss functions, the CIoU loss introduces , which enhances robustness in object detection tasks, which not only improves detection accuracy but also strengthens the generalization capability of the model under complex conditions. Given that PCB manufacturing environments are often affected by diverse and challenging factors, in this paper, we adopt CIoU loss as the primary regression function to ensure stability and reliability in practical applications.
In this paper, the introduction of ResNet-101, DCN, and CBAM represents a systematic architectural design tailored to the characteristics of PCB defect detection tasks. PCB defects are typically extremely small, texture-fragmented, geometrically irregular, and highly similar across categories, which makes conventional convolutional features insufficient for semantic abstraction, deformation modeling, and background suppression. ResNet-101 provides deeper semantic representations that strengthen the ability of the model to capture micro-scale defects. DCN adapts to the geometric variations in defect shapes through learnable sampling offsets, significantly reducing false negatives caused by edge distortion and structural deformation. CBAM effectively highlights key regions and suppresses complex PCB texture backgrounds through a joint channel- and spatial attention mechanism, reducing misclassification between similar-looking categories. These components complement each other functionally. ResNet-101 enhances global semantics, DCN captures local geometric details, and CBAM achieves discriminative region attention. These three correspond to three complementary dimensions: semantic enhancement, deformation modeling, and attention guidance, forming a collaborative optimization mechanism tailored to the characteristics of PCB defects.