A Transformer–CNN Dual-Branch Image Classification Model—Cross-Layer Semantic Interaction and Discriminative Feature Enhancement Algorithm

Qin, Longyan; Bao, Hong; Liu, Fanghua

doi:10.3390/sym18030527

Open AccessArticle

A Transformer–CNN Dual-Branch Image Classification Model—Cross-Layer Semantic Interaction and Discriminative Feature Enhancement Algorithm

by

Longyan Qin

,

Hong Bao

^* and

Fanghua Liu

^*

Mechanical Engineering, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2026, 18(3), 527; https://doi.org/10.3390/sym18030527

Submission received: 14 February 2026 / Revised: 15 March 2026 / Accepted: 18 March 2026 / Published: 19 March 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

PCB defect images suffer from tiny defects, subtle morphological differences and complex background wiring, making traditional single-feature classification unstable. This paper proposes a dual-branch image classification method combining a Transformer and CNN, which jointly models local anomalies and global semantic relationships. The model uses a convolutional branch and a Transformer branch to extract local defect features and global wiring dependencies, respectively. A cross-layer semantic interaction mechanism is adopted for multi-level information fusion, and a discriminative feature enhancement module is applied to highlight key defect regions and suppress background interference. Experiments show that the model improves overall accuracy by over 2%, with an F1-score of 0.930 and defect identification coverage of 0.927. It performs stably across different defect types and background complexities without obvious bias, providing new insights for hybrid deep model design in industrial defect image classification.

Keywords:

PCB defect classification; Transformer–CNN; cross-layer semantic interaction; feature enhancement; industrial visual inspection

1. Introduction

Within electronic manufacturing, printed circuit boards (PCBs) are distinguished by their high integration density and precise structural configuration. The corresponding defect images commonly manifest challenges such as limited defect scale, nuanced morphological disparities, and intricate background wiring structures. These attributes render the defect identification process distinctly nonlinear and strongly context-dependent [1]. The accurate discernment of defects often depends on correlating local anomalies with comprehensive structural semantics. Analytical methodologies predicated exclusively upon localized features or static rule sets prove insufficient for the stable characterization of discriminative boundaries separating distinct defect types [2]. Therefore, the establishment of an intelligent image classification paradigm, capable of concurrently adapting to localized abnormal perturbations and global structural constraints, carries substantial significance for augmenting the reliability of PCB defect detection.

The extant body of research concerning PCB defect imagery is principally focused on automated identification leveraging deep learning methodologies. Convolutional Neural Networks (CNNs) have attained widespread adoption in industrial defect classification and detection tasks by virtue of their capacity for effective modeling of local texture, edge structure, and fine-grained morphological features, consequently elevating the level of automation and stability in defect identification to a considerable degree [3,4]. Pertinent research commonly augments model capability in representing defects of disparate scales through the deepening of network hierarchies or the introduction of multi-scale feature extraction strategies [5,6]. Concurrently, as the intricacy of background wiring structures within PCB images escalates, the requisite for context-dependent defect identification is progressively intensifying. In response, a nascent line of inquiry has begun to incorporate models predicated on the self-attention mechanism to delineate global interdependencies among disparate image regions, thereby compensating for the deficiencies inherent in exclusive reliance on local feature modeling within complex structural environments [7,8]. Although prevailing methodologies have demonstrated measured progress in the domains of local feature modeling or global semantic characterization, respectively, within the specific scenario of PCB defects—characterized by the concurrent presence of fine-grained anomalies and complex structural constraints—the collaborative modeling between local discriminative information and global structural semantics remains inadequately addressed. Particularly under conditions of complex background, the absence of an efficacious mechanism for cross-layer semantic interaction and discriminative information enhancement may detrimentally impact model stability in discriminating subtle defects and categories of analogous morphology.

To address the aforementioned issues, this paper proposes a dual-branch PCB defect image classification model that integrates Transformer and CNN architectures. Through a parallel structure, the model separately characterizes local defect morphology and the semantic representation of the overall wiring structure. During the feature learning stage, a cross-layer semantic interaction mechanism is introduced, allowing local features to progressively incorporate global structural constraints at intermediate layers, thereby enhancing the model’s discriminative stability for subtle defects. On this basis, a discriminative feature enhancement strategy is further incorporated to explicitly strengthen key channels and spatial regions within the fused features. This design effectively suppresses interference from complex background textures and improves the consistency of defect identification. Compared with conventional CNN–Transformer hybrid frameworks that perform feature fusion only at high feature levels, the proposed architecture enables closer collaboration between local and global semantics during the feature learning process. Consequently, it is better suited to the recognition demands of PCB defect images, where fine-grained anomalies and complex structural constraints coexist.

The paper proceeds as follows. First, the research background, problem definition, and objectives are introduced. Second, a systematic review of image classification and defect detection methods is provided. Third, the data, Transformer–CNN dual-branch model design, and evaluation framework are explained. Fourth, experimental results are presented and analyzed from the perspectives of overall performance, category-level results, and adaptability to different background conditions. Fifth, a discussion is conducted, comparing the findings with existing literature and clarifying the study’s contributions and limitations. Finally, the paper concludes with a summary and suggestions for future research.

2. Literature Review

As industrial manufacturing achieves higher levels of precision and intelligence, techniques for visual defect detection and image classification have become indispensable for maintaining product quality [9]. Industrial images, especially those of printed circuit boards, are commonly characterized by small defect sizes, structurally complex backgrounds, and subtle differences between defect categories. These characteristics make the recognition task inherently dependent on both the expression of local details and the modeling of overall structural meaning [10]. Centering on the challenge of achieving effective multi-level feature representation in industrial defect images, deep learning methods have undergone continuous development. This development is evident in the evolution of strategies for feature modeling and in the design of network architectures, leading to the formation of several distinct and representative research trajectories.

A substantial corpus of scholarly work has been devoted in recent years to the effective representation of local anomalous features within industrial defect imagery, with modeling predicated predominantly upon convolutional neural networks (CNNs) for capturing fine-grained textural and structural mutation information. Research by Zhang et al. utilizing industrial surface defects as its object, proposed a lightweight convolutional network architecture [11]. Through the optimization of convolution units and feature expression methodologies, the model’s discrimination capacity for local defect areas was enhanced while operational efficiency was preserved. Confronting the issue wherein defect features are prone to submersion within complex backgrounds, Zhao et al. introduced a convolutional attention guidance mechanism alongside a multi-scale feature aggregation strategy [12]. This enables the network to preferentially attend to defect-associated regions during the feature extraction phase, consequently augmenting detection stability for diminutive defects. The work of Hao et al., which involved the embedding of an attention mechanism into a convolutional network structure to fortify the detection of local texture anomalies, demonstrated that CNNs can effectively ameliorate categorical discrimination outcomes [13]. Aimed at the challenges posed by noise interference and intricate background textures in complex industrial imaging environments, Zhang et al. constructed a deep convolutional network for surface defect detection. Their emphasis lay in realizing robust modeling of local structural alterations via the sequential extraction of multi-layer convolutional features [14]. For the identification of appearance defects in mechanical components, Fu et al. augmented the perceptual capability for minor defects on bearing surfaces through improvements to the convolutional network structure. Empirical results from these investigations collectively indicate that CNN-based approaches retain considerable adaptability in the domain of local anomaly modeling [15].

As the importance of structural relationships and contextual constraints in industrial defect images gradually becomes apparent, researchers have begun to introduce a Transformer to model cross-regional global semantic information. Shamsabadi et al. first applied the visual Transformer to the crack detection task and used the self-attention mechanism to characterize the long-range dependencies between different image areas [16]. The results show that global semantic modeling helps to improve the generalization ability of the model under complex background conditions. Focusing on the problem of industrial surface defect detection, Shang et al. proposed a defect-aware Transformer network to explicitly model the semantic relationship between the defect area and the background texture through global attention, thereby enhancing the ability to distinguish pseudo defects from real defects [17]. Tao et al. constructed a global semantic modeling framework based on Transformer in the industrial texture anomaly localization task, and used self-attention to capture cross-scale contextual information, allowing the model to maintain a more stable anomaly localization performance under complex texture conditions [18]. For unsupervised industrial anomaly detection scenarios, Yang & Guo, built an autoencoding model based on VisionTransformer, which verified the advantages of Transformer in learning the overall structural semantics and distribution pattern of industrial images [19]. Zhou et al., started from the global-local collaboration perspective and introduced a dual global attention mechanism to enhance the Transformer’s ability to characterize the semantics of the overall structure, and pointed out that relying only on local convolution features is difficult to fully reflect the relationship between defects and the overall structure [20].

Following the realization that individual convolutional or Transformer structures face difficulties in simultaneously accommodating both local discriminability and global consistency, scholarly investigation has commenced into hybrid modeling strategies. Research by Li et al. exemplifies this direction, combining Swin Transformer with a convolutional neural network to introduce a windowed global attention mechanism while retaining local texture modeling capability, consequently improving surface defect detection stability under complex backgrounds [21]. Concerning the engineering application of industrial detection frameworks, the work of Guo et al. involves the embedding of Transformer-related modules within a YOLO network, thereby enabling the detection model to acquire enhanced global context modeling capabilities while maintaining the advantages inherent to convolutional features [22]. Furthermore, Wang et al. proposed a hybrid Transformer architecture for defect detection tasks, achieved through the collaborative modeling of local fine-grained features [23]. In a related vein, Üzen et al. constructed a multi-feature fusion network based on Swin Transformer, which verified the efficacy of fusing hierarchical Transformer and convolutional features for the representation of fine defects [24]. Within specific industrial process scenarios, Jeong et al. utilized a hybrid framework comprising ResNet-50 and Vision Transformer to accomplish steel surface defect classification, a result that illustrates the adaptability of the hybrid structure to actual industrial data [25].

While existing work provides a solid theoretical basis for feature modeling in industrial defect analysis, distinct limitations persist. Research based on convolutional neural networks has effectively verified the core contribution of local features (e.g., texture and fine-grained structure) to defect identification, offering a reliable paradigm for describing defect morphology. Conversely, studies incorporating Transformers have proven the value of global semantic modeling and long-range dependencies in supplying structural constraints and improving holistic consistency. However, prevailing methods are largely confined to high-level feature fusion or modular combination, failing to enable profound semantic collaboration between local and global features during feature extraction. They also lack an explicit mechanism to enhance the most discrimination-relevant features. In response to these shortcomings, this paper proposes a parallel dual-branch framework designed to foster deep collaboration between local and global features at intermediate layers through cross-layer semantic interaction. Integrated with a discriminative feature enhancement strategy that prioritizes key channels and spatial regions, the model aims to achieve superior stability and discriminative power for defect image classification amidst complex industrial backgrounds.

3. Research Design

3.1. Data Sources and Preprocessing

The experiments in this study are conducted on a publicly available PCB defect image dataset, which has been widely used in industrial visual inspection and defect classification research and demonstrates strong engineering representativeness in real manufacturing scenarios. The dataset contains a total of 12,000 PCB defect images, covering six typical defect categories, including mouse bite, oil stain, open circuit, short circuit, missing copper, and spurious copper. These defect types correspond to different defect manifestations such as conductor edge damage, surface contamination, circuit breakage, abnormal line connections, copper layer loss, and spurious residual copper structures. The category composition of the dataset and representative examples of different defect types are illustrated in Figure 1. The numbers of samples across different categories are maintained within a relatively small range, which helps preserve category representativeness while effectively reducing the potential impact of class imbalance on model training and performance evaluation. In addition, the dataset includes defect images with varying background complexity, defect scales, and diverse texture interference conditions, enabling the model to sufficiently learn defect distribution characteristics under different industrial inspection environments and providing a reliable data foundation for evaluating model generalization performance.

For dataset partitioning, the full sample collection was randomly divided into training, validation, and test sets following a 7:2:1 ratio. The training set was utilized for model parameter learning, the validation set for hyperparameter tuning and performance monitoring during training, and the test set reserved exclusively for final model evaluation. This division approach ensures both the objectivity and reproducibility of experimental outcomes, while also facilitating comprehensive learning of data distribution characteristics and providing a stable assessment of the model’s generalization ability.

Faced with challenges including inconsistent resolution, complex background textures, and large defect scale variations in PCB defect images, this research conducts systematic data preprocessing prior to model training. To ensure input consistency, all images are resized to a fixed resolution of 224 × 224. Additionally, pixel normalization is performed to enhance numerical stability during training. To improve model generalization, training-time augmentations—random rotation (±15°) and random horizontal/vertical flipping (probability 0.5)—are introduced. These augmentations mimic potential posture changes and imaging disturbances in actual industrial settings, thereby expanding the effective sample space and strengthening robustness against varied defect shapes and imaging conditions. Through these controlled preprocessing steps, a reliable data basis is established for the subsequent stable training and experimental analysis of the dual-branch model.

3.2. Model Construction

Given the characteristics commonly observed in PCB defect images—such as small defect regions, subtle morphological differences, complex background textures, and strong constraints imposed by the global wiring structure—single-feature modeling approaches often struggle to simultaneously capture both local discriminative information and overall structural semantics. Convolutional neural networks exhibit clear advantages in extracting fine-grained defect features, including local edges, fractures, and voids. However, their limited receptive fields make it difficult to effectively model the long-range dependencies among PCB wiring structures. In contrast, although Transformers possess strong global modeling capabilities, they are relatively less effective in capturing the subtle local discriminative details associated with small-scale defects.

In recent years, some studies have begun to explore hybrid modeling strategies that combine CNNs and Transformers in order to simultaneously exploit the strengths of local feature representation and global semantic modeling. Nevertheless, most existing hybrid models achieve information integration through simple feature concatenation or fusion at higher network layers. As a result, the interaction between local features and global semantics typically occurs only in the later stages of feature extraction, making it difficult to establish continuous semantic collaboration during the feature learning process. Under complex PCB background conditions, such late-stage fusion strategies may cause local texture noise to be mistakenly identified as defects, thereby affecting the discriminative stability between subtle defects and categories with similar morphological characteristics.

To address this issue, this paper proposes a Transformer–CNN integrated dual-branch model for PCB defect image classification (Figure 2). Through a parallel architecture, the model separately captures local defect morphology and the semantic representation of the overall wiring structure. A cross-layer semantic interaction mechanism is introduced during the feature learning stage, allowing local features to progressively incorporate global structural constraints at intermediate layers. On this basis, a discriminative feature enhancement strategy is further incorporated to reinforce key channels and spatial regions within the fused features. This design enhances the model’s defect recognition capability and classification stability under complex industrial background conditions.

3.2.1. Dual-Branch Parallel Modeling for PCB Defect Characteristics

Given an input PCB defect image

X \in ℝ^{H \times W \times C}

, where H, W, and C denote its height, width, and number of channels, respectively, the model employs two parallel branches to simultaneously capture distinct feature attributes. Specifically, a CNN branch is utilized to depict local defect morphology, while a Transformer branch models the semantics of the overall wiring structure. This parallel design enables the joint representation of both fine-grained details and global contextual information.

The CNN branch is primarily responsible for extracting local structural features from PCB images, including edge discontinuities and texture anomalies corresponding to defect regions such as open circuits, short circuits, and missing copper. This branch consists of four convolutional modules, each comprising a 3 × 3 convolutional layer, Batch Normalization, and a ReLU activation function. The numbers of channels in the convolutional modules are sequentially set to 32, 64, 128, and 256. In addition, certain convolutional layers employ a stride of 2 to perform feature map downsampling, thereby progressively enlarging the network’s receptive field and enhancing its capability to represent fine-grained defect morphologies. Under this architecture, the convolutional feature at the l-th layer can be expressed as follows:

F_{c}^{(l)} = f_{c}^{(l)} (F_{c}^{(l - 1)})

(1)

where f_c^(l)(⋅) represents the l-th layer convolution mapping function,

F_{c}^{(l)} \in ℝ^{H_{l} \times W_{l} \times C_{l}}

(2)

which represents the feature map output of this layer, where H_l, W_l, and C_l denote the height, width, and number of channels of the feature map at the l-th layer, respectively. This branch primarily strengthens the modeling capability for defect edge discontinuities and local texture anomalies. The Transformer branch is designed to model the global dependency relationships among PCB wiring structures. Specifically, the input image is first divided into 16 × 16 non-overlapping image patches, and each patch is transformed into a token vector through a linear projection. Under the input resolution setting of 224 × 224, the corresponding number of tokens is N = 196. The resulting token sequence is then fed into the Transformer encoder for global feature modeling. To balance model expressiveness and computational efficiency, this study adopts a lightweight Transformer architecture. The encoder consists of six encoding layers, each including a Multi-head Self-Attention module and a feed-forward network. The number of attention heads is set to 4, the feature embedding dimension is set to 256, and the hidden dimension of the feed-forward network is set to 1024. Through stacked self-attention layers, the model is able to capture long-range dependency relationships among different regions within PCB images. The output of the k-th layer can be expressed as follows:

F_{t}^{(k)} = f_{t}^{(k)} (F_{t}^{(k - 1)})

(3)

where

F_{t}^{(k)} \in ℝ^{N \times d}

(4)

denotes the feature representation output by the k-th Transformer encoder layer, where N represents the number of tokens and d denotes the feature dimension of each token. This branch facilitates the capture of global structural characteristics in PCB images, such as wiring arrangements and regional layout patterns, thereby providing contextual constraints for defect identification.

3.2.2. Cross-Layer Semantic Interaction Mechanism for Local–Global Collaboration

In PCB defect detection scenarios, the determination of whether a defect truly exists often depends not only on its local morphology but also on its relative position and semantic relationship within the overall wiring structure. In recent years, several studies have begun to adopt hybrid modeling frameworks that combine CNN and Transformer architectures in order to simultaneously leverage the advantages of convolutional structures in representing local details and the capability of self-attention mechanisms in modeling global dependencies. However, most existing hybrid models perform feature concatenation or simple fusion between CNN and Transformer features at higher network layers. Consequently, the interaction between local features and global semantics typically occurs only in the later stages of feature extraction, making it difficult to establish continuous semantic collaboration at intermediate and lower layers. Under complex PCB background conditions, such late-stage fusion strategies may cause local texture noise to be mistakenly identified as defects, thereby affecting the discriminative stability between subtle defects and categories with similar morphological characteristics. To address this issue, this paper proposes a cross-layer semantic interaction mechanism that enables deep integration of local defect information and global structural semantics during the feature extraction process. The proposed bidirectional cross-layer semantic interaction signal flow is illustrated in Figure 3.

To progressively introduce global semantic constraints during feature learning, two cross-layer semantic interaction stages are incorporated into the network. In the first stage, feature interaction is established between the second convolutional module of the CNN branch and the third encoder layer of the Transformer branch. In the second stage, semantic interaction is performed again between the third convolutional module of the CNN branch and the fifth encoder layer of the Transformer branch. By performing feature interaction twice at intermediate layers, the model progressively introduces the global wiring structural semantics learned by the Transformer branch into the local feature representation of the CNN branch. Consequently, the formation of local defect features remains continuously constrained by structural semantics, thereby reducing the likelihood that background texture noise will be misidentified as defects.

Specifically, map the ll-th layer features of the CNN branch to the Transformer feature space:

{\tilde{F}}_{c}^{(l)} = ϕ (F_{c}^{(l)})

(5)

where ϕ(⋅) represents space flattening and linear mapping operations to align feature dimensions.

Then, use

{\tilde{F}}_{c}^{(l)}

as Query and Transformer branch feature

F_{t}^{(k)}

as Key and Value to build cross-layer attention interaction:

A^{(l, k)} = Softmax (\frac{({\tilde{F}}_{c}^{(l)} W_{Q}) {(F_{t}^{(k)} W_{K})}^{⊤}}{\sqrt{d}})

(6)

F_{int}^{(l, k)} = A^{(l, k)} \cdot (F_{t}^{(k)} W_{V})

(7)

where W_Q, W_K and W_V are learnable parameter matrices. This mechanism allows the CNN branch to introduce PCB global wiring semantic information in the middle layer, thereby avoiding misjudgment of local noise as defects.

Meanwhile, the model establishes a reverse interaction pathway (Transformer ← CNN), enabling the Transformer branch to receive feedback of local discriminative features from the CNN branch. This design allows the Transformer branch to emphasize genuine defect regions during the global modeling process. Specifically, after mapping the CNN features into the token space, they are used as the Key and Value, while the Transformer features serve as the Query:

A_{r e v}^{(k, l)} = softmax (\frac{F_{t}^{(k)} W_{Q}^{t} {({\tilde{F}}_{c}^{(l)} W_{K}^{c})}^{T}}{\sqrt{d}})

(8)

F_{r e v}^{(k, l)} = A_{r e v}^{(k, l)} ({\tilde{F}}_{c}^{(l)} W_{V}^{c})

(9)

The Transformer branch feature representation is finally updated as follows:

F_{t}^{(k + 1)} = F_{t}^{(k)} + F_{r e v}^{(k, l)}

(10)

where

W_{Q}^{t}

,

W_{K}^{c}

, and

W_{V}^{c}

denote learnable parameter matrices.

3.2.3. Feature Enhancement Module for Defect Identification

Although cross-layer semantic interaction enhances the completeness of feature representation, PCB images still contain strong background responses from numerous non-defect regions, which may interfere with the final classification decision. To further strengthen the model’s attention to defect regions, a discriminative feature enhancement module is introduced based on the cross-layer fused features, explicitly reinforcing key features associated with defect identification.

The design of this module draws upon the fundamental principle of attention mechanisms in feature recalibration. By jointly learning weights along both the channel and spatial dimensions, the module highlights feature responses that are closely related to defect discrimination. However, its primary purpose is to strengthen critical discriminative information specifically for PCB defect recognition tasks, thereby suppressing interference caused by complex background textures.

Let the feature representation after cross-layer fusion be denoted as F_f. First, feature importance weights are learned along the channel dimension as follows:

w_{c} = σ (AvgPool (F_{f}) + MaxPool (F_{f}))

(11)

where σ(⋅) represents the Sigmoid activation function. This weight helps highlight feature channels that contribute significantly to the identification of defects such as open circuits and short circuits.

Furthermore, a spatial attention weight w_s is introduced along the spatial dimension to characterize the spatial distribution patterns of defect regions within the image. This design enables the model to focus more effectively on potential defect regions. The final enhanced feature representation can be expressed as follows:

F_{enh} = w_{c} ⊙ w_{s} ⊙ F_{f}

(12)

Among them, ⊙ represents element-wise multiplication operation. Through this module, the model can effectively suppress the interference of PCB background wiring texture and improve the discriminability between different defect categories.

3.2.4. Classification Decision-Making and Optimization Goals

The enhanced feature representation is subsequently fed into a fully connected classification layer, which outputs the final predicted probability distribution across all PCB defect categories.

\hat{y} = softmax (W_{f} F_{enh} + b_{f})

(13)

During the model training process, the cross-entropy loss function is used as the optimization goal:

L = - \sum_{i = 1}^{C} y_{i} \log ({\hat{y}}_{i})

(14)

where y_i denotes the ground-truth defect category label,

{\hat{y}}_{i}

represents the predicted probability of the i-th class generated by the model, and C denotes the total number of defect categories. Through end-to-end training, the model is able to more accurately accomplish the PCB defect image classification task under the synergistic effects of cross-layer semantic interaction and discriminative feature enhancement.

The model training procedure is configured with a maximum of 150 epochs, employing the Adam optimizer for parameter updates with an initial learning rate of 1 × 10⁻⁴. An EarlyStopping mechanism is implemented to halt training if the validation loss fails to improve for 15 consecutive epochs, thereby mitigating the risk of overfitting. Additional hyperparameters include a batch size of 32 and a weight decay coefficient of 1 × 10⁻⁵. This training strategy ensures robust convergence and comprehensive learning of PCB defect features, thereby establishing a reliable and reproducible foundation for subsequent experimental validation.

3.3. Evaluation Indicators

An evaluation system is constructed to thoroughly assess the performance of the proposed Transformer–CNN dual-branch model in classifying PCB defect images. The system is designed along two key dimensions: overall classification accuracy, which measures general recognition performance, and category-level discrimination ability, which evaluates stability across individual defect types. Together, these metrics provide a multi-faceted view of the model’s effectiveness. The specific indicators are listed below.

3.3.1. Accuracy

Accuracy serves as an indicator of the alignment between the model’s predictions and the true defect category labels. It provides a holistic reflection of how effectively the model has learned the overall distributional characteristics present in the PCB defect image dataset. The formula for calculating accuracy is as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(15)

Here, TP, TN, FP, and FN correspond to the following: TP refers to the number of samples correctly identified as the positive class, TN signifies the number of samples correctly identified as the negative class, FP denotes the number of samples incorrectly identified as the positive class, and FN represents the number of samples incorrectly identified as the negative class.

3.3.2. Precision

In order to evaluate the reliability of the model’s classification for a specific defect type, the precision rate is calculated as the ratio of correctly predicted instances to all instances predicted as that class.

Precision = \frac{T P}{T P + F P}

(16)

Among them, TP represents the number of target defect samples that are correctly identified, and FP represents the number of samples that are misjudged to this defect category.

3.3.3. Recall Rate (Recall)

The recall rate represents the proportion of samples from a specific defect category that are correctly identified by the model. This metric reflects the model’s ability to cover or retrieve defective samples within that category. Its calculation formula is defined as follows, where TP denotes true positives and FN denotes false negatives for the category in question:

Recall = \frac{T P}{T P + F N}

(17)

Among them, FN represents the number of samples that are actually the target defect category but are not correctly recognized by the model.

3.3.4. F1-Score

F1-score is the harmonic average of precision rate and recall rate, which is used to achieve a balance between the two to evaluate the comprehensive classification performance of the model. Its calculation formula is:

F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(18)

4. Research Results and Analysis

4.1. Model Performance Evaluation

4.1.1. Convergence Analysis

To analyze the convergence behavior and learning stability of the model during training, Figure 4 presents the loss curves of the CNN baseline model and the Transformer–CNN dual-branch model during the training process. The figure simultaneously illustrates the variation trends of both training loss and validation loss. The two models were trained under identical conditions, including the same data partitioning, optimizer configuration, and training strategy, thereby ensuring the comparability of the results.

From the overall trend, both models exhibit a pronounced decrease in loss during the early stages of training, indicating that the network parameters can be effectively updated under random initialization and that the optimization process does not suffer from gradient divergence or numerical instability. As the number of training epochs increases, both the training loss and validation loss gradually decrease and eventually enter a relatively stable phase in the later stages. This behavior suggests that the models progressively learn the underlying distribution characteristics of the data and reach a stable convergence state. Meanwhile, the validation loss curve follows a trend largely consistent with that of the training loss, without significant divergence or oscillation, indicating that no evident overfitting occurs during the training process and that the model maintains good generalization stability.

A further comparison of the convergence behavior between the two models reveals that the Transformer–CNN dual-branch model demonstrates a faster loss reduction in the early training stage and maintains a smoother convergence trajectory in the later stages. Compared with the single CNN model, its training and validation loss curves remain consistently lower and exhibit smaller fluctuations. This indicates that, after introducing global semantic modeling and cross-layer feature interaction mechanisms, the model is able to learn defect-related features more effectively, thereby improving both optimization stability and learning efficiency. In addition, during the training process, when the validation loss fails to decrease for several consecutive epochs, the Early Stopping mechanism automatically terminates the training procedure, preventing potential overfitting in the later training stages. Overall, the Transformer–CNN dual-branch architecture enhances the model’s representational capacity while still maintaining a stable optimization process and favorable convergence characteristics.

4.1.2. Overall Performance Analysis

To systematically evaluate the comprehensive performance of the proposed Transformer–CNN dual-branch model in PCB defect image classification tasks, this study selects several representative models as baselines for comparison. The selection encompasses distinct architectural paradigms: ResNet-50 serves as a representative of pure convolutional structures, which excel at modeling local texture and fine-grained defect features; ViT-B/16 represents a pure Transformer architecture, relying on self-attention mechanisms to capture long-range dependencies across image regions; Conformer is included as a prominent CNN-Transformer hybrid model that jointly models local discriminative features and global semantic information within a unified framework. Furthermore, considering the prevalent application of hybrid architectures based on hierarchical window attention in industrial defect detection, the SwinTransformer–CNN model is also introduced for comparison. This allows for an assessment of the performance differences among various hybrid modeling strategies in the specific context of PCB defect classification.

As demonstrated in Figure 5, the proposed Transformer–CNN dual-branch model achieves the best overall performance in the PCB defect image classification task. In terms of overall accuracy, it attains a score of 0.938, which exceeds that of ResNet-50, ViT-B/16, and Conformer by approximately 2.6%, 2.0%, and 1.2%, respectively, while also outperforming the SwinTransformer–CNN hybrid model. This outcome indicates that, building upon the parallel modeling of convolutional and Transformer features, the incorporated cross-layer semantic interaction mechanism effectively introduces global structural semantic constraints at intermediate feature stages. Consequently, this design mitigates the interference from complex wiring backgrounds on classification decisions and enhances overall discrimination consistency.

Regarding the Recall metric, as demonstrated in Figure 5, the proposed model achieves a score of 0.927, surpassing ResNet-50 by more than 3% and maintaining the highest level among all Transformer-based and hybrid architectures evaluated. This indicates that the model possesses superior coverage in identifying defect samples and effectively reduces the risk of missing subtle defects. The results for Precision and the comprehensive F1-score further corroborate this finding. The model’s F1-score reaches 0.930, consistently demonstrating a stable advantage over Conformer and SwinTransformer–CNN. These results collectively demonstrate that the model achieves a synergistic optimization between Precision and Recall, enhancing defect identification completeness without compromising classification accuracy, thereby effectively balancing false alarm control and detection integrity in complex PCB defect scenarios.

4.2. Computational Complexity Analysis

In industrial visual inspection tasks, a model is required not only to achieve high classification accuracy but also to satisfy the practical requirements of real production systems in terms of computational efficiency and deployment cost. Therefore, in addition to classification performance metrics, a quantitative evaluation of computational complexity is essential for assessing the engineering feasibility of the model. To this end, this study further analyzes the computational cost of the proposed Transformer–CNN dual-branch model from three aspects: model parameter scale (Parameters), floating-point operations (FLOPs), and inference speed (Frames Per Second, FPS). These metrics are also compared with several representative baseline models. Specifically, the parameter scale reflects the structural complexity of the network, FLOPs represent the computational workload required for a single forward inference, and FPS evaluates the runtime efficiency of the model in practical detection systems. All models were tested under the same hardware environment to ensure the comparability of the results. The comparison results of computational complexity are presented in Table 1.

The traditional convolutional network ResNet-50 has relatively lower parameter scale and computational cost, which enables higher inference speed. However, such models mainly rely on local convolutional feature modeling, and their capability for capturing global semantic relationships is relatively limited in scenarios involving complex PCB wiring structures. In contrast, pure Transformer architectures (such as ViT-B/16) possess stronger global dependency modeling capabilities, but their parameter scale and computational overhead increase significantly, leading to reduced inference efficiency.

For hybrid architecture models, such as Conformer and Swin Transformer–CNN, convolutional features and self-attention mechanisms are combined to achieve collaborative modeling of local and global information while maintaining relatively high classification performance. Their parameter scale and computational cost therefore remain at a moderate level. The Transformer–CNN dual-branch model proposed in this study introduces a cross-layer semantic interaction mechanism and a discriminative feature enhancement module. Although these structural improvements slightly increase the parameter scale compared with conventional hybrid models, the overall computational cost remains within an acceptable range.

From the perspective of inference efficiency, the proposed model achieves an inference speed of 103.7 FPS in the experimental environment, which satisfies the basic real-time processing requirements of most industrial visual inspection systems. Overall, although the introduction of a dual-branch architecture and cross-layer interaction mechanisms increases the computational complexity to a certain extent, the computational cost remains significantly lower than that of pure Transformer architectures. At the same time, the model achieves notable improvements in classification performance, thereby attaining a balanced trade-off between performance and efficiency and providing practical feasibility for deployment in real-world PCB defect detection scenarios.

4.3. Ablation Experiment

To verify the independent contributions of the proposed cross-layer semantic interaction mechanism and the discriminative feature enhancement strategy to model performance improvement, a series of ablation experiments were conducted based on the complete Transformer–CNN dual-branch model. The corresponding results are presented in Table 2. All model variants were trained under identical conditions, including the same data partitioning, training strategy, and parameter configurations, in order to ensure the fairness and interpretability of the comparisons. To further improve the stability and reproducibility of the experimental results, each model was trained five times under different random seed settings. The mean and standard deviation of each evaluation metric were then calculated, and the final results are reported in the form of “mean ± std”.

The performance of a dual-branch baseline model, which lacks both cross-layer semantic interaction and discriminative feature enhancement, is relatively limited on key metrics like Accuracy, Recall, and F1-score. This outcome underscores that simple parallel fusion of CNN and Transformer features cannot fully address the challenge of fine-grained defect identification in complex PCB images. Ablation by removing the cross-layer semantic interaction mechanism from the complete model results in a marked performance decrease, with Accuracy and F1-score both falling short of the full model’s standard. This ablation result highlights the mechanism’s essential function: it actively fosters collaboration between local defect information and global structural semantics in the feature extraction phase. By embedding structural constraints into intermediate feature layers, the mechanism significantly boosts the consistency and stability of defect discrimination.

When the discriminative feature enhancement module is individually removed (w/o discriminative feature enhancement), model performance suffers a distinct decline, especially affecting Precision and F1-score. This underscores the module’s utility in mitigating background texture interference and intensifying the focus on defect-discriminative channels and regions, which bolsters classification reliability. By comparison, the complete model—integrating both the cross-layer semantic interaction mechanism and the discriminative feature enhancement strategy—secures the highest scores across all metrics. These findings indicate that the two modules are not simply functionally stacked; instead, they provide complementary contributions. One operates at the level of local–global semantic integration, while the other specializes in discriminative information highlighting. Their joint implementation leads to a concerted improvement in the model’s overall classification efficacy for complex PCB defects.

4.4. Image Classification Results

4.4.1. Defect Category Classification Results

To further analyze the model’s discriminative performance across different defect categories, Figure 6 presents the normalized confusion matrix of the proposed Transformer–CNN dual-branch model for the PCB defect classification task. The matrix is row-normalized according to the ground-truth categories, meaning that each row represents the proportional distribution of samples from a given class across the predicted categories. This normalization allows a more intuitive interpretation of the model’s recognition performance for each defect type. From the overall distribution, all defect categories exhibit a clear diagonal dominance within the confusion matrix. This pattern indicates that the model’s predictions for different categories are highly concentrated and consistent, with no evident category bias or systematic misclassification. These results demonstrate that the dual-branch feature modeling framework, together with the cross-layer semantic interaction mechanism, is able to maintain stable discriminative capability in multi-category PCB defect classification scenarios.

At the level of specific defect types, categories with pronounced geometric structural characteristics—such as mouse bites, open circuits, and short circuits—demonstrate relatively distinct discrimination boundaries. This is because such defects typically involve clear edge discontinuities or local structural anomalies. The convolutional branch of the model effectively captures these local discriminative features, which in turn reduces confusion with other defect types during the classification process.

Conversely, oil stains, missing copper, and spurious copper exhibit some overlap in visual form and texture, causing their identification to rely more heavily on the semantic context between the defect region and the global wiring pattern. Analysis of the confusion matrix reveals residual inter-class confusion within this group. It is noteworthy, however, that erroneous predictions are largely limited to categories with analogous appearances and do not diffuse to structurally dissimilar defects. This observation demonstrates the model’s effectiveness in limiting discrimination errors within a manageable range, even amidst complex backgrounds.

The aforementioned results demonstrate that distinct defect categories impose varying demands on feature modeling approaches during classification. The convolutional branch is pivotal for extracting localized anomalous structures, whereas the Transformer branch contributes effective contextual constraints—particularly for texture-based and structurally subtle defects—by incorporating global wiring semantics. Through the synergistic operation of the cross-layer semantic interaction mechanism and the discriminative feature enhancement strategy, these two branches complement each other. This synergy enables the model to concurrently address both local discriminability and global consistency in multi-category PCB defect classification, thereby enhancing classification stability at the category level.

4.4.2. Sample Condition Hierarchical Classification Results

To further examine the classification stability of the model under different sample conditions, this study groups the test samples according to the background complexity of PCB defect images. Specifically, the samples in the test set are manually annotated and categorized based on wiring density, background texture complexity, and the degree of interference from non-defect regions. When the image background exhibits relatively regular textures, simpler wiring structures, and limited interference, the sample is classified as a simple-background sample. Conversely, when the image contains dense wiring structures or noticeable complex texture interference, it is categorized as a complex-background sample.

As a result, the test set contains 602 simple-background samples and 598 complex-background samples, with the distribution of defect categories remaining approximately balanced between the two groups. A statistical analysis of the predicted category distributions under these two conditions was conducted, and the results are illustrated in Figure 7. From the perspective of overall distribution patterns, under the condition of relatively simple background textures, the prediction results for each defect category exhibit clear and concentrated distribution characteristics. The proportional structure among different categories remains stable, with no obvious category bias or excessive concentration of predictions within a small number of classes. This observation indicates that the model is able to effectively preserve its capability to distinguish among different defect morphologies in imaging environments with relatively weak interference.

As background complexity increases, the overall morphology of the predicted distribution for each defect category retains a high degree of alignment with that observed under simple background conditions. Proportional shifts are minor and primarily confined to defect categories sharing similar texture characteristics, such as oil stains, missing copper, and spurious copper, which exhibit slight fluctuations in their predicted proportions. Importantly, these variations are largely restricted to classes with analogous visual morphology and do not extend to defect types characterized by substantially different structural features. This pattern indicates that the model successfully maintains reasonable discriminative boundaries even in the presence of complex wiring and texture interference.

This study further calculates classification evaluation metrics, including Precision, Recall, and F1-score, separately for the simple-background and complex-background samples, in order to provide a more comprehensive assessment of the model’s recognition capability under different background complexity conditions. The corresponding results are presented in Table 3. It can be observed that under relatively simple background conditions, the model achieves relatively high values for Precision, Recall, and F1-score. When the background complexity increases, these metrics exhibit slight decreases; however, the overall variations remain relatively small. This indicates that the model is able to maintain stable classification performance even in the presence of complex background interference.

The aforementioned results demonstrate that the Transformer–CNN dual-branch model exhibits strong predictive consistency and structural stability across varying levels of background complexity. This robustness stems from the complementary roles of its two branches: the convolutional branch, with its sensitive modeling of local defect morphology, helps preserve the fundamental discriminative structure across defect categories. Concurrently, the Transformer branch, by introducing global wiring semantics, provides effective contextual constraints for local discrimination, enabling the model to suppress irrelevant texture interference even as background complexity escalates. Collectively, these attributes endow the model with considerable robustness under complex industrial imaging conditions, forming a solid foundation for its practical deployment in real-world PCB defect detection scenarios.

To further provide an intuitive analysis of the model’s classification behavior under complex scenarios and its ability to correct erroneous predictions, two representative types of misclassification cases are selected for qualitative comparison (Figure 8). In Case 1 (visually similar defects), oil stain defects exhibit high similarity to spurious copper defects in terms of color distribution and local texture characteristics, leading the baseline model to misclassify them as spurious copper. In contrast, the proposed model correctly identifies the oil stain defect under the same input conditions, thereby avoiding category confusion caused by visual similarity. In Case 2 (complex background interference), the mouse bite defect is located in a region with dense wiring structures, where background textures and noise interfere with defect edge features, resulting in misclassification as an oil stain by the baseline model. Under the same input conditions, the proposed model produces the correct prediction for the mouse bite defect, indicating that it maintains stable discriminative behavior under complex background conditions.

5. Discussion

Against the backdrop of increasing demands for high precision and stability in industrial visual inspection, this study constructs a Transformer–CNN integrated dual-branch classification model. Through a parallel architecture, the model separately characterizes local defect morphology and the semantic relationships of the overall wiring structure. During the feature learning process, a cross-layer semantic interaction mechanism is introduced, allowing local features to be constrained by global information at the early stages of representation formation, thereby improving the consistency of defect discrimination under complex background conditions.

In recent studies on PCB defect recognition, integrating local detail modeling with global dependency modeling has gradually become a major strategy for improving model robustness. An & Zhang (2022) [26] introduced the Vision Transformer into PCB image classification and proposed the LPViT model, demonstrating that the global self-attention mechanism helps capture long-range dependencies among PCB wiring structures and improves contextual consistency in defect identification. This conclusion is consistent with the emphasis in this study on the constraining role of global structural semantics in defect recognition. However, their method mainly relies on a single Transformer pathway for modeling, which is relatively limited in capturing the fine local discriminative details required for small-scale defects. In contrast, the present study introduces a parallel CNN branch and performs semantic interaction at intermediate layers, enabling local features to be guided by global information during the learning process. Chen et al. (2022) [27] integrated a Transformer module into the YOLO architecture and verified that global modeling can effectively suppress false detections under complex background conditions, which also supports the introduction of Transformer mechanisms to mitigate local texture interference. Nevertheless, that study mainly introduces global information at higher feature levels and lacks a detailed characterization of the collaborative relationship between local and global features. In this work, a cross-layer interaction mechanism is employed to achieve bidirectional semantic collaboration, allowing the two types of features to be continuously integrated throughout the entire feature extraction process rather than simply combined at the output stage. Feng & Cai (2023) [28] further indicated that simultaneously utilizing local details and global dependency information is crucial for improving the stability of PCB defect detection. Although their findings are highly consistent with the motivation of this study, their approach relies more on parallel modeling and feature concatenation, with relatively limited constraints on semantic consistency across different feature layers. In comparison, the present work further introduces cross-layer semantic interaction and a discriminative feature enhancement strategy, enabling the representation of local anomalies to be effectively constrained by overall structural semantics and thereby forming a more targeted methodological improvement.

From a theoretical perspective, this study expands the research framework of industrial defect image classification from the viewpoint of collaborative feature modeling. Existing research generally focuses on either convolutional feature modeling or global attention mechanisms through a single modeling pathway, or performs feature fusion only at higher network layers, while paying relatively limited attention to multi-level semantic interaction during the feature learning process. By introducing a cross-layer semantic interaction mechanism into the CNN–Transformer hybrid framework, the proposed model allows local defect features to progressively incorporate global structural semantic constraints during learning, thereby overcoming the limitations of traditional late-fusion strategies. Meanwhile, the discriminative feature enhancement strategy designed for defect recognition tasks further strengthens critical discriminative information within fused features, thereby improving the stability of defect identification under complex industrial background conditions.

Although the model demonstrates strong performance in the overall classification task, the confusion matrix still reveals a certain degree of misclassification among some defect categories. For example, texture-related defects such as oil stains, missing copper, and spurious copper exhibit high similarity in color distribution and local texture characteristics, which may still lead to some category confusion under complex wiring background conditions. This phenomenon indicates that when the defect regions are relatively small or the texture differences are subtle, local texture information may still be affected by background structures, thereby influencing the final classification decision. It also suggests that even with the introduction of cross-layer semantic interaction, the model still faces certain challenges in distinguishing highly similar defect categories, reflecting the inherent complexity of defect recognition in industrial visual inspection scenarios.

This study also has several limitations. The research is primarily conducted based on a single publicly available PCB defect dataset. Although this dataset possesses certain engineering representativeness in industrial visual inspection studies and contains multiple defect types as well as samples with varying background complexity, its imaging conditions and defect distributions cannot fully cover the complex situations encountered in real industrial production environments. Therefore, the scope of experimental validation remains somewhat limited. The model is evaluated only on this dataset, and its generalization ability across different datasets or scenarios still requires further systematic verification. The current experiments mainly examine the model’s stability through analyses of different defect categories and stratified experiments under varying background complexity conditions. Although the results indicate that the model maintains relatively stable classification performance across different sample conditions, these findings primarily reflect the model’s behavior under the current data distribution. In addition, the analysis in this work mainly focuses on classification performance, while the adaptability of the model to downstream tasks—such as defect localization or multi-task joint detection—has not been systematically investigated, which to some extent limits the generalizability of the research conclusions.

6. Conclusions

Centering on the typical attributes of PCB defect images, this paper constructs a Transformer–CNN integrated dual-branch image classification model. By implementing cross-layer semantic interaction and discriminative feature enhancement mechanisms, the model’s defect recognition capability in complex industrial environments is enhanced. The principal conclusions are as follows:

The Transformer–CNN dual-branch model exhibits superior comprehensive classification performance compared to several mainstream methods. Evaluated under the same settings, it achieves Accuracy gains of about 2.6%, 2.0%, 1.2%, and 0.9% against ResNet-50, ViT-B/16, Conformer, and SwinTransformer–CNN. With an F1-score of 0.930 and a Recall of 0.927, the model substantially lowers the likelihood of missed detections. Importantly, it retains a high Precision rate, demonstrating no significant surge in misjudgments. This outcome suggests an effective equilibrium between precise classification, reliable predictions, and comprehensive defect coverage.
From a category-level analysis perspective, the model demonstrates clear and stable discrimination boundaries for structural defects like mouse bites, open circuits, and short circuits. Concurrently, confusion among texture-similar defects—such as oil stains, missing copper, and spurious copper—is effectively contained. Results from the normalized confusion matrix reveal that misclassifications are predominantly confined to categories sharing high visual similarity, with no widespread confusion occurring across structurally distinct types. This indicates that the cross-layer semantic interaction mechanism actively contributes to maintaining the stability of the inter-category discrimination framework.
In stratified experiments conducted under varying levels of background complexity, the model’s predicted category distributions remain largely consistent. Only minor fluctuations are observed in the prediction of texture-based defects, with no significant category bias emerging. These results collectively indicate that the proposed model retains strong robustness even as background interference intensifies. Consequently, it demonstrates the adaptability required to meet the application demands of real-world industrial imaging environments, where complex wiring and substantial noise often coexist.

Future work can be pursued along several directions to advance the model’s applicability. First, research should be conducted under conditions more closely aligned with real industrial settings by incorporating multi-source PCB defect data from diverse production lines, imaging devices, or process stages. This would enable a systematic evaluation of the model’s feature transfer capability and discriminative stability across different scenarios, specifically testing the adaptability of the cross-layer semantic interaction mechanism to shifts in complex data distributions. Second, at the architectural level, exploring more lightweight feature interaction methods or parameter-sharing strategies could help reduce computational overhead while preserving the benefits of local–global collaborative modeling, thereby enhancing suitability for resource-constrained or time-sensitive online detection scenarios. Lastly, the current classification-focused feature representation could be extended to downstream tasks such as defect localization or joint classification-detection modeling. Investigating the potential of cross-layer semantic interaction features for improving spatial localization accuracy and facilitating multi-task collaborative learning would further broaden the model’s applicability within practical industrial vision systems.

Author Contributions

Conceptualization, L.Q. and H.B.; methodology, L.Q.; software, L.Q.; validation, L.Q., H.B. and F.L.; formal analysis, L.Q.; investigation, L.Q.; resources, L.Q.; data curation, L.Q.; writing—original draft preparation, L.Q.; writing—review and editing, L.Q.; visualization, L.Q.; supervision, H.B.; project administration, H.B.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

These data were derived from the following resources available in the public domain: https://codeload.github.com/YMkai/PCB_Datasets/zip/refs/heads/main, accessed on 17 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACC	Accuracy
Adam	Adaptive Moment Estimation
CNN	Convolutional Neural Network
FLOPs	Floating Point Operations
FN	False Negative
FP	False Positive
FPS	Frames Per Second
LPViT	Lightweight Pyramid Vision Transformer
PCB	Printed Circuit Board
ResNet	Residual Network
TN	True Negative
TP	True Positive
ViT	Vision Transformer
YOLO	You Only Look Once

References

Meng, J.; Guo, L.; Hao, W.; Jain, D.K. A surface defect detection method for electronic products based on improved YOLOv11. PLoS ONE 2025, 20, e0334333. [Google Scholar] [CrossRef]
Fu, L.; Shi, L.; Yan, L.; Xi, Z.; Juan, C.; Zhu, L. A Global-Local Feature Multilevel Fusion Network for Chip Defect Segmentation. IEEE Access 2024, 12, 17467–17480. [Google Scholar] [CrossRef]
Dong, G. A pixel-wise framework based on convolutional neural network for surface defect detection. Math. Biosci. Eng. 2022, 19, 8786–8803. [Google Scholar] [CrossRef]
Fu, G.; Zhang, Z.; Le, W.; Li, J.; Zhu, Q.; Niu, F.; Chen, H.; Sun, F.; Shen, Y. A multi-scale pooling convolutional neural network for accurate steel surface defects classification. Front. Neurorobot. 2023, 17, 1096083. [Google Scholar] [CrossRef]
Tang, T.W.; Hsu, H.; Li, K.M. Industrial anomaly detection with multiscale autoencoder and deep feature extractor-based neural network. IET Image Process. 2023, 17, 1752–1761. [Google Scholar] [CrossRef]
Lou, H.; Zheng, Y.; Chen, W.; Liu, H. Multi-Scale Feature Similarity and Object Detection for Small Printing Defects Detection. IEEE Access 2024, 12, 196403–196412. [Google Scholar] [CrossRef]
Chen, W.; Meng, S.; Wang, X. Local and global context-enhanced lightweight CenterNet for PCB surface defect detection. Sensors 2024, 24, 4729. [Google Scholar] [CrossRef] [PubMed]
Zhai, Y.; Li, X.; Gao, L.; Gao, Y. A lightweight transformer with linear self-attention for defect recognition. Electron. Lett. 2024, 60, e13292. [Google Scholar] [CrossRef]
Zeng, L.; Wan, F.; Zhang, B.; Zhu, X. Automated Visual Inspection for Precise Defect Detection and Classification in CBN Inserts. Sensors 2024, 24, 7824. [Google Scholar] [CrossRef]
Li, K.; Zhong, X.; Han, Y. A High-Performance Small Target Defect Detection Method for PCB Boards Based on a Novel YOLO-DFA Algorithm. IEEE Trans. Instrum. Meas. 2025, 74, 1–12. [Google Scholar] [CrossRef]
Zhang, D.H.; Hao, X.Y.; Wang, D.C.; Qin, C.B.; Zhao, B.; Liang, L.L.; Liu, W. An efficient lightweight convolutional neural network for industrial surface defect detection. Artif. Intell. Rev. 2023, 56, 10651–10677. [Google Scholar] [CrossRef]
Zhao, S.; Li, G.; Zhou, M.; Li, M. ICA-Net: Industrial defect detection network based on convolutional attention guidance and aggregation of multiscale features. Eng. Appl. Artif. Intell. 2023, 126, 107134. [Google Scholar] [CrossRef]
Hao, Z.; Li, Z.; Ren, F.; Lv, S.; Ni, H. Strip steel surface defects classification based on generative adversarial network and attention mechanism. Metals 2022, 12, 311. [Google Scholar] [CrossRef]
Zhang, D.; Zhang, Z.; Wang, C.; Xu, Y. A novel deep convolutional neural network algorithm for surface defect detection in complex industrial scenes. J. Comput. Des. Eng. 2022, 9, 2121–2134. [Google Scholar]
Fu, X.; Yang, X.; Zhang, N.; Zhang, R.; Zhang, Z.; Jin, A.; Ye, R.; Zhang, H. Bearing surface defect detection based on improved convolutional neural network. Math. Biosci. Eng. 2023, 20, 12341–12359. [Google Scholar] [CrossRef] [PubMed]
Shamsabadi, E.A.; Xu, C.; Rao, A.S.; Nguyen, T.; Ngo, T.; Dias-da-Costa, D. Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Autom. Constr. 2022, 140, 104316. [Google Scholar] [CrossRef]
Shang, H.B.; He, J.; Wu, H.Y.; Zhang, X.Y. Defect-aware transformer network for intelligent visual surface defect detection. Adv. Eng. Inform. 2023, 55, 101882. [Google Scholar] [CrossRef]
Tao, X.; Adak, C.; Chun, P.-J.; Yan, S.; Liu, H. ViTALnet: Anomaly localization on industrial textured surfaces with hybrid transformer. IEEE Trans. Instrum. Meas. 2023, 72, 5009013. [Google Scholar] [CrossRef]
Yang, Q.; Guo, R. An unsupervised method for industrial image anomaly detection with vision transformer-based autoencoder. Sensors 2024, 24, 2440. [Google Scholar] [CrossRef]
Zhou, X.; Zhou, S.; Zhang, Y.; Ren, Z.; Jiang, Z.; Luo, H. GDALR: Global dual attention and local representations in transformer for surface defect detection. Measurement 2024, 229, 114398. [Google Scholar] [CrossRef]
Li, Y.; Xiang, Y.; Guo, H.; Liu, P.; Liu, C. Swin Transformer Combined with Convolution Neural Network for Surface Defect Detection. Machines 2022, 10, 1083. [Google Scholar] [CrossRef]
Guo, Z.; Wang, G.; Chen, H.; Wen, Q.; Xu, W. MSFT-YOLO: Improved YOLOv5 Based on Transformer for Detecting Defects of Steel Surface. Sensors 2022, 22, 3467. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; He, Y.; Song, K.; Meng, Q.; Yan, Y. Defect transformer: An efficient hybrid transformer architecture for surface defect detection. Measurement 2023, 211, 112614. [Google Scholar] [CrossRef]
Üzen, H.; Türkoğlu, M.; Yanikoglu, B.; Hanbay, D. Swin-MFINet: Swin transformer based multi-feature integration network for detection of pixel-level surface defects. Expert Syst. Appl. 2022, 209, 118269. [Google Scholar] [CrossRef]
Jeong, M.; Yang, M.; Jeong, J. Hybrid-DC: A Hybrid Framework Using ResNet-50 and Vision Transformer for Steel Surface Defect Classification in the Rolling Process. Electronics 2024, 13, 4467. [Google Scholar] [CrossRef]
An, K.; Zhang, Y. LPViT: A transformer based model for PCB image classification and defect detection. IEEE Access 2022, 10, 42542–42553. [Google Scholar] [CrossRef]
Chen, W.; Huang, Z.; Mu, Q.; Sun, Y. PCB defect detection method based on Transformer-YOLO. IEEE Access 2022, 10, 129480–129489. [Google Scholar] [CrossRef]
Feng, B.; Cai, J. PCB defect detection via local detail and global dependency information. Sensors 2023, 23, 7755. [Google Scholar] [CrossRef]

Figure 1. Overview of the PCB defect dataset. (a) Category distribution of the PCB defect dataset; (b) Representative image samples of six typical PCB defect types.

Figure 2. Transformer–CNN structure diagram.

Figure 3. Bidirectional cross-layer semantic interaction signal flow between CNN and Transformer branches.

Figure 4. Training and validation loss curves of the compared models.

Figure 5. Model comparison results.

Figure 6. Classification confusion of different PCB defect categories.

Figure 7. PCB defect prediction category distribution under different background complexity conditions.

Figure 8. Comparison of error correction between the baseline and the proposed model.

Table 1. Computational complexity comparison of different models.

Model	Parameters (M)	FLOPs (G)	FPS
ResNet-50	25.6	4.1	165.3
ViT-B/16	86.6	17.6	78.4
Conformer	27.4	7.9	121.6
Swin Transformer–CNN	28.3	8.7	110.2
Transformer–CNN	31.5	9.6	103.7

Table 2. Ablation experiment results.

Model	Accuracy	Precision	Recall	F1-Score
Dual-branch baseline	0.912 ± 0.003	0.901 ± 0.004	0.894 ± 0.005	0.897 ± 0.004
w/o cross-layer semantic interaction	0.921 ± 0.003	0.912 ± 0.003	0.907 ± 0.004	0.909 ± 0.003
w/o discriminative feature enhancement	0.926 ± 0.002	0.918 ± 0.003	0.914 ± 0.003	0.916 ± 0.003
Full model	0.938 ± 0.002	0.929 ± 0.002	0.927 ± 0.003	0.930 ± 0.002

Table 3. Performance under different background complexity.

Background Condition	Precision	Recall	F1-Score
Simple background	0.931	0.928	0.929
Complex background	0.925	0.921	0.923

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, L.; Bao, H.; Liu, F. A Transformer–CNN Dual-Branch Image Classification Model—Cross-Layer Semantic Interaction and Discriminative Feature Enhancement Algorithm. Symmetry 2026, 18, 527. https://doi.org/10.3390/sym18030527

AMA Style

Qin L, Bao H, Liu F. A Transformer–CNN Dual-Branch Image Classification Model—Cross-Layer Semantic Interaction and Discriminative Feature Enhancement Algorithm. Symmetry. 2026; 18(3):527. https://doi.org/10.3390/sym18030527

Chicago/Turabian Style

Qin, Longyan, Hong Bao, and Fanghua Liu. 2026. "A Transformer–CNN Dual-Branch Image Classification Model—Cross-Layer Semantic Interaction and Discriminative Feature Enhancement Algorithm" Symmetry 18, no. 3: 527. https://doi.org/10.3390/sym18030527

APA Style

Qin, L., Bao, H., & Liu, F. (2026). A Transformer–CNN Dual-Branch Image Classification Model—Cross-Layer Semantic Interaction and Discriminative Feature Enhancement Algorithm. Symmetry, 18(3), 527. https://doi.org/10.3390/sym18030527

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Transformer–CNN Dual-Branch Image Classification Model—Cross-Layer Semantic Interaction and Discriminative Feature Enhancement Algorithm

Abstract

1. Introduction

2. Literature Review

3. Research Design

3.1. Data Sources and Preprocessing

3.2. Model Construction

3.2.1. Dual-Branch Parallel Modeling for PCB Defect Characteristics

3.2.2. Cross-Layer Semantic Interaction Mechanism for Local–Global Collaboration

3.2.3. Feature Enhancement Module for Defect Identification

3.2.4. Classification Decision-Making and Optimization Goals

3.3. Evaluation Indicators

3.3.1. Accuracy

3.3.2. Precision

3.3.3. Recall Rate (Recall)

3.3.4. F1-Score

4. Research Results and Analysis

4.1. Model Performance Evaluation

4.1.1. Convergence Analysis

4.1.2. Overall Performance Analysis

4.2. Computational Complexity Analysis

4.3. Ablation Experiment

4.4. Image Classification Results

4.4.1. Defect Category Classification Results

4.4.2. Sample Condition Hierarchical Classification Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI