A Dual-Structured Convolutional Neural Network with an Attention Mechanism for Image Classification

Liu, Yongzhuo; Zhang, Jiangmei; Liu, Haolin; Zhang, Yangxin

doi:10.3390/electronics14193943

Open AccessArticle

A Dual-Structured Convolutional Neural Network with an Attention Mechanism for Image Classification

by

Yongzhuo Liu

^1,*

,

Jiangmei Zhang

^1,2,*,

Haolin Liu

^1,2 and

Yangxin Zhang

¹

School of Information and Control Engineering, Southwest University of Science and Technology, Mianyang 621010, China

²

Caca Innovation Center of Nuclear Environmental Safety Technology, Mianyang 621010, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(19), 3943; https://doi.org/10.3390/electronics14193943

Submission received: 8 August 2025 / Revised: 17 September 2025 / Accepted: 30 September 2025 / Published: 5 October 2025

(This article belongs to the Special Issue Advances in Object Tracking and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a dual-structured convolutional neural network (CNN) for image classification, which integrates two parallel branches: CNN-A with spatial attention and CNN-B with channel attention. The spatial attention module in CNN-A dynamically emphasizes discriminative regions by aggregating channel-wise information, while the channel attention mechanism in CNN-B adaptively recalibrates feature channel importance. The extracted features from both branches are fused through concatenation, enhancing the model’s representational capacity by capturing complementary spatial and channel-wise dependencies. Extensive experiments on a 12-class image dataset demonstrate the superiority of the proposed model over state-of-the-art methods, achieving 98.06% accuracy, 96.00% precision, and 98.01% F1-score. Despite a marginally longer training time, the model exhibits robust convergence and generalization, as evidenced by stable loss curves and high per-class recognition rates (>90%). The results validate the efficacy of dual attention mechanisms in improving feature discrimination for complex image classification tasks.

Keywords:

convolutional neural networks; attention module; image classification; feature fusion; dual-branch architecture

1. Introduction

1.1. Backgroud

Image classification stands as one of the most fundamental and widely studied tasks in the field of computer vision, serving as the backbone for numerous advanced applications [1,2] such as object detection, medical imaging, and autonomous driving. The primary objective of image classification is to assign a given input image to one of several predefined categories based on its visual content. Over the past decade, the rapid evolution of deep learning, particularly convolutional neural networks (CNNs), has revolutionized this domain by enabling machines to achieve human-level accuracy in many benchmark datasets [3,4]. Traditional CNNs rely on hierarchical feature extraction, where successive layers progressively capture low-level features (e.g., edges and textures) and high-level semantic representations [5] (e.g., object parts and entire entities). However, despite their remarkable success, conventional CNNs often face limitations in efficiently modeling long-range dependencies and dynamically emphasizing the most discriminative regions or channels within an image [6]. This shortcoming has spurred extensive research into more sophisticated architectures, among which attention mechanisms have emerged as a transformative innovation.

Attention mechanisms, originally popularized in natural language processing (NLP) through models like the Transformer, have been increasingly adapted to computer vision tasks due to their ability to selectively focus on the most relevant parts of the input data [7]. Unlike standard convolutional operations, which apply fixed filters uniformly across an entire image, attention mechanisms enable adaptive weighting of spatial locations or feature channels, thereby enhancing the model’s capacity to prioritize informative regions while suppressing irrelevant or redundant information. For instance, in fine-grained image classification, where subtle differences distinguish categories (e.g., bird species or car models), attention modules can help the network concentrate on distinguishing features [8] such as beak shapes or wing patterns. Similarly, in cluttered or occluded scenes, attention can dynamically adjust the receptive field to emphasize the visible portions of objects, significantly improving robustness. The integration of attention mechanisms into CNNs has thus become a pivotal research direction, bridging the gap between local feature extraction and global contextual understanding.

Recent studies have demonstrated that attention mechanisms can substantially enhance the performance of CNNs by addressing several inherent limitations. One key advantage is their ability to model long-range spatial relationships, which traditional convolutions struggle with due to their localized receptive fields. For example, in large-scale images where critical features may be dispersed across distant regions [9], self-attention or non-local operations can directly capture these dependencies without relying on deep stacking of convolutional layers. Another benefit lies in channel-wise attention, which adaptively recalibrates the significance of different feature channels [10], allowing the network to amplify informative features while suppressing noise. Techniques such as Squeeze-and-Excitation (SE) networks and Convolutional Block Attention Modules (CBAM) have empirically proven that channel and spatial attention can lead to consistent accuracy gains across diverse datasets. Furthermore, attention mechanisms introduce a dynamic and interpretable element into CNNs, enabling researchers to visualize which regions or features the model deems important for decision-making—a valuable property for debugging and real-world deployment.

Despite these advancements, the integration of attention mechanisms into CNNs is not without challenges. One major consideration is the computational overhead introduced by attention operations, particularly for high-resolution images or real-time applications. While vanilla self-attention exhibits quadratic complexity with respect to input size, numerous efficient variants (e.g., axial attention, shifted windows, or local attention blocks) have been proposed to mitigate this issue. Another challenge lies in the optimal combination of attention with convolutional inductive biases, as pure attention-based models (e.g., Vision Transformers) often require large-scale pretraining to match CNN performance. Hybrid architectures, such as CoAtNets or MobileViT, aim to synergize the strengths of both paradigms, offering a balanced trade-off between accuracy and efficiency. Additionally, the effectiveness of attention mechanisms can vary significantly across tasks; for example, in datasets with limited training samples, overly complex attention modules may lead to overfitting. Future research directions may explore adaptive attention mechanisms that automatically adjust their complexity based on input complexity or task requirements, as well as unified frameworks for multi-modal attention (e.g., combining visual and textual cues in vision-language tasks). As the field progresses, attention-enhanced CNNs are poised to remain at the forefront of image classification, continually pushing the boundaries of accuracy, interpretability, and scalability [11].

1.2. Related Work

The domain of image recognition has witnessed transformative developments through groundbreaking innovations in computer vision and artificial intelligence (AI). Over the past decade, the convergence of deep learning, high-performance computing, and large-scale annotated datasets has propelled the field into new frontiers of accuracy and applicability. Modern computational methodologies have evolved sophisticated mechanisms for interpreting image recognition, leveraging both spatial and temporal data modalities to achieve unprecedented performance across diverse applications—from medical diagnostics to autonomous driving. This section systematically examines contemporary research trends, technical implementations, and unresolved challenges within this interdisciplinary field, providing a comprehensive overview of the current state of the art while identifying critical gaps that warrant further investigation.

State-of-the-art systems predominantly utilize hierarchical neural architectures to distill discriminative features from visual inputs, enabling robust recognition even under varying conditions such as occlusion, illumination changes, and viewpoint shifts. Pioneering work by Ref. [12] established benchmark performance using residual network (ResNet) topologies adapted for action classification, demonstrating that deep networks with skip connections mitigate the vanishing gradient problem while enhancing feature reuse. Subsequent innovations by Ref. [13] demonstrated the efficacy of transformer-based encoders for cross-domain feature transfer, highlighting their superior ability to capture long-range dependencies compared to traditional convolutional neural networks (CNNs). Complementary studies [14] have shown that hybrid architectures combining convolutional operations with self-attention mechanisms yield superior performance in complex scenarios, such as fine-grained recognition and multi-object tracking. These architectures benefit from the local inductive biases of CNNs while leveraging the global contextual awareness of transformers, resulting in a more holistic understanding of visual data.

Advanced segmentation methodologies have become instrumental in isolating critical patterns of images from cluttered environments, a critical requirement for applications such as image recognition, and image analytics. The cascaded mask-propagation framework introduced in Ref. [15] achieves pixel-level precision by iteratively refining region proposals through multi-stage feature integration, outperforming traditional single-pass segmentation approaches. This method is particularly effective in video-based recognition tasks, where temporal consistency must be maintained across frames to ensure coherent object tracking.

Parallel developments in spatiotemporal modeling, particularly the dual-stream networks presented in Ref. [16], effectively decouple motion and appearance features for independent processing, allowing for more nuanced analysis of dynamic scenes. By processing optical flow and RGB data in separate streams before fusing them at a later stage, these networks achieve superior performance in action recognition benchmarks such as Kinetics and UCF101. Attention-driven systems [17] further enhance this capability through learnable weighting schemes that dynamically prioritize salient regions across video sequences, reducing computational overhead while improving recognition accuracy. These mechanisms mimic human visual attention, focusing computational resources on the most informative parts of an image or video.

Divergent strategies emerge in how systems incorporate prior knowledge about image recognition, particularly in scenarios where labeled training data is limited. The zero-shot learning (ZSL) paradigm explored by Su et al. [18] utilizes ontological knowledge graphs to recognize unseen action classes by transferring knowledge from seen to unseen categories via semantic embeddings. While this approach demonstrates promising results in controlled settings, practical deployment faces challenges in knowledge base construction—a constraint similarly identified in [19,20]. Ensuring that knowledge graphs are both comprehensive and unbiased remains an open research problem, particularly in domains where class definitions are fluid or ambiguous.

Despite significant progress, several unresolved challenges persist in image recognition, including robustness to adversarial attacks, interpretability of deep learning models, and scalability to real-world environments with uncontrolled conditions. Future research directions may explore the integration of neuromorphic computing for energy-efficient recognition, the development of more sophisticated multimodal fusion techniques, and the ethical implications of deploying recognition systems in sensitive domains. By addressing these challenges, the field can move closer to achieving human-level performance in visual understanding while ensuring fairness, transparency, and accountability in AI-driven recognition systems.

2. Materials and Methods

Given an input image

X \in R^{H \times W \times 3}

, here, height H, width W, and 3 RGB channels, the model outputs a class probability vector

y \in R^{c}

, where C is the number of classes. The proposed method comprises two parallel CNN branches, i.e., CNN-A and CNN-B, respectively. CNN-A takes 3 convolutional layers and introduces spatial attention mechanism. CNN-B takes 3 convolutional layers and utilizes channel attention mechanism. Features from both the two branches are fused via operation

Γ_{f u s e}

and classified through fully connected layers.

Figure 1 displays the structure of the model. The model processes an input image through two parallel branches. The top branch (CNN-A) utilizes a spatial attention module to emphasize discriminative regions. The bottom branch (CNN-B) employs a channel attention module to adaptively recalibrate feature channel importance. The outputs of both branches are then concatenated and passed to a classifier for final output.

2.1. CNN-A Branch with Spatial Attention

The CNN-A branch has three convolutional layers followed by a spatial attention module. Each convolutional layer is defined as

F_{A}^{(k)} = Θ (W_{A}^{(k)} * F_{A}^{(k - 1)} + b_{A}^{(k)})

(1)

where

F_{A}^{(0)} = X

and

k \in {1, 2, 3}

.

Θ

is the activation function

ReLU

. Sign ‘*’ denotes convolution,

W_{A}^{(k)} \in ℝ^{d_{k} \times d_{k - 1} \times f_{k} \times f_{k}}

and

b_{A}^{(k)}

are the weights and biases of the k-th layer, respectively.

d_{k}

is the output channels.

f_{k}

is kernel size. The spatial attention is computed as

M_{spatial} = σ ({Con}_{7 \times 7} ([AvgPool (F_{A}^{(3)}); MaxPool (F_{A}^{(3)})])),

(2)

F_{A}^{out} = M_{spatial} \otimes F_{A}^{(3)}

(3)

AvgPool

and

MaxPool

are average and max pooling operations along the channel dimension, reducing the feature map to

ℝ^{H \times W \times 1}

. The two pooled features are concatenated along the channel dimension, forming a 2-channel feature map. A 7 × 7 convolution followed by a sigmoid activation

σ

produces the spatial attention mask

M_{spatial} \in {[0, 1]}^{H \times W}

.The output feature

F_{A}^{out}

is obtained by element-wise multiplication

\otimes

of the mask (broadcasted to all channels) and the original feature

F_{A}^{(3)}

.

2.2. CNN-B Branch with Channel Attention

The CNN-B branch also has three convolutional layers followed by a channel attention module. The convolutional layers are similar to those in CNN-A. As follows,

F_{B}^{(k)} = Θ (W_{B}^{(k)} * F_{B}^{(k - 1)} + b_{B}^{(k)})

(4)

The channel attention is computed as

z = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{B}^{(3)} (i, j),

(5)

M_{channel} = σ (W_{2} \cdot δ (W_{1} \cdot z)),

(6)

F_{B}^{out} = M_{channel} \otimes F_{B}^{(3)}

(7)

Here,

z \in ℝ^{d_{3}}

is the channel descriptor obtained by global average pooling.

W_{1} \in ℝ^{d_{3} / γ \times d_{3}}

and are weights of two fully connected layers, with

γ

being the reduction ratio.

δ

denotes the ReLU activation, and

σ

is the sigmoid function. The attention mask

M_{channel} \in ℝ^{d_{3}}

is applied to the original feature via

F_{B}^{(3)}

channel-wise multiplication

\otimes

, where the mask is broadcasted spatially.

2.3. Feature Fusion

The outputs of both branches are fused by concatenation along the channel dimension

Γ_{f u s e} = [F_{A}^{out}] [F_{B}^{out}]

(8)

The fused feature

Γ_{f u s e} \in ℝ^{H \times W \times 2 d_{3}}

is then passed to the classifier.

2.4. The Classifier

The classifier consists of a global average pooling layer and a fully connected layer

f = GAP (Γ_{f u s e}),

(9)

y = s o f t m a x (W_{f c} \cdot f + b_{f c})

(10)

where

f \in ℝ^{2 d_{3}}

is the vectorized feature.

W_{f c} \in ℝ^{C \times 2 d_{3}}

and

b_{f c}

are the weight matrix and bias of the fully connected layer, respectively.

Parameter specifications: Table 1 lists the parameters of convolutional layers. The parameters of attention modules are listed in Table 2. Table 3 lists the parameters of feature fusion and the classifier.

2.5. Algorithm Implements

The corresponding algorithm is implemented in Algorithm 1. The forward propagation begins by processing the input image X through two parallel branches. CNN-A applies three convolutional layers (ConvA1–ConvA3) with ReLU activations, followed by a spatial attention module that generates a mask via channel-wise pooling and a 7 × 7 convolution, refining features through element-wise multiplication. Simultaneously, CNN-B processes the input through its own three convolutional layers, enhanced by a channel attention mechanism that compresses spatial information via global pooling, applies bottleneck FC layers, and scales channels dynamically. The outputs of both branches are concatenated channel-wise, pooled globally to a vector, and classified through a softmax-activated FC layer. This design ensures complementary spatial-channel feature integration while maintaining computational efficiency through shared convolutional structures and lightweight attention modules.

The model was trained using an early stopping regularization strategy with a patience of 10 epochs, monitoring the validation loss. This mechanism halted training if no improvement was observed, preventing overfitting and ensuring optimal generalization.

Algorithm 1. Dual CNN with attention mechanism

Input:

image X \in ℝ^{H \times W \times 3}

Output:

y \in ℝ^{C}

CNN-A
Extract features

F_{A}^{(k)} = ConvBlock (F_{A}^{(k - 1)}) for k = 1, 2, 3

;
Apply spatial attention

F_{A}^{o u t} = SA (F_{A}^{(3)})

CNN-B
Extract features

F_{B}^{(k)} = ConvBlock (F_{B}^{(k - 1)}) for k = 1, 2, 3

;
Apply spatial attention

F_{B}^{o u t} = CA (F_{B}^{(3)})

Fusion

Γ_{f u s e} = Concat (F_{A}^{out}, F_{B}^{out})

Classification
Pooling

f = GAP (Γ_{f u s e})

Prediction

y = s o f t m a x (W_{f c} \cdot f)

2.6. Experimental Datasets

The experimental dataset consists of 1550 training images and 465 testing images, distributed across 12 distinct categories. Each class in the training set contains high-resolution samples, while the testing set maintains a rigorous evaluation protocol with carefully selected images per category. The 12 distinct categories in the experimental dataset are bag, bottle, cloth, glove, hammer, mask, pipe, screwdriver, shield, shoe, wood, wrench, respectively. This balanced dataset design enables reliable model training and facilitates fair performance comparison across different approaches.

The dataset employed in our experiments is a custom-collected dataset comprising 12 categories of common objects: bag, bottle, cloth, glove, hammer, mask, pipe, screwdriver, shield, shoe, wood, and wrench. The dataset was constructed specifically for this study to evaluate the model’s performance on a diverse yet challenging set of classes with varying textures, shapes, and complexities.

The total number of images in the dataset is 2015. These are split into a training set of 1550 images and a testing set of 465 images. The images were manually collected from multiple sources. While this is not a standard public benchmark like CIFAR or ImageNet, it was designed to represent a realistic and practical classification scenario.

To thoroughly validate the effectiveness of our model DCNNAM (dual CNN with attention module), we conduct extensive comparative experiments against five state-of-the-art baselines: DiffMIC-v2 (diffusion-based network) [2], GNN (Graph Neural Network) [3], CNNs (Convolution Neural Networks) [21], FFGAN (Fusion and Generative Adversarial Networks) [22], ResNet50 [23]. The benchmark methods are carefully selected to represent diverse technical approaches, ensuring a comprehensive evaluation of our model’s advantages in terms of accuracy, robustness, and computational efficiency.

All models were implemented in Python3.9 using the TensorFlow/Keras framework. The experiments were conducted on a single NVIDIA RTX 3080 GPU. To ensure fair comparison and reproducibility, the same set of hyperparameters (detailed in Table 4) was used for all models unless a specific baseline required its own canonical settings (e.g., we used the standard hyperparameters for the ResNet50 baseline from the original publication). The random seed was fixed for weight initialization and data shuffling.

3. Results

3.1. Classifying Performance

This section evaluates the effectiveness of the proposed DCNNAM framework in classifying 12 distinct categories of images. The experimental results confirm its superior recognition capability compared to the five state-of-the-art baselines DiffMIC-v2, GNN, CNNs, FFGAN and ResNet50.

As illustrated in Table 5, the DCNNAM model attains an impressive classification accuracy of 98.06%, surpassing alternative approaches including DiffMIC-v2, GNN, CNNs, FFGAN and ResNet50. Notably, ResNet50 exhibits the weakest performance among all baseline methods, with the lowest scores across multiple evaluation criteria. A detailed comparison based on three key metrics—Precision, F1-score, and G-mean—further validates the robustness of DCNNAM. The model achieves a precision of 96.00%, an F1-score of 98.01%, and a G-mean of 97.47%, demonstrating consistent superiority the five competing methods.

In addition to the standard Top-1 accuracy, we further evaluated the models using Top-3 and Top-5 accuracy metrics to gain a deeper insight into their classification performance, illustrated in Table 5, the proposed DCNNAM model achieves a remarkable Top-3 accuracy of 99.78% and a perfect Top-5 accuracy of 100.0%. This means that for every image in the test set, the true class label is contained within the model’s top five predicted classes. Furthermore, for all but a single image (99.78%), the true label is among the top three predictions.

These results demonstrate the exceptional confidence and ranking capability of our model. The high Top-3 and Top-5 scores indicate that the features learned by the dual attention mechanism are highly discriminative, allowing the model to almost never fail to identify the correct category at a broad level. The performance gap between DCNNAM and the baseline models, which is evident in the Top-1 metric, is even more pronounced in these Top-k metrics, solidifying the superiority of our proposed architecture.

To address concerns regarding the generalizability of our proposed DCNNAM model and to rigorously validate its effectiveness beyond our custom dataset, we conducted additional experiments on two standard public benchmarks CIFAR-10 and CIFAR-100. These datasets present different challenges, with CIFAR-100 offering a more complex 100-class fine-grained classification task.

As shown in Table 5, the proposed DCNNAM model consistently outperforms the majority of baseline models across both datasets. On CIFAR-10, our model achieves a state-of-the-art accuracy of 95.72%, surpassing all other baselines including the strong ResNet50 backbone (94.25%). More notably, on the more challenging CIFAR-100 dataset, DCNNAM attains a top-1 accuracy of 78.89%, which is a significant margin of 2.06 percentage points over the next best baseline (DiffMIC-v2 at 76.83%) and 3.46 points over the standard ResNet50.

These results demonstrate that the advantages of our dual-attention architecture are not limited to our specific custom dataset. The model effectively generalizes to other domains and problem scales, excelling particularly in complex, fine-grained scenarios like CIFAR-100. The consistent performance gain across three distinct datasets strongly validates the robustness and general applicability of the proposed method.

Table 5 also includes comparisons with leading models based on the Transformer (ViT [24], DeiT [25], Swin [26]) and Mamba (VMamba [27]) architectures. The results demonstrate that our proposed DCNNAM model consistently outperforms all these SOTA benchmarks, achieving the highest scores across all four metrics. Notably, our model shows a clear advantage over the pure ViT-Base (+2.15% accuracy) and the efficient DeiT-Tiny (+3.22% accuracy). It also surpasses the powerful hierarchical Swin-Transformer-Tiny by +1.29% in accuracy and the emerging VMamba-Tiny by +1.72%.

This superior performance is significant because it highlights the effectiveness of our dedicated dual-attention CNN design, even when compared against the dominant paradigm of Transformers and the latest Mamba architecture. While Transformers excel at capturing global dependencies, our model effectively synergizes channel and spatial attention within a convolutional framework to achieve more discriminative feature learning for this specific task. These comparisons solidly contextualize the performance of our model and robustly demonstrate the relative advantage and novelty of the proposed dual attentional mechanism.

The confusion matrix displayed by the DCNNAM model highlights twelve distinct categories of images, all of which achieve recognition rates exceeding 90%, illustrated in Figure 2. However, minor performance variations suggest potential areas for further refinement.

Training dynamics, depicted in Figure 3, reveal that the DCNNAM model converges efficiently after 50 epochs, with final training and validation losses stabilizing at 0.021. This indicates effective optimization without signs of overfitting. The stable learning curve underscores the model’s ability to generalize well to unseen data.

Figure 4 shows a steady increase in both training and validation accuracy throughout the training process. The validation accuracy closely tracks the training accuracy, with no significant gap emerging. This indicates that the model is learning effectively without overfitting. Both curves plateau near the final epochs, confirming that the model has reached its peak performance and that further training would yield diminishing returns. This behavior strongly supports the model’s ability to generalize well to unseen data.

3.2. Classifying Efficiency

While the DCNNAM framework demonstrates exceptional classification accuracy, its computational efficiency during training does not exhibit the same level of superiority. Table 6 provides a comparative analysis of the training durations for all six models (our model and the five competitors).

The proposed DCNNAM requires an average training time of 2980.57 s, slightly lower than DiffMIC-v2 (3163.06 s) and FFGAN (3055.39 s) and significantly higher than GNN (2781.96 s), GNN (2819.60 s), and ResNet50 (2822.39 s). MediaPipe records the longest training duration at 3055.39 s, indicating inefficiencies in its optimization process. Despite its marginally extended training phase, DCNNAM’s high recognition accuracy justifies the additional computational cost. The trade-off between training efficiency and classification performance suggests that the model is better suited for deployment in scenarios where precision is prioritized over rapid model updates.

Additionally, Table 6 shows that while our DCNNAM model has a higher number of parameters due to its dual-branch architecture, its inference time (4.75 s) is highly competitive. It is significantly faster than the diffusion-based DiffMIC-v2 (12.31 s), the generative FFGAN (18.54 s), and the deep ResNet50 (7.22 s) architectures. It is also faster than the GNN model (5.82 s), which is a positive result given that GNNs are often efficient for specific data structures. As expected, a simpler CNN baseline (3.10 s) has the shortest inference time, but this comes at a considerable cost to accuracy (~8% lower than our model).

This analysis provides a more complete picture of the model’s efficiency. It demonstrates that the superior accuracy of DCNNAM (98.06%) is achieved with a reasonable and practical computational cost during the inference phase, making it suitable for applications where both high accuracy and near-real-time performance are required.

4. Discussion

Advantages: (i) Enhanced feature representation. The dual-branch architecture synergizes spatial and channel attention, enabling the model to focus on both locally salient regions (e.g., object edges in CNN-A) and globally informative channels (e.g., texture patterns in CNN-B). This dual emphasis addresses limitations of traditional CNNs, which process spatial and channel features uniformly. Feature fusion via concatenation preserves hierarchical information from both branches, mitigating information loss observed in single-attention models.

(ii) Superior classification performance. The proposed model outperforms the five baselines DiffMIC-v2, GNN, CNNs, FFGAN and ResNet50 by 5–10% in accuracy, demonstrating its robustness across diverse categories (e.g., “hammer,” “mask,” “shoe”). High G-mean (97.47%) indicates balanced performance across classes, reducing bias toward dominant categories.

(iii) Interpretability and convergence. The spatial attention maps (Figure 1) provide visual explanations for model decisions, e.g., focusing on tool handles for “screwdriver” classification. Stable training curves (Figure 2) suggest effective optimization without overfitting, even with limited data (1550 training images).

Limitations: (i) Computational overhead. The dual attention mechanisms increase training time compared to lightweight models like GNN (2781.96 s). This trade-off may hinder deployment in real-time applications. The 7 × 7 convolutional kernel in spatial attention introduces additional parameters, potentially limiting scalability to high-resolution images.

(ii) Dataset dependency. While the model excels on the tested 12-class dataset, its performance on fine-grained datasets (e.g., CIFAR-100) or heavily occluded images remains unverified. The current fusion strategy (concatenation) may not optimally resolve feature conflicts between branches, suggesting room for adaptive fusion techniques.

5. Conclusions

In this study, we introduced DCNNAM, a novel dual-structured convolutional neural network enhanced with complementary attention mechanisms for robust image classification. The proposed architecture leverages two parallel branches—CNN-A with spatial attention and CNN-B with channel attention—to capture both region-specific details and global feature interdependencies, addressing key limitations of conventional CNNs. By fusing multi-scale features through concatenation, the model achieves a more discriminative representation, as evidenced by its state-of-the-art performance (98.06% accuracy, 96.00% precision) on a challenging 12-class dataset. The spatial attention module effectively localizes critical object regions, while channel attention adaptively recalibrates feature importance, leading to improved generalization across diverse categories. Despite a moderate increase in training time, the model’s interpretability (via attention visualization) and stability (demonstrated by convergence curves) make it a viable solution for high-precision applications. Future work will focus on optimizing computational efficiency through lightweight attention variants and extending the framework to more complex tasks, such as fine-grained and multi-modal classification. This research underscores the potential of hybrid attention mechanisms to advance CNN-based vision systems, balancing accuracy, interpretability, and scalability.

Author Contributions

Conceptualization, Y.L. and J.Z.; methodology, Y.L.; software, Y.L.; validation, Y.L.; data curation, Y.Z. and H.L.; writing—original draft preparation, Y.L.; writing—review and editing, all authors. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by Research Project on Nuclear Facility Decommissioning and Radioactive Waste Management TCKY-2024-CICDR-029.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available from the corresponding author on reasonable request.

Conflicts of Interest

All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Yang, S.; Chen, L. Computer-Aided Pathology Image Classification and Segmentation Joint Analysis Model. In Proceedings of the 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), Erode, India, 20–22 January 2025; pp. 1317–1322. [Google Scholar]
Yang, Y.; Fu, H.; Aviles-Rivero, A.I.; Xing, Z.; Zhu, L. DiffMIC-v2: Medical Image Classification via Improved Diffusion Network. IEEE Trans. Med. Imaging 2025, 44, 2244–2255. [Google Scholar] [CrossRef] [PubMed]
Shovon, I.I.; Ahmad, I.; Shin, S. Segmentation Aided Multiclass Tumor Classification in Ultrasound Images using Graph Neural Network. In Proceedings of the 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 18–21 February 2025; pp. 1012–1015. [Google Scholar]
Liu, Y.; Zhang, H.; Su, D. Research on the Application of Artificial Intelligence in the Classification of Artistic Attributes of Photographic Images. In Proceedings of the 2025 Asia-Europe Conference on Cybersecurity, Internet of Things and Soft Computing (CITSC), Rimini, Italy, 10–12 January 2025; pp. 370–374. [Google Scholar]
Rollo, S.D.; Yusiong, J.P.T. A Two-Color Space Input Parallel CNN Model for Food Image Classification. In Proceedings of the 2024 International Conference on Information Technology Research and Innovation (ICITRI), Jakarta, Indonesia, 5–6 September 2024; pp. 141–145. [Google Scholar]
El Amoury, S.; Smili, Y.; Fakhri, Y. CNN Hyper-parameter Optimization Using Simulated Annealing for MRI Brain Tumor Image Classification. In Proceedings of the 2025 5th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Fez, Morocco, 15–16 May 2025; pp. 1–5. [Google Scholar]
Li, J.; Liu, H.; Li, K.; Shan, K. Heart Sound Classification Based on Two-channel Feature Fusion and Dual Attention Mechanism. In Proceedings of the 2024 5th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 12–14 April 2024; pp. 1294–1297. [Google Scholar]
Tong, X.; Chen, J.; Shi, J.; Jiang, Y. Improved ResNet50 Galaxy Classification with Multi-Attention Mechanism. In Proceedings of the 2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA), Xi’an, China, 28–30 March 2025; pp. 725–729. [Google Scholar]
Zheng, Z.; Slam, N. A VGG Fire Image Classification Model with Attention Mechanism. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024; pp. 873–877. [Google Scholar]
Wu, W. Face Recognition Algorithm Based on ResNet50 and Improved Attention Mechanism. In Proceedings of the 2025 5th International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 10–12 January 2025; pp. 98–101. [Google Scholar]
Lei, W.; Liu, X.; Ye, L.; Hu, T.; Gong, L.; Luo, J. Research on Graph Feature Data Aggregation Algorithm Based on Graph Convolution and Attention Mechanism. In Proceedings of the 2024 4th International Conference on Electronic Materials and Information Engineering (EMIE), Guangzhou, China, 24–26 May 2024; pp. 146–150. [Google Scholar]
Zhang, R.; Xie, M. A Multi-task learning model with low-level feature sharing and inter-feature guidance for segmentation and classification of medical images. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisboa, Portugal, 3–6 December 2024; pp. 2894–2899. [Google Scholar]
Shi, J.; Liu, Y.; Yi, W.; Lu, X. Semantic-Guided Cross-Modal Feature Alignment for Cross-Domain Few-Shot Hyperspectral Image Classification. In Proceedings of the 2025 6th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Ningbo, China, 23–25 May 2025; pp. 622–625. [Google Scholar]
Wang, Y.; Liu, G.; Yang, L.; Liu, J.; Wei, L. An Attention-Based Feature Processing Method for Cross-Domain Hyperspectral Image Classification. IEEE Signal Process. Lett. 2025, 32, 196–200. [Google Scholar] [CrossRef]
Sun, K.; Dong, F.; Liu, W.; Wu, Q.; Sun, X.; Wang, W. Hyperspectral Image Classification with Spatial–Spectral–Channel 3-D Attention and Channel Attention. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4412314. [Google Scholar] [CrossRef]
Akilan, T.; Wu, Q.J.; Safaei, A.; Huo, J.; Yang, Y. A 3D CNN-LSTM-Based Image-to-Image Foreground Segmentation. IEEE Trans. Intell. Transp. Syst. 2020, 21, 959–971. [Google Scholar] [CrossRef]
Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. RSSFormer: Foreground Saliency Enhancement for Remote Sensing Land-Cover Segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef] [PubMed]
Su, T.; Wang, H.; Qi, Q.; Wang, L.; He, B. Transductive Learning With Prior Knowledge for Generalized Zero-Shot Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 260–273. [Google Scholar] [CrossRef]
Zhang, X.; Xiao, Z.; Ma, J.; Wu, X.; Zhao, J.; Zhang, S.; Li, R.; Pan, Y.; Liu, J. Adaptive Dual-Axis Style-Based Recalibration Network with Class-Wise Statistics Loss for Imbalanced Medical Image Classification. IEEE Trans. Image Process. 2025, 34, 2081–2096. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Liu, S.; Bruzzone, L. An Attention-Enhanced Feature Fusion Network (AeF2N) for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5511005. [Google Scholar] [CrossRef]
Priya, V.V.; Chattu, P.; Sivasankari, K.; Pisal, D.T.; Sai, B.R.; Suganthi, D. Exploring Convolution Neural Networks for Image Classification in Medical Imaging. In Proceedings of the 2024 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE), Bangalore, India, 24–25 January 2024; pp. 1–4. [Google Scholar]
Hua, R.; Zhang, J.; Xue, J.; Wang, Y.; Liu, Z. FFGAN: Feature Fusion GAN for Few-shot Image Classification. In Proceedings of the 2024 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 12–14 April 2024; pp. 96–102. [Google Scholar]
Brikci, Y.B.; Benazzouz, M.; Benomar, M.L. A Comparative Study of ViT-B16, DeiT, and ResNet50 for Peripheral Blood Cell Image Classification. In Proceedings of the 2024 International Conference of the African Federation of Operational Research Societies (AFROS), Tlemcen, Algeria, 3–5 November 2024; pp. 1–5. [Google Scholar]
Dosovitskiy, A.; Beyer, L. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 2021 International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021; pp. 1–22. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F. Training data-efficient image transformers & distillation through attention. In Proceedings of the 2021 IEEE International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 1–22. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 1–14. [Google Scholar]
Liu, Y.; Tian, Y.; Liang, Y. VMamba: Visual State Space Model. Comput. Vis. Pattern Recognit. 2024, 4, 1–33. [Google Scholar]

Figure 1. An architectural overview of the proposed dual-structured CNN.

Figure 2. Confusion matrix of DCNNAM model.

Figure 3. Loss curve of DCNNAM model.

Figure 4. Accuracy curves of the DCNNAM mode.

Table 1. Parameters of convolutional layers.

Branch	Layer	Output Channels	Kernel	Stride	Padding
CNN-A	ConvA1	32	3 × 3	1	1
	ConvA2	64	3 × 3	1	1
	ConvA3	128	3 × 3	1	1
CNN-B	ConvB1	32	3 × 3	1	1
	ConvB2	64	3 × 3	1	1
	ConvB3	128	3 × 3	1	1

Table 2. Parameters of attention modules.

Module	Parameters	Value	Role
Spatial	Conv Kernel	7 × 7	Spatial weight
Channel	Reduction Ratio $γ$	16	Channel compression

Table 3. Parameters of feature fusion and the classifier.

Component	Output Dimension	Parameters
Concatenation	$H \times W \times 256$	—
Global Pooling	$1 \times 1 \times 256$	—
Fully Connected	C	$W_{f c} \in ℝ^{C \times 256}$

Table 4. Hyperparameter settings for reproducibility.

Hyperparameter	Value/Description	Comment
Optimizer	Adam
Learning Rate	0.001
Learning Rate Scheduler	ReduceLROnPlateau	Factor: 0.5, Patience: 5
Batch Size	32
Number of Epochs (Max)	60	Training halted by early stopping.
Early Stopping Patience	10	Monitored validation loss.
Loss Function	Categorical Cross-Entropy
Weight Initialization	He Normal	For convolutional layers.
Data Augmentation	Horizontal Flip, Random Rotation (±10°)	Applied only to training set.
Train/Test Split	1550/465	~77%/~23% split.
Reduction Ratio (r)	16	For channel attention module.
Spatial Attention Kernel	7 × 7	Convolution kernel size.
Random Seed	42	Fixed for reproducibility.

Table 5. Classifying results of different methods.

Methods	Accuracy	Precision	F1-Score	G-Mean	Top-1 Accuracy	Top-3 Accuracy	Top-5 Accuracy	CIFAR-10 (Top-1 Accy)	CIFAR-100 (Top-1 Accy)
DCNNAM	98.06%	96.00%	98.01%	97.47%	98.06%	99.78%	100%	95.72%	78.89%
DiffMIC-v2	92.11%	91.88%	95.63%	95.56%	92.11%	98.28%	99.57%	93.15%	75.11%
GNN	92.07%	90.44%	93.03%	94.14%	92.07%	97.42%	99.35%	90.88%	72.43%
CNNs	90.00%	90.18%	91.11%	92.37%	90.00%	96.99%	98.92%	89.50%	70.05%
FFGAN	88.59%	88.59%	90.49%	90.19%	88.59%	95.91%	98.28%	88.21%	68.92%
ResNet50	88.11%	88.11%	90.08%	89.44%	88.11%	95.27%	97.85%	94.25%	76.83%

ViT-Base [24]	95.91%	93.85%	95.82%	95.23%
DeiT-Tiny [25]	94.84%	92.77%	94.65%	93.98%
Swin-T-Tiny [26]	96.77%	95.12%	96.59%	96.18%
VMamba-Tiny [27]	96.34%	94.58%	96.22%	95.76%

Table 6. Training time and inference time of the six models.

Model	DCNNAM	DiffMIC-v2	FFGAN	ResNet50	CNNs	GNN
Training time (s)	2980.57	3163.06	3055.39	2822.39	2819.60	2781.96
Inference time (s)	4.75	12.31	18.54	7.22	3.10	5.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Zhang, J.; Liu, H.; Zhang, Y. A Dual-Structured Convolutional Neural Network with an Attention Mechanism for Image Classification. Electronics 2025, 14, 3943. https://doi.org/10.3390/electronics14193943

AMA Style

Liu Y, Zhang J, Liu H, Zhang Y. A Dual-Structured Convolutional Neural Network with an Attention Mechanism for Image Classification. Electronics. 2025; 14(19):3943. https://doi.org/10.3390/electronics14193943

Chicago/Turabian Style

Liu, Yongzhuo, Jiangmei Zhang, Haolin Liu, and Yangxin Zhang. 2025. "A Dual-Structured Convolutional Neural Network with an Attention Mechanism for Image Classification" Electronics 14, no. 19: 3943. https://doi.org/10.3390/electronics14193943

APA Style

Liu, Y., Zhang, J., Liu, H., & Zhang, Y. (2025). A Dual-Structured Convolutional Neural Network with an Attention Mechanism for Image Classification. Electronics, 14(19), 3943. https://doi.org/10.3390/electronics14193943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual-Structured Convolutional Neural Network with an Attention Mechanism for Image Classification

Abstract

1. Introduction

1.1. Backgroud

1.2. Related Work

2. Materials and Methods

2.1. CNN-A Branch with Spatial Attention

2.2. CNN-B Branch with Channel Attention

2.3. Feature Fusion

2.4. The Classifier

2.5. Algorithm Implements

2.6. Experimental Datasets

3. Results

3.1. Classifying Performance

3.2. Classifying Efficiency

4. Discussion

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI