Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification

Ullah, Zahid; Hong, Minki; Mahmood, Tahir; Kim, Jihie

doi:10.3390/math13223728

Open AccessFeature PaperArticle

Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification

¹

Department of Computer Science and Artificial Intelligence, Dongguk University, Seoul 04620, Republic of Korea

²

Division of Electronics and Electrical Engineering, Dongguk University, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(22), 3728; https://doi.org/10.3390/math13223728

Submission received: 9 October 2025 / Revised: 6 November 2025 / Accepted: 18 November 2025 / Published: 20 November 2025

Download

Browse Figures

Versions Notes

Abstract

Deep learning has demonstrated significant promise in medical image analysis; however, standard CNNs frequently encounter challenges in detecting subtle and intricate features vital for accurate diagnosis. To address this limitation, we systematically integrated attention mechanisms into five commonly used CNN backbones: VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5. Each network was modified using either a Squeeze-and-Excitation block or a hybrid Convolutional Block Attention Module, allowing for more effective recalibration of channel and spatial features. We evaluated these attention-augmented models on two distinct datasets: (1) a Products of Conception histopathological dataset containing four tissue categories, and (2) a brain tumor MRI dataset that includes multiple tumor subtypes. Across both datasets, networks enhanced with attention mechanisms consistently outperformed their baseline counterparts on all measured evaluation criteria. Importantly, EfficientNetB5 with hybrid attention achieved superior overall results, with notable enhancements in both accuracy and generalizability. In addition to improved classification outcomes, the inclusion of attention mechanisms also advanced feature localization, thereby increasing robustness across a range of imaging modalities. Our study established a comprehensive framework for incorporating attention modules into diverse CNN architectures and delineated their impact on medical image classification. These results provide important insights for the development of interpretable and clinically robust deep learning-driven diagnostic systems.

Keywords:

squeeze and excitation; attention mechanism; convolutional neural networks; medical image classification

MSC:

68T07; 37-04

1. Introduction

Deep learning (DL) has revolutionized medical image analysis, offering robust techniques for disease diagnosis, segmentation, and classification. Among DL methodologies, Convolutional Neural Networks (CNNs) have demonstrated outstanding capability across various imaging applications, primarily due to their proficiency in autonomously extracting salient features from complex datasets. Two important areas of application include the classification of phenotypes in products of conception (PoC) and the identification of brain tumors using magnetic resonance imaging (MRI) [1]. PoC samples, which are derived from spontaneous abortions, possess critical genetic data that can facilitate the determination of underlying causes of pregnancy loss. Accurate phenotyping of these samples is essential for recognizing genetic disorders and chromosomal anomalies [2]. Similarly, precise and prompt classification of brain tumors via MRI is imperative, as tumor characterization and grading have a direct impact on clinical decision-making and patient outcomes [3]. In response to these complexities, transfer learning using established CNNs such as VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5 has become standard practice, employing pretraining on substantial datasets like ImageNet to enhance effectiveness in medical image analysis.

This research explores phenotype classification in two prominent medical imaging datasets: a brain tumor MRI dataset and a PoC dataset consisting of specimens from spontaneous abortions. In both contexts, achieving high-accuracy classification yields valuable understanding for timely treatment planning, accurate diagnosis, and elucidation of underlying pathological mechanisms. This work systematically assesses the improvements in performance attributed to integrating attention mechanisms within standard CNN models by conducting four distinct experimental scenarios.

Firstly, we employ pretrained CNN models (VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5) as benchmarks, retaining their original architectures without modifications. In the second phase, we systematically augment each model by incorporating Squeeze-and-Excitation (SE) modules into every convolutional block. As depicted in Figure 1, the SE block, initially proposed by Hu et al. [4], serves as a lightweight addition that strengthens CNN performance by modeling the dependencies between feature map channels. The principal idea behind this module is that feature channels exhibit varying relative significance based on the input, and adaptive recalibration of these channel responses enhances the network’s classification ability. The SE block executes two functional steps: the squeeze operation compresses spatial information into a succinct channel descriptor using global average pooling, capturing comprehensive context while reducing dimensionality; subsequently, with ReLU and sigmoid activations, the excitation operation processes this descriptor using a compact two-layer fully connected (FC) network to yield channel-specific weights. These weights are then used to rescale the corresponding feature maps, amplifying more salient channels and attenuating less relevant ones.

In the third phase, we investigate a targeted integration approach, whereby SE modules are incorporated solely within the deeper layers of each network architecture. For example, in VGG16, SE blocks are positioned following Blocks 3, 4, and 5. This strategy is designed to concentrate the refinement of attention on higher-level, semantically enriched features, while simultaneously reducing computational demands relative to comprehensive SE integration.

In the fourth and final phase, we expand on selective SE integration by adding Spatial Attention (SA) mechanisms [5]. This hybrid framework enables the model to focus on both key feature channels and critical spatial regions within the feature maps, thereby enhancing its capacity for localization and classification. The SA module identifies and highlights the regions of a feature map that provide the most valuable spatial information. Unlike channel attention, which distinguishes the most important feature channels, SA assigns significance to precise spatial positions (pixels). This is accomplished by performing global average pooling and max pooling along the channel axis, which generates two spatial descriptors with dimensions

H \times W \times 1

. These descriptors are subsequently concatenated to form an

H \times W \times 2

feature map, which is then processed by a convolutional layer followed by a sigmoid activation to produce the resulting SA map

M_{s}

of size

H \times W \times 1 .

Recent methods such as Improved EATFormer [6] have investigated Vision Transformer-based models that utilize advanced self-attention mechanisms for medical image classification. Nevertheless, these approaches often demand significant computational resources and access to large-scale datasets to attain high levels of performance. Furthermore, Vision Transformers usually introduce only one novel architecture, which confines their adaptability to a range of CNN backbones. By comparison, our study introduces a systematic comparative framework that incorporates lightweight yet effective attention modules (SE and Convolutional Block Attention Module (CBAM)) into five widely adopted CNN architectures. This methodology retains computational efficiency and supports deployment in resource-constrained environments while also offering comprehensive insights into the behavior of attention mechanisms across various architectures and imaging modalities. By balancing high accuracy with practical deployment considerations, our research both extends Transformer-based studies and improves the generalizability and clinical relevance of these methods. Figure 2 presents the steps involved in generating the spatial attention map.

This study examines whether embedding attention mechanisms, specifically SE and CBAM, within established CNN architectures can meaningfully enhance feature representation and boost classification accuracy within medical image analysis. Therefore, the primary research question is as follows: To what degree does the methodical integration of SE and CBAM attention modules within CNN frameworks elevate the accuracy, precision, recall, and F1-score in medical image classification, and how does this performance differ when applied to varying network depths and separate datasets?

Additionally, the goal of this work is to deliver not only a performance benchmarking but also a reproducible and systematic approach for integrating attention modules, specifically SE and CBAM, across several CNN models under uniform experimental protocols. In contrast to previous research that has focused only on single models or datasets, our approach makes possible an equitable, cross-model, and cross-task assessment encompassing both classification and segmentation. Moreover, we investigate the impact of different layer-wise placements of attention modules, providing actionable insights regarding the operation of attention mechanisms at multiple levels of feature abstraction. Through this rigorous examination, we underscore the enhancement in performance as well as the nuanced behavior and generalization capabilities of attention integration in medical image analysis.

The principal objective of this research is to evaluate how distinct attention strategies, global, selective, and hybrid, affect the classification outcomes of pretrained CNN architectures in difficult medical imaging tasks. Our results indicate that the use of attention modules, particularly when applied in a targeted and spatially guided manner, substantially improves the discriminative performance of CNNs while ensuring computational efficiency. The code is available at: https://github.com/Zahid672/Brain-Tumor-and-POC-Classification (accessed on 20 October 2025).

The main contributions of this study are outlined below:

We integrate lightweight attention modules (SE and CBAM) into five well-established CNN backbones, generating multiple model variants to systematically evaluate the roles of channel and spatial-level attention.
In contrast to prior studies that typically concentrate on a single backbone or Transformer-based architecture, our research establishes a unified comparative platform to investigate attention integration across several CNN architectures, providing meaningful perspectives for medical image classification.
We conduct extensive validation on two separate medical imaging datasets, such as brain tumor MRI (including multiple subtypes) and POC histopathology (encompassing various tissue classes), thereby demonstrating the robustness of our approach within both radiological and pathological contexts.
Experimental results demonstrate that attention-augmented CNNs consistently surpass their respective baselines, with EfficientNetB5 combined with hybrid attention achieving the highest accuracy. Moreover, the introduction of attention facilitates enhanced feature localization, which supports improved model interpretability and greater clinical value.

The remainder of this manuscript is organized as follows: Section 2 reviews foundational literature. Section 3 details the methodologies and datasets employed. Section 4 presents outcomes and discussion. Section 5 discusses limitations and suggests future research directions. Finally, Section 6 concludes the study.

2. Related Work

DL techniques, particularly CNNs, have revolutionized medical image analysis by facilitating automated processing and improving diagnosis for numerous medical conditions [7]. CNNs have demonstrated significant impact in computer-aided diagnosis, marking substantial advances within the discipline [8]. For instance, CNN architectures are capable of autonomously extracting essential features from brain MRI scans, which enhances the effectiveness of cancer detection relative to traditional methods [9,10]. Additionally, CNN-based approaches have yielded improvements in diagnostic accuracy for neurodegenerative disorders through multi-class classification tasks [11]. In brain tumor analysis, CNNs remain extensively explored, with several publications reporting high performance in identifying and distinguishing various tumor types [12,13].

A key factor underlying these achievements is transfer learning [14], which utilizes pretrained models built on large-scale datasets such as ImageNet. Transfer learning has demonstrated particular significance in medical imaging, where available datasets are frequently small in size [15]. Pretrained CNNs, like those used in this study, offer a solid groundwork by applying knowledge obtained through extensive training on natural images [16]. Fine-tuning these models on domain-specific medical datasets typically leads to considerable gains in performance in comparison to initializing models from scratch. To enhance model effectiveness further, attention mechanisms, such as SE modules and SA, have been integrated into CNN architectures. These mechanisms enable networks to focus on the most relevant features or areas within medical images, thereby increasing diagnostic accuracy [17].

Continuing this research trajectory, Rongjun et al. [18] proposed an approach where, in 1D residual CNNs, SE blocks are primarily incorporated to detect ECG arrhythmia. They merged SE modules with temporal 1D convolutions to dynamically enhance channel features associated with arrhythmia, eliminate redundant features, and eliminate the need for preprocessing (i.e., denoising) steps. Residual connections contribute to stabilizing model training and optimizing efficiency. Their experimental findings reveal that this model achieved high performance, underscoring its potential for robust automated ECG analysis.

Li et al. [19] proposed a unified temporal–spectral SE framework that extracts multi-scale temporal patterns and multi-level spectral features simultaneously from EEG signals. Their model utilizes convolutional blocks to capture nonstationary patterns, while parallel spectral convolutional blocks are employed to extract frequency-band characteristics. These extracted features are integrated via a novel SE module, which adaptively prioritizes the most salient channel-wise representations. To address overfitting resulting from the limited number of seizure events, the authors implemented an information-maximizing loss function. Their approach achieved more accurate results than prior methods, demonstrating the advantage of jointly modeling temporal and spectral domains for seizure detection. In a comparable manner, Kitada et al. [20] use both labeled and unlabeled data to refine an ensemble of pretrained SENets through a mean-teacher semi-supervised learning framework. They further augmented the training process with dermatology-specific data augmentations to expand the dataset. This augmentation strategy led to a substantial improvement in balanced accuracy, increasing it from 79.2% to 87.2% on the ISIC 2018 validation set.

Zheng et al. [21] developed a Multi-Attention CNN aimed at fine-grained image recognition. The model simultaneously learns both attention localization and feature extraction by identifying multiple discriminative parts without the need for part annotations. This demonstrates that attention mechanisms facilitate enhanced class separability and interpretability, paving the way for subsequent attention-based models in medical and visual recognition. Gu et al. [22] presented CA-Net (Comprehensive Attention Network), a framework incorporating spatial, channel, and scale attention mechanisms to improve feature representation in medical image segmentation. The model increases segmentation accuracy and interpretability by selectively emphasizing key anatomical regions. This study underscores the value of multi-dimensional attention fusion, supporting our study’s focus on integrating channel and SA within CNN architectures.

Although prior studies have demonstrated the utility of SE networks [23] and attention mechanisms for specialized applications such as ECG analysis [18], EEG-based seizure detection [19], and skin lesion classification [20], these approaches largely remain limited to task-specific and single-modality settings. In contrast, the present study systematically evaluates attention integration across multiple widely adopted CNN architectures, VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5, applied to two distinct medical imaging modalities: brain tumor MRI and histopathological images of PoC. Unlike earlier work that concentrates on a single signal dataset, our analysis includes both channel-oriented (SE) and hybrid channel-spatial (CBAM) modules, thereby presenting a more comprehensive perspective on feature recalibration and localization. The findings indicate that CNNs enhanced with attention modules achieve not only higher classification accuracy but also improved generalizability across heterogeneous datasets, with the combination of EfficientNetB5 and CBAM yielding the highest overall performance. This broad, modality-independent comparative assessment distinguishes our research from previous task-specific studies and offers actionable insights for developing robust, generalizable attention-based models for clinical decision support.

3. Materials and Methods

The methodology of this study consisted of a sequence of experiments designed to evaluate the performance of pretrained CNN architectures and their versions augmented with attention modules for phenotype classification in PoC and brain tumor MRI datasets.

We initially established benchmark results by employing the pretrained models, VGG16 [24], ResNet18 [25], InceptionV3 [26], EfficientNetB5 [27], and DenseNet121 [28]. Each network was fine-tuned for the target classification tasks without altering its architecture, enabling us to assess its baseline performance and set reference points for evaluating subsequent modifications. In the following phase, we augmented the pretrained CNNs with SE modules. These additions enhance the representational power of the networks by adaptively modulating channel-wise feature responses, which amplify salient features while diminishing those deemed less informative. To investigate the potential benefit of a more targeted SE application, the third experimental phase implemented selective integration. Specifically, for VGG16, SE modules were inserted only after Blocks 3, 4, and 5, based on the rationale that deeper layers capture progressively more abstract and semantically meaningful features. Finally, in the fourth phase, we incorporated SE modules alongside SA. While SE enhances feature interdependencies between channels, SA emphasizes the most discriminative spatial regions within feature maps. By integrating these two complementary approaches, our aim was to develop models that are receptive to feature channels as well as spatial information, thereby enhancing both classification accuracy and interpretability.

3.1. Datasets

Two publicly available datasets were utilized in this study, specifically the POC dataset and the BT-Large-4C dataset, as illustrated in Figure 3. An overview of these datasets can be found in Table 1.

3.1.1. Products of Conception Dataset

The POC dataset [29] used in this work is publicly available for research purposes. It contains histopathological image samples categorized into 4 distinct tissue types: chorionic villi, decidual tissue, hemorrhage, and trophoblastic tissue. The training set consists of 4155 samples, specifically 1138 hemorrhage, 1391 chorionic villi, 926 decidual, and 700 trophoblastic tissue images. The testing set comprises 421 hemorrhage images out of a total of 1511 samples, including 390 chorionic villi, 349 decidual, and 351 trophoblastic tissue images.

3.1.2. Brain Tumor Dataset

The second dataset used in this research is the publicly available brain tumor MRI dataset obtained from the Kaggle repository (https://www.kaggle.com/sartajbhuvaji/brain-tumor-classification-mri (accessed on 20 October 2025)). The dataset comprises 3064 T1-weighted, contrast-enhanced brain MR images corresponding to three primary tumor types: gliomas, meningiomas, and pituitary tumors. To facilitate a more in-depth evaluation, we expanded the dataset to 4 categories by adding normal brain MR images, referring to this version as BT-Large-4c. The BT-Large-4c dataset, therefore, encompasses four classes: Normal, Glioma tumor, Meningioma tumor, and Pituitary tumor. Given the limited number of MRI images, we increased dataset diversity using image augmentation techniques, particularly two methods: horizontal flipping and rotation. In the first step, input images were randomly rotated by 90 degrees one or more times, after which each rotated image was horizontally flipped, thereby generating additional training data.

3.2. Baseline CNN Architectures

3.2.1. VGG16

VGG16, as introduced by Simonyan and Zisserman [24], achieved prominence due to its strong results in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014. Its architecture is characterized by a consistent application of small

3 \times 3

convolutional filters with a stride of 1 and padding, which serves to preserve spatial dimensions. The sequential stacking of these layers enables the network to capture hierarchical representations of increasing abstraction while controlling the parameter count relative to architectures using larger filters. VGG16 consists of 13 convolutional layers and 3 FC layers, yielding a total of 16 weight-bearing layers and thus providing the rationale for the model’s name.

In VGG16, the convolutional layers are divided into five consecutive blocks, with each block succeeded by a

2 \times 2

max-pooling layer with stride 2, progressively reducing spatial resolution while efficiently maintaining essential feature information. The output feature maps from the final block are then flattened and passed through 3 FC layers, with the last layer typically utilizing a softmax activation function to produce class probabilities. Despite its substantial depth, the architecture’s consistent and straightforward design has contributed to VGG16’s broad adoption in computer vision applications. The baseline, which focused on medical image analysis, was selected for its strong feature extraction capabilities and the availability of pretrained weights on large-scale datasets such as ImageNet.

To explore how attention mechanisms affect convolutional networks, we constructed three VGG16-based variants with attention modules, as depicted in Figure 4. The first variant, VGG16-SE v1, extends the baseline by inserting SE blocks following every convolutional layer. This approach enables channel recalibration at all depths, ensuring that informative channels are persistently prioritized whereas redundant channels are diminished. The second model, VGG16-SE v2, adopts a more targeted approach by integrating SE blocks only after the final convolutional layers of Blocks 3, 4, and 5. Focusing recalibration on higher-level semantic features in this manner reduces computational demands and enhances the expressive power of deeper layers where abstraction is paramount. In the third version, VGG16-SE-SA, both channel and SA are employed; SA modules are added after the final convolutions of Blocks 1 and 2 to capture key spatial regions, while SE blocks are inserted after the final convolutions of Blocks 3, 4, and 5 for refined channel adjustments. This configuration is intended to leverage the complementary properties of spatial and channel attention, allowing the network to effectively determine both the location and the nature of feature importance. This strategy supports selective emphasis within feature maps.

3.2.2. ResNet18

ResNet18, introduced by He et al. [25], marked a key innovation in deep learning by addressing the vanishing gradient challenge via residual learning. Instead of requiring the network to learn direct input–output mappings, ResNet introduces shortcut (skip) connections that allow it to focus on learning residual functions. This design permits the successful training of deeper architectures, circumventing the degradation of accuracy commonly observed in very deep networks.

ResNet18 consists of 18 trainable layers, comprising 17 convolutional layers and a final FC layer. The model structure is organized into five primary stages: an initial

7 \times 7

convolution with stride 2, followed by a

3 \times 3

max-pooling layer, and four distinct groups of residual blocks. Within each residual block are two

3 \times 3

convolutional layers, each succeeded by batch normalization and ReLU activation, with an accompanying shortcut connection that bypasses one or more layers. Depending on dimensional consistency, these shortcut connections can function as identity mappings or employ

1 \times 1

convolutions to align dimensions.

Residual connections in ResNet18 facilitate the learning of more distinctive feature representations and help mitigate the vanishing gradient issue, offering an effective trade-off between accuracy and efficiency. The architecture’s lightweight nature and access to pretrained weights on large-scale datasets such as ImageNet have established ResNet18 as a prominent backbone in computer vision, supporting applications such as image classification, segmentation, and medical image analysis.

To systematically analyze the impact of attention mechanisms within residual networks, three ResNet18-based variants were developed as depicted in Figure 5. The first variant, SEResNet18 v1, alters the baseline by substituting every BasicBlock with an SE-enhanced version, thus enabling adaptive channel recalibration at each residual unit throughout the network. The second variant, SEResNet18 v2, incorporates SE blocks specifically after layers 2, 3, and 4, focusing on mid and high-level feature representations where greater semantic abstraction is observed, while also minimizing additional computational demands. The third variant, SEResNet18-SA, sequentially integrates both SE and SA modules after layers 2, 3, and 4. In this variant, SE modules adaptively reweight channel responses, while SA modules highlight the most relevant spatial regions. Through the combination of these complementary mechanisms, the model enhances its capacity to capture both channel dependencies and spatial relationships at the same time.

3.2.3. InceptionV3

InceptionV3, proposed by Szegedy et al. [30], is a deep CNN designed to achieve high accuracy with computational efficiency. As an advanced version of the initial Inception approach, it incorporates factorization methods, dimensionality reduction, and optimized resource utilization. The defining principle of the Inception module lies in its ability to capture spatial information at multiple scales by using convolutional kernels of various sizes (

1 \times 1

,

3 \times 3

, and

5 \times 5

) in parallel, in addition to pooling layers. These outputs are merged by concatenation, which enables the network to extract both detailed and global features concurrently.

InceptionV3 introduces several significant improvements over its predecessors. Large convolutions, such as

5 \times 5

, are decomposed into two sequential

3 \times 3

convolutions, which reduces the parameter count substantially without sacrificing representational capacity. In addition, asymmetric convolutions, such as a

1 \times 3

followed by a

3 \times 1

, are utilized to further optimize computational efficiency. Extensive application of batch normalization stabilizes and expedites the training process, while auxiliary classifiers at intermediate layers enhance gradient propagation and reduce the risk of overfitting. The final network configuration comprises multiple stacked Inception modules, succeeded by a global average pooling layer and FC layers, with a softmax activation facilitating classification. Owing to its optimal balance between computational efficiency and predictive accuracy, InceptionV3 has achieved widespread adoption in image classification, including medical imaging applications, especially in settings with constrained computational resources.

Due to its modular architecture, InceptionV3 is well-suited for evaluating the integration of attention mechanisms. Accordingly, we introduced three attention-enhanced variants, depicted in Figure 6. The first variant, InceptionV3-SE v1, incorporates SE blocks after selected stages to recalibrate channel-wise feature responses in intermediate layers, adaptively enhancing the most informative channels while suppressing less significant ones. The second variant, InceptionV3-SE v2, implements SE blocks more selectively, positioning them exclusively after the Inception-C, Inception-D, and Inception-E modules. As these modules correspond to deeper layers with heightened semantic abstraction, this targeted approach strengthens high-level representational power without adding undue complexity. The third variant, InceptionV3-SE-SA, integrates both SE and SA mechanisms. In this configuration, SA modules are inserted following the Inception-B module and again before the global pooling layer, emphasizing salient spatial regions at both mid-level and high-level stages. SE blocks are applied after the Inception-C, D, and E modules to refine channel interdependencies. The combined use of spatial and channel attention in this hybrid structure facilitates more robust and discriminative feature extraction.

3.2.4. EfficientNetB5

EfficientNetB5, introduced by Tan and Le [27], is part of the EfficientNet series of CNNs designed using a compound scaling methodology. Rather than scaling depth, width, or input resolution independently, EfficientNet utilizes a compound coefficient that consistently scales these three factors together. This results in models that achieve high accuracy with improved computational efficiency.

The EfficientNet architecture originates from a baseline network optimized via neural architecture search (NAS) and is primarily structured around Mobile Inverted Bottleneck Convolution (MBConv) blocks integrated with SE modules. Each MBConv block increases the number of input channels, employs depthwise separable convolutions, and subsequently projects the result back to a lower dimensionality, which reduces both parameter count and floating-point operations (FLOPs). The incorporated SE modules perform adaptive recalibration of channel responses, allowing the model to prioritize the most relevant features. EfficientNetB5 is a scaled-up variant of the baseline EfficientNet, employing a compound coefficient to simultaneously scale network depth, width, and input resolution in a unified manner. Relative to smaller versions (B0–B4), B5 is capable of processing higher-resolution inputs (

456 \times 456

), which enables it to extract finer structural details. The architecture’s final layers include a global average pooling procedure followed by FC layers ending with a softmax classifier. Because of its efficient architecture and robust feature extraction, EfficientNetB5 has been widely adopted as a backbone in complex vision applications, with notable use in medical image classification and segmentation, domains that demand both high accuracy and computational efficiency.

To enhance its representational strength, we introduced three attention-augmented variants of EfficientNetB5, outlined in Table 2. The first variant, EfficientNet-B5 (MB- Conv + SE), alters the original MBConv blocks to insert SE modules within each block, positioned immediately after the depthwise convolution and before the projection layer. This arrangement provides channel recalibration at every stage, ensuring detailed channel-level attention. The second variant, EfficientNet-B5 (SE after blocks), places SE modules selectively after key architectural stages: after Block 2 (index 4, 176 output channels), Block 3 (index 9, 512 output channels), and Block 4 (index 13, 512 output channels). This selective strategy emphasizes mid-level and high-level semantic features, while minimizing the additional computational cost associated with uniform SE integration. The third variant, EfficientNet-B5 (SE + SA), integrates both channel and SA by applying SE modules after Blocks 2, 3, and 4, and further introducing an SA module after Block 3 directly following SE3. This combined approach enables synergistic refinement, as SE focuses on channel discrimination and SA highlights spatially significant regions, thereby advancing both the location and nature of feature extraction.

3.2.5. DenseNet121

DenseNet121, introduced by Huang et al. [28], is a CNN that uses dense connectivity to promote feature reuse and improve gradient flow. In this model, each layer receives the feature maps of all preceding layers as input and then passes its own outputs to every subsequent layer in the same block. This structure supports more efficient transmission of both information and gradients through the network, alleviating the vanishing gradient issue and enhancing parameter efficiency.

DenseNet121 consists of 121 layers grouped into four dense blocks, which are separated by transition layers. In each dense block, the feature maps produced by a layer are concatenated, not summed, with those of all earlier layers, resulting in varied and informative data representations. The transition layers, made up of

1 \times 1

convolutions followed by

2 \times 2

average pooling, are inserted between dense blocks to control the expansion of feature-map dimensions while reducing computation. The model’s structure starts with a

7 \times 7

convolution and max-pooling step and ends with global average pooling and an FC softmax classifier.

One of the primary advantages of DenseNet121 is its efficient use of parameters. Due to the extensive reuse of features across layers, the model achieves robust accuracy with fewer parameters than alternative architectures of comparable depth. This efficiency is particularly advantageous for medical image analysis, where datasets are often limited and there is a critical need for both effective feature extraction and minimization of overfitting. To evaluate the impact of various attention mechanisms, we developed three DenseNet121 variants that incorporate SE and SA modules, as depicted in Figure 7. The first variant, DenseNet121-SE v1, extends the baseline by inserting SE blocks after each of the four dense blocks, thereby providing consistent channel recalibration throughout all levels of representation, from low to high-level features. The second variant, DenseNet121-SE v2, takes a selective approach by introducing SE blocks solely after dense blocks 2, 3, and 4. As these later dense blocks capture increasingly abstract semantic representations, this strategy directs the attention mechanisms toward deeper layers, maintaining a balance between computational cost and performance enhancement. The third variant, DenseNet121-SE-SA, incorporates both SE and SA modules. Specifically, SE modules are implemented after dense blocks 2–4 to strengthen channel-wise feature dependencies, while SA modules are positioned after transition layers 1–3 and prior to the global average pooling layer to focus on salient spatial regions. By integrating both channel and spatial forms of attention, the model achieves complementary refinement, improving its capacity to identify and prioritize relevant features and where they are most pertinent within the spatial domain.

3.3. Evaluation Metrics

To quantitatively evaluate the effectiveness of the model, we utilized four widely recognized classification metrics: accuracy, precision, recall, and F1-score. The definitions of these metrics are as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N},

(1)

Precision = \frac{T P}{T P + F P},

(2)

Recall = \frac{T P}{T P + F N},

(3)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall},

(4)

where

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively.

3.4. Mathematical Influence of SE Units on Learning

Let the output of a convolutional block be denoted as

U \in R^{C \times H \times W}

, where C, H, and W represent the number of channels, height, and width, respectively. SE unit first performs a squeeze operation through global average pooling to generate a channel descriptor:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j),

(5)

where

z_{c}

captures the global spatial information of each channel.

Next, an excitation operation learns channel-wise dependencies using two FC layers with a non-linear activation:

s = σ (W_{2} δ (W_{1} z)),

(6)

where

W_{1}

and

W_{2}

are learnable weight matrices,

δ (\cdot)

denotes the ReLU activation, and

σ (\cdot)

is the sigmoid function. The resulting vector

s \in R^{C}

encodes the relative importance of each feature channel.

The reweighted feature maps are obtained by channel-wise multiplication:

{\tilde{U}}_{c} = s_{c} \cdot U_{c},

(7)

where

s_{c}

acts as a dynamic scaling coefficient for channel c.

This modulation influences the learning process by adaptively controlling both forward activations and backward gradients. During backpropagation, the gradient of the loss

L

with respect to

U_{c}

becomes the following:

\frac{\partial L}{\partial U_{c}} = s_{c} \frac{\partial L}{\partial {\tilde{U}}_{c}},

(8)

This relationship demonstrates that channels with greater attention weights

s_{c}

have correspondingly higher gradients. As a result, features that are discriminative are intensified, while redundant or noisy features are reduced, which enhances convergence stability and improves generalization. Such adaptive gradient scaling provides a mathematical rationale for the observed performance improvements when SE modules are selectively integrated into CNN architectures trained on the POC dataset.

3.5. Computational Efficiency Analysis

To evaluate the computational demands of attention integration, we compared the parameter count, FLOPs, inference time, and memory consumption for both the baseline and attention-enhanced models. Table 3 presents the trade-offs in efficiency, indicating that SE and CBAM modules result in negligible additional cost while yielding substantial performance improvements.

3.6. Experimental Framework

3.6.1. Baseline Fine-Tuning of Pretrained CNNs

In the initial phase of the study, baseline models were established using several widely accepted CNN architectures, including VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5. To maintain experimental consistency and utilize transfer learning, we employed their pretrained versions on the ImageNet dataset. These models were implemented in their default forms, with no changes in architecture, thereby ensuring a fair and robust benchmark. For fine-tuning, the terminal classification layer is substituted with an FC layer that matches the number of categories in our datasets. Convolutional layers utilize pretrained weights to convey generalizable feature representations, while the new layers are learned from randomly initialized parameters to fit the specific classification targets. This approach to fine-tuning provides a robust baseline for directly comparing model performance across architectures and serves as a basis for assessing the impact of subsequent improvements.

3.6.2. Integration of SE Modules

To augment the representational capability of the baseline architectures, we integrated SE modules into the convolutional blocks of each network to enhance the representation power of each model. As shown in Figure 1, the SE block applies channel-wise attention by dynamically recalibrating the feature maps. This process starts with global average pooling, which summarizes spatial information into a channel descriptor. The descriptor then passes through two FC layers (realized as

1 \times 1

convolutions) with a reduction ratio of 16, followed by a sigmoid activation to generate channel-specific weights. These weights are then multiplied by the original feature maps, thereby emphasizing informative channels and suppressing less relevant ones.

In the modified VGG16 architecture, SE modules are introduced after each convolutional layer and placed immediately prior to the corresponding pooling layer, where appropriate. Specifically, SE blocks are incorporated into all five of VGG16’s convolutional blocks. For example, in the initial block, each of the two convolutional layers is directly followed by an SE module and subsequently by a max-pooling operation, with this sequential structure maintained across the remaining blocks. This systematic application of channel attention across all abstraction levels enables dynamic recalibration of feature dependencies throughout the feature extraction process. The classifier segment of VGG16 is retained from the pretrained model, with the exception of the final FC layer, which is substituted to match the number of target classes (four in this context). Thus, the model preserves original ImageNet-pretrained feature representations while incorporating SE-based channel recalibration for finer-grained feature adjustment.

3.6.3. Selective Placement of SE Modules in CNN Models

Instead of positioning SE modules after every convolutional layer, we utilized a selective placement approach to optimize the trade-off between performance gains and computational efficiency. As outlined previously, an SE block enacts channel-wise attention by conducting global average pooling coupled with a bottleneck consisting of a reduction ratio of 16, yielding channel descriptors that guide recalibration of feature maps. Within the modified VGG16 structure, SE modules are placed solely after the final convolutional layers of the 3rd, 4th, and 5th convolutional blocks (i.e., conv_3_3, conv_4_3, and conv_5_3 in the canonical VGG16), which correspond to layers 10, 17, and 24 in the PyTorch implementation. This targeted integration emphasizes deeper layers, which encode higher-order semantic information, thereby augmenting representational capabilities while minimizing computational overhead.

By excluding SE modules from the early layers, the model efficiently preserves low-level representations such as texture and edge information, while the deeper layers are enhanced through adaptive channel recalibration to capture more complex contextual features. This configuration limits the number of added parameters, diminishing the likelihood of overfitting on comparatively small datasets. The classifier portion of VGG16 remains consistent with the pretrained backbone, aside from the final FC layer, which is revised to produce four target class outputs. In this manner, ImageNet-pretrained features are utilized, and critical stages of the network are selectively enhanced via channel-focused attention mechanisms.

We introduce SE modules following the mid-to-late convolutional blocks and immediately before global average pooling. This placement is informed by (i) the development of semantic richness in features, (ii) alignment of the receptive field with lesion scale, and (iii) the maintenance of stable gradient propagation in residual structures when SE follows the residual branch. Early layers extract basic textures and stain patterns; thus, reweighting at this stage is less beneficial. This information is less indicative of class and may increase the presence of noise. In contrast, later layers display greater inter-channel redundancy; SE diminishes this redundancy and enhances class differentiation.

3.6.4. Combining SA with SE Modules

To increase the representational capability of the network, we integrated both SE and SA modules within the VGG16 architecture in a complementary manner. SE blocks enable channel-wise recalibration, whereas SA modules focus on highlighting spatially relevant regions within feature maps. This integrated strategy enables the model to more effectively capture both inter-channel relationships and spatial contextual information. The SA module creates a SA map by applying max pooling and average pooling along the channel dimension, concatenating these results, and processing them with a

7 \times 7

convolution followed by a sigmoid activation. This operation enhances prominent spatial regions while diminishing the effects of background noise.

In our revised VGG16, SE and SA modules are strategically placed at distinct depths to optimize effectiveness. SA modules are introduced after the last convolutional layers of the first two convolutional blocks (conv1_2 and conv2_2), where low-level features such as edges and textures emerge, making spatial localization more valuable. SE modules are subsequently applied after the final convolutional layers of the third, fourth, and fifth convolutional blocks (conv3_3, conv4_3, and conv5_3), where abstract semantic features are prominent and channel recalibration is most advantageous. This arrangement creates a balance, enabling the network to extract detailed spatial features in the shallow layers and refine complex semantic representations in the deeper layers through channel attention. The classifier component of VGG16 remains consistent with the original pretrained version, except for substituting the last FC layer to accommodate the four classification categories. Through this complementary integration of SE and SA, the model enhances feature expression and classification performance while retaining computational efficiency.

3.7. Implementation Details

All experiments were conducted using PyTorch 2.7.1+cu118 on a system with an NVIDIA GPU. Two open-access medical imaging datasets, the Brain Tumor MRI dataset and the POC dataset, were utilized. Each dataset was partitioned into training and testing subsets, and all images were resized to

224 \times 224

. Training was performed for 60 epochs with a batch size of 32. We utilized five pretrained CNN architectures, VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5, as baseline models. To examine the influence of attention mechanisms, experiments were organized into four phases. The initial phase involved evaluating the baseline models in their unaltered form. In the following phases, SE blocks [4] were systematically incorporated into all convolutional blocks to capture channel interdependencies and enhance feature representations. We further investigated selective integration, adding SE modules exclusively to upper layers (for example, Blocks 3–5 in VGG16) to reduce computational requirements while targeting improvements in semantic features. In the final phase, SA modules [5] were integrated alongside SE blocks in the deeper layers to promote improved spatial localization and further enhance classification results.

All models were trained employing cross-entropy loss and the Adam optimizer. Two distinct parameter groups were established: the backbone CNN layers, assigned a learning rate of 0.0001, and the SE block parameters, granted a higher learning rate of 0.0006. A StepLR scheduler was used, decreasing the learning rate by a factor of 0.1 after every 10 epochs. Early stopping, using a 20-epoch patience threshold, was implemented to mitigate overfitting. Model performance assessment was conducted using accuracy, precision, recall, and F1-score, all computed from the confusion matrix of the test set. During training, the model with the highest test accuracy was preserved, and its evaluation metrics were documented in an output file for further analysis. To promote reproducibility, results for all experimental runs were comprehensively logged, including each confusion matrix and per-class values for precision, recall, and F1-score. Experiments were conducted with fixed random seeds to ensure consistency across all settings.

Training Parameters

The models were implemented and trained using the PyTorch framework for a total of 60 epochs with a batch size of 32. All input images were resized to

224 \times 224

pixels, and normalized according to the default torchvision preprocessing protocol. Both the Brain Tumor and POC datasets were structured as four-class classification tasks, corresponding to their predefined categories. Training utilized the Cross-Entropy Loss function throughout. The Adam optimizer was configured with two parameter groups that separately updated backbone layers and attention mechanisms. For the backbone CNN, a learning rate of 0.0001 and weight decay of 0.0001 were applied. Regarding the attention modules (SE/CBAM), learning rates of 0.0006 and a weight decay of 0.0001 were employed. A StepLR scheduler served to systematically decrease the learning rate, operating with a step size of 10 epochs and a decay factor of

γ = 0.1

. Early stopping was applied, utilizing a patience value of 20 epochs and monitoring test accuracy; training was halted if no further improvement was detected. At the conclusion of each epoch, the model’s performance was assessed using the test set. The model exhibiting the highest test accuracy was designated as the best-performing model and subsequently saved. For each experiment, evaluation metrics, confusion matrix, precision, recall, and F1-score were systematically recorded. All experimental procedures were executed in an NVIDIA GPU-enabled computing environment.

The comprehensive list of training parameters utilized in all experiments is summarized in Table 4.

4. Results and Discussion

This section details and interprets the classification results of various CNN architectures augmented with channel and spatial attention modules on two datasets: the POC dataset and the Brain Tumor MRI dataset. The evaluation criteria include test accuracy, precision, recall, and F1-score. The experimental findings are systematically organized into four parts: baseline CNN performance, global SE integration, selective SE integration, and hybrid attention integration.

4.1. Baseline Performance of Pretrained CNN Models

Table 5 and Table 6 display the baseline outcomes of pretrained CNN architectures excluding attention mechanisms. For the POC dataset, EfficientNetB5 achieved the highest test accuracy at 86.05%, while VGG16 demonstrated the lowest performance at 78.22%. In the Brain Tumor dataset, DenseNet121 achieved the highest result, with 81.00% accuracy and an F1-score of 0.7940; EfficientNetB5 and InceptionV3 yielded similar performance levels.

These initial findings highlight the superior generalization abilities found in deeper and compound-scaled architectures such as EfficientNet and DenseNet, positioning them as effective baselines for evaluating the influence of attention mechanisms.

4.2. Global Integration of SE Modules

To investigate the effect of globally applied channel-wise attention, SE blocks were embedded into every convolutional block within each pretrained architecture (Table 7 and Table 8). All models exhibited consistent improvements across both datasets.

For the POC dataset, VGG16_SE yielded a notable improvement, achieving 86.65% accuracy and an F1-score of 0.8583. On the Brain Tumor dataset, EfficientNetB5_SE exhibited evident gains, with test accuracy rising from 80.50% to 84.37% and an F1-score of 0.8235. These results support the utility of channel recalibration in enhancing feature selectivity and advancing classification outcomes.

4.3. Selective SE Integration into Deeper Layers

Table 9 and Table 10 show the results of employing SE blocks only in deeper layers, a method that lowers computational demands while maintaining advanced semantic feature refinement. The data reveal modest but consistent improvements over models with global SE integration.

On the POC dataset, VGG16_SE_after_B3_4_5 achieved the best performance, with an F1-score of 87.08% and a test accuracy of 87.95%. For the Brain Tumor dataset, EfficientNetB5_SE_after_B2_3_4 delivered the highest results, reaching 86.53% accuracy and an F1-score of 85.11%, surpassing even the globally integrated SE variants. These outcomes indicate that applying attention to deeper layers plays a more critical role in decision-making than refining shallow features. The comparative performance of CNN architectures with selectively integrated SE modules on the POC dataset and brain tumor dataset is illustrated in Figure 8 and Figure 9. The heatmap clearly shows the variation in test accuracy, precision, recall, and F1-score across different model configurations.

4.4. Hybrid Attention: Combined SE and SA

To enhance both the type of features emphasized and the spatial focus, SA modules were combined with SE blocks, as detailed in Table 11 and Table 12. This hybrid architecture consistently produced the most robust outcomes across both datasets.

For the POC dataset, EfficientNetB5_SA_after_B9_SE_after_B2_3_4 acquired the best performance, reaching 89.97% accuracy with 89.72% F1-score. Similarly, on the Brain Tumor dataset, ResNet18 with hybrid attention attained 84.37% accuracy with 82.60% F1-score. These findings highlight the complementary roles of spatial and channel attention, where SA improves feature localization and SE strengthens channel relevance. Figure 10 and Figure 11 present the analysis of SA mechanisms on the POC dataset and brain tumor dataset by selectively integrating SE modules at different depths within CNN architectures. The heatmap highlights the variation in test accuracy, precision, recall, and F1-score across five configurations. EfficientNetB5 with selective SE placement achieved the highest overall performance, indicating the benefit of deeper SE integration for medical image classification.

4.5. Comparative Analysis and Key Observations

Our results provide several notable findings beyond the common assertion that “attention boosts accuracy.” The comparative performance of five CNNs across two medical imaging tasks demonstrates that the effect of attention mechanisms varies considerably based on both network depth and task characteristics. Specifically, attention introduced at earlier layers tends to improve feature discrimination for classification tasks, whereas mid or later layer attention enhances spatial coherence in segmentation. This suggests that the strategic placement of attention modules can support the development of more effective models for medical image analysis. By adhering to consistent preprocessing, training, and evaluation protocols, this study supplies a reproducible foundation for evaluating attention mechanisms in CNN-based models.

The experimental findings yield several key insights. First, EfficientNetB5 consistently outperformed other models, especially when enhanced with hybrid attention, owing to its balanced approach to scaling depth, width, and resolution. Second, hybrid attention yielded the highest overall metrics, emphasizing the advantage of integrating both channel-wise and SA within medical image classification. Third, selective integration demonstrated greater effectiveness than global application by providing similar performance gains with reduced computational requirements. Furthermore, attention mechanisms improved not only overall accuracy, but also class-specific metrics, as reflected in increased recall, precision, and F1-scores. These results collectively demonstrate that CNNs augmented with attention mechanisms offer significant improvements in both the reliability and accuracy of phenotypic pattern classification in spontaneous abortion and brain tumor diagnosis. Of all tested methodologies, selective hybrid attention applied to deeper layers was the most successful.

Comparative Analysis Across Models

A thorough comparison of all CNN models, both with and without attention mechanisms, reveals notable trends regarding architectural design and the impact of attention strategies. EfficientNetB5 consistently emerged as the optimal backbone across both datasets, achieving the highest F1-scores and test accuracy when paired with hybrid attention that combined SE and spatial modules. Its compound scaling approach facilitates more efficient feature extraction with fewer parameters compared to traditional deep networks. For every architecture assessed, models enhanced with attention surpassed their corresponding baselines. The global application of SE blocks increased both accuracy and F1-scores; however, targeted SE deployment within deeper layers yielded even greater improvements, reinforcing that higher-level semantic features are especially crucial for classification. Hybrid attention demonstrated superior performance overall, for example, EfficientNetB5_SA_after_B9_SE_after_B2_3_4 produced the highest results, achieving 89.97% accuracy and an F1-score of 89.72% on the POC dataset. These results indicate that fusing channel and spatial attention modules offers complementary benefits, guiding networks to concentrate on both the most pertinent regions and the most informative feature channels.

Lighter models such as ResNet18 and VGG16 also experienced significant improvement following the integration of attention mechanisms. Despite having shallower architectures, incorporating SE and SA modules markedly enhanced their classification capability, suggesting that attention can compensate for the intrinsic constraints of simpler backbones. The improvements were especially prominent on the Brain Tumor dataset, where high intra-class variability and nuanced texture differences make attention-guided feature learning particularly impactful. Conversely, InceptionV3 demonstrated less uniform advances: although SE contributed to measurable performance gains, the addition of hybrid attention yielded only limited benefits. This may result from the multi-branch design of Inception modules, which inherently capture diverse receptive fields and spatial patterns. In summary, these findings demonstrate that attention mechanisms, especially when used selectively and in combination, enhance robustness and generalization across CNN frameworks. EfficientNetB5 augmented with hybrid attention stood out as the most effective model for both Brain Tumor and POC classification. More generally, this evidence highlights the importance of engineering lightweight yet attention-integrated architectures to ensure reliable results in real-world medical image analysis scenarios.

5. Limitations and Future Work

While integrating attention mechanisms into pretrained CNN architectures has demonstrated notable performance improvements for brain tumor and POC classification, several limitations should be considered. Firstly, this work evaluated only two datasets, specifically the Brain Tumor and POC datasets. The inclusion of more heterogeneous datasets is necessary to validate generalizability across diverse clinical settings and a wider range of imaging modalities. Secondly, although attention modules such as SE and hybrid spatial–channel mechanisms offer classification enhancements, their integration adds additional complexity and computational overhead, potentially restricting deployment in real-time or resource-constrained environments, including mobile health applications and embedded diagnostic devices. A further limitation arises from the predetermined, manually selected positions of attention modules within the network architectures. These choices were informed by prior literature and empirical observations, yet they might not represent the most effective design choices. Automated strategies for optimizing module placement could yield superior architectures. Lastly, while performance increased, there remains a risk of overfitting, particularly in deeper models trained on relatively limited datasets, highlighting the ongoing need for rigorous cross-validation protocols and the development of more effective regularization techniques. In future work, we aim to expand experimentation to encompass larger and more varied datasets that reflect a broader spectrum of imaging scenarios and patient cohorts. To address computational challenges, we intend to investigate model compression techniques including pruning, quantization, and knowledge distillation. We will also explore automated strategies for attention module integration, employing neural architecture search or reinforcement learning to identify optimal configurations. Moreover, integrating multimodal data, combining imaging with clinical information, may improve model robustness and support more contextually informed diagnostic systems. In conclusion, while the proposed attention-augmented CNN models exhibit considerable potential for tumor classification, systematically addressing the identified limitations remains essential for facilitating their adoption in clinical practice.

5.1. Model Generalizability

Model generalizability plays a critical role in assessing the clinical utility of deep learning models. In this study, we assessed generalization by training and evaluating attention-augmented CNN architectures on two separate datasets: the Brain Tumor dataset and the POC dataset. The consistent improvements observed across both datasets demonstrate the robustness of models enhanced with attention mechanisms. These results suggest that attention mechanisms allow the networks to prioritize more discriminative and clinically relevant features, helping to minimize overfitting to dataset-specific characteristics. Pretrained architectures such as EfficientNetB5, ResNet18, and MobileNetV2 additionally supported improved generalization by leveraging knowledge acquired from large-scale natural image datasets and applying it in the medical context. Fine-tuning these models with medical data enabled the preservation of broad visual features while simultaneously adapting to domain-specific patterns. This approach to transfer learning is especially advantageous in situations where annotated medical datasets are scarce, as it facilitates superior performance on previously unseen instances. However, it is important to acknowledge that the generalizability of these models is most accurately established through evaluation on a wider array of datasets, encompassing images from various institutions, imaging equipment, and diverse patient demographics. Consequently, future research should focus on cross-dataset and cross-institutional validation to more thoroughly evaluate the models’ adaptability to real-world variability. In summary, the proposed attention-augmented CNNs demonstrate strong generalizability within the context of this research. With additional external validation and optimization, these models have significant promise for reliable use across a range of clinical environments.

5.2. Potential for Real-World Clinical Deployment

The promising outcomes from incorporating attention mechanisms into CNN architectures underscore their high potential for real-world clinical implementation in brain tumor and POC classification. Achievement of high classification accuracy, stable performance across both datasets, and improved generalizability via attention-based feature refinement establish these models as valuable assets for aiding radiologists and clinicians in diagnostic tasks. Notably, EfficientNetB5 with hybrid attention achieved the highest overall accuracy, while also demonstrating the capability to localize crucial imaging features, which is particularly important for detecting nuanced patterns in complex medical images. Utilizing pretrained backbones that are subsequently fine-tuned on medical datasets further enhances their clinical relevance, as this allows effective model training despite the limited availability of annotated samples common in healthcare contexts. This benefit of transfer learning, when integrated with modular attention mechanisms, enables models to adapt rapidly to novel tasks while requiring minimal annotation effort. Additionally, lightweight architectures such as ResNet18, when equipped with attention components, achieve efficient and precise predictions, rendering them ideal for use in portable or embedded diagnostic applications. However, several implementation challenges must be resolved before clinical deployment. Achieving real-time computation, ensuring interpretability, maintaining data security, and integrating seamlessly with hospital information systems are essential for successful adoption. Decision support tools should not only provide high levels of predictive accuracy but also generate interpretable explanations, such as visualizing attention-based feature maps, to build trust with healthcare practitioners. Moreover, regulatory certification and extensive clinical validation are imperative to establish reliability, safety, and conformance with medical guidelines.

In conclusion, CNN architectures augmented with attention mechanisms show considerable potential as the basis for clinical decision support systems. With ongoing enhancements aimed at improving interpretability, operational efficiency, and alignment with regulatory standards, these models may ultimately serve as highly effective tools in medical image-based diagnostics. Furthermore, in Table 2 presents three variants of EfficientNet-B5 augmented with attention mechanisms. The first variant incorporates a custom Mobile Inverted Bottleneck Convolution (MBConv) block with an integrated SE module, placed after the pretrained EfficientNet-B5 feature extractor. This configuration replaces or complements the final processing layer with a block designed to adaptively recalibrate channel outputs, thereby increasing the model’s capacity to focus on salient features at the conclusion of the pipeline. The second variant preserves the native EfficientNet-B5 architecture while embedding SE modules after three critical convolutional blocks, specifically, blocks 4, 9, and 13. Through adaptive channel recalibration at multiple stages, this configuration progressively refines internal feature representations and maintains the integrity of the main convolutional pathway. The third variant expands upon this strategy by including SE modules at these positions alongside an additional SA block inserted following block 9. In this combined configuration, SE modules selectively enhance informative channels, and the SA module generates SA maps that delineate pertinent regions within the feature representations. This integrated attention strategy boosts the model’s discriminative power by addressing both the selection of key channels and the emphasis on significant spatial locations. Together, these variants exemplify distinct methodologies for embedding attention in EfficientNet-B5, ranging from modifying the final stage to implementing hierarchical channel recalibration and combining both channel and spatial-level attention to achieve more nuanced and robust feature enhancement.

6. Conclusions

This study investigated the impact of incorporating attention mechanisms into several pretrained CNN architectures for the classification of POC and brain tumor images. Notable improvements were consistently observed across both datasets when channel-based SE and hybrid spatial–channel attention modules were added to models such as EfficientNetB5, VGG16, ResNet18, InceptionV3, and DenseNet121. These modifications enhanced feature discrimination and increased prediction reliability, with EfficientNetB5 combined with hybrid attention demonstrating the highest overall performance. The integration of attention also contributed to improved feature localization, thereby increasing interpretability in medical image analysis. Although the findings illustrate substantial performance improvements within the evaluated datasets, broader validation involving additional imaging modalities and larger cohorts is required before these results can be generalized to wider medical diagnostic applications. Nonetheless, this study establishes a systematic approach for introducing attention mechanisms into CNN-based medical image classification and provides actionable insights for future advancements in model development and clinical practice.

Author Contributions

Conceptualization, Z.U. and J.K.; Methodology, Z.U.; Software, Z.U.; Formal analysis, Z.U., M.H., T.M. and J.K.; Investigation, Z.U. and J.K.; Data curation, Z.U., M.H. and T.M.; Writing—original draft, Z.U.; Writing—review and editing, Z.U. and J.K.; Supervision, J.K.; Project administration, J.K.; Funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the “Regional Innovation System & Education (RISE)” through the Seoul RISE Center, funded by the MOE (Ministry of Education) and the Seoul Metropolitan Government, and conducted in collaboration with HuVet bio, Inc. (https://www.huvetbio.com/) (2025-RISE-01-007-05), and the Artificial Intelligence Convergence Innovation Human Resources Development supervised by the MSIT(Ministry of Science and ICT) and the IITP (Institute for Information & Communications Technology Planning & Evaluation) (IITP-2025-RS-2023-00254592).

Data Availability Statement

The data presented in this study are openly available at https://doi.org/10.34740/kaggle/dsv/12745533 and https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri (accessed on 20 October 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Abd El Kader, I.; Xu, G.; Shuai, Z.; Saminu, S.; Javaid, I.; Salim Ahmad, I. Differential deep convolutional neural network model for brain tumor classification. Brain Sci. 2021, 11, 352. [Google Scholar] [CrossRef]
Khan, A.H.; Abbas, S.; Khan, M.A.; Farooq, U.; Khan, W.A.; Siddiqui, S.Y.; Ahmad, A. Intelligent model for brain tumor identification using deep learning. Appl. Comput. Intell. Soft Comput. 2022, 2022, 8104054. [Google Scholar] [CrossRef]
Nadeem, M.W.; Ghamdi, M.A.A.; Hussain, M.; Khan, M.A.; Khan, K.M.; Almotiri, S.H.; Butt, S.A. Brain tumor analysis empowered with deep learning: A review, taxonomy, and future challenges. Brain Sci. 2020, 10, 118. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Shisu, Y.; Mingwin, S.; Wanwag, Y.; Chenso, Z.; Huing, S. Improved EATFormer: A vision transformer for medical image classification. arXiv 2024, arXiv:2403.13167. [Google Scholar]
Chung, D. Artificial intelligence in healthcare and medicine technology development review. Eng. Appl. Artif. Intell. 2025, 143, 109801. [Google Scholar] [CrossRef]
Yu, H.; Yang, L.T.; Zhang, Q.; Armstrong, D.; Deen, M.J. Convolutional neural networks for medical image analysis: State-of-the-art, comparisons, improvement and perspectives. Neurocomputing 2021, 444, 92–110. [Google Scholar] [CrossRef]
Ullah, Z.; Kim, J. Hierarchical Deep Feature Fusion and Ensemble Learning for Enhanced Brain Tumor MRI Classification. arXiv 2025, arXiv:2506.12363. [Google Scholar] [CrossRef]
Ullah, Z.; Pamucar, D.; Kim, J. Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification. arXiv 2025, arXiv:2507.12177. [Google Scholar] [CrossRef]
Asif, R.N.; Naseem, M.T.; Ahmad, M.; Mazhar, T.; Khan, M.A.; Khan, M.A.; Al-Rasheed, A.; Hamam, H. Brain tumor detection empowered with ensemble deep learning approaches from MRI scan images. Sci. Rep. 2025, 15, 15002. [Google Scholar] [CrossRef]
Babayomi, M.; Olagbaju, O.A.; Kadiri, A.A. Convolutional xgboost (c-xgboost) model for brain tumor detection. arXiv 2023, arXiv:2301.02317. [Google Scholar] [CrossRef]
Srinivasan, S.; Francis, D.; Mathivanan, S.K.; Rajadurai, H.; Shivahare, B.D.; Shah, M.A. A hybrid deep CNN model for brain tumor image multi-classification. BMC Med. Imaging 2024, 24, 21. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Kim, H.E.; Cosa-Linan, A.; Santhanam, N.; Jannesari, M.; Maros, M.E.; Ganslandt, T. Transfer learning for medical image classification: A literature review. BMC Med. Imaging 2022, 22, 69. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Wang, C. Brain tumor detection based on a novel and high-quality prediction of the tumor pixel distributions. Comput. Biol. Med. 2024, 172, 108196. [Google Scholar] [CrossRef] [PubMed]
Thakur, G.K.; Thakur, A.; Kulkarni, S.; Khan, N.; Khan, S. Deep learning approaches for medical image analysis and diagnosis. Cureus 2024, 16, e59507. [Google Scholar] [CrossRef]
Ge, R.; Shen, T.; Zhou, Y.; Liu, C.; Zhang, L.; Yang, B.; Yan, Y.; Coatrieux, J.L.; Chen, Y. Convolutional squeeze-and-excitation network for ECG arrhythmia detection. Artif. Intell. Med. 2021, 121, 102181. [Google Scholar] [CrossRef]
Li, Y.; Liu, Y.; Cui, W.G.; Guo, Y.Z.; Huang, H.; Hu, Z.Y. Epileptic seizure detection in EEG signals using a unified temporal-spectral squeeze-and-excitation network. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 782–794. [Google Scholar] [CrossRef]
Kitada, S.; Iyatomi, H. Skin lesion classification with ensemble of squeeze-and-excitation networks and semi-supervised learning. arXiv 2018, arXiv:1809.02568. [Google Scholar]
Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5209–5217. [Google Scholar]
Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711. [Google Scholar] [CrossRef]
Robert Singh, A.; Vidya, S.; Hariharasitaraman, S.; Athisayamani, S.; Hsu, F.R. Segmentation of Mammogram Images Using U-Net with Fusion of Channel and Spatial Attention Modules (U-Net CASAM). In Proceedings of the International Conference on Applied Soft Computing and Communication Networks, Online, 18–22 December 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 435–448. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Tan, M.; Le, Q.E. Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 15. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Mahmood, T.; Ullah, Z.; Latif, A.; Sultan, B.A.; Zubair, M.; Ullah, Z.; Ansari, A.; Zehra, T.; Ahmed, S.; Dilshad, N. Computer-Aided Diagnosis in Spontaneous Abortion: A Histopathology Dataset and Benchmark for Products of Conception. Diagnostics 2024, 14, 2877. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]

Figure 1. Structure of SE module [4].

Figure 2. Structure of SA module [5].

Figure 3. Representative samples from the POC and BT-Large-4C datasets.

Figure 4. Several architectural variants of VGG16 [24] incorporate attention mechanisms. In the initial variant (VGG16-SE v1), SE blocks are introduced after every convolutional layer, thereby facilitating channel recalibration across the network. The second variant (VGG16-SE v2) employs SE blocks selectively, limiting their placement to the final convolutions of Blocks 3, 4, and 5; this is intended to prioritize higher-level feature refinement while minimizing computational demands. The third variant (VGG16-SE-SA) integrates both channel and SA, where SA modules are implemented following the final convolutions of Blocks 1 and 2, and SE blocks are placed after the final convolutions of Blocks 3, 4, and 5, thus enhancing the network’s capacity to model both spatial and channel-wise relationships.

Figure 5. Illustrates three ResNet18-based [25] variants that incorporate attention mechanisms. For the first variant (SEResNet18 v1), standard BasicBlocks are replaced by SE-enhanced blocks, facilitating channel-wise recalibration across the entire network. The second variant (SEResNet18 v2), by introducing SE blocks after layers 2, 3, and 4, focuses recalibration on mid and high-level features. The third variant (SEResNet18-SA) sequentially applies both SE and SA modules after layers 2, 3, and 4, which allows for the refinement of network features along the channel and spatial axes.

Figure 6. Three InceptionV3-based [30] variants are developed by integrating attention mechanisms. The first variant (InceptionV3-SE v1) enhances the baseline by introducing SE blocks after selected stages to enable channel recalibration. The second variant (InceptionV3-SE v2) integrates SE blocks specifically after the Inception-C, Inception-D, and Inception-E modules, resulting in more targeted channel-wise refinement. The third variant (InceptionV3-SE-SA) utilizes both SE and SA, where SA modules are added after the Inception-B module and immediately before global pooling, while SE blocks are positioned after the Inception-C, D, and E modules. This combined approach strengthens feature representations in both the spatial and channel dimensions.

Figure 7. Three DenseNet121-based [28] variants are shown, each integrating attention mechanisms. In the first version (DenseNet121-SE v1), SE blocks are added after each of the four dense blocks to optimize channel-wise responses across the network. The second variant (DenseNet121-SE v2) places SE blocks only after dense blocks 2, 3, and 4, targeting refinement in higher-level channel features and managing additional complexity. The third approach (DenseNet121-SE-SA) combines both SE and SA modules, assigning SE after dense blocks 2–4 and embedding SA after transition layers 1–3 and before the global average pooling layer. This combined method delivers both channel- and spatial-level attention, strengthening the model’s ability to identify significant features and important spatial regions.

Figure 8. This heatmap displays the performance of CNN models incorporating SE modules selectively on the POC dataset across core metrics: test accuracy, precision, recall, and F1-score. Each cell depicts the respective quantitative value, with darker shades corresponding to higher scores. The visualization demonstrates how selective SE integration bolsters CNN effectiveness for medical image classification.

Figure 9. Heatmap illustrating the performance of CNN architectures with selectively integrated SE modules on the brain tumor dataset across key evaluation metrics, including test accuracy, precision, recall, and F1-score. Each cell represents the corresponding quantitative performance value, where darker shades indicate higher scores. This visualization highlights the effectiveness of selective SE integration in enhancing CNN performance for medical image classification.

Figure 10. Heatmap illustrating the comparative performance of five CNN architectures on the POC dataset across key evaluation metrics, including test accuracy, precision, recall, and F1-score. Each cell represents the quantitative performance of a model on a specific metric, where darker shades correspond to higher values. This visualization highlights the relative effectiveness of each model configuration in the medical image classification task using the POC dataset.

Figure 11. Heatmap showing the performance of different CNN architectures on the brain tumor dataset across test accuracy, precision, recall, and F1-score.

Table 1. Details of the datasets.

Dataset	Number of Class	Training Set	Test Set
Product of Conception	4	4155	1511
BT-large-4c	4	2611	653

Table 2. Locations of SE and SA modules within various EfficientNet-B5 variants, detailing their placements following particular blocks in the feature extraction process.

Model	SE Placement	SA Placement
Variant 1 (MBConv + SE)	Inside the custom MBConv block (applied after the depthwise convolution and before the projection layer)	None
Variant 2 (SE after blocks)	-After Block 2 (end of index 4, output channels = 176) -After Block 3 (end of index 9, output channels = 512) -After Block 4 (end of index 13, output channels = 512)	None
Variant 3 (SE + SA)	-After Block 2 (index 4) - After Block 3 (index 9) -After Block 4 (index 13)	-After Block 3 (index 9, right after SE3)

Table 3. Computational efficiency of baseline and attention-integrated CNNs. Values are approximate and based on standard 224 × 224 input resolution. FLOPs: floating-point operations. Inference time and memory usage were estimated on an NVIDIA RTX 3090 GPU.

Model	Parameters (M)	FLOPs (G)	Inference Time (ms)	Memory Usage (MB)
VGG16 (Baseline)	138.4	15.5	4.2	780
VGG16 + SE	139.0	15.7	4.6	810
VGG16 + CBAM	139.3	15.9	4.9	830
ResNet18 (Baseline)	11.7	1.8	3.5	410
ResNet18 + SE	12.0	1.9	3.9	430
ResNet18 + CBAM	12.3	2.0	4.2	445
InceptionV3 (Baseline)	23.8	5.7	5.1	520
InceptionV3 + SE	24.1	5.8	5.5	540
InceptionV3 + CBAM	24.4	5.9	5.8	555
DenseNet121 (Baseline)	8.0	2.8	3.8	460
DenseNet121 + SE	8.3	2.9	4.0	475
DenseNet121 + CBAM	8.6	3.0	4.3	490
EfficientNetB5 (Baseline)	28.4	9.9	6.8	630
EfficientNetB5 + SE	28.7	10.1	7.1	650
EfficientNetB5 + CBAM	29.0	10.3	7.4	670

Table 4. Summary of Training Parameters.

Parameter	Value
Framework	PyTorch
Device	NVIDIA GPU
Number of Epochs	60
Batch Size	32
Input Image Size	$224 \times 224$
Number of Classes	4
Loss Function	Cross-Entropy Loss
Optimizer	Adam
Backbone Learning Rate	0.0001
SE/CBAM Learning Rate	0.0006
Weight Decay	0.0001
Learning Rate Scheduler	StepLR (step size = 10, gamma = 0.1)
Early Stopping Patience	20 epochs
Model Selection Criterion	Best test accuracy
Evaluation Metrics	Accuracy, Precision, Recall, F1-score

Table 5. Pretrained CNN models results on POC dataset, the best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
Pre-trained_VGG16	0.7822	0.7825	0.7711	0.7708
Pre-trained_ResNet18	0.8506	0.8537	0.8413	0.8416
Pre-trained_InceptionV3	0.8309	0.8361	0.8183	0.8137
Pre-trained_EfficientNet_B5	0.8605	0.8695	0.8516	0.8519
Pre-trained_DenseNet121	0.8486	0.8531	0.8400	0.8430

Table 6. Pretrained CNN models results on brain tumor dataset, the best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
Pre-trained_VGG16	0.7425	0.8210	0.7186	0.6984
Pre-trained_ResNet18	0.7950	0.8621	0.7839	0.7673
Pre-trained_InceptionV3	0.8025	0.8538	0.7826	0.7793
Pre-trained_EfficientNet	0.8050	0.8501	0.7878	0.7907
Pre-trained_DenseNet121	0.8100	0.8484	0.8026	0.7940

Table 7. Performance evaluation of pretrained CNN models with globally integrated SE modules on the POC dataset, emphasizing the contribution of channel-wise attention to classification outcomes. The best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
VGG16_SE	0.8665	0.8797	0.8560	0.8583
ResNet18_SE	0.8593	0.8671	0.8486	0.8516
InceptionV3_SE	0.8639	0.8723	0.8524	0.8556
EfficientNetB5_SE	0.8574	0.8621	0.8458	0.8482
DenseNet121_SE	0.8528	0.8572	0.8399	0.8415

Table 8. Performance evaluation of pretrained CNN models with globally integrated SE modules on the brain tumor dataset, emphasizing the contribution of channel-wise attention to classification outcomes. The best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
VGG16_SE	0.7980	0.8479	0.7710	0.7640
ResNet18_SE	0.8052	0.8657	0.7729	0.7751
InceptionV3_SE	0.8317	0.8802	0.8066	0.8094
EfficientNetB5_SE	0.8437	0.8860	0.8270	0.8235
DenseNet121_SE	0.8365	0.8814	0.8172	0.8159

Table 9. This table reports the selective insertion of SE blocks into convolutional blocks of pretrained CNN models using the POC dataset. SE blocks are excluded from the initial layers and incorporated into later-stage modules to enhance attention in deeper network parts. The best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
VGG16_SE_after_B3_4_5	0.8795	0.8957	0.8691	0.8708
ResNet18_SE_after_B_2_3_4	0.8619	0.8685	0.8506	0.8538
InceptionV3_SE_after_C_D_E	0.8541	0.8603	0.8421	0.8445
EfficientNetB5_SE_after_B_2_3_4	0.8717	0.8788	0.8617	0.8634
DenseNet121_SE_after_B_2_3_4	0.8561	0.8614	0.8443	0.8471

Table 10. Selective SE integration for pretrained CNN models on the brain tumor dataset skips SE blocks in early layers and introduces them in deeper modules, thereby focusing the model’s attention on high-level feature representation. The best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
VGG16_SE _after_B3_4_5	0.8509	0.8709	0.8358	0.8314
ResNet18_SE_after_B2_3_4	0.8437	0.8831	0.8220	0.8254
InceptionV3_SE _after_C_D_E	0.8269	0.8782	0.8077	0.8022
EfficientNetB5_SE_after_B2_3_4	0.8653	0.8972	0.8542	0.8511
DenseNet121_SE _after_B2_3_4	0.8197	0.8674	0.8006	0.7946

Table 11. Analysis of SA mechanisms on the POC dataset by selectively integrating SE modules at various depths within CNN architectures. The best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
VGG16_SA_after_B1_2 _SE_after_B3_4_5	0.8802	0.8840	0.8721	0.8750
ResNet18_SA_after_B2_ 3_4_SE_after_B2_3_4	0.8626	0.8706	0.8518	0.8521
InceptionV3_SA_after_ InceptionB_and_before_ global_avg_Pooling_SE_ after_C_D_E	0.8587	0.8642	0.8474	0.8503
EfficientNetB5_SA_after_ B9_SE_after_B2_3_4	0.8997	0.9000	0.8963	0.8972
DenseNet121_SA_after_T_1 _2_3_before_global_avg_ pooling_SE_after_B_2_3_4	0.8593	0.8640	0.8476	0.8495

Table 12. Analysis of SA mechanisms on the brain tumor dataset by selectively integrating SE modules at various depths within CNN architectures. The best performance is indicated in bold.

Model	Test Accuracy	Precision	Recall	F1-Score
VGG16_SA_after_B1_2 _SE_after_B3_4_5	0.8341	0.8375	0.8189	0.8106
ResNet18_SA_after_B2_ 3_4_SE_after_B_2_3_4	0.8437	0.8745	0.8227	0.8260
InceptionV3_SA_after_ InceptionB_and_before_ global_avg_Pooling_SE_ after_C_D_E	0.8293	0.8753	0.8088	0.8062
EfficientNetB5_SA_after_ B9_SE_after_B_2_3_4	0.8341	0.8747	0.8180	0.8110
DenseNet121_SA_after_T_1 _2_3_before_global_avg_ pooling_SE_after_B_2_3_4	0.8293	0.8790	0.8067	0.8063

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ullah, Z.; Hong, M.; Mahmood, T.; Kim, J. Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification. Mathematics 2025, 13, 3728. https://doi.org/10.3390/math13223728

AMA Style

Ullah Z, Hong M, Mahmood T, Kim J. Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification. Mathematics. 2025; 13(22):3728. https://doi.org/10.3390/math13223728

Chicago/Turabian Style

Ullah, Zahid, Minki Hong, Tahir Mahmood, and Jihie Kim. 2025. "Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification" Mathematics 13, no. 22: 3728. https://doi.org/10.3390/math13223728

APA Style

Ullah, Z., Hong, M., Mahmood, T., & Kim, J. (2025). Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification. Mathematics, 13(22), 3728. https://doi.org/10.3390/math13223728

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Datasets

3.1.1. Products of Conception Dataset

3.1.2. Brain Tumor Dataset

3.2. Baseline CNN Architectures

3.2.1. VGG16

3.2.2. ResNet18

3.2.3. InceptionV3

3.2.4. EfficientNetB5

3.2.5. DenseNet121

3.3. Evaluation Metrics

3.4. Mathematical Influence of SE Units on Learning

3.5. Computational Efficiency Analysis

3.6. Experimental Framework

3.6.1. Baseline Fine-Tuning of Pretrained CNNs

3.6.2. Integration of SE Modules

3.6.3. Selective Placement of SE Modules in CNN Models

3.6.4. Combining SA with SE Modules

3.7. Implementation Details

Training Parameters

4. Results and Discussion

4.1. Baseline Performance of Pretrained CNN Models

4.2. Global Integration of SE Modules

4.3. Selective SE Integration into Deeper Layers

4.4. Hybrid Attention: Combined SE and SA

4.5. Comparative Analysis and Key Observations

Comparative Analysis Across Models

5. Limitations and Future Work

5.1. Model Generalizability

5.2. Potential for Real-World Clinical Deployment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI