ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification

Li, Sihan; Huang, Juhua

doi:10.3390/app15052693

Open AccessArticle

ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification

by

Sihan Li

and

Juhua Huang

^*

School of Advanced Manufacturing, Nanchang University, Nanchang 330031, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2693; https://doi.org/10.3390/app15052693

Submission received: 24 January 2025 / Revised: 23 February 2025 / Accepted: 28 February 2025 / Published: 3 March 2025

Download

Browse Figures

Versions Notes

Abstract

Researchers encounter substantial challenges in medical image classification, mainly due to limited image resolution and low signal-to-noise ratios. This situation makes it difficult for deep learning algorithms to identify abnormal regions based solely on image content accurately. This paper proposes ResGDANet (Residual Group Dual-Channel Attention Network), an enhanced architecture that builds upon ResGANet by incorporating two novel modules: a Dual-Channel Attention Fusion (DCAF) module and a Retention-Memory Transformer (RMT) module. The DCAF module utilizes a dual-path architecture that integrates global average pooling and max pooling operations, effectively enhancing local feature representation through the fusion of channel-wise and spatial attention mechanisms. The RMT module enhances rotation-invariant feature extraction by integrating the retention mechanism from Retentive Networks and the global context modeling capabilities of Vision Transformers. Extensive experiments on the COVID19-CT and ISIC2018 datasets demonstrate the superiority of ResGDANet, achieving classification accuracies of 83.74% and 81.73% respectively, outperforming state-of-the-art models including ResGANet, GvT, and SENet. Ablation studies and visualization analyses further validate the efficacy of the proposed attention module, showing notable enhancements in feature representation capability and classification accuracy. By introducing a more robust and precise classification framework, this research contributes importantly to the progress in medical image analysis.

Keywords:

medical image classification; deep learning; attention mechanism; rotation-invariant feature; feature fusion

1. Introduction

As social productivity rapidly develops and quality of life continues to improve [1], there is a growing demand for medical imaging diagnostics among patients. Medical imaging (including X-rays, CT scans, pathological images, and MRI) serves as crucial reference material for imaging physicians’ diagnostic work, as shown in Figure 1. Traditionally, the task of interpreting medical images has primarily fallen to radiologists and clinicians. However, current deep learning methods still face numerous challenges in medical image classification, such as insufficient model generalization, sensitivity to noisy data, and high computational resource consumption. Addressing these issues is crucial for improving the accuracy and efficiency of medical image classification.

Machine learning algorithms can rapidly detect anomalies through image feature analysis, effectively complementing manual diagnostic limitations and alleviating physicians’ workload. Nevertheless, medical imaging data often suffer from restricted dimensions and poor signal-to-noise ratios, which poses challenges for deep learning models in precisely identifying abnormalities using only visual information [2]. Thus, the task of accurately and efficiently categorizing medical images to support physicians in diagnosing diseases continues to be a significant challenge within medical image analysis. Consequently, this research seeks to improve medical image classification performance by refining the architecture of the model and attention modules, targeting higher classification accuracy and greater robustness to noisy data.

Although deep learning technology has shown great potential in medical image analysis, its application in clinical practice has raised a series of ethical issues, particularly regarding data privacy. Medical image analysis faces the challenge of small sample sizes, and the traditional solution is to aggregate data from multiple sites. However, due to strict privacy protection policies, such as HIPAA (Health Insurance Portability and Accountability Act) and GDPR (General Data Protection Regulation), directly sharing medical image data is not feasible. Federated learning, as a solution for collaborative training without sharing data, can effectively protect data privacy and demonstrates its effectiveness and potential in addressing data privacy issues in medical image analysis [3,4].

The rapid advancement of deep learning in recent years has led to remarkable progress in multiple areas, including image classification, object detection, and semantic segmentation. Computer-assisted diagnostic technology employs deep learning methods to analyze medical images and patient data [5], enabling both condition assessment and clinical decision support. The AlexNet model’s excellent performance on ImageNet dataset classification tasks marked a breakthrough in deep learning for image classification [6]. Subsequently, VGG-Net improved classification accuracy by increasing network depth and reducing parameter count [7]. GoogLeNet reduced parameter count through the Inception module and optimized computational resource usage [8]. Nonetheless, these methods face limitations in medical image classification, including the oversight of multi-scale feature extraction and a lack of model interpretability. When adapted to medical image classification tasks, their performance typically lags significantly behind their initial target tasks.

Currently, mainstream methods in medical image analysis are still based on ResNet and its variants. Given its modular design and robust feature extraction capabilities, ResNet demonstrates remarkable adaptability to diverse tasks in medical image processing [9]. Nonetheless, ResNet was primarily conceived for image classification, characterized by restricted receptive fields and a lack of cross-channel and cross-spatial interaction functionalities, which could hinder its effectiveness in direct medical imaging scenarios. Therefore, manual adjustments to ResNet are usually required for specific tasks. Although such manual adjustments can improve performance to some extent, they also limit generalizability and efficiency in broader medical applications. For instance, Res2Net [10] enhanced ResNet through improved multi-scale feature representation, ResNeXt [11] achieved higher accuracy using grouped convolutions, and CBAM [12] strengthened feature representation through the introduction of spatial attention modules. Beyond ResNet and its variants, the emergence of ViT (Vison in Transformers) has opened new avenues for medical image processing, and more research has begun to utilize ViT and its derivatives in this domain. MC-ViT [13] uses two sub-networks to predict slice-level pathological information and whole-slide-level thymoma subtypes, significantly improving the accuracy of thymoma subtype classification. The MaxCerVixT [14] model combines the multi-axis vision transformer (MaxViT) architecture, ConvNeXtv2 blocks, and GRN-based MLP layers, effectively improving the accuracy and inference speed of cervical cancer detection.

ResGANet [15], introduced in 2022, enhances feature representation through grouped attention mechanisms, yet still exhibits limitations in addressing rotation invariance and local feature capture. Therefore, although existing methods have improved in accuracy, they are still insufficient in addressing key challenges in medical image analysis, such as rotation invariance, local feature capture, and cross-channel interaction. To overcome these limitations, we propose the Residual Group Dual-Channel Attention Network (ResGDANet), which substantially enhances model performance in medical classification tasks by incorporating Dual-Channel Attention Fusion (DCAF) and RMT modules [16], showing stronger adaptability and robustness, especially in terms of rotation invariance, local feature capture, and cross-channel interaction.

This paper introduces ResGDANet, an innovative medical image classification method built upon ResGANet, and our proposed attention module designed for enhanced feature representation. The primary contributions of this work are:

(1): Dual-Channel Attention Fusion (DCAF) module: Employs a dual-branch architecture combining global average pooling and max pooling to enhance the network’s capacity for capturing locally significant features, while integrating channel attention and coordinate attention to improve the robustness and accuracy of feature representations.
(2): RMT module: By leveraging the retention mechanism of Retentive Networks, the RMT module allows the model to efficiently capture and preserve critical information in the data when facing rotating objects. Simultaneously, it incorporates the global context modeling ability of Vision Transformers to comprehend the connections between different components, leading to more effective feature extraction. Through the Manhattan Self-Attention (MaSA) mechanism, the model performs better in handling rotation invariance issues, significantly improving the performance of medical image classification tasks.

The first chapter emphasizes the research background and significance of image classification while providing an overview of recent developments in convolutional neural networks within the field of medical image analysis. The second chapter highlights current work pertaining to medical image classification tasks, encompassing residual networks, their various derivatives, and attention mechanisms. The third chapter presents ResGDANet, designed for image classification based on ResGANet, which begins by introducing the network model’s components, then elaborates on the model’s design and implementation methodology, including ResGANet, the Dual-Channel Attention Fusion (DCAF) module, and the RMT module. The fourth chapter begins with a comprehensive description of the experimental dataset and evaluation metrics. Subsequent experimental validation shows that the ResGDA model outperforms the original model and other leading image classification models in terms of feature extraction ability and medical image classification accuracy. In conclusion, a comprehensive overview of this research and its findings is presented, highlighting existing limitations and suggesting potential directions and opportunities for future research and enhancements.

2. Related Work

2.1. CNN Architecture Design

Since the advent of AlexNet, Convolutional Neural Networks have maintained a dominant position in image classification. The research emphasis has transitioned from manual feature design to network architecture engineering. Lin et al. [17] innovatively replaced fully connected layers with global average pooling layers and utilized 1 × 1 convolutions to learn nonlinear combinations of feature map channels, pioneering the feature map attention mechanism. By employing a modular design strategy and repetitively stacking network modules of the same structure, VGG-Net streamlines both network design and the transfer learning process. DenseNet [18] innovated network architecture through dense connections, enabling each layer to receive input from all previous layers’ outputs, enhancing feature reuse and propagation while mitigating the vanishing gradient problem, leading to substantial improvements in model performance. By introducing identity skip connections, ResNet effectively alleviates the vanishing gradient problem in deep neural networks and enhances feature representation capabilities, and has become one of the most widely used CNN architectures. The structure of the original residual learning unit and bottleneck is shown in Figure 2.

2.2. Attention Mechanism and Self-Attention in Visual Tasks

The attention mechanism serves as a framework for dynamically allocating resources based on the relevance of activation patterns. It serves an essential function in the human visual system. The field has flourished over the past decade [12,19,20,21,22]. Hu et al. introduced SENet [20], demonstrating that attention mechanisms are capable of improving classification accuracy while simultaneously reducing noise interference. Following this, numerous studies have adopted it for various visual tasks. In their work, Wang et al. [23] proposed non-local networks as a solution for advancing video understanding capabilities, Wang et al. [24] utilized attention mechanisms to improve object detection performance, Fu et al. [25] developed DANet for semantic segmentation, Zhang et al. [22] demonstrated the attention mechanism’s efficacy in image generation, and Xie et al. [26] developed A-SCN to address challenges in point cloud processing.

Self-attention is a special case of attention, and many papers [22,27,28] have considered self-attention mechanisms for vision. At its core, self-attention relies on computing feature affinities to model long-range dependencies effectively. However, computational and memory costs increase quadratically with the size of feature maps. To alleviate computational and memory burdens, Yang et al. [29] proposed a sparse self-attention mechanism, which improves the model’s computational efficiency while maintaining detection accuracy by computing attention only for key regions. Yuan et al. [30] suggested using object context vectors to handle attention. Other works [31,32] have investigated the application of self-attention mechanisms for capturing local details.

3. Methodology

3.1. Dual-Channel Attention Fusion Module

Through learning from data, attention mechanisms can weight multi-dimensional features according to their importance, enhancing critical features while suppressing interference. Hou et al. [33] proposed embedding positional information into channel attention, using one-dimensional global average pooling in the horizontal and vertical directions to compress information, reducing positional information loss from 2D pooling. Nevertheless, this technique employs average pooling exclusively to derive global region data, potentially overlooking locally salient features in lesion areas, such as small lesions, edge details, or texture features [34]. In medical image analysis, since many pathological regions share visual characteristics with normal tissues, it is essential to preserve fine local features to support subsequent tasks such as lesion classification, localization, and diagnosis.

To address this challenge, we designed the Dual-Channel Attention Fusion (DCAF) module, introducing a dual-branch structure combining global average pooling and max pooling to enhance the capture of significant local features in medical images, as shown in Figure 3. Besides maintaining the original global average pooling branch, we incorporated a max pooling branch to capture prominent local features. By fusing feature information from both branches, the model can simultaneously utilize global contextual information and local salient features, thereby further improving the accuracy and robustness of classification tasks.

For input feature map

X \in R^{B \times C \times H \times W}

, where

B

is the batch size,

C

is the number of channels, and

H

and

W

are the height and width of the feature map, the spatial information is first aggregated through global average pooling and global max pooling:

F_{a v g} (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X (i, j)

(1)

F_{m a x} (X) = \underset{i, j}{m a x} X (i, j)

(2)

F_{a v g} (X)

and

F_{m a x} (X)

represent the global average pooling and global max pooling results, respectively, with dimensions

B \times C \times 1 \times 1

. The outputs of these two branches generate channel attention maps through a shared-weight multilayer perceptron (MLP):

M_{c a} (X) = σ (W_{2} δ (W_{1} (F_{a v g} (X))) + W_{2} δ (W_{1} (F_{m a x} (X))))

(3)

where

M_{c a} (X)

is the channel attention map,

W_{1} \in R^{C / r \times C}

and

W_{2} \in R^{C / r \times C}

are the weight matrices of MLP,

δ

is the ReLU activation function, and

σ

is the Sigmoid activation function. The reduction ratio, denoted as

r

, signifies that the number of channels in the input feature map is first reduced to

C / r

via a

1 \times 1

convolution and subsequently restored to the original channel count.

The other branch is coordinate attention, which first calculates global information for each position in spatial directions through global average pooling operations on the feature map’s spatial dimensions. Specifically, for input feature map

X

with dimensions

(B, C, H, W)

, global average pooling is performed in the horizontal and vertical directions to obtain feature maps

x_{h}

and

x_{w}

in Equation (4):

x_{h} = \frac{1}{W} \sum_{0 \leq i \leq W} x (h, i) x_{w} = \frac{1}{H} \sum_{0 \leq j \leq H} x (j, w)

(4)

This operation maintains global information on the feature map along horizontal and vertical directions, enabling the network to comprehend object distribution in both horizontal and vertical orientations.

To merge information from height and width directions, we transpose the dimensions of

x_{w}

to

(B, C, H, 1)

, then concatenate it with

x_{h}

along spatial dimensions to obtain feature map

y

:

y = c o n c a t (x_{h}, x_{w})

(5)

After batch normalization and activation operations on the convolved feature map,

y

is split along spatial dimensions into two tensor representations,

x_{h}

and

x_{w}

, as shown in Equations (6) and (7).

y = R e L U (B a t c h N o r m (y))

(6)

x_{h}, x_{w} = s p l i t (y, [H, W])

(7)

Then, we pass

x_{h}

and

x_{w}

through

1 \times 1

convolution layers to restore the channel number to

C

, and generate attention weights

a_{h}

and

a_{w}

through the Sigmoid activation function in Equation (8):

a_{h} = σ (W_{h} (x_{h})) a_{w} = σ (W_{w} (x_{w}))

(8)

These weights reflect the spatial importance of each position in the feature map. Finally, we multiply the generated attention weights

a_{h}

and

a_{w}

with the input

X

yield output feature map:

M_{c o} (X) = X \times a_{w} \times a_{h}

(9)

The symbol

\times

in the formula denotes element-wise multiplication.

To effectively integrate channel attention with coordinate attention, the DCAF module combines their outputs using adaptive weights. Initially, the coordinate attention and channel attention branches generate feature maps

F_{c o o r d}

and

F_{c h a n}

, respectively:

F_{c o o r d} = C o o r d A t t (X) F_{c h a n} = C h a n n e l A t t (X)

(10)

In Equation (10),

C o o r d A t t (X)

and

C h a n n e l A t t (X)

refer to the outputs of the coordinate attention module and the channel attention module, respectively. Their relationship with Equations (3) and (9) is as follows:

C o o r d A t t (X)

calculates

M_{c o}

obtained through the coordinate attention module, while

C h a n n e l A t t (X)

calculates

M_{c a}

obtained through the channel attention module.

Finally,

F_{c o o r d}

and

F_{c h a n}

pass through the Gated Fusion Unit in Figure 4. This module combines the feature maps of coordinate attention and channel attention through an adaptive weighting mechanism, effectively enhancing the model’s performance in medical image classification. Through this operation, the network can better focus on locally important information and exhibit stronger adaptability in complex medical images. The module fuses the outputs of the two branches using an adaptive weight

α

, resulting in

Y

in Equation (12):

α = σ (F_{c o o r d} + F_{c h a n})

(11)

Y = α F_{c o o r d} + (1 - α) F_{c h a n}

(12)

where

α

is the adaptive fusion weight and

Y

is the final output feature map.

The detailed pseudocode description of the DCAF module can be found in Algorithm A1 in Appendix A.

3.2. RMT Module

Medical image lesions exhibit significant variations in morphology and orientation, while traditional deep learning models lack rotational invariance learning capabilities, making it difficult to accurately identify lesion features at different angles. In response to this issue, we introduce the RMT module [16], shown in Figure 5, which integrates the global context modeling strengths of Vision Transformers and the retention mechanism from Retentive Networks, enabling the effective incorporation of spatial prior knowledge and the extraction of rotation-invariant features.

Through self-attention mechanisms alongside explicit spatial prior information, the RMT module learns stable features under rotational transformations, enhancing its ability to extract features from lesions of various morphologies. The module employs a two-dimensional distance-based decay matrix and utilizes the Manhattan Self-Attention (MaSA) mechanism for computation along both image axes, reducing the computational burden associated with extensive annotations. This design is particularly suitable for capturing morphologically diverse lesion features in medical images.

The RMT module’s output can be expressed as:

X_{o u t} = M a S A (X) + L C E (V)

(13)

Here,

X_{o u t}

is the output feature of the RMT module,

M a S A (X)

is the output of the Manhattan self-attention mechanism,

L C E (V)

is the output of the local context enhancement module based on depthwise convolution (DWConv), and

V

is the value matrix of the input feature map. LCE denotes the local context enhancement module employing Depthwise Separable Convolution (DWConv), while MaSA decomposes into two image axes, as shown in Equation (14). To be precise, the attention scores are derived individually for the horizontal and vertical dimensions. Subsequently, the attention weights are modified using one-dimensional bidirectional decay matrices.

\begin{matrix} A t t e n_{H} = S o f t m a x (Q_{H} K_{H}^{T}) ⊙ D^{H} \\ A t t e n_{W} = S o f t m a x (Q_{W} K_{W}^{T}) ⊙ D^{W} \\ M a S A (X) = A t t e n_{H} (A t t e n_{W} V)^{T} \end{matrix}

(14)

In Equation (14),

A t t e n_{H}

and

A t t e n_{W}

represent the attention weights computed along the horizontal and vertical axes, respectively.

Q_{H}

and

Q_{W}

are the query matrices along the two axes,

K_{H}

and

K_{W}

are the key matrices along the two axes,

D_{H}

and

D_{W}

are the decay matrices along the two axes, and

⊙

denotes element-wise multiplication (Hadamard product).

The RMT module integrates spatial prior information explicitly into the self-attention mechanism, strengthening its ability to interpret spatial relationships. Through the application of spatial priors and attentional allocation mechanisms, the module achieves reduced computational overhead while preserving linear complexity. The rotational invariance property improves the robustness of lesions in any direction, significantly enhancing the detection accuracy of abnormal tissues in medical images. This characteristic offers significant advantages in processing lesions of various angles and morphologies, helping to improve the accuracy of medical image classification.

3.3. Residual Group Dual-Attention Network

This study proposes ResGDANet (Residual Group Dual-Attention Network) based on ResGANet, improving the original model through integrated attention modules.

In Figure 6, the network first groups the input image matrix

(C \times H \times W)

according to channel number

N

. For broader feature extraction, basic feature transformation is performed following channel reorganization. For each coordinate

(i, j)

in the matrix, the transformation is represented as:

g (r, i, j) = (\begin{matrix} c o s (\frac{π r}{2}) & - s i n (\frac{π r}{2}) \\ s i n (\frac{π r}{2}) & c o s (\frac{π r}{2}) \end{matrix}) (\binom{i}{j})

(15)

where

(i, j)

represents the matrix coordinate points, and

r \in [0, 4]

[35].

The transformed feature map passes through ResNet bottleneck blocks, operating with

3 \times 3

convolution kernels to enlarge the model’s receptive field, providing a foundation for subsequent attention weight allocation. Using

K_{1} ()

to denote the

3 \times 3

convolution operation, with output represented as

y_{s}

in Equation (16):

y_{s} = \{\begin{matrix} K_{1} (g_{r} (x_{s})) r, s = 0 \\ K_{1} (g_{r} (x_{s})) ⊙ y_{0} 0 < r = s < 4 \end{matrix}

(16)

In Equation (16),

x_{s}

represents the s-th group of the input feature map, and

y_{0}

denotes the initial convolution output.

Following basic feature extraction, the Dual-Channel Attention Fusion (DCAF) module is incorporated to enhance feature representation. The module utilizes a dual-branch architecture combining max pooling and global average pooling, fusing channel attention and coordinate attention via adaptive weighting as in Equation (12).

The spatial attention module generates spatial feature descriptors through global average pooling (GAP) and global max pooling (GMP), enhancing spatial dimension feature relationships.

M_{s a} (X) = σ (W_{s a} (X))

(17)

In Equation (17),

W_{s a}

represents the spatial attention weight, and

σ

denotes the Sigmoid activation function.

Following this, the attention scores for the RMT module are computed as per Equation (14). The final output feature is:

M a S A (X) = A t t e n_{H} (A t t e n_{W} X)^{T}

(18)

This design enables the RMT module to extract features at stable resolution while maintaining low computational complexity, enhancing feature rotational invariance and thereby improving medical image classification performance.

4. Experiments

4.1. Experimental Settings

The experiments were carried out on a system running Ubuntu 22.04, equipped with an RTX 4090 GPU with 24 GB of video memory, sourced from NVIDIA. The implementation of the experiments was carried out using the PyTorch(1.9.0+cu111) framework, with cross-entropy loss function, the Adam [28] optimizer, a batch size of 16, a learning rate of 3.5 × 10⁻⁵, and 120 training epochs. The detailed hyperparameter settings are presented in Table 1.

4.2. Dataset and Image Preprocessing

COVID19-CT [36]: This dataset comprises medical images and COVID-19-related publications collected by He et al. [37] from medRxiv2 and bioRxiv3. It includes 349 CT scan images confirmed as COVID-19 positive and 397 CT scan images classified as normal or negative for other diseases. The dimensions of the images vary between 143 × 76 and 1637 × 1225. Following the original data division strategy, we divided the dataset into three parts—training set, validation set, and test set—in a ratio of 0.6:0.15:0.25.

ISIC2018 [38]: In the evaluation of the ResGDANet model, we used the ISIC2018 skin lesion diagnosis dataset. This dataset consists of seven categories, totaling 10,015 images. It includes 6705 melanocytic nevus images, 115 dermatofibroma images, 1113 melanoma images, 327 actinic keratosis images, 1099 benign keratosis images, 514 basal cell carcinoma images, and 142 vascular lesion images. All images in the dataset have a size of 650 × 450 pixels. We divided the dataset into training, validation, and testing sets in a ratio of 0.6:0.15:0.25 using stratified sampling.

Examples of ISIC2018 and COVID19-CT images are shown in Figure 7.

For image preprocessing, we applied normalization and data augmentation techniques to enhance the model’s generalization ability and avoid overfitting. All preprocessing operations were performed after the dataset was divided to ensure that no information would leak between the training, validation, and test sets.

To improve the model’s generalization ability, various data augmentation methods were applied to the training set during the data preprocessing phase. These methods included random horizontal flipping and random vertical flipping, each with a probability of 50%; random rotation with an angle range of −15° to 15°, to simulate rotation variations of images in natural environments; and random adjustments of the image’s brightness, contrast, and saturation, with adjustment ranges from 0.8 to 1.2, to improve the model’s adaptability to images under different lighting conditions. It is important to emphasize that data augmentation was only applied to the training set, while the validation and test sets retained the original data to ensure the fairness and consistency of performance evaluation.

All images were normalized. The normalization used the mean and standard deviation of the ImageNet dataset: mean of [0.485, 0.456, 0.406] and standard deviation of [0.229, 0.224, 0.225]. The specific normalization formula is

X_{normalized} = \frac{X - m e a n}{s t d}

(19)

where

X

is the original pixel value and

X_{normalized}

is the normalized pixel value. In addition, for images smaller than 224 × 224, proportional scaling was first applied to adjust the shorter side of the image to 256 pixels, followed by a crop of a 224 × 224 region from the center of the scaled image to ensure the consistency and adaptability of the input image.

4.3. Evaluation Metrics

This study employs accuracy, precision, recall, and F1 score as evaluation metrics. Table 2 presents the confusion matrix. The matrix comprises True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN) classes, which can be seen in Table 2.

We use accuracy as one of the performance metrics for the experiment. Accuracy refers to the percentage of correct malware identification and overall classification performance. Due to the imbalanced nature of the medical image dataset, we also use precision, recall, and F1-score as evaluation metrics. These metrics are suitable for evaluating imbalanced deep-learning classification problems. Precision represents the positive predictive rate, while recall represents the true positive rate. The F1 score combines precision and recall into a single metric. The evaluation metrics are calculated using the following Equations (18)–(21):

A c c a r y = \frac{T P + T N}{T P + T N + F P + F N}

(20)

P r e c i s i o n = \frac{T P}{T P + F P}

(21)

R e c a l l = \frac{T P}{T P + F N}

(22)

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2 \times T P}{2 \times T P + F P + F N}

(23)

TP, TN, FP, and FN represent the number of true positive, true negative, false positive, and false negative samples, respectively.

4.4. Ablation Research and Visualization

4.4.1. Ablation Studies in Different Groups

We conducted ablation experiments on multiple group settings of ResGDANet, employing the COVID19-CT dataset for evaluation. Results are presented for the ResGDANet model with grouping (G) values of 1, 2, and 4. Table 3 displays the experimental results for accuracy, precision, recall, and F1-score (F1). Experimental results demonstrate superior classification accuracy when using group sizes of 2 or 4, which are shown in Table 3.

Empirical studies further indicate that a higher number of groups leads to a reduction in the network’s inference speed. Using network configurations with two and four groups allows for achieving the optimal balance among various metrics in subsequent experiments. These findings confirm that expanding the number of groups within the ResGDANet enhances the classification performance for medical images.

4.4.2. Ablation Experiment of Attention Module

To more intuitively and visually validate the classification performance of the proposed ResGDANet model in this paper, we used Grad-CAM [39] to visualize the attention generated by ResGDANet without attention modules, without DCAF, without RMT, and with both attention modules. Figure 8 demonstrates that without attention modules, the network’s focus is limited to specific portions of the target region. With the addition of the DCAF module, the region of interest expands, though it shows misalignment with the target area. The combination of both attention modules outperforms the individual modules, enabling more precise localization and coverage of target objects. The visualization performance achieved by incorporating only the spatial attention module and RMT module is marginally inferior to that of using solely the DCAF module. The RMT module exhibits a weaker ability to localize boundary regions compared to the DCAF module. However, when compared to visualizations without attention modules, the RMT module shows enhanced attention to the target region.

To further validate the effectiveness of each strategy and module, we also conducted ablation experiments on the ISIC2018 dataset. The experimental results, presented in Table 4, primarily validate the effectiveness of various attention modules. Table 4 demonstrates that the incorporation of both RMT and Dual-Channel Attention Fusion modules yields superior performance compared to the baseline without attention modules.

4.5. Model Comparison Experiment

We examined the performance metrics of various advanced classification techniques on the COVID19-CT dataset. The comparison included VGG-16 [7], ResNet [9], ResNeXt [11], Res2Net [10], DenseNet [18], EfficientNet [40], ShuffleNet 1.0 × (G = 4) [41], SENet [20], GvT [21], and ResGANet [15]. All models were implemented in their original form, sharing the same operational environment and essential hyperparameters. The experimental results are shown in Table 5 and Figure 9.

As shown in Table 5, the approach introduced in this study surpasses other models in classification tasks, further demonstrating that our method can effectively diagnose COVID19-CT images. The highest values for each metric are indicated in bold.

Similar to ResNet and DenseNet, we use residual connections and dense connections to improve the efficiency of feature reuse and gradient flow. ResNet tackles deep network degradation through residual structures, whereas DenseNet achieves efficient feature reuse through dense connectivity. Additionally, like ResNeXt and Res2Net, we utilize group convolution and multi-scale feature representations to enhance model expressiveness. ResNeXt enhances network width by introducing the concept of cardinality, while Res2Net captures multi-scale information by building hierarchical residual connections. By incorporating these advantages, we achieve superior classification performance.

Making reference to ResGANet, we utilize attention modules to improve the feature representation abilities of convolutional neural networks. ResGANet adaptively recalibrates feature responses using channel and spatial attention mechanisms, whereas EfficientNet balances network depth, width, and resolution through compound scaling. Relative to ResGANet, our network demonstrated a 3.82% enhancement in classification accuracy (79.92% vs. 83.74%). This is mainly attributed to the introduction of the DCAF module and RMT module, which enhance the model’s ability to capture multi-scale features and model global contextual information.

GvT employs the Talking-Heads mechanism, enhancing model expressiveness through inter-head communication in multi-head attention. However, due to the complex structure of the GvT model, the training process requires more computational resources and time, which may result in its limited performance on small datasets. Our method performs better on small datasets, particularly improving by 0.41% compared to GvT in medical image classification tasks (83.33% vs. 83.74%). Although GvT scores slightly higher than ResGDANet in precision (84.76% vs. 83.46%), ResGDANet performs better in terms of F1 score (83.98% vs. 83.43%), indicating that ResGDANet has an advantage in balancing precision and recall.

To further validate the proposed model, we also compared the classification performance of different models on the ISIC2018 dataset. The experimental results are shown in Table 6 and Figure 10.

The results show that ResGDANet has advantages in medical image classification. In comparison to CBAM and SENet, we built attention modules to enhance the feature representation capability of convolutional neural networks. Our network achieved accuracy improvements of 3.77% over SENet-50 (77.96% vs. 81.73%) and 3.27% over CBAM-50 (78.46% vs. 81.73%). ResGDANet demonstrated a 2.06% accuracy improvement over ResGANet (79.67% vs. 81.73%). These experimental results confirm that incorporating the DCAF module and RMT module or refining features into multiple groups, enhances classification model performance to varying degrees. ResGANet incorporates both these features and achieved the highest classification accuracy (81.73%) on the ISIC2018 dataset.

Figure 11 displays the confusion matrices of ResGDANet on the ISIC2018 and COVID19-CT datasets. Overall, the classifier demonstrates significant performance across all labels. In the ISIC2018 dataset, the most frequent misclassification is predicting MEL as NV. However, there is a reasonable explanation for this: the sample size of minority categories is small, leading to poorer performance in these categories. A solution to this issue is to increase the number of images in the database, including more data sources.

Figure 12 demonstrates the model fitting while training to assess the classification performance of ResGDANet. The figure shows that the model rapidly reaches an optimal state, with no signs of overfitting on the training or validation sets.

5. Conclusions

This study builds upon ResGANet by introducing Dual-Channel Attention Fusion modules and RMT modules to enhance image classification performance. This architecture boosts the feature representation abilities of convolutional neural networks, resulting in better performance for medical image classification. Experimental results show that ResGDANet demonstrates positive performance metrics on the ISIC2018 and COVID19-CT datasets, with its accuracy significantly surpassing the original model and other comparative models.

However, this study has limitations, as the experiments were not validated on more medical image datasets. This limitation may lead to an overestimation of the model’s generalization ability in real clinical settings. Although ResGDANet performs well on the existing datasets, its performance on other datasets still requires further validation. In future research, we plan to explore additional methods to enhance network efficiency and further validate the model’s applicability on data from different institutions or imaging protocols while safeguarding patient privacy. This may include systematically adjusting the number of multi-scale convolutional kernels and network groups to gain a deeper understanding of the network’s characteristics.

Author Contributions

S.L. contributed to the conception of the study, performed the experiments and data analyses, and wrote the manuscript. J.H. contributed significantly to analysis and manuscript preparation and helped perform the analysis, with constructive discussions. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These data can be found here: [https://challenge.isic-archive.com/data/#2018] and [https://github.com/desaisrkr/https-github.com-UCSD-AI4H-COVID-CT] (all accessed on 23 January 2025).

Acknowledgments

Dai Jia provided guidance on improving the experimental methods section during the manuscript revision process and reviewed the manuscript. GenAI was utilized for translation and grammar correction in the process of manuscript writing.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Algorithm A1: Dual-Channel Attention Fusion Module

Input: X: Input feature tensor of shape (B, C, H, W)
B: Batch size
C: Number of input/output channels
H, W: Height and width of feature map
r_c: Channel attention reduction ratio (default = 16)
r: Coordinate attention reduction ratio (default = 32)
bias: Convolution bias flag (default = False)
mip: Minimum preserved channels (default = max(8,C/r))

Output: Y: Enhanced feature tensor

1 Channel attention branch:
// Global average pooling and max pooling
F_avg = (1/(H × W)) × Σ_{i = 1}^H Σ_{j = 1}^W X(i,j) // Global average pooling
F_max = max_{i,j} X(i,j) // Global max pooling
// Shared MLP with reduction ratio r_c
W1 ∈ R^{(C/r_c)×C}, W2 ∈ R^{C×(C/r_c)} // MLP weight dimensions
F_avg_mlp = W2 × ReLU(W1 × F_avg) // Apply MLP to F_avg
F_max_mlp = W2 × ReLU(W1 × F_max) // Apply MLP to F_max
M_channel = Sigmoid(F_avg_mlp + F_max_mlp) // Channel attention map
F_chan = ChannelAtt(X) = M_channel // Channel attention output

2 Coordinate attention branch:
// Height-wise and width-wise pooling
xh = (1/W) × Σ_{i = 1}^W X(h,i) // Height-wise pooling: (B,C,H,1)
xw = (1/H) × Σ_{j = 1}^H X(j,w) // Width-wise pooling: (B,C,1,W)
xw = Permute(xw, (0,1,3,2)) // Adjust to (B,C,W,1)
// Concatenate and reduce channels
y = concat(xh, xw, dim = 2) // Shape: (B,C,H + W,1)
y = Conv2D(y, out_channels = mip, kernel_size = 1, bias = bias) // Reduce channels to mip
y = ReLU(BatchNorm2D(y)) // Feature transformation
// Split and restore dimensions
xh, xw = split(y, split_sizes = [H, W], dim = 2) // Split back to H,W
xw = Permute(xw, (0,1,3,2)) // Restore dimensions
// Height and width attention
ah = Sigmoid(Conv2D(xh, out_channels = C, kernel_size = 1, bias = bias)) // Height attention
aw = Sigmoid(Conv2D(xw, out_channels = C, kernel_size = 1, bias = bias)) // Width attention
M_coord = X × aw × ah // Element-wise multiplication
F_coord = CoordAtt(X) = M_coord // Coordinate attention output

3 Adaptive fusion:
α = Sigmoid(F_coord + F_chan) // Adaptive weight
Y = α × F_coord + (1 − α) × F_chan // Final output

4 return Y // Return enhanced feature tensor

References

Huang, S.; Zeng, Z.; Ota, K.; Dong, M.; Wang, T.; Xiong, N.N. An Intelligent Collaboration Trust Interconnections System for Mobile Information Control in Ubiquitous 5G Networks. IEEE Trans. Netw. Sci. Eng. 2021, 8, 347–365. [Google Scholar] [CrossRef]
Jena, K.K.; Bhoi, S.K.; Nayak, S.R.; Panigrahi, R.; Bhoi, A.K. Deep Convolutional Network Based Machine Intelligence Model for Satellite Cloud Image Classification. Big Data Min. Anal. 2023, 6, 32–43. [Google Scholar] [CrossRef]
Guan, H.; Yap, P.-T.; Bozoki, A.; Liu, M. Federated Learning for Medical Image Analysis: A Survey. Pattern Recognit. 2024, 151, 110424. [Google Scholar] [CrossRef]
Li, Z.; Xu, X.; Cao, X.; Liu, W.; Zhang, Y.; Chen, D.; Dai, H. Integrated CNN and Federated Learning for COVID-19 Detection on Chest X-Ray Images. IEEE/ACM Trans. Comput. Biol. Bioinform. 2024, 21, 835–845. [Google Scholar] [CrossRef] [PubMed]
Latif, G.; Alghazo, J.; Mohammad, N.; Abdelhamid, S.E.; Brahim, G.B.; Amjad, K. A Novel Fragmented Approach for Securing Medical Health Records in Multimodal Medical Images. Appl. Sci. 2024, 14, 6293. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Xu, H.; Xu, Q.; Cong, F.; Kang, J.; Han, C.; Liu, Z.; Madabhushi, A.; Lu, C. Vision Transformers for Computational Histopathology. IEEE Rev. Biomed. Eng. 2024, 17, 63–79. [Google Scholar] [CrossRef]
Pacal, I. MaxCerVixT: A Novel Lightweight Vision Transformer-Based Approach for Precise Cervical Cancer Detection. Knowl.-Based Syst. 2024, 289, 111482. [Google Scholar] [CrossRef]
Cheng, J.; Tian, S.; Yu, L.; Gao, C.; Kang, X.; Ma, X.; Wu, W.; Liu, S.; Lu, H. ResGANet: Residual Group Attention Network for Medical Image Classification and Segmentation. Med. Image Anal. 2022, 76, 102313. [Google Scholar] [CrossRef] [PubMed]
Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. RMT: Retentive Networks Meet Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Lin, H.; Zhang, H.; Ma, Y.; He, T.; Zhang, Z.; Zha, S.; Li, M. Dynamic Mini-Batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources. arXiv 2019, arXiv:1904.12043. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention IsAll You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2011–2023. [Google Scholar] [CrossRef]
Shan, D. GvT: A Graph-Based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets. arXiv 2024, arXiv:2404.04924. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. In Proceedings of the Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Wang, H.; Wang, S.; Qin, Z.; Zhang, Y.; Li, R.; Xia, Y. Triple Attention Learning for Classification of 14 Thoracic Diseases Using Chest Radiography. Med. Image Anal. 2021, 67, 101846. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Liu, J. Vegetable Disease Detection Using an Improved YOLOv8 Algorithm in the Greenhouse Plant Environment. Sci. Rep. 2024, 14, 4261. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Xie, S.; Liu, S.; Chen, Z.; Tu, Z. Attentional ShapeContextNet for Point Cloud Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4606–4615. [Google Scholar]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention Augmented Convolutional Networks. arXiv 2020, arXiv:1904.09925. [Google Scholar]
Liu, M.; Wang, Z.; Ji, S. Non-Local Graph Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 10270–10276. [Google Scholar] [CrossRef]
Yang, N.; Li, G.; Wang, S.; Wei, Z.; Ren, H.; Zhang, X.; Pei, Y. SS-YOLO: A Lightweight Deep Learning Model Focused on Side-Scan Sonar Target Detection. J. Mar. Sci. Eng. 2025, 13, 66. [Google Scholar] [CrossRef]
Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation. arXiv 2021, arXiv:1909.11065. [Google Scholar]
Ou, Q.; Zou, J. Channel-Wise Attention-Enhanced Feature Mutual Reconstruction for Few-Shot Fine-Grained Image Classification. Electronics 2025, 14, 377. [Google Scholar] [CrossRef]
Song, J.; Yu, D.; Teng, H.; Chen, Y. RLVS: A Reinforcement Learning-Based Sparse Adversarial Attack Method for Black-Box Video Recognition. Electronics 2025, 14, 245. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Li, W.; Zhang, Y. DC-YOLO: An Improved Field Plant Detection Algorithm Based on YOLOv7-Tiny. Sci. Rep. 2024, 14, 26430. [Google Scholar] [CrossRef]
Wang, C.; Deng, X.; Sun, Y.; Yan, L. Research on Image Classification Based on Residual Group Multi-Scale Enhanced Attention Network. Comput. Electr. Eng. 2024, 118, 109351. [Google Scholar] [CrossRef]
Yang, X.; He, X.; Zhao, J.; Zhang, Y.; Zhang, S.; Xie, P. COVID-CT-Dataset: A CT Scan Dataset about COVID-19. arXiv 2020, arXiv:2003.13865. [Google Scholar]
Zhao, W.; Jiang, W.; Qiu, X. Deep Learning for COVID-19 Detection Based on CT Images. Sci. Rep. 2021, 11, 14353. [Google Scholar] [CrossRef]
Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC). arXiv 2019, arXiv:1902.03368. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv 2017, arXiv:1707.01083. [Google Scholar]

Figure 1. Different types of skin diseases in ISIC2018. (a) Actinic keratosis; (b) basal cell carcinoma; (c) benign keratosis; (d) dermatofibroma; (e) melanoma; (f) melanocytic nevus; (g) vascular lesion.

Figure 2. Structure diagram of an original residual learning unit and bottleneck.

Figure 3. Dual-Channel Attention Fusion module structure diagram. It consists of several components: the Channel Attention Module, the Coordinate Attention Module, and the Gated Fusion Unit.

Figure 4. Gated Fusion Unit. This module accepts two inputs, uses a sigmoid function to generate weights, and outputs the weighted sum of the two components.

Figure 5. RMT module: 3 × 3 DWConv, LN, Manhattan self-attention, FFN, with residual connections.

Figure 6. ResGDANet block structure diagram. × represents the multiplication of spatial dimensions of the tensor, while * indicates the expansion of the channel dimension.

Figure 7. Original images in ISIC2018 and COVID19-CT datasets.

Figure 8. Visualization results of the attention module ablation experiments show that regions with higher weights exhibit higher heat values. The first column displays the original input images, the second column shows the visualization results without the RMT module, the third column presents the visualization results without the DCAF module, and the fourth column illustrates the visualization results with both attention modules integrated.

Figure 9. Bar graph analysis of classification outcomes for various models using the COVID19-CT dataset.

Figure 10. Bar graph analysis of classification outcomes for various models using the ISIC2018 dataset.

Figure 11. Confusion matrices on different datasets. (a) ISIC2018; (b) COVID19-CT.

Figure 12. (a) ResGDANet network training and validation set accuracy on ISIC2018 dataset; (b) ResGDANet network training and validation set losses on ISIC2018 dataset.

Table 1. Experimental hyper-parameters.

Parameter	Value
Loss Function	Cross entropy
Optimizer	Adam
Lr_scheduler	CosineAnnealingLR (T_max = 32)
Batch_size	16
Epochs	120
Learning_rate	$3.5 \times 10^{- 5}$

Table 2. The confusion matrix.

	Positive	Negative
Positive	True Positive (TP)	False Negative (FN)
Negative	False Positive (FP)	True Negative (TN)

Table 3. Performance comparison of ResGDANet across different groups on the COVID19-CT dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
ResGDANet (G = 1)	80.78	81.05	80.78	80.91
ResGDANet (G = 2)	82.27	83.14	82.27	82.71
ResGDANet (G = 4)	83.74	82.11	83.74	82.90
ResGDANet (G = 8)	82.37	82.28	82.37	82.32

Table 4. Evaluation of ResGDANet’s effectiveness among various categories using the ISIC2018 dataset.

Network	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
Without AM	79.21	79.58	78.71	79.14
Without RMT	79.59	79.89	79.27	79.53
Without DCAF	80.51	80.28	80.62	80.45
With DCAF&RMT	81.71	80.93	82.39	81.65

Table 5. Performance comparison results of different networks on the COVID19-CT dataset.

Network	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
VGG-16	66.16	68.74	63.92	66.23
ResNet-50	72.33	74.92	70.16	72.46
DenseNet-121	75.54	73.81	77.42	75.57
EfficientNet-b0	71.38	69.54	73.47	71.45
ShuffleNet	72.33	70.52	74.31	72.36
SENet-50	76.75	74.82	78.91	76.81
ResGANet-50	79.92	81.25	80.06	80.65
GvT	83.24	84.76	82.14	83.43
ResGDANet	83.74	83.46	82.59	83.98

Table 6. Evaluation outcomes of various networks using the ISIC2018 dataset.

Network	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
ResNet	76.62	74.34	76.65	75.48
SENet	77.96	75.73	78.04	76.87
CBAM	78.46	77.04	78.6	77.81
ResNeXt	77.31	75.34	77.34	76.33
Res2Net	78.91	79.08	80.01	79.54
ECANet	78.09	76.61	78.09	77.35
ResGANet	79.67	79.83	79.39	79.61
ResGDANet	81.73	80.94	82.42	81.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Huang, J. ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification. Appl. Sci. 2025, 15, 2693. https://doi.org/10.3390/app15052693

AMA Style

Li S, Huang J. ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification. Applied Sciences. 2025; 15(5):2693. https://doi.org/10.3390/app15052693

Chicago/Turabian Style

Li, Sihan, and Juhua Huang. 2025. "ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification" Applied Sciences 15, no. 5: 2693. https://doi.org/10.3390/app15052693

APA Style

Li, S., & Huang, J. (2025). ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification. Applied Sciences, 15(5), 2693. https://doi.org/10.3390/app15052693

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification

Abstract

1. Introduction

2. Related Work

2.1. CNN Architecture Design

2.2. Attention Mechanism and Self-Attention in Visual Tasks

3. Methodology

3.1. Dual-Channel Attention Fusion Module

3.2. RMT Module

3.3. Residual Group Dual-Attention Network

4. Experiments

4.1. Experimental Settings

4.2. Dataset and Image Preprocessing

4.3. Evaluation Metrics

4.4. Ablation Research and Visualization

4.4.1. Ablation Studies in Different Groups

4.4.2. Ablation Experiment of Attention Module

4.5. Model Comparison Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI