Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification

Yasir, Siddiqui Muhammad; Kim, Hyun

doi:10.3390/electronics14122364

Open AccessArticle

Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification

by

Siddiqui Muhammad Yasir

and

Hyun Kim

^*

Department of Electrical and Information Engineering, Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, 232 Gongneung-ro, Nowon-gu, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(12), 2364; https://doi.org/10.3390/electronics14122364

Submission received: 21 April 2025 / Revised: 30 May 2025 / Accepted: 5 June 2025 / Published: 9 June 2025

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 4th Edition)

Download

Browse Figures

Versions Notes

Abstract

Early and accurate detection of dermatological conditions, particularly melanoma, is critical for effective treatment and improved patient outcomes. Misclassifications may lead to delayed diagnosis, disease progression, and severe complications in medical image processing. Hence, robust and reliable classification techniques are essential to enhance diagnostic precision in clinical practice. This study presents a deep learning-based framework designed to improve feature representation while maintaining computational efficiency. The proposed architecture integrates multi-level feature aggregation with a squeeze-and-excitation attention mechanism to effectively extract salient patterns from dermoscopic medical images. The model is rigorously evaluated on five publicly available benchmark datasets—ISIC-2019, ISIC-2020, SKINL2, MED-NODE, and HAM10000—covering a diverse spectrum of dermatological medical disorders. Experimental results demonstrate that the proposed method consistently outperforms existing approaches in classification performance, achieving accuracy rates of 94.41% and 97.45% on the MED-NODE and HAM10000 datasets, respectively. These results underscore the method’s potential for real-world deployment in automated skin lesion analysis and clinical decision support.

Keywords:

deep learning; skin disease classification; medical image processing; feature aggregation network; depthwise convolution; attention-guided network

1. Introduction

The precise and prompt identification of skin lesions remains a cornerstone in dermatology, profoundly influencing treatment strategies and patient outcomes, particularly for critical cases such as melanoma. With the emergence of deep learning, automated diagnostic systems leveraging convolutional neural networks (CNNs) and attention mechanisms have made significant strides in enhancing classification accuracy and efficiency [1]. Over the last decade, research into deep learning frameworks for skin lesion classification has yielded substantial advancements, improving the extraction of features and the differentiation of lesion types [2]. CNN-based models have shown remarkable efficacy in capturing hierarchical features from both macroscopic and dermoscopic medical images [3], with methods like transfer learning and ensemble approaches further refining classification outcomes. Despite these advancements, persistent challenges remain. Many models struggle with effectively integrating advanced attention mechanisms and feature aggregation techniques, which are essential for enhancing their ability to accurately distinguish features [4,5,6]. Addressing these limitations is vital for progressing towards more robust and reliable skin lesion classification systems. Many researchers depend primarily on global feature extraction techniques, which often neglect the finer details of lesions that are crucial for accurate classification [7]. Moreover, a significant challenge for most models arises from issues like dataset bias and class imbalances, both of which compromise model robustness and its ability to generalize effectively [8]. The importance of effective skin lesion classification stems from the high prevalence and potential fatality of skin cancers, where early detection is vital for successful intervention and treatment [9,10,11]. Automated classification systems are designed to address these hurdles, ensuring reliable and consistent results that can support clinicians in their diagnostic decisions [12,13]. While deep learning architectures are proficient at extracting hierarchical features, they frequently fail to emphasize the lesion’s most diagnostically relevant regions. The absence of advanced attention mechanisms often causes models to be affected by artifacts such as hair occlusions, reflection spots, and uneven illumination in dermoscopy images, which can mislead feature extraction, thereby diminishing the reliability of classifications [7,10,14].

This design enables the model to refine multi-scale feature representations, thereby improving classification accuracy while preserving interpretability [14]. Through the integration of effectively integrating advanced attention mechanisms, the model aims to advance the development of clinically effective and reliable diagnostic tools. Attention mechanisms serve a crucial function in directing the network’s focus toward regions that exhibit malignancy-related patterns [12,13,14]. By embedding channel-level attention, the proposed model dynamically recalibrates feature channel importance, leading to improved feature representation and enhanced classification performance [15,16,17]. Additionally, the network employs a multi-layer feature aggregation strategy to fuse both fine-grained and high-level semantic features, allowing it to adapt effectively to variations in lesion size, texture, and structural complexity [10,11,18]. The proposed deep learning framework introduces an attention-guided feature aggregation approach, marking a significant advancement in automated skin lesion analysis. By integrating channel-wise attention with multi-level feature aggregation, the model effectively enhances discriminative capability, ensuring a more precise focus on diagnostically relevant lesion regions [19,20]. This integration addresses key challenges, including inter-class similarity and intra-class variability, which often limit classification accuracy. By improving feature representation and stressing semantically relevant lesion characteristics, the proposed system improves classification strength and diagnostic precision, establishing possibilities for more accurate and efficient dermatological image analysis.

The main contributions of this study are as follows:

Development of a multiclass classification model that influences deep feature aggregation and recalibration techniques to improve feature extraction from skin lesion images.
Integration of depthwise separable convolution, which minimizes aggregate learnable parameters while preserving representational efficiency.
Implementation of channel-wise attention mechanisms to improve the model’s sensitivity to significant features, ensuring improved learning capacity.
Adoption of multi-scale convolutional layers within each block to effectively capture spatial details of varying lesion sizes.
Refinement of hierarchical feature representations to strengthen the discriminative power of the network, addressing dataset biases and class imbalances.
Comprehensive evaluation using diverse datasets, demonstrating the network’s generalizability and clinical relevance.

The structure of this paper is organized as follows: Section 2 investigates recent advancements and discoveries in the field, highlighting research gaps and the critical role of image classification for skin lesions. Section 3 thoroughly defines the proposed methodology. The datasets, data augmentation strategies, and training configurations are detailed in Section 4, with an in-depth analysis presented in Section 5. Lastly, Section 6 summarizes the key findings of the study and outlines future research directions.

2. Related Works

Attention mechanisms have played a transformative role in biomedical image analysis, significantly enhancing feature representation while improving the interpretability of neural networks [21]. Recent studies have increasingly focused on integrating diverse attention mechanisms to boost the accuracy, efficiency, and explainability of classification models [18,21,22,23]. In the specific domain of skin lesion classification, the integration of attention mechanisms in deep learning models has driven noteworthy progress. These advancements aim to tackle critical challenges such as inter-class similarity, intra-class variation, and the constraints posed by limited training datasets [16,24]. These innovations hold immense potential for improving diagnostic capabilities and supporting clinical applications. One of the primary obstacles in utilizing AI for medical image analysis is ensuring that models are both generalizable and robust when applied to different datasets and various disease types [25]. While attention mechanisms have already contributed to improved model performance, further exploration is required to improve their adaptability to diverse medical imaging conditions [1,26,27]. Jie Hu’s introduction of the squeeze-and-excitation network [27] marked a breakthrough, significantly bolstering the presentation of existing state-of-the-art classification models. Attention-based architectures have since gained prominence in medical image classification tasks, with designs like EDCA-Net [1] utilizing densely connected attention layers to improve feature extraction and expand generalizability across a wide range of diseases. In hyperspectral image classification, the double-branch dual-attention network (DBDA) [28] demonstrates the efficacy of channel and spatial attention modules in refining feature maps for superior classification outcomes. Similarly, the EffiCAT framework [9] influences dual-channel attention layers to optimize feature representation, achieving high accuracy and robustness across multiple datasets. Another noteworthy example is the dual attention-based network, integrating spatial and channel modules to refine localized patterns in skin lesion images, thereby improving both model performance and interpretability [13]. These advancements continue to push the boundaries of AI-driven medical imaging solutions, addressing critical challenges while aiming for broader clinical applicability.

Recent advancements in skin lesion segmentation and feature fusion have demonstrated significant improvements in classification accuracy. Sarwar et al. [29] introduced a Hybrid Residual Network (ResUNet) model supplemented with Ant Colony Optimization (ACO), achieving a Dice coefficient of 93.1% and a Jaccard index of 87.5%, demonstrating superior lesion segmentation [30]. Similarly, Saghir et al. [31] proposed a Midpoint Analysis Approach for skin cancer segmentation, achieving an accuracy of 95.30%. These studies reinforce the importance of multi-scale feature fusion and optimization techniques, aligning with the objectives of our proposed attention-guided deep feature aggregation network. While attention mechanisms improve feature extraction and classification accuracy, their interpretability remains a challenge in deep learning models. Future research will explore Explainable AI techniques like Grad-CAM, SHAP, and attention heatmaps to improve transparency in medical imaging. These methods help clinicians understand model priorities, strengthening trust in automated diagnostic systems [32].

Recent advancements in multi-scale attention mechanisms have demonstrated remarkable capabilities in capturing fine-grained features across various scales. This has notably improved the focus on lesion areas and addressed class imbalance within datasets like HAM10000 [33]. Patch-based attention architectures [34], developed by N. Gessert, effectively utilize pre-trained models to maintain global context in high-resolution skin image patches without down-sampling. The ARL-CNN model [20] achieved state-of-the-art results on the ISIC 2017–2020 datasets by employing residual learning paired with attention methods, enhancing the network’s focus on semantically significant lesion regions. Building on this, EFAM-Net [10] has integrated attention residual learning with multi-scale feature fusion to further boost classification accuracy. Attention mechanisms, including those implemented in frameworks such as EffiCAT [9], EFAM-Net [10], ARL-CNN [20], and other architectures [18,35], have been shown to improve model sensitivity and prediction accuracy without imposing significant computational demands. Despite these advancements, balancing complexity and computational efficiency remains a key area for future exploration [36,37]. Dual attention mechanisms have proven particularly effective in improving the interpretability and focusing capabilities of skin lesion classification models. However, persistent challenges, such as class imbalance and the need for diverse large-scale datasets, continue to progress [38]. Promising solutions like multi-scale and patch-based attention methods address these challenges by capturing detailed features at varying scales and resolutions. Incorporating metadata, such as patient history, lesion location, age, and previous diagnoses, can significantly enhance classification accuracy. For instance, lesion location plays a crucial role in distinguishing between benign and malignant cases, while patient history provides valuable context for risk assessment. Integrating these metadata features into deep learning models can improve diagnostic precision by complementing image-based feature extraction [39].

Attention-based networks are setting new benchmarks in skin lesion classification, showcasing superior adaptability and performance. Future research should prioritize addressing data imbalance issues and exploring the integration of varied modalities, which could significantly improve the accuracy and generalizability of these deep learning models in clinical applications.

3. Proposed Model

The network design, represented in Figure 1, utilizes a hierarchical framework for feature aggregation, integrated with channel-wise attention for improved accuracy in classification tasks. This attention is achieved through the Squeeze-and-Excitation (SE) method, which emphasizes refining feature representations. The model operates on resized input images as

224 \times 224

pixels, enabling the extraction of key discriminative features that are subsequently aggregated and re-calibrated across hierarchical levels. The architecture is structured with four convolutional groups, each composed of two convolutional blocks, culminating in a total of eight convolutional blocks. Additionally, four SE blocks, complemented by eight dual SE blocks, as depicted in Figure 2, further improve feature aggregation and refinement. The upcoming section provides an in-depth breakdown of the proposed network’s architecture, highlighting key components such as feature aggregation, the functionality of SE blocks, and their integration within the framework. This explanation is aimed at offering a complete interpretation of the model’s design and operational goals.

3.1. Architecture, Materials, and Methods

Researchers have extensively investigated various components of deep learning architectures to enhance model performance while minimizing computational complexity and the number of learnable parameters. Although increasing model capacity, expanding receptive fields, and introducing additional non-linearities are commonly employed to improve accuracy, these approaches often incur significant computational overheads [40]. While AI-based solutions benefit from rich and diverse feature representations, merely increasing network depth or width does not necessarily translate into optimal feature discrimination. Additionally, deep neural networks are prone to issues such as vanishing and exploding gradients, which pose substantial challenges during training. To mitigate these limitations, establishing strong inter-layer connections and effectively aggregating hierarchical features have become critical design considerations. Traditional segmentation models, such as U-Net and its variants (e.g., U-Net++, U-Net 3+), have demonstrated strong performance in medical image segmentation. Our proposed architecture introduces attention-guided deep feature aggregation, which enhances feature recalibration beyond standard encoder–decoder structures. Unlike U-Net, which primarily relies on skip connections for feature fusion, our model integrates channel-wise attention mechanisms to refine lesion-specific features dynamically. This approach builds upon segmentation advancements discussed in ‘U-Net in Medical Image Segmentation: A Review of Its Applications Across Modalities’, further improving classification robustness in dermatological imaging [41]. Consequently, a variety of architectural strategies have been proposed to optimize network design, reducing parameter count without compromising classification accuracy. In the domain of skin lesion analysis, manual classification is inherently complex, time-consuming, and susceptible to human error due to the wide variability in lesion appearance. Deep learning-based approaches offer an effective solution by automating this process, thereby enhancing diagnostic consistency and efficiency. To this end, the proposed network integrates multi-level deep feature aggregation with a squeeze-and-excitation attention mechanism to improve the quality of feature representations. Detailed descriptions of each module are presented in the subsequent sections.

3.2. Deep Feature Aggregation

CNNs are composed of multiple layers, each designed to fulfill a specific function within the overall architecture. The performance of a CNN depends heavily on the hierarchical organization and interaction of these layers, as no single layer can independently capture the full complexity of input features. While many existing CNN architectures have sought to enhance performance by increasing network depth and width, such expansions do not inherently guarantee optimal feature representation in the final layers [42]. To address this limitation, skip connections have been widely adopted, proving effective in mitigating performance degradation and improving feature propagation. These connections, however, require a careful architectural design to realize their full potential.

In this study, we propose a deep learning framework that enhances feature representation by hierarchically aggregating information across different network depths [43]. The proposed architecture fuses features from both shallow and deep layers to generate more discriminative and contextually enriched representations. Within each hierarchical level, features are progressively aggregated and propagated to deeper layers, facilitating continuous refinement. Moreover, to preserve spatial resolution and mitigate information loss typically caused by downsampling, features from earlier stages are merged into the final feature map via strategically placed skip connections. This design ensures that critical spatial details are retained while enabling robust and expressive feature learning. The hierarchical deep feature aggregation across network layers is formally defined as follows:

F_{d} = A \{\begin{matrix} R_{d - 1}^{d} (x), & R_{d - 2}^{d} (x), & \dots, & R_{1}^{n} (x), & L_{1}^{n} (x), & L_{2}^{n} (x) \end{matrix}\}

(1)

where F signifies the final feature map, while

F_{d}

denotes the aggregated feature map at a specific depth d. The variables d, A, and x correspond to the depth of the network, the aggregation function, and the feature map at level x, respectively. Additionally, the terms L and R, as defined in Equations (2) and (3), describe left- and right-propagated features across network layers. The aggregated feature map

F_{d}

is generated through the aggregation function A, which operates on transformed feature maps from both preceding and succeeding layers.

The aggregation function A dynamically integrates feature representations across different network depths, ensuring optimal feature fusion between shallow and deep layers. The function employs learnable weights W that are adjusted during training via backpropagation. Batch normalization is applied to stabilize weight distributions, preventing vanishing or exploding gradients. The adaptive nature of A ensures that lower layers emphasize fine-grained spatial details, while deeper layers focus on high-level semantic feature fusion. The recursive structure, as defined in Equations (2) and (3), ensures continuous refinement through left- and right-propagated feature maps, enhancing classification robustness.

Specifically, right-propagated features

R_{i}^{d} (x)

for

i = 1, . . ., d - 1

and left-propagated features

L_{j}^{n} (x)

for

j = 1, 2

facilitate bi-directional feature flow across layers. This design supports multi-scale feature integration by merging contextual information from various network depths, augmenting the model’s capability to encapsulate complex relationships and patterns.

L^{d_{2}} (x) = C L^{d_{1}} (x), L^{d_{1}} (x) = C (R^{d_{1}} (x))

(2)

Equation (2) outlines the recursive process used to construct left-propagated feature maps

L^{d} (x)

through a series of transformation operations. At depth

d_{2}

, the left-propagated feature map

L^{d_{2}} (x)

is computed using

C (\cdot)

, which represents a transformation operator such as convolution combined with normalization and activation. The recursive application of C begins with the right-propagated feature map

R^{d_{1}} (x)

and continues with the left-propagated feature map from the preceding depth

d_{1}

. This hierarchical design allows for effective information flow across layers, enabling the network to capture both shallow and deep contextual features. Residual connections are embedded within this structure to counteract issues like vanishing or exploding gradients, thereby ensuring stable gradient propagation in deeper networks. However, as noted by [44], a shallow network hierarchy—characterized by limited depth or insufficient intermediate layers—may reduce the efficacy of residual learning, potentially diminishing overall performance. Therefore, maintaining a well-structured and adequately hierarchical design is critical for maximizing the benefits of residual connections in deep feature aggregation networks.

R_{m}^{d} (x) = \{\begin{matrix} F_{m} (x), & if m = n - 1 \\ F_{m} (R_{m + 1}^{d} (x)), & otherwise \end{matrix}

(3)

In the attention-guided deep feature aggregation network, the integration of feature representations across multiple hierarchical levels addresses challenges commonly encountered in deep neural networks, such as gradient misrepresentation. This hierarchical aggregation improves discriminative learning by ensuring robust gradient flow and introducing alternative short pathways, which enable more efficient backpropagation during training. The right-propagated features, represented as

R_{m}^{d} (x)

, are recursively computed as defined in Equation (3). Here,

R_{m}^{d} (x)

signifies the features propagated from depth m towards depth d. The function

F_{m} (\cdot)

is typically a learnable module, such as a convolutional block or an attention-improved feature encoder, applied at level m. For the deepest level, where

m = n - 1

,

F_{m} (\cdot)

directly outputs the encoded features. Otherwise, it operates on the output of the subsequent deeper layer

R_{m + 1}^{d} (x)

, establishing a recursive feature flow. This framework facilitates the progressive integration of high-level semantic information from deeper layers to shallower ones. The network uses a weighted fusion strategy, described in Equation (4), to combine multi-scale features effectively, enhancing both classification accuracy and feature representation.

A (x_{1}, \dots, x_{d}) = δ (BatchNorm (\sum_{i} W_{i} x_{i} + b))

(4)

Here,

W_{i}

and b are learnable parameters that adaptively scale feature contributions at different depths. The function

δ (\cdot)

represents a non-linear activation (such as ReLU) ensuring effective feature fusion. This structure enables multi-scale feature integration, reinforcing hierarchical representation and boosting classification accuracy. In this formulation,

A (\cdot)

acts as the aggregation function, while

W_{i}

and b represent learnable parameters, and

δ (\cdot)

denotes a non-linear activation function such as ReLU. This setup equips the network with the capability to learn adaptive weights, enhancing its efficiency in combining features across various depths. Consequently, the model excels at encapsulating both detailed, fine-grained patterns and high-level contextual data essential for accurate skin lesion classification. The proposed network architecture incorporates eight convolutional blocks, each comprising two separable convolution layers [45]. As depicted in Figure 3, the proposed architecture employs depthwise separable convolution layers with 3 × 3 and 5 × 5 kernels arranged in a parallel configuration. Each kernel independently extracts multi-scale spatial features, capturing fine-grained textures with the 3 × 3 kernel while the 5 × 5 kernel enhances broader contextual details. The outputs of these kernels are element-wise added to ensure complementary feature fusion, reinforcing hierarchical feature representation. This structure is depicted in (Figure 4), where individual depthwise paths contribute to a balanced mix of local and global feature extraction. These layers employ convolutional kernels of sizes

(5 \times 5)

and

(3 \times 3)

, ensuring the network captures multi-scale spatial features effectively. The structure is depicted in Equation (2), highlighting the hierarchical organization and feature refinement approach adopted by the model.

Convolutional layers are critical for determining both the effectiveness and computational efficiency of deep learning models [46]. While early architectures, such as AlexNet, Attention-Guided VGG [47], DenseNet, and ResNet, relied heavily on standard convolutional layers, recent innovations have introduced advanced alternatives. Depthwise separable and group convolutions, in particular, have emerged as efficient options to reduce model complexity without sacrificing accuracy [48]. In the proposed method, depthwise separable convolution is utilized to decrease trainable parameters, inference time, and overall model size. This approach splits a typical convolution into two discrete operations: (i) depthwise convolution, which relates spatial filtering separately across each input channel, and (ii) pointwise convolution, employing

1 \times 1

kernels to map depthwise outputs into a new feature space. As depicted in Figure 3, the convolutional block is designed for multi-scale feature extraction using kernels of size

5 \times 5

and

3 \times 3

. The resulting feature maps from each depthwise separable convolutional path undergo element-wise addition and are then controlled with batch normalization layers. To further improve multi-scale representation, features from the initial and concluding convolutional blocks are concatenated at the final aggregation node. Complementing this structure is a channel-wise attention module, implemented through the Squeeze-and-Excitation (SE) block, which recalibrates feature responses, strengthening the network’s overall representation capability.

3.3. Squeeze and Excitation Block (SE)

The squeeze-and-excitation (SE) block, as introduced by Hong et al. [49], is a lightweight yet effective architectural unit designed to enhance feature representations by explicitly modeling channel-wise dependencies in CNNs. Conventional CNNs primarily rely on convolution, pooling, and batch normalization to extract and abstract spatial features. While these techniques are effective for spatial pattern learning, early CNN designs often overlooked the interdependence between feature channels, a factor critical to capturing rich contextual information. The SE block addresses this gap through an adaptive channel recalibration mechanism, composed of two sequential operations. The recalibration of feature channels through the squeeze-and-excitation mechanism plays a crucial role in dermatological image analysis. By dynamically adjusting channel importance, the model enhances sensitivity to critical lesion characteristics, such as pigmentation variations, asymmetry, and texture irregularities, which are key indicators in skin cancer classification. This ensures that diagnostically relevant features receive higher attention while suppressing background noise, improving classification accuracy. First, a squeeze operation performs global average pooling to distill spatial information into a compact channel descriptor. This is followed by an excitation operation that captures non-linear channel interdependencies through a lightweight gating mechanism, producing channel-wise attention weights. These weights are then applied to the original feature maps, amplifying informative features while suppressing less relevant ones [16]. A key advantage of the SE block lies in its modularity and computational efficiency, allowing seamless integration into existing CNN architectures with minimal overhead. When incorporated into early layers, it enhances general-purpose features such as texture and edges; in deeper layers, it refines high-level, class-specific semantic representations. This adaptability across the network hierarchy contributes to improved feature discrimination and overall model accuracy, particularly in tasks demanding fine-grained classification. Due to its compact design and significant performance gains, the SE block has become a foundational component in modern CNN-based systems across diverse application domains [50].

The SE block, illustrated in Figure 4, operates as a lightweight and highly effective attention module designed to address channel-wise interdependencies in CNNs. It improves the network’s feature representation by adaptively recalibrating the importance of each channel. The SE block consists of three core mechanisms: squeeze, excitation, and scale. During the squeeze phase, comprehensive spatial information is condensed by applying global average pooling to each individual channel of the input tensor. This reduces the spatial dimensions, yielding a compact feature descriptor of size

1 \times 1 \times C

, where C is the number of channels. This descriptor serves as a statistical summary of the channel-wise activation values, forming the foundation for the excitation phase. The subsequent excitation stage utilizes this compact descriptor to capture non-linear relationships among channels, generating attention weights that highlight significant features. These weights are then applied during the scaling phase to adaptively improve salient channels while blocking less related ones. This mechanism is integral for refining feature representations and improving classification accuracy in CNNs.

The excitation phase of the SE block features is a dynamic mechanism that captures channel-wise dependencies. The process begins with the squeezed descriptor, which is delivered through a bottleneck comprising two fully connected (FC) layers. The first layer decreases the dimensionality by a predefined reduction ratio r, and the second reinstates it to C. A non-linear activation function (e.g., ReLU) is applied between these layers, followed by a sigmoid activation at the output. This sequence generates modulation weights in the range of

[0, 1]

, representing the relative significance of respective channels. In the scale stage, attention weights are applied to the original input tensor through channel-wise multiplication. This step improves salient features while diminishing irrelevant ones, enabling the network to learn more discriminative feature representations. This mechanism proves particularly valuable for tasks like medical image analysis, e.g., skin lesion classification, where fine-grained features—texture, color, and boundaries—are critical for precise diagnosis. As demonstrated in Figure 4, the SE block efficiently integrates channel dependencies into the network, boosting both performance and robustness. This adaptive attention mechanism significantly improves the network’s capability to handle complex visual recognition challenges.

The operations within the SE block—namely, the squeeze, excitation, and scaling modules—are formally defined in Equations (5), (6), and (7), respectively. In this context, the input tensor

X \in R^{H^{'} \times W^{'} \times C^{'}}

represents spatial dimensions

H^{'}

and

W^{'}

, the number of input channels. A convolutional transformation F is applied to generate the feature map

U = F (X) = [u_{1}, u_{2}, \dots, u_{C}]

, where each

u_{i} \in R^{H^{'} \times W^{'}}

corresponds to the spatial representation of the

i^{t h}

channel. The resulting tensor

U \in R^{H^{'} \times W^{'} \times C}

serves as the input to the SE block, capturing both spatial and channel-wise information for further refinement through attention-based recalibration. This architectural design allows the SE block to dynamically adjust the importance of each channel by leveraging global contextual information. Specifically, the squeeze operation condenses spatial information into a channel descriptor via global average pooling. The excitation step models non-linear channel dependencies and generates attention weights that are subsequently used in the scaling operation to recalibrate the original feature maps. Integrated into the attention-guided deep feature aggregation network, the SE block plays a pivotal role in enhancing the model’s ability to capture and emphasize diagnostically relevant patterns in skin lesion images. By assigning adaptive importance to different channels, the SE mechanism suppresses redundant information while amplifying salient features that contribute most to accurate classification. This is particularly beneficial for handling the complex visual characteristics inherent in dermoscopic medical images, thereby improving both diagnostic precision and network robustness. The squeeze operation is expressed as follows:

F_{squeeze} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(5)

In this context,

u_{c} \in R^{H \times W}

refers to the spatial feature map associated with the

c^{th}

channel after the convolutional layer, where W and H characterize the width and height of the feature map one-to-one. To effectively capture the global context of each channel, the methodology applies global average pooling. This operation condenses the dimensions into a single scalar value for each channel, summarizing the distribution of activation values. This scalar serves as a channel descriptor, allowing the network to influence global spatial information for further attention-based refinements within the squeeze-and-excitation block. The concise representation provided by global average pooling improves the model’s ability to concentrate on significant patterns within the feature space. The excitation step is expressed as follows:

S = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2 δ} (W_{1} z))

(6)

In the proposed model,

z \in R^{C}

represents the output of the squeeze operation, capturing global channel-wise information. The ReLU activation function,

δ (\cdot)

, initiates non-linearity between transformations, while the sigmoid function,

σ (\cdot)

, scales the output to the range

[0, 1]

. The weight matrices

W_{1} \in R^{C / r \times C}

and

W_{2} \in R^{C \times C / r}

handle dimensionality decrease and expansion, r, defining the reduction ratio and ensuring efficient computation. This bottleneck mechanism effectively learns non-linear dependencies across channels while maintaining computational efficiency. The recalibration of the feature map is performed via element-wise multiplication, as elaborated in Equation (7), selectively enhancing salient channel-wise features to improve the network’s discriminative capabilities. This process significantly contributes to accurate and robust feature representation in tasks like skin lesion classification.

\tilde{X} = F_{scale} (u_{c}, s_{c}) = u_{c} \cdot s_{c}

(7)

The SE block’s flexibility and architecture-agnostic nature make it a valuable improvement for deep learning models. By adaptively scaling each channel using learned attention weights

s_{c}

, where

s_{c}

corresponds to the

c^{th}

feature map

u_{c} \in R^{H \times W}

, this module boosts the network’s discriminative ability. It selectively amplifies significant features while suppressing less pertinent ones, refining the overall representation. Its integration within the proposed attention-guided deep feature aggregation network has proven beneficial. By reinforcing attention-guided feature aggregation, the SE block improves the network’s capability to capture subtle variations in dermoscopic medical images. This adaptability significantly improves classification performance, particularly in challenging fine-grained tasks like skin lesion diagnosis. With its lightweight design and computational efficiency, the SE block becomes a pivotal element for advancing automated diagnostic systems.

The architectural improvement integrating double SE blocks into the attention-guided deep feature aggregation network presents a significant leap in optimizing feature discrimination for skin lesion classification. By employing two SE blocks after each pair of convolutional layers, rather than the standard single SE block, the modified design aims to refine channel-wise attention at finer levels. This approach is particularly effective in enhancing sensitivity to subtle lesion patterns, which are critical for accurate classification. The modified architecture features four convolutional groups, each consisting of two convolutional blocks, subsequent in a total of eight convolutional blocks and eight SE blocks. With each group incorporating two SE modules, this configuration allows for a more profound recalibration of feature channels across multiple hierarchical levels. Such in-depth recalibration strengthens the network’s capability to obtain and amplify diagnostically significant features while suppressing less relevant ones. Only the most optimal hyperparameter settings were selected throughout training, therefore Table 1 contains more details of the final outcomes. As highlighted in Table 2, Table 3, Table 4, Table 5 and Table 6, the performance of this improved architecture demonstrates noticeable improvements in classification accuracy when compared to the baseline single SE block configuration. This augmented attention mechanism not only boosts the network’s robustness but also offers a scalable solution for addressing complex datasets with intricate lesion patterns.

4. Evaluation

This section outlines the experimental framework, including dataset descriptions, training protocols, data augmentation techniques, evaluation criteria, and ablation studies, along with the results obtained for each dataset. The proposed approach is assessed on five prominent dermoscopic medical image datasets, recognized as benchmarks in the field. In total, five datasets are employed, as outlined in Section 4.3, to ensure a thorough and diverse evaluation. This multi-dataset setup enables the assessment of model generalizability and performance across various clinical scenarios and image variations.

4.1. Implementation Details

The proposed supervised learning framework is meticulously designed to address skin lesion classification through a robust and equitable experimental setup. It employs a 5-fold cross-validation method, ensuring that datasets are consistently split with 80% for training and 20% for testing in each fold. This strategy aligns with prior research methodologies to ensure fair comparisons. Key implementation details include the following: the use of categorical cross-entropy for optimizing multi-class classification and Adam as the optimizer (due to its adaptive learning rate and momentum) ensures stable convergence [51,52]. The hyperparameter tuning for empirical fine-tuning led to an optimal configuration, featuring the following: Batch Size: 16, Learning Rate: 0.0001, Epochs: 100, and early stopping and dropout regularization safeguard against overfitting by halting training if validation loss stagnates over five epochs. The network architecture is optimized for feature extraction with a progressive increase in filter numbers across convolutional layers, capturing increasingly abstract patterns. Each convolutional block integrates kernels of sizes

5 \times 5

and

3 \times 3

to effectively address multi-scale spatial details. Additionally, deep feature aggregation connections are embedded to improve gradient flow between layers, improving learning dynamics and ensuring faster convergence during backpropagation. For computational support, the framework operates on an NVIDIA (Santa Clara, CA, USA) RTX3090 GPU with 24 GB RAM, tested in Python 3.6 on an Ubuntu workstation. This robust configuration accelerates processing while maintaining the scalability needed for rigorous experimentation and model evaluation.

Table 1. Hyperparametersettings proposed during training process; only optimized settings selected for final results.

Hyperparameters	Variables	Selected
Optimizer	SGD, Adam, RMSProp	Adam
Learning rate	0.001, 0.0001	0.0001
Bactch Size	8, 16, 32, 64, 128	32
Epochs	50, 70, 100, 120, 150	100

4.2. Evaluation Matrices

To evaluate the effectiveness of the proposed model, standard classification metrics—accuracy, precision, recall, and F1-score—were employed within a multi-class classification framework. These metrics, formally defined in Equations (8)–(11), provide a comprehensive and quantitative assessment of model performance. Accuracy, as expressed in Equation (8), measures the average proportion of correctly classified instances across all K classes, serving as a primary indicator of overall classification reliability and consistency when applied to diverse and complex datasets. In addition to accuracy, precision, recall, and F1-score offer deeper insights into the model’s class-wise behavior. These metrics help characterize the model’s ability to manage false positives and false negatives, providing a balanced evaluation of sensitivity and specificity for each class. Collectively, they form a robust evaluation suite for analyzing both general and class-specific performance in multi-class skin lesion classification tasks.

A c c u r a c y = \frac{1}{K} \sum_{j = 1}^{K} \frac{T P_{j} + T N_{j}}{T P_{j} + T N_{j} + F P_{j} + F N_{j}}

(8)

In this context,

T P_{j}

,

T N_{j}

,

F P_{j}

, and

F N_{j}

represent the true negatives, true positives, false negatives, and false positives for class j, respectively. Equation (9) quantifies precision as the proportion of properly identified positive cases relative to all cases predicted as positive for a given class. This metric is significant for measuring the model’s ability to minimize false positives, ensuring accurate identification of class-specific attributes within the multi-class classification framework.

P r e c i s i o n = \frac{1}{K} \sum_{j = 1}^{K} \frac{T P_{j}}{T P_{j} + F P_{j}}

(9)

Recall, also known as sensitivity, is formally defined in Equation (10). It is calculated within each class to determine the proportion of true positive cases accurately identified out of all actual positive instances. Recall is important for evaluating the model’s success in detecting related instances across multiple classes, particularly in applications where missing a true positive could have significant implications, such as skin lesion classification. By focusing on the identification of all relevant cases, recall offers critical insights into the model’s sensitivity and reliability.

R e c a l l = \frac{1}{K} \sum_{j = 1}^{K} \frac{T P_{j}}{T P_{j} + F N_{j}}

(10)

The F1-score, as outlined in Equation (11), is the vocal mean of recall and precision. It presents a stable metric for assessing the model’s performance by accounting for both false negatives and positives. This ensures that the assessment does not overly favor one aspect, like precision or recall alone, making it especially valuable in contexts like skin lesion classification, where sensitivity and specificity are critical for precise diagnostic outcomes.

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

These metrics—precision, accuracy, recall, and F1-score—were computed on a per-class basis and subsequently averaged across all K classes. This averaging strategy ensures a comprehensive evaluation of the model’s performance while explicitly addressing the issue of class imbalance, which is particularly critical in medical imaging tasks such as skin lesion classification. By accounting for the individual contribution of each class, the approach offers a balanced performance assessment, mitigating the bias that may arise from skewed class distributions. This is essential for ensuring that the model performs robustly across both majority and minority classes, thereby enhancing its clinical reliability and generalization.

4.3. Datasets

The global prevalence of skin lesions has necessitated the release of several publicly available dermoscopic medical image datasets to facilitate automated diagnostic research. The proposed classification network, illustrated in Figure 1, was rigorously trained and evaluated across five datasets, each curated from diverse sources to ensure variability in lesion types and imaging conditions. To improve robustness and generalizability, a comprehensive suite of data augmentation techniques was employed. These include flipping, random shear transformations, random rotation, center cropping, contrast and brightness adjustments, histogram equalization, Gaussian noise injection, and blurring. These augmentations aim to expand the training set, mitigate overfitting, and assess their influence on test-time accuracy by introducing variability akin to real-world scenarios. Additionally, each dataset image was meticulously annotated and validated by expert dermatologists to guarantee the reliability of class labels [53]. To mitigate class imbalance, the Synthetic Minority Over-sampling Technique combined with Tomek links (SMOTE-Tomek) [54] strategy was applied before dataset splitting, ensuring balanced distributions for both training and testing sets. This approach enhances model generalization by preventing over-representation of majority classes in evaluation phases. SMOTE enriches the minority class distribution by synthesizing new instances between existing samples, while Tomek links reduce overlap by eliminating borderline instances. This combined strategy not only diversifies the minority classes but also improves class separability, ultimately boosting classification accuracy and ensuring a more balanced dataset for training and evaluation.

4.3.1. ISIC-2019 and ISIC-2020 Datasets

The study effectively uses dermoscopic medical image datasets to construct and evaluate the proposed method for skin lesion classification. Two key datasets from the ISIC challenges—ISIC-2019 with 25,331 images and ISIC-2020 with 33,126 images—form the primary foundation of this research, ensuring a robust evaluation of the network under diverse conditions. In addition, a custom dataset was curated by merging carefully filtered subsets from the ISIC 2019 dataset [55] and the Atlas Dermatology database. Initially, the Atlas Dermatology repository included 561 annotated skin conditions, many underrepresented with as few as 9–10 samples per class. To improve training efficacy, a threshold of at least 80 samples per class was applied, yielding 24 well-represented classes with 3399 images. Data from the ISIC 2018 dataset were also incorporated, excluding overlapping classes, resulting in a refined dataset of 31 unique classes and 4910 images. The dataset division for training, validation, and test sets in a 72:8:20 ratio is applied, reaching a benchmark classification accuracy of 73% by [56]. The ISIC 2020 dataset [57] further contributes with 33,126 images sourced from over 2000 patients across multiple institutions, offering broad demographic and clinical diversity. From these, only the 579 histopathologically confirmed melanoma images were utilized to maintain focus on malignant cases, excluding benign classes to avoid class imbalance during evaluation.

This strategic dataset selection, preparation, and augmentation ensure that the proposed attention-guided deep feature aggregation network is rigorously evaluated on a diverse and clinically meaningful foundation, advancing automated diagnostic capabilities.

4.3.2. SKINL2—Skin Lesions Dataset

The SKINL2 dataset, introduced by de Faria et al. [58], is tailored for advancing examination in skin lesion classification through light field imaging technology. This specialized dataset comprises 376 light field images, each captured under controlled conditions benefiting a focused plenoptic camera. Each light field includes 81 individual views arranged in a

9 \times 9

matrix, providing both lenslet format and multiview representations. Accompanying these light field images are corresponding dermatoscopic images, allowing for multimodal analysis and comparative studies between traditional and light field imaging techniques. The dataset spans eight clinically relevant categories, such as Melanoma (C43), Melanocytic Nevus (D22), Basal Cell Carcinoma (D04), Seborrheic Keratosis (L82), Hemangioma (D18), Dermatofibroma (D23s), Psoriasis (L40), and additional types. All images were annotated and validated by clinical experts, ensuring high-quality labels for supervised learning. The SKINL2 dataset’s rich spatial and angular information supports flexible exploration, making it a valuable resource for developing machine learning models in dermatology. Despite its notable contribution, no standardized benchmark has been established for the dataset as of this writing. Its multimodal capabilities and diverse lesion representation continue to facilitate robust evaluation and innovation in automated diagnostic systems.

4.3.3. MED-NODE Dataset

The MED-NODE dataset [59] is a valuable resource for the development and evaluation of automated skin cancer classification systems. Containing 170 high-resolution clinical images—70 malignant melanoma cases and 100 benign melanocytic nevi—the dataset originates from the Department of Dermatology at the University Hospital Groningen (UMCG), Netherlands. It has significantly contributed to macroscopic image-based diagnostics in dermatology, facilitating advancements in machine learning applications. The MED-NODE system, utilizing an ensemble of predictive models, set an early benchmark by achieving an 81% diagnostic accuracy through a majority voting strategy across classifiers. Differentiating melanoma from benign nevi remains a critical challenge, highlighting the importance of datasets like MED-NODE in enabling deep learning models to capture the subtle distinctions for early detection. To overcome the absence of ground-truth images, dataset splitting and augmentation techniques were used to enrich data. The dataset’s balanced class distribution and expert-verified annotations ensure its reliability for supervised learning tasks, making it particularly suitable for training, validation, and performance analysis of classification algorithms. Its continuous use reflects its central role in advancing dermatological diagnostics by fostering the development of reliable and interpretable diagnostic tools.

4.3.4. HAM10000 Dataset

The HAM10000 dataset [60] is a comprehensive repository widely employed for automated skin lesion classification research. It includes 10,015 dermoscopic medical images collected from various clinical settings, primarily sourced from the Department of Dermatology at the Medical University of Vienna and a skin cancer practice in Australia. This dataset spans seven diagnostic categories, including melanoma, benign keratosis-like lesions, melanocytic nevi, basal cell carcinoma, vascular lesions, actinic keratoses, and dermatofibroma. Each image in HAM10000 is meticulously curated and verified by board-certified dermatologists. Diagnostic labels are derived from histopathological examinations, expert consensus, or follow-up data, ensuring reliability and clinical relevance. The dataset’s diversity and high-resolution imagery make it an invaluable benchmark for evaluating automated classification models. It serves as a robust foundation for developing deep learning systems, facilitating advancements in diagnostic accuracy and sensitivity across a wide range of lesion types.

5. Experimental Results

The proposed attention-guided deep feature aggregation network was thoroughly evaluated by conducting comparative experiments against five state-of-the-art (SOTA) deep learning models as demonstrated success in skin lesion classification tasks, as reported in the literature. These experiments were designed to measure the network’s performance across five widely used dermoscopic medical datasets (Section 4.3): ISIC-2019, ISIC-2020, SKINL2, MED-NODE, and HAM10000. If an additional dataset is to be included, it can be specified as needed. The evaluation framework employed a combination of conventional transfer learning strategies and the proposed architecture with integrated Squeeze-and-Excitation (SE) modules to assess its effectiveness. The results, detailed in terms of classification accuracy, are systematically summarized in Table 2, Table 3, Table 4, Table 5 and Table 6, along with ablation studies in Table 7 and Table 8. These tables collectively highlight the network’s ability to outperform baseline models and showcase its robustness in handling diverse and complex datasets.

5.1. Discussion

The proposed method demonstrates remarkable classification accuracy across multiple datasets, consistently surpassing baseline and existing SOTA models. On the ISIC-2019 dataset (Table 2), the method accomplished an outstanding accuracy of 97.84%, showing substantial improvement over the separable convolution baseline (97.00%). The integration of SE modules significantly improved feature representation and discrimination. Similarly, on the ISIC-2020 dataset (Table 3), the method attained state-of-the-art accuracy of 95.82%, narrowly surpassing SkinNet-8 (98.81%), validating its generalizability and effectiveness on newer image collections. With the SKINL2 dataset (Table 4), the proposed model achieved 93.00% accuracy, outperforming both Pedro M. et al. [61]’s uncertainty-based deep learning model (90.82%) and the baseline architecture without SE modules (85.17%). This highlights the method’s ability to handle complex light field image data. On the MED-NODE dataset, the method achieved an accuracy of 94.41%, surpassing traditional techniques like ResNet50 (82.00%) and DenseNet-161 (87.00%). The performance gain over previous results (Pedro M. et al. [61], 90.82%) emphasizes the framework’s suitability for small-scale datasets. Finally, on the HAM10000 dataset (Table 6), the method accomplished 97.45% accuracy, breaking the records of Bhuvaneshwari S. et al. [62] (95.18%) and Abdulmateen A. et al. [63] (94.11%). The SE module was instrumental in refining the network’s focus on subtle lesion features, ensuring robust classification across diverse categories.

Table 2. Comparison of the proposed method with state-of-the-art methods in terms of classification accuracy (%) with ISIC-2019 dataset.

Model/Author Name	Methods	Accuracy
Ensemble [64]	Data Augmentation and Transfer Learning	63.40
LiwTERM [56]	Transformer-Based Model	73.00
Amirreza M. et al. [65]	Data Augmentation and Transfer Learning	97.55
Burhanettin O. et al. [66]	Data Augmentation and Transfer Learning	93.48
Ziyi L. et al. [67]	Custom CNN	98.00
W/O SE module	Separable Convolution	97.00
W Single SE module	Proposed	97.84
W Double SE module	Proposed	93.13

Table 3. Comparison of the proposed method with state-of-the-art methods in terms of classification accuracy (%) using ISIC-2020 dataset.

Model/Author Name	Methods	Accuracy
SkinNet-8 [68]	Data Augmentation and Custom CNN	98.81
SM et al. [69]	Data Augmentation and Transfer Learning	94.83
Kumar et al. [70]	Handcrafted Features and Multimodal Network	93.00
Jaisakthi S. et al. [69]	Data Augmentation and Transfer Learning	96.81
Jaisakthi S. et al. [71]	Hybrid Deep Learning Method	96.75
Proposed	With Single SE module	95.82
Proposed	With Double SE module	91.23

This study highlights the effectiveness of incorporating single SE modules into the proposed attention-guided architecture, showcasing their ability to recalibrate channel-wise feature responses and amplify diagnostically relevant patterns. This integration resulted in consistent performance improvements across all datasets. However, experiments with double SE modules led to accuracy reductions, likely due to over-parameterization or excessive feature suppression, which hindered gradient flow, particularly in datasets with smaller or less diverse samples. Comparative evaluations also demonstrated the limitations of transformer-based models like LiwTERM [56], which achieved only 73.00% accuracy on ISIC-2019. This suggests that spatial self-attention alone may not suffice for tasks requiring localized textural cues, as in dermoscopic medical image classification. The proposed model, by contrast, demonstrated remarkable generalizability across datasets of varying sizes and class distributions. Its strength was particularly evident in datasets like MED-NODE and SKINL2, directing challenges e.g., class imbalance and limited samples more effectively than conventional deep learning approaches. These findings underline the importance of integrating attention mechanisms at intermediate network levels to improve the model’s focus on diagnostically significant features. The study positions the proposed framework as a reliable, generalizable, and clinically applicable solution for automatic skin lesion classification, with promising potential for diagnostic treatments.

Table 4. Comparison of the proposed method with state-of-the-art methods in terms of classification accuracy (%) through SKINL2 dataset.

Model/Author Name	Methods	Accuracy
Pedro M. et al. [61]	DL uncertainty evaluation mechanisms	90.82
W/O SE module	Classic Convolution	78.0
W/O SE module	Separable Convolution	85.17
W Single SE module	Proposed	93.00
W Double SE module	Proposed	90.56

Table 5. Comparison of the proposed method with state-of-the-art methods in terms of classification accuracy (%) through MED-NODE dataset.

Model/Author Name	Methods	Accuracy
Pedro M. et al. [61]	DL uncertainty evaluation mechanisms	90.82
Mukherjee et al. [72]	CMLD model	90.14
Panagiotis G. et al. [73]	Transfer Learning (ResNet50)	82.00
Panagiotis G. et al. [73]	Transfer Learning (MobileNetV3)	74.00
Panagiotis G. et al. [73]	Transfer Learning (DenseNet-161)	87.00
Ioannis G. et al. [59]	color descriptors	81.00
W/O SE module	Separable Convolution	76.00
W Single SE module	Proposed	94.41
W Double SE module	Proposed	89.35

The experimental analysis presented in Table 8 emphasizes the impact of integrating a single Squeeze-and-Excitation (SE) module within the classification network. Evaluations conducted across five diverse and publicly available skin lesion datasets—ISIC-2019, ISIC-2020, SKINL2, MED-NODE, and HAM10000—demonstrate that the proposed architecture achieves consistently superior classification accuracy, with notable performance peaks on the ISIC-2019 dataset (97.84%), ISIC-2020 (95.82%), and HAM10000 (97.50%). The attention-based recalibration mechanism introduced by the SE module enables improved feature discrimination, addressing the complexity of dermoscopic image patterns effectively. However, a deeper analysis reveals nuances in metric behavior across datasets. For instance, while ISIC-2020 exhibits high accuracy and recall (95.82% and 91.23%, respectively), its F1-score (0.3370) suggests challenges related to class imbalance. Similarly, the MED-NODE dataset reflects a moderate F1-score (0.5846) alongside a relatively high accuracy (94.41%), emphasizing the difficulties posed by binary imbalance in smaller datasets. Conversely, the SKINL2 dataset demonstrates a strong F1-score (0.98) despite a lower accuracy (93.00%), highlighting the model’s capability to weight precision and recall effectively in multi-class settings. In contrast, the HAM10000 dataset shows high accuracy and recall (97.50% and 96.8%), though its reported F1-score (0.4094) appears to require correction. These findings validate the single SE module’s ability to amplify representational capacity by focusing on salient feature channels while suppressing less relevant ones. They also illustrate the importance of using complementary metrics like the F1-score to assess model behavior thoroughly, particularly in scenarios involving imbalanced datasets. Moving forward, exploring the integration of multiple SE modules or hybrid attention mechanisms could improve the model’s performance in handling skewed distributions and subtle lesion variations.

To further verify the computational efficiency of the proposed architecture, we conduct a comparative analysis of FLOPs, parameter count, and inference speed against DenseNet-121, Xception, and ResNet-50 on the SKINL2 dataset. The results, presented in (Table 7), demonstrate that our model maintains competitive classification accuracy while significantly reducing computational costs. The proposed network requires 1.9 million and 2.5 million parameters, respectively, whereas ResNet-50 contains 25.6 million, Xception contains 22.9 million, and DenseNet-121 has 8.0 million parameters. Furthermore, the FLOPs analysis shows that our approach achieves a 32% reduction in computational complexity compared to ResNet-50, making it a lightweight yet effective solution.

The proposed attention-guided deep feature aggregation network (illustrated in Figure 1) demonstrates remarkable computational efficiency compared to existing state-of-the-art models. As shown in (Table 7), our single SE model achieves significantly lower computational complexity (1.9 M parameters, 8.2 ms inference time) while maintaining competitive accuracy (93.00%), making it a lightweight yet effective solution compared to larger models such as ResNet-50 (25.6 M parameters, 17.5 ms inference time) and Xception (22.9 M parameters, 19.2 ms inference time). These findings validate the model’s efficiency in automated skin lesion classification, ensuring its suitability for real-world deployment in medical diagnostics. In terms of classification performance, the model consistently outperforms traditional architectures across multiple datasets, achieving 97.84% accuracy on ISIC-2019, 97.45% on HAM10000, and 94.41% on MED-NODE. The attention-guided feature aggregation approach enhances feature discrimination, allowing the model to capture intricate lesion patterns with greater precision, particularly in complex dermatological cases. Regarding double SE modules, our analysis highlights performance degradation due to over-parameterization and excessive feature suppression. The gradient norm analysis reveals that models with double SE modules exhibit reduced gradient magnitudes, leading to slower convergence and potential vanishing gradient issues, particularly in datasets with fewer samples. While single SE modules refine feature representation efficiently, the addition of multiple SE layers negatively impacts learning dynamics, reinforcing the need for careful architectural optimization. Future work will explore hybrid attention mechanisms to balance computational efficiency and feature recalibration.

Table 6. Comparison between proposed and state-of-the-art methods in terms of classification accuracy (%) using HAM10000 dataset.

Model/Author Name	Methods	Accuracy
Abdulmateen A. et al. [63]	Multimodal Deep Learning	94.11
Naveed A. et al. [74]	CMLD model	91.50
Amin T. et al. [75]	Transfer Learning	93.00
Bhuvaneshwari S. et al. [62]	Transfer Learning	95.18
Talha M. et al. [76]	Transfer Learning (RegNetY-320)	87.00
W/O SE module	Separable Convolution	90.00
W Single SE module	Proposed	97.45
W Double SE module	Proposed	91.12

Table 7. The results of proposed method against ResNet-50 and EfficientNet-50.

Model	Parameters (M)	Inference Time (ms)	Accuracy
ResNet-50	25.6	17.5	96.81
DenseNet-121	8.0	10.3	92.20
Xception	22.9	19.2	97.5
Proposed Single SE Model	01.9	08.2	93.00
Proposed Double SE Model	02.5	14.0	90.00

The experimental results provide a comprehensive evaluation of the proposed attention-guided deep feature aggregation network equipped with single and double SE modules across various dermoscopic medical image datasets. The analysis offers critical perceptions of the model’s strengths and limitations, both in terms of its overall performance and its behavior across different dataset characteristics. The proposed model consistently demonstrates competitive or superior accuracy compared to state-of-the-art approaches. For instance, it achieves 97.84% accuracy on ISIC-2019, surpassing the best-performing baseline model by Ziyi L. et al. [67] (98.00%). On HAM10000, the model achieves 97.45%, significantly improving upon results from models like Bhuvaneshwari S. et al. [62] (95.18%) and Abdulmateen A. et al. [63] (94.11%). This highlights the efficiency of incorporating attention mechanisms to improve feature representation and biases.

Table 8. The results of proposed method against five distinct datasets with respect to the number of classes, accuracy, recall, and F1-score. The results are based on the proposed single SE module.

Dataset Name	Classes	Accuracy	Recall	F1 Score
ISIC-2019	9	97.84	97.3	0.9633
ISIC-2020	2	95.82	98.1	0.9370
SKINL2	8	93.00	87.6	0.87
MED-NODE	2	94.41	93.5	0.5846
HAM10000	7	97.50	96.8	0.9694

5.2. Performance Degradation with Double SE Modules

The integration of a single SE module produces consistent accuracy improvements across all datasets. By dynamically recalibrating feature channels, the SE module allows the network to prioritize diagnostically relevant information, crucial for addressing the subtle and intricate patterns typical in dermoscopic medical images. This is particularly evident in datasets like SKINL2, where the proposed model achieved 93.00%, outperforming prior benchmarks.

The difference in performance between single and double SE modules arises primarily due to over-parameterization and feature suppression in the latter configuration. Single SE modules recalibrate channel-wise features effectively without introducing excessive complexity, ensuring a streamlined gradient flow and optimal feature discrimination. Double SE modules, however, may lead to an overly intricate architecture that suppresses diagnostically relevant features and disrupts gradient propagation, especially in smaller datasets. Contrary to initial expectations, the inclusion of double SE modules resulted in a decline in accuracy for most datasets (e.g., 93.13% on ISIC-2019 compared to 97.84% with a single SE module). This suggests that over-parameterization or excessive feature suppression may impede gradient flow and learning efficiency. This limitation highlights the need for careful consideration of architectural complexity to avoid performance bottlenecks, especially when dealing with smaller datasets. To enhance the model further, future work could explore hybrid attention mechanisms that combine spatial and channel-wise recalibration, adaptive configurations based on dataset characteristics, and improved regularization techniques to mitigate overfitting while maintaining generalization. Such advancements could refine the model’s efficiency and broaden its applicability across diverse clinical scenarios.

5.3. Metric Imbalances and Dataset-Specific Behavior

Disparities in performance metrics, such as the lower F1-score on ISIC-2020 (0.3370) despite high accuracy (95.82%), point to challenges in handling class imbalance. Similar issues are observed on MED-NODE (F1-score 0.5846 vs. accuracy 94.41%). These findings emphasize the importance of employing balanced evaluation metrics, as accuracy alone may not fully utilize the model’s effectiveness, particularly in imbalanced scenarios. While the model performs well on multi-class datasets like HAM10000 and SKINL2, its performance on binary or smaller datasets like MED-NODE reveals room for improvement. The relatively lower F1-score indicates sensitivity to dataset-specific characteristics, such as imbalance or underrepresented classes, which might limit the model’s relevance in certain clinical contexts. Furthermore, the single SE module improves channel-wise attention, delivering robust accuracy across datasets with diverse characteristics. Generalizable performance across small-scale, large-scale, balanced, and imbalanced datasets demonstrates the model’s versatility. The method effectively captures subtle lesion features, making it particularly suited for challenging tasks like melanoma detection.

The addition of double SE modules introduces complexity that may delay learning capabilities, particularly in resource-constrained or small dataset settings. Metric inconsistencies suggest the need for further optimization, particularly to address imbalanced data. The model’s reliance on channel recalibration may sometimes overlook spatial or contextual dependencies, which could further improve the classification robustness.

5.4. Future Directions

The findings of this study pave the way for several future improvements that can enhance the proposed attention-guided deep feature aggregation network. Combining SE modules with spatial or multi-head attention can help address limitations associated with feature suppression and gradient flow, refining feature discrimination and ensuring optimal recalibration. Additionally, implementing mechanisms to dynamically adapt the number of SE modules based on dataset characteristics could further improve efficiency and performance. To mitigate challenges posed by imbalanced or underrepresented datasets, class-specific augmentation techniques will be explored to enhance model generalization. A key future direction involves addressing the limitation of cross-dataset evaluation by leveraging transfer learning, where the model will be pretrained on large-scale datasets such as ISIC-2019, SKINL2, and HAM10000 before fine-tuning on domain-specific medical datasets like MED-NODE, improving classification robustness across different lesion distributions. Furthermore, domain adaptation strategies, including adversarial domain adaptation and meta-learning approaches, will be investigated to align feature representations across varying imaging conditions, ensuring generalizability to unseen datasets and reducing dataset bias. To enhance clinical trust in automated diagnostics, future work will explore Explainable AI (XAI) techniques, such as Grad-CAM, SHAP, and saliency-based visualization, to provide clinicians with transparent insights into model predictions. Additionally, the study does not yet address real-world deployment challenges, including domain shift analysis across different imaging conditions. Future research will focus on cross-domain adaptation strategies, ensuring the model remains robust when applied to diverse clinical datasets. Finally, systematic robustness testing under real-world noise conditions will be conducted to enhance model stability and ensure practical applicability in clinical environments. Overall, the proposed network represents a significant step forward in automated skin lesion classification, with a strong potential for real-world deployment and continuous optimization through these future research directions.

6. Conclusions

This study introduced an innovative attention-guided deep feature aggregation network for skin lesion classification, achieving significant advancements in the integration of spatial attention mechanisms and hierarchical feature aggregation. By conducting experiments on five benchmark datasets—ISIC-2019, ISIC-2020, HAM10000, MED-NODE, and SKINL2—the proposed method demonstrates competitive performance with a peak classification accuracy of 97.84%, 94.41%, and 97.45 on the ISIC-2019, MED-NODE, and HAM10000 datasets, respectively. Moreover, the model’s robust F1-scores on multi-class datasets like SKINL2 validate its effectiveness in capturing both the global context and intricate lesion details. The proposed model achieved accuracies of 94.41% and 97.45% on the MED-NODE and HAM10000 datasets, respectively. These results surpassed previous state-of-the-art methods, improving classification accuracy by 3.6% on MED-NODE and 0.96% on HAM10000, demonstrating the effectiveness of our attention-guided deep feature aggregation network. However, several limitations need addressing. The absence of cross-dataset evaluation limits the ability to confirm its generalization across varying clinical settings. Training the model from scratch, while thorough, may hinder optimization in terms of convergence speed and adaptability. Additionally, to refine this framework further, future efforts should emphasize cross-dataset evaluations, adopt transfer learning with domain-specific pretraining to accelerate convergence, and systematically evaluate robustness under noisy conditions. These improvements will be instrumental in enhancing the model’s reliability and ensuring its seamless integration into real-world diagnostic workflows.

Author Contributions

Conceptualization, S.M.Y. and H.K.; methodology, S.M.Y.; software, S.M.Y.; validation, S.M.Y.; formal analysis, S.M.Y.; investigation, S.M.Y.; resources, S.M.Y. and H.K.; data curation, S.M.Y.; writing—original draft preparation, S.M.Y.; writing—review and editing, H.K.; visualization, S.M.Y.; supervision, H.K.; project administration, H.K.; funding acquisition, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by the Seoul National University of Science and Technology, Seoul, Republic of Korea.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhu, H.; Wang, J.; Wang, S.H.; Raman, R.; Górriz, J.M.; Zhang, Y.D. An Evolutionary Attention-Based Network for Medical Image Classification. Int. J. Neural Syst. 2023, 33, 2350010. [Google Scholar] [CrossRef] [PubMed]
Devi, N.L.; Kumar, D.V.; Prasad, B.D.; Divya, B.; Sindhuja, A.; Charan, B.S. Skin Lesion Segmentation and Multiclass Classification using Deep Neural Networks with Transfer Learning Models. Int. J. Res. Publ. Rev. 2024, 5, 3754–3760. [Google Scholar] [CrossRef]
Lee, S.I.; Kim, H. GaussianMask: Uncertainty-aware Instance Segmentation based on Gaussian Modeling. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR 2022), Montreal, QC, Canada, 21–25 August 2022; pp. 3851–3857. [Google Scholar]
Attallah, O. Skin-CAD: Explainable deep learning classification of skin cancer from dermoscopic images by feature selection of dual high-level CNNs features and transfer learning. Comput. Biol. Med. 2024, 178, 108798. [Google Scholar] [CrossRef] [PubMed]
Remzan, N.; Tahiry, K.; Farchi, A. Advancing brain tumor classification accuracy through deep learning: Harnessing radimagenet pre-trained convolutional neural networks, ensemble learning, and machine learning classifiers on MRI brain images. Multimed. Tools Appl. 2024, 83, 82719–82747. [Google Scholar] [CrossRef]
Raouf, I.; Kumar, P.; Kim, H.S. Deep learning-based fault diagnosis of servo motor bearing using the attention-guided feature aggregation network. Expert Syst. Appl. 2024, 258, 125137. [Google Scholar] [CrossRef]
Bajcsi, A.; Andreica, A.; Chira, C. Significance of Training Images and Feature Extraction in Lesion Classification. In Proceedings of the 16th International Conference on Agents and Artificial Intelligence, Rome, Italy, 24–26 February 2024; pp. 117–124. [Google Scholar] [CrossRef]
Hasan, S.M.M.; Mamun, A.; Srizon, A.Y. Enhancing Multi-Class Skin Lesion Classification with Modified EfficientNets: Advancing Early Detection of Skin Cancer. In Proceedings of the 2023 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 21–23 September 2023; pp. 94–98. [Google Scholar] [CrossRef]
Sasithradevi, A.; Soundararajan, K.; Sasidhar, P.; Pulipati, P.K.; Sruthi, E.; Prakash, P. EffiCAT: A synergistic approach to skin disease classification through multi-dataset fusion and attention mechanisms. Biomed. Signal Process. Control. 2025, 100, 107141. [Google Scholar] [CrossRef]
Ji, Z.; Wang, X.; Liu, C.; Wang, Z.; Yuan, N.; Ganchev, I. EFAM-Net: A Multi-Class Skin Lesion Classification Model Utilizing Enhanced Feature Fusion and Attention Mechanisms. IEEE Access 2024, 12, 143029–143041. [Google Scholar] [CrossRef]
Ömeroglu, A.N.; Mohammed, H.M.; Oral, E.A.; Aydın, S. A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng. Appl. Artif. Intell. 2023, 120, 105897. [Google Scholar] [CrossRef]
Ghazouani, H. Multi-residual attention network for skin lesion classification. Biomed. Signal Process. Control 2025, 103, 107449. [Google Scholar] [CrossRef]
Wei, Z.; Li, Q.; Song, H. Dual attention based network for skin lesion classification with auxiliary learning. Biomed. Signal Process. Control 2022, 74, 103549. [Google Scholar] [CrossRef]
Lu, Z.; Wang, J. A novel and efficient multi-scale feature extraction method for EEG classification. AIMS Math. 2024, 9, 16605–16622. [Google Scholar] [CrossRef]
Tada, M.; Han, X. Bottleneck Transformer model with Channel Self-Attention for skin lesion classification. In Proceedings of the 2023 18th International Conference on Machine Vision and Applications (MVA), Hamamatsu, Japan, 23–25 July 2023; pp. 1–5. [Google Scholar] [CrossRef]
Wang, L.; Zhang, L.; Shu, X.; Yi, Z. Intra-class consistency and inter-class discrimination feature learning for automatic skin lesion classification. Med. Image Anal. 2023, 85, 102746. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Ma, Q.; Li, Y.; Mao, K.; Xu, L.; Zhao, Y. A skin lesion segmentation network with edge and body fusion. Appl. Soft Comput. 2025, 170, 112683. [Google Scholar] [CrossRef]
Mahmood, T.; Choi, J.; Park, K.R. Artificial intelligence-based classification of pollen grains using attention-guided pollen features aggregation network. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 740–756. [Google Scholar] [CrossRef]
Ding, S.; Wu, Z.; yan Zheng, Y.; Liu, Z.; Yang, X.; kai Yang, X.; Yuan, G.; Xie, J. Deep attention branch networks for skin lesion classification. Comput. Methods Programs Biomed. 2021, 212, 106447. [Google Scholar] [CrossRef]
Zhang, J.; Xie, Y.; Xia, Y.; Shen, C. Attention Residual Learning for Skin Lesion Classification. IEEE Trans. Med. Imaging 2019, 38, 2092–2103. [Google Scholar] [CrossRef]
Askari, F.; Fateh, A.; Mohammadi, M.R. Enhancing few-shot image classification through learnable multi-scale embedding and attention mechanisms. Neural Netw. 2025, 187, 107339. [Google Scholar] [CrossRef]
Shah, S.J.H.; Albishri, A.; Wang, R.; Lee, Y. Integrating local and global attention mechanisms for enhanced oral cancer detection and explainability. Comput. Biol. Med. 2025, 189, 109841. [Google Scholar] [CrossRef]
Mubeen, A.; Dulhare, U.N. Enhanced Skin Lesion Classification Using Deep Learning, Integrating with Sequential Data Analysis: A Multiclass Approach. Eng. Proc. 2024, 78, 6. [Google Scholar] [CrossRef]
Zaw, K.P.; Mon, A. Enhanced Multi-Class Skin Lesion Classification of Dermoscopic Images Using an Ensemble of Deep Learning Models. J. Comput. Theor. Appl. 2024, 2, 256–267. [Google Scholar] [CrossRef]
Zhu, J.; Bolsterlee, B.; Song, Y.; Meijering, E. Improving cross-domain generalizability of medical image segmentation using uncertainty and shape-aware continual test-time domain adaptation. Med. Image Anal. 2025, 101, 103422. [Google Scholar] [CrossRef] [PubMed]
Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating Fully Convolutional Networks With Spatial and Channel “Squeeze and Excitation” Blocks. IEEE Trans. Med. Imaging 2018, 38, 540–549. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of Hyperspectral Image Based on Double-Branch Dual-Attention Mechanism Network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
Sarwar, N.; Irshad, A.; Naith, Q.H.; Alsufiani, K.D.; Almalki, F.A. Skin lesion segmentation using deep learning algorithm with ant colony optimization. BMC Med. Inform. Decis. Mak. 2024, 24, 265. [Google Scholar] [CrossRef]
Song, W.; Wang, X.; Guo, Y.; Li, S.; Xia, B.; Hao, A. CenterFormer: A Novel Cluster Center Enhanced Transformer for Unconstrained Dental Plaque Segmentation. IEEE Trans. Multimed. 2024, 26, 10965–10978. [Google Scholar] [CrossRef]
Saghir, U.; Singh, S.K.; Hasan, M. Skin Cancer Image Segmentation Based on Midpoint Analysis Approach. J. Imaging Inform. Med. 2024, 37, 2581–2596. [Google Scholar] [CrossRef]
Bhati, D.; Neha, F.; Amiruzzaman, M. A Survey on Explainable Artificial Intelligence (XAI) Techniques for Visualizing Deep Learning Models in Medical Imaging. J. Imaging 2024, 10, 239. [Google Scholar] [CrossRef]
Qian, S.; Ren, K.; Zhang, W.; Ning, H. Skin lesion classification using CNNs with grouping of multi-scale attention and class-specific loss weighting. Comput. Methods Programs Biomed. 2022, 226, 107166. [Google Scholar] [CrossRef]
Gessert, N.; Sentker, T.; Madesta, F.; Schmitz, R.; Kniep, H.; Baltruschat, I.M.; Werner, R.; Schlaefer, A. Skin Lesion Classification Using CNNs With Patch-Based Attention and Diagnosis-Guided Loss Weighting. IEEE Trans. Biomed. Eng. 2019, 67, 495–503. [Google Scholar] [CrossRef]
Martinez, F.; Zhao, Y. Integrating Multiple Visual Attention Mechanisms in Deep Neural Networks. In Proceedings of the 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), Software, Torino, Italy, 26–30 June 2023; pp. 1191–1196. [Google Scholar] [CrossRef]
Alizadeh, R.; Allen, J.K.; Mistree, F. Correction: Managing computational complexity using surrogate models: A critical review. Res. Eng. Des. 2025, 36, 2. [Google Scholar] [CrossRef]
Lee, J.; Jang, J.; Lee, J.; Chun, D.; Kim, H. CNN-Based Mask-Pose Fusion for Detecting Specific Persons on Heterogeneous Embedded Systems. IEEE Access 2021, 9, 120358–120366. [Google Scholar] [CrossRef]
Rodríguez-Torres, F.; Martínez-Trinidad, J.; Carrasco-Ochoa, J. An Oversampling Method for Class Imbalance Problems on Large Datasets. Appl. Sci. 2022, 12, 3424. [Google Scholar] [CrossRef]
Vu, Y.N.T.; Wang, R.; Balachandar, N.; Liu, C.; Ng, A.Y.; Rajpurkar, P. MedAug: Contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation. arXiv 2021, arXiv:2102.10663. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep Layer Aggregation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2403–2412. [Google Scholar] [CrossRef]
Neha, F.; Bhati, D.; Shukla, D.K.; Dalvi, S.M.; Mantzou, N.; Shubbar, S. U-Net in Medical Image Segmentation: A Review of Its Applications Across Modalities. arXiv 2024, arXiv:2412.02242. [Google Scholar]
Verma, P.; Gupta, T.; Das, P.; Nali, A.R.; Hunsigida, V.; Acharyya, A. Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging. In Proceedings of the 2024 IEEE 37th International System-on-Chip Conference (SOCC), Dresden, Germany, 6–19 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
Collin, A.S.; de Bodt, C.; Mulders, D.; De Vleeschouwer, C. Don’t skip the skips: Autoencoder skip connections improve latent representation discrepancy for anomaly detection. In Proceedings of the ESANN 2023, Bruges, Belgium, 4–6 October 2023; pp. 653–658. [Google Scholar] [CrossRef]
Haider, A.; Arsalan, M.; Lee, M.B.; Owais, M.; Mahmood, T.; Sultan, H.; Park, K.R. Artificial Intelligence-based computer-aided diagnosis of glaucoma using retinal fundus images. Expert Syst. Appl. 2022, 207, 117968. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Hong, H.; Choi, D.; Kim, N.; Lee, H.; Kang, B.; Kang, H.; Kim, H. Survey of convolutional neural network accelerators on field-programmable gate array platforms: Architectures and optimization techniques. J. Real-Time Image Process. 2024, 21, 64. [Google Scholar] [CrossRef]
Sitaula, C.; Hossain, M.B. Attention-based VGG-16 model for COVID-19 chest X-ray image classification. Appl. Intell. 2020, 51, 2850–2863. [Google Scholar] [CrossRef]
Hong, H.; Choi, D.; Kim, N.; Kim, H. Mobile-X: Dedicated FPGA Implementation of the MobileNet Accelerator Optimizing Depthwise Separable Convolution. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 4668–4672. [Google Scholar] [CrossRef]
Hong, J.S.; Kim, S.G.; Kim, J.S.; Park, K.R. Deep learning-based restoration of multi-degraded finger-vein image by non-uniform illumination and noise. Eng. Appl. Artif. Intell. 2024, 133, 108036. [Google Scholar] [CrossRef]
Shin, J.; Kim, H. L-TTA: Lightweight Test-Time Adaptation Using a Versatile Stem Layer. Adv. Neural Inf. Process. Syst. 2024, 37, 39325–39349. [Google Scholar]
Ho, Y.; Wookey, S. The Real-World-Weight Cross-Entropy Loss Function: Modeling the Costs of Mislabeling. IEEE Access 2020, 8, 4806–4813. [Google Scholar] [CrossRef]
Xu, D.; Zhang, S.; Zhang, H.; Mandic, D.P. Convergence of the RMSProp deep learning method with penalty for nonconvex optimization. Neural Netw. 2021, 139, 17–23. [Google Scholar] [CrossRef]
Chun, D.; Lee, S.; Kim, H. USD: Uncertainty-Based One-Phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection. IEEE Trans. Multimed. 2024, 26, 6336–6347. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing, Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Kassem, M.A.; Hosny, K.M.; Fouad, M.M. Skin Lesions Classification into Eight Classes for ISIC 2019 Using Deep Convolutional Neural Network and Transfer Learning. IEEE Access 2020, 8, 114822–114832. [Google Scholar] [CrossRef]
Souza, L.A.; Pacheco, A.G.C.; de Angelo, G.G.; Oliveira-Santos, T.; Palm, C.; Papa, J.P. LiwTERM: A Lightweight Transformer-Based Model for Dermatological Multimodal Lesion Detection. In Proceedings of the 2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Manaus, Brazil, 30 September–3 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
International Skin Imaging Collaboration. SIIM-ISIC 2020 Challenge Dataset. 2020. Available online: https://challenge2020.isic-archive.com/ (accessed on 4 June 2025).
de Faria, S.M.M.; Filipe, J.N.; Pereira, P.M.M.; Tavora, L.M.N.; Assuncao, P.A.A.; Santos, M.O.; Fonseca-Pinto, R.; Santiago, F.; Dominguez, V.; Henrique, M. Light Field Image Dataset of Skin Lesions. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; pp. 3905–3908. [Google Scholar] [CrossRef]
Giotis, I.; Molders, N.; Land, S.; Biehl, M.; Jonkman, M.F.; Petkov, N. MED-NODE: A computer-assisted melanoma diagnosis system using non-dermoscopic images. Expert Syst. Appl. 2015, 42, 6578–6585. [Google Scholar] [CrossRef]
Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef]
Pereira, P.M.M.; Thomaz, L.A.; Tavora, L.M.N.; Assuncao, P.A.A.; Fonseca-Pinto, R.; Paiva, R.P.; Faria, S.M.M. Multiple Instance Learning Using 3D Features for Melanoma Detection. IEEE Access 2022, 10, 76296–76309. [Google Scholar] [CrossRef]
Shetty, B.; Fernandes, R.; Rodrigues, A.P.; Chengoden, R.; Bhattacharya, S.; Lakshmanna, K. Skin lesion classification of dermoscopic images using machine learning and convolutional neural network. Sci. Rep. 2022, 12, 18134. [Google Scholar] [CrossRef]
Adebiyi, A.; Abdalnabi, N.; Smith, E.H.; Hirner, J.; Simoes, E.J.; Becevic, M.; Rao, P. Accurate Skin Lesion Classification Using Multimodal Learning on the HAM10000 Dataset. medRxiv 2024. [Google Scholar] [CrossRef]
Steppan, J.; Hanke, S. Analysis of skin lesion images with deep learning. arXiv 2021, arXiv:2101.03814. [Google Scholar] [CrossRef]
Mahbod, A.; Schaefer, G.; Wang, C.; Ecker, R.; Ellinge, I. Skin Lesion Classification Using Hybrid Deep Neural Networks. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. pp. 1229–1233. [CrossRef]
Ozdemir, B.; Pacal, I. A robust deep learning framework for multiclass skin cancer classification. Sci. Rep. 2025, 15, 4938. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Chen, Z.; Che, X.; Wu, Y.; Huang, D.; Ma, H.; Dong, Y. A classification method for multi-class skin damage images combining quantum computing and Inception-ResNet-V1. Front. Phys. 2022, 10, 1046314. [Google Scholar] [CrossRef]
Fahad, N.M.; Sakib, S.; Khan Raiaan, M.A.; Hossain Mukta, M.S. SkinNet-8: An Efficient CNN Architecture for Classifying Skin Cancer on an Imbalanced Dataset. In Proceedings of the 2023 International Conference on Electrical, Computer and Communication Engineering (ECCE), Chittagong, Bangladesh, 23–25 February 2023; pp. 1–6. [Google Scholar] [CrossRef]
S M, J.; P, M.; Aravindan, C.; Appavu, R. Classification of skin cancer from dermoscopic images using deep neural network architectures. Multimed. Tools Appl. 2022, 82, 15763–15778. [Google Scholar] [CrossRef]
Pilania, U.; Kumar, M.; Garg, P.; Kaur, R. Detection and Classification of Skin Cancer Using Binary Classifier, Residual Network, and Convolutional Neural Network. In Proceedings of the 2024 2nd International Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India, 10–12 July 2024; pp. 1115–1122. [Google Scholar] [CrossRef]
Akram, A.; Rashid, J.; Jaffar, M.A.; Faheem, M.; Amin, R.u. Segmentation and classification of skin lesions using hybrid deep learning method in the Internet of Medical Things. Skin Res. Technol. 2023, 29, e13524. [Google Scholar] [CrossRef]
Mukherjee, S.; Adhikari, A.; Roy, M. Malignant melanoma classification using cross-platform dataset with deep learning CNN architecture. In Recent Trends in Signal and Image Processing; Springer: Singapore, 2019; pp. 31–41. [Google Scholar] [CrossRef]
Georgiadis, P.; Gkouvrikos, E.V.; Vrochidou, E.; Kalampokas, T.; Papakostas, G.A. Building Better Deep Learning Models Through Dataset Fusion: A Case Study in Skin Cancer Classification with Hyperdatasets. Diagnostics 2025, 15, 352. [Google Scholar] [CrossRef]
Ahmad, N.; Shah, J.H.; Khan, M.A.; Baili, J.; Ansari, G.J.; Tariq, U.; Kim, Y.J.; Cha, J.H. A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable AI. Front. Oncol. 2023, 13, 1151257. [Google Scholar] [CrossRef]
Tajerian, A.; Kazemian, M.; Tajerian, M.; Akhavan Malayeri, A. Design and validation of a new machine-learning-based diagnostic tool for the differentiation of dermatoscopic skin cancer images. PLoS ONE 2023, 18, e0284437. [Google Scholar] [CrossRef]
Alam, T.M.; Shaukat, K.; Khan, W.A.; Hameed, I.A.; Almuqren, L.A.; Raza, M.A.; Aslam, M.; Luo, S. An Efficient Deep Learning-Based Skin Cancer Classifier for an Imbalanced Dataset. Diagnostics 2022, 12, 2115. [Google Scholar] [CrossRef]

Figure 1. Proposed squeeze-and-excitation-based architecture for classification with deep feature aggregation and channel-wise attention mechanism.

Figure 2. Squeeze-and-excitation-based architecture with single and double block groups.

Figure 3. Detailed overview of separable convolution block used in proposed architecture.

Figure 4. Steps involved in squeeze and excitation (SE) building block with detailed operations in every SE block.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yasir, S.M.; Kim, H. Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification. Electronics 2025, 14, 2364. https://doi.org/10.3390/electronics14122364

AMA Style

Yasir SM, Kim H. Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification. Electronics. 2025; 14(12):2364. https://doi.org/10.3390/electronics14122364

Chicago/Turabian Style

Yasir, Siddiqui Muhammad, and Hyun Kim. 2025. "Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification" Electronics 14, no. 12: 2364. https://doi.org/10.3390/electronics14122364

APA Style

Yasir, S. M., & Kim, H. (2025). Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification. Electronics, 14(12), 2364. https://doi.org/10.3390/electronics14122364

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention-Based Deep Feature Aggregation Network for Skin Lesion Classification

Abstract

1. Introduction

2. Related Works

3. Proposed Model

3.1. Architecture, Materials, and Methods

3.2. Deep Feature Aggregation

3.3. Squeeze and Excitation Block (SE)

4. Evaluation

4.1. Implementation Details

4.2. Evaluation Matrices

4.3. Datasets

4.3.1. ISIC-2019 and ISIC-2020 Datasets

4.3.2. SKINL2—Skin Lesions Dataset

4.3.3. MED-NODE Dataset

4.3.4. HAM10000 Dataset

5. Experimental Results

5.1. Discussion

5.2. Performance Degradation with Double SE Modules

5.3. Metric Imbalances and Dataset-Specific Behavior

5.4. Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI