Lightweight Attention-Based Architecture for Accurate Melanoma Recognition

Beirami, Mohammad J.; Gruzmark, Fiona; Manwar, Rayyan; Tsoukas, Maria; Avanaki, Kamran

doi:10.3390/electronics14214281

Open AccessArticle

Lightweight Attention-Based Architecture for Accurate Melanoma Recognition

by

Mohammad J. Beirami

¹,

Fiona Gruzmark

²,

Rayyan Manwar

¹

,

Maria Tsoukas

² and

Kamran Avanaki

^1,2,*

¹

Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA

²

Department of Dermatology, University of Illinois at Chicago, Chicago, IL 60607, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(21), 4281; https://doi.org/10.3390/electronics14214281

Submission received: 28 August 2025 / Revised: 22 October 2025 / Accepted: 25 October 2025 / Published: 31 October 2025

(This article belongs to the Special Issue Digital Signal and Image Processing for Multimedia Technology)

Download

Browse Figures

Versions Notes

Abstract

Dermoscopy, a non-invasive imaging technique, has transformed dermatology by enabling early detection and differentiation of skin conditions. Integrating deep learning with dermoscopic images enhances diagnostic potential but raises computational challenges. This study introduces APNet, an attention-based architecture designed for melanoma detection, offering fewer parameters than conventional convolutional neural networks. Two baseline models are considered: HU-Net, a trimmed U-Net that uses only the encoding path for classification, and Pocket-Net, a lightweight U-Net variant that reduces parameters through fewer feature maps and efficient convolutions. While Pocket-Net is highly resource-efficient, its simplification can reduce performance. APNet extends Pocket-Net by incorporating squeeze-and-excitation (SE) attention blocks into the encoding path. These blocks adaptively highlight the most relevant dermoscopic features, such as subtle melanoma patterns, improving classification accuracy. The study evaluates APNet against Pocket-Net and HU-Net using four large, annotated dermoscopy datasets (ISIC 2017–2020), covering melanoma, benign nevi, and other lesions. Results show that APNet achieves faster processing than HU-Net while overcoming the performance loss observed in Pocket-Net. By reducing parameters without sacrificing accuracy, APNet provides a practical solution for computationally demanding dermoscopy, offering efficient and accurate melanoma detection where medical imaging resources are limited.

Keywords:

dermoscopy; deep learning; convolutional neural networks; parameter reduction; Pocket-Net; U-Net; attention; skin conditions; melanoma; diagnosis

1. Introduction

Skin cancer, one of the most prevalent forms of cancer, has become a major concern in public health, involving over 5 million cases annually in the US [1,2]. Melanoma is the most dangerous type of skin cancer, according to the American Cancer Society: delayed diagnosis can dramatically increase the five-years chance of fatality to 73%; however, early detection can improve the survival chance to 99%. Many different non-invasive imaging modalities have been explored to help melanoma diagnosis [3,4,5,6,7,8,9,10,11].

From the lesion appearance, there are several diagnostic markers to identify melanoma, such as asymmetry, irregularity of borders, unevenness of distribution of color, and diameter [12]. However visual comparison of lesions on the same person and experience are the common tools that many clinicians use. Dermoscopy image classification using novel deep learning models can assist in the diagnosis of skin cancer. Advancements in deep learning architecture design has led to the development of highly sophisticated models that are capable of analyzing and interpreting medical images. However, nearly all deep learning models are computationally intensive, requiring significant processing power and memory resources. Training these complex models may demand specialized hardware, such as complex Graphics Processing Units (GPUs) and a very large memory. Another formidable challenge is the acquiring, labeling, and storing of extensive data, demanded by more complex and deeper models, which can be a formidable challenge, especially in the medical domain, where patient data are sensitive and privacy regulations are stringent. On top of all that, deeper models can also require lengthy training times: up to hours, days, or even weeks.

A potent model in the field of biomedical image processing is the U-Net architecture that was first applied to biomedical image segmentation tasks, specifically biomedical image analysis difficulties like the ISBI cell tracking challenge [13]. The U-Net architecture derives its name from its U-shaped structure, which consists of a contracting path and an expanding path. Because of its special design, it may generate very accurate segmentation results even with a small amount of training data, which makes it very useful for medical imaging tasks where annotated datasets are sometimes hard to come by and expensive to purchase. The U-Net architecture consists of two main components: the contracting path and the expansive path. The contracting path, an encoder in a typical convolutional neural network (CNN), includes multiple convolutional and pooling layers, which reduces the spatial dimensions of the input image while capturing its hierarchical features [13]. This process effectively extracts high-level representations of the input image. The expansive path mimics the decoder in a CNN, and comprises up-sampling and convolutional layers that gradually recover the spatial resolution of the feature maps produced by the contracting path. Although CNNs have demonstrated impressive performance in a variety of computer vision tasks, the growing size of models has made it more difficult to use them in embedded vision, mobile, and real-time applications. By concatenating feature maps from the contracting path at corresponding layers in the expansive path, the U-Net model incorporates both local and global contextual information, facilitating precise localization of objects of interest in the input image. Moreover, the skip connections between the layers enable the U-Net model to preserve fine-grained details during the up-sampling process, thereby mitigating issues such as information loss and gradient vanishing commonly encountered in deeper networks. Half U-Net (HU-Net) is a modified version of the U-Net architecture where only a portion of the original model is utilized for a specific task [13,14,15,16,17]. HU-Net consists solely of the encoding (contracting) path of the U-Net architecture for classification tasks, with the decoding (expanding) path removed. More specifically, the HU-Net architecture first encodes the input dermoscopy image into a latent representation, capturing essential hierarchical features relevant to classification; then a fully connected model can be utilized for discrimination of two or more classes.

Recently, researchers have investigated distributed and privacy-preserving paradigms to enable scalability of learning and data sharing in healthcare. For instance, Selvaraj et al. (2025) [18] proposed a federated learning and Digital Twin-enabled distributed intelligence framework for 6G autonomous transport systems, demonstrating how decentralized architectures could facilitate collaborative purposes without exchanging data. While this approach highlights the potential of distributed intelligence for privacy-sensitive domains like medical imaging, it also introduces challenges related to communication overhead, synchronization between clients, and the need for powerful edge hardware.

Recently, Song et al. (2024) [19] presented a framework that employs annotations created using AI to improve quality in medical image segmentation. The work addresses issues of limited and inconsistent manual labeling. Their results demonstrate that synthetic annotation can be effective in improving the quality of training, especially in the context of large and diverse datasets. However, depending on AI-generated labels can transfer bias/error from the generative model, and there must be proper validation and human oversight. Issues of label reliability in medical imaging tasks are also important to consider in our work, as mentioned in the data Section 3.2, where we describe the use of well-curated ISIC datasets, while ensuring the data distribution was balanced using oversampling methods where necessary.

Google created the MobileNet [20] family of lightweight deep neural networks for embedded and mobile vision applications with constrained processing power. Depth-wise separable convolutions, which divide the conventional convolution operation into depth-wise and point-wise convolutions, are its key contribution. They greatly lower computational costs and model sizes while preserving high accuracy. Through the use of width and resolution multipliers, MobileNet offers developers the freedom to scale the model to meet certain hardware limitations [21]. The reason behind developing these models is to have a variant of the existing light models which will be easily deployable in memory-constrained scenarios. Pocket-Net [22], a lightweight version of U-Net with fewer parameters, is designed to address computational and memory constraints while maintaining competitive performance in medical images segmentation and classification tasks. Pocket-Net is typically lighter than MobileNet. By using fewer filters or more efficient layers, Pocket-Net minimizes the model size and computational requirements. Light models (e.g., MobileNet and Pocket-Net) are also designed for embedded board implementation. The limited memory of micro-controller units is one of the limitations of these architectures for hardware deployment. Despite being lighter than other CNN models, MobileNet typically has inferior classification accuracy. Therefore, more hyperparameters for modifying the model’s size have lately been proposed in an effort to strike a compromise between accuracy and complexity [23]. However, models become thick and unsuitable for devices with limited processing resources when the number of parameters is greatly increased.

We aim to reduce computational cost and memory usage by utilizing the Pocket-Net approach and incorporating attention mechanisms, resulting in our proposed Attention Pocket-Net (i.e., APNet). APNet integrates squeeze-and-excitation (SE) blocks into Pocket-Net, enabling the model to focus adaptively on the most relevant features for melanoma detection, further enhancing classification accuracy. This development holds significant promise for researchers who are end-users of neural networks in the domain of dermatology image analysis. The streamlined workflow facilitated by APNet’s resource-efficient architecture not only enhances the model training speed but also enables researchers to handle large and complex datasets more easily.

2. Related Works

Modern deep CNNs are convoluted and require a lot of computational power, especially when the convolutional layers that dominate runtime during prediction are being evaluated. By reducing the size of these networks, specifically by means such as pruning off complete feature maps, inference times can be made much faster. This is essential for performing real-time or near-real-time applications [24]. Although large neural networks are complex, they require large amounts of storage, memory bandwidth, and computational resources. Reducing the size of data makes it possible to use these images on devices with limited resources, such as smartphones, IoT devices, and other edge computing platforms. This not only reduces demands on computer hardware but also extends battery life in mobile devices by reducing power requirements for sample calculations [25]. By interpolating greedy criteria-based pruning and fine-tuning via backpropagation we can achieve a balance where the pruning mesh retains its effort while being efficient. Additionally, reducing the mesh complexity can help reduce overfitting because the thumbnails may not be susceptible to noise capture and lack samples in training data. Deep networks with a large number of parameters have the capability to learn complex features and functions. However, more parameters need more data. On the other hand, shallow networks with few trainable parameters restrict network learning capabilities and will cause overfitting. Network pruning techniques eliminate less-important connections of the network [26], but numerous conditional operations and additional representations are required to indicate the position of zero or non-zero parameters.

Alongside attempts to design novel architectures of deep networks, the need for computational speed and memory requirements for large dermoscopy datasets encourages scholars to develop methods to reduce the number of parameters for these networks A summary is provided in Table 1. Briefly, these attempts include efficient networks, which are often designed to perform well while using fewer parameters and computations [27,28]; model compression, which aims to reduce the computational requirements of deep networks while maintaining performance [29,30,31,32]; transfer learning and pre-trained models that leverage the knowledge gained from training a neural network on one task and apply it to a different but related task [33,34,35]; and architectural innovation, which might include modifications to well-known architectures, such as skip connection variations, and pruning. Pruning refers to eliminating less important neurons or connections by means of reducing the model’s computational complexity and memory footprint [24,25]. Pruning is typically applied to a pre-trained model or after a model has been trained on a large dataset. Models are typically trained on large, diverse datasets for generic tasks such as image classification or language modeling. This means that the model might have learned features or patterns that are not essential for the specific problem at hand.

3. Materials and Methods

3.1. HU-Net to Pocket-Net to APNet

The conventional HU-Net architecture [13], while highly effective, includes a substantial number of parameters, leading to increased memory and GPU requirements [44]. The approach of holding the number of feature maps constant for each convolution layer (in other words the number of output channels is the same as the number of input channels for all convolutional layers, except the final output layer), named “Pocket-Net” to imply the network’s compactness (small enough to fit in one’s back pocket), employs a hierarchical, scale-dependent strategy to eliminate the need for doubling the number of features at each down-sampling step. By keeping the number of feature maps constant, the number of network parameters is severely reduced, resulting in a significant computational cost reduction. Figure 1 shows how Pocket-Net can be used to modify a typical HU-Net structure. Pocket-Net is a compelling alternative in this scenario as it offers the advantage of significantly reduced parameters, thereby optimizing computational resources [22].

For many convolutional neural network (CNN) architectures, including popular methods like DenseNet [45], ENet [46], and fully convolutional networks (FCNs) [47], the number of feature maps in each convolution operation is doubled when the image resolution decreases. This strategy aims to compensate for the information loss caused by down-sampling by increasing the number of feature maps. However, traditional methods, such as wavelets and multigrid techniques, have long employed a hierarchical framework for scale-dependent information decomposition. This approach, involving the construction of resolution grids, efficiently preserves information without requiring the exponential growth of channels during down-sampling. Yet U-Net contains this hierarchical scale-dependent approach, suggesting there is no need for doubling the number of feature maps in each convolution operation, and that there is already present the ability to maintain effective multi-scale information handling.

While Pocket-Net achieves resource efficiency, APNet takes this further by incorporation of squeeze-and-excitation (SE) blocks [48,49,50,51] to enhance performance accuracy. The attention mechanism can be considered as a bias that allocates processing resources towards the most informative components of the input image. These SE blocks allow APNet to focus on critical features within dermoscopy images, improving the model classification accuracy over the original Pocket-Net model. To emphasize important channels of the features, a channel-specific descriptor learned by global average pooling has been used to take account of the spatial dependency.

Let the input image be represented as

X \in R^{H \times W \times C}

where

H

and

W

are the height and width, and

C

is the number of channels (e.g.,

C = 1

for greyscale images). In APNet, we use depth-wise separable convolutions (Figure 2) to reduce the number of parameters and computational load in each convolutional layer. For a given layer with input feature maps

X

and convolutional filter

F

, a depth-wise separable convolution operation can be represented as two consecutive operations:

Depth-wise Convolution: Each channel in

X

is convolved separately with a kernel of size

k \times k

, producing intermediate feature maps

X_{d} \in R^{H^{’} {\times W}^{’} \times C}

.

X_{d} = X * F_{d}

(1)

where

F_{d} \in R^{k \times k \times C}

is the depth-wise filter applied to each channel individually.

Point-wise Convolution: A

1 \times 1

convolution combines the depth-wise features across channels, producing the final output

X_{d} \in R^{H^{’} {\times W}^{’} \times C}

, where

D

is the number of output channels.

X_{p} = X_{d} * F_{p}

(2)

where

F_{p} \in R^{1 \times 1 \times C \times D}

.

This depth-wise separable convolutional structure reduces the parameters and computation significantly compared to standard convolutions. To enhance feature selection, APNet integrates SE blocks in specific layers. The SE block first “squeezes” spatial information and then “excites” or re-weighs each channel. For an input feature map

X \in R^{H \times W \times C}

:

Squeeze Operation: Global average pooling (GAP) is applied to each channel, reducing

X

to a vector

z \in R^{C}

, where each element

z_{c}

represents the average activation of channel

c

:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{i, j, c}

(3)

Excitation Operation: The channel descriptor

z

is passed through two fully connected (dense) layers with weights

W_{1} \in R^{\frac{c}{r} \times C}

and

W_{2} \in R^{C \times \frac{c}{r}}

, where

r

is the reduction ratio. These layers learn the channel dependencies, producing a reweighting vector

s \in R^{C}

:

s = σ (W_{2} δ (W_{1} z))

(4)

where

δ

is the ReLU activation function and

σ

is the sigmoid function. The output vector

s

contains weights for each channel that are then used to rescale

X

.

Scaling: Finally, each channel in

X

is reweighted by

s

, yielding the output

X_{S E}

:

X_{S E} = X . s

(5)

where the dot operator (

.

) denotes channel-wise multiplication.

In the above formulation, all variables are dimensionally consistent with the SE literature [48]. The feature maps

X \in R^{H \times W \times C}

are unitless activations derived from convolutional operations. The global average pooling (Equation (3)) produces a dimensionless channel descriptor

z \in R^{C}

, where each element represents the normalized average response of one channel. The reduction ratio

r

is a scalar hyperparameter that controls the bottleneck dimension in the excitation operation. Following the SE formalism, the mapping

s = σ (W_{2} δ (W_{1} z))

(Equation (4)) yields a reweighting vector

s \in R^{C}

, where each element

s_{c} \in [1,0]

serves as a multiplicative coefficient applied to the corresponding feature channel in Equation (5).

The reduction ratio r in the squeeze-and-excitation (SE) block controls the dimensionality reduction applied to the channel descriptor during the “squeeze” operation. Specifically, the number of neurons in the first fully connected (FC) layer is reduced from (the number of input channels) to

C / r

, allowing the network to learn inter-channel dependencies while significantly lowering computational cost and overfitting risk.

S = σ (W_{2} δ (W_{1} z))

(6)

where

W_{1} \in R^{C / r \times C}

and

W_{2} \in R^{C / r \times C}

,

δ

is the ReLU activation function, and

σ

is sigmoid function. The reduction ratio

r

was kept fixed at

r = 16

, which is a commonly used value in the original SE-Net paper and subsequent lightweight CNN studies [52,53].

This choice was motivated by the need to balance two factors: (1) avoiding an excessive increase in parameters that would undermine the lightweight design philosophy of APNet, and (2) retaining enough capacity in the excitation step to model meaningful channel interdependencies.

3.2. Datasets

We have tested the model separately on several widely regarded dermoscopy image datasets provided by The International Skin Imaging Collaboration (ISIC). These datasets include various types of skin diseases. The decision to analyze images from each dataset separately stems from the fact that subsequent datasets include prior images and thus it is not possible to train on one dataset and utilize a different dataset for testing. In addition, there are unique characteristics and challenges posed by each dataset. ISIC datasets represent a spectrum of dermatological cases, showcasing variations in skin conditions, lighting conditions, and imaging technologies. By individually examining these datasets, we aim to capture the nuances intrinsic to each set, allowing for a more nuanced evaluation of HU-Net, Pocket-Net, and proposed model (i.e., APNet) performance. This approach not only provides a robust assessment of the model’s generalizability across different datasets but also enables insights into its adaptability to varying data distributions and characteristics commonly encountered in real-world dermatology practices.

Class imbalance, where one class significantly outnumbers the others, can lead to poor generalization and performance degradation, as the model may prioritize accuracy on the majority class at the expense of minority classes. Therefore, different data augmentation methods are proposed to resolve such an issue. To address this issue, an oversampling method is employed. Oversampling involves augmenting the minority class (melanoma) by generating synthetic samples or replicating existing ones until the class distribution is more balanced. This ensures that the model receives sufficient exposure to the minority class during training, preventing it from being overlooked or underrepresented. The skewed distribution of classes within a dataset leads to biased model training, poor generalization, and misleading evaluation metrics. To address this problem various techniques can be employed, such as oversampling the minority class or undersampling the majority class. Based on our experience, we employed oversampling of minority classes in the training stage. The detailed information of classification datasets after augmentation and resizing is given in Table 2. Examples of dermoscopy image datasets provided by The International Skin Imaging Collaboration (ISIC) are shown in Figure 3. ISIC 2017 [54] consists of melanoma and benign nevi images made for three tasks: lesion segmentation, dermoscopic feature classification, and disease classification. The ISIC 2018 [55] images were collected over 20 years from the Department of Dermatology at the Medical University of Vienna, Austria, and the skin cancer practice of Cliff Rosendahl in Queensland, Australia. Images taken before introduction of digital cameras were digitized with a Nikon Coolscan 5000 ED scanner at a 300 DPI quality and 15 × 10 cm size. These images were then manually cropped to 800 × 600 pixels at 72 DPI with the lesion centered and histogram corrections were applied to enhance the quality of the images. The other part of the data was collected with a digital camera by the University of Vienna by the MoleMax HD dermoscopy system. The dataset includes skin lesions of melanocytic nevi, melanoma, benign keratosis-like lesions, basal cell carcinoma, actinic keratoses, vascular lesions, and dermatofibroma, providing a diverse set of conditions for training and evaluating skin lesion classification models. ISIC 2019 [56] has been collected over 16 years (2010–1016) by the Department of Dermatology at the “Hospital Clínic de Barcelona” on their patients. They captured images via their high-resolution cameras and filtered the images to have better visualization; finally they then manually revised the images to check corresponding diagnoses against a reference table. The data can be categorized into nevus, melanoma, basal cell carcinoma, seborrheic keratosis, actinic keratosis, squamous cell carcinoma, dermatofibroma, vascular lesion, and “other” (lesions not contained in any of the other categories). ISIC 2020 [57] images have been collected from six different centers over 22 years (1998–2020) with or without polarized light in contact or non-contact dermoscope conditions. Notably, polarized light enables a clearer visualization of deeper skin structures, even in cases where there is no direct skin contact with the imaging interface or the utilization of a liquid interface. For some of the images, diagnosis labeling and internal biopsies were reviewed, and for the others the lesions were monitored over six months and considered benign if they remained without further granularity [58]. These images were categorized into nevi, atypical melanocytic proliferation, café-au-lait macule, lentigo NOS, lentigo simplex, solar lentigo, lichenoid keratosis, and seborrheic keratosis.

All procedures are implemented in a single Nvidia Geforce RTX 4090 GPU with 64 GB RAM, using the Tensorflow 2.10 framework. We used an Adam optimizer with a learning rate of 0.001, categorical cross-entropy loss function, and a batch size of 32 and 1000 epochs for all datasets.

4. Results

Using the implementation details mentioned above, we report performance metrics of HU-Net, Pocket-Net and APNet architectures with ISIC challenge classification task datasets from years 2017 to 2020. Note that we performed all experiments with the same GPU and API. The metrics used for evaluation include accuracy, sensitivity, specificity, the area under the receiver operating characteristic curve (AUC), ROC curve, and confusion matrix. Overall, while all architectures demonstrated high levels of performance, HU-Net consistently achieved slightly higher accuracy scores on most datasets. All datasets had reasonably similar sensitivity, meaning they correctly predicted presence of melanoma 96% of the time or better (although HU-Net performance for 2019 was closer to 95% sensitivity). The difference was in specificity: Pocket-Net had about double the number of false positives in each dataset (somewhat more for 2020), meaning while HU-Net specificity ranged from 92.2 to 99.9%, Pocket-Net specificity ranged from 85.7 to 98.7% (both are far better than any individual dermatologist). APNet enhanced the performance of Pocket-Net in terms of specificity, achieving values ranging from approximately 88% to 99.2%. The performance of HU-Net, Pocket-Net, and APNet, as well as some other complexity reduction methods including MobileNet [20], 3D-DenseUNet-569 [37], and Alex-Net Weight pruning [59], in terms of number of parameters, accuracy, sensitivity, specificity, and AUC on the four datasets is reported in Table 3. To assess the impact of minority class oversampling, we conducted an additional experiment where APNet was trained without any oversampling during the training stage, while keeping all other parameters identical. The results, presented in Table S1 (Supplementary Section), demonstrate a noticeable degradation in performance, particularly in sensitivity and balanced accuracy when oversampling was omitted. This confirms that minority class oversampling plays a critical role in improving the model’s ability to correctly identify melanoma cases and achieve a more balanced performance across classes.

To provide a visual interpretation of the suggested network and comparison between former architectures (e.g., HU-Net and Pocket-Net), we applied Gradient-weighted Class Activation Mapping (Grad-CAM) to highlight the regions of the focus for the classification task that most strongly influence the model’s predictions, allowing for an intuitive comparison of feature attention between classes. As shown in Figure 4, the heatmaps reveal that APNet effectively compensates for weakness of Pocket-Net and focuses on critical lesion regions, capturing discriminative patterns of the images.

The network parameters are reduced from 1,304,866 (HU-Net) to 70,114 (Pocket-Net) and 70,762 (APNet), resulting in both Pocket-Net and APNet having fewer parameters and thereby requiring less computation during both training and inference. This can lead to faster training times and lower inference latency, which is beneficial for real-time applications and resource-constrained environments. As shown in Figure 5, these models also consume less memory, which is crucial for deployment on devices with limited memory resources, such as mobile phones and micro-controllers. However, reducing the number of parameters in these models led to a loss of performance, limiting the model’s ability to capture fine-grained features in the data. Figure 5a compares training speed and memory usage of the original HU-Net structure and Pocket-Net and APNet for different batch sizes. Pocket-Net sped up training time for ISIC 2017, 2018, 2019, and 2020 datasets by 40% on average. Increasing batch size is a straightforward way to boost GPU speed, but is not always successful. Also, bigger batch sizes computed in parallel on the GPU means the batch size can be large as long as there is enough GPU memory. Figure 5b demonstrates the effect of batch sizes and architecture on memory usage.

The deep learning frameworks and GPUs tend to allocate GPU memory up front, especially during model initialization and graph construction. This pre-allocation ensures efficient memory utilization and avoids potential memory fragmentation issues during training. However, this behavior can sometimes lead to seemingly excessive memory consumption, particularly when working with large datasets or models. The HU-Net model, with over a million parameters, involves more complex operations or layers that require larger memory buffers or additional memory overhead during computation. Therefore, the model with a larger number of parameters would tend to go out-of-memory.

To provide a hardware-independent measure of computational cost, we calculated the number of floating-point operations (FLOPs) for each model using TensorFlow’s built-in profiler. The FLOPs metric quantifies the total number of arithmetic operations required for a single forward pass and offers a more generalizable comparison than GPU memory usage, which can vary across hardware and framework configurations. Using an input size of 256 × 256 × 3, the total FLOPs were approximately 1.95 GFLOPs for HU-Net, 0.68 GFLOPs for Pocket-Net, and 0.42 GFLOPs for the proposed APNet. These results demonstrate that APNet achieves a substantial computational reduction of nearly 78% compared to HU-Net while maintaining competitive classification performance, confirming its suitability for deployment in resource-constrained or real-time clinical applications. The ROC curves shown in Figure 6 demonstrate that HU-Net significantly outperforms Pocket-Net, while APNet shows the capability to improve the performance of the light model (i.e., Pocket-Net), highlighting the importance of incorporating attention mechanisms for improved feature prioritization and class discrimination. For Pocket-Net, the smooth rise in the ROC curve at the start of the curve (near the origin) of ISIC 2017–2019 indicates less sensitivity at low false positive rates compared to HU-Net. APNet compensated for the degradation of Pocket-Net by sharpening the curves, showing a more pronounced bow compared to Pocket-Net, reflecting its higher AUC and improved capability for distinguishing between classes.

The confusion matrices in Figure 7 demonstrate that APNet effectively addressed misclassification issues observed in the benign class with Pocket-Net for ISIC 2017–2019, highlighting its ability to enhance feature representation and improve class differentiation.

5. Discussion

The proposed APNet architecture demonstrates a promising balance between efficiency and performance in dermoscopic image classification. By integrating squeeze-and-excitation (SE) blocks into a lightweight Pocket-Net backbone, APNet enhances channel-wise feature recalibration, allowing the model to focus on more discriminative regions while maintaining a drastically reduced number of trainable parameters. This design enables faster inference, lower memory usage, and reduced computational demand—making APNet particularly suitable for deployment in clinical environments with limited hardware resources or real-time diagnostic applications.

Compared to traditional U-Net-derived architectures, APNet achieves comparable or superior classification accuracy, with a 94.6% reduction in parameters and approximately 40% less computational time. These results underscore the potential of attention-integrated lightweight models to deliver high diagnostic reliability without the overhead of large-scale networks. Furthermore, APNet’s ability to generalize across multiple ISIC datasets indicates robustness against inter-dataset variability, a key challenge in dermatological image analysis. As our goal was to demonstrate the generalizability of a single lightweight configuration rather than tailoring hyperparameters per dataset, we did not perform specific tuning across ISIC 2017–2020 datasets,

However, some limitations should be acknowledged. First, despite its efficiency, APNet exhibited a slight drop in specificity compared to the full HU-Net, likely due to reduced feature representation capacity in deeper layers. Second, while the SE block improves feature weighting, it primarily captures channel dependencies and may not fully exploit spatial relationships—an aspect where hybrid attention mechanisms (e.g., CBAM or ECA) could offer further improvement. Additionally, the current evaluation focused on publicly available ISIC datasets; real-world deployment would benefit from testing on diverse, clinically acquired data to validate generalizability under different imaging conditions and device settings.

Overall, APNet provides an effective trade-off between model complexity and diagnostic accuracy. Its compact design offers a practical foundation for future extensions, such as integrating hybrid attention modules, self-supervised pre-training, or federated learning frameworks to further enhance reliability and scalability in real-world dermatology applications.

With the help of advanced 3D imaging modalities such as optical coherence tomography and image processing we hope to improve the performance of dermoscopy for melanoma detection [60,61,62,63,64].

6. Conclusions

While deep learning models more often offer impressive results against problems (e.g., classification and segmentation), they come with several drawbacks, including computational complexity and limited memory sizes for large datasets due to the massive number of network trainable parameters. Hence, parameter reduction could be a solution to mitigate these disadvantages and improve the efficiency, provide the possibility to work with big datasets, and enable embedding these complex models into portable devices. In this study we investigated the effects of network size reduction methods on trainable parameter and comparing computational costs of HU-Net, Pocket-Net, and our proposed model, APNet, for analyzing dermatology image data. The pursuit of practical and resource-efficient solutions for accurate diagnosis is of paramount importance, especially in clinical settings. This study addressed some of the computational challenges presented by the traditional HU-Net and other network size reduction methods for classification of dermatology images while maintaining their performance. APNet, with its integration of SE blocks, demonstrated enhanced feature focus and accuracy, with a 94.6% reduction in parameters and approximately 40% reduction in computational time compared to HU-Net. This result indicates that high numbers of parameters may not be necessary for dermatology image classification. APNet’s performance across multiple dermoscopic datasets further confirms that the reduced architecture can effectively capture critical features. While Pocket-Net demonstrated this to some extent, it exhibited limitations in performance that APNet successfully addressed, ensuring both efficiency and accuracy are maintained. Specifically, APNet achieved around 1% classification accuracy improvement for ISIC 2017–219 and 0.4% for ISIC 2020 compared to Pocket-Net. Meanwhile, Pocket-Net showed dramatic specificity degradation of around 6.5% for the ISIC 2019 dataset; APNet compensates for this issue with a 4.3% sensitivity loss. The same results can be seen for ISIC 2018, where APNet reduced the 2.77% specificity degradation of Pocket-Net to 1.13% and achieved a slight improvement (around 0.5%) for ISIC 2017 and ISIC 2020. The attention mechanism in APNet enhances feature selection, contributing to improved classification accuracy while preserving efficiency.

It is important to note that while APNet achieved substantial gains in efficiency and overall accuracy, HU-Net occasionally maintained higher specificity due to its larger representational capacity. This highlights the trade-off between model complexity and performance: APNet is best suited for resource-limited or real-time clinical settings where computational cost and memory usage are critical, whereas HU-Net may still be preferable when maximum specificity is required and computational resources are not a constraint.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics14214281/s1, Table S1: Performance of the proposed APNet architecture across different ISIC challenge datasets. The table summarizes the class imbalance ratio, accuracy, sensitivity, specificity, and AUC for each dataset. The results demonstrate that data augmentation plays a crucial role in improving model generalization and handling severe class imbalance.

Author Contributions

Conceptualization, M.J.B. and K.A.; methodology, M.J.B.; software, M.J.B. and R.M.; validation, M.J.B., M.T. and K.A.; formal analysis, M.J.B.; investigation, M.J.B.; resources, F.G., R.M. and K.A.; data curation, M.J.B. and R.M.; writing—original draft preparation, M.J.B. and K.A.; writing—review and editing, F.G., R.M., M.T. and K.A.; visualization, M.J.B. and R.M. supervision, K.A.; funding acquisition, K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Melanoma Research Alliance (grant number 624320).

Data Availability Statement

Dermoscopy image datasets were obtained from The International Skin Imaging Collaboration (ISIC).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rogers, H.W.; Weinstock, M.A.; Feldman, S.R.; Coldiron, B.M. Incidence estimate of nonmelanoma skin cancer (keratinocyte carcinomas) in the US population, 2012. JAMA Dermatol. 2015, 151, 1081–1086. [Google Scholar] [CrossRef] [PubMed]
Siegel, R.L.; Miller, K.D.; Jemal, A. Cancer statistics, 2018. CA—Cancer J. Clin. 2018, 68, 7–30. [Google Scholar] [CrossRef] [PubMed]
Xu, Q. The Advanced Applications for Optical Coherence Tomography in Skin Imaging. Ph.D. Thesis, Wayne State University, Detroit, Michigan, 2021. [Google Scholar]
Eathara, A.; Siegel, A.P.; Tsoukas, M.M.; Avanaki, K. Applications of Optical Coherence Tomography (OCT) in Preclinical Practice. In Bioimaging Modalities in Bioengineering; Springer: Berlin/Heidelberg, Germany, 2025; pp. 61–80. [Google Scholar]
Benavides-Lara, J.; Siegel, A.P.; Tsoukas, M.M.; Avanaki, K. High-frequency photoacoustic and ultrasound imaging for skin evaluation: Pilot study for the assessment of a chemical burn. J. Biophotonics 2024, 17, e202300460. [Google Scholar] [CrossRef] [PubMed]
Avanaki, M.R.; Hojjat, A.; Podoleanu, A.G. Investigation of computer-based skin cancer detection using optical coherence tomography. J. Mod. Opt. 2009, 56, 1536–1544. [Google Scholar] [CrossRef]
O’leary, S.; Fotouhi, A.; Turk, D.; Sriranga, P.; Rajabi-Estarabadi, A.; Nouri, K.; Daveluy, S.; Mehregan, D.; Nasiriavanaki, M. OCT image atlas of healthy skin on sun-exposed areas. Ski. Res. Technol. 2018, 24, 570–586. [Google Scholar] [CrossRef]
Adabi, S.; Turani, Z.; Fatemizadeh, E.; Clayton, A.; Nasiriavanaki, M. Optical coherence tomography technology and quality improvement methods for optical coherence tomography images of skin: A short review. Biomed. Eng. Comput. Biol. 2017, 8, 1179597217713475. [Google Scholar] [CrossRef]
Adabi, S.; Fotouhi, A.; Xu, Q.; Daveluy, S.; Mehregan, D.; Podoleanu, A.; Nasiriavanaki, M. An overview of methods to mitigate artifacts in optical coherence tomography imaging of the skin. Ski. Res. Technol. 2018, 24, 265–273. [Google Scholar] [CrossRef]
Horton, L.; Fakhoury, J.W.; Manwar, R.; Rajabi-Estarabadi, A.; Turk, D.; O’Leary, S.; Fotouhi, A.; Daveluy, S.; Jain, M.; Nouri, K. Review of Non-Invasive Imaging Technologies for Cutaneous Melanoma. Biosensors 2025, 15, 297. [Google Scholar] [CrossRef]
Akella, S.S.; Lee, J.; May, J.R.; Puyana, C.; Kravets, S.; Dimitropolous, V.; Tsoukas, M.; Manwar, R.; Avanaki, K. Using optical coherence tomography to optimize Mohs micrographic surgery. Sci. Rep. 2024, 14, 8900. [Google Scholar] [CrossRef]
Gachon, J.; Beaulieu, P.; Sei, J.F.; Gouvernet, J.; Claudel, J.P.; Lemaitre, M.; Richard, M.A.; Grob, J.J. First prospective study of the recognition process of melanoma in dermatological practice. Arch. Dermatol. 2005, 141, 434–438. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Li, K.; Fathan, M.I.; Patel, K.; Zhang, T.; Zhong, C.; Bansal, A.; Rastogi, A.; Wang, J.S.; Wang, G. Colonoscopy polyp detection and classification: Dataset creation and comparative evaluations. PLoS ONE 2021, 16, e0255809. [Google Scholar] [CrossRef]
Thanoon, M.A.; Zulkifley, M.A.; Mohd Zainuri, M.A.A.; Abdani, S.R. A Review of Deep Learning Techniques for Lung Cancer Screening and Diagnosis Based on CT Images. Diagnostics 2023, 13, 2617. [Google Scholar] [CrossRef]
Selvaraj, A.K.; Govindarajan, Y.; Prathiba, S.B.; Vinod, A.A.; Ganesan, V.P.A.; Zhu, Z.; Gadekallu, T.R. Federated Learning and Digital Twin-Enabled Distributed Intelligence Framework for 6G Autonomous Transport Systems. IEEE Trans. Intell. Transp. Syst. 2025, 26, 18214–18224. [Google Scholar] [CrossRef]
Song, Y.; Liu, Y.; Lin, Z.; Zhou, J.; Li, D.; Zhou, T.; Leung, M.-F. Learning from AI-generated annotations for medical image segmentation. IEEE Trans. Consum. Electron. 2024, 71, 1473–1481. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sinha, D.; El-Sharkawy, M. Thin MobileNet: An Enhanced MobileNet Architecture. In Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; pp. 280–285. [Google Scholar]
Celaya, A.; Actor, J.A.; Muthusivarajan, R.; Gates, E.; Chung, C.; Schellingerhout, D.; Riviere, B.; Fuentes, D. Pocketnet: A smaller neural network for medical image analysis. IEEE Trans. Med. Imaging 2022, 42, 1172–1184. [Google Scholar] [CrossRef]
Kim, C.Y.; Um, K.S.; Heo, S.W. A novel MobileNet with selective depth multiplier to compromise complexity and accuracy. ETRI J. 2022, 45, 666–677. [Google Scholar] [CrossRef]
LeCun, Y.; Denker, J.; Solla, S. Optimal brain damage. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27–30 November 1989; Volume 2. [Google Scholar]
Hagerty, J.R.; Stanley, R.J.; Almubarak, H.A.; Lama, N.; Kasmi, R.; Guo, P.; Drugge, R.J.; Rabinovitz, H.S.; Oliviero, M.; Stoecker, W.V. Deep learning and handcrafted method fusion: Higher diagnostic accuracy for melanoma dermoscopy images. IEEE J. Biomed. Health Inform. 2019, 23, 1385–1391. [Google Scholar] [CrossRef]
Anwar, S.; Hwang, K.; Sung, W. Structured Pruning of Deep Convolutional Neural Networks. J. Emerg. Technol. Comput. Syst. 2017, 13, 32. [Google Scholar] [CrossRef]
Alfed, N.; Khelifi, F.; Bouridane, A.; Seker, H. Pigment network-based skin cancer detection. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 7214–7217. [Google Scholar]
Chaturvedi, S.S.; Gupta, K.; Prasad, P.S. Skin lesion analyser: An efficient seven-way multi-class skin cancer classification using MobileNet. In Proceedings of the Advanced Machine Learning Technologies and Applications: Proceedings of AMLTA 2020; Springer: Singapore, 2021; pp. 165–176. [Google Scholar]
Ye, R.; Liu, F.; Zhang, L. 3d depthwise convolution: Reducing model parameters in 3d vision tasks. In Proceedings of the Advances in Artificial Intelligence: 32nd Canadian Conference on Artificial Intelligence, Canadian AI 2019, Kingston, ON, Canada, 28–31 May, 2019; pp. 186–199. [Google Scholar]
van der Putten, J.; van der Sommen, F. Influence of decoder size for binary segmentation tasks in medical imaging. In Proceedings of the Medical Imaging 2020: Image Processing, Houston, TX, USA, 15–20 February 2020; pp. 276–281. [Google Scholar]
Venugopal, V.; Raj, N.I.; Nath, M.K.; Stephen, N. A deep neural network using modified EfficientNet for skin cancer detection in dermoscopic images. Decis. Anal. J. 2023, 8, 100278. [Google Scholar] [CrossRef]
Ahmad, N.; Shah, J.H.; Khan, M.A.; Baili, J.; Ansari, G.J.; Tariq, U.; Kim, Y.J.; Cha, J.-H. A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable AI. Front. Oncol. 2023, 13, 1151257. [Google Scholar] [CrossRef]
Hosny, K.M.; Kassem, M.A.; Foaud, M.M. Classification of skin lesions using transfer learning and augmentation with Alex-net. PLoS ONE 2019, 14, e0217293. [Google Scholar] [CrossRef]
Younis, H.; Bhatti, M.H.; Azeem, M. Classification of skin cancer dermoscopy images using transfer learning. In Proceedings of the 2019 15th International Conference on Emerging Technologies (ICET), Peshawar, Pakistan, 2–3 December 2019; pp. 1–4. [Google Scholar]
Kumar, V.; Sinha, B.B. Skin Cancer Classification for Dermoscopy Images Using Model Based on Deep Learning and Transfer Learning. In Computational Intelligence and Data Analytics: Proceedings of ICCIDA 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 257–271. [Google Scholar]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
Hinton, G.E.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Alalwan, N.; Abozeid, A.; ElHabshy, A.A.; Alzahrani, A. Efficient 3D deep learning model for medical image semantic segmentation. Alex. Eng. J. 2021, 60, 1231–1239. [Google Scholar] [CrossRef]
Iandola, F.N.; Moskewicz, M.W.; Ashraf, K.; Han, S.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Denil, M.; Shakibi, B.; Dinh, L.; Ranzato, M.A.; Freitas, N.d. Predicting Parameters in Deep Learning. In Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar]
Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up Convolutional Neural Networks with Low Rank Expansions. arXiv 2014, arXiv:1405.3866. [Google Scholar] [CrossRef]
Zoph, B.; Le, Q.V. Neural Architecture Search with Reinforcement Learning. arXiv 2016, arXiv:1611.01578. [Google Scholar]
Weng, Y.; Zhou, T.; Li, Y.; Qiu, X. Nas-unet: Neural architecture search for medical image segmentation. IEEE Access 2019, 7, 44247–44257. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2017; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.F.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2019; pp. 11531–11539. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.-S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Codella, N.C.; Gutman, D.; Celebi, M.E.; Helba, B.; Marchetti, M.A.; Dusza, S.W.; Kalloo, A.; Liopyris, K.; Mishra, N.; Kittler, H. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; pp. 168–172. [Google Scholar]
Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef]
Combalia, M.; Codella, N.C.; Rotemberg, V.; Helba, B.; Vilaplana, V.; Reiter, O.; Carrera, C.; Barreiro, A.; Halpern, A.C.; Puig, S. Bcn20000: Dermoscopic lesions in the wild. arXiv 2019, arXiv:1908.02288. [Google Scholar] [CrossRef]
Rotemberg, V.; Kurtansky, N.; Betz-Stablein, B.; Caffery, L.; Chousakos, E.; Codella, N.; Combalia, M.; Dusza, S.; Guitera, P.; Gutman, D. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci. Data 2021, 8, 34. [Google Scholar] [CrossRef]
Moscarella, E.; Tion, I.; Zalaudek, I.; Lallas, A.; Athanassios, K.; Longo, C.; Lombardi, M.; Raucci, M.; Satta, R.; Alfano, R.; et al. Both short-term and long-term dermoscopy monitoring is useful in detecting melanoma in patients with multiple atypical nevi. J. Eur. Acad. Dermatol. Venereol. 2017, 31, 247–251. [Google Scholar] [CrossRef]
Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 2015, 28, 1135–1143. [Google Scholar]
Adabi, S.; Conforto, S.; Clayton, A.; Podoleanu, A.G.; Hojjat, A.; Avanaki, M.R. An intelligent speckle reduction algorithm for optical coherence tomography images. In Proceedings of the 2016 4th International Conference on Photonics, Optics and Laser Technology (PHOTOPTICS), Rome, Italy, 27–29 February 2016; pp. 1–6. [Google Scholar]
Adabi, S.; Rashedi, E.; Clayton, A.; Mohebbi-Kalkhoran, H.; Chen, X.-w.; Conforto, S.; Avanaki, M.N. Learnable despeckling framework for optical coherence tomography images. J. Biomed. Opt. 2018, 23, 016013. [Google Scholar] [CrossRef]
Xu, Q.; Jalilian, E.; Fakhoury, J.W.; Manwar, R.; Michniak-Kohn, B.; Elkin, K.B.; Avanaki, K. Monitoring the topical delivery of ultrasmall gold nanoparticles using optical coherence tomography. Ski. Res. Technol. 2020, 26, 263–268. [Google Scholar] [CrossRef]
Avanaki, M.R.; Hojjatoleslami, A. Skin layer detection of optical coherence tomography images. Optik 2013, 124, 5665–5668. [Google Scholar] [CrossRef]
Zafar, M.; Manwar, R.; Avanaki, K. Miniaturized preamplifier integration in ultrasound transducer design for enhanced photoacoustic imaging. Opt. Lett. 2024, 49, 3054–3057. [Google Scholar] [CrossRef]

Figure 1. Comparison of Half U-Net and Pocket-Net architectures. (a) In the half U-Net architecture, the number of channels (feature maps) doubles every down-sampling step. (b) In the Pocket-Net architecture, this stays the same in each step of down-sampling.

Figure 2. APNet architecture. Attention modules. (a) Squeeze-and-excitation (SE) blocks help Pocket-Net focus on critical features. APNet, (b) combines depth-wise separable convolutions with SE blocks for efficient and accurate melanoma detection, reducing model complexity while enhancing feature focus in dermoscopy images.

Figure 3. Examples of dermoscopy image datasets provided by The International Skin Imaging Collaboration (ISIC) (a) ISIC 2017, (b) ISIC 2018, (c) ISIC 2017, (d) ISIC 2020.

Figure 4. Grad-CAM visualization of APNet predictions on dermoscopic images. The highlighted regions indicate areas of the lesion that contributed most to the model’s classification. (a) ISIC 2017, (b) ISIC 2018, (c) ISIC 2017, (d) ISIC 2020.

Figure 5. Comparison between HU-Net, Pocket-Net, and APNet in terms of (a) training time and (b) GPU Memory usage for batch sizes of 8, 16, and 32.

Figure 6. ROC curve of (a) HU-Net, (b) Pocket-Net, and (c) and APNet. The dashed black line represents the line of no discrimination.

Figure 7. Confusion Matrix of (a) HU-Net, (b) Pocket-Net, and (c) AP-Net for the four datasets.

Table 1. Summary of popular neural network size reduction methods, highlighting their mechanisms, advantages, and limitations for achieving computational efficiency and maintaining performance.

Method	Description	Advantageous	Disadvantageous
Pruning [36,37]	Pruning aims to remove unnecessary components of a network, such as weights or neurons, without significantly affecting accuracy.	➢ Reduces memory and computational requirements. ➢ Retains much of the original model’s performance. ➢ Effective for sparse or over-parameterized networks.	➢ Time-intensive to tune the pruning process ➢ Often requires fine-tuning after pruning to recover performance.
Knowledge Distillation [38]	Trains a smaller model (student) to mimic the outputs of a larger, pre-trained model (teacher).	➢ Produces smaller models with high accuracy, often close to the teacher model. ➢ Can transfer robustness and generalization from teacher to student.	➢ Requires training an additional large teacher model. ➢ Performance depends on the similarity of tasks and quality of the teacher.
Lightweight Architecture Design [20,39,40]	Develop models with efficiency in mind by rethinking the architecture.	➢ Designed from scratch to be resource-efficient. ➢ Combines techniques like depth-wise separable convolutions and bottleneck blocks.	➢ May underperform compared to larger models on complex tasks. ➢ Requires careful design and optimization.
Decomposes weight matrices into lower-rank approximations, reducing the number of parameters [41,42]	Decomposes weight matrices into lower-rank approximations, reducing the number of parameters.	➢ Effective for reducing computational complexity in matrix operations. ➢ Maintains reasonable performance when applied correctly.	➢ Limited by the rank approximation; aggressive factorization can degrade accuracy. ➢ Suitable mainly for over-parameterized layers.
Neural Architecture Search (NAS) [43]	Uses automated algorithms to design optimal neural network architectures.	➢ Finds architecture with an optimal balance of accuracy and efficiency. ➢ Can be customized for specific hardware.	➢ Extremely computationally expensive during the search phase. ➢ Requires advanced expertise and resources.

Table 2. Detailed information of classification dataset after augmentation. A balanced representation of dataset classes helps the model learn discriminative features from both classes equally, preventing poor generalization and reducing bias towards the majority class. After oversampling, the dataset now contains an equal number of samples for both the melanoma and benign classes, ensuring a balanced distribution during training. For consistency across datasets and models, all images were resized to 256 × 256 pixels before being fed into the network.

Dataset	Melanoma (Augmented)	Benign (Augmented)	Training Set	Validation Set	Test Set
ISIC 2017	347 (1627)	1627 (same)	2602	325	325
ISIC 2018	6705 (8902)	8902 (same)	14,244	1780	1780
ISIC 2019	4522 (20,808)	20,810 (same)	33,295	4161	4162
ISIC 2020	584 (32,542)	32,542 (same)	52,068	6508	6508

Table 3. Performance comparison between HU-Net, Pocket-Net, APNet. Different network parameter reductions mentioned in Table 1.

Dataset	Architecture	Number of Parameters	Accuracy	Sensitivity	Specificity	AUC
ISIC 2017	Mobile net	1,882,498	95.36%	93.85%	96.90%	0.9819
	3D-DenseUNet-569	802,178	54.80%	69.34%	40.11%	0.5676
	Alex-Net Weight pruning	656,511	73.83%	96.34%	51.20%	0.8957
	HU-Net	1,304,866	96.30%	97.50%	95.07%	0.986
	Pocket-Net	70,114	93.55%	96.90%	90.20%	0.977
	APNet	70,762	94.5%	98.14%	90.80%	0.986
ISIC 2018	Mobile net	1,882,498	98.90%	98.20%	99.56%	0.9989
	3D-DenseUNet-569	802,178	95.20%	93.70%	96.73%	0.9648
	Alex-Net Weight pruning	656,511	78.12%	98%	58.30%	0.9457
	HU-Net	1,304,866	98.44%	99.66%	97.17%	0.996
	Pocket-Net	70,114	96.80%	99.20%	94.40%	0.988
	APNet	70,762	97.90%	99.76%	96.04%	0.9908
ISIC 2019	Mobile net	1,882,498	96.40%	94.78%	98.05%	0.9917
	3D-DenseUNet-569	802,178	92.21%	9.09%	93.26%	0.9584
	Alex-Net Weight pruning	656,511	82.75%	94.40%	71.14%	0.9389
	HU-Net	1,304,866	93.80%	95.36%	92.24%	0.976
	Pocket-Net	70,114	91.26%	96.88%	85.70%	0.96
	APNet	70,762	92.6%	97.27%	87.9%	0.965
ISIC 2020	Mobile net	1,882,498	99.56%	99.37%	99.70%	0.9999
	3D-DenseUNet-569	802,178	98.60%	99.50%	97.70%	0.9834
	Alex-Net Weight pruning	656,511	97.75%	100%	95.46%	0.9999
	HU-Net	1,304,866	99.95%	100.00%	99.90%	0.999
	Pocket-Net	70,114	99.20%	99.76%	98.73%	0.999
	APNet	70,762	99.60%	100%	99.20%	0.999

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Beirami, M.J.; Gruzmark, F.; Manwar, R.; Tsoukas, M.; Avanaki, K. Lightweight Attention-Based Architecture for Accurate Melanoma Recognition. Electronics 2025, 14, 4281. https://doi.org/10.3390/electronics14214281

AMA Style

Beirami MJ, Gruzmark F, Manwar R, Tsoukas M, Avanaki K. Lightweight Attention-Based Architecture for Accurate Melanoma Recognition. Electronics. 2025; 14(21):4281. https://doi.org/10.3390/electronics14214281

Chicago/Turabian Style

Beirami, Mohammad J., Fiona Gruzmark, Rayyan Manwar, Maria Tsoukas, and Kamran Avanaki. 2025. "Lightweight Attention-Based Architecture for Accurate Melanoma Recognition" Electronics 14, no. 21: 4281. https://doi.org/10.3390/electronics14214281

APA Style

Beirami, M. J., Gruzmark, F., Manwar, R., Tsoukas, M., & Avanaki, K. (2025). Lightweight Attention-Based Architecture for Accurate Melanoma Recognition. Electronics, 14(21), 4281. https://doi.org/10.3390/electronics14214281

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Attention-Based Architecture for Accurate Melanoma Recognition

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. HU-Net to Pocket-Net to APNet

3.2. Datasets

4. Results

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI