1. Introduction
Skin cancer is one of the most common and dangerous types of cancer if not detected and treated early. In 2022, over 1.5 million new cases were reported worldwide, with melanoma causing 60,000 deaths from 330,000 cases, while non-melanoma cancers ranked fifth among global cancers with 1.23 million cases and 69,416 deaths [
1]. These figures highlight the urgent need for early detection, especially in remote regions where access to specialized healthcare and internet infrastructure remains limited [
2].
Recent advances in Deep Learning technology have shown great promise in automating the classification of skin cancer using large image datasets [
3]. Despite its potential, this approach often demands substantial computational resources and reliable internet connectivity—conditions not always available in rural or low-resource environments. To overcome these constraints, some studies have integrated Deep Learning models with Internet of Things (IoT) devices, enabling local inference. However, these solutions continue to face critical challenges related to data privacy and security. According to a HIMSS survey [
4], 75% of healthcare stakeholders identify data privacy as the main barrier to AI adoption, followed by risks of data breaches and algorithmic bias.
Convolutional Neural Networks (CNNs) have demonstrated superior performance compared to traditional classification methods such as Artificial Neural Networks (ANN), k-Nearest Neighbors (kNN), Random Forest (RF), and Logistic Regression (LR) [
5,
6]. Advanced architectures including EfficientNetB0, InceptionV3, and VGG-16 have achieved remarkable accuracy in skin cancer classification tasks. Nevertheless, many of these models remain computationally intensive and unsuitable for deployment on devices with limited processing power [
7,
8]. Moreover, previous research has often overlooked performance evaluations on edge or embedded systems.
To address these gaps, this study proposes a CNN-based skin cancer classification framework integrated with Federated Learning, as shown in
Figure 1. The proposed framework begins with data preprocessing, followed by data augmentation and annotation to prepare labeled skin lesion images for training. The pretrained CNN model is used as the initial global model. This model is then updated through an iterative Federated Learning process in which the weights from local servers are aggregated to obtain the final optimized global model.
This approach aims to enhance model accuracy while preserving patient data privacy by training models collaboratively without centralizing sensitive information through federated learning. Two benchmark datasets, HAM10000 [
9] and ISIC 2019 [
10,
11], are utilized in this study. The HAM10000 dataset is used to initialize the global model, while the ISIC 2019 dataset is distributed across multiple clients and employed as local training data. In addition, this research evaluates the inference performance of the proposed model using real-time inference on edge computing platforms, namely Jetson Orin and Raspberry Pi. The evaluation focuses on inference time, power consumption, RAM utilization, and energy dissipation to support early skin cancer detection while maintaining data confidentiality and enabling efficient deployment in resource-constrained environments.
2. Related Work
From 2021 to 2025, numerous studies have explored skin cancer detection using both machine learning and deep learning methods. The dataset used in this research contains various skin lesion images with distinct visual characteristics, reflecting the diversity commonly encountered in real-world clinical cases. Early works primarily focused on conventional machine learning models; however, as the complexity and volume of skin lesion data increased, more recent studies began adopting advanced deep learning architectures. In addition, privacy-preserving techniques such as Federated Learning have been introduced to enhance performance while maintaining data security.
The first notable study applied the Support Vector Machine (SVM) method for melanoma classification. Their model utilized features such as asymmetry, color, and irregularity to distinguish skin lesions, achieving a specificity of 95.7% and an accuracy of 90.2% using the ISIC dataset. Although the results were promising, computational complexity remained a limitation, and the model’s performance was highly dependent on preprocessing and segmentation quality [
12].
In 2022, Pennisi et al. proposed the Squeeze U-Net model using the ISIC 2017 and PH2 datasets, as illustrated in
Figure 1. The model demonstrated strong segmentation performance, achieving Dice scores of 0.9035 and 0.9301, as well as Jaccard indices of 0.7758 and 0.8533, respectively. Its compact size (10 MB) and fast inference time made it suitable for deployment on embedded systems. However, the model showed reduced performance when applied to unseen ISIC data and images with low contrast [
13].
A year earlier, Zhang [
14] evaluated CNN-based architectures such as VGG16, VGG19, and EfficientNet-B6 for melanoma detection using the ISIC 2020 dataset. The EfficientNet-B6 model achieved the best result with an AUC-ROC of 0.917, outperforming earlier CNN models. Nonetheless, issues of overfitting, lack of preprocessing details, and missing analysis of data bias or inference speed were identified as limitations.
In a comparative study, Gupta et al. [
15] examined multiple classifiers including CNN, Random Forest (RF), Naïve Bayes (NB), k-Nearest Neighbors (kNN), and SVM, using 23,000 ISIC and 3150 Dermofit images. Their findings indicated that CNN achieved the highest accuracy of 94.91%, outperforming all traditional models, supported by the use of a large and diverse dataset.
Similarly, Ornek et al. [
5] compared several algorithms—ANN, kNN, RF, and Logistic Regression (LR)—for skin lesion segmentation and classification. The best-performing model, ANN, achieved 71.8% accuracy, while LR yielded the lowest accuracy at 67.2%. Although ANN performed relatively well among classical approaches, the overall accuracy was lower than deep learning-based methods reported in other studies.
Further improvements were demonstrated by Owida et al. [
16] and Sharma et al. [
6], who applied CNN-based architecture to the HAM10000 dataset, as shown in
Figure 2. Their models, including MobileNet + LSTM, Modified MobileNetV2, and DenseNet169, achieved up to 95.23% accuracy after applying preprocessing techniques to enhance image quality. Sharma et al. [
6] further compared Xception (95.17%), Inception-ResNet-V2 (95.17%), DenseNet121 (94.48%), and DenseNet201 (95.86%), showing strong performance across architectures. However, both studies were limited to a single dataset, reducing generalizability to real-world clinical settings. Similarly, Abdullah et al. [
3] reported an overall accuracy of 96.25% and training accuracy of 98.26% on HAM10000 using CNN, yet without addressing dataset bias.
Subsequent research explored lightweight CNN models for low-computation environments. Shrivastava [
17] tested MobileNet and ResNet50 using the PH2 dataset (200 images), reporting accuracies of 80% and 56.66%, respectively. A proposed improved CNN model achieved 93.33% accuracy, outperforming earlier versions but using a relatively small dataset. Similarly, Lilhore et al. [
18] developed a hybrid model combining U-Net and Improved MobileNet-V3, reaching 98.86% accuracy, 97.84% precision, 96.35% sensitivity, and 97.32% specificity. The use of Bayesian optimization for hyperparameter tuning enhanced model performance, allowing faster and more accurate skin cancer detection.
Recent research has also emphasized integrating deep learning with IoT and edge computing. Watt et al. [
2] implemented Tiny Machine Learning (TinyML) on Jetson Nano (Nvidia, Santa Clara, CA, USA) using the HAM10000 dataset to enable on-device inference. However, their model achieved only 78% accuracy, which was considered insufficient compared to state-of-the-art methods. Moreover, the study lacked detailed performance evaluations on the IoT hardware. In the same year, Satria et al. [
7] focused on melanoma and non-melanoma classification using EfficientNetB0, InceptionV3, and VGG-16, with EfficientNetB0 achieving 99% accuracy and 99% precision. Despite its impressive results, the study did not address potential overfitting or model optimization for hardware deployment.
Other works such as Aishwarya & GS [
8] used the YOLOv8s model on 3585 dermoscopic images from Roboflow Universe, achieving 99.5% precision, 99% recall, and 99.2% F1-score, with real-time inference on NVIDIA Jetson Nano. Similarly, Swaminathan et al. [
19] benchmarked deep learning models on Jetson Nano, reporting inference speedups of 16.7× (MobileNetV2), 13.6× (ShuffleNetV2), 8.07× (ResNetV2), and 6.05× (VGG). However, variations in performance across models were not thoroughly analyzed. Torti et al. [
20] also evaluated device-level performance, showing execution times of 3.34 s (5W mode) and 2.42 s (10W mode) on Jetson Nano, compared to 0.96 s on RTX 2080, highlighting the feasibility of efficient edge deployment.
Regarding data privacy and federated learning, Deng, Yang, and Suzuki (2025) [
21] explored Federated Active Learning (FedAL) using combined HAM10000 and MSK datasets distributed across four hospitals with 50% labeled data. Their model achieved Micro-F1 of 86.17%, Macro-F1 of 78.50%, and AUC of 91.43%, demonstrating strong performance while preserving patient confidentiality. However, the study did not examine variations in skin type or the impact of noisy or mislabeled data. Similarly, Khullar et al. [
22] proposed Federated Transfer Learning with MobileNetV2 and EfficientNetV2S, achieving 89.9% accuracy with model sizes under 10 MB—ideal for low-power devices. Nonetheless, annotation challenges and model robustness were not fully addressed. Finally, Akbar et al. [
23] combined CNN-LSTM with Federated Learning (FedAvg), obtaining 91.8% accuracy while introducing noise to enhance privacy protection.
Overall, the existing literature demonstrates continuous improvements in skin cancer classification accuracy from 2021 to 2025, driven by advances in deep learning architectures and optimization techniques. However, most privacy-preserving approaches based on federated learning primarily focus on model training performance, with limited attention to real-time inference on edge computing platforms. In addition, many studies do not evaluate deployment feasibility on resource-constrained devices, which is critical for practical clinical applications. These limitations highlight a clear research gap in developing skin cancer classification systems that jointly address data privacy through federated learning and efficient real-time inference on edge computing, motivating the proposed approach in this study.
3. Materials and Methods
The research stages conducted in this paper are illustrated in
Figure 3. The process begins with problem identification related to the need for early skin cancer detection that maintains data privacy and supports federated learning and real-time inference on edge computing devices. A literature review is then carried out to examine previous studies on skin cancer detection methods, the use of the Xception backbone architecture, and Federated Averaging (FedAvg) learning approaches. Subsequently, data preprocessing is performed, including data splitting, augmentation, and annotation, to ensure sufficient data quality for effective model training.
The dataset is divided into training and testing sets. In the next stage, a local model is trained using an Xception-based model, which is then used to initialize the global model in the Federated Learning architecture. The global model is evaluated using F1-score, precision, and recall to measure its performance. The Federated Learning process consists of one global server and two local servers. After several federated learning iterations, the most optimal model is selected and subsequently deployed on edge computing devices, such as Jetson Nano and Raspberry Pi (Raspberry Pi Ltd., Cambridge, UK), to assess its feasibility on low-power hardware. The final stage involves evaluating both the hardware performance and the federated learning model using several measurement parameters.
3.1. Data Preprocessing
The data used in this study consists of two datasets that will be tested. The first is the HAM10000 dataset, which contains 7 categories of skin conditions, with the largest distribution in the melanoma class, totaling 10,015 images. The second is the ISIC 2019 dataset, which includes 9 categories of skin cancer with a total of 25,331 images. However, only 7 categories are used in this research, namely Melanoma (MEL), Melanocytic Nevus (NV), Basal Cell Carcinoma (BCC), Actinic Keratosis (AK), Benign Keratosis-like Lesions (BKL), Dermatofibroma (DF), and Vascular Lesions (VASC), as shown in
Figure 4.
However, only seven categories are used in this research, namely Melanoma (MEL), Melanocytic Nevus (NV), Basal Cell Carcinoma (BCC), Actinic Keratosis (AK), Benign Keratosis-like Lesions (BKL), Dermatofibroma (DF), and Vascular Lesions (VASC). The ISIC dataset contains two additional categories, Squamous Cell Carcinoma (SCC) and Normal skin, which are not included in this research. This research utilizes two datasets that are distributed across a global server and local client servers as listed in
Table 1. The number of samples shown represents the original data before any preprocessing or resampling techniques were applied. The HAM10000 dataset is used exclusively during the global pretraining phase to initialize the global model in the Federated Learning framework. Meanwhile, the ISIC 2019 dataset is evenly divided and used as local datasets for each client. Each client receives a different subset of data while maintaining an equal distribution across all selected categories.
Before the model training process, the dataset undergoes a series of data preprocessing steps to improve data quality while ensuring class balance and data suitability. The preprocessing process begins with an analysis of the image distribution across each class. To address class imbalance, random oversampling is applied, where classes with fewer images are augmented by duplicating existing samples until all classes contain an equal number of images.
Next, all images are subjected to pixel value normalization, in which pixel intensities are rescaled from the range [0, 255] to [0, 1]. This normalization step aims to improve training stability and accelerate model convergence. Following normalization, on-the-fly data augmentation is applied to the training data to increase data diversity and reduce the risk of overfitting. The augmentation techniques include random rotations, minor width shifts, horizontal flipping, and brightness adjustments.
Subsequently, the dataset is divided into training and validation sets using a ratio of 80% for training and 20% for validation, in accordance with the data loading configuration. This partitioning enables objective performance evaluation during the training process.
Figure 5 presents a comparison of class distribution after data cleaning and illustrates the changes before and after applying data augmentation and random oversampling. The left plot demonstrates a substantial imbalance among classes, where certain classes such as NV contain significantly more images than others. After preprocessing, as shown in the right plot, the number of images in each class is balanced. By applying these preprocessing steps, the dataset becomes more balanced, normalized, and diverse, thereby reducing overfitting and improving the model’s capability to perform skin cancer classification.
3.2. Xception-Based Convolutional Neural Network (CNN) Model Training
In this study, a Convolutional Neural Network (CNN) based on the Xception architecture is employed for skin disease classification. The training procedure is designed to enable the model to learn discriminative features effectively and to produce reliable classification results. The model training process is implemented using the Python 3.12.12 programming language and the TensorFlow 2.19.0 framework. To accelerate the training process and optimize GPU computation, Accelerated Linear Algebra (XLA) is enabled.
The model is trained using a predefined hyperparameter configuration. The input image size is set to 224 × 224, the batch size is 32, the learning rate is 1 × 10−4, and the weight decay is 1 × 10−5, with a maximum of 50 training epochs. At this stage, the training dataset consists of the HAM10000 dataset, which has been processed using random oversampling and data augmentation in the previous preprocessing stage. The model employs the Xception architecture as the backbone network. The Xception model is initialized with pretrained weights from the ImageNet dataset, enabling the application of transfer learning. This architecture is selected due to its depth wise separable convolution mechanism, which improves feature extraction efficiency while reducing computational cost. The fully connected layers at the top of the network are removed, and global average pooling is applied to aggregate spatial features. All layers of the base model are set to trainable, allowing full fine-tuning to adapt the learned features to the characteristics of skin disease images. This fine-tuning strategy enables the model to better capture domain-specific patterns and enhances overall classification performance.
This model is used as the initial global model for the federated learning process.
Figure 6 shows the Xception-based CNN architecture. Model Xception as the backbone for feature extraction, followed by a classification head consisting of Batch Normalization, a Dropout layer with a rate of 0.4, a Dense layer with 256 neurons using ReLU activation and L2 regularization, and a Dropout layer with a rate of 0.3. Subsequently, the network is reduced using a Dense layer with 128 neurons, ReLU activation, and L2 regularization, followed by a final Dense layer with a Softmax activation function for multi-class classification across seven classes.
The model is compiled using the AdamW optimizer, which combines an adaptive learning rate with weight decay regularization. In addition, gradient clipping is applied to maintain training stability. The loss function used is Categorical Crossentropy, while the evaluation metrics include accuracy and Area Under the ROC Curve (AUC). To improve training effectiveness and prevent overfitting, several callbacks are applied, including EarlyStopping, which stops training when the validation AUC does not improve for 10 consecutive epochs; ReduceLROnPlateau, which automatically reduces the learning rate when the validation loss stagnates; TerminateOnNaN, which terminates training if numerical instability occurs; and ModelCheckpoint, which saves the model with the best performance during training. After the training process is completed, the model is saved in both .keras and .h5 formats to ensure compatibility across various usage scenarios and inference implementations.
3.3. Federated Learning Model Training
At this stage, the model training process is conducted using the Federated Learning approach, which is a distributed learning method that allows multiple servers or clients to collaboratively train a model without sharing raw data with a central server. This approach is highly suitable for the medical domain, as it preserves the privacy and security of patient data, where data remain stored locally at each institution or client. The dataset used in this stage is the ISIC 2019 dataset, which is divided according to
Table 1 for Local Server 1 and Local Server 2. The dataset is evenly distributed, with the same class categories allocated to each local server.
The Federated Learning architecture consists of a central server and multiple clients. The central server acts as the main coordinator responsible for initializing the global model, distributing the model to clients, and aggregating model parameters. Meanwhile, each client functions as a local training unit that maintains its own dataset and independently performs model updates. The global model used at this stage is a CNN model based on the Xception architecture, which has been initialized in the previous stage. This model is then distributed to all clients as the initialization for local training.
Figure 7 illustrates the architecture of Federated Learning, starting from the initial global model initialization stage to the inference stage on edge computing devices. The global model is stored in .keras format and distributed to each client. During the local training phase, training is performed independently at each client, ensuring that image data are not transmitted to the global server or to other local servers. Each client updates the model weights based on its local dataset over several training epochs. After the local training process is completed, each local server sends the updated model parameters back to the global server.
Next, the process proceeds to the global model aggregation stage. The global aggregation process is performed using the Federated Averaging (FedAvg) algorithm. In this stage, the system updates the global model by combining the weights from all local models. This process enables the global server to integrate knowledge from all clients without directly accessing the original local data. After that, an iterative process is conducted, consisting of global model distribution, local training, and global aggregation. These three processes are repeated for multiple federated learning rounds. In each iteration, the global model is progressively improved based on updates from all clients until optimal performance is achieved. The final optimized model is then stored and used in the inference stage.
3.4. Inference Model on Edge Computing Devices
This study utilizes two edge computing devices, namely the Raspberry Pi 5 and Jetson Nano Orin Super, for the model inference process after the training stage using Federated Learning.
Table 2 presents the hardware specifications of the two devices used in this study. The specifications include the CPU and GPU architecture, which determine the computational capability for executing deep learning models, as well as memory capacity, which affects the efficiency of model loading and inference execution. In addition, the table also provides information on storage media, operating system, and power supply configuration, which are important factors in the implementation and performance analysis of model inference on edge devices.
The setup of the edge computing devices used in this study is shown in
Figure 1. On the Jetson Orin device, several supporting components are used, including a 9V–19V DC power adapter via USB Type-C, a LAN cable, and a standard HDMI cable for connecting to a monitor. Meanwhile, on the Raspberry Pi device, a 5V/5A DC USB Type-C power adapter, a LAN cable, and a micro HDMI cable are used. For the installation and initial configuration process, an internet connection is established using a LAN cable, as illustrated in
Figure 8.
Figure 8 shows the initial setup of the Jetson Orin device, which begins with the preparation of several required components, including a microSD card reader, a 64 GB microSD card, a USB keyboard and mouse, a computer display (HDMI or DisplayPort), and an appropriate power supply. For the initial setup process, an internet connection from another device is required to download the Jetson Nano Developer Kit SD Card Image. The operating system image is written in the microSD card using Etcher. Prior to this process, the microSD card must be formatted to ensure compatibility. After Etcher completes the flashing process successfully, the microSD card is inserted into the Jetson Orin device.
Upon powering on the device, a first boot process is performed, during which several initial configuration steps are completed, including system settings and user configuration. The initial operating system installation process for the Jetson Orin and Raspberry Pi devices follows a similar procedure. However, the Raspberry Pi uses Raspberry Pi Imager as the official tool for preparing and installing its operating system.
The libraries used during the inference and benchmarking stages include TensorFlow as the main framework for executing the CNN model, NumPy for numerical computation, Pillow for image processing, psutil for monitoring memory usage, and tqdm for visualizing the testing progress. In addition, several built-in Python modules such as os, time, csv, and glob are used for system management and result storage. Prior to inference, the global model in Keras format is first converted to TensorFlow Lite format to reduce model size and improve efficiency on the Raspberry Pi.
On the Jetson Orin device, the libraries used during the inference stage are adapted to the model format employed, namely Keras, TensorFlow SavedModel, and ONNX. For models in Keras and TensorFlow SavedModel formats, TensorFlow together with its Keras API (tf.keras) is used as the main runtime, supported by NumPy, Pillow, psutil, and tqdm. Meanwhile, inference using the ONNX format is executed with ONNX Runtime (CUDA Execution Provider), also supported by NumPy, Pillow, psutil, and tqdm. Built-in Python modules such as os, time, csv, and glob are used for system management and result storage.
3.5. Performance Evaluation
The classification performance of the global Xception-based CNN model is evaluated using accuracy, precision, recall, and F1-score. Accuracy is used to evaluate the ability of a model to correctly classify data into their respective categories. It is defined as the ratio of correctly predicted observations to the total number of observations, as shown through Equation (1). Correct predictions include both True Positives (TP) and True Negatives (TN). In this context, True Positive (TP) refers to cases where the model correctly identifies a positive class, while True Negative (TN) denotes cases where the model correctly identifies a negative class. Conversely, False Positive (FP) occurs when the model incorrectly predicts a positive class, and False Negative (FN) occurs when the model fails to correctly identify a positive instance.
Precision is used to evaluate the correctness of the model predictions by measuring how many of the predicted positive instances are actually correct, as defined in Equation (2). Recall represents the model’s ability to identify all relevant positive samples within the dataset, indicating how effectively the model captures true positive cases. A higher recall value reflects a better capability in recognizing all relevant objects, as shown in Equation (3). The F1-score is a composite metric that combines precision and recall, providing a balanced evaluation between the two measures. This metric is calculated using the formula shown in Equation (4). Model performance is evaluated using training and validation loss and accuracy graphs to observe the learning behavior and convergence of the model. In addition, a confusion matrix is used to analyze the classification results across all classes.
This study evaluates the performance of edge computing devices platforms using four hardware-related parameters: Inference time (Second), power consumption (Watt/hours), RAM utilization (Megabytes), and dissipated energy (milliJoules). Inference time (s) is used to assess the inference speed of the model, where a lower inference time indicates faster object recognition. Power consumption (mW) represents the amount of electrical power consumed by the SBC during the inference process.
RAM utilization (MB) reflects the memory resources required to run the skin cancer classification system. The dissipated energy (mJ) is calculated using Equation (5), which multiplies the power consumption by the inference time and converts the result to energy units. This metric represents the total energy expended while the system operates on the edge devices and is used to evaluate the overall energy efficiency of the proposed system during inference.
To evaluate the hardware performance of each edge computing devices, this study considers four performance parameters, namely inference time (s), power consumption (mW), RAM utilization (MB), and dissipated energy (mJ). The evaluation is conducted by selecting 20 random images from the dataset. These images are then randomly reused to perform 100 inference trials on each device. The average values of the four performance metrics are reported in the hardware performance evaluation results table.
4. Results and Discussion
4.1. Initial Global Model Performance Evaluation
The initial global Xception-based model achieved a relatively high performance. As shown in
Figure 9, the training process was conducted for 12 epochs, covering both accuracy and loss for the training and validation datasets. In the early stage (Epoch 1), the training accuracy was relatively low, indicating that the model was still in the initial phase of feature learning. The accuracy increased significantly between Epochs 2 and 4, exceeding 97%, which demonstrates the model’s ability to effectively learn patterns from the dataset. The difference between training and validation accuracy remained small, indicating that no significant overfitting occurred. The training loss consistently decreased from 1.1395 in the first epoch to 0.0246 in the final epoch.
The best validation accuracy was achieved at Epoch 9, reaching 99.49%. The lowest validation loss was observed at Epoch 10 with a value of 0.0317. Considering both accuracy and loss, Epoch 9 was selected as the optimal model checkpoint due to its superior generalization performance. Furthermore, the training AUC reached 0.999, while the validation AUC remained above 0.995, indicating an excellent capability of the model in distinguishing between different classes of skin lesions.
Table 3 presents the classification performance, demonstrating a very high and consistent performance across all skin disease classes. Several classes, including AKIEC, BCC, DF, and MEL, achieved perfect scores with precision, recall, and F1-score of 1.00, indicating that the model is able to accurately recognize and classify skin lesion images for these classes without errors. The BKL class also shows excellent performance, with precision, recall, and F1-score values of 0.99, suggesting that only a very small number of misclassifications occurred. Meanwhile, the NV and VASC classes exhibit slightly lower performance compared to the other classes, with F1-scores of 0.99 and 0.98, respectively. These differences may be attributed to visual similarities among skin lesion classes or the higher complexity of the image characteristics.
Overall, the global Xception-based CNN model achieved an accuracy of 0.99, indicating a highly reliable classification capability across the entire dataset. The macro average and weighted average values for precision, recall, and F1-score also reached 0.99, demonstrating that the model performs consistently across all classes without significant bias toward either majority or minority classes. These results indicate that the model has strong generalization ability in maintaining balanced performance among different skin lesion classes.
Based on the confusion matrix shown in
Figure 10, the global Xception-based CNN model demonstrates very high classification performance, with the majority of predictions located along the main diagonal. Several classes, including AKIEC, DF, and VASC, are classified perfectly without any misclassifications. The number of misclassifications is very small and mainly occurs in the NV, MEL, and BKL classes. This indicates the presence of visual similarities among these skin lesion classes, which makes the classification task more challenging. Nevertheless, the number of errors is not significant and does not affect the overall performance of the model. Overall, the confusion matrix indicates that the model has good generalization capability for skin disease classification.
4.2. Federated Learning Model Performance Evaluation
Figure 11 illustrates the convergence behavior of the final global model during the Federated Learning process, observed over 16 training rounds. The model performance is evaluated based on global loss and global accuracy, which are used to analyze the stability and effectiveness of the distributed training process. The reported metrics are calculated using a global test dataset constructed using 20% of the data from each client for evaluation results. The training is conducted using the ISIC 2019 dataset, which is evenly distributed across two local servers. It can be observed that both global accuracy and global loss reach a stable convergence after approximately round 14, with only marginal improvements in subsequent rounds. Therefore, training was stopped at round 16 as additional rounds did not provide significant performance gains and could potentially increase the risk of overfitting.
As shown in
Figure 11a, the global loss exhibits a significant decrease during the early rounds of Federated Learning. The initial loss value of 0.4491 in the first round decreases sharply within the first five rounds, indicating that the global model is able to rapidly learn meaningful feature representations from the aggregated local model updates. Afterward, the loss continues to decrease gradually and stabilizes at values below 0.06, with minor fluctuations in subsequent rounds. Such fluctuations are common in Federated Learning and are generally caused by differences in local data distributions (non-IID) as well as variations in model updates from individual clients.
Figure 11b presents the progression of global accuracy throughout the Federated Learning process. The accuracy increases consistently from 85.67% in the first round to over 97% by the fifth round. In the following rounds, the global accuracy continues to improve and reaches a maximum value of 98.87%. Overall, the decreasing trend in global loss and the increasing trend in global accuracy demonstrate that the implemented Federated Learning approach effectively optimizes the global model. These results indicate that the local model aggregation mechanism operates successfully, producing a stable global model with high performance while simultaneously preserving data privacy at each client or local server.
Table 4 presents a comparison between the proposed model and several state-of-the-art methods reported in recent studies. The comparison is conducted based on model architecture, classification accuracy, deployment on single-board computer (SBC) platforms, and data security aspects. Previous studies such as Deng et al. [
21], Khullar et al. [
22], and Akbar et al. [
23] have demonstrated the effectiveness of Federated Learning approaches combined with deep learning models, achieving accuracies ranging from 86.17% to 91.8%. However, these methods were not implemented or evaluated on SBC platforms, limiting their applicability for edge-based medical systems.
Other works, including Watt et al. [
2], Aishwarya & GS [
8], and Shrivastava [
17], focused on lightweight or edge-oriented solutions and successfully deployed their models on SBC platforms. Nevertheless, these approaches either lacked data privacy mechanisms or did not employ Federated Learning, which is a critical requirement in medical applications involving sensitive patient data.
In contrast, the proposed model integrates a CNN-based architecture with the Federated Averaging (FedAvg) algorithm, achieving an accuracy of 98.87%. Importantly, the performance of the federated model is comparable to that of centralized (global) training approaches, with only a very minor reduction in accuracy. This indicates that the adoption of Federated Learning does not significantly compromise model performance while offering substantial advantages in terms of data privacy. Moreover, unlike prior studies, the proposed approach simultaneously satisfies high classification performance, secure data handling through Federated Learning, and practical deployment on SBC platforms.
4.3. Hardware Performance Evaluation on Edge Computing Devices
Table 5 presents the results of the hardware performance evaluation during real-time inference process of the skin cancer classification model using several edge computing configurations, namely Jetson Orin Nano and Raspberry Pi 5, with different framework and quantization mode variations. The reported values are obtained as the average of 100 inference iterations for each evaluated parameter. On the Jetson Orin Nano, experiments were conducted using three model formats—Keras, TensorFlow, and ONNX—all operating in FP32 precision.
The results indicate that the ONNX model achieves the smallest model size (42.9 MB) and the fastest inference time (0.16 s) compared to the Keras and TensorFlow formats. In addition, the ONNX format also produces the lowest dissipated energy, measured at 1552 mJ, demonstrating high computational efficiency on GPU-based devices. Although the instantaneous power consumption of ONNX is relatively higher than that of Keras and TensorFlow, the extremely short inference time results in the lowest overall energy usage. Meanwhile, the TensorFlow model on Jetson Orin Nano exhibits faster inference than the Keras model but has the largest model size (165.0 MB) and the highest RAM utilization, reaching 4200 MB.
On the Raspberry Pi 5, experiments were performed using Keras (FP32) and TensorFlow Lite (FP16) formats. The evaluation results show that TensorFlow Lite with FP16 precision significantly reduces the model size to 42.8 MB and shortens the inference time to 0.88 s compared to the Keras model. Furthermore, TensorFlow Lite requires lower RAM utilization (2350 MB) and more efficient power consumption, leading to lower dissipated energy than the Keras configuration.
Figure 12 provides a comprehensive evaluation of inference performance on the edge computing devices—Jetson Orin Nano and Raspberry Pi 5—by considering several key metrics, including inference time, RAM utilization, power consumption, and dissipated energy. The inference time comparison shows that the ONNX model on Jetson Orin Nano achieves the fastest inference among all configurations. This observation is consistent with the results presented in
Table 5, where ONNX records an average inference time of 0.16 s.
Overall, these evaluation results reveal a clear trade-off between inference speed, memory usage, and energy efficiency. Jetson Orin Nano with the ONNX model is well suited for applications requiring fast inference and high energy efficiency, while Raspberry Pi 5 with TensorFlow Lite (FP16) is more appropriate for edge computing scenarios with limited resources and low power consumption requirements.
For future work, the number of federated clients can be increased, and more heterogeneous data distributions can be introduced to better reflect real-world multi-institution medical environments. This would provide deeper insights into the robustness of the Federated Learning framework under more challenging non-IID conditions. In addition, further optimization techniques such as INT8 quantization or TensorRT framework optimization on Jetson devices can be explored to further reduce inference latency and energy consumption. Finally, future studies may extend the system to real clinical settings by integrating real-time image acquisition and medical expert feedback, thereby enhancing the practical applicability of the proposed approach in clinical decision support systems.
5. Conclusions
This research presents an end-to-end framework for skin cancer classification that integrates an Xception-based Convolutional Neural Network (CNN), federated learning and real-time inference on edge computing platforms. The proposed approach is designed to achieve high classification performance while preserving data privacy and ensuring practical deployment on resource-constrained devices. The initial global Xception-based CNN demonstrated excellent classification capability, achieving an overall accuracy of 99%, with consistently high precision, recall, and F1-score across all skin lesion classes. The learning curves and confusion matrix analysis confirm that the model exhibits strong generalization ability and minimal overfitting, even for visually similar lesion categories.
The Federated Learning implementation further validated that distributed training using the FedAvg algorithm can effectively optimize the global model. Across 16 federated rounds, the global accuracy increased steadily from 85.67% to a maximum of 98.87%, while the global loss converged to a stable low value. Importantly, the performance of the federated model remains comparable to centralized training, experiencing only a negligible reduction in accuracy. This result confirms that Federated Learning can maintain high model performance while ensuring data privacy, making it particularly suitable for medical applications involving sensitive patient data.
Hardware performance evaluation on Jetson Orin Nano and Raspberry Pi 5 demonstrates the feasibility of deploying the proposed model on edge computing devices. The ONNX format on Jetson Orin Nano achieved the fastest inference time (0.16 s) and the lowest dissipated energy (1552 mJ), highlighting its suitability for real-time and energy-efficient applications. Meanwhile, TensorFlow Lite (FP16) on Raspberry Pi 5 significantly reduced model size, inference time, and memory usage, making it an effective solution for low-power edge environments. These results reveal a clear trade-off between inference speed, memory consumption, and energy efficiency, emphasizing the importance of selecting an appropriate model format for specific deployment scenarios.
Overall, the proposed system successfully combines high classification accuracy, privacy-preserving federated training, and efficient edge deployment, offering a practical and scalable solution for skin cancer classification in real-world healthcare environments. For future work, the system can be extended by increasing the number of federated clients and incorporating more heterogeneous data distributions to better represent multi-institution clinical settings. Additionally, further optimization techniques such as INT8 quantization and TensorRT acceleration can be explored to reduce inference latency and energy consumption, as well as to facilitate real-time deployment in clinical decision support systems.