An Adaptive Federated Machine Learning-Based Intelligent System for Skin Disease Detection: A Step toward an Intelligent Dermoscopy Device

: The prevalence of skin diseases has increased dramatically in recent decades, and they are now considered major chronic diseases globally. People suffer from a broad spectrum of skin diseases, whereas skin tumors are potentially aggressive and life-threatening. However, the severity of skin tumors can be managed (by treatment) if diagnosed early. Health practitioners usually apply manual or computer vision-based tools for skin tumor diagnosis, which may cause misinterpretation of the disease and lead to a longer analysis time. However, cutting-edge technologies such as deep learning using the federated machine learning approach have enabled health practitioners (dermatologists) in diagnosing the type and severity level of skin diseases. Therefore, this study proposes an adaptive federated machine learning-based skin disease model (using an adaptive ensemble convolutional neural network as the core classiﬁer) in a step toward an intelligent dermoscopy device for dermatologists. The proposed federated machine learning-based architecture consists of intelligent local edges (dermoscopy) and a global point (server). The proposed architecture can diagnose the type of disease and continuously improve its accuracy. Experiments were carried out in a simulated environment using the International Skin Imaging Collaboration (ISIC) 2019 dataset (dermoscopy images) to test and validate the model’s classiﬁcation accuracy and adaptability. In the future, this study may lead to the development of a federated machine learning-based (hardware) dermoscopy device to assist dermatologists in skin tumor diagnosis.


Introduction
Improving and ensuring healthy lives and promoting wellbeing in the community are some of the United Nations' Sustainable Development Goals. For that reason, empowering and strengthening health practitioners with sufficient technological advancement (to attain the maximum potential of their health practices) is mandatory. Among other kinds of diseases, skin diseases have increased dramatically in recent decades, and they are now considered a major chronic disease globally [1]. People suffer from a broad spectrum of skin disease types (ranging from low to high severity level), such as eczema, chickenpox, measles, warts, acne, skin tumors, and others [2][3][4]. Skin tumors (melanoma, basal cell carcinoma, squamous cell carcinoma, and others) are the most dangerous type of skin disease and may be fatal if not treated early. For example, melanoma is a form of skin tumor that 1. This study proposes the idea of an adaptive federated machine learning-based skin disease detection system to assist dermatologists.

2.
This study proposes a federated machine learning-based adaptive framework for skin disease.

3.
This study validates the proposed model's classification performance and adaptability at the edge (local device) and cloud (global server) level.
Section 2 of this study surveys the related work and formulates a theoretical foundation. Section 3 proposes the idea of artificial intelligence-based dermoscopy and details the proposed model design. Section 4 presents the experiments and discusses the model validation process. Section 5 highlights the conclusions and future work.

Related Work and Theoretical Foundation
Previous research primarily focused on the early screening of skin cancer, whether a lesion was malignant or benign, or whether a lesion was a melanoma. However, upward of 90% of skin problems are not malignant, and addressing these more common conditions is also essential for reducing the global burden of skin disease. Especially in tropical regions, environmental changes (such as urbanization and industrialization) have increased the prevalence of skin diseases [22]. This section discusses the existing dermatology equipment (devices) and existing machine learning-based solutions for skin disease detection. This section also discusses the limitation of current technologies and highlights the motivation for developing adaptive federated machine learning-based dermoscopy devices.

Existing Devices Used for Skin Disease Analysis
Currently, manual or computer vision-based tools are used to diagnose skin diseases. Figure 1a shows a dermoscopy device widely used in advanced countries to diagnose melanoma cancer. It has demonstrated success in diagnosing other skin conditions such as psoriasis [4], lichen planus [4], and cicatricial alopecia. Dermoscopy visualizes the subtle clinical patterns of skin lesions and subsurface skin structures not typically visible to the unaided eye. However, when using this device, the health professional has to decide the severity and disease condition. Figure 1b presents the dermalite device, a user-friendly dermoscopy device attached to a smartphone. It enables capturing highly magnified dermatologist-grade photos of moles or other skin lesions that can be further shared with professional dermatologists for proper examination and diagnosis.  Figure 1c shows a microscope device developed by Stanford scientists to spot the seeds of cancer, diagnose diseases that include skin cancer, and perform exact surgery (without cutting the skin). However, laser-based imaging tools are expensive, slower, bulkier, and less accurate, and they are usually not meant for public adoption [23].

Current State of Machine Learning in Skin Disease Detection
In recent years, with the progress of machine learning technology, the expectations of artificial intelligence have been increasing, and research on its applications in dermatology has actively progressed [24]. In the literature, several studies proposed shallow learningbased skin disease solutions. For example, one study [25] used the artificial neural network to detect melanoma cancer from color images. The model used a discriminant feature to diagnose the tumor shape, categorize the melanoma skin disease into three main categories, and separate melanoma-like diseases. Another study [26] used the fuzzy c means region segmentation approach based on color discrimination. In this approach, a histogram (twodimensional) is calculated along with the principal components and the Gaussian low-pass filter analysis. Similarly, an extreme machine learning-based approach was proposed in [27] to detect skin cancer. This approach requires an extensive process for feature extraction and texture analysis; thus, it is not feasible for clinical trials.
In contrast, some other studies worked on optimized feature extraction processes and determined that the spread of chronic skin diseases in different regions may lead to severe consequences. For example, [28] proposed a support vector machine-based model that automatically detects eczema and determines its severity on the basis of three different stages, i.e., effective segmentation, extraction of optimized features (color, texture, and borders), and severity of the disease itself. Another impressive attempt was made by [29] to extract more appropriate features. This study used the joint power of computer vision and machine learning to detect six types of skin disease. However, these studies were not adequate when addressing similar features across multiple diseases (e.g., some types of eczema show similar features to cancer); as such, wrong classifications may be obtained. The feature similarity issue can be overcome using low-level feature extraction techniques. For that reason, deep learning approaches (such as the convolutional neural network) are more desirable due to their pixel-level feature extraction process. Recently, several studies used deep learning-based techniques to classify similar features more accurately. In [30], a hybrid approach was designed by using a combination of shallow learning-and deep learning-based pretrained models, such as AlexNet and support vector machine. A recent study [31] highlighted the enormous potential of deep understanding to detect skin diseases with human-like diagnosis accuracy or better. Furthermore, this study urged the utilization of deep learning-based real-time intelligent healthcare systems for clinical utilization. The current approaches are only based on the batch learning approach and are static; thus, they do not welcome any future change (i.e., the model would need to be retrained if any changes are required). However, clinical procedures require a continuous upgradation to increase their accuracy and add new kinds of skin disease, rendering the current static deep learning approaches inapplicable. Accordingly, new adaptive mechanisms are required to ensure adaptability with high classification accuracy.
After an extensive analysis of the literature, it can be safely concluded that health practitioners usually apply manual or computer vision-based tools for the diagnosis of skin diseases, which may lead to misinterpretation and a longer analysis time. Existing devices such as laser microscopy and multimeter wave devices are only applicable in particular situations and are not soon expected to be implemented in clinical practices. On the contrary, dermoscopy and dermalite are majorly used clinically, but they require extensive health practitioner input to observe skin diseases. Overall, the current technologies are more expensive and need more time to analyze the actual condition and severity level. Deep learning-enabled dermoscopy is an essential approach to diagnose skin diseases and reduce the proliferation of the skin disease epidemic. Research has shown that, with proper training, diagnostic accuracy with dermoscopy is reportedly 75-84% [32,33], which does not meet the desirable level of classification accuracy. Additionally, these devices are static in nature and, thus, do not fulfill the current technological needs. Therefore, it is essential to provide a solution that could help health practitioners (dermatologists) reduce the skin disease epidemic.

Methodology
This section elaborates on the proposed solution and presents the developed algorithms. Figure 2 illustrates a system-level diagram of intelligent federated machine learning-based dermoscopy. The proposed solution would be placed on the edge. The master model would be placed in the cloud server. However, the federated machine learning approach can allow continuous upgradation of the master copies by learning the averaging weights from all edges after each classification (classification done by the dermatologist). The proposed model is adaptive with respect to new disease knowledge (to be classified), and it can improve as a function of experience (by learning from recent examples identified by the dermatologist) during deployment.  Figure 3 depicts the application-level design. Two separate mobile applications are shown. They represent the two kinds of edges, one for the community (this module is not covered in this study) and the other for health practitioners (dermatologists). The community module aims to support patients in getting fundamental knowledge regarding skin diseases, as well as to motivate and assist users in visiting relevant health practitioners (nearest available dermatologists). In comparison, the health practitioner module is equipped with a dermoscopy device, which performs more detailed analysis of the captured dermoscopy images until diagnosis. This module also continuously transfers the updated weights of new samples (after each classification) to its cloud counterpart. Fundamentally, the health practitioner module can work together with the dermoscopy device by providing predictions (to health practitioners) regarding the skin disease type. Figure 4 represents the module and detailed architecture of the proposed model. This prototype was initially developed for the four most common skin diseases (mentioned in the International Skin Imaging Collaboration (ISIC) 2019 dataset). Later, it can incorporate new skin diseases through its adaptability feature. In future work, the provided model will be tuned with a multimodal solution (taking into account the input skin disease image and the patient's medical history) to detect and monitor the patient's progress, as depicted in Figure 5.
In this proposed approach, the authors used the ensemble mechanism's diversity from previous models (used to adapt to new spectral bands), which enabled handling the possible arrival of new classes and samples. Remarkably, the proposed approach (ensemble approach) contributes to diversity in a simple yet effective manner. This study also used the single-instance optimized CNN model inspired by [34,35] (which was carefully devised after numerous experiments) as an instance in the cloud server's ensemble. Furthermore, the authors trained the proposed model using a challenging dataset (the ISIC 2019 dataset). The proposed model contains two core contributions: (1) the model was deployed on the cloud server, and (2) its deployment on the edges majorly contributes toward adaptability by continuously updating. The authors used online training (OT) and online classifier updating (OCU), presented in [36], with some internal tweaking parameters to make the approach suitable for the federated machine learning environment.   Averaging Mechanism: The global model (server) collects the trained weights from all local models (edges). The global model updates the global weight matrix, called G, such that G = G + (I k * w 1k /S), where I k is the number of data points used to obtain w 1k , and S is the sum of the number of data points across all local models (edges). Among all the available clients K, the server considers a small fraction of clients (C) in each round to update the global weight. Ic = max(C * K, 1), where Ic is the number of local models (edges).

Federated Machine Learning-Based Algorithm for Cloud Server
The cloud-based adaptive ensemble CNN was inspired by a previously proposed approach [36]. The primary difference is that this study restructured the previously proposed framework into the federated machine learning-based architecture. Thus, the core dynamic ensemble classifier and other modules such as OT and OCU were used to perform online training and continuously update the new samples. The additional neuron layers and weight update mechanism were mostly used to ensure runtime learning. However, such approaches are yet to be investigated for complex and high-dimensional data streams. In this module, the existing instance classifiers are updated (trained on old data) according to the individual updated weights from unique local models (edges). The primary objective is to update the current classification weight into the newly obtained classification weight to continuously update the global module, as depicted in Figure 6. Similar class weight differences were considered a challenging task, and the maintenance of classification accuracy for the global and local modules was a primary requirement. Algorithm 1 outlines the steps followed in performing cloud-based adaptive ensemble CNN tasks. Identify the misclassified images using the activate performance feedback module 5: Determine the ensemble accuracies using the majority voting mechanism 6: if (percentage of) % accuracy for S ≥ Th //correctly classify 7: Repeat algorithm steps 3, 4, and 5 8: if % accuracy for S < Th //wrongly classify 9: Save S //save samples 10: Repeat algorithm steps 3, 4, and 5 12: if the counter is equal to 100 //number of wrongly classified instances reaches 100 13: Identify possible new classes using Algorithm 2 [36]. 14: Repeat step 3 15: Send the updated model to the edge node 16: End while Output: Module with (i n+1 ) instances and classification using Cni + j.

Federated Machine Learning-Based Algorithm for Edges
Fundamentally, this algorithm distributes and enables edges to perform collaborative learning, and it avoids sending the sample data to the cloud server for its upgradation. This process reduces the massive computational complexity and resolves the privacy issues (a critical issue) when using a cloud-based server. The deployed model (global model) is first trained on a server using some initial data. Each edge (smart dermoscopy or mobile) then improves the model using data available on the device (samples of diseases which were correctly tested), i.e., federated data from the device. The edge is trained using the newly observed data and updates the local model's latest gradient weights. The changes made to the local model are summarized (as an update) and sent to the global model for global upgradation. However, to ensure faster transmission and avoid latency issues, random compression and quantization techniques are used. The process is done after several iterations (until a high-quality global model) is obtained on the cloud server. The edges send their trained models to the global model, which are averaged to obtain a unified cloud service model. TensorFlow federated and federated core application programming interfaces (APIs) are utilized for experimental purposes. A gradient guarantees convergence, whereas the model average cannot. The detailed steps of this algorithm are shown in Algorithm 2. Perform the classification//regular operation 4: Perform training within the edge device//to compute the updated gradients 5: Update gradient weight to update the global model 6: Send the global updates to all local models 7: If sample data do not belong to existing classes, then 8: Create and train and update the new instance//using Algorithm 1 [36] 9: Update gradient weight to update the global model 10: Send the global updates to all local models Output: The edges send the updated ∆W to the cloud model.

Experimental Results
This section presents three subsections to validate the effectiveness and performance of the proposed ameliorated framework. Section 4.1 details the data preparation and trans-formation of the datasets. Section 4.2 presents the experimental criteria and experimental setup. Section 4.3 display the obtained results, along with their analysis.

Data Preparation and Transformation
For evaluating the proposed model, this study used a challenging dataset, i.e., ISIC 2019 to verify the proposed framework (the ISIC skin disease dataset is considered one of the most challenging due to its sophisticated features).

Skin Disease Data Stream Pipeline Preparation to Simulate Concept Drift (CD)
In this study, the authors selected a challenging real dataset created by the International Skin Imaging Collaboration (ISIC). ISIC released this dataset to the research and professional communities for open competition (skin lesion analysis toward melanoma detection) in 2019. The ISIC competition provides a challenging task to the research community with the aim of obtaining optimized solutions worldwide. ISIC is developing proposed standards to address the technologies, techniques, and terminology used in skin imaging with attention to the issues of privacy and interoperability (i.e., the ability to share images across technology and clinical platforms). In ISIC 2019, the skin disease  The maximum number of samples was 12,875 (for class 1), and the minimum number of samples was 239 (for class 5), which represents the highly imbalanced classes. In addition, for each class, the available number of samples was 4522, 12,875, 3323, 867, 2624, 239, 253, and 628 in classes 0, 1, 2, 3, 4, 5, 6, and 7, respectively, as shown in Figure 8. This class imbalance problem can cause overfitting issue (bias toward classes for which the number of samples is greater). Therefore, image augmentation techniques were used to handle the overfitting issue, such as image flipping, random cropping, random scaling, central zooming, and increasing/decreasing brightness and sharpness, to balance the number of classes in each class. Figure 9 depicts the random samples generated using image augmentation. Python libraries were also used with appropriate parameters to increase each class image sample. The image pixel intensity values were also normalized from 0 to 255 to 0 to 1 to reduce the computational complexity.

Experimental Criteria and Performance Measures
To simulate a federated learning environment, we designed two primary situations: (1) validation of the global model and local models by measuring the classification evaluation measures before and after observing new data samples, and (2) validation of the local models' overall classification accuracy and histogram clustering gradient for online training dataset formation.

Environment and Libraries
The experiments were carried out on the Google Cloud Platform (GCP) and Google Colaboratory on the GCP server (us-west1-b region) with a computed engine virtual machine and additional machine learning and deep learning libraries. To speed up the complex computing jobs, the authors used 16 virtual central processing units (vCPUs), with 104 GB random-access memory (RAM) and a single NVIDIA graphics processing unit (GPU) Tesla K80. The experiments were implemented using the Python 3 programming language and the libraries below.
OpenCV to perform image processing tasks; 5.
NumPy and Pandas for data manipulation and processing; 6.
Seaborn and Matplotlib for visualization of the results.

Hyperparameter Optimization and Performance Measures
To select hyperparameters for training the model, the authors of this study used a manual search strategy [37]. The authors acquired the optimized training hyperparameters after various tuning iterations through the manual search strategy, as shown in Table 1. They also followed best practices outlined by the research community, for example, selection of an optimization function (Adam) and cross-entropy selection (one-hot encoded). The classification accuracy is considered the most suitable metric to evaluate model performance in a nonstationary environment [38]. This study has used performance measures recognized as primary classification performance indicators by the research community [39,40].

Experimental Results and Discussion
The authors of this study performed two experiments to analyze the performance of the proposed framework. Initially, experiment 1 was carried out to validate the global and local models by measuring the classification evaluation measures before (case 1) and after (case 2) observing new data samples. Later, in experiment 2, the overall classification accuracy of local models (at edges) was measured. These experiments also allowed validating the histogram of clustering distance during edge training and validating the performance (accuracy and loss) during new sample adaptation.

Experiment 1: Validation of the Global and Local Models by Measuring the Classification Evaluation Measures before (Case 1) and after (Case 2) Observing New Data Samples
The primary intention of this experiment was to evaluate the performance of the proposed framework in a stable condition. The obtained results were promising with a recorded classification accuracy of 95.6%, loss of 2.50 (as shown in Table 2), and 0.95 precision and recall (as shown in Table 3). This study analyzed the proposed framework's performance with a challenging dataset (complex features), i.e., the ISIC skin disease dataset. Here, the proposed framework was trained using four classes of the skin disease dataset, which were dermatofibroma (DF), vascular lesion (VASC), squamous cell carcinoma (SSC), and unknown (UKN). Despite the complex features and class imbalance problem in the skin disease dataset, the model's performance was satisfactory, even better than highlighted in the literature. In case 2, the proposed model was trained on the first four classes and correctly classified them. Later, the proposed model incorporated four new classes (which were not trained on the proposed framework), with a subsequent degradation in performance. The intent behind this experiment was to determine the accuracy of the proposed model at the cloud level. It was shown that the proposed framework outperformed those in the literature and achieved satisfactory classification accuracy. The reported classification accuracy was 89% (as shown in Table 2). Moreover, a significant level of precision and recall was noted. However, in this case, the loss increased to 3.5 from 2.5 in case 1 (as shown in Table 2). The model's overall performance after adapting new classes was noticeably lesser, because model training in offline mode is always better than that in online mode (online mode creates online training dataset formation, which might contain noisy data). Thus, more advanced techniques are required to overcome these differences. Additionally, all the individual classification accuracies showed good performance after the arrival of new samples, as shown in Figure 10b. This study also generated receiver operating characteristic (ROC) curve plots to demonstrate the tradeoff between sensitivity and specificity. A possible increase in sensitivity, accompanied by a decrease in particularity was observed in the proposed framework before and after new class arrival, as depicted in Figure 8a,b. The ROC curve was closer to the left-hand border and the top edge of the ROC space. This shows the true positive rate vs. the false positive rate. In model1_SD, the obtained ROC curve was desirable in both cases, as shown in Figure 11. The intent behind this experiment was to test the overall classification accuracies of our deployed model, as well as apply its feature extraction technique to prepare for online training dataset formation. Lastly, the updated training and validation accuracies and loss are presented to validate the model's successful incorporation of the new dataset at the edges and on the cloud server.
Overall Classification Performance of the Cloud Models Figure 12 shows the confusion matrix for all trained eight classes. It can be noted that the global model performed well in the stable scenario, with classification accuracy above 90%. Simultaneously, some classes performed exceptionally, such as class 5 and class 1, with maximum correct predictions of 4005 and 3900 samples, respectively. The intent behind this experiment was to test the overall classification accuracies of our deployed model, as well as apply its feature extraction technique to prepare for online training dataset formation. Lastly, the updated training and validation accuracies and loss are presented to validate the model's successful incorporation of the new dataset at the edges and on the cloud server.
Overall Classification Performance of the Cloud Models Figure 12 shows the confusion matrix for all trained eight classes. It can be noted that the global model performed well in the stable scenario, with classification accuracy above 90%. Simultaneously, some classes performed exceptionally, such as class 5 and class 1, with maximum correct predictions of 4005 and 3900 samples, respectively.

Histogram of Clustering Distance during Edge Model Training for New Samples
The histogram of clustering distances identifies how correctly the new samples are extracted and clustered for new dataset formation. New dataset formation is an essential procedure for training edges with newly collected samples. Table 4 represents the four classes of updated samples that were collected and clustered after the feature extraction process. The feature extraction process was done using the pretrained network. Later, the extracted features were clustered using the k-means clustering algorithm. In Table 4, it can be noted that class 6 means clustered well, showing a mean values of 99.85, variance of 160.55, and standard deviation of 12.6. Class 4 showed worse performance with a mean value of 78.5, variance of 144.34, and standard deviation of 12.10. Well-clustered features ensure better training and validation accuracies when updating the edge models. Furthermore, they reduce the overfitting issue by neglecting the condition of bias toward a particular class.   Figure 13 represents each prediction input used for computing the cluster distance, assigned into histogram bins. Here, the y-axis is the number of predictions and the x-axis represents the cluster distance. Notably, the histogram represents how frequently the prediction/classification falls within a particular range of cluster distances.  Training Performance of Edge Models with New Samples Figure 14 illustrates the training and validation accuracy and loss for the new sample adaptation at the edges. Figure 14a shows the correlation reported for training and validation accuracy and loss during the retraining of new samples at epoch 3. Figure 14b depicts the correlation reported for training and validation accuracy and loss during the retraining of new samples at epoch 6. Interestingly, in the presented results, it can be noted that the edge model retrained the new samples by continuously increasing its training and validation accuracy and minimizing loss after each epoch. Furthermore, it can be noted that the observed loss and accuracy were stable at every epoch.

Conclusions and Future Work
The majority of diagnoses in dermatology are based on visual pattern recognition of morphological features. Skin imaging technology currently involves dermoscopy devices, very-high-frequency (VHF) ultrasound, and reflectance confocal microscopy (RCM). Each method of skin imaging has its advantages and limitations. Dermatologists need to choose different imaging methods according to varying conditions of skin lesions. Skin imaging technology has become a vitally important tool for the clinical diagnosis of skin diseases, and it is widely accepted and applied in the world. At the same time, machine learning-based dermoscopy is exceedingly suitable for improving the diagnosis capabilities of dermatologists. Accordingly, this study proposed an intelligent dermoscopy device, which can be used by health practitioners for the clinical diagnosis of skin tumors. This study offers a continuous improvement in classification accuracy by developing a more robust solution through the adaptability mechanism. To ensure adaptability, this study proposed an adaptive federated machine learning-based model that can correctly classify the dermoscopy images for skin disease classification, capable of learning new features (new samples acquired during the classification task through the dermoscopy device). This study used the previously proposed online training and online classifier update (k-means clustering method for new training dataset formation). However, this study found that using the clustering-based mechanism to distinguish different classes for similar features resulted in some classification degradation after adaptation. Hence, a supervised learning mechanism should be used for new dataset formation. The proposed framework showed satisfactory performance for both the cloud and the edge models. The results demonstrated adequate classification performance (in terms of accuracy) for the edge models, which is essential for clinical trials. Furthermore, a patient-level mobile application is also offered to help patients locate the nearest dermatologist and to provide the necessary information regarding the skin disease to prevent negligence. However, this module was not practically tested herein and will be addressed in a future study. The authors also aim to develop a prototype (hardware) intelligent dermoscopy device for dermatologists (using the proposed federated machine learning-based model), which will be tested clinically. Funding: This research study was conducted at the Universiti Teknologi PETRONAS (UTP), Malaysia, as a part of the research project "A novel approach to mitigate the performance degradation in big data classification model" under the matching grant scheme (Cost Centre: 015ME0-057).

Informed Consent Statement: Not applicable.
Data Availability Statement: The ISIC 2019 challenge dataset is available at https://challenge2019. isic-archive.com.