A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis

Fan, Shuhuan; Ahmed, Awais; Zeng, Xiaoyang; Xi, Rui; Hou, Mengshu

doi:10.3390/electronics14142880

Open AccessArticle

A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis

by

Shuhuan Fan

¹,

Awais Ahmed

²

,

Xiaoyang Zeng

¹,

Rui Xi

¹ and

Mengshu Hou

^1,3,*

¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China—UESTC, Chengdu 611731, China

²

School of Computer Science, China West Normal University, Nanchong 637009, China

³

School of Big Data and Artificial Intelligence, Chengdu Technological University, Chengdu 611730, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2880; https://doi.org/10.3390/electronics14142880

Submission received: 7 June 2025 / Revised: 5 July 2025 / Accepted: 14 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Multimodal Learning and Transfer Learning)

Download

Browse Figures

Versions Notes

Abstract

Skin cancer is one of the most prevalent forms of cancer worldwide, and early and accurate diagnosis critically impacts patient outcomes. Given the sensitive nature of medical data and its fragmented distribution across institutions (data silos), privacy-preserving collaborative learning is essential to enable knowledge-sharing without compromising patient confidentiality. While federated learning (FL) offers a promising solution, existing methods struggle with heterogeneous and missing modalities across institutions, which reduce the diagnostic accuracy. To address these challenges, we propose an effective and flexible Personalized Multimodal Federated Learning framework (PMM-FL), which enables efficient cross-client knowledge transfer while maintaining personalized performance under heterogeneous and incomplete modality conditions. Our study contains three key contributions: (1) A hierarchical aggregation strategy that decouples multi-module aggregation from local deployment via global modular-separated aggregation and local client fine-tuning. Unlike conventional FL (which synchronizes all parameters in each round), our method adopts a frequency-adaptive synchronization mechanism, updating parameters based on their stability and functional roles. (2) A multimodal fusion approach based on multitask learning, integrating learnable modality imputation and attention-based feature fusion to handle missing modalities. (3) A custom dataset combining multi-year International Skin Imaging Collaboration(ISIC) challenge data (2018–2024) to ensure comprehensive coverage of diverse skin cancer types. We evaluate PMM-FL through diverse experiment settings, demonstrating its effectiveness in heterogeneous and incomplete modality federated learning settings, achieving 92.32% diagnostic accuracy with only a 2% drop in accuracy under 30% modality missingness, with a 32.9% communication overhead decline compared with baseline FL methods.

Keywords:

skin cancer diagnosis; personalized federated learning; multimodal fusion

1. Introduction

Skin cancer is one of the most prevalent forms of cancer worldwide, with millions of new cases reported annually [1,2,3]. This high mortality rate demands early and accurate diagnosis to reduce mortality rates and improve treatment outcomes [4,5]. Traditionally, this type of diagnosis heavily relies on medical specialist, such as dermatologists in the case of skin cancer, which can be limited by factors such as geographical access, the availability of specialists, and the subjective nature of visual assessments [6,7]. In recent years, advancements in artificial intelligence (AI) and machine learning have shown promise in augmenting diagnostic capabilities, offering new avenues for early detection and personalized treatment strategies [8,9,10,11,12,13].

Skin cancer encompasses several subtypes, with melanoma being the most deadly form [14]. According to the 2017 statistics by the World Health Organization (WHO), an estimated 2–3 million non-melanoma and 132,000 melanoma skin cancers occur globally each year [15,16], with updated statistics recording 331,722 new cases of skin cancer in 2022 alone [17,18]. Risk factors for skin cancer include excessive UV exposure, sunburn history, tanning bed use, and genetic [19]. Melanoma, which is more common in white people, causes most skin cancer deaths. Existing diagnostics solutions typically operate on a uni-modality, employing text, tabular or image modality features to retrieve possible diagnostic results [20,21,22,23,24]. Relying on single modality-based solutions leads to misleading suggestions; therefore, multimodality solutions are used to enhance the interpretability of results [10,25,26,27,28].

The integration of multimodal data, such as dermoscopic images, patient demographics, and genomic information, has the potential to enhance the accuracy and robustness of skin cancer diagnoses [29,30,31]. Multimodal solutions saw significant growth after OpenAI officially launched multimodal model ChatGPT-4.0 in March 2023 [29,32]. With the advent of multimodal advancements, deep learning models, particularly convolutional neural networks (CNNs), have shown remarkable performance in automated skin lesion classification tasks. However, training such models requires large amounts of labeled data, which is often distributed across multiple hospitals and medical institutions. Due to privacy concerns and regulatory restrictions, such as the General Data Protection Regulation (GDPR), sharing sensitive patient data among institutions is impractical [33]. Federated learning (FL) offers a viable solution by enabling multiple institutions to collaboratively train a global model without sharing raw data [34]. Initially proposed by [34], FL has shown promising results across various applications, including healthcare [29,35,36,37,38,39]. Despite its potential, existing FL methods face significant challenges in skin cancer diagnosis:

Limited Exploration of Multimodal Learning: Most existing skin cancer diagnosis methods focus on uni-modal data (e.g., dermoscopic images). However, the availability of multimodal data, such as dermoscopic images and clinical metadata, varies across institutions. Some institutions may have access to both modalities, while others may have only images or only tabular data. This heterogeneity makes it difficult to develop models that generalize well across different institutions. Additionally, some modality samples may be missing in multimodal institutions. Above all, the heterogeneous modalities and missing modalities in federated settings degrade the diagnostic accuracy.
Absence of Personalized Federated Training Paradigms: State-of-the-art federated learning frameworks for skin cancer diagnosis adopt a rigid dichotomy, training either a one-size-fits-all global model (neglecting client-specific heterogeneity) or fully isolated local models (losing collaborative advantages). There is a critical lack of a method that harmonizes global knowledge with local adaptation, which would have potential for real-world model applications and deployment.
Severe Class Imbalance and Limited Positive Samples: Existing skin cancer diagnosis studies suffer from extremely skewed class distributions, where positive samples (e.g., malignant cases) are scarce and often limited to short-term datasets (e.g., one-year collections). This is the cause of the lack of real-world diversity and results in a poor generalization diagnosis model.

To address the challenges discussed, we propose the PMM-FL framework. The design enables effective knowledge-sharing across modalities and clients, addressing the challenge of data heterogeneity. This approach improves the accuracy and application of the model in real-world scenes. As far as we know, this is the first study to address personalized federated learning with heterogeneous modalities and a missing modality for skin cancer diagnosis. Finally, this study makes the following contributions:

Personalized Multimodal Federated Learning: We propose PMM-FL, an effective and robust framework that decouples multi-module aggregation from local deployment via global modular-separated aggregation and local client fine-tuning. This design enables efficient cross-client knowledge transfer while maintaining a personalized performance under heterogeneous and incomplete modality conditions. Unlike traditional FL, which synchronizes all parameters in each round (leading to communication bottlenecks for multimodal models), our method adopts a hierarchical aggregation strategy, synchronizing parameters based on their stability and functional roles.
Multitask Learning-Based Multimodal Fusion Approach: We propose a multitask strategy that jointly performs skin cancer diagnosis and missing modality prediction, enhancing feature robustness through cross-task learning and enabling reliable decisions to be made under incomplete data conditions.
Dataset Preparation: We prepared a custom dataset specifically tailored for our study, combining data from several years of the ISIC challenge dataset. This dataset addresses the challenges of data imbalance and an absence of modality, ensuring comprehensive coverage of various skin cancer types.
Numerical Evaluation: We conduct numerical evaluations demonstrating the effectiveness of our proposed method. The experiments show a significant improvement in model performance metrics compared to existing methods in the field.

The rest of paper is organized as follows: Section 2 discusses related work, followed by Section 3 formulates the problem definition, while Section 4 explains the preparation of the dataset. Section 5 details the methodological components. Then, in Section 6, the implementation details are explained. Section 7 presents the experimental work. Lastly, Section 8 concludes the current study.

2. Related Work

Federated learning (FL) and multimodal federated learning (MMFL) have attracted significant attention in healthcare due to their adaptability and ability to address data privacy and heterogeneity challenges. In this section, we summarize the recent works focused on skin cancer diagnosis in medical imaging, dividing them into three sub-categories.

2.1. Advances in Skin Lesion Diagnosis

In recent years, we have witnessed significant progress in automating medical image tasks, including skin lesion analyses, using deep learning [20,40,41,42,43]. Models such as ResNet, EfficientNet, and DenseNet, and combinations thereof, have been employed for tasks ranging from lesion segmentation to malignancy classification [5,42,44,45,46]. Recent studies have explored the use of multimodal data (e.g., combining dermoscopic images with clinical metadata) to improve diagnostic accuracy. Despite their good performance, these models require large labeled datasets, which are often distributed across multiple institutions. Further, their reliance on centralized data collection often limits the applicability of these methods in real-world scenarios, where data is often distributed across multiple institutions.

2.2. Multimodal Federated Learning (MMFL)

In recent years, FL has become popular in healthcare due to its privacy-preserving properties. Uni-modal FL is the most frequently employed solution, while multimodal FL remains a challenge. MMFL is a technique where clients processes different data modalities across other clients to achieve improved diagnostic accuracy [47,48,49]. Lucieri et al. [48] introduced a multimodal explanation framework for computer-aided skin cancer diagnosis, providing interpretable predictions even in cases of incorrect diagnoses. Reference [50] developed a collaborative FL framework for healthcare, focusing on privacy-preserving data-sharing across institutions. Further, the study [51] proposed a personalized FL approach, adapting models to individual client data distributions to achieve a better performance in healthcare tasks. However, most existing approaches assume balanced and complete modalities across clients, which is unrealistic in practice [29,47,50]. This problem was categorized into the missing modalities category.

Furthermore, recent studies have further advanced FL and MMFL across domains such as healthcare, such as some works that introduced a personalized FL approach, MRI synthesis, and MRI reconstruction, addressing the privacy and heterogeneity of the data [52,53,54]. Additionally, [55] proposed an effective skin cancer diagnosis framework using FL and Deep Convolutional Neural Networks (DCNN). Lastly, Table 1 summarizes the recent work on MMFL in the literature. Although these studies collectively demonstrate the potential of FL and MMFL in addressing key challenges in healthcare, such as data privacy, heterogeneity, and personalization, they lack demonstrated effectiveness in the dermatology domain, specifically in skin lesion diagnosis. In this work, we fill this gap by introducing a novel framework that specifically focuses on federated learning with personalization, leveraging knowledge transfer.

2.3. Knowledge Transfer in FL

Knowledge transfer in federated learning focuses on enabling clients to share and leverage insights across different domains or modalities without compromising privacy. In the context of FL, knowledge transfer helps clients with missing modalities by leveraging information from clients with complete data. A recent work by Islam et al. [44] used a knowledge distillation-based approach to propose a lightweight and high-performing skin cancer classifier. Further, study [56] proposed a self-supervised diverse knowledge distillation method for lightweight skin lesion classification. A few more recent works, such as [57,58], reported the classification of skin lesions using the knowledge distillation technique, achieving effective results.

Recent advancements in cross-modal knowledge transfer have shown promising results; however, they lack personalization and cross-modality effectiveness, and also struggle with non-IID and missing modalities. This work builds on these concepts by introducing a two-stage knowledge transfer mechanism.

Table 1. Summarizing the literature on MMFLs proposed in different healthcare applications. [FL type = personalized/centralized/N/A].

Ref. & YoP	Short Summary	Task	Included Modalities	FL Type	Dataset
[47] 2021	Multimodal melanoma detection with federated learning	Detection	Images and tabular	Centralized	ISIC 2018
[59] 2021	An adaptive federated machine learning-based intelligent system for skin disease detection	Detection	Images	Centralized	ISIC 2019
[50] 2022	The study proposed privacy-aware collaborative learning for skin cancer prediction. Further study transferred weights to a cloud-based server for training to enhance data privacy and security.	Prediction.	Images	Centralized	ISIC 2019
[48] 2022	The study presented a multimodal explanation framework for computer-aided skin cancer diagnosis. It shows explanations for CAD-assisted scenarios even in the case of incorrect disease predictions.	Diagnosis and explanation	Image and text	N/A	SkinL2, Derm7, PH and ISIC 2018
[45] 2023	The paper presented an FL framework to train models using private patient data on local clients and aggregate models on a central server, facilitating joint model training across participants without data-sharing, thereby dismantling the problem of Data Silos and promoting AI collaboration.	Diagnosis	Image	Personalized	Skin Cancer MNIST: HAM10000
[57] 2023	The study introduced a new framework for classifying melanoma that is based on knowledge distillation and lightweight Deep-CNN. This framework was developed to address the issue of high inter-class similarity and low intra-class similarity.	Classification	Dermoscopy Images	N/A	ISIC 2020
[31] 2023	The study presented a review study on the need for a shift from uni-modal to multimodal federated learning	N/A	Discussed various modalities	N/A	Summarized various skin lesion datasets
[60] 2024	The study presented a review study on federated learning-assisted deep learning methods for skin cancer detection	N/A	Discussed various modalities	N/A	Summarized various skin lesion datasets
[55] 2024	This study proposed effective skin cancer diagnosis through federated learning and Deep Convolutional Neural Networks (DCNN). They employed three datasets of varying complexity and size to validate their method’s effectiveness.	Detection	Image	Centralized	ISIC 2018, PH2 and Curated dataset by merging of both
[61] 2025	Reviews AI in healthcare, including disease detection and diagnosis. Mentions multimodal data, blockchain, and federated learning for skin cancer diagnosis.	Diagnosis	Discussed multimodal	Review	N/A
[62] 2025	Proposes EffiCAT for skin disease classification using multi-dataset fusion and attention mechanism.	Classification	Custom dataset curated combining HAM10000 and PAD-UFES-20	N/A	N/A
[63] 2025	The study emphasizes the recent developments in AI in medical diagnostics, including dermatology and early skin cancer diagnosis. Further, the study mentions federated learning and multimodal techniques as a prospective work.	Diagnosis	Multimodal	N/A	Discussed various datasets
This work, 2025	PMM-FL: A personalized multimodal FL framework with cross-modality knowledge transfer for skin cancer diagnosis	Skin lesion diagnosis	Image, Tabular, Fusion	Personalized	Privately curated from ISIC versions

3. Problem Definition

We consider a PMM-FL setup, where there are M multimodal clients (each has two different data modalities: A represents the image modality and B represents the table modality, serving as its local data), N uni-modal clients (each has only one data modality, A) and R=unbalanced multimodal clients (each has two modalities but some B modality samples are missing).

Each multimodal client possesses a local dataset

M_{m_{i}} = {(A_{m_{i}}^{j}, B_{m_{i}}^{j}), y_{m_{i}}^{j}}_{j = 1}^{| M_{m_{i}} |}

, where

(A_{m_{i}}^{j}, B_{m_{i}}^{j})

denotes the j-th aligned data pair of the

m_{i}

-th multimodal client (

1 \leq m_{i} \leq M

) and

y_{m_{i}}^{j} \in R^{K}

represents the corresponding label, and K denotes the total number of classes.

Similarly, each unimodal client possesses a local dataset

N_{n_{i}} = {A_{n_{i}}^{j}, y_{n_{i}}^{j}}_{j = 1}^{| N_{n_{i}} |}

, where

A_{n_{i}}^{j}

represents the j-th data sample of the

n_{i}

-th unimodal client (

1 \leq n_{i} \leq N

), and

y_{n_{i}}^{j}

represents the corresponding label.

Each unbalanced multimodal client possesses a local dataset

R_{r_{i}} = {(A_{r_{i}}^{j}, B_{r_{i}}^{j}), y_{r_{i}}^{j}}_{j = 1}^{| R_{r_{i}} |}

, where

(A_{r_{i}}^{j}, B_{r_{i}}^{j})

denotes the j-th aligned data pair of the

r_{i}

-th unbalanced multimodal client (

1 \leq r_{i} \leq R

) and

y_{r_{i}}^{j} \in R^{K}

represents the corresponding label, with K denoting the total number of classes. However, in the case of unbalanced multimodal clients, certain samples may have missing values in the B modality. This means that while most samples have both A and B modalities, some samples might only have the A modality available, making it necessary to handle these cases separately during data preprocessing and model training.

Our goal is to obtain an optimal personalized multimodal model M for each client through collaborative training between clients. The overall objective is defined as follows:

min_{{θ_{i}}} \sum_{m = 1}^{M} ℓ (θ_{m}; M_{m}) + \sum_{r = 1}^{R} ℓ (θ_{r}; R_{r}) + \sum_{n = 1}^{N} ℓ (θ_{n}; N_{n})

(1)

where

ℓ (θ_{m}; M_{m})

represents the loss for multimodal clients.

ℓ (θ_{r}; R_{r}

represents the loss for unbalanced multimodal clients.

ℓ (θ_{n}; N_{n})

represents the loss for uni-modal clients.

M_{m}

represents a local dataset of the m-th multimodal client.

R_{r}

represents a local dataset of the r-th unbalanced multimodal client.

N_{n}

represents a local dataset of the n-th uni-modal client.

4. Dataset Curation

The dataset used in this study was developed in accordance with guidelines of the ISIC 2024 challenge dataset [64] (https://www.kaggle.com/competitions/isic-2024-challenge/data, accessed on 1 March 2025). The study curated custom multimodal data by combining ISIC 2024 with previous years, mainly focusing on 2018, 2019, and 2020. The dataset emphasizes the importance of high-quality, annotated images for the training and validation of diagnostic models. ISIC 2024 contains a total 40,159 skin lesions. Among these, only 393 are malignant cases, while the rest are benign cases. For the current work, we included all 393 malignant lesions, and randomly selected 10,000 benign lesions to prepare a custom collection including data from challenges from previous years. This collection included a variety of Total Body Photography (TBP), focusing on distinguishing between cancerous and non-cancerous lesions. The prepared dataset aims to support advancements in automated diagnostic systems and contribute to improved clinical outcomes in dermatology.

The customized dataset comprises 15,670 lesion images, along with their tabular metadata, forming multimodal data. This includes 662 positive cases from 2018, 4522 positive cases from 2019, and 584 positive cases from 2020. The remaining 10,393, include10,000 benign and 393 malignant cases, contributing to the total of 15,670 images. All images were center-cropped and resized to 224 × 224 pixels to ensure consistency with the 2024 dataset. A custom dataset was prepared to balance the highly imbalanced nature of the provided dataset from 2024, which featured only 393 positive targets and over 400,000 negative targets, and reduce training bias. The sample of skin lesions is depicted in Figure 1.

Data Processing

The multimodal system processes two primary data modalities: dermoscopic lesion images in JPEG format and structured clinical/demographic metadata in CSV format. The target variable represents a binary classification task, distinguishing malignant from benign lesions.

For tabular data processing, we first performed rigorous column alignment and cleaning by removing non-predictive identifiers (ISIC-ID, PATIENT-ID) and ensuring consistency between training and test datasets. This involved dropping clinically irrelevant features. We then engineered 22 clinically relevant/meaningful features spanning several categories: (1) morphological characteristics, including lesion size ratio and border complexity; (2) color-based metrics like hue_contrast and lesion color differences; (3) spatial properties such as 3D position distance; and (4) composite indices, including lesion severity index.

Image processing employed distinct pipelines for training versus validation and test data. The training augmentation strategy included random resized crops, horizontal vertical flips, and color jitters. This was followed by normalization using ImageNet statistics. For validation and testing, we used only central cropping after resizing to

256 \times 256

to ensure the evaluation included consistent, non-augmented images. This balanced approach ensured model generalizability while maintaining reliable assessment conditions.

The preprocessing pipeline was carefully designed to maintain data integrity across modalities while addressing real-world challenges like missing values and shifts in the dataset. Feature engineering incorporated domain knowledge about the dermatological lesion assessment, with the derived features capturing clinically relevant patterns that complement visual information from images. The combination of rigorous tabular processing and comprehensive image augmentation created a robust foundation for our proposed multimodal skin cancer diagnosis framework.

5. Methodology

This section presents an overview of our proposed framework, followed by detailed implementations of its core components.

5.1. Framework Overview

Personalized Multi-Modal Federated Learning (PMM-FL) is a novel framework that enables efficient and robust federated learning across heterogeneous medical data modalities while preserving patient privacy. The framework was set up as shown in Figure 2. The key processes of our framework are described as follows:

①: Local Training: After server initialization, clients train models locally on their private data without sharing raw data. Single-modality clients train local image diagnosis models, while multimodality clients train multi-fusion models with missing modality predictions (illustrated in Figure 2b).
②: Global Aggregation: During each global aggregation round, clients upload local model parameters to the server. Inspired by previous studies [65], we designed a hierarchical aggregation strategy in the central server, where the parameters of each component—including single/multimodality encoders, missing-modality predictors, and skin cancer classifiers—are updated independently (illustrated in Figure 2a). This approach offers three advantages: (1) By aggregating knowledge from diverse clients, the model reduces the risk of overfitting to individual clients or particular client distributions. (2) It supports its flexible deployment by institutions with heterogeneous modalities. For example, clients with only image data can download just the image encoder and classifier. (3) Independent parameter-sharing ensures that each task aggregates updates separately during global rounds, each with their own frequency, which can be designed to optimize the communication overhead.
③: Local Fine-Tuning: Finally, local institutions download specific module parameters for fine-tuning to adapt to local conditions. This process forms a continuous cycle, balancing global consistency and local adaptability. During client-side fine-tuning, the global backbone remains frozen while the task heads adapt to local data distributions and dynamically handle modality absence. The personalized model is retained locally for inference, ensuring both global consistency and local adaptability, with minimal data exposure (illustrated in Figure 2b).

Building upon our framework, each hospital institution in the federated network can efficiently implement robust federated learning while reserving local data privacy. To address the challenges of the distributed and heterogeneous modality data, our framework integrates three core techniques: multi-task learning for multimodal fusion, missing modality prediction, and the hierarchical aggregation strategy. These techniques collectively enable adaptive learning in heterogeneous federated networks. The following sections detail their technical implementations.

5.2. Multitask Learning for Multimodal Fusion

We propose a novel multitask learning strategy that enables the robust integration of heterogeneous by jointly optimizing (1) the primary task—skin cancer classification with high diagnostic accuracy; (2) the auxiliary task—missing modality prediction to enhance feature robustness. As illustrated in Figure 3, our approach dynamically leverages cross-modal correlations during training.

5.2.1. Uni-Modal Feature Encoders

We introduce a different model for the modality according to the specific modality’s characteristics. We realize feature extraction separately using the uni-modal feature encoders.

Image Modality Encoder:

X^{(i m g)} \in R^{H \times W \times C}

, where H, W, and C represent the height, width, and number of channels of the image. This input modality captures visual information that can be crucial for diagnosing skin conditions. We use a pre-trained CNN model to extract features from images:

f_{img} (X_{img}) \in R^{m}

(2)

Tabular Modality Encoder:

X^{(t a b)} \in R^{d_{2}}

, representing clinical features associated with the patient (e.g., age, gender, medical history). Here,

d_{2}

is the dimensionality of the tabular data, which provides complementary information to the visual data. A fully connected network processes the tabular metadata:

f_{tab} (X_{tab}) \in R^{m} .

(3)

The tabular data are projected in a same space as the image modality feature space.

5.2.2. Feature Fusion with Multi-Head Attention

To combine the features from different modalities, we first use an early concatenation fusion strategy. This approach merges the image and tabular embeddings along the feature dimension to form a unified representation:

f_{fused} = Concat (f_{img} (X_{img}), f_{tab} (X_{tab})) \in R^{2 m} .

(4)

After aligning and concatenating the features from both modalities, we apply a multi-head attention mechanism to further process these fused features. This step aims to capture the complex interactions between different parts of the features:

f_{attn} = g_{attn} (f_{fused}; θ_{attn}) .

(5)

Here,

g_{attn}

denotes the multi-head attention module, and

θ_{attn}

represents its set of parameters. Each attention head

h_{i}

is defined by

h_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}),

where

Q = K = V = f_{fused}

, and

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in R^{2 m \times d_{h}}

are projection matrices for the i-th head, with

d_{h}

being the dimensionality of each head. The multi-head attention output is as follows:

f_{attn} = LayerNorm (Concat (h_{1}, . . ., h_{H}) W^{O}),

(6)

where H is the number of heads and

W^{O} \in R^{H d_{h} \to 2 m}

is the output projection matrix.

This mechanism allows the model to focus on relevant parts of the multimodal features, improving the quality of the extracted representations.

5.2.3. Multitask Learning

Multi-task learning aims to improve model generalization and representation learning by jointly optimizing multiple related tasks. The primary objective is to leverage shared representations between tasks to enhance performance on each individual task, especially when some tasks have limited data.

In our design, multitask learning enables the model to simultaneously learn discriminative features for skin cancer diagnosis and missing modality predictions, leading to more robust and reliable decision-making. We optimize both the primary classification task and the auxiliary prediction task simultaneously.

Specifically, in the main task, the classification loss

L_{cls}

is computed using the cross-entropy loss between the predicted class probabilities and the true labels, defined as follows:

L_{cls} = - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i}))

(7)

where N is the number of training samples,

y_{i} \in {0, 1}

is the true binary label (e.g., melanoma or benign),

p_{i} = σ (w^{T} f_{i} + b)

is the predicted probability that sample i belongs to the positive class,

σ (\cdot)

denotes the sigmoid function,

f_{i}

is the fused feature vector for the i-th sample,

w

and b are the learnable parameters of the classifier.

An auxiliary branch learns to predict missing tabular features based on available image inputs. The modality prediction loss (

L_{modality}

) is defined as the mean squared error (MSE) between the predicted and actual tabular features:

L_{modality} = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{x}}_{i}^{(2)} - x_{i}^{(2)})}^{2}

(8)

where n is the number of samples,

{\hat{x}}_{i}^{(2)}

and

x_{i}^{(2)}

are the predicted and actual table data for the ith sample, respectively.

The overall loss function combines both tasks, balanced by hyperparameter

λ

:

L_{total} = L_{cls} + λ \cdot L_{modality}

(9)

5.3. Missing Modality Prediction

Modality prediction is a crucial component in multi-task learning frameworks, in which the other modality is predicted using the modality we have. In our study setting, this mainly refers to predicting the table modality (the clinic’s information regarding the patient) with the image modality we have. We present the implementation details as follows:

5.3.1. Inputs

Dermoscopic image

x^{(1)} \in R^{H \times W \times C}

, where H, W, and C represent the height, width, and number of channels of the image.

5.3.2. Image Feature Extraction

Let

h^{(1)} = f_{img} (x^{(1)})

, where

h^{(1)} \in R^{d_{1}}

is the extracted image feature representation, and

f_{img}

denotes the image encoder function.

5.3.3. Table Feature Prediction

Given the image features

h^{(1)}

, the goal of the modality prediction branch is to generate approximate table features

{\hat{x}}^{(2)} = g_{pred} (h^{(1)})

, where

{\hat{x}}^{(2)} \in R^{d_{2}}

and

d_{2}

is the dimensionality of the tabular data (e.g., 68 dimensions). The loss for the modality prediction task can be defined as the mean squared error (MSE) of the predicted values and the actual tabular data:

L_{modality} = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{x}}_{i}^{(2)} - x_{i}^{(2)})}^{2}

(10)

where n is the number of samples, and

{\hat{x}}_{i}^{(2)}

and

x_{i}^{(2)}

are the predicted and actual table data for the ith sample, respectively.

Missing modality prediction realizes knowledge transfer from image modality to table modality, which enhances the model’s adaption and its robustness to the unbalanced multimodal and uni-modality clients. The method can not only impute the missing table data for skin cancer diagnoses, but can also help the model to understand the relation between the image modality and table modality, and improve the accuracy of skin cancer diagnosis.

5.4. Federated Learning with Hierarchical Aggregation

We describe the detailed process of model training and inference, including local training, global aggregation, and client-personalized fine-tuning steps. The entire framework aims to support data collaboration across multiple institutions through federated learning mechanisms while ensuring data privacy protection and allowing for individual model adaptation.

5.4.1. Local Training

Each client trains the model locally using their own dataset. The objective is to minimize the local loss function:

L_{local}^{(k)} = L_{cls}^{(k)} + λ \cdot L_{modality}^{(k)}

(11)

λ

is a hyperparameter balancing classification loss and missing modality prediction loss. A larger

λ

emphasizes enhancing robustness to modality absence via auxiliary tasks, while a smaller

λ

prioritizes optimizing core classification performance; proper tuning balances both to improve overall model adaptability.

5.4.2. Hierarchical Aggregation

FedProx is employed to aggregate global model updates, introducing a proximal term to mitigate heterogeneity in local training and ensure consistency across distributed clients. This approach aligns with the system’s design of separating backbone parameter aggregation from task head personalization, balancing global generalization and local adaptation. The optimization objective integrates the client-specific local loss:

min_{w} \sum_{k = 1}^{K} (\frac{n_{k}}{n} L_{local}^{(k)} + \frac{μ}{2} {∥ w - w_{g} ∥}^{2})

(12)

where K is the number of clients,

n_{k}

is the number of samples on client k, n is the total number of samples,

L_{local}^{(k)}

is the client-specific local loss function defined in Equation (11),

μ

is a regularization parameter, and

w_{g}

is the current global model parameters.

Our framework employs a stratified parameter synchronization approach that intelligently segments model parameters based on their functional roles and stability characteristics. The core architecture divides parameters into two distinct tiers, each with specialized synchronization mechanisms:

1. Foundation Layers (Visual Backbone + Tabular Encoder): These fundamental feature extractors process raw input data into meaningful representations. The visual backbone (CNN) extracts spatial features from medical images, while the tabular encoder transforms structured clinical data into dense embeddings. These components are aggregated every two rounds using data-volume weighted averaging:

θ_{foundation}^{(t + 1)} = \sum_{k = 1}^{K} \frac{| D_{k} |}{\sum | D_{i} |} θ_{foundation, k}^{(t)}

This frequent synchronization ensures consistent feature extraction across all institutions. Clients with larger datasets contribute proportionally more to the shared foundation model, creating a robust base that captures diverse data characteristics while maintaining visual–semantic coherence. The two-round interval balances stability with adaptability, allowing for the timely incorporation of new patterns without excessive communication overhead.

2. Decision Layers (Modality Predictor + Fusion + Attention + Classifier): These higher-level components integrate multimodal information for diagnostic decision-making. The modality predictor estimates missing tabular data from visual features, the fusion module combines visual and tabular embeddings, the attention mechanism weights important features, and the classifier makes final predictions. These specialized components are synchronized every five rounds using client-balanced averaging:

θ_{decision}^{(t + 1)} = \frac{1}{K} \sum_{k = 1}^{K} θ_{decision, k}^{(t)}

This less frequent synchronization preserves institutional specialization while periodically harmonizing global knowledge. The modality predictor learns to handle institution-specific missing data patterns, the attention mechanism adapts to local feature importance, and the classifier develops specialized diagnostic expertise. The five-round interval provides sufficient time for these specialized capabilities to develop before incorporating global insights.

5.4.3. Client Specific Fine-Tuning

We implemented a selective parameter optimization strategy: (1) global backbone parameters remain frozen to maintain shared feature representations; (2) task-specific classification heads are fine-tuned using localized learning rates to adapt to institutional data distributions. The optimization objective formalizes this process as follows:

min_{θ_{classifier}} L_{local}^{(k)} (θ_{g}^{frozen}, θ_{classifier})

(13)

where

θ_{g}^{frozen}

represents all frozen parameters from the global model (Equation (12)), including backbone, fusion layers, and modality predictors, while

θ_{classifier}

denotes the client-specific classification head parameters being optimized. This formulation maintains the core

L_{local}^{(k)}

loss from Equation (11) while constraining updates exclusively to the classification layer, preserving shared knowledge while enabling personalized adaptation.

5.4.4. Federated Learning Algorithm

The high-level algorithm is further presented in Algorithm 1, which details the steps involved in the proposed framework.

Overall, the methodology focuses on addressing the unique challenges posed by real-world FL scenarios, including heterogeneous modality, incomplete modality, and collaborative learning across clients.

Algorithm 1 Federated learning with hierarchical aggregation

1:: Input:
   Client datasets $D = {D_{1}, \dots, D_{N}}$ , Tabular availability ${a_{k}}_{k = 1}^{N}$
   Global model $θ^{(0)}$ , Local epochs E, FedProx parameter $μ$
   Aggregation frequencies $τ_{low} = 2, τ_{high} = 5$
2:: for communication round $t = 1$ to T do
3:: for each client $k = 1$ to N in parallel do
4::         Step 1: Model Initialization
   Initialize shared parameters: $θ_{k} \leftarrow θ^{(t - 1)}$
   Load client-specific predictor: $ϕ_{k}$
   Freeze foundation layers: $\nabla θ^{backbone} = 0, \nabla θ^{encoder} = 0$
5:: Step 2: Adaptive Tabular Processing
6:: for local epoch $e = 1$ to E do
7::            for batch $(x_{img}, x_{tab}^{true}, y) \in D_{k}$ do
   Extract image features: $h_{img} \leftarrow f_{CNN} (x_{img}; θ_{k})$
   Predict missing tabular: ${\hat{x}}_{tab} \leftarrow g_{pred} (h_{img}; ϕ_{k})$
   Select appropriate features:
    $x_{tab}^{used} \leftarrow \{\begin{matrix} x_{tab}^{true} & if table exists \\ {\hat{x}}_{tab} & otherwise \end{matrix}$
   Encode tabular features: $h_{tab} \leftarrow f_{enc} (x_{tab}^{used}; θ_{k})$
   Fuse multimodal features: $h \leftarrow Attn ([h_{img}; h_{tab}]; θ_{k})$
   Predict class: $\hat{y} \leftarrow f_{cls} (h; θ_{k})$
   Compute classification loss: $L_{cls} \leftarrow ℓ (\hat{y}, y)$
   Compute tabular reconstruction loss: $L_{tab} \leftarrow I_{m = 0} \cdot {∥ {\hat{x}}_{tab} - x_{tab}^{true} ∥}^{2}$
   Combine losses: $L \leftarrow L_{cls} + λ L_{tab}$
   Add FedProx regularization: $L_{prox} \leftarrow L + \frac{μ}{2} {∥ θ_{k} - θ^{(t - 1)} ∥}^{2}$
   Update parameters: $θ_{k} \leftarrow θ_{k} - η \nabla L_{prox}$
8:: end for
9:: end for
10::       Step 3: Parameter Extraction
   Extract foundation parameters: $θ_{k}^{low} \leftarrow {θ^{backbone}, θ^{encoder}}$
   Extract decision parameters: $θ_{k}^{high} \leftarrow {θ^{attn}, θ^{cls}}$
   Send parameters to server: $(θ_{k}^{low}, θ_{k}^{high})$
11:: end for
12:: Step 4: Hierarchical Aggregation
13::     if $t mod τ_{low} = 0$ then
   Aggregate foundation layer:
    $θ_{(t)}^{low} \leftarrow \sum_{k = 1}^{N} \frac{| D_{k} |}{\sum | D_{i} |} θ_{k}^{low}$
14:: end if
15::     if $t mod τ_{high} = 0$ then
   Group clients by availability: $G_{m} \leftarrow {k : a_{k} \in I_{m}}$
   Aggregate within groups:
    $θ_{m, (t)}^{high} \leftarrow \frac{1}{| G_{m} |} \sum_{k \in G_{m}} θ_{k}^{high}$
   Fuse group models: $θ_{(t)}^{high} \leftarrow \sum_{m} \frac{| G_{m} |}{N} θ_{m, (t)}^{high}$
16:: end if
Update global model: $θ^{(t)} \leftarrow {θ_{(t)}^{low}, θ_{(t)}^{high}}$
17:: end for
18:: Output: Global model $θ^{(T)}$ , Personalized predictors ${ϕ_{k}}_{k = 1}^{N}$

6. Implementation Details

In this section, we present the experimental setting and provide a description of the baseline methods.

6.1. System Information

The hardware and software configurations for our experiments are detailed as follows: we utilize an Intel(R) Xeon(R) CPU E5-2678 v3 at 2.50 GHz for processing, paired with an NVIDIA GeForce RTX 2080 Ti GPU for computational acceleration. Our algorithms are implemented using Python 3.9.18, and we rely on PyTorch 2.4.1 as our deep learning framework.

6.2. Experimental Setting

To comprehensively evaluate the effectiveness of the proposed framework, we designed several experimental parameters, such as dataset splits, feature extractors, and variation in the number of clients and conducted a variety of experiments. These experiments aimed to explore the nuances of our framework under different configurations. Key experimental parameters, such as learning rate, and the batch size and number of clients, are summarized in Table 2. In summary, the subsequent experimental results are based on combinations of specific data splits and feature extractors and the federated model aggregator.

6.3. Backbone Models

In this study, to conduct various experiments, we used two variants of ResNet architecture as backbone models, utilizing pre-trained weights to facilitate feature extractions. The ResNet was chosen due to its efficiency and effectiveness in image classification. Specifically, we utilized ResNet18 and ResNet50, which were both initialized with pre-trained weights. Later, a feature extraction layer was adopted to output an embedding that is suitable for downstream tasks.

The ResNet architecture was chosen as the backbone due to its proven effectiveness in image classification tasks. ResNet utilizes residual connections to facilitate the training of deeper networks to achieve accurate results. Pre-trained weights from ImageNet were used to initialize the model, before being fine-tuned for downstream tasks that enhance convergence speed and accuracy. Additional layers were added to process the tabular metadata, to support missing modality predictions based on one available modality. This included linear layers that transform the features into a format suitable for concatenation with image features. These custom layers ensure that both modalities are appropriately represented and combined for robust and accurate results.

6.4. Evaluation Metrics

The aggregated global model was evaluated using a validation dataset from each institute. Key performance metrics such as accuracy, precision, recall, and F1-score, and Area Under the Curve (AUC), were computed to assess the model’s efficacy. A comprehensive analysis was conducted to identify potential biases in performance across different experimental settings, ensuring robustness and generalizability.

Performance measures are quantified through the following method:

\begin{matrix} Accuracy & = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}

(14a)

\begin{matrix} Precision & = \frac{TP}{TP + FP} \end{matrix}

(14b)

\begin{matrix} Recall (TP Rate) & = \frac{TP}{TP + FN} \end{matrix}

(14c)

\begin{matrix} Specificity (TN Rate) & = \frac{TN}{TN + FP} \end{matrix}

(14d)

\begin{matrix} F 1 & = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(14e)

\begin{matrix} Balanced Accuracy & = \frac{TPR + TNR}{2} \end{matrix}

(14f)

Statistical significance of resultant metrics: Statistical significance was tested via Wilcoxon signed-rank test across five runs and the presented experiments were conducted according to the received factual data, not the incidental results, such that every experimental result’s p-value remained under the following bound:

p - value < 0.05 \Rightarrow significant improvement

(15)

7. Experiments and Discussion

In this section, we systematically investigate the overall effectiveness of the proposed federated multimodal learning framework under various configurations. We first conduct extensive evaluations focusing on the impact of different imputation strategies, loss function combinations, and parameter configurations for the primary classification task (skin cancer diagnosis) and the auxiliary task (missing modality prediction). Next, image-only and multimodal performances are compared. Further, we conduct an ablation study to validate the impact of the federated aggregator and the model architecture design choice. Lastly, we provide an in-depth investigation of the missing modality handling techniques.

7.1. Effectiveness

This section represents the effectiveness of our method, derived both from the accuracy analysis and the aggregation overhead analysis.

7.1.1. Accuracy Analysis

To validate the effectiveness of our method and investigate the impact of loss function selection and parameter configuration on model performance, we conducted experiments using different loss metrics to evaluate the model. The results are shown in the Table 3 below.

Since we adopted a multi-task learning approach for local training—jointly optimizing the primary task (skin cancer diagnosis) and an auxiliary task (missing modality prediction)—we tested various combinations of loss functions with different weight settings for model training. The federated learning setup consists of two clients with only image modalities and three clients with table modalities missing to varying degrees (10%, 20%, and 30%). As an alternative prediction metric, we employed cosine similarity, a classical vector similarity measure that is more suitable for assessing the angular closeness between vector representations. Furthermore, we explored a composite loss function that combines MSE with cosine similarity to balance the strengths of both metrics. The experimental results show the following:

First, compared to when using the Cls method alone, adding MSE or cosine similarity significantly improves performance on global tasks. For example, in the “Missing 30%” task, the accuracy of the Cls method is 91.19%. When MSE is added (with

λ_{2} = 0.1

), the accuracy increases to 94.10%. When cosine similarity is added (with

λ_{2} = 0.015

), accuracy increases to 93.27%. With the composite loss function (MSE+COS,

λ_{2} = 0.015

,

λ_{3} = 0.015

), accuracy reaches 92.99%. Overall, all three methods enhance model performance to varying degrees, with MSE yielding the most significant gains.

More specifically, from the perspective of average performance across all metrics, the Cls+MSE strategy outperforms other loss combinations, achieving optimal or near-optimal results in multiple key indicators. This suggests that simply incorporating MSE is sufficient to effectively improve model performance across most task scenarios, without the need to introduce more complex loss compositions.

Building on this, we further compared different parameter settings and found that MSE performs best when

λ_{2} = 0.1

. Under this setting, the “Missing 30%” global task achieved an accuracy of 94.10%, the highest among all methods. Accuracy on Client 1 and Client 2 reached 91.82% and 90.94%, respectively, with corresponding recall scores of 0.8972 and 0.8205, both outperforming other parameter combinations. On Client 4, which has the most severe missing modality, the recall reached 0.8772—the best performance for that client as well. This indicates that MSE is not only effective under severely missing modalities but also consistently performs well in partially missing scenarios.

The results show the effectiveness of our method in heterogeneous and incomplete modality conditions. Our model achieved 92.32% accuracy even with 30% of the table modality missing, which is significantly higher than the original unoptimized model and only 2% lower than the centralized method.

7.1.2. Aggregation Overhead Analysis

Traditional federated learning approaches synchronize all model parameters during each communication round, creating significant communication bottlenecks for multimodal models. This “all-in-every-round” strategy becomes prohibitively expensive as model complexity increases, especially when dealing with multimodal architectures that incorporate both visual and tabular data processing. To overcome these limitations, we introduced a hierarchical aggregation strategy that synchronizes parameters at varying frequencies based on their functional roles and stability characteristics. As shown in Table 4, our approach significantly reduces communication overhead.

This stratified synchronization strategy achieves significant communication efficiency improvements compared to naive multimodal aggregation, reducing total transmission volume by 32.9% (from 4.90 GB to 3.28 GB) while specifically cutting upload volume by 65.6% (from 2.45 GB to 0.84 GB). The approach eliminates 6.97 million parameter transmissions over 10 rounds through three key mechanisms:

Asymmetric transmission: Only partial parameters are uploaded (48.8 MB download vs. 23.6–25.2 MB upload per client)
Stratified frequencies: Foundation layers upload 5× (vs. 10×); decision layers upload 2× (vs. 10×)
Parameter reduction: Eliminates 139,332 parameters per client from transmission

These optimizations demonstrate superior scalability, with communication savings growing proportionally to both client count and training duration. For 100 clients and 100 rounds, this strategy would reduce parameter transmissions by over 65 billion.

7.2. Performance Comparisons

Table 5 and Table 6 presents an overview of the performance metrics for a multimodal federated learning framework utilizing two different feature extractors (ResNet18 and ResNet50) across various federated clients (3, 5, and 10).

Table 5 records the results for image-only model comparison, where we only include image modality, while Table 6 presents the results obtained with multimodality, combining image and tabular modalities. These tabular analyses provide a comprehensive evaluation of the performance of different feature extraction models, with different data splits, across various federated client configurations in a multimodal federated learning framework setting. The metrics, including AUROC, AUPRC, balanced accuracy, precision, and recall, are recorded for each configuration to assess the models’ ability to correctly classify skin cancer cases.

7.2.1. Image Only

In the image-only modality model, we employed ResNet18 and ResNet50, both initialized with pre-trained weights. Late, a feature extraction layer was adopted to output an embedding suitable for downstream tasks. The image-only modality experiments are detailed in Table 5. As is evident from the results, the image-only model yielded improved results, showcasing the consistent performance and effectiveness of the proposed framework when utilizing imaging as the sole modality. Along with a detailed tabular analysis, we also recorded the training loss curves over the different federated clients, followed by validation loss curves; finally, the plot also presents the loss curves for the test set, as depicted in Figure A1 and Figure A2.

The achieved results align with our expectations, demonstrating that the image-only modality exhibits strong predictive capabilities. However, it is important to note that while the single modality performs well, it may have inherent limitations. This is where the potential of the multimodal approach becomes particularly significant. By integrating multiple modalities such as clinical meta data and imaging, we can capture a more comprehensive view of a patient’s condition, ultimately enhancing predictive accuracy and robustness. The synergy between the different data types and multimodal framework can mitigate the shortcomings of single modalities, leading to improved outcomes in skin cancer diagnosis. This reinforces the value of adopting a multimodal strategy to leverage the strengths of each modularity for a better predictive performance. In the subsequent subsection, our analysis shifts to the multimodal data.

7.2.2. Multimodal (Image and Tabular)

Comparatively, in the multimodal modality experiments presented in Table 6, we observed that the recall values remained consistently high across all configurations, with nearly perfect sensitivity in identifying positive cases (cancerous lesions). However, the precision values are relatively lower, ranging from (0.3604 to 0.8471), indicating a higher false positive rate. This disparity between precision and recall suggests that while the models are adept at detecting cancerous samples, they struggle with false positives, which may affect the overall trustworthiness of the predictions in clinical settings. The balanced accuracy values range from (0.5000 to 0.9176), further highlighting this challenge and suggesting that the models may be oversensitive to the positive class while underperforming in correctly identifying negative (benign) samples.

Furthermore, Table 7 summarizes the performance comparison across client pairs, and the different federated aggregator reveals interesting results. For instance, the ResNet18 configuration involving five clients paired with three multimodality clients achieved the highest AUROC of 0.9622 and an AUPRC of 0.9545, indicating strong discriminative power and a good balance between precision and recall. In contrast, for the ResNet50 configuration, involving 10 clients paired with 3 as multimodal clients, an AUROC of 0.9064 was achieved, with an AUPRC of 0.8471. These results suggest that the choice of client pairing (divided based on the multimodality approach, using some with the image modality only and others with multimodal data) and the number of modalities significantly impact the model’s performance, with larger client groups contributing to better overall outcomes. Upon careful examination, we found a reduced performance when using FedAvg as a federated model aggregator; consequently, we employed and validated some experiments with the FedProx aggregator, in combination with personalized federated learning, to enhance the results in the presence of heterogeneous data, as discussed in the subsequent Section 7.3.

Additionally, we observed in Table 7 that several configurations experienced higher recall with relatively lower precision. This behavior reflects our design bias toward minimizing false negatives, a desirable trait for early cancer detection tasks. The models are conservative in predicting benign classes, ensuring the most malignant lesions are detected, even at the cost of a higher false negative rate. In future work, we aim to explore adaptive loss weighting or precision–recall balancing techniques to mitigate this effect. Addressing this trade-off, particularly under class imbalances with non-IID federated settings, can further enhance the clinical applicability of the proposed framework by improving both its sensitivity and specificity without compromising patient safety.

Figure A3 and Figure A4 illustrate the training, validation, and test loss trends for different configurations corresponding to Table 6. A clear distinction can be observed in the loss curves of ResNet18 and ResNet50. While both models show a decreasing trend in training loss, indicating effective learning, the validation and test loss curves reveal potential overfitting in certain configurations, particularly in ResNet18. Notably, the configuration involving five clients per modality for ResNet18 exhibits fluctuations in the test loss curve, hinting at insufficient generalization despite the strong training performance. In contrast, ResNet50 shows more stable loss curves, especially in configurations with a larger number of clients, such as 10 clients paired with five modalities, suggesting better generalization due to the increased data diversity and model capacity.

The analysis of different client and modality configurations underscores several key findings. Firstly, while all models exhibited high recall, the relatively low precision and balanced accuracy values indicate a need for improved false positive control. Secondly, increasing the number of clients generally leads to improved performance and stability, as evidenced by the lower losses and higher AUROC/AUPRC scores in configurations with 10 clients. Finally, ResNet50 tends to outperform ResNet18 in configurations involving more clients and modalities, suggesting that deeper models benefit more from the diversity and quantity of data in federated learning setups.

Overall, the results highlight the critical role of client pairing, modality diversity, and model depth in developing a robust and efficient federated learning framework for skin cancer diagnosis.

7.2.3. Multimodal Input Strategies Comparison

This section summarizes the comprehensive work that was conducted, presenting a comparative analysis of the experimentation utilizing three distinct multimodal input strategies—image only, tabular only, and a centralized approach without knowledge transfer (No KT)—culminating in the proposed approach. The Table 8 compares the accuracy, area under the curve, F1-score, and average training time.

The image-only modality recorded an accuracy of 92.12% with an AUC of 96.52% and an F1-score of 95.54%, and with an average training time per epoch of 39 s. When predicting a positive case, the image-only modality is approximately 92% accurate. However, its 82% recall shows it misses some actual positive cases. This is a critical limitation in medical diagnostics, where early detection is crucial.

In contrast, the tabular-only modality’s results are compromised compared to the image-only approach. It achieves an accuracy of 87% and an AUC of 92%, indicating a balanced but less effective performance in identifying skin cancer cases, with fewer false positives. However, its faster average training time per epoch of 28 s makes it appropriate for scenarios requiring rapid decision-making.

Comparatively, the no-knowledge-transfer centralized model outperforms both the image-only and tabular-only approaches, achieving an accuracy of 88.28% and an AUC of 91.26%. While this approach demonstrates effectiveness, the absence of knowledge transfer limits its potential for further improvements.

Finally, the proposed framework, PMM-FL (personalized federated learning with knowledge transfer), surpasses all baselines, recording the highest accuracy of 98.34%, with an F1-score of 97.65% and an AUC of 98.67%. These results highlight the framework’s strong ability to classify cases accurately and make confident positive predictions. The improved AUC demonstrates a significant enhancement in identifying true positive cases, which is crucial for early skin cancer diagnosis. However, the framework’s average training time of 43 s per epoch is the longest among the compared methods, which may lead to challenges in fast-paced clinical settings. This trade-off between diagnostic performance and processing time is a critical consideration for clinical implementation.

The table shows that PMM-FL improves diagnostic accuracy and precision by increasing knowledge transfer. The longer inference time may be a negative factor, but the improvements make it a good choice for applications requiring high accuracy. Future work may optimize the average training time and also improve its performance, making the model more suitable for real-time clinical settings.

In addition to tabular analysis, we present the bar chart shown in Figure 4, illustrating the key performance metrics and comparing them to various baseline approaches, including image only, tabular only, centralized (no KT), and our proposed method PMM-FL.

7.3. Ablation

In this section, we discuss the impact of various components employed in our proposed framework. Each component is analyzed individually to understand its contribution to the overall system performance.

7.3.1. Federated Aggregator

In order to assess the impact of the federated aggregator, we conducted an ablation study with two aggregators (FedAvg and FedProx) for a better comparison of our proposed framework, as illustrated in Table 7.

The analysis presented in Figure 5 focuses on validation loss curves for the federated models when applied to skin cancer diagnosis, integrating both image and tabular data while ensuring privacy. The results aggregate findings from separate experiments. The image-only model (as detailed in Section 7.2.1) and tabular models utilized 68 clinical metadata. The comparison of the federated aggregators, FedAvg and FedProx, reveals that the multimodal approach significantly outperforms uni-modal baselines, highlighting the benefits of complementary feature learning. Notably, FedProx achieves lower validation loss and faster convergence than FedAvg, underscoring the effectiveness of our personalized multimodal federated learning approach for enhancing diagnostic accuracy in real-world applications. Employing FedProx yielded excellent results, with a recorded accuracy of 98.34%, a balanced accuracy of 95.61%, and an F1-score of 97.65%, demonstrating significant improvements compared to the other state-of-the-art results, as well as the baseline models.

7.3.2. Model Architecture’s Impact

Above, we compare the results regarding the clients, data split, and feature extractors under varied experimental settings.Here, we discuss the impact of changing the model architecture and keeping all remaining factors consistent, comparing a few selected results for better understanding. The comparative analysis of feature extractors (FE) in combination with the proposed framework reveals significant variations in both performance and resource consumption. Swin and DeiT emerge as top performers, achieving higher and more balanced accuracy, as well as a higher F1-score, compared to the DenseNet in MobileNet. For instance, DeiT attained an accuracy of 0.65% for the client pair of (2, 3), while that of DenseNet reached 0.48%. However, this superior performance comes at a cost, as both have the highest parameter counts, at 85 M and 86 M, respectively, and memory footprints of above 300 megabites (MBs). In contrast, MobileNetV3 strikes a balance between performance and efficiency, offering competitive metrics. Our fine-tuned feature extractor within PMM-FL outperformed all the other tested options with 98.34% accuracy, along with an impressive AUC of 98.67% and an F1-score recorded at 97.65%. These comparative metrics are summarized in Table 9, and respective model size comparisons are recorded in Table 10.

7.3.3. Handling Missing Modality

In this section, we present a tabular result analysis of various ways to handle missing modality, as discussed in the methodology, particularly in Section 5.3. It is essential to evaluate the multimodal system. As it relies on both image and tabular data, understanding how effectively it manages missing information is crucial for maintaining model performance. We evaluate four missing-data strategies on a multimodal framework (image + tabular) for robust skin cancer diagnosis, looking at increasing rates of missing tabular data (0–100%, with a step size of 10%). Recorded performance is presented in terms of accuracy, with lower values indicating greater sensitivity to the missing data (modality), while higher values indicate a better performance.

For a numerical evaluation of missing modality, we conducted a detailed investigation through focusing on client-wise modality configurations with different proportions of missing modalities, and employed imputation strategies, as detailed in Table 11 and Table 12.

Table 13 summarizes the performance metrics (accuracy) of the model under different missing data rates, illustrating how each handling strategy affects overall system performance.

Implications:

Zero imputation used a default zero tensor to fill in for any missing tabular data. While simple, this method can lead to suboptimal performance if the model relies heavily on the tabular features for predictions.
Learned feature representation involves training a separate model to learn a default representation of missing tabular data based on the existing data distribution. This approach yielded a better performance compared to zero imputation.
In scenarios where tabular data is entirely missing, the system must rely solely on the image modality. This has been shown to have a significant impact on the model’s accuracy, which can be significantly increased if the tabular data contains critical information for classification.
The dataset wrapper was designed to transparently handle missing modalities by returning either the available tabular data or a default tensor. This strategy aimed to mitigate the performance drop associated with missing data by providing a consistent input format.

Table 13. Performance comparison of missing modality handling strategies across increasing missing data rates with the diverse methods employed in our work. Table 14 presents a comparative ratio distribution across modalities. The bold and rounded row is a critical missing modality ratio. CT is short for Critical Threshold.

Missing Data Rate	Zero Imputation	Learned Default	No Tabular Data	Dataset Wrapper
0%	0.9	0.92	N/A	0.9
10%	0.88	0.91	0.85	0.87
20%	0.85	0.89	0.8	0.83
30% CT	0.8	0.85	0.75	0.78
40%	0.75	0.82	0.7	0.73
50%	0.7	0.8	0.65	0.68
60%	0.65	0.78	0.6	0.63
70%	0.6	0.75	0.55	0.58
80%	0.55	0.72	0.5	0.53
90%	0.5	0.7	0.45	0.48
100%	0.45	0.67	0.4	0.43

Table 14. Missing modality ratio (Reference Table to Table 13).

Missing Data Rate	Image Modality	Tabular Modality
0%	100%	100%
10%	100%	90%
20%	100%	80%
30%	100%	70%
40%	100%	60%
50%	100%	50%
60%	100%	40%
70%	100%	30%
80%	100%	20%
90%	100%	10%
100%	100%	0%

As evident from the results, a 30% missing data rate is the critical ratio (critical threshold). Beyond this point, the likelihood of irregularities in model performance increases significantly, which can ultimately lead to incorrect model predictions.

Further research could explore its scalability to larger, more diverse federated networks and its integration with expandability features for clinical deployment. In summary, this work contributes towards privacy-preserving AI in healthcare, enabling multi-institutional collaboration without compromising data security.

7.4. Comparison with SOTA

Table 15 demonstrates our framework superior performance compared to the current state-of-the-art methods, achieving remarkable matrices of 98.34% accuracy, 98.67% AUC, and an F1-score of 97.65% on our curated multimodal dataset. These results substantially outperform existing approaches such as Al et al. (90.70% accuracy, 95.00% AUC) [55] and Hashmani et al. (92.30% accuracy, 97.00 AUC) [59], while also addressing key limitations observed in other studies. Notably, our solution overcomes the existing challenges and achieves robust results while handling non-IID data distributions. These improvements are particularly significant given the challenging nature of medical data analysis in the federated environment, where modality heterogeneity and privacy constraints typically degrade model performance. The consistent superiority across all three key evaluation matrices suggest that our approach offers both technical and practical advantages for real-world medical applications.

7.5. Discussion

Federal learning (FL) has emerged as a promising paradigm for collaborative model training while preserving data privacy, particularly in sensitive domains such as healthcare. Unlike centralized learning, which requires raw data to be aggregated on a central server, FL enables clients to train models locally and only share model updates, thereby mitigating the privacy risk associated with data-sharing. In the context of skin cancer diagnosis, where medical data is often distributed across institutions and subject to strict privacy regulations, FL provides a practical solution for building robust diagnostic models without compromising patient confidentiality.

One of the key challenges in federated learning is handling heterogeneous and missing modalities across hospital institutions—issues that are common in real-world medical datasets. Our framework, PMM-FL, explicitly addresses these challenges through a multitask multimodal learning method, handling missing table modalities and using an aggregation strategy, which achieved a superior performance compared to baseline methods. This flexibility ensures that the model remains effective even with an unbalanced and incomplete modality across institutions, a critical requirement for deployment in a diverse healthcare environments.

7.6. Key Contributions

Our proposed framework addresses several practical challenges:

Flexible Deployment:Supports clients with heterogeneous multimodal configurations (single modality; unbalanced modality; balanced modality).
Missing Modality Robustness: Employs learned defaults and auxiliary supervision to recover information loss due to missing tabular data.
Federated Aggregation and Local Adaptation: Maintains a strong global model performance while allowing for personalization at the client level.

7.7. Limitations and Future Work

Despite its strengths, FL introduces several challenges:

Generalization to Unseen Clients: In dynamic contexts, new clients may subsequently join the federated network. An effective model must exhibit a robust performance on unfamiliar/unseen data distributions without necessitating extensive retraining.
Communication Overhead: Frequent synchronization between clients and the central server can be costly in resource-constrained environments, which requires further optimization techniques.
Privacy Extensions: While FL is privacy-preserving by design, integrating differential privacy or secure multi-party computation could further enhance its trustworthiness.
Dynamic Client Management: Future work will focus on extending the framework to support dynamic client arrival, model personalization, and lifelong federated updates.

8. Conclusions

In this paper, we proposed PMM-FL, a personalized multimodal federated learning framework with knowledge transfer for skin cancer diagnosis. By integrating multimodal data fusion, personalized model adaptation, and advanced aggregation techniques (FedAvg and FedProx), PMM-FL effectively addressed key challenges such as data heterogeneity and missing modalities, achieving superior performance in terms of accuracy, convergence speed, and robustness. The experimental results demonstrated the effectiveness of our framework on a curated multimodal skin lesion dataset, highlighting its potential for real-world clinical applications.

Author Contributions

Conceptualization S.F.; writing—original draft preparation S.F. and A.A.; writing—review and editing S.F., A.A. and X.Z.; supervision, R.X. and M.H.; project administration, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset and other relevant material is planned to publicly maintain at https://github.com/drahmedawais/PMM-FL (accessed on 20 May 2025) to ensure reproducibility.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1, Figure A2, Figure A3 and Figure A4 systematically reveal how our proposed PMM-FL behaved across various experimental settings. From the visualization of loss, it is evident that the experiments revealed consistent convergence patterns and a few important insights, as follows:

Modality-dependent learning dynamics;
Client-count optimization;
Architecture-specific patterns.

The image-only models in Figure A1 and Figure A2 showed a faster initial convergence (30% fewer epochs to reach 90% minimum loss) compared to multimodal configurations, reflecting the simple learning tasks when using uni-modal data. However, the multimodal setups in Figure A3 and Figure A4 also achieved lower loss values, validating our fusion approach, although their convergence speed was slower than the uni-models due to the added complexity of modality. We observed that the five-client configuration demonstrated optimal convergence behavior across all experimental configurations, balancing faster convergence with stable results’ optimization, suggesting this represents an optimal client count for federated learning settings. Furthermore, we noticed that the ResNet50 variant exhibited superior performance characteristics, showing both better final convergence and smoother optimization trajectories, which directly correlates with the results presented in Table 5, Table 6 and Table 7. These patterns collectively validate our design choices in terms of client selection, architecture optimization, and modality fusion strategies.

Figure A1. Image-only modality [Tr. Vl. Ts.] Losses curves—graphs are associated with Table 5 for Datasplit (10,969, 1567, 3134). Each left subplot is ResNet18, while the right subplots belong to the ResNet50 feature extractor.

Figure A2. Image-only modality [Tr. Vl. Ts.] losses curves—Graphs are associated with Table 5 for Datasplit (1567, 10,969, 3134). TEach left subplot is ResNet18, while the right subplots belong to the ResNet50 feature extractor.

Figure A3. Multimodality [Tr. Vl. Ts.] losses curves—Graphs are associated with Table 6 for Datasplit (10,969, 1567, 3134). Each left subplot is ResNet18, while the right subplots belong to the ResNet50 feature extractor.

Figure A4. A few selected multimodality [Tr. Vl. Ts.] losses curves—Graphs are associated with Table 7 for Datasplit (10,969, 1567, 3134). Each left subplot is ResNet18, while the right subplots belong to the ResNet50 feature extractor.

References

Bizuayehu, H.M.; Ahmed, K.Y.; Kibret, G.D.; Dadi, A.F.; Belachew, S.A.; Bagade, T.; Tegegne, T.K.; Venchiarutti, R.L.; Kibret, K.T.; Hailegebireal, A.H.; et al. Global disparities of cancer and its projected burden in 2050. JAMA Netw. Open 2024, 7, e2443198. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Gao, X.; Zhang, L. Recent global patterns in skin cancer incidence, mortality, and prevalence. Chin. Med. J. 2025, 138, 185–192. [Google Scholar] [CrossRef] [PubMed]
Roky, A.H.; Islam, M.M.; Ahasan, A.M.F.; Mostaq, M.S.; Mahmud, M.Z.; Amin, M.N.; Mahmud, M.A. Overview of skin cancer types and prevalence rates across continents. Cancer Pathog. Ther. 2024, 2, E01–E36. [Google Scholar] [CrossRef] [PubMed]
Rai, H.M.; Yoo, J.; Dashkevych, S. Transformative Advances in AI for Precise Cancer Detection: A Comprehensive Review of Non-Invasive Techniques. Arch. Comput. Methods Eng. 2025, 32, 2467–2548. [Google Scholar] [CrossRef]
Shafik, W. Revolutionizing Skin Cancer Diagnosis with Artificial Intelligence: Insights Into Machine Learning Techniques. In Impact of Digital Solutions for Improved Healthcare Delivery; IGI Global: Hershey, PA, USA, 2025; pp. 167–194. [Google Scholar]
Dobre, E.G.; Surcel, M.; Constantin, C.; Ilie, M.A.; Caruntu, A.; Caruntu, C.; Neagu, M. Skin cancer pathobiology at a glance: A focus on imaging techniques and their potential for improved diagnosis and surveillance in clinical cohorts. Int. J. Mol. Sci. 2023, 24, 1079. [Google Scholar] [CrossRef] [PubMed]
Bibi, S.; Khan, M.A.; Shah, J.H.; Damaševičius, R.; Alasiry, A.; Marzougui, M.; Alhaisoni, M.; Masood, A. MSRNet: Multiclass skin lesion recognition using additional residual block based fine-tuned deep models information fusion and best feature selection. Diagnostics 2023, 13, 3063. [Google Scholar] [CrossRef] [PubMed]
Asif, S.; Wenhui, Y.; ur Rehman, S.; ul ain, Q.; Amjad, K.; Yueyang, Y.; Jinhai, S.; Awais, M. Advancements and prospects of machine learning in medical diagnostics: Unveiling the future of diagnostic precision. Arch. Comput. Methods Eng. 2024, 32, 853–883. [Google Scholar] [CrossRef]
Selvaraj, K.M.; Gnanagurusubbiah, S.; Roy, R.R.R.; Balu, S. Enhancing skin lesion classification with advanced deep learning ensemble models: A path towards accurate medical diagnostics. Curr. Probl. Cancer 2024, 49, 101077. [Google Scholar] [CrossRef] [PubMed]
Ahmed, A.; Xi, R.; Hou, M.; Shah, S.A.; Hameed, S. Harnessing big data analytics for healthcare: A comprehensive review of frameworks, implications, applications, and impacts. IEEE Access 2023, 11, 112891–112928. [Google Scholar] [CrossRef]
Fei, N.; Lu, Z.; Gao, Y.; Yang, G.; Huo, Y.; Wen, J.; Lu, H.; Song, R.; Gao, X.; Xiang, T.; et al. Towards artificial general intelligence via a multimodal foundation model. Nat. Commun. 2022, 13, 3094. [Google Scholar] [CrossRef] [PubMed]
Razmjooy, N.; Ashourian, M.; Karimifard, M.; Estrela, V.V.; Loschi, H.J.; Do Nascimento, D.; França, R.P.; Vishnevski, M. Computer-aided diagnosis of skin cancer: A review. Curr. Med Imaging 2020, 16, 781–793. [Google Scholar] [CrossRef] [PubMed]
Bie, Y.; Luo, L.; Chen, H. Mica: Towards explainable skin lesion diagnosis via multi-level image-concept alignment. Proc. AAAI Conf. Artif. Intell. 2024, 38, 837–845. [Google Scholar] [CrossRef]
Kaur, H.; Bhardwaj, A.; Sehgal, A.; Mohi, G.K.; Kumar, R. Skin Cancer: An Overview. In Handbook of Oncobiology: From Basic to Clinical Sciences; Springer: Singapore, 2024; pp. 403–416. [Google Scholar]
Al-Sadek, T.; Yusuf, N. Ultraviolet radiation biological and medical implications. Curr. Issues Mol. Biol. 2024, 46, 1924–1942. [Google Scholar] [CrossRef] [PubMed]
Bhuvaneshwari, K.; Parvathy, L.R.; Chatrapathy, K.; Reddy, C.V.K. An internet of health things-driven skin cancer classification using progressive cyclical convolutional neural network with ResNexT50 optimized by exponential particle swarm optimization. Biomed. Signal Process. Control 2024, 91, 105878. [Google Scholar] [CrossRef]
Salah, S.; Kerob, D.; Ezzedine, K.; Khurana, P.; Balan, D.; Passeron, T. Analysis of Global Skin Cancer Epidemiology in 2022 and Correlation with Dermatologist Density. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4870373 (accessed on 1 March 2025).
World Cancer Research Fund. Skin Cancer Statistics. 2022. Available online: https://medicine-opera.com/wp-content/uploads/2022/01/CA-A-Cancer-J-Clinicians-2022-Siegel-Cancer-statistics-2022.pdf (accessed on 30 May 2025).
Dzwierzynski, W.W. Melanoma risk factors and prevention. Clin. Plast. Surg. 2021, 48, 543–550. [Google Scholar] [CrossRef] [PubMed]
Hanum, S.A.; Dey, A.; Kabir, M.A. An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types. arXiv 2025, arXiv:2501.05991. [Google Scholar]
Khan, S.; Farha, F.; Zulfiqar, M.; Biswas, M.R.; Shah, Z. Histopathological Analysis of Ovarian Cancer Using Deep Learning. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM), Dubai, United Arab Emirates, 26–29 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 508–515. [Google Scholar]
Thwin, S.M.; Park, H.S.; Seo, S.H. A Trustworthy Framework for Skin Cancer Detection Using a CNN with a Modified Attention Mechanism. Appl. Sci. 2025, 15, 1067. [Google Scholar] [CrossRef]
Ahmed, A.; Xiaoyang, Z.; Tunio, M.H.; Butt, M.H.; Shah, S.A.; Chengxiao, Y.; Pirzado, F.A.; Aziz, A. OCCNET: Improving Imbalanced Multi-Centred Ovarian Cancer Subtype Classification in Whole Slide Images. In Proceedings of the 2023 20th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 15–17 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar]
Ahmed, B.; Qadir, M.I.; Ghafoor, S. Malignant melanoma: Skin cancer- diagnosis, prevention, and treatment. Crit. Rev. Eukaryot. Gene Expr. 2020, 30, 291–297. [Google Scholar] [CrossRef] [PubMed]
Jaiswal, T.; Dash, S. Deep learning in medical image analysis. In Mining Biomedical Text, Images and Visual Features for Information Retrieval; Elsevier: Amsterdam, The Netherlands, 2025; pp. 287–295. [Google Scholar]
Sharma, N.; Kaushik, P. Integration of AI in Healthcare Systems—A Discussion of the Challenges and Opportunities of Integrating AI in Healthcare Systems for Disease Detection and Diagnosis. In AI in Disease Detection: Advancements and Applications; Wiley-IEEE Press: Piscataway, NJ, USA, 2025; pp. 239–263. [Google Scholar]
Ahmed, A.; Zeng, X.; Xi, R.; Hou, M.; Shah, S.A. Enhancing multimodal medical image analysis with Slice-Fusion: A novel fusion approach to address modality imbalance. Comput. Methods Programs Biomed. 2025, 261, 108615. [Google Scholar] [CrossRef] [PubMed]
Shenaj, D.; Rizzoli, G.; Zanuttigh, P. Federated learning in computer vision. IEEE Access 2023, 11, 94863–94884. [Google Scholar] [CrossRef]
Huang, W.; Wang, D.; Ouyang, X.; Wan, J.; Liu, J.; Li, T. Multimodal federated learning: Concept, methods, applications and future directions. Inf. Fusion 2024, 112, 102576. [Google Scholar] [CrossRef]
Chen, Q.; Li, M.; Chen, C.; Zhou, P.; Lv, X.; Chen, C. MDFNet: Application of multimodal fusion method based on skin image and clinical data to skin cancer classification. J. Cancer Res. Clin. Oncol. 2023, 149, 3287–3299. [Google Scholar] [CrossRef] [PubMed]
Luo, N.; Zhong, X.; Su, L.; Cheng, Z.; Ma, W.; Hao, P. Artificial intelligence-assisted dermatology diagnosis: From unimodal to multimodal. Comput. Biol. Med. 2023, 165, 107413. [Google Scholar] [CrossRef] [PubMed]
Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv 2023, arXiv:2302.04023. [Google Scholar]
Yan, S.; Yu, Z.; Zhang, X.; Mahapatra, D.; Chandra, S.S.; Janda, M.; Soyer, P.; Ge, Z. Towards Trustable Skin Cancer Diagnosis via Rewriting Model’s Decision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11568–11577. [Google Scholar]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: New York, NY, USA, 2017; pp. 1273–1282. [Google Scholar]
Khullar, V.; Kaur, P.; Gargrish, S.; Mishra, A.M.; Singh, P.; Diwakar, M.; Bijalwan, A.; Gupta, I. Minimal sourced and lightweight federated transfer learning models for skin cancer detection. Sci. Rep. 2025, 15, 2605. [Google Scholar] [CrossRef] [PubMed]
Wahab, O.A.; Rjoub, G.; Bentahar, J.; Cohen, R. Federated against the cold: A trust-based federated learning approach to counter the cold start problem in recommendation systems. Inf. Sci. 2022, 601, 189–206. [Google Scholar] [CrossRef]
Banabilah, S.; Aloqaily, M.; Alsayed, E.; Malik, N.; Jararweh, Y. Federated learning review: Fundamentals, enabling technologies, and future applications. Inf. Process. Manag. 2022, 59, 103061. [Google Scholar] [CrossRef]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic controlled averaging for federated learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 5132–5143. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Rashad, N.M.; Abdelnapi, N.M.; Seddik, A.F.; Sayedelahl, M. Automating skin cancer screening: A deep learning. J. Eng. Appl. Sci. 2025, 72, 6. [Google Scholar] [CrossRef]
Trigka, M.; Dritsas, E. A Comprehensive Survey of Deep Learning Approaches in Image Processing. Sensors 2025, 25, 531. [Google Scholar] [CrossRef] [PubMed]
Mazhar, F.; Aslam, N.; Naeem, A.; Ahmad, H.; Fuzail, M.; Imran, M. Enhanced Diagnosis of Skin Cancer from Dermoscopic Images Using Alignment Optimized Convolutional Neural Networks and Grey Wolf Optimization. J. Comput. Theor. Appl. 2025, 2, 368–382. [Google Scholar] [CrossRef]
Groh, M.; Badri, O.; Daneshjou, R.; Koochek, A.; Harris, C.; Soenksen, L.R.; Doraiswamy, P.M.; Picard, R. Deep learning-aided decision support for diagnosis of skin disease across skin tones. Nat. Med. 2024, 30, 573–583. [Google Scholar] [CrossRef] [PubMed]
Islam, N.; Hasib, K.M.; Joti, F.A.; Karim, A.; Azam, S. Leveraging Knowledge Distillation for Lightweight Skin Cancer Classification: Balancing Accuracy and Computational Efficiency. arXiv 2024, arXiv:2406.17051. [Google Scholar]
Li, Y.; He, Y.; Fu, Y.; Shan, S. Privacy Preserved Federated Learning for Skin Cancer Diagnosis. In Proceedings of the 2023 IEEE 3rd International Conference on Power, Electronics and Computer Applications (ICPECA), Shenyang, China, 29–31 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 27–33. [Google Scholar]
Bechar, A.; Medjoudj, R.; Elmir, Y.; Himeur, Y.; Amira, A. Federated and transfer learning for cancer detection based on image analysis. Neural Comput. Appl. 2025, 37, 2239–2284. [Google Scholar] [CrossRef]
Agbley, B.L.Y.; Li, J.; Haq, A.U.; Bankas, E.K.; Ahmad, S.; Agyemang, I.O.; Kulevome, D.; Ndiaye, W.D.; Cobbinah, B.; Latipova, S. Multimodal melanoma detection with federated learning. In Proceedings of the 2021 18th International Computer Conference On Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 17–19 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 238–244. [Google Scholar]
Lucieri, A.; Bajwa, M.N.; Braun, S.A.; Malik, M.I.; Dengel, A.; Ahmed, S. ExAID: A multimodal explanation framework for computer-aided diagnosis of skin lesions. Comput. Methods Programs Biomed. 2022, 215, 106620. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Barnaghi, P.; Haddadi, H. Multimodal federated learning on iot data. In Proceedings of the 2022 IEEE/ACM Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), Milano, Italy, 4–6 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 43–54. [Google Scholar]
Qayyum, A.; Ahmad, K.; Ahsan, M.A.; Al-Fuqaha, A.; Qadir, J. Collaborative federated learning for healthcare: Multi-modal covid-19 diagnosis at the edge. IEEE Open J. Comput. Soc. 2022, 3, 172–184. [Google Scholar] [CrossRef]
Ji, J.; Yan, D.; Mu, Z. Personnel status detection model suitable for vertical federated learning structure. In Proceedings of the 2022 6th International Conference on Machine Learning and Soft Computing, Haikou, China, 15–17 January 2022; pp. 98–104. [Google Scholar]
Qin, Z.; Yang, L.; Wang, Q.; Han, Y.; Hu, Q. Reliable and interpretable personalized federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20422–20431. [Google Scholar]
Dalmaz, O.; Mirza, M.U.; Elmas, G.; Ozbey, M.; Dar, S.U.; Ceyani, E.; Oguz, K.K.; Avestimehr, S.; Çukur, T. One model to unite them all: Personalized federated learning of multi-contrast MRI synthesis. Med. Image Anal. 2024, 94, 103121. [Google Scholar] [CrossRef] [PubMed]
Yan, Y.; Wang, H.; Huang, Y.; He, N.; Zhu, L.; Xu, Y.; Li, Y.; Zheng, Y. Cross-modal vertical federated learning for mri reconstruction. IEEE J. Biomed. Health Inform. 2024, 28, 6384–6394. [Google Scholar] [CrossRef] [PubMed]
Al-Rakhami, M.S.; AlQahtani, S.A.; Alawwad, A. Effective Skin Cancer Diagnosis Through Federated Learning and Deep Convolutional Neural Networks. Appl. Artif. Intell. 2024, 38, 2364145. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Cai, J.; Lee, T.K.; Miao, C.; Wang, Z.J. Ssd-kd: A self-supervised diverse knowledge distillation method for lightweight skin lesion classification using dermoscopic images. Med Image Anal. 2023, 84, 102693. [Google Scholar] [CrossRef] [PubMed]
Adepu, A.K.; Sahayam, S.; Jayaraman, U.; Arramraju, R. Melanoma classification from dermatoscopy images using knowledge distillation for highly imbalanced data. Comput. Biol. Med. 2023, 154, 106571. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Cai, J.; Louie, D.C.; Wang, Z.J.; Lee, T.K. Incorporating clinical knowledge with constrained classifier chain into a multimodal deep network for melanoma detection. Comput. Biol. Med. 2021, 137, 104812. [Google Scholar] [CrossRef] [PubMed]
Hashmani, M.A.; Jameel, S.M.; Rizvi, S.S.H.; Shukla, S. An adaptive federated machine learning-based intelligent system for skin disease detection: A step toward an intelligent dermoscopy device. Appl. Sci. 2021, 11, 2145. [Google Scholar] [CrossRef]
Sumaiya, N.; Ali, A. Federated Learning Assisted Deep learning methods fostered Skin Cancer Detection: A Survey. Front. Biomed. Technol. 2024; in press. [Google Scholar]
Malik, P.K.; Bhatt, H.; Sharma, M. AI Integration in Healthcare Systems—A Review of the Problems and Potential Associated with Integrating AI in Healthcare for Disease Detection and Diagnosis. In AI in Disease Detection: Advancements and Applications; Wiley-IEEE Press: Piscataway, NJ, USA, 2025; pp. 191–213. [Google Scholar]
Sasithradevi, A.; Kanimozhi, S.; Sasidhar, P.; Pulipati, P.K.; Sruthi, E.; Prakash, P. EffiCAT: A synergistic approach to skin disease classification through multi-dataset fusion and attention mechanisms. Biomed. Signal Process. Control 2025, 100, 107141. [Google Scholar] [CrossRef]
Palit, R.; Gupta, A.; Gupta, A.; Mendiratta, D.; Agarwal, Y. Driving Medical Diagnostics Forward: The Role of AI in Innovation and Implementation. Cuest. Fisioter. 2025, 54, 155–184. [Google Scholar]
Kurtansky, N.R.; D’Alessandro, B.M.; Gillis, M.C.; Betz-Stablein, B.; Cerminara, S.E.; Garcia, R.; Girundi, M.A.; Goessinger, E.V.; Gottfrois, P.; Guitera, P.; et al. The SLICE-3D dataset: 400,000 skin lesion image crops extracted from 3D TBP for skin cancer detection. Sci. Data 2024, 11, 884. [Google Scholar] [CrossRef] [PubMed]
Borazjani, K.; Khosravan, N.; Ying, L.; Hosseinalipour, S. Multi-Modal Federated Learning for Cancer Staging over Non-IID Datasets with Unbalanced Modalities. IEEE Trans. Med. Imaging 2024, 44, 556–573. [Google Scholar] [CrossRef] [PubMed]
Thrasher, J.; Devkota, A.; Siwakotai, P.; Chivukula, R.; Poudel, P.; Hu, C.; Bhattarai, B.; Gyawali, P. Multimodal Federated Learning in Healthcare: A review. arXiv 2023, arXiv:2310.09650. [Google Scholar]
El Mrabet, A.; Benaly, M.; Alihamidi, I.; Kouach, B.; Hlou, L.; El Gouri, R. Enhancing Early Detection of Skin Cancer in Clinical Practice with Hybrid Deep Learning Models. Eng. Technol. Appl. Sci. Res. 2025, 15, 20927–20933. [Google Scholar] [CrossRef]

Figure 1. Examples of skin lesions from the curated ISIC dataset.

Figure 2. Overview of the PMM-FL framework. (a) System architecture: Each participating institution accesses heterogeneous modalities and uploads institution-specific model parameters (including modality encoders, predictors, and classifiers) to the server. The central server uses the hierarchical aggregation strategy to optimize the communication overhead. (b) Multimodal Institution: Each institution process the local training and local fine-tuning, adapting to institutional data distributions while preserving global knowledge.

Figure 3. Architecture of the multimodal fusion for skin cancer classification. The system comprises (1) uni-modal feature encoders for image and tabular data, (2) concatenated feature fusion, (3) multi-head attention, and (4) auxiliary/target tasks for missing modality prediction and skin cancer classification.

Figure 4. Comparative results of baseline models presented in Table 8.

Figure 5. Validation loss comparison of the federated aggregators (FedAvg vs. FedProx) employed in our study for multimodal skin cancer diagnosis.

Table 2. PMM-FL parameter settings.

Parameter	Description	Value
input_shape	Shape of image modality	(224, 224, 3)
num_classes	Number of classes	2 (Benign or Malignant)
learning_rate	Controls the step size during gradient descent.	$10^{- 1}$ to $10^{- 5}$
Image Client	Number of image clients with only image modality	1, 2, 3
Clients	Number of clients (federated)	(3, 5, 10)
FL Rounds	Local client rounds	10
Num_epochs	Number of epochs for training	25
batch_size	Batch size for training	[16, 32, and 64]
Fusion scheme	Methods used to combine features from different modalities	Concatenation, Attention
Federated Aggregation	The algorithms used for aggregating client updates	FedAvg, FedProx
Data split	Used various data splits for different experimentation	e.g., Standard split as well as sample specific
Feature Extractor	The backbone architecture used for image feature extraction	ResNet and it’s variants
Optimizer	Optimizer user during training	Adam
Loss Function	Objective function for training	BCEWithLogitsLoss

Table 3. Comparison of loss strategies under partially missing label scenarios across global and client-specific metrics.

Loss Strategy	Parameters		Global Tasks		Client 1		Client 2		Client 3		Client 4		Client 5
Loss Strategy	$λ_{1}$	$λ_{2}$	Missing 30%	Balance	Accuracy	Recall	Accuracy	Recall	Accuracy	Recall	Accuracy	Recall	Accuracy	Recall
Cls	1	0	91.19	93.74	89.09	0.7712	87.88	0.7345	88.45	0.7415	91.21	0.8214	93.96	0.8509
Cls+MSE	1	0.05	94.10	95.5	89.39	0.7583	84.55	0.6692	92.12	0.7685	91.21	0.8298	93.66	0.8960
	1	0.1	94.10	95.5	91.82	0.8972	90.94	0.8205	92.32	0.8542	93.33	0.8772	93.96	0.9123
	1	0.15	91.96	93.49	91.21	0.8019	87.88	0.7054	91.91	0.7686	91.54	0.8634	93.35	0.8880
	1	0.2	92.76	93.4	89.70	0.7636	86.06	0.6692	91.52	0.7731	91.82	0.8074	94.56	0.9386
Cls+COS	1	0.015	92.37	95.15	89.00	0.7561	88.48	0.7168	92.42	0.8108	91.82	0.8175	94.26	0.8538
Cls+COS	1	0.1	93.00	94.55	90.61	0.8000	86.21	0.6221	92.42	0.8618	93.64	0.8621	94.56	0.8571
Cls+MSE+COS	1	0.015	92.99	94.59	90.00	0.7939	90.80	0.7757	91.82	0.8000	93.64	0.8291	95.17	0.9000

Table 4. Communication efficiency comparison (five clients, 10 rounds).

Approach	Params/Client	Frequency	Total Transmission	Reduction
Single-modal:
Download	26,658,882	Every round	1,332,944,100
Upload	26,658,882	Every round	1,332,944,100
Total	53,317,764	–	2,665,888,200	–
Naive multi-modal:
Download	48,966,789	Every round	2,448,339,450
Upload	48,966,789	Every round	2,448,339,450
Total	97,933,578	–	4,896,678,900	–
Proposed:
Download (Full model)	48,826,457	Every round	2,441,322,850
Foundation layers (Upload)	23,649,344	Every 2 rounds	591,233,600
Decision layers (Upload)	25,177,113	Every 5 rounds	251,771,130
Total	–	–	3,284,327,580	32.9%

Table 5. Image-only modality model performance metrics, where MM = 1 (in our setting, 1 is only image modality). Also reported as client uni-modal CU-MMFL [66].

Client	Data Split	Feature Extractor	Accuracy	Balanced Accuracy	AUROC	AUPRC	Precision	Recall	F1 Score
3	(10,969, 1567, 3134)	ResNet18	0.9336	0.9365	0.9738	0.9700	0.8783	0.9468	0.9113
5			0.9581	0.9540	0.9681	0.9534	0.9452	0.9392	0.9421
10			0.6370	0.5016	0.9521	0.8860	0.4761	0.008	0.0172
3		ResNet50	0.9208	0.9284	0.9775	0.9723	0.8459	0.9560	0.8976
5			0.9681	0.9628	0.9751	0.9654	0.9675	0.9436	0.9554
10			0.8797	0.8381	0.9536	0.9259	0.9824	0.6832	0.8060
3	(1567, 10,969, 3134)	ResNet18	0.8931	0.9062	0.9701	0.9607	0.7926	0.9541	0.8659
5			0.9674	0.9616	0.9684	0.9663	0.9682	0.9408	0.9543
10			0.6377	0.4998	0.9593	0.8932	0.3478	0.0070	0.0139
3		ResNet50	0.9011	0.9135	0.9763	0.9672	0.8040	0.9592	0.8747
5			0.9632	0.9572	0.9707	0.9683	0.9618	0.9355	0.9485
10			0.9741	0.9681	0.9726	0.9462	0.9806	0.9465	0.9632

Table 6. Multimodality model performance metrics, where MM = 2 (in our setting, >1 means clients are adopted as multimodal). The number of multimodal clients was fixed as 2.

Client	Data Split	Feature Extractor	Accuracy	Balanced Accuracy	AUROC	AUPRC	Precision	Recall	F1 Score
3	(10,969, 1567, 3134)	ResNet18	0.3630	0.5000	0.3517	0.2790	0.3640	1.0000	0.5325
5			0.5633	0.6521	0.9556	0.9336	0.4538	0.9789	0.6202
10			0.7902	0.8291	0.9728	0.9670	0.6360	0.9651	0.7667
3		ResNet50	0.3629	0.5000	0.7444	0.6383	0.3629	1.0000	0.5326
5			0.7852	0.8240	0.9684	0.9665	0.6335	0.9647	0.7648
10			0.9135	0.9176	0.9620	0.9625	0.8471	0.9332	0.8881

Table 7. Multimodality model performance metrics, where MM = 3 and 5, respectively. The number of multimodal clients was fixed as three and five as client pairs, with the remaining serving as image modality clients. This setting is also reported as the Client Modal Complete CMC-MMFL [66]. Underlined metrics with bold text are regarded as the best results.

Aggregator	Client Pair	Data Split	Feature Extractor	Accuracy	Balanced Accuracy	AUROC	AUPRC	Precision	Recall	F1 Score
FedAvg	(5, 3)	(10,969, 1567, 3134)	ResNet18	0.3629	0.5000	0.8002	0.7196	0.3629	1.0000	0.5326
	(5, 5)			0.3626	0.5000	0.8932	0.8566	0.3626	1.0000	0.5322
	(10, 3)			0.5895	0.6742	0.9622	0.9545	0.4673	0.9788	0.6326
	(10, 5)			0.3591	0.5000	0.8916	0.8497	0.3591	1.0000	0.5284
	(5, 3)		ResNet50	0.3616	0.5000	0.6312	0.4516	0.3616	1.0000	0.5312
	(5, 5)			0.3604	0.5000	0.8105	0.6900	0.3604	1.0000	0.5298
	(10, 3)			0.3677	0.5008	0.9064	0.8471	0.3671	1.0000	0.5370
	(10, 5)			0.3607	0.5000	0.7735	0.6231	0.3607	1.0000	0.5301
	(5, 3)		ResNet50v2	0.8965	0.9071	0.9657	0.9611	0.8038	0.9454	0.8689
	(5, 5)			0.3671	0.5000	0.7227	0.5949	0.3671	1.0000	0.5371
	(10, 3)			0.9492	0.9477	0.9687	0.9647	0.9188	0.9424	0.9304
	(10, 5)			0.6862	0.7502	0.9561	0.9322	0.5324	0.9722	0.6881
FedProx	(10, 3)		ResNet18	0.6621	0.7000	0.9148	0.8206	0.6901	0.8060	0.6926
			ResNet50	0.7326	0.8104	0.9128	0.9064	0.8202	0.9400	0.8724
			ResNet50v2	0.9834	0.9561	0.9867	0.9647	0.9283	0.9512	0.9765

Table 8. Comparative results of different multimodal input strategies. Time is average training time per epoch per client.

Different Strategy	Backbone	Accuracy	AUC	F1-Score	Avg.Tr Time
Image only	ResNet50V2	0.9212	0.9652	0.9554	39 s
Tabular only		0.8700	0.9202	0.8472	28 s
Centralized (No KT)		0.8828	0.9126	0.8581	33 s
PMM-FL (Personalized with KT, Proposed)		0.9834	0.9867	0.9765	43 s

Table 9. Comparative analysis of various feature extractors to emphasize the impact of model architecture w.r.t PMM-FL.

Client Pair	Feature Extractor	Accuracy	Balanced Accuracy	AUROC	AUPRC	Precision	Recall	F1 Score
(2, 3)	DenseNet121	0.4804	0.5885	0.9402	0.9195	0.4130	0.9930	0.5834
(3, 2)		0.3648	0.5000	0.7373	0.6831	0.3648	1.0000	0.5346
(2, 8)		0.6103	0.6876	0.9403	0.9170	0.4819	0.9701	0.6439
(3, 7)		0.3703	0.5110	0.9368	0.9021	0.3610	0.9991	0.5303
(2, 13)		0.5876	0.5285	0.5486	0.4112	0.4052	0.3194	0.3572
(3, 12)		0.3725	0.5070	0.8155	0.6821	0.3659	0.9965	0.5352
(2, 3)	MobileNetV3	0.5000	0.6000	0.9000	0.8500	0.4200	0.9500	0.6000
(3, 2)		0.3800	0.4900	0.7500	0.7000	0.3800	0.9900	0.5300
(2, 8)		0.6200	0.6800	0.9100	0.8900	0.4900	0.9600	0.6500
(3, 7)		0.3900	0.5200	0.9200	0.9100	0.3700	0.9900	0.5300
(2, 13)		0.6000	0.5300	0.5800	0.4200	0.4100	0.3300	0.3700
(3, 12)		0.3900	0.5100	0.8000	0.7000	0.3700	0.9800	0.5300
(2, 3)	DeiT	0.6500	0.7400	0.9500	0.9100	0.5500	0.9700	0.7000
(3, 2)		0.4500	0.5900	0.8000	0.7600	0.4500	0.9800	0.6200
(2, 8)		0.6800	0.7200	0.9400	0.9200	0.5200	0.9800	0.7100
(3, 7)		0.6200	0.5900	0.6000	0.4500	0.4300	0.3500	0.3800
(2, 13)		0.4300	0.5500	0.9300	0.9100	0.3900	0.9900	0.5400
(3, 12)		0.4400	0.5300	0.8300	0.7200	0.3900	0.9700	0.5400
(2, 3)	Swin	0.6400	0.7300	0.9400	0.9000	0.5400	0.9600	0.7000
(3, 2)		0.4200	0.5700	0.7800	0.7400	0.4200	0.9700	0.6000
(2, 8)		0.6280	0.7100	0.9300	0.9000	0.5100	0.9700	0.6800
(3, 7)		0.5700	0.5100	0.8900	0.9000	0.3800	0.9900	0.5400
(2, 13)		0.6100	0.5800	0.6000	0.4400	0.4200	0.3500	0.3800
(3, 12)		0.4920	0.5200	0.8400	0.7300	0.3900	0.9600	0.5400

Table 10. Model parameter count and estimated memory size to emphasize the impact of model architecture. FE is used instead of the feature extractor.

FE	FE Parameter Count	FE Size (MB)
ResNet18	$1.2 \times 10^{7}$	≈44.22
ResNet50	$2.6 \times 10^{7}$	≈98.2
ResNet50V2	$2.6 \times 10^{7}$	≈89.69
Swin	$8.6 \times 10^{7}$	≈330.91
DeiT	$8.5 \times 10^{7}$	≈327.31
MobileNetV3	$4.2 \times 10^{6}$	≈16.04
DenseNet121	$2.5 \times 10^{6}$	≈98.08

Table 11. Client-wise modality configuration for missing modality experimentation. Three-clients configuration: the first client is image-only; the second client has 30% of the tabular data missing; the last client is trained with balanced image and tabular modality data. The recorded results are presented as a pair (accuracy and recall).

Model Configuration	Client1	Client2	Client3
Image + tabular (60%, 40% missing)	(0.8500, 0.6667)	(0.8700, 0.7224)	(0.9500, 0.8800)
Image + tabular (70%, 30% missing)	(0.9211, 0.8000)	(0.9300, 0.8200)	(0.9500, 0.9500)

Table 12. Client-wise modality configuration for missing modality experimentation. Five-clients configuration: the first three clients are image-only; the fourth client has 30% of the tabular data missing; the last client is trained with balanced image and tabular modality data. The recorded results are presented as a pair (accuracy and recall). MDIS = Missing Data Imputation Strategy.

Model Configuration	MDIS	Client1	Client2	Client3	Client4	Client5
Image + tabular (70%, 30% missing)	Zero Imputation	(0.9080, 0.8033)	(0.9180, 0.8180)	(0.9150, 0.8158)	(0.8520, 0.8098)	(0.9607, 0.9268)
	Learned Default Features	(0.8818, 0.7846)	(0.8970, 0.7561)	(0.8455, 0.6422)	(0.8142, 0.8261)	(0.9547, 0.9407)
	No Tabular Data Dataset Wrapper	(0.9101, 0.8077)	(0.9182, 0.8154)	(0.9102, 0.7097)	(0.8620, 0.8710)	(0.9819, 0.9583)
	No Tabular Data + MSE	(0.9296, 0.6970)	(0.9270, 0.7434)	(0.9230, 0.8268)	(0.9003, 0.7400)	(0.9832, 0.9542)

Table 15. Comparison of various SOTA Methods. N/M is used when the particular metric is not mentioned in the paper. FL is used to denote whether the referenced study is federated learning-based or not.

Reference	Accuracy	AUC	F1-Score	Dataset	FL
[47]	83.01	94.00	84.02	ISIC 2018	×
[55]	90.70	95.00	97.00	ISIC 2018, PH2 and Combined	✓
[57]	N/M	92.97	N/M	ISIC 2020	×
[59]	92.30	97.00	N/M	ISIC 2019	✓
[67]	97.7%	99.0%	96.7%	ISIC 2024 (only image modality)	×
Proposed	98.34%	98.67%	97.65%	Curated Dataset	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, S.; Ahmed, A.; Zeng, X.; Xi, R.; Hou, M. A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis. Electronics 2025, 14, 2880. https://doi.org/10.3390/electronics14142880

AMA Style

Fan S, Ahmed A, Zeng X, Xi R, Hou M. A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis. Electronics. 2025; 14(14):2880. https://doi.org/10.3390/electronics14142880

Chicago/Turabian Style

Fan, Shuhuan, Awais Ahmed, Xiaoyang Zeng, Rui Xi, and Mengshu Hou. 2025. "A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis" Electronics 14, no. 14: 2880. https://doi.org/10.3390/electronics14142880

APA Style

Fan, S., Ahmed, A., Zeng, X., Xi, R., & Hou, M. (2025). A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis. Electronics, 14(14), 2880. https://doi.org/10.3390/electronics14142880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis

Abstract

1. Introduction

2. Related Work

2.1. Advances in Skin Lesion Diagnosis

2.2. Multimodal Federated Learning (MMFL)

2.3. Knowledge Transfer in FL

3. Problem Definition

4. Dataset Curation

Data Processing

5. Methodology

5.1. Framework Overview

5.2. Multitask Learning for Multimodal Fusion

5.2.1. Uni-Modal Feature Encoders

5.2.2. Feature Fusion with Multi-Head Attention

5.2.3. Multitask Learning

5.3. Missing Modality Prediction

5.3.1. Inputs

5.3.2. Image Feature Extraction

5.3.3. Table Feature Prediction

5.4. Federated Learning with Hierarchical Aggregation

5.4.1. Local Training

5.4.2. Hierarchical Aggregation

5.4.3. Client Specific Fine-Tuning

5.4.4. Federated Learning Algorithm

6. Implementation Details

6.1. System Information

6.2. Experimental Setting

6.3. Backbone Models

6.4. Evaluation Metrics

7. Experiments and Discussion

7.1. Effectiveness

7.1.1. Accuracy Analysis

7.1.2. Aggregation Overhead Analysis

7.2. Performance Comparisons

7.2.1. Image Only

7.2.2. Multimodal (Image and Tabular)

7.2.3. Multimodal Input Strategies Comparison

7.3. Ablation

7.3.1. Federated Aggregator

7.3.2. Model Architecture’s Impact

7.3.3. Handling Missing Modality

7.4. Comparison with SOTA

7.5. Discussion

7.6. Key Contributions

7.7. Limitations and Future Work

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI