A Systematic Review of Federated Learning in the Healthcare Area: From the Perspective of Data Properties and Applications

: Recent advances in deep learning have shown many successful stories in smart healthcare applications with data-driven insight into improving clinical institutions’ quality of care. Excellent deep learning models are heavily data-driven. The more data trained, the more robust and more generalizable the performance of the deep learning model. However, pooling the medical data into centralized storage to train a robust deep learning model faces privacy, ownership, and strict regulation challenges. Federated learning resolves the previous challenges with a shared global deep learning model using a central aggregator server. At the same time, patient data remain with the local party, maintaining data anonymity and security. In this study, ﬁrst, we provide a comprehensive, up-to-date review of research employing federated learning in healthcare applications. Second, we evaluate a set of recent challenges from a data-centric perspective in federated learning, such as data partitioning characteristics, data distributions, data protection mechanisms, and benchmark datasets. Finally, we point out several potential challenges and future research directions in healthcare applications.


Introduction
Deep learning technology has shown promising results in smart healthcare applications to assist medical diagnosis and treatment based on clinical data. For instance, deep learning assists cancer diagnosis and prediction [1][2][3], brain tumor segmentation and classification from magnetic resonance image (MRI) [4][5][6], and text detection of medical laboratory reports [7,8]. Good performance of the deep learning model on smart healthcare applications highly depends on a diverse and vast amount of training data [9]. These training data were obtained from various clinical observations such as biomedical sensors, individual patients, clinical institutions, hospitals, pharmaceutical industries, and health insurance companies. However, acquiring the healthcare data required to develop a deep learning model may be challenging due to fewer patients and pathologies with a low incidence rate available in a single healthcare institution. Furthermore, Zech et al. [10] showed that deep learning models trained with single institutional data are vulnerable to institutional data bias, as shown in Figure 1a. This institutional data bias has been shown to have high accuracy when evaluated on the same clinical institution's data. However, it does not work well when applied to data from a different institution or even across departments within the same institution. Simultaneously, training deep learning models in a centralized data lake [11], as depicted in Figure 1b, is infeasible because of patient privacy and government regulations related to clinical data. Thus, to increase both the diversity and quantity of training data is through the collaboration of several healthcare institution to create a single deep learning model while maintaining patient privacy and confidentially. Medical data are usually fragmented due to the complex nature of the medical system and processes. For instance, each medical institution may be able to access the medical data of their patients only. As protected health information (PHI), these medical data are only disclosed strictly regulated by law to third parties. The process of accessing and analyzing medical data is strictly regulated by laws and regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) [12]. In addition, with an increasing number of data breaches at healthcare organizations, the prominence of data security and privacy protection has become a global consensus. For instance, in the American Medical Collection Agency (AMCA) recent healthcare data breach, the perpetrators have access to medical data, financial information, and payment details, affecting 11.9 million patients [13]. As a result, many countries around the globe are enacting stricter legislations to protect data security. For example, the General Data Protection Regulation (GDPR) went into effect in 2018 by the European Union to ensure users' privacy while protecting their data [14]. Under this GDPR, business entities must clearly explain why they need user data access and offer them the right to withdraw or delete their data. Business entities violating the regulation would face severe penalties. Many similar actions have taken place in the United States and Taiwan to protect individuals' privacy and security. For instance, Taiwan's Personal Data Protection Act (PDPA) and Cyber Security Management Act, enacted in 2018, prohibit online business entities from leaking or tampering with personal data details that they obtain [15]. This regulation enforces the business activities following the obligations of legal data protection. On the one hand, establishing these regulations will contribute to a more civil society's growth. On the other hand, these regulations introduce new challenges to data transaction and collaboration procedures for multi-institutional collaboration to train a deep learning model.
One recent approach to solving the problem of training a robust deep learning model from federated medical data while preserving patient privacy is federated learning (FL) [16,17]. This method provides decentralized machine learning model training without transmitting medical data through a coordinated central aggregate server. Medical institutions, working as client nodes, train their deep learning models locally and then periodically forward them to the aggregate server. The central server coordinates and aggregates the local models from each node to create a global model, then distributes the global model to all the other nodes. It is worth noting that the training data are kept private to each node and never transmitted during the training process. Only the model's weight and parameters are transmitted, ensuring that medical data remain confidential. For these reasons, FL mitigates many security concerns because it retains sensitive and private data while enabling multiple medical institutions to work together. FL holds an excellent promise in healthcare applications to improve medical services for both institutions and patients-for instance, predict autism spectrum disorder [18], mortality and intensive care unit (ICU) stay-time prediction [19], wearable healthcare devices [20,21], and brain tumor segmentation [22]. However, FL algorithms face several challenges, mainly due to the properties of medical data, such as:

•
Data partitions: FL technique aims to solve the limited sample size problem for training a secure collaborative machine learning model by aggregating a group of clients' data. However, choosing a data partition (horizontal or vertical) for FL is essential to solve the limited sample size, limited sample features, or both. • Data distribution (statistical challenge): In developing a machine learning model in a centralized manner, the training data are centrally stored and balanced during training. However, with federated learning, each client generated the training data locally, remained decentralized, and cannot access the other clients' data. Thus, data distribution at one client can differ significantly from others, i.e., nonindependent and identically distributed (non-IID), impacting the performance of the federated learning model [23,24]. • Privacy and security: Data privacy and security are critical issues in medical applications. It is impossible to assume all of the clients in FL are reliable because the number of clients expected to participate is potentially thousands or millions. Thus, privacy-preserving mechanisms are needed to protect medical data from untrusted clients or third-party attackers. • Benchmark medical dataset: Medical dataset quantity and quality have often limited the development of a robust solution to the FL algorithm. For various research purposes, the dataset used in FL experiments could vary significantly. For instance, some datasets focus on medical image classification and segmentation performance while others focus on network communication performance. However, the benchmark datasets have not already been compiled, specifically for medical datasets. Thus, a trusted benchmark is necessary to evaluate the performance of the FL that uses multiple medical data sources. Finally, we provide a comprehensive list of relevant medical datasets for future research on this topic.
Due to the ever-changing development in FL, several valuable studies on FL have been published in reputable publications from 2018 to 2021. Therefore, this paper aims to provide a recent review of federated learning in the medical domain. Specifically, this study describes the existing FL techniques related to solving the challenges inherent in medical data together with future research direction on FL for healthcare applications.
This study differs from existing reviews. General descriptions of FL are given in [16,17], while detailed discussions of recent challenges are presented in [25,26], security analysis [27], and personalization techniques [28]. Resumes of FL applications in edge computing [29], wireless networks [30], and healthcare [31,32] also have been published. However, none of the existing studies have explored the impact of medical data properties on the performance of FL in great detail. Moreover, it is necessary to provide a comprehensive overview related to benchmarking the FL in medical data. To fill the gap, this review presents a survey of FL from the perspective of data properties including data partitions, data distribution, data privacy, benchmarking, and its promising applications.
After a brief introduction of FL in this study, the rest of this paper is structured as follows. Section 2 describes the research method to conduct this study. Furthermore, in Section 3, we provide the search results from existing publications. Section 4 discusses our findings in data partition, data distribution properties, data privacy threats and protections, benchmark medical dataset, and open challenges applied in federated learning for medical applications. Finally, we have our paper's conclusion in Section 5.

Research Method
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [33] was the research method to guide this study. PRISMA technique is a widely accepted standard for reporting evidence in systematic reviews that health-related organizations and journals have adopted [34]. PRISMA approaches provide several advantages, such as showcasing the review's quality, allowing readers to assess the review's strengths and flaws, replicating review processes, and structuring and formatting the review using PRISMA headings [33]. However, doing a systematic review and thoroughly publishing it may take time. Additionally, it can soon become out of date, thus it must be updated regularly to incorporate all newly published primary material since the project began.

Formulate Research Questions
We divide the research question into the following research questions. -

RQ1:
What are the state-of-the-art FL methods in the healthcare area? -RQ2: What are the FL methods proposed by scholars to solve challenging medical applications from a data properties perspective? -RQ3: What are the research gaps and potential future research directions of FL related to medical applications?
The first research question (RQ1) aims to provide a comprehensive and systematic overview of all articles related to FL. Furthermore, RQ1 aims to provide evidence that the healthcare area can benefit by incorporating FL. Additionally, the second research question's (RQ2) motivation is to answer FL medical data settings challenges in FL such as data partition, statistic heterogeneity, and security. Finally, the third research question (RQ3) provides future directions for a researcher in the FL field primarily related to medical data challenges.

Data Eligibility and Analysis of the Literature
The article selection procedure uses the PRISMA flow diagram [33], as shown in Figure 2, which outlines papers' search, inclusion, and exclusion. There are three steps in the PRISMA flow diagram: identification, screening, and included. Firstly, in the identification step, we performed a comprehensive literature review between 1 January 2018 and 31 June 2021, using PubMed, Web of Science (WoS), Association of Computing Machinery Digital Library (ACM DL), Science Direct, and IEEEXplore digital libraries. We start from 2018 because we are interested in further implementation in the medical area one year after federated learning was proposed in 2017 [16]. The following search phrases were used in general are "Federate learning," and "Healthcare," and "data privacy protection." Because each publication database has its own set of filters for search queries, the specific query terms are specified in Appendix A Table A1. The initial result from digital libraries showed 197 articles satisfying the search criteria. Then, 28 articles were removed due to duplications, ending with 169 articles in the identification step. While systematic reviews offer various advantages, they are prone to biases that obscure the study's objective results and should be evaluated cautiously [35]. Several approaches were used to eliminate bias and ambiguity in the research selection process, such as (i) conducting a dual review, (ii) defining clear and transparent inclusion and exclusion criteria, and (iii) tracing the resulting flow diagram using the PRISMA flow diagram. Firstly, two researchers independently analyzed the data and resolved inconsistencies through group discussion (P. and K.T.P). Then, the abstracts and complete texts of all relevant articles were carefully studied, and only those that fit the inclusion and exclusion criteria were chosen. Researchers then confirmed the selected papers and resolved any conflicts; if any disagreements persisted, third researchers were invited to discuss the matter, and the findings were appraised (Z.-Y.S., C.-R.S., and W.J.). There was no dispute over the papers included in this review.
This study should propose a good overview of FL for the healthcare sector and more in-depth about establishing FL's secure medical data mechanism. Thus, in the screening step, we define the inclusion and exclusion criteria. We included publications that (i) use FL to develop a model on a medical dataset, (ii) are published in well-known journals, and (iii) are published in English. Exclusion criteria were used to exclude the published studies that were not related, based on the following criteria: (i) articles that are not related to FL, (ii) FL for nonmedical application or not using medical dataset in the experiment, (iii) non-English language, (iv) review article, (v) proceeding or conference papers, (vi) arXiv preprints, and (vii) book, book chapter, book section.
Numerous considerations exist against the inclusion of conference papers in this study [36]. Firstly, conference proceedings usually contain various topics and much larger set of publications such that identifying suitable conferences, accessing their abstracts, and sifting through the frequent thousands of abstracts can be time-consuming and resourceconsuming. Secondly, conference proceedings may lack sufficient information for systematic reviewers to evaluate the methods, risk of bias, and outcomes of the studies submitted at the conference due to their brevity. Finally, the reliability of the results is also in question especially in the healthcare area, partly because they are frequently preliminary or based on limited investigations undertaken in a position to meet conference deadlines. Thus, we do not include the conference papers in the inclusion criteria.
After applying inclusion and exclusion criteria from each study's title, abstract, and keywords, 56 articles were identified in the screening step. Next, 32 articles were excluded in the reports assessed for eligibility step due to exclusion criteria from full text in the article, ending with 24 articles. Finally, in the included step, 24 articles using FL in the healthcare application were selected for further analysis, and their results are discussed in this study. All of the 24 selected FL studies in the healthcare domain are listed in Table A2.
To provide a numerical description of the literature review, we gathered information from each article as follows: (i) paper information, such as author, title, year, and keywords; (ii) proposed methods, such as FL training algorithms and deep learning/machine learning models; (iii) data properties, such as medical datasets, data distribution techniques and challenges, data partition techniques, privacy attacks, and privacy mechanisms; and (iv) experiment results and discussion.

Results
We compiled the data properties in FL for healthcare applications from 24 published articles, as shown in Figure 3. The data scheme settings consisted of four layers: (i) data partitions such as horizontal federated learning (HFL), vertical federated learning (VFL), and federated transfer learning (FTL) (as discussed in Section 4.2); (ii) data distribution characteristics (non-IID) such as quantity skew, label distribution skew, feature distribution skew, and concept shift skew (as discussed in Section 4.3); (iii) possible data privacy attacks such as model inversion and membership inference attacks (as discussed in Section 4.4.1); (iv) additional data privacy protections such as differential privacy and homomorphic encryption (as discussed in Section 4.4.2). Above the medical data properties is the application task, where the task can be a classification or segmentation (as discussed in Section 4.6).  Figure 4b shows the number of studies with data partition characteristics employed in FL.
According to the figure, most published FL studies use horizontal federated learning (HFL) as a medical data partition. Thirdly, Figure 4c shows the number of studies with various defense methods to protect from data privacy attacks. We can see that differential privacy is the most often employed type of data privacy protection. All of the possible data privacy protection methods will be discussed in Section 4.4. Based on Figure 4d, quantity skew is typical when dealing with multi-institutional medical data from FL experiments. Machine learning algorithms. Additionally, we want to outline the machine learning models employed in the studies and evaluate their proposed FL algorithms. The outlined result of the machine learning model is shown in Figure 4e, where multilayer perceptron (MLP) is the most commonly used model when predicting with tabular medical datasets such as mortality prediction. Meanwhile, convolutional neural network (CNN) is the frequent model architecture used for medical image datasets. Other models include support vector machine (SVM) and autoencoder (AE) models. Additionally, we compile the machine learning task based on the 24 published articles, as shown in Figure 4f. There were 21 studies on classification tasks and three studies on segmentation tasks. Finally, we summarized in Table 1 the strengths and weaknesses of machine learning algorithms performing on federated learning. An autoencoder may exclude essential information from a medical dataset's characteristics. [19,21,37] CNN Performs well on medical image classification tasks such as prediction of COVID-19 using X-ray images The training process of CNN that contains multiple layers will be time-consuming if the client in the FL environment does not have powerful computation resources. [18,20,[38][39][40][41][42] GAN Generate a synthetic sample of medical data for limited quantity in experiments datasets.
Training GAN is challenging due to the unstable training process, no standard metric evaluation, and numerous trial-and-error experiments required for effective outcomes. [43,44] LSTM Performs well on time series or sequential medical datasets, for instance, detection of human activity recognition.
Due to the vanishing and exploding gradient challenges, training LSTM is difficult. [45] MLP Good generalization performance on tabular medical datasets such as mortality prediction based on drug data MLP is limited to learning elementary problems. Additionally, it is feature-scaling sensitive and involves setting numerous hyperparameters such as the number of hidden neurons and layers.
SVM is memory-consuming, more difficult to modify because of critically selecting the appropriate kernel, and does not scale well to more extensive datasets. [52] U-Net Achieve accurate results when performing segmentation tasks on medical image datasets, for example, when segmenting brain tumors disease using brain magnetic resonance medical images. U-Net model development is time-consuming because the network must be operated independently for each patch, and redundancy due to overlapping patches. Additionally, a tradeoff exists between the precision of localization and the utilization of context. [38,53] AE: autoencoder; CNN: convolutional neural network; GA.: generative adversarial network; LSTM: long short-term memory; MLP: multilayer perceptron; SVM: support vector machine.

RQ1:
What are the state-of-the-art FL methods in the healthcare area?

Federated Learning Overview
FL is a technique to develop a robust quality shared global model with a central aggregate server from isolated data among many different clients. In a healthcare application scenario, assume there are K nodes where each node k holds its respective data D k with n k total number of samples. These nodes could be a healthcare wearable device, an internet of health things (IoHT) sensor, or a medical institution data warehouse. The FL objective is to minimize loss function given total data n = ∑ K k=1 n k and trainable machine learning weight vectors with d parameters w ∈ R d using Equation (1): where f i (w) = (x i , y i ; w) denotes the loss of the machine learning model made with parameter w. For instance, Huang et al. [19] used the categorical cross-entropy loss function to update the model parameters on the binary classification of patient mortality. In addition, Yang et al. [53] used the soft dice loss function for the COVID-19 region segmentation application.
In 2016, the basic concept of data parallelism in FL namely federated averaging (FedAvg) algorithm, was introduced by McMahan et al. [16]. As stated in the FedAvg algorithm, every communication round t consists of four phases. Firstly, the aggregate server initializes a global model with initial weights w g t , then shared with a group of clients S t (medical nodes in our case), which was picked randomly with a fraction of C ∈ {0, 1}. Secondly, each client k ∈ S t , after received a global model w g t from the server, the client conducts local training steps with epoch E on minibatch b ∈ B of n k private data points. The local model parameters are updated with local learning rate η and optimized by minimizing loss function L(.). Thirdly, once client training is completed, the client k sends back its local model w k t+1 to the server. Finally, after receiving the local model w k t+1 from all selected groups of clients S t , the aggregate server updates the global model w g t+1 by averaging of local model parameters using Equation (2): where α k is a weighting coefficient to indicate the relative influence of each node k on the updating function in the global model, and K is the total nodes that participated in the training process. Choosing the proper weighting coefficient α k in the averaging function can help improve the global model's performance (as discussed in Section 4.3.2 non-IID mitigation methods). The entire FL procedure is described in Algorithm 1.

Algorithm 1 FL with Federated Averaging (FedAvg) algorithm [16]
Input: T global round, C number of fractions for each training round, K number of clients, η learning rate at a local client, E number of epochs at a local client, B local minibatch at a local client. 01: Initialize global model w g t=0

02:
for each round t = 1, 2, . . . , T do 03: m ← max(C × K, 1) 04: S t ← (m clients in a random order) 05: for each client k ∈ S t do 06: for each local epoch e = 1, 2, . . . E do 12: for each local batch b ∈ B do 13: return local model w k Output: w g t+1 a global model at round t + 1 FL has differentiated from the standard collaborative learning in the following properties: (1) training is carried out across a vast number of many client nodes, and communication speed between the client nodes and the aggregate server is slow; (2) the central aggregate server does not have a control to individual nodes or devices, and full participation of all nodes is unrealistic because there are inactivate devices that do not respond to the server; (3) in real-world case scenario, data distribution is nonindependent and identically distributed (non-IID). Non-IID data distribution means that each node has a different distribution pattern from the other node. These properties are shown when the first proposed of FL algorithm is applied for mobile keyboard prediction [16,17]. However, these properties are different when FL is implemented in the healthcare area. First, the FL training is carried out across a limited number of healthcare nodes from 2 to 100 as listed in Table 2, and communication speed between healthcare participants and the aggregate server is usually reliable. Second, the aggregate server coordinates the participant nodes in the FL training scheme without exposing the participant's local data to the network; thus, data privacy and security can be guaranteed.
FL is divided into two categories based on the aggregation schema: (a) centralized FL and (b) decentralized FL. As shown in Figure 5a, for centralized FL, the central server selects a subset of nodes at the beginning of training and aggregates the model updates received from client nodes. As nodes, the medical institutions periodically communicate the local updates w k t−1 with a central server to learn a global model w g t . The central server aggregates the updates and sends back the parameters of the updated global model. However, if the centralized server fails, the whole FL environment will collapse. This failure is one of the reasons that the decentralized FL was proposed. Specifically, all nodes coordinate themselves and work together from node to node to develop a global model in decentralized FL, as shown in Figure 5b.

RQ2:
What are the FL methods proposed by scholars to solve challenging medical applications from a data properties perspective?

Data Partition Characteristics
This section discusses FL based on the healthcare data partition characteristics. Since FL uses data kept in various medical institutions, it is frequently presented in a feature matrix. Let matrix D k denote medical data held by the medical institution k. Notably, a row in the matrix represents a patient index denoted by I, a column represents a patient features diagnosis denoted by X , and some data may contain a label data Y. The complete training medical dataset D k in a medical institution k is denoted by (I k , X k , Y k ). Thus, data partition in FL can be divided into horizontal federated learning (HFL), vertical federated learning (VFL), and federated transfer learning (FTL) [26].

Horizontal Federated Learning (HFL)
The horizontal federated learning (HFL) data partition, shown in Figure 6, is recommended in the case of limited sample size variability when developing a model. In this data partition setting, the nodes could be different health institutions or health data application providers. The HFL aims to develop a global model by integrating patients' sample data from different institutions without affecting patient privacy. Each node shares different patients' index I but has the same features X and labels Y information [26]. HFL is denoted as: where D i represents the dataset held by client i. For instance, two healthcare providers of the same business located in different countries would like to develop an AI model. User features of these two healthcare providers will mostly be the same because both operate the same business. However, the patient samples held by the two healthcare providers are different due to geographic locations. In this regard, we can use HFL to increase the total training sample by aggregating both of the healthcare providers' user samples in a privacy-preserving manner to enhance the model's performance. Therefore, the HFL data partition resolves the lack of sample size in data training because it combines all healthcare institutions' sample data. Figure 6. The typical medical data partitions scenario for horizontal federated learning (HFL). Each node is a medical institution data silo or wearable medical device. They share the same feature of medical diagnosis X j = X k but have different patients index I j = I k .
HFL data partition is quite common in FL applied for medical applications. More than half of FL studies on medical applications implemented horizontal medical data partition in their experiment [18,19,21,37,[39][40][41][42][43][44][45][46][47][48][49]51,52,54,55]. Unlike FL applied for nonmedical applications where training is carried out across many nodes, FL studies in medical applications only handle limited nodes from 2 to 100, as listed in Table 2. For instance, Li et al. [18] experimented with four medical institutions in different places for the autism spectrum disorder (ASD) prediction scenario. Each medical party shares the same user features generated by medical equipment and combines all patient samples from four medical nodes.

Vertical Federated Learning (VFL)
Data partition in vertical federated learning (VFL) is depicted in Figure 7. In this data partition setting, two nodes shared the same users' profile but different features information. The nodes could be different health institutions or health data application providers. VFL aims to develop a global model by integrating patient features from different institutions without directly sharing patient data. Each node shares different patients' features X and labels Y information but has the same sample data I [26]. VFL can be denoted as: For example, two distinct healthcare organizations exist in the same region: one hospital and one health insurance company. Users of these two healthcare organizations may mostly be the same because they are the region's residents. However, the user features may not have anything in common because healthcare insurance records users' income and medical reimbursement, while hospitals keep users' medical treatment histories. VFL data partition securely combines different features sets to enhance the performance of the model. Thus, the VFL data partition increases feature dimension in data training.
In contrast to HFL, there are a few published VFL-based studies applied in medical applications. One such an example was proposed by Cha et al. [56]. The authors developed an autoencoder federated learning model for the vertically partitioned medical data. An autoencoder model is used for transforming user features in each client into a latent dimension. The proposed method does not share any raw medical data but latent dimensions as secure perturbed data. After receiving the clients' latent dimensions, the aggregate server concatenates all latent dimensions for training the global model. However, this approach is prone to reverse-engineering, which could discover the original medical data from the latent dimensions. In addition, the proposed method needs all the clients to perform data alignment, which means the user data has the same row indices in all data silos (first row data on clients k must be the same as client j).

Federated Transfer Learning (FTL)
Unlike the data configurations in HFL and VFL, data partition in federated transfer learning (FTL) considers the situation of multiple nodes shared neither the same users' profile nor features information, as shown in Figure 8. The main issue in this data partition configuration is that one node lacks labeled data. The nodes could be different health institutions or health data application providers located in different regions. Furthermore, each node shared different patients' features X , labels Y, and sample data index I [26]. FTL can be denoted as: For example, there are two distinct healthcare entities: one is a hospital in Taiwan, while the other is in the United States. Due to the geographical limitations, the two healthcare entities' user groups have little overlap, and the data features of the two entities datasets may slightly overlap. FTL addresses limited data sets and label samples in this scenario, thus increasing the model's performance while protecting user privacy.
The research in FTL is still in the early stages, and there is plenty of room for improvement. Chen et al. [20] proposed FedHealth assuming FTL data partition. FedHealth method collects data from several users/organizations using FL then offers a personalized model for each user/organization using transfer learning. First, the model learns to classify human activity and then extends the task to Parkinson's disease classification with transfer learning. In this case, FTL developed a global model for disease prediction in one task and then could be transferred to another task.

Data Distribution (Statistical Data Heterogeneity) Challenge
FL can solve the limited data quantity issue by combining data from each client without directly sharing each client's private data. However, FL also faces statistical data heterogeneity challenges due to data distribution at each client. The data distributions at each client are likely to be different, leading to poor global model performance [23,24]. Zhao et al. [23] demonstrated that the data distributions might considerably decrease FL model performance due to weight divergence induced by different population distributions. Within an FL environment, data distribution is frequently classified into IID and non-IID. Non-IID can result from an imbalance in the amount of data quantity, features, or labels. Non-IID is a common occurrence in the medical domain. Various medical tools manufacturers, different calibrated techniques, and different medical data acquisition techniques are the main reasons why each medical institution generates nonidentical data distribution. For instance, Li et al. [18] described how each medical institution uses various brain scanner manufacturers and instructions for each patient when taking autism brain imaging data. Specifically, during data acquisition, one medical site instructs patients to keep their eyes open while others instruct them to close their eyes during scanning. In the following subsection, we describe the non-IID characteristics and mitigation methods.

Non-IID Characteristics
The non-IID characteristics among healthcare nodes in the FL environment can take on four different forms such as (1) quantity distribution skew, (2) label distribution skew, (3) feature distribution skew, and (4) concept shift skew [24,25]. The non-IID characteristic summarized from 24 published FL studies applied for medical application is listed in Table 2.
Quantity skew (imbalance data) characteristic. Quantity skew characteristic in non-IID occurs when the class distribution of data instances I is not equal or far from equal across nodes in the FL scheme. An illustration of quantity skew is shown in Figure 9. In the IID scenario, the amount ratio of positive and negative instances is almost equal. For instance, in node two, the negative and positive amount ratios are 45% and 55%, respectively. In the non-IID case, the ratio of positive and negative instances is far from equal. For example, in node one, positive instances are around 5%, while negative ones are 95%. Krawczyk et al. [57] divided imbalance data categories into slight imbalance and severe imbalance. A slight imbalance is when the majority class is uneven by a small amount in the training dataset, and the ratio ranges from 1:4 up to 1:100. Severe imbalance data distribution is when the data distribution of the majority class is uneven by a vast amount in the training dataset, the ratio is more than 1:100. For example, the ratio of imbalance data in fraud detection tasks is up to 1:1000. Quantity skew characteristic exists in FL for medical application experiment datasets such as [18,19,46,52,53]. Quantity skew (i.e., imbalanced dataset) is common in the medical dataset since it is acquired from multiple healthcare institutions, and the number of instances in a class is not equally distributed for each institution. For instance, larger hospitals have more patient records than small clinics in rural areas. Huang et al. [19] tried to resolve this challenge by developing an imbalanced eICU dataset to predict patient mortality where the ratio is 5% and 95% for death and alive categories, respectively.
Label distribution skew characteristic. For label distribution skew, the distribution of labels P(y i ) varies between different nodes. In the medical case, larger hospitals generally have more disease-related records than small clinics in rural areas. An illustration for label distribution skew characteristic is shown in Figure 10. In the IID setting, the distribution of labels Y is the same across all nodes. However, in the non-IID setting, the distribution of labels Y varies between each node. Specifically, there is a label y i that only exists in one or several nodes in the FL environment. This label distribution skew characteristic was initially demonstrated in FedAvg's experiment [16]. Data samples with the same label are divided into subsets, and each client is assigned to no more than two subsets with distinct labels. Following FedAvg, this configuration is employed in published FL studies for medical applications [38]. Feature distribution skew characteristic. In the feature distribution skew characteristic, the distribution of features P(x i ) varies between different nodes. An illustration of features distribution skew is shown in Figure 11. In the case of IID, the distribution of features X is the same across all nodes, while in the non-IID case, the distribution of features X varies between each node. Specifically, there is a feature x i that only exists in one or several nodes in the FL environment. For instance, node two does not have the x 1 and x 2 features while other nodes have those features. Missing features or missing data is a common occurrence in medical datasets. For instance, missing features can be caused by failures of measurement on medical images. Measurement in medical image acquisition requires the images to be in focus. Medical images that are not in focus or blur can cause missing pixel values. The absence of some features in one or several nodes in the features distribution skew can be a problem in the FL training process. Data imputation techniques such as probability principal component analysis (PPCA) and multiple imputations using chained equations (MICE) can be employed to mitigate the problem [58].

Concept Shift Skew.
There are two forms in the concept shift skew: the same label but different features P(x|y) and the same features but different label P(y|x). An illustration of concept shift skew is depicted in Figure 12. The same label but different features in non-IID characteristic is related to vertical federated learning data partition where each node shares the sample index I but have different features X , while in the case of the same features but the different label in non-IID characteristics is not applicable in most FL studies. Balance the training dataset method. When dealing with quantity skew in non-IID characteristics, researchers balance the quantity of minority and majority classes in the training dataset with the synthetic data augmentation technique. It is important to note that the balancing method in the FL environment should keep the data secure and private. There are two methods to generate synthetic data augmentation in the FL environment: (1) local data augmentation and (2) server data sharing.
(1) The healthcare node generates a synthetic sample to balance the training dataset in the local data augmentation method. The synthetic minority oversampling technique (SMOTE) [21,49], generative adversarial method (GAN) [44], or geometric transformation [40,48,53] is employed to generate a synthetic sample in an FL environment. The SMOTE algorithm is an oversampling technique where the synthetic data are generated for the minority class. For instance, Wu et al. [21] and Rajendran et al. [49] employ SMOTE to balance the heavy imbalance in a fall detection and lung cancer training dataset, respectively. Zhang et al. [44] proposed secure synthetic COVID-19 data by combining the GAN and differential privacy method. Feki et al. [40], Duo et al. [48], and Yang et al. [53] applied geometric transformations such as random flipping, random rotation, and random translation to balance the quantity of minority class in their training dataset for the data augmentation method. (2) The aggregate server securely shares a small portion of data to the healthcare node in the server data sharing method. For instance, Zhao et al. [23] proposed a global shared dataset partition to train non-IID data. The author demonstrated that by simply sharing 5% of data, they could get a 30% boost accuracy score. However, it raises model communication costs and is prone to data privacy attacks during the data sharing process.
Adaptive Hyperparameters Method. The adaptive hyperparameters method tries to find the proper FL hyperparameters values for each node during the training process. Each node can have different values of the FL hyperparameters, such as learning rate, loss score, and weighting coefficient. There are two published adaptive hyperparameters methods in the published FL studies for medical application: (1) weighting coefficient [16,19,20,45], and (2) adaptive loss function [46].
(1) The weighting coefficient α k is a variable that indicates the relative influence of each node k on the aggregation equation in Equation (2) to update the global model. Initially, McMahan et al. [16] proposed FedAvg that the weighting coefficient is α k = n k n as shown in Equation (6), where n k and n are the private data points hold by node k and the total data from all nodes that participated during training, respectively. In this case, a node with significant data points has a considerable effect on the global model. This method worked well when dealing with label distribution skew characteristics experimented in their studies [16,20].
In comparison, Chen et al. [20] proposed that the weighting coefficient is α k = 1 K , where K is the total nodes participating in FL as shown in Equation (7). In this scenario, the author considered that each node would contribute equally to the aggregation function.
Huang et al. [19] proposed that the weighting coefficient is

as shown in
Equation (8), where m c k and ∑ C c=1 m c k are denoted as the clusters size in medical node k and the total number of clusters in community-based federated learning, respectively. In their method, the algorithm considers the weighted average from the cluster patient community.
Finally, Chen et al. [45] proposed that the weighting coefficient is α k = n k n × e 2 −(t−timestamp k ) , as shown in Equation (9), where e is the natural logarithm number to denote the time effect and timestamp k is the round in the newest updated local model. Their proposed weighting coefficient considers not only the data samples held by node k shown by the portion of data n k n but also the time required to update the global model in the local node.
(2) In addition, the adaptive loss function has the ability to change conditions based on the loss score function. The loss function was used to measure the model performance.
The lower the loss score, the better a model was trained. Specifically, Huang et al. [46] proposed the LoAdaBoost method based on loss function in the FL environment for patient mortality prediction. In their proposed method, the adaptive loss function boosts the training process adaptively from the weak learners node. On each training step, the local node will send both the local model and training loss. If the training loss score is more than the loss threshold, it will be retraining again. Otherwise, it will send to the aggregate server.
Domain Adaptation Method. Domain adaptation (DA) is a subset of transfer learning in which a model developed in one or more "source domains" is applied to a new (but related) "target domain." DA is used when the source and target domains share the same feature space but different data representations and distribution [59]. In comparison, transfer learning is used when the target domain's feature is different from the source domain's feature. The goal of DA is to minimize discrepancies in data distributions. Li et al. [18] incorporated domain adaptation in their FL algorithm. The fundamental assumption is that DA approaches can increase the overall performance of multiple nodes in the FL environment with non-IID. Specifically, the author implemented a mixture of expert (MoE) and adversarial domain adaptation methods. The MoE implements adaptation near the model output layer, whereas the adversarial domain alignment implements adaptation on the data knowledge representation level.

Data Privacy Attacks and Protections
Data security and privacy are critical issues in medical applications. In FL, it is usual for all nodes to calculate and upload their local model weights and parameters to an aggregate server. The steps of uploading and processing the weights and parameters may leak sensitive patient information contained in the medical data. The possible attacks include model inversion and membership inference attacks, which may leak patient data to an attacker. The common solutions for data privacy protection include differential privacy and homomorphic encryption [21] based techniques, which can guarantee the security of transferring the local weights and parameters in federated learning. In the following subsection, we describe the possible data privacy attacks and protections in FL.

Data Privacy Attacks on Federated Learning
There are two types of possible data privacy attacks on federated learning. The first attack is trying to recreate the input data, such as model inversion attack, and the second attack is to discover the training data such as membership inference attack.
Model Inversion (MI) Attack. The model inversion attack is an attack method for recreating data on which a machine learning model was trained [60]. In the case of federated learning for healthcare applications, this can leak the sensitive patient data used in the model's training process. Fredrikson et al. [60] demonstrated the MI attack that, given the machine learning model and several demographic data about a patient, an attacker could generate the patient's genetic markers. Specifically, the attack exploits the predicted output probability confidence score from the machine learning model when predicting the class given the features data. Given a machine model learning model as a function y = f (w; x 1 , . . . , x n ) whereŷ, w, and X = {x 1 , x 2 , . . . , x n } are predicted probability class, machine learning parameters, and features vector as an input, respectively. The model inversion attack aims to exploit a sensitive feature, for instance feature x 1 , given some information about the other features x 2 , . . . , x n and the predicted output probabilityŷ. One solution to overcome this threat is to use differential privacy mechanism which can be incorporated into the learning process to protect the data from inversion attacks, such as inferring model weights (discussed in Section 4.4.2).
Membership inference attack. Given a machine learning model f (w; x 1 , . . . , x n ) and some sample instances, the membership inference attack task tries to discover whether the instance exists or not in the training dataset [61]. Membership inference attack poses a significant privacy issue as the membership can expose a person's private information. For instance, determining a person's presence in a hospital's clinical trial training dataset indicates that this patient was once a patient at the hospital. The patient and the hospital are the two key parties interested in defending against such membership inference attacks. The patients consider their memberships confidential and do not wish for their sensitive information to be made public. At the same time, the hospital does not want to be prosecuted for leaking patient data. Almadhoun et al. [62] demonstrated the first membership inference attack in the medical area that infers the personal information of the participants in a genomic dataset. Truex et al. [63] showed the threats of membership inference attack when the attacker is a member in the FL environment. The member could be the aggregation service or one of the client nodes. Their FL configuration is different from the one discussed above. Instead of pooling the weights to construct a new global model, each node trains their local model and contributes just the prediction probability when inferring a new instance. The process of membership inference attack consists of three steps [61]. Firstly, the attackers aim to develop a shadow dataset D that mimics the target model training dataset D. Secondly, the attacker create a shadow model using the shadow dataset D which mimics the target model behavior. In this step, the attacker observed the shadow model behavior in response to instances known to have been provided during training against those that were not. This behavior is utilized to create an attack dataset that captures the different instances in the training data and data that have not been seen previously. Finally, this attack dataset is used to construct a binary classifier that predicts whether an instance was previously used in the target model output.

Data Privacy Protections for Federated Learning
There are two methods to protect data privacy from data leakage and attacker in the FL environment: perturbation and encryption. The perturbation method preserves private data and model privacy by adding a controlled random noise to the training data or the machine learning model parameters during the training process. For instance, differential privacy [18,43,44,55] and hybrid exchange parameters [39] algorithms are the perturbations techniques implemented in the FL studies published in medical applications. In comparison, the encryption method preserves private data and model privacy by encrypting the parameters exchanged and the gradients in the aggregation process in the FL environment, such as the homomorphic encryption algorithm [20,21,51].
Data Privacy Protections with Differential Privacy (DP) Method. Combining a deep learning model with privacy protection is an emerging research focus. For instance, many researchers use differential privacy (DP) methods to secure the deep learning model. Inspired by the successfully implemented DP in centralized learning, several researchers implemented DP in distributed training, especially in FL studies for medical application [18,43,44,55]. Dwork et al. [64] introduced differential privacy as a notion of privacy, ensuring that data analytics do not compromise privacy. It ensures that the effect of an individual's data on the model output is restricted. In other words, differential privacy aims for an algorithm's result to be nearly identical whether or not the dataset contains data about a specific individual. This technique can prevent the membership inference attack where the attacker tries to find if a specific individual is in the training dataset. Differential privacy is achieved by adding controlled statistical noise to the machine learning model's input or output. Whereas the addition of noise ensures that specific individual data contributions are hidden, it also provides insights into the entire population without compromising privacy. The quantity of added noise is called the privacy budget denoted by epsilon ( ). Gaussian and Laplace are two controlled random noise mechanisms implemented in differential privacy for the FL studies in medical applications. Differential privacy with Gaussian noise mechanism is a common technique used in FL studies [18,43,44,55]. For instance, in their training dataset, Li et al. [18] and Vaid et al. [55] incorporated the Gaussian noise in the model learning process to protect from model inversion attacks. In addition, Zhang et al. [44] and Yan et al. [43] proposed a differential privacy technique with a generative adversarial network (DPGAN) to generate private data samples at a medical node in a federated environment. Specifically, Zhang et al. [44] implemented controlled noise to the gradient value in the discriminator part of their generated adversarial network (GAN) for image sampling in federated learning, interfering with original data distribution. Their experiments showed that this method could address the lack of data availability and the non-IID issue in FL while keeping patient data private. In addition, Zhang et al. [44] evaluated that the smaller the Gaussian noise as part of DP will improve the model performance. Besides the Gaussian noise mechanism, differential privacy with the Laplace noise mechanism is implemented by Li et al. [18] in their studies. Li et al. [18] showed when the Laplace noise level was too high the deep learning model performance failed to classification task.
Data Privacy Protection with Homomorphic Encryption Method. Homomorphic encryption (HE) was used to ensure data privacy by encrypting the parameter exchanged in the gradient aggregation process. There are many recent FL studies for healthcare application that implemented HE during FL training [20,21,51]. Homomorphic encryption was categorized into fully homomorphic encryption (FHE) and additively homomorphic encryption (AHE) [65]. An FHE scheme is an encryption method that allows analytical functions to be run directly on the encrypted data while producing the same encrypted output as if the functions were executed in plaintext. In other words, if we perform an add or multiply operation on the ciphertext, the decryption result is the same as the actual result obtained by performing the same operation on the plaintext. In comparison, the AHE is an encryption method that allows only one type of operation to be run directly on the encrypted data and produces the same encrypted output as if the functions were executed in plaintext. In other words, the AHE scheme is intended for use with specific applications that require simple addition or multiplication operations. Formally, an encryption method is called homomorphic over an operation "+" if it supports Equation (10): where E . is the encryption method and W is the machine learning model parameters.
For instance, in the AHE scheme, for parameters w 1 and w 2 , one can obtain E w 1 + w 2 by using E w 1 and E w 2 without knowing w 1 and w 2 explicitly. Most of the FL studies for healthcare applications leverage the AHE rather than the FHE since FHE is computationally more expensive than AHE. For example, Chen et al. [20] and Wu et al. [21] incorporated the AHE in their local model parameters sharing and gradient aggregation between healthcare nodes and the aggregate server. Xue et al. [51] adopted two AHE schemes for a lightweight privacy module to prevent the patient EMRs' privacy leakage in the medical edge devices.

Benchmark Medical Dataset for Federated Learning
The dataset utilized in FL studies can vary depending on the task. For instance, some datasets concentrate on the performance of classification tasks, while others concentrate on segmentation tasks. There are datasets such as LEAF [66] and FedVision [67] for FL algorithm benchmarking. However, there is no specific open public medical dataset for FL algorithm benchmarking due to limited quantity, patient security, and privacy. Therefore, a comprehensive list of relevant medical datasets is compiled from published FL papers for future research on this topic. From 24 published FL papers in the healthcare area, 16 publications used the public dataset listed in Table 3. We exclude eight publications from the list because these papers use their institution/private dataset.
Besides benchmark medical datasets for federated learning, numerous scientific research communities and industries have developed various tools to accelerate the growth of federated learning. We summarized in Table 4 the federated learning tools based on data configuration challenges. Table 3. Summary of public medical datasets in recent FL studies applied for a medical area for algorithm benchmarking.

Dataset Type Dataset Name Description FL Study
Healthcare dataset Medical Image Classification Autism Brain Imaging Data Exchange (ABIDE) I [68] The ABIDE I is a consortium dataset openly sharing 1112 functional magnetic resonance imaging (fMRI) dataset from 539 patients with autism spectrum disorders. [18] Public COVID-19 Image Data Collection [69] The dataset consists of 108 healthy chest X-ray images and 108 confirmed with COVID-19 chest X-ray images taken from 76 patients. [40,44,54] Facial Emotion Recognition (FER) 2013 [70] The FER2013 dataset consists of 35,887 human facial emotion images. The dataset is labeled into seven emotions: neutral, anger, disgust, sadness, happiness, surprise, and fear. [37] Medical Image Segmentation Brain Tumor Image Segmentation Benchmark (BraTS) 2017 and 2018 [71] The BraTs 2017 were collected from 13 institutions and consisted of 359 patients' brain tumor scans. [38] SPIE-AAPM PROSTATEx dataset [72] The PROSTATEx dataset consists of 343 MRI prostate image cancer from Siemens 3T MR scanners, the MAGNETOM Trio, and Skyra. [43,50] Electronic Health Record MobiAct [73] The MobiAct dataset is human activity dataset taken from 57 volunteers (42 men and 15 women). [21] Human Activity Recognition (HAR) [74] The HAR dataset was collected from 30 volunteers. Each subject performed different activities such as walking, sitting, standing, and laying. There are 10,299 with 561 time-series features. [20,45] WESAD (Wearable Stress and Affect Detection) [75] The WESAD is a dataset for wearable effect and stress detection. Taken from 15 participants, the WESAD consists of 12 features with 63,000,000 time-series samples. [42] Medical Information Mart for Intensive Care (MIMIC) III [76] The MIMIC III dataset was collected from 40,000 patients during stayed in the ICU at Beth Israel Deaconess Medical Center between 2001 and 2012. [46] The eICU collaborative research database. [77] Critical care datasets consist of 200,859 patients data from 208 hospitals in the United States. [19,39,56] Nonhealthcare dataset Image classification, sentiment analysis LEAF Dataset [66] The LEAF Dataset Benchmarking framework consists of images and text datasets such as EMNIST, Celeba, Shakespeare, and Synthetic datasets. [66]

Image Classification
FedVision-Real World image dataset for FL [67] The FedVision dataset contains more than 900 real-world images generated from 26 street cameras.
Precisely, it consists of 7 classes with a detailed bounding box. This dataset has non-IID properties reflecting a real-world data distribution. [67] ABIDE: autism brain imaging data exchange; BraTS: brain tumor image segmentation benchmark; eICU: electronic intensive care unit; FER: facial emotion recognition; FL: federated learning; fMRI: functional magnetic resonance imaging; HAR: human activity recognition; MIMIC: medical information mart for intensive care; MR: magnetic resonance; IID: independent and identical data distribution; WESAD: wearable stress and affect detection. HFL: horizontal federated learning; VTL: vertical transfer learning; FTL: federated transfer learning; IID: independent and identically data distribution; DP: differential privacy; HE: homomorphic encryption.

FL Studies for Healthcare Applications
Published FL studies in medical applications mostly come with two tasks: classification and segmentation, as summarized in Table 5. In our selected papers, there are 24 studies. Out of these studies, 21 studies are on classification tasks, and three are on segmentation tasks. The following subsections describe the existing studies on FL for healthcare applications, organized by the application task type.

Classification Task in FL for Healthcare Applications
Classification is a common task tackled in the published FL applications in the medical domain. In machine learning, classification algorithms learn how to classify or annotate a given set of instances with classes or labels. There are several classification tasks that are studied in federated learning setting in healthcare, e.g., autism spectrum disorder (ASD) [18], cancer diagnosis [41,43,49], COVID-19 detection [40,44,48,54], human activity and emotion recognition [20,21,37,42,45], patient hospitalization prediction [52], patient mortality prediction [19,39,46,47,55,56], and sepsis disease diagnosis [51]. The summary of classification tasks in FL studies for medical application is listed in Table 5.
Cancer diagnosis. Recent studies show that researchers are employing FL technology to develop machine learning models for cancer diagnostic applications [41,43,49]. For instance, Lee et al. [27] proposed a CNN-based model to classify whether thyroid nodules were benign or malign. The training data were 8457 ultrasound images collected from six institutions. The results show that the performance of the FL-based method was comparable with centralized learning with accuracy, sensitivity, and specificity of 97%, 98%, and 95%, respectively. Similarly, Rajendran et al. [49] implemented FL with an MLP model for lung cancer classification using two independent cloud providers. The model initialized, trained, and transferred from one node to another node using a cloud repository. The model achieved 92.8% accuracy to classify cancer. Another study by Yan et al. [43] transformed all nodes' raw medical image data onto a common space via image-to-image translation without violating FL's privacy settings. The image-to-image translation was done using a cycle generative adversarial network (CycleGAN) model. The performance of the proposed method trained with eight medical nodes achieved 98% accuracy and 99% area under the curve (AUC) to classify prostate cancer.

COVID-19 detection.
For COVID-19 detection applications [40,44,48,54], FL is a potential approach for connecting medical images data from medical institutions, enabling them to develop a model while maintaining patient privacy. In this case, the model's performance is considerably enhanced from diverse medical datasets from several institutions. For instance, Abdul Salam et al. [54] experimented with different federated learning architectures for binary COVID-19 classification. Their results showed that the federated learning model with GAN architecture and stochastic gradient descent (SGD) optimizer had a higher accuracy while keeping the loss score lower than the centralized machine learning model. The model performance achieved accuracy and AUC of 98.30% and 9.63%, respectively. Similarly, Dou et al. [48] showed the efficacy of a federated learning system for detecting COVID-19-related CT anomalies using patients' medical data from one country hospital as training data, then validating the model with medical data from other countries. Specifically, the authors trained an MLP-model using 132 patients from three hospitals in Hong Kong and validated the model generalizability performance using the medical dataset from China and Germany. The system achieved 83.12% in terms of AUC. Feki et al. [40] showed that increasing the number of medical nodes will decrease the training round for the model to converge and increase the model performance in CT-X-ray COVID-19 prediction. The authors proposed the CNN-based model architecture and achieved a performance of 95.27% AUC score. Similar results were obtained by Zhang et al. [44], who proposed an FL framework that enables medical nodes to generate high-quality training data samples with a privacy-protection approach. Specifically, the proposed method solves the challenge of lacking COVID-19 medical training data in a federated environment. The GAN-based architecture was employed in the proposed system and achieved a comparable performance of 94.11% accuracy.
Human activity and emotion recognition. With increasing research on wearable technology and the internet of health things (IoHT), FL technology is one of the solutions to keep users' privacy while collaborating to develop a model for human activity and emotion recognition [20,21,37,42,45]. For example, Chen et al. [20] developed a deep learning model for human activity classification such as walking, sitting, standing, and laying. Then the author elaborates the trained CNN-based model with federated transfer learning to achieve a personalization model for each edge device. The system achieved 99.4% accuracy in classifying human activities. Similarly, Wu et al. [21] developed a cloud-edge federated learning infrastructure to create a patient privacy-aware deep learning model for in-home monitoring applications. The authors developed an autoencoder (AE) model architecture then deployed the model into five different healthcare nodes. The FL system achieved an accuracy of 95.41%. Chhikara et al. [37] combined the speech signal and facial expression to create an emotion index for monitoring the patient's mental health. Using the facial emotion recognition (FER) dataset collected from several data silos, the author employed a federated learning technique and AE-based architecture to create a secure machine learning model to classify a human emotion. The FL algorithm showed an AUC of 88%.
Patient mortality prediction. Similarly, FL enables early predictive modeling based on several sources, which can help to assist clinicians with extra information into the risks and benefits of treating patients earlier [19,46,47,51,52,55,56]. Huang et al. [19] used drug features to forecast critical care patients' mortality, and ICU stays time. Their algorithm based on AE architecture also addresses non-IID ICU patient data by grouping patients into clinically significant communities with shared diagnoses and geographical regions, then training one model per community. The proposed FL algorithm showed an AUC of 69.13%. In a similar study, Brismi et al. [52] proposed a method to forecast future patient hospitalizations with heart-related disorders by solving the L1-regularized sparse support vector machine (SVM) classifier in a federated learning environment. The proposed FL model performed an AUC of 77.47%. Shao et al. [47] proposed an MLP-based model framework to predict in-hospital mortality among patients admitted to the intensive care unit. Their findings indicate that training the model in a federated learning framework produces outcomes comparable to those obtained in a centralized learning environment with an AUC of 97.76%. Vaid et al. [55] demonstrated federated learning with an MLPbased model architecture to predict patient mortality with COVID-19 disease within seven days. Their experiment showed that the federated learning algorithm successfully produces a robust predictive model while preserving the patient's confidential information with an 82.9% AUC score.
Other healthcare areas. Besides the healthcare areas mentioned above, FL also applied for sepsis disease [51] and autism spectrum disorder classification [18]. Xue et al. [51] developed a fully decentralized federated framework (FDFF) that integrates a neural network model across edge devices to extract knowledge from internet-of-things for healthcare applications. The edge devices using FDFF can create a double deep Q-network (DDQN) that gives suggestions for sepsis treatment. In addition, Li et al. [18] proposed FL for multisite autism spectrum disorder (ASD) fMRI analysis.

Segmentation Task in FL for Healthcare Applications
Segmentation tasks with medical images have become an essential clinical task in healthcare applications. The medical image segmentation task is the process of identifying and selecting a region of interest within a medical image. Medical images can be in several forms, such as MRI or CT image scan. There are several published FL studies in medical image segmentation, namely brain tumor disease [38], COVID-19 region [53], and prostate cancer region [50]. The summary of published FL studies on segmentation tasks is listed in Table 5. Specifically, in brain tumor segmentation using brain MRI medical images, Sheller et al. [38] applied the FL algorithm with CNN-based architecture for multi-institutional collaboration in brain tumor segmentation tasks while preserving the patient data. Compared to existing collaborative learning approaches, FL achieved the highest dice score of 85% and scaled more effectively as the number of collaborating institutions increases. Using multinational three-dimensional chest CT images from three countries, Yang et al. [53] applied federated semi-supervised learning with 3D u-shape fully connected layer model architecture to segment the COVID-19 disease region. Federated semi-supervised learning can assure good training performance even when some healthcare sites have a limited number of annotated data compared to unannotated data. Additionally, the semi-supervised environment may alleviate some of the strain associated with expert annotation, which is critical given the present pandemic crisis. Similarly, Sarma et al. [50] performed prostate segmentation with a 3D anisotropic hybrid network (3D AH-Net) model on MRI with collaboration from industry, public universities, and the federal institution. The proposed FL algorithm experimented with three medical nodes showed a dice score of 88.9%.

RQ3:
What are the research gaps and potential future research directions of FL related to medical data?

Open Challenges
In this survey, we review the current progress on federated learning in the healthcare field. We highlight the comprehensive solutions to federated learning issues related to medical data configurations to provide a valuable resource for researchers. In what follows, we list some potential research directions or open questions when federated learning is applied in the healthcare area.
FL with Medical Data Stream. Medical data streams are collections of medical data that increase constantly and rapidly over time, generated during the treatment and monitoring of patients. For instance, in telemedicine or patient monitoring, the medical monitoring devices generate a large amount of time-sensitive data when monitoring patient vital signs such as temperature, heart rate, and blood pressure. This medical data is a stream of medical signals displayed for interpretation by physicians. Certain pieces of these data could be used in real-time to alert physicians about changes in patient circumstances. Medical data streams arrive periodically, and we would like to develop an analytic model that extracts meaningful patterns or risk factors in real-time. Federated learning incorporated with the medical data stream could improve training tasks and security performance, as inconsistencies in evolving medical datasets and the data transmission between the FL coordinator and participant nodes can be highly decreased [25]. However, the medical data streams are usually fast, large, and we must handle them in real-time. In addition, the medical data streams are dynamic, so our FL algorithm has to respond to these changes. Thus, it is essential to design an efficient federated learning algorithm to achieve good accuracy, low total memory, and minimum time in medical data streams.
FL with Hybrid Medical Data Partition. In the HFL data partition, the nodes share the same features X and label Y but have different data samples I. Thus, the HFL aims to solve limited sample size variability by combining data samples from all nodes when developing a model, while for the VFL data partition the nodes share the same data samples I but have different features X and labels Y. Therefore, the VFL aims to enrich the features by combining features from all nodes when developing a model. However, we need to simultaneously solve a limited sample size variability and enrich the features when developing a model in practice. For instance, a healthcare node may possess either partial features or data samples in healthcare insurance, which serves only a fraction of users and only has partial records. Incorporating both the HFL and the VFL data partition will result in a hybrid data partition. Compared to the HFL and the VFL, a hybrid FL data partition has its challenges. In HFL, each node shares neither its local data nor labels. In contrast, in VFL, the node shares the user's index to the server or is securely stored in one node as a key for aligning the features [56]. A hybrid FL data partition needs to deal with both types of nodes, so the FL training algorithm can run without requiring the aggregate server to access any data, including the users' index. New architecture and training algorithms in FL will be required to utilize the benefits of the hybrid data partition effectively.
FL with Incentive Mechanism for Good Data Contributor. The internet of health things (IoHT) uses internet of things (IoT) devices on e-health applications that enable the connection between healthcare resources and patients. The IoHT devices such as smartwatches and healthcare wearable trackers can record heart rate, body temperature, and blood pressure. These rich healthcare data are excellent for personal smartphone healthcare apps that can run on device federated learning. However, the IoHT nodes are burdened by significant computation and communication costs during the federated model training process. Without a proper incentive mechanism design, those IoHT nodes will be reluctant to participate in federated learning. In addition, a suitable incentive mechanism can have rewards and punishments. A good quality personal healthcare data contributor can obtain a good incentive, while harmful data contributors can receive a punishment. Thus, an effective and efficient incentive mechanism can attract good data contributors to join federated learning.
Limitation and future perspective. There are two limitations to the present study. The first limitation is that existing FL experiments focus exclusively on one of the non-IID properties, such as data imbalance or label skew. However, there are no comprehensive experiments in the medical dataset that examine multiple properties of non-IID. The future perspective will find additional algorithms for addressing the issues associated with hybrid non-IID features. The second limitation is the hyperparameter framework search for FL. Hyperparameter tuning is a critical yet time-consuming step in the machine learning workflow. Optimization of hyperparameters becomes considerably more difficult in federated learning, in which models are trained across a dispersed network of heterogeneous data silos. Thus, an automatic tool or framework to select the optimal hyperparameters in the FL model is critically needed in the future research.

Conclusions
We presented the advancement of federated learning growth in the context of healthcare applications over the last four years in terms of data properties such as data partition, data distribution, data privacy attack and protection, and benchmark datasets. We hope that this study stimulates additional research into FL in healthcare applications and eventually becomes a guideline for handling sensitive medical data. Several open challenges remain, including FL for the medical data stream, FL with medical data hybrid partitions, and incentive mechanisms for good medical data contributors. We envision the increased popularity of FL for medical purposes in the near future, resulting in more advanced protocols with security and privacy guarantees and the actual deployment of FL technology for solving real-world problems in the healthcare domain.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article. authors express their gratitude to the anonymous reviewers for their comments and recommendations, which significantly improved the original work.

Conflicts of Interest:
The authors declare no conflict of interest.