A Federated Learning Framework for Breast Cancer Histopathological Image Classiﬁcation

: Quantities and diversities of datasets are vital to model training in a variety of medical image diagnosis applications. However, there are the following problems in real scenes: the required data may not be available in a single institution due to the number of patients or the type of pathology, and it is often not feasible to share patient data due to medical data privacy regulations. This means keeping private data safe is required and has become an obstacle in fusing data from multi-party to train a medical model. To solve the problems, we propose a federated learning framework, which allows knowledge fusion achieved by sharing the model parameters of each client through federated training rather than sharing data. Based on breast cancer histopathological dataset (BreakHis), our federated learning experiments achieve the expected results which are similar to the performances of the centralized learning and verify the feasibility and efﬁciency of the proposed framework.


Introduction
With the rapid development of Artificial Intelligence (AI), machine learning approaches have been widely used in smart medical diagnosis [1,2].The success of smart medical diagnosis is dependent on a large amount of high-quality labeled data which machine learning models obtain knowledge from.However, there is extremely restricted access to medical data under the consideration of patient privacy and data confidentiality.Breaking down the data isolated islands and strengthening the privacy and security of data are the two significant challenges when applying artificial intelligence to smart medical diagnosis.As a kind of secure knowledge fusion approach, federated learning allows data owners to train their models locally and then aggregate model parameters rather than fusing data directly.This study seeks to explore a federated learning framework that enables the intelligent model to learn from multi-sourced data without damaging data privacy and takes BC (Breast Cancer) as an example in the experiment.
Breast cancer tops the list of cancers among women, and early screening is critical for treatment effectiveness.Traditional diagnosis involves the participation of medical professionals, with the attendant risks of treatment delay and subjective diagnosis.Based on the breast cancer histopathological dataset, intelligence diagnosis methods [3][4][5] have been made to accelerate the cancer diagnostic process.In the early years of relevant research, hand-craft feature engineering [6][7][8] dominates the automatic cancer classification task.In 2012, AlexNet won in ImageNet, marking the beginning of the era of convolution feature extraction [9].The implementation of an AlexNet variant on classification tasks brings an increase in accuracy [10].With the popularization of deep learning, multiple updated deep convolution algorithms are applied to the histopathological images, all being excel in the most used breast cancer histopathological (BreakHis) dataset [11][12][13].
Research to date has tended to focus on algorithm innovations rather than data enhancement.It has previously been observed that 100% accuracy can be achieved or approached in training set after continuous training with the 5-folds strategy in the BreakHis dataset [14][15][16].The present deep convolutional network is adequate for the classification task on the BreakHis dataset [17,18].Furthermore, histopathological data is more complex and diverse than in the experimental environment.This situation indicates a need for more extensive and diverse data, improving the generalization and robustness of the practical application models.
The medical data, including breast histopathological images, is not allowed to be collected and exchanged outside hospitals in reality.Sufficient high-quality data is essential for model training of machine learning.However, medical datasets normally have problems with uneven distribution and insufficient data due to collecting difficulties.As a result, the contradiction between privacy protection and adequate data fusion requirements is a crucial obstacle to smart healthcare development [19,20].In this situation, the method that fuses knowledge derived from data instead of fusing data themselves, i.e., federated learning [21] is suitable for the development of intelligent medical diagnosis systems.
Two information security technologies are widely used in protecting the machine learning knowledge fusion process.The first is the secure multi-party computation (SMC) [22][23][24], which can realize multi-participant joint training under the protection of model weights.However, since there is still a risk of deriving source data information from model parameters, a random noise mechanism is also adopted, known as the differential privacy (DP) [25][26][27].The federated learning process involved in this paper combines SMC and DP to achieve multi-party joint modeling without performance compromise.
In this paper, we propose an efficient and feasible federated learning framework for medical image diagnosis.Based on the proposed framework, we design an efficient system with the consideration of resource efficiency.We apply breast cancer to the proposed framework as a practical case.Several comparative experiments are conducted on the BreakHis dataset.The experimental results in this work verify that federated learning is an effective way to solve the data silos and data privacy issues in the knowledge fusion process of the intelligent medical field.

Breast Cancer Diagnosis
The texture of the nuclear image provides a reproducible pattern in the histopathologicalcytopathological determination of cancer [28].This reproducible pattern makes it possible that pathological cancer images can be automatically processed without human intervention.Worldwide, breast cancer is the most common cause of cancer death in women.For the diagnosis of breast cancer, a large number of computer-aided technologies have been proposed.Generally, the early stage of diagnosis is a cytological examination of breast tumor material, and then several classifiers are applied to specific nuclei features.In the hand-craft feature stage, the histopathological image features need to be designed and filtered.For nuclei feature engineering, clustering methods, such as k-means and fuzzy c-means, are used in the color space for fast nuclei segmentation [29].Instead of accurate segmentation, the circular Hough transform is adopted for nuclei estimation [30,31].A neural network with back-propagation is also used for the analysis of cytological images.Its performance is comparable to a traditional support vector machines (SVM) classifier [31].
In terms of breast cancer imaging, most of the existing work [31][32][33] has been on whole-slide imaging (WSI) with a high cost of processing and operating in practice.A large, publicly available, annotated dataset is crucial to develop intelligent cancer diagnosis systems.However, the difficulty in obtaining medical data is also a bottleneck in the development of breast cancer detection technology.The emergence of the BreakHis dataset [34] slightly alleviates the problem, giving support to more efficient detection technologies with the combination of deep learning.Deep convolution neural networks are adopted directly to the histopathological images for feature extractions [10,11].It is worth noting that breast cancer image classification is feasible without considering magnification factors (40×, 100×, 200× and 400×) [11], which means that the uniformity of image magnification for all medical parties is not required.

Federated Learning
As mentioned above, machine learning models can benefit from the fusion of mass data.However, there is extremely restricted access to data under the consideration of user privacy and data confidentiality in the medical field.In this situation, privacy-preserving decentralized collaborative machine-learning techniques are suitable for the development of intelligent medical diagnosis systems.Since Google first proposed federated learning in 2016, the concept of federated learning has been extended to cover collaborative learning and knowledge fusion scenarios among organizations.There are several specific frameworks, including horizontal federated learning, vertical federated learning and federated transfer learning [21].For isolating data and label deficiencies, federated learning provides a safe and efficient solution, that is, it can realize the training of machine learning models to synthesize and fuse the information provided by multi-party data while keeping them localized.
A considerable portion of research has been devoted to the secure computing of machine learning algorithms.As a primary machine learning algorithm, researchers hope for linear regression to fit the best curve without disclosing the input data using homomorphic encryption and Yao garbled circuits [35].The gradient boosting decision tree (GBDT) secure computing system with the premise of privacy protection can support each private party to train a model independently, then aggregated safely [36].After WeBank proposed the industrial-level federated learning framework, the SecureBoost framework for vertical federation [37] and the loose privacy constraint method for horizontal federation [38] both provide efficient solutions for GBDT secure computing.Google has been committed to mobile federated learning, optimizing federated communications efficiency [39,40], and building scalable federated production systems [41].Furthermore, cross-device federated learning [42] dealing with a large number of unreliable devices with limited computing capabilities and slow access links, by contrary, cross-silo federated learning [43] handles at most a few hundreds of reliable data silos with powerful computing resources and high-speed connections.

Overview
The proposed federated learning framework for efficient medical image diagnosis is illustrated in Figure 1.The system consists of three modules: user platform, federated server, and federated client.
The user platform includes two sub-modules: medical tools and data servers.The medical tools provide users with various types of medical field prediction services, and the data server provides the labeled data required for model training.The data sources can be datasets established by the platform, or provided by medical institution partners.
The federated server contains a task scheduler, model container, and federated module.As illustrated in Figure 2, the task scheduler maps the service request, which is encrypted with advanced encryption standard (AES) [44] to the self-built or third-party model container for prediction.The model container decrypts the request, sends it to the corresponding model, and returns the result after AES encryption.When the prediction result of the model is not satisfactory, the federated module loads the model for online federated training optimization.The federated clients download the corresponding model from the federated server when the online federated training optimization begins.As shown in Figure 3, each client trains the model with its local data severally and uploads the parameters which are encrypted with the homomorphic encryption Cheon-Kim-Kim-Song (CKKS) algorithm [45] to the federated server.The federated server adopts the federated averaging algorithm [46] to aggregate the parameters of each client.The model parameters are updated and sent back to each federated client.The online federated training optimization progress is repeated until the model completes the training.

Workflow
The overall workflow is shown as follows.Firstly, the user initiates a service request on the user platform.The service request is encrypted and sent to the task scheduler.Secondly, the processing center of the task scheduler analyzes the request and maps it to the self-built or third-party model container.The chosen model will carry out the prediction task and return the encrypted prediction result to the user.Thirdly, the processing center checks whether the user satisfies the result, and if so, it ends; otherwise, a warning message is sent to the federated module.The federated server performs online federated training with K clients and saves the optimized model as a self-built model to the corresponding model container.Finally, the task scheduler returns the optimized result to the user.

Robustness
In this subsection, we analyze the robustness of the federated learning system.When a user initiates a medical service request to the task scheduler, the processing center adopts AES encryption to enhance the request security and ensure the model privacy as the user cannot directly access the model container.If there are too many requests, and the queuing time is too long, the task scheduler carries out the weighted round robin (WRR) scheduling algorithm [47] for load balancing.Furthermore, a distributed computing framework called Spark [48] is also used for computing load balancing.When the user needs to query models in two different model containers, Spark can bring significant efficiency improvements.The model container returns the encrypted prediction results.It avoids the original information being replaced with malicious links or malicious code programs.In general, the system is developed with high scalability.It can adapt to the number increasing of users, knowledge fusion model types, and the third parties participating in online federated learning training.

Experiments
In this section, we conduct three kinds of confirmatory experiments (centralized, federated and independent training) to verify the feasibility and efficiency of the proposed federated learning framework for breast cancer histopathological image classification.
In BreakHis, both benign and malignant types of tumors are further categorized into four subtypes depending on the tumor cells' appearance, respectively.The four categories for benign tumors contain adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenoma (TA).Furthermore, ductal carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC), and papillary carcinoma (PC) are four corresponding classes of malignant tumors.Figure 4 shows an example slide of a breast malignant tumor with four magnification factors 40×, 100×, 200×, 400×, and Table 1 details the magnification factors and histological subtypes of tumors with their number of images and patients containing in the BreakHis dataset.The BreakHis dataset is partitioned into training and test sets at a ratio of 7:3, whose benign and malignant proportions are basically the same, shown in Table 2.Moreover, we do not use images of the same patient for training and testing at the same time.In the experiments, the training section is segmented into K = 11 parts, that is, eleven virtual clients in the experimental environment with a similar amount of data and distribution of tumor types (Benign and Malignant).Each client includes four to six patients.It is worth noting that the non-IID partitioning way is followed under the data settings.

ResNet-152
This model introduces a deep residual learning framework to address the degradation problem, which lets the layers fit a residual mapping F(x).For the desired underlying mapping H(x), the residual mapping for the stacked nonlinear layers is set to be F(x) := H(x) − x, and the original mapping is recast into F(x) + x.We make use of the residual nets with a depth of up to 152 layers.

DenseNet-201
This model proposes an architecture called Dense Convolutional Network (DenseNet) that distills this insight into a simple connectivity pattern.On the one hand, to ensure maximum information flow between layers, DenseNet connects all layers directly with each other.On the other hand, to preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on feature maps to all subsequent layers.Crucially, DenseNet combines features by concatenating them, which introduces connections in an L-layer network.We apply DenseNet with 201 layers.

MobileNet-v2-100
This model is an improved version of MobileNet-v1, which contains a novel layer module: the inverted residual with a linear bottleneck.On the one hand, this module takes input as a low-dimensional compressed representation, which is first expanded to a high dimension and filtered with a lightweight depthwise convolution.On the other hand, features are subsequently projected back to a low-dimensional representation with a linear convolution.We take MobileNet-v2 with a depth multiplier of 1.0.

EfficientNet-b7
To find a principled method to scale up Convolutional Neural Networks (CNNs) that can achieve better accuracy and efficiency, this model proposes a simple yet effective compound scaling method, which uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients.Taking EfficientNet-b0 as a baseline, we scale up the baseline network with different compound coefficients to obtain EfficientNet-b1 to b7.
All of them have been pre-trained on ImageNet-1k, with 224 × 224 image sizing.Generally, the performances of the four state-of-the-art models are in the ascending order of ResNet-152, DenseNet-201, MobileNet-v2-100 and EfficientNet-b7.

Metrics
Referring to [5], we conduct five evaluation metrics on the test dataset of BC histopathology images in this work, including ACC_IL (test accuracy at image level), ACC_PL (test accuracy at patient level), F1 (F1 measure), DOR (diagnostic odds ratio) and Kappa (Kappa criteria).1)) is the ratio of N rec (the number of BC histopathology images correctly identified) to N all (the total number of BC histopathology images).

ACC_PL
Patient score (Equation ( 2)) is the ratio of N rec (correctly identified BC histopathology images of patient P) to N P (all the BC histopathology images of patient P), and ACC_PL (Equation ( 3)) is the ratio of the sum of patient score to the total number of patients.
Precision (Equation ( 4)) is the number of correct benign BC histopathology images divided by the number of all benign BC histopathology images returned by the classifier.Recall (Equation ( 5)) is the number of correct benign BC histopathology images divided by the number of all samples that should have been identified as benign.F 1 (Equation ( 6)) is the harmonic mean of the precision and recall.DOR expresses the ratio of the product of TP and TN to the product of FP and FN, which reflects the degree of correlation between the results of the diagnostic prediction and ground truth.When the value is greater than 1, it indicates that the diagnostic prediction is reliable; when the value is less than 1, benign patients are more likely to be diagnosed as malignant patients; when the value is equal to 1, this diagnosis cannot distinguish between benign or malignant patients.

Kappa
Kappa is calculated as Equation ( 10) describes, where p 0 (Equation ( 8)) is equal to ACC_IL defined in Equation (1), and p e is the ratio between the sum of the number of real images in the benign or malignant category multiplied with the predicted number of images in that category and the square of the total samples.Kappa is used for consistency checking, and its value is in the range of [−1, 1], which can be divided into six groups representing the following consistency levels: [−1, 0) as indicating no agreement, [0, 0.20] as slight, [0.21, 0.40] as fair, [0.41, 0.60] as moderate, [0.61, 0.80] as substantial and [0.81, 1] as almost perfect agreement.

Implementation
In terms of the federated case, the federated averaging algorithm [46] is adopted.As described in Algorithm 1, to start with, the kth federated client generates a public key Step 1 (federated client k): Step 2 (federated client k): Regarding hyper-parameters, we have the following settings in the training phases.Each federated client participates in training over the local data with the mini-batch size b c = 32 and the learning rate lr = 0.001, executing E c = 5 epochs each round.Then, the federated server receives locally-calculated gradients from those K = 11 federated clients each round and calculates their weighted average for a single global update.The global update takes E s = 20 rounds, leading to a total of 100 epochs for training.
To be fair, both centralized training and independent training are conducted with identical hyper-parameter settings.In the centralized case, models are directly trained on the overall training data with 100 epochs.During the independent experiment, the training processes of eleven clients are stuck with their local data, each with 100 epochs.
During the training process, we only update the parameters of the last classification layer and lock the parameters of the previous layers in order to accelerate training.

Results
As shown in Tables 3-5, the average ACC_IL difference, the average ACC_PL difference, and the average F1 difference of the four models between the centralized learning and the federated learning in the whole dataset are −0.35%,−0.44% and 0.88%, respectively.The corresponding differences between federated learning and independent learning are 13.70%, 12.13%, and 26.27%, respectively.Furthermore, whether in the whole dataset or in each magnification subset (40×, 100×, 200×, 400×), the federated learning results are competitive with the centralized learning results and are much higher than those of independent learning.The experimental results are consistent with the theoretical analyses, which show that the federated learning method indeed brings significant improvements to all the independent clients.What are the performances of the four state-of-the-art models in the experiments?In the whole dataset, ResNet-152 [49] achieved the best results, of which the federated learning ACC_IL, ACC_PL, and F1 scores are 3.68%, 5.7%, and 7.59% higher than those of MobileNet-v2-100 [51] in the second place.The results of EfficientNet-b7 [52] are not ideal.We hold that the training rounds are not large enough so the model has not fully converged.For each magnification subset, most of the experimental results with the 100×, 200× magnification are better than those with 40×, 400×, we hold the opinion that the previous two magnification images maintain the balance between image information and precision.
It is worth mentioning that some federated learning results exceed the centralized learning results, the reason is that in our experimental settings, federated learning updates the gradients every five epochs, while centralized learning updates the gradients for every training round, which may cause some deviations in the experimental results.
As for the reliabilities and consistencies of four state-of-the-art models in the experiments, it can be found from Table 6 that in the federal learning experiment on the overall data set, the DOR of ResNet-152 is close to 100, and the DOR of DenseNet-201 exceeds 150, which indicate that the diagnostic results of these two models are very reliable.At the same time, the DOR of MobileNet-v2-100 is close to 40, and the DOR of EfficientNet-b7 exceeds 20, indicating that the experimental results of the four models are convincing.It is worth mentioning that ResNet-152 has a DOR of 0 on the dataset with 100× magnification, this is because the calculation process sets the corresponding DOR to be 0 when the value of FP or FN is 0, which also reflects the superiority of the experimental results.
As illustrated in Table 7, in the federal learning experiment on the overall data set, the Kappa criteria of DenseNet-201, MobileNet-v2-100, and EfficientNet-b7 all belong to [0.61, 0.80], which have a substantial agreement.Meanwhile, the Kappa of ResNet-152 belongs to [0.41, 0.60] as a moderate agreement, indicating that the experimental results of the four models have a high agreement as a whole.It is noteworthy that DenseNet-201 has a Kappa of more than 0.80 on the dataset with 200× magnification, which is almost perfect agreement.

Conclusions
In our paper, we propose a federated learning framework for efficient medical image diagnosis, which can conduct knowledge fusion through aggregating model parameters under the data privacy requirement.In the system, the task scheduler plays a role in load balancing for multi-user access.The benefit of computing efficiency depends on the distributed computing framework.As the heart of the federated training mechanism, the encryption algorithms ensure the privacy of requests and results.Moreover, the easy extensibility of the model container makes it applicable beyond the medical field.
We also conduct breast cancer histopathological image classification experiments based on this framework.For the ACC_IL, ACC_PL and F1 measure, it proves that the four state-of-the-art models have achieved similar federated learning results to the centralized learning results, indicating the feasibility and efficiency of the federal learning framework.In addition, for the DOR and Kappa, the performances of the four models of federated learning also reflect the reliabilities and consistencies of the experimental results.
In future work, the trained network can be further tested with larger and balanced datasets from non-identically and independently distributed data sources.However, the problem of data imbalance is prevalent in the medical field, the approaches to deal with data imbalance should be carried out in the future.Furthermore, we plan to improve the operating efficiency of the homomorphic encryption algorithms and measure the performance (including its susceptibility to security attacks) of the entire federated learning framework, practically.

Figure 1 .
Figure 1.The federated learning framework for breast cancer histopathological image classification.

Figure 2 .
Figure 2. The task scheduler and model container sub-modules of the federated server module.

Figure 3 .
Figure 3.The federated learning sub-module and federated client module of the federated learning framework.

Figure 4 .
Figure 4.A slide of breast malignant tumor with different magnification factors.Pathologist selects the key areas to be seen in the next higher magnification.For illustrative purposes, we manually add the highlighted rectangles.

Algorithm 1 :
pk k for encryption, a private key sk k for decryption based on the security parameter λ k and sends the number of local training data n k to the federated server.Next, the kth federated client trains E c epochs through the selected model severally to update the corresponding model parameters w c k , and sends [[w c k ]] encrypted by CKKS with pk k to the federated server separately.In addition, the federated server integrates [[w c k ]] of K federated clients by weighted average to obtain the integrated parameters [[w s ]], of which the weight for each federated client is equal to the proportion of local training data, and then returns them to K federated clients.Finally, the kth federated client receives [[w s ]] sent by the federated server, decrypts them with sk k and updates w c k .Repeat the above steps for E s times to complete the federated training.Federated Averaging.

Table 1 .
Benign and malignant image distribution by magnification factors and histological subtypes.

Table 2 .
The partitions of training and test dataset.

Table 6 .
DOR of four models validated on BreakHis dataset using centroid/federated/ independent training.