1. Introduction
Over the last decade, the rapid progression of machine learning technologies has propelled a wave of artificial intelligence applications, encompassing fields such as computer vision, anomaly detection, fault diagnosis, and natural language processing, among others. The rise of machine learning can be largely attributed to two key factors: the accessibility of vast volumes of data and significant advancements in computational techniques and resources.
Nevertheless, the availability of extensive data, metaphorically a “double-edged sword” [
1], poses significant risks of personal information leakage when customer, industrial or public data are not properly managed and used. As an illustration, stringent regulations such as the European Union’s General Data Protection Regulation (GDPR) [
2] and the United States’ California Consumer Privacy Act (CCPA) [
3] have been implemented to enhance the protection of personal data and privacy by regulating corporate behaviour [
4].
With the increasing focus on data privacy, ownership, and confidentiality in contemporary society, there is a growing apprehension that personal information could be exploited for commercial or political purposes without the individual’s consent. This concern has catalyzed the emergence of a novel era in machine learning, characterized by approaches specifically designed to safeguard user data privacy. An example of such an approach is Federated Learning (FL), a technique introduced by McMahan et al. [
5].
Federated learning serves as a privacy-focused alternative to machine learning approaches that require central data collection, allowing models to be trained directly at the data storage site of each user. This approach eliminates the need for data transmission, as only the locally trained model parameters are used to develop and refine a more effective global model.
Federated learning systems, depending on the communication scheme between components, can be implemented in a centralized (client-server) or decentralized (peer-to-peer) fashion [
6,
7]. In a centralized scheme, the central server primarily orchestrates the training process and sets up the communication infrastructure among users. However, its pivotal role also introduces a potential vulnerability, rendering it a single point of failure within the system. Conversely, in a decentralized scheme, all clients can autonomously coordinate to acquire the global model, facilitating model updates and aggregations via peer-to-peer client interactions.
Although decentralized federated learning methods, such as those based on blockchain [
8,
9,
10,
11,
12,
13], can mitigate the challenges of centralized federated learning by eliminating the central server, they introduce their own challenges [
14,
15]. These include performance degradation as well as increased computational and storage costs. Consequently, this study will focus on federated learning systems that employ a centralized communication scheme and tackle its specific challenges.
In a centralized federated learning system, a central server might become a vulnerability, acting as a single point of failure due to physical damage, server node failure, or network disruptions. This can potentially interrupt the federated learning process. Although large organizations may handle such server roles in some scenarios, collaborative learning often faces constraints regarding the availability and reliability of a robust central server [
16]. The server may also become a bottleneck when serving numerous clients, as highlighted by Lian et al. [
17].
Hence, when conceptualizing a federated learning system based on a centralized design, it is essential to adopt a design pattern that is both fault-tolerant and performant. Although there are numerous platforms and frameworks for federated learning, challenges related to performance, scalability, and fault tolerance remain. While these platforms often emphasize user scalability and fault management at the user-end, they frequently neglect the vital aspect of server-side fault management and system scalability as the user base expands. Ideally, a federated learning system should inherently possess scalability and fault tolerance.
Scalability pertains to the capacity of the system to include additional devices in the federated learning process. More devices can improve the accuracy and speed up the convergence of the federated learning process [
18,
19]. Techniques such as resource optimization, prioritizing devices with high computational power, and implementing compression schemes for learning model parameter transfer, can aid in scalability. Effective resource optimization allows more devices to participate in the federated learning process, thereby enhancing performance. However, increasing device participation requires an expansion of server-side computational resources.
Fault-tolerance in the context of federated learning indicates the system’s capability to manage the federated learning process effectively, even when the server fails. Traditional federated learning, which relies on a centralized cloud server for global aggregation, can be disrupted if the aggregation server malfunctions [
20]. Current concerns about fault tolerance in federated learning systems often revolve around adversarial or Byzantine attacks targeting the central server [
21,
22,
23,
24] and data related faults such as handling missing data [
25]. While numerous studies have investigated the system performance, research into the effects of server faults, such as physical damage, on the federated learning system’s performance is relatively scant.
To achieve these properties for a performant federated learning system, this paper presents Micro-FL: a microservice-based federated learning platform designed to handle an expanding user base, guarantee high availability, and offer fault tolerance. Suitable for deployment on-site or in the cloud, it leverages the flexibility, modularity, scalability, and reliability that microservices provide. Micro-FL streamlines the testing, deployment, and maintenance of federated learning algorithms, while allowing dynamic resource allocation based on workload or user count, thus improving resource efficiency.
The principal contributions of this research paper are as follows:
Introduction of a Microservice-based Federated Learning (Micro-FL) Platform: This research paper proposes a novel Micro-FL platform that leverages microservices architecture to address the challenges of fault tolerance and scalability of federated learning systems. This approach significantly differs from the traditional centralized federated learning frameworks by decomposing the monolithic architecture of the traditional centralized federated learning frameworks into smaller, manageable microservices. This decomposition enhances scalability, fortifies fault tolerance, and facilitates efficient resource management for the federated learning server.
Emphasis on Server-side Fault Management: Unlike other federated learning frameworks that mainly concentrate on user-end scalability and fault management, the current study emphasizes server-side challenges. This paper presents solutions for managing faults in the communication system of the federated learning system, which is crucial for maintaining the integrity of the federated learning process.
Scalability and Dynamic Resource Allocation: The Micro-FL platform facilitates dynamic resource allocation, which allows for the more efficient management of computational resources. This feature is crucial for accommodating a growing number of federated learning clients without compromising performance or reliability.
These contributions underscore the innovation and significance of the proposed Micro-FL platform in enhancing the resilience, efficiency, and scalability of federated learning systems.
3. Related Work
Addressing the single point of failure and improving fault tolerance in centralized federated learning has garnered substantial research attention. One approach is the implementation of a decentralized federated learning design, which eliminates the central server from the federated learning system, thus averting any single point of failure. This is made possible through the use of different blockchain technologies [
8,
9,
38], such as proof-of-work [
11], proof-of-authority [
13], and proof-of-contribution [
14], in conjunction with smart contracts [
10,
39]. Such a setup enables model updates and aggregations via direct client-to-client interactions [
12], enhancing the overall robustness and fault tolerance of the system.
However, blockchain-integrated federated learning systems face several challenges [
14,
15], including: (1) Performance issues due to a limited number of transactions, which can result in high latency; (2) the high computational cost of aggregation processes due to typically limited resources on client devices; (3) increased storage demands, as machine learning models must be stored on all client devices, leading to considerable strain on storage resources; and (4) potential data privacy risks as all models are accessible to client devices.
Given these challenges with decentralized federated learning, our research proposes an alternative approach to improve the fault tolerance of centralized federated learning. This approach involves implementing a microservice-based design pattern for federated learning.
The following section examines previous research relevant to the area of microservice-based platforms for federated learning. In particular, it delves into existing tools that use a centralized server design for federated learning and their unique features.
TensorFlow Federated (TFF) [
40] is an open-source framework for machine learning on decentralized data. Its interfaces include the high-level Federated Learning API for federated learning training with pre-existing TensorFlow models and the lower-level Federated Core API for developing new federated learning algorithms. While it supports various aggregation functions, it currently does not allow the use of GPU for ML model training and only supports the simulation mode. The framework is still under development, and its current limitations suggest that it might not be suitable for all use cases.
Federated AI Technology Enabler (FATE) [
41] is an open-source project by Webank’s AI Department, designed to provide a secure computational framework for a federated AI ecosystem. FATE offers a suite of features such as federated statistics, feature engineering capabilities, machine learning algorithm support, and secure protocols. It can be deployed in simulation or federated modes, with installation streamlined via Docker containers. However, its high resource requirements, including 6GB RAM and 100GB disk space on both the client and server side, may render it impractical for real-world federated learning scenarios.
Paddle Federated Learning (PFL) [
42] is an open-source platform that supports both horizontally and vertically partitioned data, and can handle neural networks and linear regression models. It leverages techniques like Federated Averaging, Secure Aggregation, and Differentially Private Stochastic Gradient Descent for model construction. Communication in PFL is managed using the ZeroMQ protocol, and it supports both simulation and federated modes, making it adaptable for various deployment scenarios.
PySyft [
43] is an open-source project focusing on secure, private deep learning. It comprises components like PyGrid for connecting data owners and data scientists in a peer-to-peer network, KotlinSyft for training PySyft models on Android, SwiftSyft for iOS, and Syft.js for web interfacing. These elements collectively enable the secure, collaborative training of models using PySyft.
The Federated Learning and Differential Privacy (FL&DP) [
44] Framework is an open-source framework that uses TensorFlow for deep learning tasks and the SciKit-Learn library for linear models and clustering. It offers various aggregation algorithms and uses adaptive Differential Privacy and randomized response coins to enhance data privacy protection during the learning process.
LEAF [
45] is an open-source benchmark tailored for federated learning settings. It provides open-source datasets suitable for federated learning, metrics for evaluating federated learning algorithms, and a repository of standard methods such as minibatch Stochastic Gradient Descent and Federated Averaging. Serving as a valuable resource, LEAF aids in benchmarking and comparing federated learning algorithms.
Flower [
46] is an open-source framework that supports large-cohort training on edge devices and compute clusters. It offers aggregation methods like SecAgg [
47] and SecAgg+ [
48], and supports both simulation and federated modes of operation. Notably, Flower is language and machine learning framework-agnostic, ensuring broad compatibility.
Serverless federated learning (FedLess) [
49] is a system designed for federated learning on diverse Function-as-a-Service platforms, supporting major commercial FaaS platforms such as AWS Lambda, Google Cloud Functions, Azure Functions, and IBM Cloud Functions. Implemented in Python3, FedLess provides a command-line tool for orchestrating the training process and supports TensorFlow and Keras for deep learning models. Its default federated learning strategy is the FedAvg algorithm, commonly used for the aggregation of model updates. Other research projects, such as [
50], have also adopted a serverless design for federated learning.
FedML [
51] is an open-source research library and benchmark platform designed to aid the development of Federated Learning algorithms and provide objective performance comparisons. It supports on-device training, distributed computing, and single machine simulation. FedML offers resources such as algorithmic implementations, benchmarks with evaluation metrics, access to real-world datasets, and validated baseline results. It is organized into FedML-API for high-level APIs and FedML-core for low-level APIs, using the Message Passing Interface for system communication. FedML supports various federated learning algorithms including FedAvg, Decentralized FL, Vertical Federated Learning, and Split Learning.
Numerous other tools are designed to address scalability issues in federated learning systems, mainly with regard to scaling in the number of clients [
52,
53,
54,
55,
56]. Despite providing several beneficial features, these platforms fail to implement an efficient central server design that is scalable and fault tolerant. Micro-FL is introduced to mitigate these deficiencies. Based on the principles of microservices system architecture, Micro-FL is designed to enhance the scalability and robustness of centralized federated learning systems, effectively addressing these significant gaps in contemporary solutions.
4. Micro-FL
Figure 2 contrasts the building blocks of a commonly used federated learning server design (monolithic architecture) with the microservices-based design proposed in this study. In a monolithic federated learning server design, all components (e.g., user interface and communication services) are encapsulated into a single process, using a single database. A fault in any of these components can completely halt the federated learning process. This issue, known as a ‘single point of failure’ within the server, could cause substantial downtime and undermine system reliability. More importantly, scaling monolithic applications requires scaling the whole application, necessitating a considerable increase in resource requirements.
Conversely, when employing a microservices architecture, these components are decoupled from each other (i.e., isolated), each operating as an individual scalable process (microservice) with its own dedicated database. Furthermore, a communication mechanism is implemented to allow these services to interact with each other. As these microservices are horizontally scalable, multiple instances of each microservice can be created to enhance the fault tolerance of the federated learning system. The proposed architectural framework is called Microservice-based Federated Learning (Micro-FL) (
https://github.com/asgaardlab/Micro-FL, (accessed on 21 February 2024)).
Micro-FL accommodates both Linear and Deep Neural Network (DNN) models, leveraging TensorFlow and Keras. Additionally, as a microservice-based application, Micro-FL possesses several notable attributes:
Micro-FL’s distinct modules handle specific functions, contributing to a compact codebase and easier debugging, while also enabling incremental upgrades. These upgrades allow for the coexistence of old and new versions for compatibility testing, and changes in a module do not require a system-wide reset, thus reducing the re-deployment cycle.
The framework enhances fault tolerance capabilities with Kubernetes; even with a communication microservice failure, the federated learning process continues without interruption.
Using containerization, Micro-FL allows for an extensive customization of the deployment environment and facilitates the scaling of individual microservices without impacting the whole application. This functionality supports the easy deployment or retraction of services based on demand and accommodates both horizontal and vertical scaling.
4.1. Micro-FL Workflow
The following subsection provides an overview of the workflow of the Micro-FL system. The process is characterized by a series of steps that ensure the smooth execution of federated learning tasks. Each of these steps is explained in detail below:
➊ Clients initiate a registration process with Micro-FL via a dedicated web application interface. ➋ Based on the number of clients registered and prepared to contribute to the federated learning process, the Micro-FL administrator notifies registered clients ready to contribute, signaling the start of a new training iteration. ➌ The Aggregator service actively monitors connected clients and their statuses. When a certain number of participating clients is reached, it triggers the initialization of a model, which is subsequently disseminated to all the clients. ➍ Clients continuously listen for updates from the aggregator service. Upon receipt of the model from the aggregator service, they start training on their local datasets. ➎ After training, the clients transmit their model parameters to the server through the communication service. ➏ All messages submitted by clients are transmitted securely through the communication service and are logged into the database. ➐ The aggregator microservice continuously monitors the client messages during each iteration. When the number of messages is equal to the total number of clients, the aggregator synthesizes a new global model using the individual client models. ➑ The Aggregator service dispatches a fresh message to the clients, and the cycle from steps ➌ to ➑ repeats. This iterative process continues and is monitored until a specified number of iterations are completed or a pre-defined model performance metric is achieved.
4.2. Framework Design
A minimalistic implementation of the proposed Micro-FL architectural design is presented in
Figure 3. All services operate as Docker containers and are orchestrated using Kubernetes. Additionally, load balancing is used to distribute client requests between the user interface and communication microservices. Building on the previous section, this part discusses the essential components, services, and applications used to actualize Micro-FL. These components are selected based on their performance, scalability with Kubernetes, and open-source nature. Each microservice is briefly explained, as follows:
User Interface. The web application is developed using the Flask library for Python and Nginx as the web server. A federated learning client can register and authenticate using this web interface, and Google Cloud Load Balancing is used to balance the workload of the web application. Both the web application and the server are deployed with three replicas to improve their performance and reliability.
Database. The database governs the access and caching of the federated learning system. Elasticsearch (ES), a NoSQL database, retains federated Learning models, accuracy metrics, and client information. The proposed Micro-FL platform employs Elasticsearch due to its scalable and fault-tolerant design, which incorporates index replication and sharding. ES uses REST APIs for data storage and search, with a document-based structure in place of tables and schemas. Additionally, Kibana visualizes the federated learning process.
Communication. This microservice enables data exchange across various applications, services, and systems, which is essential for effective communication between microservices and clients. Apache Kafka is used as a message broker, renowned for its scalability, fault tolerance, and capacity to handle trillions of daily messages with minimal latency. Within Kafka, partitions serve as the foundational units for parallelism and scalability. These partitions segment a Kafka topic into multiple smaller, immutable, ordered sequences of records, each hosted on a distinct Kafka broker within a Kafka cluster. Multiple partitions in a Kafka topic enable parallel processing and enhance scalability. Producers can write to various partitions simultaneously, and consumers can consume data from numerous partitions concurrently. This design promotes high throughput and fault tolerance. Kafka employs replication to ensure fault tolerance and high availability. Each partition is replicated across multiple brokers for redundancy. The replication factor determines how many copies of each partition are maintained in the cluster. Strimzi Kafka [
57] is used for Kafka broker deployment on Kubernetes and the Apache Camel Elasticsearch sink connector (Kafka Connect) for message transfer to the Elasticsearch database.
Aggregator. Aggregator microservice is responsible for aggregating client updates. It retrieves model updates from the database. After all selected users have reported their local model updates, it creates a new global model for the next federated learning iteration. Although numerous methods and structures can be used for aggregation, the simple and commonly used FedAvg algorithm has been selected for testing. Since the aggregation, a synchronous process, occurs at the end of the federated learning cycle, this microservice is not replicated. In other words, if this microservice fails, the Kubernetes controller manager will automatically restart it, ensuring no impact on the training process.