Design and Implementation of a Cost-Effective Failover Mechanism for Containerized UPF

Nguyen Trung, Kiem; Kim, Younghan

doi:10.3390/electronics14152991

Open AccessArticle

Design and Implementation of a Cost-Effective Failover Mechanism for Containerized UPF

by

Kiem Nguyen Trung

and

Younghan Kim

^*

School of Electronic Engineering, Soongsil University, Seoul 06978, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 2991; https://doi.org/10.3390/electronics14152991

Submission received: 23 June 2025 / Revised: 17 July 2025 / Accepted: 25 July 2025 / Published: 27 July 2025

(This article belongs to the Special Issue Advances in Intelligent Systems and Networks, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Private 5G networks offer exclusive, secure wireless communication with full control deployments for many clients, such as enterprises and campuses. In these networks, edge computing plays a critical role by hosting both application services and the User Plane Functions (UPFs) as containerized workloads close to end devices, reducing latency and ensuring stringent Quality of Service (QoS). However, edge environments often face resource constraints and unpredictable failures such as network disruptions or hardware malfunctions, which can severely affect the reliability of the network. In addition, existing redundancy-based UPF resilience strategies, which maintain standby instances, incur substantial overheads and degrade resource efficiency and scalability for the applications. To address this issue, this study introduces a novel design that enables quick detection of UPF failures and two failover mechanisms to restore failed UPF instances either within the cluster hosting the failed UPF or across multiple clusters, depending on that cluster’s resource availability and health. We implemented and evaluated our proposed approach on a Kubernetes-based testbed, and the results demonstrate that our approach reduces UPF redeployment time by up to 37% compared to baseline methods and lowers system cost by up to 50% under high-reliability requirements compared to traditional redundancy-based failover methods. These findings demonstrate that our design can serve as a complementary solution alongside traditional resilience strategies, offering a particularly cost-effective and resource-efficient alternative for edge computing and other constrained environments.

Keywords:

private 5G networks; 5G core network; use-plane function; failover mechanism for user-plane function; edge computing; multi-cluster orchestration system; Kubernetes

1. Introduction

The fifth generation (5G) of mobile communication offers three primary categories of services: enhanced mobile broadband (eMBB), massive machine-type communications (mMTCs), and ultra-reliable low-latency communications (URLLCs). Mobile Network Operators (MNOs) typically deploy these services through public 5G networks, utilizing Quality-of-Service (QoS) and network-slicing mechanisms to differentiate and manage them. Given the diverse requirements of various use cases, such as secure communication and full control over network infrastructure, private 5G [1,2], also referred to as non-public networks by the 3rd Generation Partnership Project (3GPP), has emerged as a promising solution. It allows companies and institutions to independently customize, deploy, and manage their own infrastructure to meet specific demands. Despite these benefits, the high costs of building and operating such networks remain a major concern. Consequently, reducing deployment and operational expenses has attracted significant attention in recent studies [3,4,5].

In private 5G architectures, edge computing is a critical component. It enables the deployment of local User Plane Function (UPF) and applications closer to end devices, thereby reducing latency, enhancing network performance, improving data security, and facilitating real-time data processing [6]. Hence, the future is expected to see an increasing deployment of small edge computing sites to extend coverage, enhance network responsiveness, and improve the scalability and performance of edge-enabled applications [7]. Such an approach also enables flexible deployment of edge services across multiple sites, allowing requests to be dynamically routed to the most appropriate location based on demand and context. Despite its many advantages, edge computing poses several critical challenges. In this work, we focus on two key aspects of it in a private 5G network.

The first challenge lies in the resource constraints of edge computing environments, which often have a limited Central Processing Unit (CPU), memory, and storage capacity. To optimize cost and fully utilize available resources, edge applications and UPF are typically deployed on the same edge server using virtualization technologies [5], particularly containers. Compared to traditional virtual machines, containers not only reduce overhead and improve resource utilization but also enable faster application startup times, enhancing responsiveness and deployment flexibility. Additionally, using multiple edge computing sites is essential to increase overall resource availability; however, managing workloads across multiple distributed edge sites introduces complexity. Therefore, efficient workload orchestration mechanisms across multiple edge sites are required to manage deployment efficiently.

The second challenge comes from the inherently unreliable nature of edge computing environments [8,9], which are susceptible to factors such as power outages, hardware or software failures, and operating system updates. These issues can lead to service interruptions and failed request processing. To mitigate these risks, highly available and resilient mechanisms must be implemented for both applications and local UPFs. For the application services, a common approach is to deploy multiple replicas across different edge sites, allowing requests to be redirected to active instances elsewhere. For the UPF, redundancy methods [10,11], such as 1:1 or N:M configurations, are widely used in traditional core networks by launching standby UPF instances. This setup enables quick traffic redirection in the case of failure, minimizing service disruption. However, these redundancy strategies can be inefficient and waste resources, as backup UPF instances often remain idle and infrequently used, leading to resource underutilization and reduced scalability of co-located applications on the same edge server [12].

To address these challenges, in this paper, we propose a novel design and implementation of failover mechanisms for container-based UPF deployed across clusters: for example, distributed edge sites. Our approach significantly improves resource efficiency by eliminating the need for idle backup instances while ensuring high resilience in resource-constrained edge environments. Moreover, the solution enables seamless UPF failover across clusters, minimizing service disruption and supporting continuous operation. Our main contributions can be summarized as follows:

Failover Container-Based UPF Within a Cluster: A mechanism that enables the recovery of a containerized UPF either on a different node or on the original node within the cluster, depending on conditions such as resource availability.
Failover Container-Based UPF Across Multiple Clusters: A mechanism that supports the recovery of a containerized UPF across different clusters, even when network configurations and other environmental conditions differ.
We implement the proposed mechanisms on a testbed and conduct a comprehensive evaluation, comparing their performance with state-of-the-art approaches from the literature.

Further details are presented throughout this paper, which is organized as follows: Section 2 provides background information and reviews related work about the state-of-the-art of failover UPFs. Section 3 introduces our proposed design and architecture for enhancing the resilience of container-based UPF. Section 4 describes the testbed and implementation details. Section 5 presents the evaluation results and discussion, and Section 6 concludes this paper with key findings and future directions.

2. Background and Related Works

2.1. Private 5G Network

As its name suggests, a private 5G network is a dedicated internal wireless network that utilizes similar network components as public 5G networks but is deployed to operate independently or in partial coordination with public cellular infrastructure [13]. Typically implemented by enterprises or organizations, private 5G is designed to meet stringent service requirements in terms of reliability, low latency, dynamic resource reconfiguration, and rapid redeployment [2].

According to [1], in the 3GPP Release 16 standard, private 5G networks are generally classified into two main deployment models: standalone deployment and integrated public network deployment. In the standalone model, the private 5G network is fully isolated and operates independently of any public network, offering maximum control and customization. In contrast, in the integrated public network model, the private 5G system shares resources with the public network mainly through the network slicing mechanism [14], reducing deployment costs, but it may have less customization, control, and security. Regardless of the deployment approach, edge computing plays a pivotal role in private 5G, enabling ultra-low-latency and high-bandwidth connectivity between devices and applications [15,16].

By positioning computation resources closer to the end devices, such as in multi-access edge computing (MEC) sites near Radio Access Network (RAN) base stations or housing compounds, retail centers, or even at the edge of the mobile operator’s core network [17], data processing can occur with fewer network hops and also increased reliability. Each edge node typically includes a local UPF, which allows user devices to directly connect to applications running on edge servers, thereby enhancing performance and supporting critical real-time use cases.

2.2. Migrate a Running Application

Edge computing can bring many benefits to users, but it is still hampered by reliability concerns. One of the main limitations is the constrained computational resources at edge nodes [18]. When these resources are fully utilized or insufficient for incoming requests, processing delays increase, and the quality of service may degrade. Additionally, the physical infrastructure of edge environments is often less stable and more vulnerable to external factors. Power outages, caused by extreme weather events, climate change [9], hardware failures, or software malfunctions, can lead to complete shutdowns of edge sites. In such cases, all applications running on the affected edge node may be disrupted. As a result, the ability to migrate applications from one edge node to another is not just beneficial but a necessary requirement to ensure service continuity and reliability.

Checkpoint/Restore In Userspace (CRIU) [19] is a powerful tool that enables application migration at the system level. As its name suggests, CRIU allows a running process, or even a group of processes, to be checkpointed from userspace into a set of image files stored on disk. These checkpointed images can later be restored either on the same machine or on a different host, effectively resuming execution from the exact state at the time of checkpointing. By leveraging CRIU, systems can significantly reduce downtime and eliminate the need to restart applications from scratch. These reasons make CRIU an important tool for live migrating container-based applications in edge computing environments, where ensuring uninterrupted service is essential due to resource constraints and the risk of system unreliability.

2.3. User-Plane Function 5G

Within the 5G Core Network, a UPF is a critical data-plane component designed to handle user traffic with high efficiency, enabling core functions such as traffic routing, policy enforcement, and real-time data forwarding across multiple network interfaces. Architecturally, the UPF consists of two primary sub-modules: the control sub-module and the forwarding sub-module [20].

The control sub-module operates based on instructions received from the Session Management Function (SMF) via the N4 interface and is responsible for orchestrating the behavior of the UPF by handling tasks such as node status reporting, Packet Forwarding Control Protocol (PFCP) message processing, and session management. In contrast, the forwarding sub-module is responsible for the real-time processing of user-plane traffic and related data, particularly at the N3, N6, and N9 interfaces. Its responsibilities include GPRS Tunneling Protocol–User Plane (GTP-U) data analysis and encapsulation for the N3 interface, maintaining GPRS Tunneling Protocol (GTP) channels, and managing forwarding forms with fast indexing mechanisms. Additionally, it has essential routing and packet-forwarding capabilities while enforcing various policy rules, such as the Packet Detection Rule (PDR), Forwarding Action Rule (FAR), QoS Enforcement Rule (QER), and Usage Reporting Rule (URR), to handle traffic behaviors and measurement reporting.

In private 5G deployments, edge computing environments often operate under resource constraints. To optimize cost-efficiency and resources, the UPF is typically deployed alongside applications on the same edge server, utilizing container-based virtualization technologies [5]. Furthermore, a container-based UPF not only enables deployment on commercial off-the-shelf (COTS) hardware but also inherits the key advantages of cloud-native designs, including dynamic scaling, rapid deployment, and improved operational efficiency.

2.4. Related Works

Ensuring high availability and reliability is paramount in 5G UPF deployments. A common mechanism to achieve this is redundancy, typically implemented via active-standby configurations. These can manifest as 1:1 UPF redundancy, where each active UPF is paired with a dedicated standby instance, or N:M configurations, where a shared pool of N standby UPFs supports M active UPFs. The choice between these configurations is generally driven by Service-Level Agreements (SLAs) and the criticality of the services being supported: for example, 1:1 redundancy is often used for mission-critical use cases such as the Internet Protocol Multimedia Subsystem (IMS) or emergency calls, while N:M redundancy is more suitable for less critical data and internet services.

In a 1:1 setup [10,21,22], the standby UPF can assume the active role immediately upon detecting a failure due to hardware faults, Operating System (OS) upgrades [23], or other disruptive events [9]. By leveraging pre-allocated session contexts, this mechanism ensures seamless user-plane continuity. However, dedicating one standby per active UPF leads to significant resource overhead, as these standby instances remain idle for long periods. On the other hand, N:M UPF redundancy [11] can significantly reduce the deployment cost compared to a 1:1 setup by allowing multiple active UPFs to share a smaller pool of standby UPFs. Even so, these standby instances must still be provisioned and maintained, consuming capacity even when not in use, and failover times may be slightly longer due to the need to restore session states.

In addition, several techniques have been proposed to accelerate fault detection and reduce restoration times for redundancy approaches. For example, leveraging networking protocols such as GTP echo request/response from gnodeB (gNB), packet duplication [24], or utilizing failure detection by upstream N6 routers can provide rapid identification of UPF failures. However, from a broader perspective, where cost-effectiveness and resource optimization are key considerations, especially in resource-constrained edge computing environments, the redundancy-based approach requires maintaining standby backup instances, which wastes capacity since these instances remain idle most of the time, ultimately conflicting with the primary objective of minimizing overhead at the edge.

Recognizing the inefficiencies of pre-provisioned standby resources, recent work has explored dynamic and on-demand UPF deployment strategies. Leiter et al. [25] utilize the Open Network Automation Platform (ONAP), an orchestrator, to automate failover in container-based UPF deployments. Upon detecting a failure, the orchestrator dynamically deploys a new UPF instance, eliminating the need for pre-provisioned backups, but the results show a significant delay in achieving the ready container state due to setting up network configurations for the new instance. Tsourdinis et al. [26] focus on accelerating the restoration process itself and investigate the use of CRIU to facilitate faster UPF restoration. However, the authors acknowledge the challenges inherent in the stateful nature of UPF applications. Therefore, their solution involves running UPF containers inside Virtual Machines (VMs) managed by KubeVirt and using live VM migration to transfer the UPF to a healthy host. While this technique offers potential benefits, the complexity of migrating entire VMs and the requirement for robust network infrastructure during the migration process limit its effectiveness, especially in edge computing scenarios.

In summary, ensuring high availability and resiliency for containerized UPF in edge computing environments, without relying on pre-provisioned backup instances and while enabling rapid recovery, remains an open research challenge. To address this need, in this paper, we propose a novel solution that combines advanced technologies to achieve both fast UPF recovery and resource efficiency by avoiding redundant instances. A detailed explanation of our design architecture is presented in the following section.

3. Design Architecture

In this section, we present our proposed design architecture specifically for a failover container-based UPFs operating within a cluster, which also hosts other containerized application components. In this work, a cluster is defined as a set of one or more interconnected machines, either physical or virtual, configured to operate under a unified system. At least one node assumes the role of a master, while others, if present, serve as workers. In the case of a single-node cluster, that node may fulfill both roles. These machines must share the same networking setup and be capable of direct communication. The overall design architecture is illustrated in Figure 1. The architecture comprises two distinct planes: a control plane and a data plane.. The control plane is composed of three main components responsible for centralized management and orchestration: the UPF Manager, the Multi-cluster Orchestrator, and the 5G core network control functions. In contrast, the data plane comprises the networking infrastructure, gNB, and other access network components, along with a set of clusters where UPF containers are deployed and operated to handle user data traffic in real time.

At the heart of the system’s operational oversight lies the control plane, overseeing configuration, monitoring, and failover container-based UPFs. It comprises the following:

Fifth-Generation Core Functions: Represent the suite of control-plane network functions, such as the AMF (Access and Mobility Management Function), SMF, UDR (Unified Data Repository), and PCF (Policy Control Function), which are responsible for session management, user data storage and retrieval, policy enforcement, network slicing, and other essential capabilities within the 5G Core Network.
Multi-cluster Orchestrator: Responsible for the management and monitoring of all clusters in the data plane, including handling the life-cycle management of containers, such as deployment, update, and deletion, and continuously monitoring the health status of clusters, individual nodes, and container instances across the infrastructure.
UPF Manager: Acts as the primary entity responsible for overseeing the failover process of container-based UPFs. It monitors the status of UPF containers by retrieving information from the Multi-cluster Orchestrator, while simultaneously maintaining direct connections with all UPF instances to receive real-time status reports. This dual approach enhances the accuracy and reliability of status detection. Additionally, the UPF Manager interacts with the 5G Core Network by communicating with 5G Core Functions either directly as a trusted Application Function (AF) or indirectly through the Network Exposure Function (NEF) when operating as a non-trusted AF. It also maintains a comprehensive backup repository of session-related information for all UPFs, including PDRs, FARs, and other relevant parameters.

On the data plane, where edge clusters host UPF instances, each UPF is equipped with a Session Manager component, which may be integrated within the UPF itself or deployed as a co-located auxiliary process. This component establishes a direct connection with the UPF Manager and continuously monitors the real-time state of the UPF, including all active sessions, PDRs, FARs, and other relevant parameters. Any changes in session-related data are promptly synchronized with the UPF Manager to ensure consistency.

Any changes in session-related data are promptly synchronized with the UPF Manager to ensure consistency. In addition to session monitoring, the Session Manager also tracks the PFCP associations maintained by the UPF. After PFCP association procedures such as setup, update, or release have been successfully completed, the Session Manager notifies the UPF Manager, which then initiates a checkpoint operation for the corresponding UPF container. This mechanism ensures that, in the event of a failure and the restoration of the UPF container, PFCP heartbeat procedures can resume seamlessly. To enable this, appropriate configurations must be in place to allow the UPF and the corresponding SMF to continue the previously established heartbeat procedure after recovery.

Each cluster is equipped with its own dedicated private storage, which is accessible only by the nodes within that cluster. The storage serves as a persistent repository for the checkpointed state data of UPF containers, written after a successful checkpoint operation. By keeping the state data local and access-restricted, the system achieves both improved security and faster recovery times.

The connection between each RAN and its associated compute cluster is established via the underlying networking infrastructure. Within a given cluster, all nodes share a unified internal network configuration, which may differ from that of other clusters. Moreover, to operate correctly, a UPF must be assigned multiple IP interfaces, such as N3, N4, N6, and N9. When a UPF container is restored from a checkpoint, it inherits the network settings associated with these interfaces. Consequently, the recovery process must ensure that the restored container can reconfigure its networking environment appropriately. Given these requirements, our design introduces two distinct failover mechanisms: (i) failover of a UPF container within the same cluster and (ii) failover across different clusters. Both failover mechanisms are handled by the UPF Manager.

3.1. Failover Container-Based UPF Within a Cluster

In the clustered environment defined earlier, where machines are interconnected and operate under a shared networking configuration, application workloads such as containerized services can be flexibly deployed and migrated across nodes. This capability allows the system to maintain service availability by relocating workloads to healthy nodes in response to failures. Therefore, in this section, we propose a restoration mechanism that ensures the recovery of a UPF container within the same cluster. Depending on the type of failure and the availability of resources, the UPF instance may be recovered either on a different node, for example, when the originally assigned node encounters an issue and becomes unavailable, or reinstated on the same one, such as in cases where it remains operational and resources are sufficient, and the issue may be related to the UPF itself. The detailed procedure is formalized in Algorithm 1.

Upon detecting a failure event, the UPF Manager identifies the affected UPF instance, along with the node and the cluster where the instance was previously operating. It then interacts with the Multi-cluster Orchestration component to assess the health status of the corresponding cluster and node, and it selects an appropriate node within that cluster to perform the restoration. Once a target node is selected, the UPF Manager verifies whether the checkpoint data of the affected UPF is already available on the selected node. If not, it instructs the node to retrieve the checkpoint data from a designated private storage system. After the checkpoint data is prepared, the UPF Manager backs up the networking configuration of the failed UPF container (e.g., N3, N4, N6, N9) and proceeds to remove the faulty container from the cluster. Finally, it restores the UPF container on the selected node by starting a new container with the checkpointed data and reattaches the saved networking configuration using container networking mechanisms such as the Container Network Interface (CNI).

Algorithm 1 Failover container-based UPF within a cluster

Require: Failure event of a UPF container and its hosting cluster is available
Ensure: Restored UPF container is in Ready state

1:: function Main( $e v e n t$ )
2:: $u p f \leftarrow$ GetUPFInfo( $e v e n t$ )
3:: $n o d e, c l u s t e r \leftarrow$ GetNodeAndClusterInfo( $e v e n t$ )
4:: if not IsNodeAvailable( $n o d e$ , $c l u s t e r$ ) then
5:: $n o d e \leftarrow$ FindAppropriateNodeInCluster( $c l u s t e r$ )
6:: end if
7:: $r e s t o r e d U P F \leftarrow$ RestoreUPF( $u p f$ , $n o d e$ )
8:: ReattachInterfaces( $r e s t o r e d U P F$ )
9:: RestoreState( $r e s t o r e d U P F$ )
10:: end function

11:: function RestoreUPF( $u p f$ , $n o d e$ )
12:: if not IsCheckpointImagePresent( $u p f$ , $n o d e$ ) then
13:: PullCheckpointImageFromPrivateStorage( $n o d e$ )
14:: end if
15:: $c o n f i g \leftarrow$ BackupConfigContainer( $u p f$ )
16:: RemoveContainer( $u p f$ )
17:: $r e s t o r e d U P F \leftarrow$ InstantiateFromCheckpoint( $c o n f i g$ , $n o d e$ )
18:: return restoredUPF
19:: end function

20:: function RestoreState( $r e s t o r e d U P F$ )
21:: while not IsConnectedToUPFManager( $r e s t o r e d U P F$ ) do
22:: /* waiting for connection */
23:: end while
24:: ReinstallSessionData( $r e s t o r e d U P F$ )
25:: end function

Once the UPF container has been restored and its original interface addresses reattached, communication between the Session Manager and UPF Manager is re-established. Upon confirmation of the renewed connection, the UPF Manager transfers the previously backed-up session-related data, including PDRs, FARs, and other parameters, and instructs the Session Manager to reinstall the corresponding state into the restored UPF. This restoration process is performed entirely between the UPF Manager and the UPF itself, without requiring any involvement from 5G control network functions. As a result, the restored UPF is able to continue its operation without initiating Protocol Data Unit (PDU) session re-establishment procedures or repeating PFCP association procedures with the SMF, making the restoration completely transparent to the UE.

3.2. Failover Container-Based UPF Across Multiple Clusters

In a multi-cluster environment, each cluster has different networking configurations. This makes it infeasible to apply the failover procedure described in Algorithm 1, because restoring a checkpointed UPF container requires the same networking interface configuration that existed at the time of checkpointing. When the checkpoint is created on a node in one cluster and restored on a node in another, this requirement cannot be guaranteed due to the heterogeneity of network configurations across clusters. To overcome this limitation, we propose a multi-cluster UPF failover mechanism, as described in Algorithm 2.

Upon detecting a failure event affecting a UPF instance and determining that the cluster hosting the instance has become unavailable, the UPF Manager examines the UPF’s associated policy to check whether a predefined target cluster for failover has been specified. If it exists, the failover proceeds accordingly. Otherwise, the UPF Manager interacts with the multi-cluster orchestrator to select a suitable target cluster, potentially favoring clusters already hosting the same application workload as the one originally served by the failed UPF. Once a target cluster is selected, the UPF Manager obtains the configuration of the selected cluster and initiates the deployment of a new UPF container through the multi-cluster orchestrator. The new UPF is provisioned with networking interfaces that match the destination cluster’s environment, and its operational configuration is replicated from the failed UPF. Once the new UPF container is instantiated, the UPF Manager coordinates with 5G core network functions to instruct the SMF to establish a PFCP association with the new UPF instance and to transfer all UE session contexts previously managed by the failed UPF to the new one. After the session transfer is completed, the UPF Manager proceeds to release the PFCP association with the affected UPF and remove its container from the system. This coordination may involve communication via the PCF if the UPF Manager operates as a trusted AF, or via the NEF in the case of a non-trusted AF.

Algorithm 2 Failover container-based UPF across multiple clusters

Require: Failure event of a UPF container and its hosting cluster have become unavailable
Ensure: New UPF container is recovered

1:: function Main( $e v e n t$ )
2:: $u p f \leftarrow$ GetUPFInfo( $e v e n t$ )
3:: $p o l i c y \leftarrow$ GetUPFPolicyFailover( $u p f$ )
4:: if $p o l i c y \neq \emptyset$ then
5:: $t a r g e t C l u s t e r \leftarrow$ GetPredefinedCluster( $p o l i c y$ )
6:: else
7:: $t a r g e t C l u s t e r \leftarrow$ SelectTargetCluster( $u p f$ )
8:: end if
9:: $c o n f i g \leftarrow$ GetClusterConfig( $t a r g e t C l u s t e r$ )
10:: $n e w U P F \leftarrow$ DeployUPF( $u p f$ , $t a r g e t C l u s t e r$ , $c o n f i g$ )
11:: CoordinateWith5GCore( $u p f$ , $n e w U P F$ )
12:: DeleteUPFContainer( $u p f$ )
13:: RestoreUPFState( $n e w U P F$ )
14:: UpdateRouting( $n e w U P F$ , $c o n f i g$ )
15:: end function

16:: function DeployUPF( $u p f$ , $t a r g e t C l u s t e r$ , $c o n f i g$ )
17:: $n e w U P F \leftarrow$ BuildUPFDeployment( $u p f$ , $c o n f i g$ )
18:: CreateUPFContainer( $n e w U P F$ , $t a r g e t C l u s t e r$ )
19:: return newUPF
20:: end function

21:: function CoordinateWith5GCore( $u p f$ , $n e w U P F$ )
22:: PFCPAssociationEstablishUPF( $n e w U P F$ )
23:: TransferSessionsInSMF( $u p f$ , $n e w U P F$ )
24:: PFCPAssociationReleaseUPF( $u p f$ )
25:: end function

26:: function RestoreUPFState( $n e w U P F$ )
27:: while not IsConnectedToUPFManager( $n e w U P F$ ) do
28:: /* waiting for connection */
29:: end while
30:: ReinstallSessionData( $n e w U P F$ )
31:: end function

Once the new UPF instance becomes operational, its embedded Session Manager component establishes a connection with the UPF Manager. The UPF Manager recognizes that this instance was created as part of a failover recovery. It then transfers the complete set of session-related data from the failed UPF to the Session Manager, allowing it to reinstall the sessions on the new UPF instance. Finally, the UPF Manager updates the routing for the networking infrastructure to redirect traffic flows toward the new UPF instance. The method for updating routing may vary depending on the underlying infrastructure and generally falls into two categories: update routing by following 3GPP procedures and update routing by using external components.

3.2.1. Update Routing by Following 3GPP Procedures

In this section, we present an approach that uses the UPF Manager to interact with 5G core functions to perform the update routing that follows [27,28], as illustrated in Figure 2, which details the procedure that assumes that the UPF Manager acts as a trusted AF. Since the UPF Manager already holds all session data of the failed UPF, it can identify all PDU sessions affected by the failure. Using this information, the UPF Manager initiates or updates policies with PCF regarding the impacted UE PDU sessions, indicating the failure of the old UPF and requesting the serving gNB to switch tunnels to the recovered UPF. Subsequently, PCF sends a notification about these policy updates to SMF. Upon receiving the notification, SMF responds with an acknowledgment back to PCF. The SMF then processes the policy requests for the affected UEs and triggers communication with AMF via the Namf_Communication_N1N2MessageTransfer interface, using N1 and N2 messages. These messages are encapsulated in a PDU Session Resource Modify Request sent to the gNB. The N1 message is left empty, while the N2 message contains the PDU Session Resource Modify Request List IE, which includes all affected PDU sessions represented as individual PDU Session Resource Modify Request IEs. Each request IE includes a User-Plane Failure Indication IE and a UL NG-U UP TNL Information IE, containing the recovered UPF’s IP address and Tunnel Endpoint Identifier (TEID). This information enables the gNB to release the old NG-U tunnel and establish a new one. Upon completion, the gNB replies with a PDU Session Resource Modify Response message, where the Use-Plane Failure Indication Report IE is set to “new transport address allocated”, indicating that the gNB has successfully established the new tunnel. The AMF receives and forwards this response to SMF, which updates rule reports with PCF, and finally, PCF sends a response back to the UPF Manager, completing the update routing procedure.

3.2.2. Update Routing by Using External Components

The second approach relies on external components when 3GPP-compliant procedures are unsupported by network elements, such as certain core functions or access nodes, or alternative network management components are available and preferable. In this case, solutions such as Software-Defined Networking (SDN), as depicted in Figure 3, or dedicated management switches between the gNBs and UPF are usually employed. Upon session restoration, in an SDN-based solution, the UPF Manager communicates with the SDN controller to update traffic-forwarding rules, thereby redirecting GTP traffic to the recovered UPF instance.

4. Experimental Implementation

4.1. Testbed Environment

To demonstrate the practical feasibility of the proposed architecture described in Section 3, we implemented a testbed as illustrated in Figure 4. The testbed consists of three Kubernetes clusters: One acts as the control plane, while the remaining two clusters are designated as data plane clusters, where the UPF containers and user applications are deployed. In addition, a separate machine is also included in the data plane to run a simulation of User Equipment (UE) and gNB components.

Due to limitations in the UERANSIM [29] simulator, particularly its lack of support for some essential 3GPP procedures (e.g., PDU Session Resource Modify [27]), which are critical for enabling UPF container failover across multiple clusters, we integrated Open vSwitch (OVS) [30] into our testbed. OVS is used to manipulate GTP headers, allowing redirection to a newly assigned UPF. To automate flow installation and dynamic control of OVS, a simple SDN controller is used.

All nodes across the Kubernetes clusters and the UE/gNB simulation machine are VMs provided with the same virtual Central Processing Unit (vCPU), Random Access Memory (RAM), and OS. Notably, while one of the data plane clusters is hosted on an on-premise OpenStack infrastructure and the other runs on bare-metal servers, both platforms use the same hardware specifications. Each cluster operates within an isolated network, and the two networks are connected via a router, allowing the simulation machine to access both clusters. Furthermore, all clusters use the same version of Kubernetes and are configured with the Cilium [31] and Multus CNI [32]. Detailed specifications are summarized in Table 1.

The control-plane cluster hosts all core components of the system, including the UPF Manager, a multi-cluster orchestration system, and 5G control-plane network functions. For 5G core functions, we selected Free5GC [33] due to its open-source nature, active community support, and compliance with 3GPP Release 17 specifications. For multi-cluster orchestration, we use Karmada [34] and Prometheus [35]. Karmada is an open-source orchestration platform that enables workload deployment and management across multiple Kubernetes clusters and both public and private clouds. It facilitates automated deployment and migration of UPF containers with minimal manual effort. Prometheus is employed to collect performance metrics (e.g., VM state such as Running or Shutdown) from exporters running on both OpenStack (Ceilometer project [36]) and bare-metal environments.

The UPF Manager is implemented as a Kubernetes Operator [37] consisting of three controllers, Registration, Map, and Migrate, each associated with a corresponding Custom Resource Definition (CRD) [38]:

The Migrate controller is responsible for executing failover procedures, either restoring a failed UPF within its original cluster or restoring it to another cluster.
The Registration controller monitors the current status of the UPF, including cluster location, networking configuration, and failover policies. It also performs health checks on the UPF using metrics from the orchestrator and direct alerts from the Session Manager, which communicates with the UPF Manager via a dedicated connection when errors are detected. All registration-related data is stored as Kubernetes resources under the Registration CRD.
The Map controller is responsible for managing the session data exchanged with the Session Manager. Specifically, it receives session data from the Session Manager, transforms the raw data into a structured format, and stores it as a Kubernetes resource under the Map CRD. When a failover occurs and the Migrate controller requests to restore a session, the Map controller retrieves the corresponding data, converts it back into the format expected by the Session Manager, and sends it accordingly.

The UPF Manager can interact with both Karmada and Free5GC, Prometheus via their exposed APIs to ensure coordinated logic and efficient management of network functions.

For the data plane, two Kubernetes clusters are registered as member clusters within the Karmada cluster. These clusters are dedicated to hosting UPF containers and applications serving the UE. Each node in these clusters supports the Kubelet Checkpoint API and has CRIU installed [39]. Since CRIU has specific requirements for process state management, we modified the open-source free5GC UPF in our testbed to ensure compatibility, enabling CRIU to successfully perform checkpoint and restore operations on it. Leveraging its open architecture, we further integrated the Session Manager as a plugin within the UPF itself, allowing it to directly interact with and fully control the UPF’s data-forwarding behavior.

In our design, communication between the Session Manager and the UPF Manager is bidirectional; therefore, we employed gRPC (Remote Procedure Call) bidirectional streaming for both of them, which not only enables two-way communication over a single persistent connection but also eliminates the need for both endpoints to be publicly exposed, and we choose only the UPF Manager that requires an externally accessible interface. This setup allows the Session Manager to receive control commands in real time and report any runtime issues immediately, even while operating behind a private network boundary.

4.2. Testbed Failure Scenarios

In our testbed, we choose Cluster 1 to have an UPF running, and Cluster 2 serves as a target for failover of the UPF with respect to another cluster scenario. To enable the execution of our proposed UPF failover algorithms, we simulate UPF failure events corresponding to their respective triggering conditions.

The first algorithm is designed for scenarios in which a UPF failure occurs while the hosting cluster remains operational. Such failures may originate either from the UPF itself due to internal issues, such as software bugs, process crashes, or overload, or from the underlying node hosting the UPF, for instance, due to hardware faults, operating system errors, or resource exhaustion. In our experiments, we specifically emulate application-level failures in the UPF as the representative case to trigger Algorithm 1, using a predefined fault message that serves as a fault injection mechanism inspired by similar approaches discussed in [25]. To enable this behavior, we created a REST API endpoint for the UPF instance, which can be externally invoked to trigger a failure. Prior to the fault injection, the UPF instance has already had a PDU session of the UE, and its session data has been fully synchronized with the UPF Manager. Upon receiving the trigger, the Session Manager immediately sends a fault message to the UPF Manager, which interprets it as an indication that the UPF instance is malfunctioning. It subsequently verifies that the underlying Kubernetes cluster and hosting node remain healthy and have sufficient resources available. Consequently, Algorithm 1 is activated to handle the UPF failure within the same cluster.

Algorithm 2 targets scenarios where the entire Kubernetes cluster hosting the UPF becomes unavailable. Such unavailability may result from various root causes, including a system misconfiguration [40], a network outage that isolates the cluster [41], or hardware-level failures, such as a power outage at an edge computing site, which affects the entire infrastructure hosting all cluster nodes. In our experiments, we emulate this type of failure by forcibly shutting down all nodes of Kubernetes Cluster 1, which is hosting the UPF instance. As a result, the gRPC connection between the UPF and the UPF Manager is abruptly broken, which serves as an initial indication of failure. Unlike the previous failure scenario, where only the UPF instance is affected while the hosting node and Cluster 1 remain operational, this case involves a complete loss of availability for all nodes in Cluster 1. The UPF Manager confirms this condition by querying Karmada and metrics reported by Prometheus. Upon verifying that Cluster 1 is entirely unavailable, the UPF Manager triggers Algorithm 2 to recover the UPF in an alternative cluster, specifically Cluster 2. To summarize, if the hosting cluster is healthy and reachable, Algorithm 1 is selected to handle the UPF failure within the same cluster. Otherwise, if the cluster is unreachable or unavailable, Algorithm 2 is chosen to recover the UPF in an alternative cluster.

Regarding traffic service, we prepare a UE that has an active PDU session established with the UPF. This UE consumes a video streaming service from an application deployed as a containerized service in a Kubernetes cluster. The traffic flows from the UE through the gNB and the UPF to the streaming server, with response packets following the reverse path. To simplify traffic recovery in the case of Algorithm 2, where the UPF is failed over with respect to a secondary cluster, the video streaming application is also pre-deployed in Cluster 2. This ensures that once the UPF is restored in Cluster 2, the UE can immediately resume service access without waiting for the application to be redeployed.

5. Results and Discussions

To evaluate the effectiveness of our design on the testbed, we consider two key metrics: UPF Redeployment Time and Service Disruption Time. The UPF Redeployment Time is adopted as a baseline metric, as defined by A. Leiter et al. [25]. This metric measures the time from UPF failure detection to the creation of a new instance, which is not immediately capable of processing traffic. In our testbed, the metric starts when the UPF Manager detects a faulty UPF and ends when the new UPF Pod, including both the UPF container itself and its Session Manager component, has been fully initialized through Readiness Probes successfully. The Service Disruption Time is defined as the total duration from when a UE becomes disconnected until it successfully regains access to the service. The baseline for this metric is a UPF failure scenario without session continuity.

We evaluate both UPF Redeployment Time and Service Disruption Time under two failure scenarios: (1) failover within the same cluster and (2) cross-cluster failover. Each test case conducts failover procedures 100 times under identical conditions to ensure statistical significance and consistency. Furthermore, before performing the measurements, we ensured that all necessary checkpoint data had already been downloaded and were available in the target to prevent the inclusion of data retrieval time in the evaluation metrics.

In addition to these evaluations, with the Service Disruption Time, we evaluate how cost-effective a system using our proposal is compared to the traditional methods under many reliability level requirements within the same cluster.

5.1. Performance Evaluation Using UPF Redeployment Time

Figure 5a compares UPF redeployment times within a cluster between our proposed approach and the baseline method. In this comparison, both methods reuse the network interface configuration of the failed UPF pod during restoration. The results show that our approach achieves significantly faster recovery, with a shorter average redeployment time compared to the baseline. This performance gain is mainly due to the fact that our method restores the UPF from a previously saved checkpoint, enabling both the UPF and its associated Session Manager to resume functionality without the need for re-initialization. In contrast, the baseline approach involves redeploying the container from scratch, which entails reinitializing not only the UPF but also its Session Manager component, thereby resulting in a longer overall recovery time. It is worth noting that our testbed utilizes a lightweight open-source UPF. Therefore, the performance gap between our approach and the baseline is expected to widen further when applied to more complex or sophisticated UPF implementations.

Figure 5b shows the UPF redeployment times when the recovered container is launched in another cluster, comparing our proposed approach to the baseline method. In this scenario, both methods exhibit comparable performance, with similar average recovery times. This is expected, as both approaches follow the same strategy: deploying a new UPF instance in a separate cluster, which requires full re-initialization of the UPF and its Session Manager. Unlike the UPF redeployment within a cluster, the restored UPF in this scenario must be assigned a new network interface configuration. As a result, the time required for the UPF to reach the “ready” state is significantly longer. This overhead arises from the additional steps involved in setting up new network interfaces, rather than reusing the previous configuration. This issue has also been discussed in [25], which highlights the cost of interface provisioning in containerized UPF environments.

In general, by utilizing our UPF Manager as a component for the real-time monitoring of the UPF status, the failure detection and recovery process becomes significantly faster compared to traditional mechanisms such as PFCP heartbeat-based monitoring. Overall, our findings indicate that the failover UPF within a cluster approach offers better recovery performance compared to the other one, but the choice of failover strategy should be made based on the specific failure context, as each method presents different trade-offs in terms of recovery time and system-level requirements.

5.2. Performance Evaluation Using Service Disruption Time

Figure 6 illustrates the comparison of service disruption times between our proposed approaches and the baseline method. Unsurprisingly, the same cluster UPF failover provides the best recovery performance, as the restored UPF only needs to retrieve its session data to resume operation. The other cluster failover, however, takes longer due to the need to not only recover session data but also establish a new N4 PFCP association with the SMF and update routing rules. The baseline method exhibits the longest recovery time, since all affected UEs must re-initiate their PDU sessions from scratch, resulting in a full session re-establishment process. The result of the baseline method is consistent with the findings in [21].

Regarding the implementation of the 5G Core, the baseline method introduces significant signaling overhead not only between the UE and the core network but also among multiple network functions such as the AMF, SMF, PCF, and others. This issue becomes more critical in private 5G environments, where a large number of connected devices can exponentially increase the number of signaling procedures, placing heavy processing demands on the core network. In contrast, with our proposed failover mechanisms, the UPF is restored rapidly, and in the failover UPF across clusters scenario, the 5G Core only needs to handle routing updates, significantly reducing the signaling burden.

5.3. Evaluation of Cost-Effective Performance Compared with Traditional Methods

To comprehensively evaluate the cost-effectiveness of our proposed approach compared to traditional methods, we introduce a total cost function that captures both the system’s ability to meet the allowed downtime for a given reliability level and the resource usage required to achieve it. The total cost function is defined in Equation (1):

C_{Total} = α \cdot C_{Resource} + β \cdot C_{Downtime} (R)

(1)

Here, R denotes the target reliability level,

C_{Total}

is the overall total cost,

C_{Resource}

represents the resource cost of deploying UPFs, and

C_{Downtime} (R)

denotes the downtime penalty cost based on the required reliability level. The coefficients

α

and

β

act as weighting factors that balance the trade-off between resource cost and downtime cost.

The resource cost of UPFs follows the model adopted from [42,43] and is expressed in Equation (2):

C_{Resource} = \sum_{i = 1}^{m} N_{i} \cdot R_{i} \cdot p r

(2)

where

N_{i}

is the number of containerized UPFs,

R_{i}

is the amount of resources allocated per containerized UPF, and

p r

is the unit price of resource usage.

Similarly, the downtime penalty cost is defined in Equation (3):

C_{Downtime} (R) = \frac{T_{ServiceDisruption}}{T_{MaximumDowntime} (R)} \cdot γ

(3)

where

T_{ServiceDisruption}

is the Service Disruption Time,

T_{MaximumDowntime} (R)

represents the maximum allowed downtime according to the target reliability level R, and

γ

is a penalty coefficient that reflects the cost sensitivity to downtime violations.

To compare and evaluate the cost-effectiveness of our proposed approach against traditional failover methods, we re-implemented two traditional schemes, 1:1 UPF failover and N:M UPF failover, based on their original concepts [10,11] in our testbed environment. For the 1:1 UPF failover, the backup UPF is pre-associated with the SMF via the N4 interface before failover occurs, and its PDU sessions are pre-installed manually to mirror the active UPF. In contrast, in the N:M UPF failover scheme, when a failover is triggered, the backup UPF first establishes the N4 association with the SMF; after this, the UPF Manager restores the PDU sessions from the active UPF.

We selected the failover UPF within a cluster method for comparison with traditional approaches because it provides the best service disruption time while avoiding additional delays caused by IP address allocation for the recovered UPF interface. This ensures fairness when compared to traditional methods, where the backup UPF is already pre-launched with an assigned IP address. Although all three methods are deployed within the same cluster, the traditional approaches require the use of OVS to redirect UE’s traffic to the backup UPF because the backup UPF has a different IP address than the active UPF.

Figure 7a presents the results of the Service Disruption Time for our approach compared to traditional methods within the same cluster. Unsurprisingly, our approach exhibits the longest disruption time due to the time required for CRIU to restore the containerized UPF. The 1:1 UPF failover achieves the shortest disruption time since the backup UPF is always ready to serve, followed by the N:M UPF failover, which has the second shortest recovery time. Based on these results, we evaluate the total cost of each method using the cost model defined in Equation (1). To emphasize resource efficiency, we set

α = 0.7

and

β = 0.3

, giving higher priority to reducing resource usage. For

γ

, we define

γ = p r

so that the downtime penalty is treated at the same cost scale as operating a single UPF instance. Additionally, for the N:M UPF failover method, we configured the number of backup UPFs to be two-thirds of the number of active UPFs. For

T_{MaximumDowntime} (R)

, we define it as the maximum allowed downtime within a one-month period according to the target reliability level.

Figure 7b illustrates the total cost comparison between our approach and the two traditional failover methods. In contrast to the Service Disruption Time results, our approach yields the lowest total cost when the reliability level ranges from 99% to 99.99%. This cost reduction is achieved because our method eliminates the need for pre-running backup instances while still satisfying the required reliability level. As a result, for example, at reliability level 99.99%, our method achieves a 50% cost reduction compared to the 1:1 UPF failover and a 41.5% reduction compared to the N:M UPF failover. When the reliability requirement increases to an ultra-high level of 99.9999%, our approach incurs a higher downtime penalty cost due to longer service disruption time compared to the two traditional methods. However, even under this condition, the overall total cost remains lower than both traditional approaches.

In general, although our proposed method introduces a longer Service Disruption Time compared to the traditional failover mechanisms, it achieves a significantly lower total cost, making it more resource-efficient while still maintaining acceptable reliability trade-offs.

5.4. Research Findings and Limitations

In this paper, we propose a novel design for failover in UPF-containerized deployments that is not only cost-effective but also eliminates the need for traditional redundancy mechanisms, such as continuously running backup instances. Our approach enables real-time fault detection and rapid recovery of UPF containers, which minimizes service disruption and improves overall QoS. The experimental results obtained from our testbed demonstrate that the proposed design fully satisfies the requirements of 3GPP-standardized services and URLLC use cases demanding up to 99.9999% reliability. Accordingly, this design can serve as a complementary alternative to traditional UPF resilience solutions, especially in resource-constrained environments where cost-efficiency is critical.

Additionally, our findings highlight that failover within a single cluster provides faster recovery than cross-cluster failover in terms of both UPF redeployment time and service disruption time, making the first algorithm the preferable choice. Therefore, even in multi-cluster setups, if networking consistency can be ensured (e.g., through VPNs or tunneling), it is feasible to use the first algorithm to achieve quicker recovery with lower complexity.

Despite these promising results, our study has several limitations. First, the feasibility of container-based UPF failover within a cluster depends heavily on the capabilities of CRIU. In production deployments, UPF containers typically leverage data-plane acceleration technologies such as DPDK and VPP to boost throughput and reduce latency by bypassing the kernel and avoiding context-switch overheads. However, CRIU currently lacks support for applications that require these accelerated setups due to diverse setup requirements [20]. Nevertheless, this limitation does not necessarily imply that CRIU is incompatible with such UPFs. Since CRIU is an open-source project, it can be extended to support checkpointing and restoration for accelerated UPF containers to meet their specialized requirements. Second, our proposed solution assumes stable and reliable communication links between the control-plane and data-plane components. If these connections degrade or fail, incorrect failover actions may be triggered. Third, the UPF Manager acts as a central component for failover operations. This component could introduce a single point of failure and become a scalability bottleneck as the number of UPF instances grows. Moreover, as a central controller, it may become an attractive target for attackers, and any compromise could disable the failover capability of all UPF instances. To address these concerns, future work should explore more advanced solutions, such as high-availability mechanisms and load-balancing strategies.

Although these limitations exist, our proposed failover design provides a promising foundation for future research and practical deployment. It paves the way for more advanced studies and contributes to the ongoing evolution of 5G and beyond infrastructure, where sustainability, resource optimization, and cost-effectiveness are key to delivering next-generation network services.

6. Conclusions

In this study, we tackle the open research challenge of ensuring high resiliency for containerized UPF in resource-constrained edge computing environments, achieving rapid recovery without relying on pre-provisioned backup instances. Specifically, we introduced the UPF Manager with two new failover mechanisms to support recovery both within a cluster and across clusters, along with a Session Manager to enable real-time monitoring and transparent state restoration. Our design ensures that UPF failover is seamless to the UE, eliminating the need for the UE to re-establish its sessions and significantly reducing signaling load on the 5G Core. This is especially beneficial in private 5G and Internet of Things (IoT)-intensive environments, where high device density is connected to the network. The experimental results on our testbed demonstrate that the proposed mechanisms reduce UPF redeployment time by up to 37% and lower service disruption time significantly compared to baseline methods. Moreover, under high-reliability requirements, the total system cost is reduced by up to 50% compared to traditional redundancy-based failover approaches. These findings confirm that our design offers a cost-effective and resource-efficient alternative for ensuring UPF availability and can complement existing resilience strategies in constrained edge computing environments. For future work, in addition to addressing current limitations, we plan to extend our solution to enable seamless UPF migration in more realistic environments, such as service function chains [44], characterized by multiple accelerated UPFs with high-volume traffic and diverse challenging conditions.

Author Contributions

Conceptualization, K.N.T. and Y.K.; methodology, K.N.T.; software, K.N.T.; validation, K.N.T. and Y.K.; formal analysis, K.N.T.; investigation, K.N.T.; writing—original draft preparation, K.N.T.; writing—review and editing, K.N.T. and Y.K. All authors have read and agreed to the published version of this manuscript.

Funding

This work was partly supported by an Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2020-II200946, Development of Fast and Automatic Service recovery and Transition software in Hybrid Cloud Environment, 50; and No. RS-2024-00398379, Development of High Available and High Performance 6G Cross Cloud Infrastructure Technology, 50).

Data Availability Statement

All the experimental data were generated using our lab testbed environment and are therefore not publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

3GPP	Third-Generation Partnership Project;
5G	The Fifth Generation;
AF	Application Function;
AMF	Access and Mobility Management Function;
CNI	Container Network Interface;
COTS	Commercial Off-The-Shelf;
CPU	Central Processing Unit;
CRD	Custom Resource Definition;
CRIU	Checkpoint/Restore In Userspace;
DPDK	Data Plane Development Kit;
eMBB	Enhanced Mobile Broadband;
FAR	Forwarding Action Rule;
gNB	gNodeB;
gRPC	gRPC Remote Procedure Calls;
GTP	GPRS Tunneling Protocol;
GTP-U	GPRS Tunneling Protocol–User Plane;
IMS	IP Multimedia Subsystem;
IoT	Internet of Things;
IE	Information Element;
KPI	Key Performance Indicator;
MEC	Multi-access Edge Computing;
mMTC	Massive Machine-Type Communication;
MNO	Mobile Network Operator;
NEF	Network Exposure Function;
N3/N4/N6/N9	Standard 5G Interface Points;
NG-U	Next Generation User-Plane;
ONAP	Open Network Automation Platform;
OS	Operating System;
OVS	Open vSwitch;
PCF	Policy Control Function;
PDR	Packet Detection Rule;
PDU	Protocol Data Unit;
PFCP	Packet Forwarding Control Protocol;
QoS	Quality of Service;
RAM	Random Access Memory;
RAN	Radio Access Network;
SDN	Software-Defined Networking;
SLA	Service Level Agreement;
SMF	Session Management Function;
TEID	Tunnel Endpoint Identifier;
TNL	Transport Network Layer;
UDR	Unified Data Repository;
UE	User Equipment;
UL	Up Link;
UP	User Plane;
UPF	User Plane Function;
URLLC	Ultra-Reliable Low-Latency Communication;
vCPU	Virtual Central Processing Unit;
VM	Virtual Machine;
VPP	Vector Packet Processing;
TNL	Transport Network Layer.

References

Wen, M.; Li, Q.; Kim, K.J.; López-Pérez, D.; Dobre, O.A.; Poor, H.V.; Popovski, P.; Tsiftsis, T.A. Private 5G Networks: Concepts, Architectures, and Research Landscape. IEEE J. Sel. Top. Signal Process. 2022, 16, 7–25. [Google Scholar] [CrossRef]
Maman, M.; Calvanese-Strinati, E.; Dinh, L.N.; Haustein, T.; Keusgen, W.; Wittig, S.; Schmieder, M.; Barbarossa, S.; Merluzzi, M.; Costanzo, F.; et al. Beyond Private 5G Networks: Applications, Architectures, Operator Models and Technological Enablers. EURASIP J. Wirel. Commun. Netw. 2021, 2021, 195. [Google Scholar] [CrossRef]
Islam, J.; Ahmad, I.; Gimhana, S.; Markkula, J.; Harjula, E. Energy Profiling and Analysis of 5G Private Networks: Evaluating Energy Consumption Patterns. In Proceedings of the 14th International Conference on the Internet of Things, Oulu, Finland, 19–22 November 2024; Association for Computing Machinery: New York, NY, USA IoT ’24. ; pp. 246–247. [Google Scholar] [CrossRef]
Flamini, M.; Naldi, M. Optimal Pricing in a Rented 5G Infrastructure Scenario with Sticky Customers. Future Internet 2023, 15, 82. [Google Scholar] [CrossRef]
Trakadas, P.; Sarakis, L.; Giannopoulos, A.; Spantideas, S.; Capsalis, N.; Gkonis, P.; Karkazis, P.; Rigazzi, G.; Antonopoulos, A.; Cambeiro, M.A.; et al. A Cost-Efficient 5G Non-Public Network Architectural Approach: Key Concepts and Enablers, Building Blocks and Potential Use Cases. Sensors 2021, 21, 5578. [Google Scholar] [CrossRef]
Spinelli, F.; Mancuso, V. Toward Enabled Industrial Verticals in 5G: A Survey on MEC-Based Approaches to Provisioning and Flexibility. IEEE Commun. Surv. Tutor. 2021, 23, 596–630. [Google Scholar] [CrossRef]
Ergen, M.; Saoud, B.; Shayea, I.; El-Saleh, A.A.; Ergen, O.; Inan, F.; Tuysuz, M.F. Edge Computing in Future Wireless Networks: A Comprehensive Evaluation and Vision for 6G and Beyond. ICT Express 2024, 10, 1151–1173. [Google Scholar] [CrossRef]
Qiu, T.; Chi, J.; Zhou, X.; Ning, Z.; Atiquzzaman, M.; Wu, D.O. Edge Computing in Industrial Internet of Things: Architecture, Advances and Challenges. IEEE Commun. Surv. Tutor. 2020, 22, 2462–2488. [Google Scholar] [CrossRef]
Sai, K.; Tipper, D. Sustainability and Power Outage-aware Placement of Edge Computing in NextG RANs. In Proceedings of the 2024 IEEE International Symposium on Technology and Society (ISTAS), Puebla, Mexico, 18–20 September 2024; pp. 1–8. [Google Scholar] [CrossRef]
Rodrigues, F.; Shetty, R.; Suthar, O.P.; Suryanarayanarao, R. SMF ASSISTED FASTER AND EFFICIENT UPF REDUNDANCY MODEL. Technical Disclosure Commons. Available online: https://www.tdcommons.org/dpubs_series/3550 (accessed on 10 June 2025).
Jaya, S.P.; Pant, D. MECHANISM FOR ACHIEVING FAST FAILOVER OF 5G USER PLANE FUNCTIONS DEPLOYED IN AN N:M HOT-STANDBY MODE. Technical Disclosure Commons. Available online: https://www.tdcommons.org/dpubs_series/3704 (accessed on 10 June 2025).
Abuibaid, M.; Ghorab, A.H.; Seguin-Mcpeake, A.; Yuen, O.; Yungblut, T.; St-Hilaire, M. Edge Workloads Monitoring and Failover: A StarlingX-Based Testbed Implementation and Measurement Study. IEEE Access 2022, 10, 97101–97116. [Google Scholar] [CrossRef]
Siriwardhana, Y.; Porambage, P.; Ylianttila, M.; Liyanage, M. Performance Analysis of Local 5G Operator Architectures for Industrial Internet. IEEE Internet Things J. 2020, 7, 11559–11575. [Google Scholar] [CrossRef]
Badmus, I.; Laghrissi, A.; Matinmikko-Blue, M.; Pouttu, A. End-to-End Network Slice Architecture and Distribution across 5G Micro-Operator Leveraging Multi-Domain and Multi-Tenancy. EURASIP J. Wirel. Commun. Netw. 2021, 2021, 94. [Google Scholar] [CrossRef]
Lin, Y.; Wang, X.; Ma, H.; Wang, L.; Hao, F.; Cai, Z. An Efficient Approach to Sharing Edge Knowledge in 5G-Enabled Industrial Internet of Things. IEEE Trans. Ind. Inform. 2023, 19, 930–939. [Google Scholar] [CrossRef]
Meira, J.; Matos, G.; Perdigão, A.; Cação, J.; Resende, C.; Moreira, W.; Antunes, M.; Quevedo, J.; Moutinho, R.; Oliveira, J.; et al. Industrial Internet of Things over 5G: A Practical Implementation. Sensors 2023, 23, 5199. [Google Scholar] [CrossRef]
Ishtiaq, M.; Saeed, N.; Khan, M.A. Edge Computing in the Internet of Things: A 6G Perspective. IT Prof. 2024, 26, 62–70. [Google Scholar] [CrossRef]
Tošić, A. Run-time Application Migration using Checkpoint/Restore In Userspace. J. Web Eng. 2024, 23, 735–748. [Google Scholar] [CrossRef]
Poggiani, L.; Puliafito, C.; Virdis, A.; Mingozzi, E. Live Migration of Multi-Container Kubernetes Pods in Multi-Cluster Serverless Edge Systems. In Proceedings of the 1st Workshop on Serverless at the Edge, Pisa, Italy, 3 June 2024; Association for Computing Machinery: New York, NY, USA SEATED ’24. ; pp. 9–16. [Google Scholar] [CrossRef]
Zhang, H.; Chen, Z.; Yuan, Y. High-Performance UPF Design Based on DPDK. In Proceedings of the 2021 IEEE 21st International Conference on Communication Technology (ICCT), Tianjin, China, 13–16 October 2021; pp. 349–354. [Google Scholar] [CrossRef]
Wernet, L.; Spang, L.M.; Siegmund, F.; Meuser, T. Resilient User Plane Traffic Redirection in Cellular Networks. In Proceedings of the 2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Natal, Brazil, 5–7 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Xie, H.; Du, S.; Zhang, Q.; Yang, Y.; Wang, H. Application of UPF Disaster Tolerant Networking in Power Private Network. In Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 24–26 February 2023; Volume 6, pp. 811–815. [Google Scholar] [CrossRef]
Cisco. Perform Hardware Maintenance in 5G IMS and Data UPF Nodes. Available online: https://www.cisco.com/c/en/us/support/docs/wireless/ultra-cloud-core-user-plane-function/217994-perform-hardware-maintenance-in-5g-ims-a.html. (accessed on 10 June 2025).
Ihle, F.; Meuser, T.; Menth, M.; Scheuermann, B. Packet level resilience for the user plane in 5G networks. arXiv 2025, arXiv:2501.17964. [Google Scholar] [CrossRef]
Leiter, Á.; Hegyi, A.; Galambosi, N.; Lami, E.; Fazekas, P. Automatic Failover of 5G Container-Based User Plane Function by ONAP Closed-Loop Orchestration. In Proceedings of the NOMS 2022—2022 IEEE/IFIP Network Operations and Management Symposium, Budapest, Hungary, 25–29 April 2022; pp. 1–2. [Google Scholar] [CrossRef]
Tsourdinis, T.; Makris, N.; Fdida, S.; Korakis, T. DRL-based Service Migration for MEC Cloud-Native 5G and beyond Networks. In Proceedings of the 2023 IEEE 9th International Conference on Network Softwarization (NetSoft), Madrid, Spain, 19–23 June 2023; pp. 62–70. [Google Scholar] [CrossRef]
Tech-Invite. Inside TS 38.413: Content Part, 5 out of 16. 2025. Available online: https://www.tech-invite.com/3m38/toc/tinv-3gpp-38-413_e.html (accessed on 10 June 2025).
Tech-Invite. Inside TS 23.501: 5GS Session Management. 2025. Available online: https://www.tech-invite.com/3m23/toc/tinv-3gpp-23-501_x.html#e-5-6 (accessed on 10 June 2025).
Güngör, A. Aligungr/UERANSIM. Available online: https://github.com/aligungr/UERANSIM (accessed on 10 June 2025).
Open vSwitch Project. Open vSwitch. 2025. Available online: https://www.openvswitch.org/ (accessed on 10 June 2025).
Isovalent. Cilium—Cloud Native, eBPF-Based Networking, Observability, and Security. Available online: https://cilium.io (accessed on 10 June 2025).
K8snetworkplumbingwg. K8snetworkplumbingwg/Multus-Cni. Available online: https://github.com/k8snetworkplumbingwg/multus-cni (accessed on 10 June 2025).
free5GC Organization. free5GC. 2025. Available online: https://free5gc.org/ (accessed on 10 June 2025).
Karmada. Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration|Karmada. Available online: https://karmada.io/ (accessed on 10 June 2025).
Prometheus Authors. Prometheus: Monitoring System and Time Series Database. 2024. Available online: https://prometheus.io/ (accessed on 23 June 2025).
OpenStack Foundation. Ceilometer Documentation. 2024. Available online: https://docs.openstack.org/ceilometer/latest/ (accessed on 10 June 2025).
Red Hat, Inc. What Is a Kubernetes Operator? 2024. Available online: https://www.redhat.com/en/topics/containers/what-is-a-kubernetes-operator (accessed on 10 June 2025).
Kubernetes Authors. Custom Resources. 2024. Available online: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ (accessed on 23 June 2025).
Kubernetes. Kubelet Checkpoint API. 2025. Available online: https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/ (accessed on 10 June 2025).
Barletta, M.; Cinque, M.; Di Martino, C.; Kalbarczyk, Z.T.; Iyer, R.K. Mutiny! How Does Kubernetes Fail, and What Can We Do About It? In Proceedings of the 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Brisbane, Australia, 24–27 June 2024; pp. 1–14. [Google Scholar] [CrossRef]
Malleni, S.S.; Sevilla, R.; Lema, J.C.; Bauer, A. Bridging Clusters: A Comparative Look at Multi-Cluster Networking Performance in Kubernetes. In Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering, New York, NY, USA, 5–9 May 2025; ICPE ’25. pp. 113–123. [Google Scholar] [CrossRef]
Vu, D.D.; Tran, M.N.; Kim, Y. Predictive Hybrid Autoscaling for Containerized Applications. IEEE Access 2022, 10, 109768–109778. [Google Scholar] [CrossRef]
Amazon. Amazon EKS Pricing. 2025. Available online: https://aws.amazon.com/eks/pricing/ (accessed on 12 July 2025).
Nguyen, K.T.; Kim, Y. A Design and Implementation of Service Function Chaining Over Segment Routing IPv6 Network. In Proceedings of the 2024 15th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 14–17 October 2025; pp. 1938–1941. [Google Scholar] [CrossRef]

Figure 1. High-level architecture for containerized UPF failover within and across clusters.

Figure 2. Update routing by following 3GPP procedures.

Figure 3. Update routing by using a Software-Defined Network.

Figure 4. Testbed setup for evaluating the UPF failover mechanisms using Kubernetes clusters and simulated UE/gNB.

Figure 5. Comparison of UPF redeployment times between our proposed method and the baseline method: (a) within a cluster and (b) across clusters.

Figure 6. Comparison of Service Disruption Times between two proposed approaches and the baseline method.

Figure 7. Comparison between our proposed and traditional methods within a cluster: (a) Service Disruption Time and (b) total cost based on reliability levels.

Table 1. System components and configuration of the testbed.

Criterion	Component	Values
Server	Processor	Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40 GHz
Server	Memory	125 GB
Clusters	Kubernetes	v1.32.0
	Node Type	VM (4vCPU, 8 GB RAM)
	OS	Ubuntu Server 22.04 LTS
	CNI Plugins	Multus and Cilium
	CRIU	v4.0.0
Control Plane	5G Core Functions	free5GC v4.0.1
	Multi-cluster Orchestrator	Karmada v1.14.0 Prometheus v3.5.0
	UPF Manager	Kubernetes Operator
Data Plane	UE and gNB Simulator	UERANSIM v3.2.7
	Edge Cluster 1	Runs on OpenStack
	Edge Cluster 2	Runs on Bare-metal server
SDN	Control Plane	A custom controller
SDN	Data Plane	OVS v2.17.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nguyen Trung, K.; Kim, Y. Design and Implementation of a Cost-Effective Failover Mechanism for Containerized UPF. Electronics 2025, 14, 2991. https://doi.org/10.3390/electronics14152991

AMA Style

Nguyen Trung K, Kim Y. Design and Implementation of a Cost-Effective Failover Mechanism for Containerized UPF. Electronics. 2025; 14(15):2991. https://doi.org/10.3390/electronics14152991

Chicago/Turabian Style

Nguyen Trung, Kiem, and Younghan Kim. 2025. "Design and Implementation of a Cost-Effective Failover Mechanism for Containerized UPF" Electronics 14, no. 15: 2991. https://doi.org/10.3390/electronics14152991

APA Style

Nguyen Trung, K., & Kim, Y. (2025). Design and Implementation of a Cost-Effective Failover Mechanism for Containerized UPF. Electronics, 14(15), 2991. https://doi.org/10.3390/electronics14152991

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Design and Implementation of a Cost-Effective Failover Mechanism for Containerized UPF

Abstract

1. Introduction

2. Background and Related Works

2.1. Private 5G Network

2.2. Migrate a Running Application

2.3. User-Plane Function 5G

2.4. Related Works

3. Design Architecture

3.1. Failover Container-Based UPF Within a Cluster

3.2. Failover Container-Based UPF Across Multiple Clusters

3.2.1. Update Routing by Following 3GPP Procedures

3.2.2. Update Routing by Using External Components

4. Experimental Implementation

4.1. Testbed Environment

4.2. Testbed Failure Scenarios

5. Results and Discussions

5.1. Performance Evaluation Using UPF Redeployment Time

5.2. Performance Evaluation Using Service Disruption Time

5.3. Evaluation of Cost-Effective Performance Compared with Traditional Methods

5.4. Research Findings and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI