Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies

Nguyen, Tuan Anh; Lim, Damsub; Kyung, MinGi; Min, Dugki

doi:10.3390/math14101704

Open AccessArticle

Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies

¹

Department of Software Engineering, Faculty of Information Technology, Ho Chi Minh City University of Industry and Trade (HUIT), Ho Chi Minh City 700000, Vietnam

²

Department of Artificial Intelligence, Graduate School, Konkuk University, Seoul 05029, Republic of Korea

³

Department of Computer Engineering, Konkuk University, Seoul 05029, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1704; https://doi.org/10.3390/math14101704

Submission received: 30 November 2025 / Revised: 27 February 2026 / Accepted: 3 March 2026 / Published: 15 May 2026

(This article belongs to the Special Issue Distributed Systems: Algorithms, Methods, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Mobile edge computing storage is increasingly used to support immersive services and Internet of Things applications that generate continuous real-time data streams. Sustained availability must therefore be maintained under both abrupt failures and software aging. Prior studies often evaluate reactive mechanisms (e.g., failover and live migration) and preventive mechanisms (e.g., software rejuvenation) separately, so their combined effect in microservice-based distributed edge storage is still unclear. We develop a Stochastic Reward Net (SRN) model for a multi-node edge storage architecture that captures hardware and software failures, software aging, high availability, live migration, and rejuvenation at both node and microservice levels. Using the model, we compare six policy scenarios and quantify Capacity-Oriented Availability COA), defined as the expected number of usable microservices while the storage layer is operational. Steady-state and sensitivity analyses over twelve timing parameters show that policies including live migration achieve the highest, or effectively tied-highest, COA across wide ranges of failure and repair rates. They also show that uncoordinated rejuvenation schedules can reduce availability when rejuvenation starts before live migration completes and terminates services prior to evacuation, a phenomenon we refer to as a Proactive Crash (PC). Across the tested ranges, edge/storage failure rates and rejuvenation trigger intervals dominate availability, while detection delays, repair times, and rejuvenation duration have a smaller influence. These results give guidelines for configuring proactive high availability so that migration completes before rejuvenation and rejuvenation is neither too frequent nor too sparse.

Keywords:

software aging; rejuvenation; SRN; high availability; live migration

MSC:

68M14; 68Q85; 90B25

1. Introduction

Next-generation immersive services such as Metaverse platforms, augmented and virtual reality, and dense Internet of Things (IoT) deployments generate massive volumes of sensor and interaction data under strict latency constraints. When these workloads are processed exclusively in remote cloud data centers, physical distance and multi-hop backhaul often lead to high end-to-end latency, jitter, and bandwidth consumption, which can violate quality of experience targets for interactive applications. Mobile Edge Computing (MEC), also referred to as Multi-access Edge Computing in ETSI documents, has been proposed to address these limitations by pushing compute and storage resources from the core cloud towards the access network so that data can be processed close to where it is generated and consumed [1]. ETSI specifies a reference architecture for Multi-access Edge Computing (MEC), where MEC applications run as software entities on top of a virtualization infrastructure located in or close to the network edge [2]. In such MEC systems, edge storage nodes host both user state and application logic and act as the first aggregation layer for real-time analytics and Metaverse content delivery.

Modern MEC deployments rarely rely on monolithic applications. They instead use container-based microservices that can be deployed, scaled, and updated independently to match heterogeneous local demand. Experiments on edge microservice platforms show that container orchestration frameworks provide this flexibility. They also expose failure modes linked to resource contention, orchestration policies, and network conditions [3]. In practice, this orchestration role is often realized via Kubernetes-derived stacks, and KubeEdge extends Kubernetes-native orchestration capabilities to resource-constrained and intermittently connected edge nodes [4,5]. Systematic mapping studies report a similar pattern: microservices in DevOps environments improve agility, but they increase the number of distributed components whose failures must be coordinated across development and operations teams [6]. In MEC, where each edge node has limited compute, storage, and energy, these reliability issues are stronger.

From the perspective of dependability, two broad classes of threats arise in microservice-enabled MEC storage systems. The first class consists of abrupt failures caused by hardware faults, operating system crashes, or unexpected microservice failures. High availability (HA) architectures mitigate such events by maintaining redundant instances and by triggering failover or recovery actions when a fault is detected. Live migration is routinely used for virtual machines in virtualization stacks to move a running workload between compute nodes while the service remains accessible [7]. Similar VM live migration is available for Kubernetes-managed virtual machines through KubeVirt [8]. For Linux containers, checkpoint/restore tooling such as CRIU can enable stateful container migration, but applicability depends on kernel features and application constraints [9]. Surveys of task migration strategies in MEC highlight migration and offloading policies for coping with mobility and resource dynamics, often focusing on delay and energy metrics [10]. Analytical availability models of virtualized systems show that migration and redundancy mechanisms can reduce downtime and improve capacity-oriented availability under failure/repair dynamics [11].

The second class of threats is the gradual degradation known as software aging. Long and continuous execution of operating systems, middleware, and application servers can lead to the accumulation of memory leaks, unreleased kernel objects, corrupted data structures, and numerical drifts. Empirical work has observed such phenomena in commodity operating systems such as Linux, where prolonged execution causes increased failure rates and performance loss [12]. Similar aging effects have been reported in cloud management stacks, for example, in the Eucalyptus infrastructure, where resource exhaustion and performance collapse eventually trigger failures of the cloud controller [13]. Based on these observations, the software aging and rejuvenation literature advocates proactive recovery actions, called software rejuvenation, that periodically clean or restart selected components before a failure manifests [13]. Such restart-based rejuvenation is also used in industry; for example, telecom operators report periodic restart practices as part of their operational maintenance policies [14]. Recent studies on container-based services show that such rejuvenation policies remain important in microservice environments, where the operating system that hosts containers can age and degrade service performance even when individual containers are frequently redeployed [15].

Classical rejuvenation techniques, however, have been mostly designed for centralized server or cloud platforms. The dominant practice is to trigger a reboot or restart when an aging indicator or a time-based threshold is reached, which introduces planned downtime for all services bound to the rejuvenated resource. Several works have proposed to combine live migration with rejuvenation in cloud environments, where virtual machines are evacuated to other hosts before the host operating system is rejuvenated [16,17]. These studies show that migration-assisted rejuvenation can hide most of the downtime but typically assume centralized remote storage, homogeneous infrastructure, and a relatively small number of nodes. In cloud-native Network Function Virtualization (NFV) management architectures, availability models include both container respawn and software restart; HA and rejuvenation are still treated as loosely coupled mechanisms rather than jointly optimized policies [18].

In emerging MEC-based distributed edge storage systems, computation, storage, and microservices are co-located on many geographically distributed edge nodes. These nodes operate continuously, are exposed simultaneously to sudden failures and to software aging, and participate in replication and HA protocols. In this context, a promising strategy is to integrate rejuvenation with HA policies by using live migration as a proactive mechanism that first evacuates microservices from an aging node and then rejuvenates that node before returning it to service. At present, there is a limited quantitative understanding of how such Proactive HA behaves when implemented in a realistic distributed edge storage architecture. In particular, the literature does not fully address race conditions between rejuvenation triggers and live migration, and it does not clarify how these interactions influence COA metrics when storage failures are also considered. Accordingly, this work develops a unified SRN that integrates HA, live migration, and multi-level software rejuvenation in MEC-based distributed edge storage and uses it to study the conditions under which Proactive HA effectively improves availability.

The contributions of this work are as follows:

Integrated SRN model for reactive and preventive fault management in edge storage. We build an SRN that unifies high availability, live migration, and software rejuvenation in a distributed edge storage architecture with microservices. The model captures aging, sudden failures, and rejuvenation at both edge node and microservice levels.
Race-aware SRN formulation of the PC mechanism. We encode the contention between evacuation via live migration and node-level rejuvenation after the same trigger. This yields an SRN structure in which the trigger can lead either to successful evacuation or to premature termination of microservices.
Capacity-oriented availability evaluation for multi-node edge storage. We adopt a reward-based definition of availability that counts microservices that remain usable while storage is up, instead of only checking whether the system is on or off. This metric lets us compare six combinations of HA, LM, and rejuvenation policies and quantify how much capacity is preserved under different failure and aging patterns.
Systematic sensitivity analysis over timing parameters. We vary twelve timing parameters (failure, detection, repair, migration, and rejuvenation times) and quantify their effect on COA across all policy scenarios.
Model-informed orchestration guidelines. We translate the SRN structure and COA analysis into practical guidelines for coordinating HA, live migration, and rejuvenation, with emphasis on enforcing an evacuation-before-reboot ordering and on selecting trigger intervals consistent with migration delays.

In operational deployments, many edge platforms rely on cloud-native orchestration to enact both recovery and rejuvenation. At the microservice level, rejuvenation often takes the form of component restarts triggered by health checking and restart policies, which can reset leaked resources but may interrupt in-flight work if state is not externalized. At the node level, planned maintenance is commonly executed by draining workloads before rebooting or upgrading a node, for example, via eviction-based draining constrained by disruption budgets. These mechanisms are conceptually close to the actions represented in our model microservice-level rejuvenation, node-level rejuvenation, and evacuation prior to maintenance, but they differ in whether they preserve execution state or restart from a clean image. In parallel, VM-centric cloud-native stacks offer state-preserving live migration as a first-class maintenance primitive. Motivated by these practices, this work uses an SRN to study how proactive evacuation/migration and rejuvenation interact under software aging and storage constraints in distributed edge storage and to identify the coordination conditions under which proactive policies improve COA [7,8,19,20,21].

Approach overview. We model the MEC-based distributed edge storage architecture of Section 3 using an SRN whose structure follows directly from an explicit set of modeling assumptions (Section 4.2). The SRN is built as three coupled submodels: (i) an edge-node submodel that captures hardware failures, detection, repair, and node-level rejuvenation; (ii) a microservice submodel that captures microservice failures, aging, self-rejuvenation, and dependency on the hosting node; and (iii) an inter-node policy submodel that encodes HA, live migration, and their interaction with rejuvenation triggers. We solve the induced continuous-time Markov chain (CTMC) with Mercury’s steady-state solver [22], and we report COA as a reward, which counts the expected number of usable microservices while storage remains operational. Results and sensitivity studies are presented in Section 5, and we interpret their implications for MEC orchestration in Section 6.

The work is organized as follows: Section 2 reviews related work. Section 3 describes the system architecture, Section 4 details the SRN model structure, Section 5 presents sensitivity analysis results, and Section 6 provides an in-depth discussion, including the PC phenomenon. Section 7 describes conclusions.

2. Related Works

We relate this work to three lines of research: (i) migration and high availability in MEC and edge storage, (ii) software aging and rejuvenation in virtualized and containerized systems, and (iii) cloud-native self-healing and planned maintenance. We focus on how these threads connect, because the core challenge in proactive high availability is not the existence of each mechanism but the coordination between them when triggers occur close in time.

2.1. Migration and High Availability in MEC and Edge Storage

Reactive continuity in MEC has been widely studied through redundancy, failover, and migration-oriented mechanisms. Surveys and MEC-focused studies examine migration or offloading decisions under mobility, delay, and energy objectives and show that migration can reduce perceived downtime when failures or mobility events occur [10]. However, these works typically treat long-running degradation as out of scope and therefore do not quantify how aging-induced maintenance interacts with reactive continuity actions in multi-node edge storage.

2.2. Software Aging and Rejuvenation in Virtualized and Containerized Systems

Software aging motivates preventive maintenance such as restarts, reboots, or cleaning actions, which reset internal state before failures manifest [12,13]. Prior analytical work evaluates rejuvenation scheduling and, in some cases, combines rejuvenation with migration to reduce maintenance downtime in virtualized cloud settings [16,17]. Container-oriented aging studies also show that the host operating system and platform services can age even when application containers are frequently redeployed [15]. Many models treat rejuvenation and high availability as loosely coupled, leaving open the question of how conflicts or races between evacuation/migration and rejuvenation affect effective service capacity at the edge.

2.3. Cloud-Native Self-Healing and Planned Maintenance as Rejuvenation

Cloud-native systems implement rejuvenation and recovery through orchestration primitives. At the microservice level, health checks and liveness probes can trigger automatic restarts, which act as fine-grain rejuvenation, but they are restart-based and do not preserve execution state. At the node level, planned maintenance is often performed by draining workloads via eviction and enforcing availability constraints such as disruption budgets, aiming to move workloads off a node before reboot or upgrade [19,20,21]. These workflows provide a practical interpretation of evacuate-then-rejuvenate, and they also expose a coordination risk: if node-level maintenance starts before draining completes, workloads can be terminated rather than gracefully relocated. VM-centric cloud-native stacks provide an alternative primitive, state-preserving live migration [7,8], which aligns more directly with our modeling assumption of migration without service interruption. Existing availability models of cloud-native control planes include both restart-based rejuvenation and redundancy mechanisms such as container respawn, but they generally treat these as independent recovery paths [18].

2.4. Gap Addressed by This Work

Table 1 analyze the gap addressed by this work in comparison with other related studies. Existing research provides building blocks for reactive continuity, rejuvenation scheduling, and cloud-native self-healing. The gap is a quantitative understanding of their interaction in distributed edge storage, where compute, storage, and microservices are co-located, and capacity is the service objective. Our SRN integrates HA, evacuation/migration, and multi-level rejuvenation in a single model, and it exposes a failure mode (PC) that arises when node-level rejuvenation preempts evacuation. This framing turns a common operational concern, coordinating drain/migration and maintenance, into a measurable impact on COA.

To better align the model with cloud-native practice, future work will map SRN transitions to concrete orchestration primitives and constraints. This includes representing restart-based microservice rejuvenation driven by health probes, modeling evacuation as an eviction/drain process with explicit timeout and retry behavior, and incorporating disruption-budget-like constraints as guards that limit concurrent voluntary disruptions. Furthermore, we plan to replace the current race representation with a parameterized policy that prioritizes evacuation/migration and triggers forced reboot only after a configurable drain timeout, which would allow direct analysis of miscoordination regimes and comparison between restart-based rescheduling and state-preserving migration in edge storage deployments.

Table 1. Comparison with existing work.

NO.	Research Focus	Method	Rejuvenation?	LM/HA?	Contribution
[10]	This is a survey on task migration strategies for user mobility in MEC environments.	Literature review and taxonomy (delay, energy, trade-off)	No. Aging is not considered.	Partial. Task offloading, not VM/container LM or HA	Focuses on offloading optimization; no VM LM or aging-based rejuvenation
[17]	Availability modeling treating VM migration as rejuvenation in virtualized systems	Stochastic models, assuming VM migration based on remote storage.	Yes. LM for VMM aging	Yes. LM is used as a core technology to enable REJ.	A single-node standby model based on cloud data centers.
[18]	Cloud-native NFV-MANO availability analysis.	Stochastic Activity Networks (SAN).	Yes. Aging + restart/reboot rejuvenation	Partial. Container respawn HA, no VM LM	Models software aging and container respawn separately. Focuses on NFV-MANO.
[16]	Evaluates VMM rejuvenation via live migration for cloud availability.	Extended Stochastic Petri Nets (SPNs) and Reliability Block Diagrams (RBDs).	Yes. Time-based VMM rejuvenation through VM migration.	Yes. Live migration minimizes rejuvenation downtime.	Targets centralized cloud datacenters, not distributed edge environments.
[15]	OS-level software aging impact on container service dependability and performance.	Semi-Markov Process (SMP) modeling.	Yes. OS reboot and live container migration as rejuvenation techniques.	Yes. Live container migration minimizes rejuvenation downtime.	Models backup resource aging/failure; determines optimal migration trigger intervals.
Our work	Proactive HA modeling for distributed edge storage: integrating reactive HA/LM with REJ	Integrated SRN modeling with 6 policy scenarios and sensitivity analysis	Yes. Models aging/rejuvenation at both edge node and microservice levels	Yes. Models both failure recovery (HA) and proactive LM	Explicitly models the race condition when integrating HA/LM and REJ

3. System Architecture

The proposed architecture, as specified in Figure 1, is a MEC-based distributed storage framework. It is designed to overcome the inherent latency and bandwidth limitations of centralized cloud models. Modern immersive and interactive services (e.g., AR/VR and metaverse-like applications) demand low-latency data processing and state synchronization, motivating computing and data management close to end users [1,23].

The proposed system in Figure 1 consists of two main components:

User Domain: includes wearable computing devices, haptic devices, or IoT sensors worn or used by users. These devices continuously generate user input and receive final services from edge nodes.
Edge Node Domain: a distributed network composed of n edge-storage nodes capable of performing computing and data management.

Each edge-storage node is a core component of this architecture, not merely a storage facility but a node integrating computing and storage. Each edge node provides storage services, where (i) storage apps represent the top-level application layer that handles domain logic and data management; (ii) the microservice layer modularizes applications into independently deployable units, enabling flexibility and scalability; and (iii) storage is the embedded data repository in each node that supports real-time data processing and permanent data storage in combination with microservices.

In the SRN, we abstract the distributed storage layer, replicated across edge nodes, as a single logical storage service, whose up/down state captures storage-layer outages that affect service delivery across the cluster.

Distributed nodes in edge computing environments face two major types of failure risks: (i) sudden failures due to unpredictable hardware faults or software errors and (ii) software aging failures caused by long-accumulated resource leaks and performance degradation.

Previous work [23] focused on sudden failures and presented a failure recovery model based on high availability (HA) and live migration (LM) to minimize service interruption upon failure. However, complex software systems operating 24/7 without interruption, such as edge nodes, inevitably experience software-aging phenomena [12,13,15]. Aging gradually transitions the system to a failure-probable state and can ultimately lead to system failure [24]. Therefore, models considering only sudden failures risk overlooking potential system downtime due to software aging, thereby degrading system availability.

To overcome these limitations, this work proposes a detailed SRN model integrating two failure management mechanisms. It includes the following two core strategies:

High availability and live migration: recovers services upon unpredictable failures based on HA/LM policies.
Software rejuvenation: proactively manages the system before failures occur based on software aging and rejuvenation models.

This work proposes a Proactive HA model that combines these two mechanisms to evacuate services through live migration when rejuvenation is needed and reinitialize the node after rejuvenation. This represents an availability model that prevents service downtime proactively, surpassing the traditional reactive approach of recovering software after failure occurs.

Proactive HA workflow. In the proposed architecture, each edge node continuously monitors both abrupt failures and aging signals. When an edge node reaches an aging threshold or its rejuvenation trigger time elapses, the orchestrator attempts to evacuate its microservices to neighboring nodes with spare capacity (LM). If the evacuation completes before the reboot, the node rejuvenates safely and then rejoins the pool. If evacuation is preempted (e.g., due to headroom limits or timeouts), rejuvenation may start first and clear the remaining microservices. In contrast, if a sudden node or microservice failure occurs, HA triggers a reactive failover or redeployment. The SRN model in Section 4 encodes these two paths and their interaction. At the model level, we capture whether a node-level rejuvenation trigger follows an evacuation-first (LM) or reboot-first (REJ) behavior by the probability parameter $p_{LM}$ (Table 4).

4. Stochastic Models

4.1. Abstraction

The proposed availability model is encoded as an SRN with three interacting modules: (i) an edge-node lifecycle module (including an edge rejuvenation clock), (ii) a microservice lifecycle module (including a microservice rejuvenation clock), and (iii) an inter-node policy module that governs HA failover and live migration on a logical ring of nodes. Full Petri-net structures are useful for reproducibility, but they can be difficult to comprehend as system diagrams on a first pass. For this reason, we provide the orchestration logic and its key decision point in Figure 2a, and we summarize how each step maps to the SRN modules in Figure 2b. This diagram can be used to understand the system control logic without Petri-net syntax, and then inspect Figure 3, Figure 4 and Figure 5 for the detailed SRN nets used to define the guard functions (Table 2 and Table 3) and timing parameters (Table 4).

Figure 2 provides the policy-level view that the SRN implements. Figure 2a summarizes the node-level orchestration loop. Starting from normal operation, a sudden edge failure triggers reactive HA failover (service continuity), while failure detection and repair govern restoration of the failed node, which moves the affected microservice instances to a neighbor node. The failed node is then repaired and returned to service. In parallel, an aging trigger produced by the edge clock initiates a race between evacuating the node by live migration (LM) and initiating rejuvenation. If LM completes before the reboot, services are drained, the empty node rejuvenates (reboots), and it re-enters the pool so services can resume. If rejuvenation starts first, resident services are cleared (PC), and the node rejoins the pool after reboot. Figure 2b maps this logic to interacting SRN submodels. The edge-node clock and lifecycle submodel (Figure 3) exports interface places

P_{e u}^{k}

,

P_{e d}^{k}

,

P_{e r e j i n t}^{k}

, and

P_{e r e j}^{k}

. Dependency guards couple these signals to the microservice lifecycle submodel (Figure 4), via interface places such as

P_{m u}^{k}

,

P_{m f p}^{k}

, and

P_{m r e j}^{k}

, so services transition to infrastructure-down or reset states whenever the hosting node is down or rejuvenating or storage has failed. The inter-node policy submodel (Figure 5) consumes the failure and trigger signals to enable HA/LM token moves under the scenario switches

P_{P o l i c y H A}

,

P_{P o l i c y L M}

, and

P_{P o l i c y R E J}

. The storage submodel and reward gate enforce that availability is counted only when storage is operational through the storage-up place

P_{s t r u}

.

Section roadmap. We first state the modeling assumptions and map them to SRN constructs (Section 4.2). We then define the edge-node and microservice submodels and their interaction layer, and we finally present the CTMC semantics and the COA reward used in the analysis.
Notation and symbols. Let n be the number of edge-storage nodes, indexed by $k \in {1, \dots, n}$ ; we also use $i, j$ when referring to an ordered neighbor pair $(i, j)$ . Each node initially hosts m microservice instances and has a maximum capacity of ${res}_{\max} \geq m + 1$ (Table 4). We denote by $P$ and $T$ the sets of Places and Transitions in the SRN. For any place $p \in P$ , $M (p, t)$ (or $M_{p} (t)$ ) is its marking, i.e., the number of tokens in p at time t; when t is clear, we write $M (p)$ . The SRN induces a CTMC on reachable markings (tangible markings under standard SRN/GSPN semantics, i.e., any sequences of immediate transition firings are resolved instantaneously before a timed transition fires) with infinitesimal generator $Q = {q_{x, y}}$ , where $q_{x, y}$ is the transition rate from marking state x to state y and $q_{x, x} = - \sum_{y \neq x} q_{x, y}$ . Timed transitions are modeled as exponential; a mean delay $MTT$ corresponds to the firing rate of $1 / MTT$ . We use $1 {\cdot}$ for an indicator function. In particular, storage availability is represented by an up place $P_{s t r u}$ and a failed place $P_{s t r f}$ , and indicators such as $1 {M (P_{s t r u}) = 1}$ ensure that capacity is counted only when storage is operational. Finally, the binary policy places $P_{P o l i c y H A}$ , $P_{P o l i c y L M}$ , and $P_{P o l i c y R E J}$ to enable or disable HA, LM, and REJ in the scenario definitions of Section 5.

4.2. Modeling Assumptions and SRN Formulation

The SRN is constructed to reflect the architecture of Section 3 while remaining analytically tractable. We first state the modeling assumptions (A1–A8) and indicate how each assumption is encoded as SRN places, transitions, guard functions, and reward elements.

A1.: Topology abstraction. The system consists of n edge-storage nodes arranged in a logical ring; HA/LM actions occur only between neighboring nodes. SRN encoding: Inter-node transitions (e.g., $T_{m}^{i j}$ , $T_{l m}^{i j}$ ) are enabled only for $j \in {i - 1, i + 1} (\mod n)$ using the guard functions in Table 2 and Table 3.
A2.: Capacity abstraction for microservices. Each node hosts m microservice tokens initially and can accept migrated tokens only if it has spare capacity up to ${res}_{\max}$ . SRN encoding: Initial markings place m tokens in $P_{m u}^{k}$ ; capacity headroom is enforced by guards (e.g., conditions of the form $M (P_{m u}^{j}) + M (P_{m f p}^{j}) < {res}_{\max}$ before enabling an incoming HA/LM move).
A3.: Sudden failures and repairs are memoryless and independent. Edge-node, microservice, and storage failures/repairs are modeled as exponential timed transitions with rates derived from MTTF/MTTR parameters, and their occurrence is independent across components. SRN encoding: Timed transitions such as $T_{e f}^{k}$ , $T_{m f}^{k}$ , and storage-failure transitions use the corresponding rates in Table 4, yielding a CTMC semantics on the reachable markings.
A4.: Software aging abstraction. Aging is represented as a progression between healthy, failure-probable, and aging-failure states, with optional recovery and rejuvenation actions. SRN encoding: microservice places such as $P_{m u}^{k}$ , $P_{m f p}^{k}$ , $P_{m a f}^{k}$ and timed transitions $T_{m f p}^{k}$ , $T_{m a f}^{k}$ , $T_{m r e c}^{k}$ ; node-level rejuvenation uses a clock-trigger place $P_{e r e j i n t}^{k}$ that enables rejuvenation-related transitions.
A5.: Failure detection is explicit. Detection delay is modeled separately from failure occurrence. SRN encoding: edge-node detection places/transitions (e.g., $P_{e f}^{k} \to P_{e d}^{k}$ ) with mean delay parameters in Table 4; in this model, HA/LM guards are triggered when a node leaves the up place $P_{e u}^{k}$ (i.e., $U_{i} (t) = 0$ ); detection delay contributes to the repair path $P_{e f}^{k} \to P_{e d}^{k} \to P_{e u}^{k}$ , and may be absorbed into the configured HA/LM transfer times when appropriate.
A6.: Policies are configuration switches. HA, LM, node rejuvenation, and microservice rejuvenation can be enabled/disabled to form the six scenarios. SRN encoding: Policy places (e.g., $P_{P o l i c y H A}$ , $P_{P o l i c y L M}$ , $P_{P o l i c y R E J}$ ) are referenced in guard functions so that the same SRN structure supports multiple policy configurations.
A7.: Proactive HA ordering. When a node rejuvenation trigger occurs, the intended ordering is: trigger → migrate out → rejuvenate the node once empty. SRN encoding: The LM transition is enabled by the rejuvenation-trigger place; in the LM-first branch (selected with probability $p_{L M}$ as described below), the node enters rejuvenation only after an empty-node condition such as $M (P_{m u}^{k}) + M (P_{m f p}^{k}) = 0$ , holds; in the REJ-first branch, this guard is bypassed to represent a forced reboot that may terminate remaining microservices.
A8.: Race abstraction between LM and rejuvenation. The model admits a concurrent situation where rejuvenation can begin before LM completes, leading to microservice termination. SRN encoding: immediate transitions that move microservice tokens into stop/down places when the node enters rejuvenation, while LM is still in progress; the detailed enable/disable logic is captured via guard functions.

SRN formulation roadmap. Given these assumptions, the SRN model is built in three steps: (i) we define local SRN submodels for edge nodes and microservices that capture failure, repair, and rejuvenation dynamics; (ii) we couple n instances of these submodels using inter-node transitions that represent HA and LM across neighbors; and (iii) we define a reward function that counts usable microservice instances while the storage layer is operational, yielding COA.

4.3. Edge Node Submodel

The edge node submodel in Figure 3 describes the state transitions of the edge infrastructure. It consists of basic failure and recovery and edge-node rejuvenation.

Basic failure and recovery: Edge Node ( $E_{1}$ ) starts with one token in the normal state $P_{e u}^{1}$ . As $M T T F_{E d g e}$ time passes, $T_{e f}^{1}$ fires and transitions to the failure state ( $P_{e f}^{1}$ ). The failure is detected by $T_{e d}^{1}$ after $M T T D_{E d g e}$ time, resulting in the detected failure state $P_{e d}^{1}$ . Finally, $T_{e r}^{1}$ recovers after the recovery time $M T T R_{E d g e}$ passes and returns to the $P_{e u}^{1}$ state.
Edge node rejuvenation: $P_{e c l k}^{1}$ , which tracks time for software aging in Figure 3a, fires $T_{e r e j i n t}^{1}$ to perform rejuvenation. When the rejuvenation signal is detected, the immediate transition $T_{e p r r e j}^{1}$ fires, transitioning the node state from $P_{e u}^{1}$ to $P_{e r e j}^{1}$ . Upon entering the $P_{e r e j}^{1}$ state, $T_{e p r e r e j}^{1}$ consumes the token in $P_{e r e j i n t}^{1}$ and moves it to $P_{e r e j r d y}^{1}$ , the rejuvenation-ready state. Once the $P_{e r e j r d y}^{1}$ state is confirmed, after $M T T R E J_{E d g e}$ time passes, $T_{e r e j}^{1}$ fires and transitions to the $P_{e u}^{1}$ state. After the edge node recovers, $T_{e r e s c l k}^{1}$ consumes the token in $P_{e r e j r d y}^{1}$ and resets the time clock ( $P_{e c l k}^{1}$ ).

4.4. Microservice Submodel

The microservice-node submodel in Figure 4 addresses the lifecycle and aging process of microservices running on edge nodes. It consists of normal operation and aging, dependency on edge nodes, and microservice rejuvenation.

Normal state and aging: Microservices initially have tokens in $P_{m u}^{1}$ . Due to software aging, after $M T T A_{M i c r o s e r v i c e}$ time passes, $T_{m f p}^{1}$ fires and transitions to the aging failure-probable state ( $P_{m f p}^{1}$ ), and recovers through $T_{m r e c}^{1}$ according to the recovery-time parameter $1 / ε_{m r e c}$ (Table 4). Additionally, even in the $P_{m u}^{1}$ state, sudden failure ( $P_{m f}^{1}$ ) can occur through $T_{m f}^{1}$ (mean delay: $M T T F_{M i c r o s e r v i c e}$ ) and is recovered through $T_{m r p}^{1}$ .
Infra dependency: When Edge Node $E_{1}$ goes down (i.e., when $P_{e d}^{1}$ , $P_{e f}^{1}$ or $P_{e r e j}^{1}$ ), $T_{m}^{1}$ and $T_{m f p d n}^{1}$ immediately fire to put the microservice into the down state $P_{m d n}^{1}$ . When the edge node returns to the $P_{e u}^{1}$ state, after $M T T R_{M i c r o s e r v i c e}$ time passes, $T_{m r}^{1}$ fires and recovers to $P_{m u}^{1}$ .
Self-rejuvenation of microservices: Microservice $M_{1}$ generates its own rejuvenation state ( $P_{m r e j i n t}^{1}$ ) through $P_{m c l k}^{1}$ and $T_{m r e j i n t}^{1}$ . When this trigger is generated (and $P_{P o l i c y R E J}$ is enabled), it is first latched into the clock-side state $P_{m r e j r d y}^{1}$ by $T_{m p r e r e j}^{1}$ . Then, depending on whether the microservice is currently in $P_{m u}^{1}$ or $P_{m f p}^{1}$ , the immediate transitions $T_{m p r r e j}^{1}$ or $T_{m a f r e j}^{1}$ move it to the self-rejuvenation state $P_{m r e j}^{1}$ . After rejuvenation completes, $T_{m r e s c l k}^{1}$ resets the microservice clock. After $M T T R E J_{M i c r o s e r v i c e}$ time passes, $T_{m r e j}^{1}$ fires and returns to $P_{m u}^{1}$ . This flow models the planned downtime of microservices.

We model each microservice running on edge node

E_{k}

by an SRN with places

\begin{matrix} P_{k} = { & P_{m u}^{k}, P_{m f p}^{k}, P_{m a f}^{k}, P_{m f}^{k}, \\ P_{m d n}^{k}, P_{m d n r e j}^{k}, \\ P_{m r e j i n t}^{k}, P_{m r e j r d y}^{k}, P_{m r e j}^{k}, \\ P_{m c l k}^{k}, P_{m c l k s t o p}^{k}, \\ P_{m s t o p}^{k}, P_{m b o o t}^{k}} . \end{matrix}

(1)

and transitions

\begin{matrix} T_{k} = { & T_{m f p}^{k}, T_{m a f}^{k}, T_{m r e c}^{k}, T_{m f}^{k}, T_{m r p}^{k}, T_{m}^{k}, T_{m f p d n}^{k}, T_{m r}^{k}, \\ T_{m c l k s t a r t}^{k}, T_{m c l k s t o p}^{k}, T_{m r e j i n t}^{k}, T_{m p r e r e j}^{k}, T_{m a f r e j}^{k}, T_{m r e j}^{k}, T_{m r e s c l k}^{k}, \\ T_{m u o}^{k}, T_{m f p o}^{k}, T_{m f o}^{k}, T_{m a f o}^{k}, T_{m d n o}^{k}, T_{m r e j o}^{k}, T_{m d n r e j o}^{k}, T_{m s t o p}^{k}, T_{m b o o t}^{k}} . \end{matrix}

(2)

Let

M_{p} (t)

denote the marking (token count) of place p at time t. Throughout the work, the marking (token count) of a place P is denoted by

M_{P} (t)

. When the time argument is clear from context, we write

M (P) \equiv M_{P} (t)

. We use this marking notation consistently in guards and rewards. For node k, we collect the local marking as a vector

\begin{matrix} M^{k} (t) = ( & M_{m u}^{k} (t), M_{m f p}^{k} (t), M_{m a f}^{k} (t), M_{m f}^{k} (t), \\ M_{m d n}^{k} (t), M_{m d n r e j}^{k} (t), \\ M_{m r e j i n t}^{k} (t), M_{m r e j r d y}^{k} (t), M_{m r e j}^{k} (t), \\ M_{m c l k}^{k} (t), M_{m c l k s t o p}^{k} (t), \\ M_{m s t o p}^{k} (t), M_{m b o o t}^{k} (t)) . \end{matrix}

(3)

The SRN induces a CTMC

{M^{k} (t)}_{t \geq 0}

(4)

whose transition rates are determined by the firing rates of timed transitions and the enabling conditions of immediate transitions.

Failure and repair.

Sudden microservice failures are modeled by

\begin{matrix} P_{m u}^{k} & \to_{rate λ_{m}}^{T_{m f}^{k}} P_{m f}^{k}, P_{m f}^{k} \to_{rate μ_{m}}^{T_{m r p}^{k}} P_{m u}^{k}, \end{matrix}

(5)

where

λ_{m}

and

μ_{m}

are exponential firing rates. Table 4 reports the corresponding mean times,

{MTTF}_{Microservice} = 1 / λ_{m}

and

{MTTR}_{Microservice} = 1 / μ_{m}

, and the numerical values are taken from Table 4. Analogous rate/mean pairs are used for the edge-node and storage failure and repair transitions.

In the CTMC generator Q, these transitions correspond to

q (M^{k} \to M^{k'}) = λ_{m} M_{m u}^{k},

(6)

when one token moves from

P_{m u}^{k}

to

P_{m f}^{k}

, and

q (M^{k} \to M^{k'}) = μ_{m} M_{m f}^{k},

(7)

when one token moves from

P_{m f}^{k}

back to

P_{m u}^{k}

.

Aging-related dynamics.

Software aging of a healthy microservice is captured by

P_{m u}^{k} \to_{rate θ_{m f p}}^{T_{m f p}^{k}} P_{m f p}^{k},

(8)

with

θ_{m f p} = 1 / {MTTA}_{Microservice}

.

From the failure-probable (aging) state, the microservice either returns to normal operation without full failure or progresses to an aging-related failure:

\begin{matrix} P_{m f p}^{k} & \to_{rate ε_{m r e c}}^{T_{m r e c}^{k}} P_{m u}^{k}, P_{m f p}^{k} \to_{rate ζ_{m a f}}^{T_{m a f}^{k}} P_{m a f}^{k}, \end{matrix}

(9)

where

ε_{m r e c} = 1 / {MTT}_{rec}

and

ζ_{m a f} = 1 / {MTTF}_{aging}

. A failure in

P_{m a f}^{k}

is treated as a service outage until repair or rejuvenation occurs.

The corresponding CTMC rates are

\begin{matrix} q (M^{k} \to {M^{k}}^{'}) = θ_{m f p} M_{m u}^{k}, for T_{m f p}^{k}, \\ q (M^{k} \to {M^{k}}^{'}) = ε_{m r e c} M_{m f p}^{k}, for T_{m r e c}^{k}, \\ q (M^{k} \to {M^{k}}^{'}) = ζ_{m a f} M_{m f p}^{k}, for T_{m a f}^{k} . \end{matrix}

(10)

Dependency on the edge node.

Let

P_{e d}^{k}

,

P_{e f}^{k}

,

P_{e r e j}^{k}

be the edge level places that indicate detected edge failure, edge failure, and node rejuvenation, and let

P_{s t r f}

be the storage failure place. The guard function

g_{T_{m}^{k}}

in Table 2 is

g_{T_{m}^{k}} = 1 iff M (P_{e d}^{k}) + M (P_{e f}^{k}) + M (P_{e r e j}^{k}) + M (P_{s t r f}) > 0 .

(11)

When this condition holds, the immediate transition

T_{m}^{k}

fires and sends any running microservice into the infrastructure down state

P_{m u}^{k} \to_{imm .}^{T_{m}^{k}} P_{m d n}^{k}, P_{m f p}^{k} \to_{imm .}^{T_{m f p d n}^{k}} P_{m d n}^{k} .

(12)

Interaction with HA/LM. In the composed SRN,

T_{m}^{k}

and

T_{m f p d n}^{k}

may be enabled concurrently with HA/LM transfer transitions when an edge node fails. To prevent nondeterministic routing of a microservice token between the infrastructure-down place

P_{m d n}^{k}

and a migration buffer, we assign immediate-transition priorities such that HA/LM transfer transitions fire before

T_{m}^{k}

and

T_{m f p d n}^{k}

whenever a valid destination neighbor exists. Thus, if at least one neighbor satisfies the HA guard condition, the token enters the corresponding HA buffer place

P_{m h a}^{k j}

; otherwise, it enters

P_{m d n}^{k}

and waits for local recovery.

Once the edge node returns to the up state

P_{e u}^{k}

and storage is available, the timed transition

T_{m r}^{k}

moves the microservice back to service

P_{m d n}^{k} \to_{rate μ_{m}}^{T_{m r}^{k}} P_{m u}^{k} .

(13)

Self-rejuvenation of the microservice.

Clock behavior for microservice rejuvenation is encoded in Figure 4a. The place

P_{m c l k}^{k}

holds a token whenever the rejuvenation clock is running. A timed transition

T_{m r e j i n t}^{k}

fires with rate

β_{m r e j i n t} = 1 / {MTTTrig}_{Microservice}

and generates a rejuvenation trigger token:

P_{m c l k}^{k} \to_{rate β_{m r e j i n t}}^{T_{m r e j i n t}^{k}} P_{m r e j i n t}^{k} .

(14)

The guard function

g_{T_{m r e j i n t}^{k}}

ensures that this trigger exists only when the global policy place

P_{P o l i c y R E J}

contains a token.

Given a trigger, the immediate transitions

T_{m p r e r e j}^{k}

and

T_{m a f r e j}^{k}

move the microservice into the local rejuvenation preparation state, regardless of whether it is currently healthy or in the failure-probable (aging) state:

P_{m u}^{k}, P_{m f p}^{k} \to_{imm .}^{T_{m p r e r e j}^{k}, T_{m a f r e j}^{k}} P_{m r e j}^{k} .

(15)

The timed rejuvenation transition models the planned restart:

P_{m r e j}^{k} \to_{rate τ_{m r e j}}^{T_{m r e j}^{k}} P_{m u}^{k},

(16)

with

τ_{m r e j} = 1 / {MTTREJ}_{Microservice}

. When

T_{m r e j}^{k}

fires, the immediate transition

T_{m r e s c l k}^{k}

resets the clock places,

P_{m r e j}^{k}, P_{m r e j i n t}^{k} \to_{imm .}^{T_{m r e s c l k}^{k}} P_{m c l k}^{k},

(17)

so the rejuvenation cycle restarts.

Interaction with edge node rejuvenation and proactive crash:

When the edge node

E_{k}

enters its rejuvenation state

P_{e r e j}^{k}

, the guard function

g_{T_{m o}^{k}}

in Table 2 becomes true and a family of reset transitions fires:

T_{m u o}^{k}, T_{m f p o}^{k}, T_{m a f o}^{k}, T_{m f o}^{k}, T_{m d n o}^{k}, T_{m r e j o}^{k} .

(18)

These transitions remove all tokens from the microservice places

P_{m u}^{k}

,

P_{m f p}^{k}

,

P_{m a f}^{k}

,

P_{m f}^{k}

,

P_{m d n}^{k}

,

P_{m r e j}^{k}

and send them to

P_{m d n r e j}^{k}

:

P_{m u}^{k}, P_{m f p}^{k}, P_{m a f}^{k}, P_{m f}^{k}, P_{m d n}^{k}, P_{m r e j}^{k} \to_{imm .}^{T_{m u o}^{k}, \dots, T_{m r e j o}^{k}} P_{m d n r e j}^{k} .

(19)

Next, the immediate transition

T_{m d n r e j o}^{k}

consumes all tokens in

P_{m d n r e j}^{k}

without producing output, which represents destruction of the current microservice instance:

P_{m d n r e j}^{k} \to_{imm .}^{T_{m d n r e j o}^{k}} ⌀ .

(20)

Here ⌀ denotes that the transition has no output place, i.e., the consumed tokens are removed from the net. This models the crash effect that occurs when node-level rejuvenation preempts live migration. After the edge node finishes its rejuvenation and fires

T_{e s t o p}^{k}

in the edge submodel, new microservice instances are created in

P_{m s t o p}^{k}

and then booted into service:

P_{m s t o p}^{k} \to_{rate κ_{m b o o t}}^{T_{m b o o t}^{k}} P_{m u}^{k},

(21)

where

κ_{m b o o t} = 1 / {MTT}_{boot}

is the inverse of the mean boot time in Table 4.

Reward structure:

For each microservice node k, we define the local availability reward

r^{k} (M^{k}) = M_{m u}^{k} + M_{m f p}^{k},

(22)

which counts microservices that are either fully operational or in the aging probable state, while still being able to serve requests. The global COA metric used in Section 5 is the sum of these local rewards over all microservice nodes, multiplied by the indicator that storage is available,

r (M) = (\sum_{k = 1}^{n} (M_{m u}^{k} + M_{m f p}^{k})) \cdot 1 {M (P_{s t r u}) = 1} .

(23)

4.5. System Interactions: HA, LM, and SW Rejuvenation

The microservice n nodes submodel in Figure 5 models the core logic of microservice interactions. It consists of reactive HA, proactive LM, and service destruction by rejuvenation.

Reactive high availability: When edge node $E_{1}$ fails and $P_{e u}^{1}$ becomes empty, $T_{m}^{12}$ or $T_{m f p}^{12}$ immediately fires. These tokens move to the HA migration waiting state $P_{m h a}^{12}$ . After $M T T M_{H A}$ time, which is the time required for microservices to move to an adjacent edge node, $T_{h a}^{12}$ fires and successfully fails over to $P_{m u}^{2}$ of edge node $E_{2}$ .
Proactive live migration: When the rejuvenation signal $P_{e r e j i n t}^{1} = 1$ of edge node $E_{1}$ occurs, the control plane may either evacuate microservices first via LM (evacuation-first) or start node rejuvenation immediately (reboot-first). In the SRN, we represent this ordering by a probabilistic choice with $\Pr (Z_{1} = LM) = p_{LM}$ (Table 4); the default setting is $p_{LM} = 0.5$ .
Service destruction and recovery by rejuvenation: As soon as the edge node enters the $P_{e r e j}^{1}$ state, reset transitions ( $T_{m u o}^{1}$ , $T_{m f p o}^{1}$ , $T_{m d n o}^{1}$ , $T_{m f o}^{1}$ , $T_{m a f o}^{1}$ , $T_{m r e j o}^{1}$ ) fire to create a clean state for rejuvenation. These reset transitions forcibly remove tokens from all microservice states ( $P_{m u}^{1}$ , $P_{m f p}^{1}$ , etc.) and transition them to $P_{m d n r e j}^{1}$ for clearing. Tokens in $P_{m d n r e j}^{1}$ are consumed by $T_{m d n r e j o}^{1}$ , but since this transition has no output, they are destroyed, and after the rejuvenation work is completed, $T_{e s t o p}^{1}$ fires. This recovers the destroyed microservices through a booting process ( $T_{m b o o t}^{1}$ ) to operate in $P_{m u}^{1}$ .

We model the microservice interaction layer in Figure 5 as a SRN

N_{ms} = (P, T, F, W, M_{0}, Λ, R),

(24)

where P is the finite set of places,

T = T_{imm} \cup T_{tim}

is the set of immediate and timed transitions, F is the incidence function, W are arc weights,

M_{0}

is the initial marking,

Λ

collects firing rates for timed transitions, and R is the reward structure.

Places and markings:

There are n edge storage nodes arranged in a logical ring, indexed by

E = {1, \dots, n}

. For each node

i \in E

, we consider the following places that are visible in Figure 5 and relevant for internode interaction:

$P_{mu}^{i}$ : microservices of node i in normal running state;
$P_{mfp}^{i}$ : aging but still running microservices of node i;
$P_{mha}^{i j}$ : microservices waiting for reactive HA migration from failed node i to its neighbor j;
Policy places $P_{P o l i c y H A}, P_{P o l i c y L M}, P_{P o l i c y R E J}$ that encode whether each mechanism is enabled in the considered scenario.

Each node also owns the places from the edge and microservice submodels of Figure 3b and Figure 4, for example, $P_{eu}^{i}$ for an up edge node, $P_{ed}^{i}$ for detected failure, $P_{erejint}^{i}$ for the rejuvenation trigger, and $P_{erej}^{i}$ for the edge node in rejuvenation. These places are not redrawn in Figure 5 but are used in guard functions.

The global marking at time t is

M (t) = {(M_{p} (t))}_{p \in P} \in N^{| P |},

(25)

where

M_{p} (t)

is the token count of place p.

For readability we introduce the shorthand random variables

X_{i}^{u} (t) = M_{P_{mu}^{i}} (t), X_{i}^{f p} (t) = M_{P_{mfp}^{i}} (t), Y_{i \to j} (t) = M_{P_{mha}^{i j}} (t),

(26)

and binary indicators of node and policy state

\begin{matrix} U_{i} (t) & = 1 {M_{P_{eu}^{i}} (t) = 1}, \\ R_{i} (t) & = 1 {M_{P_{erejint}^{i}} (t) = 1}, \\ H & = 1 {M_{P_{P o l i c y H A}} = 1}, \\ L & = 1 {M_{P_{P o l i c y L M}} = 1}, \\ J & = 1 {M_{P_{P o l i c y R E J}} = 1} . \end{matrix}

(27)

The capacity parameters follow Table 4. Each node is initially allocated m microservices and has a maximum capacity of

{res}_{\max} \geq m + 1

.

Reactive HA migration: For each ordered neighbor pair

(i, j)

with

j = i + 1 \mod n

or

j = i - 1 \mod n

, Figure 5 contains two immediate transitions

T_{i \to j}^{HA, mu}, T_{i \to j}^{HA, mfp},

(28)

that move tokens from

P_{mu}^{i}

and

P_{mfp}^{i}

to the HA waiting buffer

P_{mha}^{i j}

when node i fails and node j has spare capacity. Their enabling condition follows the guard function

g_{T k m f p}^{HA}

in Table 2, which we write as

G_{i \to j}^{HA} (M) = 1 \{U_{i} (t) = 0, U_{j} (t) = 1, X_{j}^{u} (t) + X_{j}^{f p} (t) < {res}_{\max}, H = 1\} .

(29)

When

G_{i \to j}^{HA} (M) = 1

and

X_{i}^{u} (t) > 0

, transition

T_{i \to j}^{HA, mu}

fires immediately and updates the marking

M^{'} = M - e_{P_{mu}^{i}} + e_{P_{mha}^{i j}},

(30)

and analogously for aging microservices using

T_{i \to j}^{HA, mfp}

. Here,

e_{p}

is the unit vector of place p.

Immediate-transition precedence. When an edge node fails, local dependency transitions in the per-node microservice submodel (e.g., $T_{m}^{k}$ and $T_{m f p d n}^{k}$ in Figure 4) and inter-node relocation transitions (e.g., $T_{i \to j}^{HA, mu}$ , $T_{i \to j}^{HA, mfp}$ ) may be enabled at the same time. To enforce the intended policy semantics, we assign a higher firing priority to the inter-node HA/LM relocation transitions than to the local “microservice down” transitions. Infeasible relocations (e.g., no neighbor capacity) leave tokens to be handled by the local down transitions, which capture capacity loss.

Tokens accumulated in

P_{mha}^{i j}

are then transferred to the neighbor node by the timed transition

T_{i \to j}^{ha}

with delay rate

γ_{ha}

from Table 4:

T_{i \to j}^{ha} : P_{mha}^{i j} \overset{γ_{ha}}{\to} P_{mu}^{j} .

(31)

At the CTMC level, for any state x with

Y_{i \to j} (t) = y > 0

and

G_{i \to j}^{HA} (x) = 1

, the transition rate from x to

x - e_{P_{mha}^{i j}} + e_{P_{mu}^{j}}

is

q_{x, x - e_{P_{mha}^{i j}} + e_{P_{mu}^{j}}} = γ_{ha} y .

(32)

Proactive live migration

Live migration between neighbor nodes is represented by timed transitions

T_{i \to j}^{LM, mu}, T_{i \to j}^{LM, mfp},

(33)

which move microservice tokens from node i to node j with rate

γ_{lm}

. The guard function in Table 3 gives the enabling condition

\begin{matrix} G_{i \to j}^{LM} (M) = & 1 \{U_{i} (t) = 1, U_{j} (t) = 1, X_{j}^{u} (t) + X_{j}^{f p} (t) < {res}_{\max}, L = 1, [(X_{i}^{u} (t) + X_{i}^{f p} (t) > m \land X_{j}^{u} (t) + X_{j}^{f p} (t) < m) \lor (R_{i} (t) = 1)]\} . \end{matrix}

(34)

The guard combines two modes: (i) load-driven migration when node i is overloaded and node j is underloaded, and (ii) proactive evacuation when the rejuvenation trigger

R_{i} (t) = 1

is set. In both cases, migration is permitted only if node j is up, has remaining capacity below

{res}_{\max}

, and PolicyLM is enabled.

If

G_{i \to j}^{LM} (M) = 1

and

X_{i}^{u} (t) > 0

, the CTMC jump associated with

T_{i \to j}^{LM, mu}

is

P_{mu}^{i} \overset{γ_{lm}}{\to} P_{mu}^{j},

(35)

with rate

q_{x, x - e_{P_{mu}^{i}} + e_{P_{mu}^{j}}} = γ_{lm} X_{i}^{u} (t) G_{i \to j}^{LM} (x),

(36)

and the same structure applies to aging microservices through

T_{i \to j}^{LM, mfp}

.

Race between live migration and rejuvenation

When the edge-node-level clock of Figure 3b fires the rejuvenation trigger

T_{erejint}^{i}

with mean interval

1 / β_{erejint}

, one token is placed into

P_{erejint}^{i}

. At this time, two kinds of transitions become enabled:

Proactive live migration on all outgoing arcs from node i through $G_{i \to j}^{LM} (M)$ since $R_{i} (t) = 1$ ;
The immediate pre-rejuvenation transition $T_{eprrej}^{i}$ , which moves the edge node into the rejuvenation state $P_{erej}^{i}$ .

In the SRN semantics we model this competition as a probabilistic choice. When

P_{erejint}^{i}

is marked and

J = 1

, a discrete variable

Z_{i}

is sampled with

\Pr (Z_{i} = LM) = p_{LM}, \Pr (Z_{i} = REJ) = 1 - p_{LM},

(37)

Interpretation of $p_{LM}$ . The probability

p_{LM}

is not intended to model a literal random choice in a production orchestration controller. It is a compact abstraction of an orchestration-level contention between two actions that become enabled by the same rejuvenation trigger: initiating an evacuation procedure (drain / live migration) versus initiating the reboot. We treat

p_{LM}

as a policy parameter (with

\Pr (Z_{i} = REJ) = 1 - p_{LM}

), and we set

p_{LM} = \frac{1}{2}

in the baseline experiments as a neutral (equal-weight) choice. In practice,

p_{LM}

depends on how strictly the platform enforces evacuate first, reboot later, and on whether evacuation can finish before a deadline under available headroom. Equivalently,

p_{LM}

can be interpreted as the empirical probability that the node is drained (i.e.,

X_{i}^{u} (t) + X_{i}^{f p} (t) = 0

) before a forced rejuvenation proceeds.

If

Z_{i} = LM

, live migration transitions fire repeatedly with rate

γ_{lm}

until all microservice tokens have been moved away from node i and

X_{i}^{u} (t) + X_{i}^{f p} (t) = 0 .

(38)

Only after the node is empty,

T_{eprrej}^{i}

and

T_{erej}^{i}

can complete without loss of user capacity.

If instead

Z_{i} = REJ

, the reset transitions inside the microservice submodel of Figure 5 fire

T_{muo}^{i}, T_{mfpo}^{i}, T_{mdno}^{i}, T_{mfo}^{i}, T_{mafo}^{i}, T_{mrejo}^{i} .

(39)

as soon as the edge node enters

P_{erej}^{i}

. They remove all tokens from microservice places and move them to the clearing place

P_{mdnrej}^{i}

. The subsequent transition

T_{mdnrejo}^{i}

consumes these tokens and destroys the running microservices before the node restart. After the node completes rejuvenation through

T_{erej}^{i}

and the short booting transition

T_{mboot}^{i}

with rate

κ_{mboot}

, new microservice tokens are created in

P_{mu}^{i}

. This path corresponds to the PC behaviour that reduces COA when rejuvenation executes before live migration.

CTMC and reward for availability

The firing rules above, together with the edge and microservice local models, induce a CTMC

{M (t)}_{t \geq 0}

on the reachable markings

S \subset N^{| P |}

, where P is the set of SRN places. Let

T_{tim}

be the set of timed (exponentially distributed) SRN transitions. For any state

x \in S

and any timed transition

k \in T_{tim}

, let

C_{k} \in Z^{| P |}

denote its state-change (incidence) vector, i.e., one firing of transition k updates the marking as

x \mapsto x + C_{k}

. The corresponding CTMC transition rate is

q_{x, x + C_{k}} = λ_{k} g_{k} (x) .

(40)

Here,

λ_{k} \in Λ

is the exponential firing rate parameter (Table 4), and

g_{k} (x)

is the enabling degree at state x: it equals 0 when the guard condition is false; otherwise, it equals the number of concurrently enabled instances under marking x (e.g., for a per-token failure transition,

g_{T_{m f}^{i}} (x) = M (P_{m u}^{i})

).

On top of this CTMC we define the reward function used for COA,

r (M) = (\sum_{i = 1}^{n} (X_{i}^{u} + X_{i}^{f p})) \cdot 1 {M (P_{s t r u}) = 1} .

(41)

which counts the number of usable microservices in all nodes while the storage layer is up. The stationary expectation of

r (M)

under different policy marks

(H, L, J)

gives the COA values for the six scenarios.

Table 2. Guard functions of immediate and trigger transitions.

Name	Associated Transition	Conditional Statement of Immediate Transitions ¹
gTkm	$T_{m}^{1}, T_{m}^{2}, T_{m}^{3}, T_{m}^{4}, T_{m}^{5}, T_{m}^{6}$	If $((M (P_{e d}^{k}) = 1) OR (M (P_{e f}^{k}) = 1) OR (M (P_{e r e j}^{k}) = 1) OR (M (P_{s t r f}) = 1))$
gTkmfp	$T_{m}^{12}, T_{m f p}^{12}, T_{m}^{23}, T_{m f p}^{23}$ $T_{m}^{34}, T_{m f p}^{34}, T_{m}^{45}, T_{m f p}^{45}$ $T_{m}^{56}, T_{m f p}^{56}, T_{m}^{61}, T_{m f p}^{61}$	If $((M (P_{e u}^{1}) < 1) AND (M (P_{e u}^{2}) > 0) AND (M (P_{m u}^{2}) + M (P_{m f p}^{2}) < res_\max)$ $AND (M (P_{P o l i c y H A}) > 0))$
	$T_{m}^{21}, T_{m f p}^{21}, T_{m}^{32}, T_{m f p}^{32}$ $T_{m}^{43}, T_{m f p}^{43}, T_{m}^{54}, T_{m f p}^{54}$ $T_{m}^{65}, T_{m f p}^{65}, T_{m}^{16}, T_{m f p}^{16}$	If $((M (P_{e u}^{1}) < 1) AND (M (P_{e u}^{6}) > 0) AND (M (P_{m u}^{6}) + M (P_{m f p}^{6}) < res_\max)$ $AND (M (P_{P o l i c y H A}) > 0))$
		(Other HA transitions follow the same pattern: source index k, target index $k \pm 1$ (mod n))
gTkeprerej	$T_{e p r e r e j}^{1}, T_{e p r e r e j}^{2}, T_{e p r e r e j}^{3}$ $T_{e p r e r e j}^{4}, T_{e p r e r e j}^{5}, T_{e p r e r e j}^{6}$	If $(M (P_{e r e j}^{k}) = 1)$
gTkeprrej	$T_{e p r r e j}^{1}, T_{e p r r e j}^{2}, T_{e p r r e j}^{3}$ $T_{e p r r e j}^{4}, T_{e p r r e j}^{5}, T_{e p r r e j}^{6}$	If $M (P_{e r e j i n t}^{k}) = 1$
gTkeresclk	$T_{e r e s c l k}^{1}, T_{e r e s c l k}^{2}, T_{e r e s c l k}^{3}$ $T_{e r e s c l k}^{4}, T_{e r e s c l k}^{5}, T_{e r e s c l k}^{6}$	If $(M (P_{e u}^{k}) = 1) AND (M (P_{e r e j r d y}^{k}) = 1)$
gTkestop	$T_{e s t o p}^{1}, T_{e s t o p}^{2}, T_{e s t o p}^{3}$ $T_{e s t o p}^{4}, T_{e s t o p}^{5}, T_{e s t o p}^{6}$	If $(M (P_{e r e j}^{k}) = 1) AND (M (P_{e s t o p}^{k}) < M (P_{m d n r e j}^{k})) AND (M (P_{m d n r e j}^{k}) > 0)$
gTkerejint	$T_{e r e j i n t}^{1}, T_{e r e j i n t}^{2}, T_{e r e j i n t}^{3}$ $T_{e r e j i n t}^{4}, T_{e r e j i n t}^{5}, T_{e r e j i n t}^{6}$	If $(M (P_{e u}^{k}) = = 1) AND (M (P_{P o l i c y R E J}) = = 1)$
gTkmfpdn	$T_{m f p d n}^{1}, T_{m f p d n}^{2}, T_{m f p d n}^{3}$ $T_{m f p d n}^{4}, T_{m f p d n}^{5}, T_{m f p d n}^{6}$	If $((M (P_{e d}^{k}) = 1) OR (M (P_{e f}^{k}) = 1) OR (M (P_{e r e j}^{k}) = 1) OR (M (P_{s t r f}) = 1))$
gTkmprerej	$T_{m p r e r e j}^{1}, T_{m p r e r e j}^{2}, T_{m p r e r e j}^{3}$ $T_{m p r e r e j}^{4}, T_{m p r e r e j}^{5}, T_{m p r e r e j}^{6}$	If $((M (P_{m u}^{k}) > 0) OR (M (P_{m f p}^{k}) > 0))$
gTkmprrej	$T_{m p r r e j}^{1}, T_{m p r r e j}^{2}, T_{m p r r e j}^{3}$ $T_{m p r r e j}^{4}, T_{m p r r e j}^{5}, T_{m p r r e j}^{6}$	If $M (P_{m r e j r d y}^{k}) = 1$
gTkmclkstart	$T_{m c l k s t a r t}^{1}, T_{m c l k s t a r t}^{2}, T_{m c l k s t a r t}^{3}$ $T_{m c l k s t a r t}^{4}, T_{m c l k s t a r t}^{5}, T_{m c l k s t a r t}^{6}$	If $(M (P_{e r e j}^{k}) = 0) AND (M (P_{e u}^{k}) = 1) AND (M (P_{m c l k}^{k}) = 0) AND (M (P_{m c l k s t o p}^{k}) = 1)$
gTkmclkstop	$T_{m c l k s t o p}^{1}, T_{m c l k s t o p}^{2}, T_{m c l k s t o p}^{3}$ $T_{m c l k s t o p}^{4}, T_{m c l k s t o p}^{5}, T_{m c l k s t o p}^{6}$	If $((M (P_{e r e j}^{k}) = 1) AND ((M (P_{m c l k s t o p}^{k}) = 0) OR (M (P_{m c l k s t o p}^{k}) > 1))) AND (M (P_{m d n r e j}^{k}) > 0)$
gTkmo	$T_{m u o}^{k}, T_{m f p o}^{k}, T_{m a f o}^{k}$ $T_{m r e j o}^{k}, T_{m d n o}^{k}, T_{m f o}^{k}$	If $M (P_{e r e j}^{k}) = 1$
gTkmdnrejo	$T_{m d n r e j o}^{k}$	If $M (P_{m d n r e j}^{k}) > 0$
gTkmafrej	$T_{m a f r e j}^{k}$	If $M (P_{m r e j r d y}^{k}) = 1$
gTkmresclk	$T_{m r e s c l k}^{k}$	If $M (P_{m r e j}^{k}) > 0$
gTkmrejint	$T_{m r e j i n t}^{1}, T_{m r e j i n t}^{2}, T_{m r e j i n t}^{3}$ $T_{m r e j i n t}^{4}, T_{m r e j i n t}^{5}, T_{m r e j i n t}^{6}$	If $((M (P_{m u}^{k}) > 0) OR (M (P_{m f p}^{k}) > 0)) AND (M (P_{P o l i c y R E J}) = = 1)$

¹: If (conditional statement) (for

k = 1 . . 6

): 1; otherwise, 0.

Table 3. Guard functions of timed transitions.

Name	Associated Transition	Conditional Statement of Timed Transitions ¹
gTklmfp	$T_{l m}^{12}, T_{l m f p}^{12}, T_{l m}^{23}, T_{l m f p}^{23}$ $T_{l m}^{34}, T_{l m f p}^{34}, T_{l m}^{45}, T_{l m f p}^{45}$ $T_{l m}^{56}, T_{l m f p}^{56}, T_{l m}^{61}, T_{l m f p}^{61}$	If $((M (P_{e u}^{1}) > 0) AND (M (P_{e u}^{2}) > 0) AND (M (P_{m u}^{2}) + M (P_{m f p}^{2}) < res_\max) AND (M (P_{P o l i c y L M}) > 0) AND (((M (P_{m u}^{1}) + M (P_{m f p}^{1}) > m) AND (M (P_{m u}^{2}) + M (P_{m f p}^{2}) < m)) OR (M (P_{e r e j i n t}^{1}) = = 1)))$
	$T_{l m}^{21}, T_{l m f p}^{21}, T_{l m}^{32}, T_{l m f p}^{32}$ $T_{l m}^{43}, T_{l m f p}^{43}, T_{l m}^{54}, T_{l m f p}^{54}$ $T_{l m}^{65}, T_{l m f p}^{65}, T_{l m}^{16}, T_{l m f p}^{16}$	If $((M (P_{e u}^{2}) > 0) AND (M (P_{e u}^{1}) > 0) AND (M (P_{m u}^{1}) + M (P_{m f p}^{1}) < res_\max) AND (M (P_{P o l i c y L M}) > 0) AND (((M (P_{m u}^{2}) + M (P_{m f p}^{2}) > m) AND (M (P_{m u}^{1}) + M (P_{m f p}^{1}) < m)) OR (M (P_{e r e j i n t}^{2}) = = 1)))$
		(Other LM transitions follow the same pattern: source index k, target index $k \pm 1$ (mod n))
gTkerej	$T_{e r e j}^{1}, T_{e r e j}^{2}, T_{e r e j}^{3}, T_{e r e j}^{4}, T_{e r e j}^{5}, T_{e r e j}^{6}$	If $M (P_{e r e j r d y}^{k}) = = 1$
gTkmaf	$T_{m a f}^{k}, T_{m f p}^{k}, T_{m r e c}^{k}$	If $((M (P_{e u}^{k}) = = 1) AND (M (P_{s t r u}) = = 1))$
gTkmboot	$T_{m b o o t}^{1}, T_{m b o o t}^{2}, T_{m b o o t}^{3}$ $T_{m b o o t}^{4}, T_{m b o o t}^{5}, T_{m b o o t}^{6}$	If $(M (P_{e r e j}^{k}) = = 0) AND (M (P_{e u}^{k}) = = 1)$
gTkmr	$T_{m r}^{1}, T_{m r}^{2}, T_{m r}^{3}, T_{m r}^{4}, T_{m r}^{5}, T_{m r}^{6}$	If $(M (P_{e u}^{k}) > 0) AND (M (P_{s t r u}) = = 1)$
gTkmrej	$T_{m r e j}^{1}, T_{m r e j}^{2}, T_{m r e j}^{3}$ $T_{m r e j}^{4}, T_{m r e j}^{5}, T_{m r e j}^{6}$	If $M (P_{m r e j}^{k}) > 0$

¹: If (conditional statement) (for

k = 1 . . 6

): 1; otherwise, 0.

Table 4. Default input parameters.

Parameters	Related Transitions	Description	Default Value
$1 / λ_{e}$	$T_{e f}^{1}, T_{e f}^{2}, T_{e f}^{3}, T_{e f}^{4}, T_{e f}^{5}, T_{e f}^{6}$	mean time to failure (MTTF) of an edge node, i.e., ${MTTF}_{Edge} = 1 / λ_{e}$	8760 h
$1 / λ_{m}$	$T_{m f}^{1}, T_{m f}^{2}, T_{m f}^{3}$ $T_{m f}^{4}, T_{m f}^{5}, T_{m f}^{6}$	mean time to failure (MTTF) of a microservice, i.e., ${MTTF}_{Microservice}$ (`MTTF_Microservice`) $= 1 / λ_{m}$	1258 h
$1 / λ_{s}$	$T_{s t r f}$	mean time to failure (MTTF) of edge storage, i.e., ${MTTF}_{Storage} = 1 / λ_{s}$	40,000 h
$1 / δ_{e d}$	$T_{e d}^{1}, T_{e d}^{2}, T_{e d}^{3}, T_{e d}^{4}, T_{e d}^{5}, T_{e d}^{6}$	mean time to detect edge-node failure	10 s
$1 / μ_{e}$	$T_{e r}^{1}, T_{e r}^{2}, T_{e r}^{3}, T_{e r}^{4}, T_{e r}^{5}, T_{e r}^{6}$	mean time to repair (MTTR) of an edge node, i.e., ${MTTR}_{Edge} = 1 / μ_{e}$	1 h
$1 / μ_{m}$	$T_{m r}^{1}, T_{m r}^{2}, T_{m r}^{3}, T_{m r}^{4}, T_{m r}^{5}, T_{m r}^{6}$ $T_{m r p}^{1}, T_{m r p}^{2}, T_{m r p}^{3}$ $T_{m r p}^{4}, T_{m r p}^{5}, T_{m r p}^{6}$	mean time to repair (MTTR) of a microservice, i.e., ${MTTR}_{Microservice}$ (`MTTR_Microservice`) $= 1 / μ_{m}$	0.238 h
$1 / μ_{s}$	$T_{s t r r}$	mean time to repair (MTTR) of edge storage, i.e., ${MTTR}_{Storage} = 1 / μ_{s}$	5 h
$1 / γ_{h a}$	$T_{h a}^{12}, T_{h a}^{21}, T_{h a}^{23}, T_{h a}^{32}, T_{h a}^{34}, T_{h a}^{43}$ $T_{h a}^{45}, T_{h a}^{54}, T_{h a}^{56}, T_{h a}^{65}, T_{h a}^{16}, T_{h a}^{61}$	mean time to complete HA failover of a microservice	2 min
$1 / γ_{l m}$	$T_{l m}^{12}, T_{l m}^{21}, T_{l m}^{23}, T_{l m}^{32}, T_{l m}^{34}, T_{l m}^{43}$ $T_{l m}^{45}, T_{l m}^{54}, T_{l m}^{56}, T_{l m}^{65}, T_{l m}^{16}, T_{l m}^{61}$ $T_{l m f p}^{12}, T_{l m f p}^{21}, T_{l m f p}^{23}$ $T_{l m f p}^{32}, T_{l m f p}^{34}, T_{l m f p}^{43}$ $T_{l m f p}^{45}, T_{l m f p}^{54}, T_{l m f p}^{56}$ $T_{l m f p}^{65}, T_{l m f p}^{16}, T_{l m f p}^{61}$	mean time to complete live migration of a microservice (LM relocation time)	1 s
$1 / β_{e r e j i n t}$	$T_{e r e j i n t}^{1}, T_{e r e j i n t}^{2}, T_{e r e j i n t}^{3}$ $T_{e r e j i n t}^{4}, T_{e r e j i n t}^{5}, T_{e r e j i n t}^{6}$	mean time to trigger edge-node rejuvenation	555 h
$1 / τ_{e r e j}$	$T_{e r e j}^{1}, T_{e r e j}^{2}, T_{e r e j}^{3}$ $T_{e r e j}^{4}, T_{e r e j}^{5}, T_{e r e j}^{6}$	mean time to execute edge-node rejuvenation	2 min
$1 / β_{m r e j i n t}$	$T_{m r e j i n t}^{1}, T_{m r e j i n t}^{2}, T_{m r e j i n t}^{3}$ $T_{m r e j i n t}^{4}, T_{m r e j i n t}^{5}, T_{m r e j i n t}^{6}$	mean time to trigger microservice rejuvenation	80 h
$1 / τ_{m r e j}$	$T_{m r e j}^{1}, T_{m r e j}^{2}, T_{m r e j}^{3}$ $T_{m r e j}^{4}, T_{m r e j}^{5}, T_{m r e j}^{6}$	Mean time to execute microservice rejuvenation	1 min
$1 / ζ_{m a f}$	$T_{m a f}^{1}, T_{m a f}^{2}, T_{m a f}^{3}$ $T_{m a f}^{4}, T_{m a f}^{5}, T_{m a f}^{6}$	Mean time from the failure-probable state to aging failure	72 h
$1 / θ_{m f p}$	$T_{m f p}^{1}, T_{m f p}^{2}, T_{m f p}^{3}$ $T_{m f p}^{4}, T_{m f p}^{5}, T_{m f p}^{6}$	Mean time to enter the failure-probable state due to aging	168 h
$1 / ε_{m r e c}$	$T_{m r e c}^{1}, T_{m r e c}^{2}, T_{m r e c}^{3}$ $T_{m r e c}^{4}, T_{m r e c}^{5}, T_{m r e c}^{6}$	Mean time to recover from the failure-probable state (resource cleanup)	30 s
$1 / κ_{m b o o t}$	$T_{m b o o t}^{1}, T_{m b o o t}^{2}, T_{m b o o t}^{3}$ $T_{m b o o t}^{4}, T_{m b o o t}^{5}, T_{m b o o t}^{6}$	Mean time to boot a microservice after edge-node rejuvenation	5 s
m		Baseline number of microservice instances initially hosted per edge node	1
$r e s_m a x$	$r e s_m a x \geq m + 1$	Maximum microservice capacity per edge node (in number of instances)	2
$p_{LM}$	Equation (37)	Probability of selecting the evacuation-first (LM) branch when a node-level rejuvenation trigger occurs	0.5

Table 4 reports mean delays (MTTF/MTTR/MTT, etc.) for timed transitions as 1/x. The corresponding SRN firing rates are x (in h⁻¹ when time is expressed in hours). For example, MTTF_Microservice (MTTF_Microservice) = 1/λ_m = 1258 h; hence, λ_m = 1/1258 h⁻¹. In the SRN implementation, we normalize all delays to hours by converting values originally specified in seconds or minutes.

For all experiments, we configure per-node headroom by setting

r e s_m a x \geq m + 1

, so each edge node can accept at least one additional microservice instance during HA/LM operations.

5. Experiment and Analysis

To quantify the impact of each policy scenario, we solve the SRN-induced CTMC with Mercury’s steady-state numerical solver [22]. Because the underlying model is Markovian with exponential transition rates, the steady-state reward (COA) is deterministic for a fixed set of input parameters, and the reported results do not depend on random seeds or replications. We therefore do not report confidence intervals; instead, we perform a sensitivity analysis over key timing parameters (Section 5.4) to assess the robustness of the conclusions.

5.1. Default Input Parameters and Model Configuration

The SRN model described in Section 4 (detailed nets in Figure 3, Figure 4 and Figure 5) is parameterized using the default inputs in Table 4. Guard functions for immediate and timed transitions are given in Table 2 and Table 3. Equation (42) summarizes the reward definition and related notation, and Table 4 lists default parameter values and units.

External grounding of key timing parameters. Several values in Table 4 correspond to control-plane timers and measured recovery latencies in container-orchestrated edge platforms. $M T T D_{Edge}$ models the delay between an edge node failure and its detection by the control plane. In Kubernetes, a node’s Ready condition becomes Unknown if the controller has not heard from the node within node-monitor-grace-period (default 50 s), based on the kubelet heartbeat mechanism (Lease renewals, default every 10 s) and the node controller timers [25,26,27]. These defaults are version-dependent and can be tuned. After the node is tainted as node.kubernetes.io/not-ready or node.kubernetes.io/unreachable, the default system-added tolerations set toleration- Seconds = 300 for these NoExecute taints, which bounds the default eviction delay in untuned clusters [28,29]. Taken together, these control-plane timers imply that end-to-end workload replacement can be on the order of tens of seconds to minutes in untuned clusters. In this SRN (Assumption A5), HA relocation is triggered when a node leaves the up place, so the workload-visible replacement latency is captured by $M T T M_{HA}$ , which can be calibrated to include both detection and post-detection actions (eviction delay, scheduling, image pull, and warm-up). $M T T D_{Edge}$ remains on the infrastructure repair path. At the edge, measured failover can also be in the order of seconds under an optimized setup (e.g., RTO ≈ 10.03 s end-to-end, with the failover stage ≈ 8.29 s in a StarlingX-based edge testbed) [30], while fast fault-detection extensions for containerized IoT services can reduce detection to sub-second levels in experimental settings (e.g., average detection time ≈ 0.84 s) [31]. $M T T M_{HA}$ abstracts the workload-visible end-to-end replacement time, and it can be calibrated to include detection, eviction, scheduling, image pull, and service warm-up. $M T T M_{LM}$ parameterizes the mean time to complete a microservice relocation step via live migration in the SRN. Empirical VM measurements indicate that the switchover downtime of post-copy migration can be sub-second (e.g., 0.2–0.7 s under 10 Gb/s interfaces) [32], so we treat service interruption during LM as negligible relative to other control-plane timers and focus on the relocation latency that governs how quickly a node can be drained. These mappings interpret the baseline values in Table 4 and motivate the ranges used in the sensitivity analysis (Table 7), while also clarifying how to calibrate SRN rates from platform logs and testbed measurements.

5.2. Experiment Metrics: `COA`

COA is a capacity-oriented availability metric. Rather than using only a binary up/down indicator, it weights each system state by the number of microservice instances that remain able to serve requests, and it conditions this capacity on the storage layer being operational. Equation (42) defines the reward function by counting microservices in

P_{m u}^{i}

and

P_{m f p}^{i}

across all nodes and setting the reward to zero when

P_{s t r u}

indicates storage is down.

{COA}_{n} = (\sum_{i = 1}^{n} (M (P_{m u}^{i}) + M (P_{m f p}^{i}))) \cdot 1 {M (P_{s t r u}) = 1} .

(42)

In this work, we analyze the SRN model with

n = 6

edge nodes. Therefore, Equation (42) becomes

(\sum_{i = 1}^{6} (M (P_{m u}^{i}) + M (P_{m f p}^{i}))) \cdot 1 {M (P_{s t r u}) = 1} .

5.3. Stationary Analysis

For the SRN in Figure 3, Figure 4 and Figure 5, we perform steady-state analysis using the default input parameters in Table 4 and the reward function in Equation (42). Subsequently, we analyze COA according to six scenarios: (i) without HA/LM, (ii) with only HA, (iii) with HA, and LM, (iv) only rejuvenation, (v) with Reju and HA, and (vi) with Reju, HA, and LM. The places (

P_{P o l i c y H A}

,

P_{P o l i c y L M}

, and

P_{P o l i c y R E J}

) at the center of Figure 5 are controlled by the guard functions in Table 2 and Table 3. For (i),

P_{P o l i c y H A} = 0

,

P_{P o l i c y L M} = 0

, and

P_{P o l i c y R E J} = 0

; for (ii),

P_{P o l i c y H A} = 1

,

P_{P o l i c y L M} = 0

, and

P_{P o l i c y R E J} = 0

; for (iii),

P_{P o l i c y H A} = 1

,

P_{P o l i c y L M} = 1

, and

P_{P o l i c y R E J} = 0

; for (iv),

P_{P o l i c y H A} = 0

,

P_{P o l i c y L M} = 0

, and

P_{P o l i c y R E J} = 1

; for (v),

P_{P o l i c y H A} = 1

,

P_{P o l i c y L M} = 0

, and

P_{P o l i c y R E J} = 1

; and for (vi),

P_{P o l i c y H A} = 1

,

P_{P o l i c y L M} = 1

, and

P_{P o l i c y R E J} = 1

. Steady-state analysis and sensitivity analysis are conducted with these settings.

To quantify COA in steady-state, we allocate

m = 1, 2, 3, 4

microservices to each edge node, ensuring

r e s_{m a x} \geq m + 1

. In this sweep, we set

r e s_{m a x} = m + 1

, i.e., one-instance headroom per node. Table 5, computed from Equation (42), shows that COA increases with m. Under the default parameters, scenarios without rejuvenation have a higher COA because rejuvenation temporarily removes capacity. Adding LM on top of HA yields the highest COA for

m = 1, 2, 3

and is effectively tied with HA-only for

m = 4

(difference below 0.001%).

Mapping of scenarios to common HA baselines. Scenario (ii) corresponds to reactive failover/rescheduling: after an edge node failure is detected, microservices are restarted/rescheduled on neighboring nodes with spare capacity (no execution-state preservation). This is conceptually close to primary–standby activation (warm/cold standby) and to Kubernetes-style recovery, where workloads are recreated/rescheduled after node failure detection and eviction. Scenario (iii) adds proactive evacuation by live migration prior to planned node rejuvenation, analogous to draining a node before reboot in planned maintenance workflows (e.g., cordon+drain in Kubernetes or VM host maintenance with live migration). Scenario (iv) represents node rejuvenation without prior evacuation, analogous to maintenance that starts before draining completes. Scenario (vi) integrates HA, LM, and REJ, corresponding to evacuation-first maintenance combined with failure-driven failover. Scenario (i) is the baseline with no HA, and Scenario (v) is rejuvenation without LM.
Relative gains. Table 6 reports the absolute COA differences and percentage gains of HA+LM (Case (iii)) over HA-only (Case (ii)) and over the baseline (Case (i)) under identical failure rates (default parameters). Under these low failure rates, the gain of HA+LM over HA-only is below 0.13%, and for $m = 4$ , the difference is below 0.001%. When failures are frequent, the effect size of LM becomes much larger (see the sensitivity analysis for edge-node MTTF). The percentages are computed from Table 5 and rounded to four decimal places.

5.4. Sensitivity Analysis

To analyze COA, we fixed some parameters (

n = 6

,

m = 1

,

r e s_m a x = 2

from Table 4) and conducted sensitivity analysis for six scenarios while varying most values. The six scenarios are (i) without HA/LM, (ii) with only HA, (iii) with HA and LM, (iv) only rejuvenation, (v) with Reju and HA, and (vi) with Reju, HA, and LM. The sensitivity analysis results are shown in Figure 6, Figure 7 and Figure 8. The descriptions for each are as follows.

Edge Node MTTF (Figure 6a) represents the mean time to fail for edge nodes, and COA tends to increase as edge node MTTF increases across all scenarios. In particular, scenario (iii) is the highest, while scenario (vi) is the strongest among rejuvenation-enabled configurations. When failures are frequent, scenarios (i) and (ii) maintain a low COA at around 5.2, while scenarios (iii) and (vi) maintain above 5.5, showing a COA improvement of approximately 4.7% or more. This is because the synergy between HA and LM reduces service interruption time: HA handles post-failure recovery while LM enables migration without service interruption. Furthermore, when rejuvenation is included, scenario (vi) maintains COA levels close to scenario (iii) by evacuating services in advance through LM, despite performing rejuvenation to prevent failures due to aging.

Microservice MTTF (Figure 6b) represents the mean time to fail for microservices, and COA increases across all scenarios as microservice expected failure time increases and stability improves. When MTTF is low (i.e., unstable services), scenario (ii) can be temporarily competitive; as stability increases, scenario (iii) remains the strongest overall, while scenario (vi) remains the strongest among rejuvenation-enabled configurations. Scenario (v) shows relatively lower performance, presumably due to downtime from forcibly interrupting services when performing rejuvenation without LM. Among scenarios (iv), (v), and (vi) that use rejuvenation, scenario (vi) maintains the highest COA across the sweep, suggesting that the proposed Proactive HA model is robust across various service quality levels.

Storages MTTF (Figure 6c) represents the storage failure time, and storage MTTF has a critical impact on COA. When MTTF is low, COA drops across all scenarios. Because our COA reward is gated by the storage-up indicator in Equation (42), frequent storage outages directly cap the attainable COA under every policy. In the low-

M T T F_{Storage}

region, rejuvenation-enabled scenarios can still appear slightly higher than non-rejuvenation cases, but this is due to faster post-repair service resumption in our SRN (microservices may already be back in

P_{mu}

via the short boot delay

1 / κ_{m b o o t}

, rather than waiting for the longer

M T T R_{Microservice}

after storage recovery). These gains reduce only the extra lag after storage returns; they do not mitigate storage downtime itself.

The detailed results from Figure 6 show that among models using rejuvenation, scenario (vi), the Proactive HA model is the most stable and highest-performing option across the tested parameter changes.

Edge Node MTTTrig (Figure 7a) represents the mean time to trigger rejuvenation for edge nodes. Scenarios (i, ii, iii) without rejuvenation are unaffected by changes in

M T T T r i g_{E d g e}

cycle. Conversely, scenarios (iv, v, vi) with rejuvenation react sensitively. COA is lowest at a 69.4-h trigger cycle but steadily increases as the cycle extends to 2220 h, approaching the COA of non-REJ scenarios (i, ii, iii). This is because when race conditions causing PC occur, rejuvenation forces service reboots.

Edge Node MTTREJ (Figure 7b) represents the mean time to rejuvenation of edge nodes. Non-rejuvenation scenarios (i, ii, iii) are nearly unchanged, while rejuvenation-enabled scenarios remain lower in COA than non-REJ cases and vary modestly over the sweep. This indicates that the dominant COA loss in PC is decided at rejuvenation start (

T_{e p r r e j}^{1}

), i.e., at the ordering point between evacuation and reboot, rather than by the reboot duration itself. Consequently, varying node reboot time from 0.2 to 5 min changes COA much less than changing rejuvenation trigger intervals in the tested range.

Microservice Node MTTTrig (Figure 8a) represents the mean time to trigger rejuvenation for microservices and shows the impact of rejuvenation cycles on COA. Scenarios (i, ii, iii) are unaffected as rejuvenation logic does not operate, while scenarios (iv, v, vi) react sensitively. When the trigger cycle is short at 10 h (i.e., frequent service restarts), COA is highest in rejuvenation-enabled scenarios; as it extends to 320 h, COA decreases and moves farther from the non-REJ scenarios (i, ii, iii). Within this SRN parameterization, this indicates that more frequent microservice rejuvenation can improve average COA, whereas sparse rejuvenation leaves aging-related degradation active for longer.

Microservice MTTA (Figure 8b) represents the mean time to aging for microservices and analyzes the impact of the average microservice aging time on COA. Across the tested range, all six scenarios show COA decreasing as

M T T A_{M i c r o s e r v i c e}

increases. In other words, shorter

M T T A_{M i c r o s e r v i c e}

values correspond to higher COA, while longer values correspond to lower COA in this model sweep. In this SRN, the sudden microservice failure transition

T_{m f}^{k}

is enabled only in

P_{m u}^{k}

, so increasing

M T T A_{M i c r o s e r v i c e}

leaves microservices exposed to sudden failures for longer. This monotonic trend is specific to the current SRN structure; if sudden failures were allowed in both

P_{m u}^{k}

and

P_{m f p}^{k}

(possibly with different rates), the sensitivity of COA to

M T T A_{M i c r o s e r v i c e}

could change.

Sensitivity analysis was also performed on the other five parameters (

M T T D_{E d g e}

,

M T T R_{E d g e}

,

M T T R_{M i c r o s e r v i c e}

,

M T T R_{S t o r a g e}

, and

M T T R E J_{M i c r o s e r v i c e}

). Across the tested ranges, their impact on COA is small: the COA variation remains below 0.01 across all six scenarios (Table 7). For this reason, we summarize these sweeps in Table 7 rather than presenting separate plots.

Therefore, the sensitivity analysis in this work demonstrates how sensitively the system COA reacts to each component’s MTTF, MTTR, and rejuvenation parameters. Results show that COA is higher and performance is better when live migration is enabled, and storage stability also has a significant impact. Additionally, as downtime accumulates, it is directly affected not only by MTTR but also by MTTREJ.

Table 7. Sensitivity Analysis Results for Parameters with

M T T D_{E d g e}

,

M T T R_{E d g e}

,

M T T R_{M i c r o s e r v i c e}

,

M T T R_{S t o r a g e}

, and

M T T R E J_{M i c r o s e r v i c e}

on COA.

Table 7. Sensitivity Analysis Results for Parameters with

M T T D_{E d g e}

,

M T T R_{E d g e}

,

M T T R_{M i c r o s e r v i c e}

,

M T T R_{S t o r a g e}

, and

M T T R E J_{M i c r o s e r v i c e}

on COA.

Parameter	Test Range	Case (i) Without HA, LM and REJ	Case (ii) with Only HA	Case (iii) with HA and LM	Case (iv) with Only REJ	Case (v) with HA and REJ	Case (vi) with HA, LM and REJ
$M T T D_{E d g e}$	0.6–40 s	5.6174 (<0.004)	5.6300 (<0.002)	5.6381 (<0.003)	5.5212 (<0.003)	5.3251 (<0.007)	5.5321 (<0.003)
$M T T R_{E d g e}$	0.1–3 h	5.6164 (<0.002)	5.6297 (<0.003)	5.6378 (<0.003)	5.5213 (<0.003)	5.3255 (<0.006)	5.5326 (<0.004)
$M T T R_{M i c r o s e r v i c e}$	3–60 min	5.6160 (<0.003)	5.6298 (<0.005)	5.6377 (<0.004)	5.5206 (<0.004)	5.3236 (<0.005)	5.5323 (<0.004)
$M T T R_{S t o r a g e}$	0.6–20 h	5.6166 (<0.006)	5.6299 (<0.006)	5.6376 (<0.006)	5.5210 (<0.005)	5.3235 (<0.008)	5.5327 (<0.006)
$M T T R E J_{M i c r o s e r v i c e}$	10–180 s	5.6171 (<0.003)	5.6296 (<0.002)	5.6379 (<0.002)	5.5213 (<0.005)	5.3229 (<0.010)	5.5325 (<0.006)

Note: Values denote Mean COA, with maximum variation range across the tested parameter range indicated in parentheses.

Why detection and repair parameters show low sensitivity in steady-state. Table 7 shows that varying $M T T D_{E d g e}$ , repair times ( $M T T R_{E d g e}$ , $M T T R_{M i c r o s e r v i c e}$ , $M T T R_{S t o r a g e}$ ), and the microservice rejuvenation duration $M T T R E J_{M i c r o s e r v i c e}$ changes COA by less than 0.01 across the tested ranges. This is expected for a steady-state time-average metric: these parameters mainly affect the duration of a degradation episode, while failure rates and rejuvenation trigger intervals determine the frequency of such episodes. A first-order alternating-renewal approximation for a single component gives

$A \approx \frac{M T T F}{M T T F + M T T D + M T T R},$

so when $M T T F$ is much larger than $M T T D$ and $M T T R$ , steady-state availability changes only in the fourth-to-sixth decimal place under realistic sweeps. For example, with the default $M T T F_{E d g e} = 8760$ h, changing $M T T D_{E d g e}$ from 0.6 to 40 s changes the unavailability fraction by $\approx 1.25 \times 10^{- 6}$ , and changing $M T T R_{E d g e}$ from 0.1 to 3 h changes it by $\approx 3.31 \times 10^{- 4}$ . These magnitudes explain why the COA ranges in Table 7 stay below 0.01. In scenarios with HA/LM, capacity loss is further bounded because microservices can be relocated to neighboring nodes soon after a failure is detected, so COA becomes even less sensitive to the subsequent repair duration.

6. Discussion

Our SRN models were designed to capture, in a single formalism, the joint effect of hardware failures, software aging, high availability, live migration, and multi-level rejuvenation in a distributed edge storage architecture with microservices. The results from the steady-state and sensitivity studies show that live migration is the most effective lever for preserving COA, while naive rejuvenation policies can unintentionally reduce effective capacity when they are not carefully synchronized with migration and redundancy mechanisms.

6.1. Effect of HA and Live Migration on `COA`

The comparison between scenarios without HA or live migration and the HA-only and HA-plus-live-migration configurations in Table 5 illustrates that redundancy alone yields only modest gains in COA. For all values of m, adding high availability without migration slightly raises COA compared with the baseline since failover still requires edge failure detection and microservice re-deployment (rescheduling). When live migration is enabled on top of HA, COA increases consistently across all m values, and the HA-plus-live-migration scenario is the highest (or tied for the highest within numerical tolerance) among the configurations that do not use rejuvenation.

The sensitivity plots in Figure 6 confirm that this advantage is robust. For small mean times to failure of edge nodes or microservices, scenarios with live migration maintain noticeably higher COA than those without migration. This follows directly from the structure of the SRN in Figure 3, Figure 4 and Figure 5, where migration moves tokens from the running and failure-probable (aging) places

P_{m u}

and

P_{m f p}

of a vulnerable node to neighboring nodes before or shortly after a fault so that the global reward

r (M)

remains close to the maximum as long as at least one neighbor has spare capacity. As the mean times to failure grow, every scenario converges to a high COA regime, but the migration-enabled cases still preserve a small advantage.

6.2. Implications of the `PC` Phenomenon

The Proactive HA configuration in our model was intended to represent a policy where rejuvenation is combined with live migration so that an aging node first evacuates its microservices and then reboots. The SRN semantics show that the interaction between rejuvenation triggers and migration is more subtle. When the rejuvenation trigger place

P_{e r e j i n t}^{k}

is marked, both the timed migration transitions from node k and the immediate edge rejuvenation transition

T_{e p r r e j}^{k}

become eligible. If the rejuvenation transition fires first, the node enters place

P_{e r e j}^{k}

, the reset transitions

T_{m u o}^{k}

to

T_{m d n r e j o}^{k}

remove all tokens from the microservice places on that node, and the local contribution

r^{k} (M)

to the reward drops to zero until new instances are booted after rejuvenation.

When both LM and REJ are enabled, the controller must arbitrate between draining (via live migration) and immediate rejuvenation; in the SRN, this arbitration is represented by

p_{LM}

. PC occurs when rejuvenation is selected before evacuation completes, so microservices are destroyed instead of being migrated. This explains why the scenarios that include rejuvenation show lower COA in Table 5 than the migration-only scenario, even though rejuvenation should, in principle, mitigate aging-related failures. The effect is most pronounced when edge-node rejuvenation triggers are frequent, because the system spends a significant fraction of time restarting nodes and microservices without reaping the full benefit of reduced aging.

Figure 7 highlights this effect in two complementary ways. First, decreasing the edge node trigger interval

M T T_{Trig, Edge}

reduces COA in all rejuvenation scenarios and pushes them farther away from the non-rejuvenation baselines. Second, changing the edge rejuvenation duration

M T T_{REJ, Edge}

has a smaller effect than changing the trigger interval: non-REJ scenarios are almost flat, while REJ scenarios vary modestly over the tested range. Once rejuvenation starts, microservices have already been removed from the reward-relevant places, so extending the reboot time only lengthens a period in which the node contributes no capacity. The critical decision point is therefore the instant when the trigger fires and the system chooses between migration and rejuvenation, not the length of the rejuvenation itself.

The microservice side exhibits a different trend in this dataset. In Figure 8, short microservice trigger intervals increase COA in rejuvenation scenarios, while extending the trigger interval reduces COA. In the same figure, increasing

M T T A_{Microservice}

lowers COA across all scenarios over the tested range.

Mapping to Kubernetes, KubeEdge, and OpenStack. In Kubernetes, planned node maintenance is typically executed as cordon+drain; drain uses the Eviction API and respects PodDisruptionBudgets and termination grace periods, then the node can be rebooted and returned to service [19,20,21]. In this sense, the $Z_{i} = LM$ branch corresponds to drain/evacuation-first coordination, while the $Z_{i} = REJ$ branch corresponds to forced disruption (e.g., a reboot that proceeds before drain completes, or forced deletion when drain is blocked or times out). Kubernetes does not provide transparent live migration for generic containers by default; however, when VM live migration is used in a Kubernetes setting (e.g., via KubeVirt), the same migrate-before-reboot pattern applies [8]. KubeEdge extends Kubernetes semantics to edge deployments with a cloud–edge split and supports operation under intermittent connectivity [4,5], which may reduce the chance that evacuation completes within a deadline and can increase the practical likelihood of a forced restart path. In OpenStack, Nova supports live migration for planned maintenance and evacuation for host failure [7,33]; Nova also exposes migration completion timeout policies [7], which aligns naturally with interpreting $p_{LM}$ as the chance that migration/evacuation completes before a forced maintenance action proceeds.

6.3. Parameter Sensitivity and Prioritization

The sensitivity experiments across twelve parameters allow us to rank their influence on COA. The dominant failure-side drivers are

{MTTF}_{Edge}

and

{MTTF}_{Storage}

, with

{MTTF}_{Microservice}

also influential. When

{MTTF}_{Edge}

or

{MTTF}_{Storage}

is low, failures are frequent and migration-enabled policies separate clearly from migration-free ones; as these MTTFs increase, all scenarios converge toward a high-COA regime where failures are rare. For

{MTTF}_{Microservice}

, the same ordering holds over most of the sweep, although at the lowest tested values, Case (ii) is temporarily close to Case (iii), consistent with Figure 6b.

Rejuvenation-related parameters form the second group of influential factors, with opposite trends between node-level and microservice-level triggers. For edge nodes, short

{MTTTrig}_{Edge}

reduces COA in rejuvenation scenarios, while long intervals approach the non-rejuvenation baselines. Varying

{MTTREJ}_{Edge}

has a smaller effect than varying

{MTTTrig}_{Edge}

, indicating that the trigger-time ordering decision (evacuate first vs. reboot first) dominates reboot duration in this model. For microservices, shorter

{MTTTrig}_{Microservice}

yields higher COA over the tested range, and increasing

{MTTA}_{Microservice}

decreases COA. This indicates that trigger design should be tuned separately for node-level and microservice-level rejuvenation.

By contrast,

{MTTD}_{Edge}

,

{MTTR}_{Edge}

,

{MTTR}_{Microservice}

,

{MTTR}_{Storage}

, and

{MTTREJ}_{Microservice}

have a minor impact within the tested ranges, as shown in Table 7. COA varied by less than 0.01 when these parameters were swept, which indicates that, in our model, they can be treated as second-order tuning knobs once reasonable engineering values have been chosen.

Relation to container orchestration and MEC reliability. The above ranking matches how Kubernetes-like orchestrators are engineered: liveness detection and reconciliation run on seconds-to-minutes time constants, while node hardware failures and software aging typically occur on much longer time scales. For example, Kubernetes marks an unresponsive node unhealthy after a grace period (default node-monitor-grace-period=50s) and then applies eviction/toleration logic (default tolerationSeconds=300s for NoExecute on not-ready/unreachable) before rescheduling workloads [25,26,28]. Therefore, moderate variation in these timers has a limited effect on long-run capacity when $M T T F$ is large, while failure/aging rates and maintenance frequency dominate. In MEC settings, intermittent connectivity and constrained resources can stretch effective detection and repair times and can create false-positive liveness events, which can amplify the effect of orchestration timeouts [1,10]. The present SRN assumes perfect failure detection and does not explicitly model network partitions; adding partition/false-positive states is a direct extension for studying this regime.

6.4. Design Guidelines for MEC-Based Edge Storage

Although the model abstracts many practical details of real edge computing deployments, the combination of the structural analysis and the numerical results suggests several design guidelines.

Industrial mapping. The building blocks represented in our SRN correspond to standard operational controls in current edge stacks. For Kubernetes-based edge clusters, planned node maintenance typically follows a cordon/drain workflow (e.g., kubectl drain), where pods are safely evicted before the node is powered down, and PodDisruptionBudgets bound the number of concurrent voluntary disruptions [19,21]. Tools such as Kured automate safe reboots by cordoning and draining a node before reboot and then uncordoning it [34]. For VM-based MEC deployments, live migration is directly supported by widely deployed virtualization platforms (e.g., OpenStack) and by Kubernetes-based VM orchestration (KubeVirt) [7,8]. In this context, the PC regime should be interpreted as miscoordinated maintenance (reboot before evacuation completes or forced evacuation with insufficient timeouts), rather than a claim that real systems intentionally implement a fixed 50/50 race.

Live migration should be treated as a first-class mechanism when building high availability for microservice-based edge storage. The steady improvement in COA across all failure and aging regimes indicates that migration provides a robust benefit even when hardware is relatively reliable.
Node-level rejuvenation must be carefully coordinated with migration. Before rebooting an aging edge node, the orchestrator should verify that all microservice instances have been migrated or drained so that capacity is not lost through PC events.
Rejuvenation trigger intervals should be tuned separately at the node and microservice levels. In this study, overly frequent node-level rejuvenation reduces COA, whereas shorter microservice trigger intervals improve COA within the tested range.
When storage reliability is a concern, as in the low $M T T F_{Storage}$ region of Figure 6c, storage-related overall COA becomes dominated by storage uptime Equation (42); HA/LM and rejuvenation can only improve the conditional microservice capacity when storage is up. In our SRN, periodic restarts may reduce the post-repair lag after storage recovers, but they cannot compensate for storage downtime.
The model assumes some spare capacity on each node, encoded by ( $r e s_{m a x} > m$ ). In practice, migration-based high availability requires similar headroom, either in the form of reserved capacity on neighboring nodes or dedicated standby nodes.

6.5. Validation Strategy and External Grounding

This work is primarily analytical, so we make the validation path explicit as a three-layer strategy.

(i): Structural validation by reduction. The policies placed in Figure 5 ( $P_{P o l i c y H A}$ , $P_{P o l i c y L M}$ , $P_{P o l i c y R E J}$ ) act as switches that enable or disable groups of transitions (Table 2, Table 3 and Table 4). By disabling $P_{P o l i c y L M}$ and $P_{P o l i c y R E J}$ , the reachable markings collapse to the reactive HA-only case. By disabling $P_{P o l i c y H A}$ and $P_{P o l i c y L M}$ , the markings collapse to the rejuvenation-only case. This reduction check confirms that the integrated SRN preserves the intended semantics of its component submodels and of the reward definition used for $COA$ .
(ii): Parameter grounding and qualitative trend checks. We map key SRN timers to orchestration timers and measurements. Under these mappings, the predicted ordering of policies is consistent with observations: proactive migration can keep service interruption short relative to failure-driven relocation, while fault detection and eviction timers mainly matter when failures are frequent or when the system is capacity-constrained. This is consistent with our sensitivity analysis result that detection delay and repair time have limited effects on long-run $COA$ within reasonable engineering ranges.
(iii): Empirical validation workflow and limitations. This work does not include a dedicated MEC testbed experiment, so we do not claim a one-to-one match between the absolute $COA$ values and any specific deployment. In future work (Section 6.6), we will validate SRN-predicted trends against testbed traces by (a) running the policy cases on a small MEC cluster, (b) logging failure, recovery, migration, and rejuvenation events, (c) computing an empirical $COA$ as a time average of the number of ready microservice replicas multiplied by a storage-health indicator, and (d) fitting SRN parameters (or phase-type distributions) to observed traces, then comparing predicted sensitivities and policy ranking.

6.6. Future Works

To enable the empirical validation described above, we plan to build a small MEC testbed based on Kubernetes/KubeEdge or StarlingX, deploying the same microservice layout as modeled (multiple edge nodes with bounded headroom res_max and a replicated storage layer). We will instrument platform logs and metrics (e.g., node heartbeats/leases, scheduling and eviction events, readiness probes, and storage health) to measure: (i) node failure detection delay, (ii) eviction and rescheduling time, (iii) user-perceived interruption time during proactive migration, (iv) reboot time during rejuvenation, and (v) the resulting empirical

COA

time series. We will run controlled fault-injection campaigns (node power-off, network partition, microservice crash) to collect enough samples for parameter estimation and to validate trend-level predictions (policy ranking and sensitivity slopes) rather than a single absolute

COA

point. Concretely, we plan to collect (i) node drain and reboot events, (ii) evacuation/migration durations and success rates, and (iii) failure and repair logs from an edge cluster. We will then fit SRN parameters to these traces and compare predicted COA trends with measured service availability under controlled maintenance policies.

We assume homogeneous edge nodes arranged in a simple ring topology with identical timing parameters and independent failures. Real mobile edge computing systems exhibit heterogeneity in hardware, workload intensity, and network conditions, and they may experience correlated failures across nodes. Extending the SRN to heterogeneous parameter sets and richer topologies is a direct next step.

Future work can relax the exponential timing assumption by (i) fitting phase-type distributions within SRN/SAN models, or (ii) validating with discrete-event simulation (DES). A DES implementation would also allow modeling of sequential vs. parallel update strategies for microservice rejuvenation or rolling updates, which are difficult to express in CTMC form. Standard simulation methodology (e.g., warm-up removal, batch means) could then be used to compute confidence intervals and validate the steady-state trends predicted here [35]. These directions complement existing phase-type approximations for non-exponential behavior in stochastic Petri nets [36].

The PC behavior depends on the race abstraction between live migration and rejuvenation, which we encode via

p_{LM}

. We plan to replace the probabilistic abstraction with an explicit priority-and-timeout controller submodel. Concretely, after a rejuvenation trigger, the node enters a draining state, migration/eviction is attempted, and rejuvenation is permitted only after the node is empty; otherwise, a timeout enables a forced rejuvenation. This extension would let the SRN map directly to platform parameters such as Kubernetes drain and termination grace periods and Nova live migration timeout policies [7,19,21], and it would reduce reliance on selecting a fixed

p_{LM}

when compared with specific orchestration systems. Under such controllers, PC would be interpreted as a warning about misordered maintenance actions rather than evidence that a 50/50 race occurs in deployed systems.

We focus exclusively on COA and do not evaluate latency, energy consumption, or migration overhead. Future work will incorporate performance and cost metrics jointly with availability and will investigate adaptive policies that adjust migration and rejuvenation decisions based on observed load and aging indicators.

A natural continuation is to calibrate the SRN parameters with field data from operational MEC clusters, including failure, repair, and maintenance logs. Such traces could expose correlated failures, long-tail repair times, and detection false positives that are not captured by memoryless assumptions. Future work also models network partitions and migration feasibility constraints (e.g., stateful microservices that cannot be migrated), since these factors affect whether a drain-then-rejuvenate workflow is applicable. These extensions would allow the policy comparisons in this work to be re-evaluated under empirically grounded parameter distributions and orchestrator behaviors.

7. Conclusions

This work developed an integrated SRN-based availability model for MEC-based distributed edge storage that jointly captures hardware and software failures, software aging, HA, live migration, and rejuvenation at both node and microservice levels and uses COA to evaluate six policy scenarios with sensitivity analysis over key timing parameters. Across the examined ranges, HA+LM achieves the numerically tied highest COA, with gains that grow as failure rates increase and vary from marginal to substantial by parameter regime. We also identified and explained PC, a detrimental race condition in which rejuvenation starts before evacuation/migration completes, destroys microservices, and reduces capacity, showing that proactive rejuvenation can harm availability when not coordinated with migration. Sensitivity results indicate that failure rates and rejuvenation trigger intervals dominate COA, whereas detection/repair delays and rejuvenation duration have a smaller influence within the studied ranges. These findings provide practical MEC orchestration guidance: complete drain/migration before reboot-based rejuvenation, reserve migration headroom, and avoid rejuvenation schedules that are too aggressive or too sparse.

Author Contributions

Conceptualization, T.A.N.; methodology, T.A.N.; software, D.L., and M.K.; validation, D.M., M.K., and D.M.; formal analysis, D.L. and M.K.; investigation, T.A.N., M.K., and D.M.; resources, D.L.; data curation, D.L.; writing—original draft preparation, T.A.N. and D.L.; writing—review and editing, M.K., and D.M.; visualization, D.L. and M.K.; supervision, T.A.N. and D.M.; project administration, and D.M.; funding acquisition, T.A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (No. No. 2020R1A6A1A03046811). This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2026-25485849). This paper was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0020536, HRD Program for Industrial Innovation).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
ETSI. GS MEC 003-V3.2.1-Multi-Access Edge Computing (MEC); Framework and Reference Architecture. ETSI Group Specification GS MEC 003, European Telecommunications Standards Institute (ETSI). 2024. V3.2.1 (2024-04). Available online: https://www.etsi.org/deliver/etsi_gs/MEC/001_099/003/03.02.01_60/gs_mec003v030201p.pdf (accessed on 15 February 2026).
Qu, Q.; Xu, R.; Nikouei, S.Y.; Chen, Y. An Experimental Study on Microservices Based Edge Computing Platforms. arXiv 2020, arXiv:cs/2004.02372. [Google Scholar] [CrossRef]
Xiong, Y.; Sun, Y.; Xing, L.; Huang, Y. Extend Cloud to Edge with KubeEdge. In Proceedings of the 2018 IEEE/ACM Symposium on Edge Computing (SEC), Seattle, WA, USA, 25–27 October 2018; pp. 373–377. [Google Scholar] [CrossRef]
Zhao, H.; Liu, S.; Luo, K.; Chen, S.; Kong, L.; Jia, F. Research on application of edge computing system based on KubeEdge. Zhineng Kexue Yu Jishu Xuebao 2022, 4, 118–128. [Google Scholar] [CrossRef]
Waseem, M.; Liang, P.; Shahin, M. A Systematic Mapping Study on Microservices Architecture in DevOps. J. Syst. Softw. 2020, 170, 110798. [Google Scholar] [CrossRef]
OpenStack Nova Documentation Team. Live-Migrate Instances. Available online: https://docs.openstack.org/nova/latest/admin/live-migration-usage.html (accessed on 15 February 2026).
Project, K. Live Migration. KubeVirt User Guide. Available online: https://kubevirt.io/user-guide/compute/live_migration/ (accessed on 19 February 2026).
Venkatesh, R.S.; Smejkal, T.; Milojicic, D.S.; Gavrilovska, A. Fast In-Memory CRIU for Docker Containers. In Proceedings of the 2019 International Symposium on Memory Systems (MEMSYS’19), Washington, DC, USA, 30 September–3 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 53–65. [Google Scholar] [CrossRef]
He, X.; Meng, M.; Ding, S.; Li, H. A Survey of Task Migration Strategies in Mobile Edge Computing. In Proceedings of the 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 24–26 April 2021; pp. 400–405. [Google Scholar] [CrossRef]
Kim, D.S.; Hong, J.B.; Nguyen, T.A.; Machida, F.; Park, J.S.; Trivedi, K.S. Availability Modeling and Analysis of a Virtualized System Using Stochastic Reward Nets. In Proceedings of the 2016 IEEE International Conference on Computer and Information Technology (CIT), Nadi, Fiji, 8–10 December 2016; pp. 210–218. [Google Scholar] [CrossRef]
Cotroneo, D.; Natella, R.; Pietrantuono, R.; Russo, S. Software Aging Analysis of the Linux Operating System. In Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering, San Jose, CA, USA, 1–4 November 2010; pp. 71–80. [Google Scholar] [CrossRef]
Araujo, J.; Matos, R.; Alves, V.; Maciel, P.; de Souza, F.V.; Matias, R., Jr.; Trivedi, K.S. Software Aging in the Eucalyptus Cloud Computing Infrastructure: Characterization and Rejuvenation. ACM J. Emerg. Technol. Comput. Syst. 2014, 10, 1–22. [Google Scholar] [CrossRef]
Alonso, J.; Bovenzi, A.; Li, J.; Wang, Y.; Russo, S.; Trivedi, K.S. Software Rejuvenation: Do IT & Telco Industries Use It? In Proceedings of the 23rd IEEE International Symposium on Software Reliability Engineering Workshops (ISSRE Workshops), Dallas, TX, USA, 27–30 November 2012; pp. 299–304. [Google Scholar] [CrossRef]
Bai, J.; Chang, X.; Machida, F.; Trivedi, K.S. Understanding Container-Based Services Under Software Aging: Dependability and Performance Views. IEEE Trans. Sustain. Comput. 2025, 10, 562–575. [Google Scholar] [CrossRef]
Melo, M.D.T.d.; Maciel, P.R.M.; Araujo, J.; Matos Júnior, R.d.S.; Araújo, C. Availability study on cloud computing environments: Live migration as a rejuvenation mechanism. In Proceedings of the 2013 IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Budapest, Hungary, 24–27 June 2013; pp. 1–6. [Google Scholar] [CrossRef]
Torquato, M.; Maciel, P.; Vieira, M. Availability and Reliability Modeling of VM Migration as Rejuvenation on a System under Varying Workload. Softw. Qual. J. 2020, 28, 59–83. [Google Scholar] [CrossRef]
Tola, B.; Jiang, Y.; Helvik, B.E. On the Resilience of the NFV-MANO: An Availability Model of a Cloud-native Architecture. In Proceedings of the 2020 16th International Conference on the Design of Reliable Communication Networks DRCN 2020, Milano, Italy, 25–27 March 2020; pp. 1–7. [Google Scholar] [CrossRef]
Kubernetes Authors. Safely Drain a Node. Available online: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ (accessed on 15 February 2026).
Kubernetes Authors. API-Initiated Eviction. Available online: https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/ (accessed on 15 February 2026).
Kubernetes Authors. Disruptions. Available online: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ (accessed on 24 February 2026).
Maciel, P.R.M.; Matos Júnior, R.d.S.; Silva, B.; Figueiredo, J.; Oliveira, D.; Fé, I.; Maciel, R.; Dantas, J. Mercury: Performance and Dependability Evaluation of Systems with Exponential, Expolynomial, and General Distributions. In Proceedings of the 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), Christchurch, New Zealand, 22–25 January 2017; pp. 50–57. [Google Scholar] [CrossRef]
Lim, D.; Nguyen, T.A.; Min, D.; Choi, E.; Fe, I.; Silva, F.A.; Maciel, P. Metaverse Distributed Storages: High Availability Quantification Using Stochastic Reward Nets. In Proceedings of the 2025 International Conference on Metaverse Computing, Networking and Applications (MetaCom), Seoul, Republic of Korea, 27–29 August 2025; pp. 181–188. [Google Scholar] [CrossRef]
Nguyen, T.A.; Kim, D.S.; Park, J.S. A Comprehensive Availability Modeling and Analysis of a Virtualized Servers System Using Stochastic Reward Nets. Sci. World J. 2014, 2014, 1–18. [Google Scholar] [CrossRef] [PubMed]
Kubernetes Authors. Node Status. Available online: https://kubernetes.io/docs/reference/node/node-status/ (accessed on 15 February 2026).
Kubernetes Authors. Kube-Controller-Manager. Available online: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/ (accessed on 15 February 2026).
Kubernetes Authors. Node Heartbeats. Available online: https://kubernetes.io/docs/concepts/architecture/nodes/#node-heartbeats (accessed on 15 February 2026).
Kubernetes Authors. Kube-Apiserver. Available online: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ (accessed on 23 February 2026).
Kubernetes Authors. Taints and Tolerations. Available online: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ (accessed on 15 February 2026).
Abuibaid, M.; Ghorab, A.; Seguin-McPeake, A.; Yuen, O.; Yungblut, T.; St-Hilaire, M. Edge Workloads Monitoring and Failover: A StarlingX-Based Testbed Implementation and Measurement Study. IEEE Access 2022, 10, 97101–97116. [Google Scholar] [CrossRef]
Yang, H.; Kim, Y. Design and Implementation of Fast Fault Detection in Cloud Infrastructure for Containerized IoT Services. Sensors 2020, 20, 4592. [Google Scholar] [CrossRef] [PubMed]
Biswas, M.I.; Parr, G.; McClean, S.I.; Morrow, P.; Scotney, B.W. A Practical Evaluation in Openstack Live Migration of VMs Using 10Gb/s Interfaces. In Proceedings of the 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE), Oxford, UK, 29 March–2 April 2016; pp. 346–351. [Google Scholar] [CrossRef]
OpenStack Nova Documentation Team. Recover from a Failed Compute Node. Available online: https://docs.openstack.org/nova/latest/admin/node-down.html (accessed on 15 February 2026).
The Kured Authors. Kured—Kubernetes Reboot Daemon. Available online: https://kured.dev/docs/ (accessed on 23 February 2026).
Law, A.M. Simulation Modeling and Analysis, 5th ed.; McGraw-Hill Education: Columbus, OH, USA, 2015. [Google Scholar]
Bobbio, A.; Horváth, A. Petri Nets with Discrete Phase Type Timing: A Bridge Between Stochastic and Functional Analysis. Electron. Notes Theor. Comput. Sci. 2002, 52, 209–226. [Google Scholar] [CrossRef]

Figure 1. System architecture on edge computing storage.

Figure 2. Abstraction of SRN system model: (a) policy-level control logic, (b) mapping to SRN submodels. Detailed Petri-net structures are provided in Figure 3, Figure 4 and Figure 5.

Figure 3. Detailed SRN net for edge nodes and storage: (a) edge clock, (b) edge availability and node rejuvenation. Interface places include

P_{e u}^{k}, P_{e d}^{k}, P_{e r e j i n t}^{k}, P_{e r e j}^{k}

; the reward is gated by storage health (

P_{s t r u}

).

Figure 3. Detailed SRN net for edge nodes and storage: (a) edge clock, (b) edge availability and node rejuvenation. Interface places include

P_{e u}^{k}, P_{e d}^{k}, P_{e r e j i n t}^{k}, P_{e r e j}^{k}

; the reward is gated by storage health (

P_{s t r u}

).

Figure 4. Detailed SRN net for microservices: (a) microservice clock, (b) microservice availability and self-rejuvenation. Interface places include

P_{m u}^{k}, P_{m f p}^{k}, P_{m r e j}^{k}

.

Figure 4. Detailed SRN net for microservices: (a) microservice clock, (b) microservice availability and self-rejuvenation. Interface places include

P_{m u}^{k}, P_{m f p}^{k}, P_{m r e j}^{k}

.

Figure 5. Detailed SRN net for inter-node interactions: HA failover, LM evacuation on a logical ring topology. Policy switches are encoded by

P_{P o l i c y H A}

,

P_{P o l i c y L M}

, and

P_{P o l i c y R E J}

.

Figure 5. Detailed SRN net for inter-node interactions: HA failover, LM evacuation on a logical ring topology. Policy switches are encoded by

P_{P o l i c y H A}

,

P_{P o l i c y L M}

, and

P_{P o l i c y R E J}

.

Figure 6. COA with respect to sensitivity parameters: (a) mean time for edge node failure (hours), (b) mean time for microservice failure (hours), (c) mean time for edge storage failure (hours).

Figure 7. COA with respect to sensitivity parameters: (a) mean time to trigger rejuvenation of edge node (hours), (b) mean time to rejuvenate edge node (minutes).

Figure 8. COA with respect to sensitivity parameters: (a) Mean time to trigger rejuvenation of microservice (hours), (b) Mean time to aging microservice (hours).

Table 5. COA by six scenarios.

m	Case (i) Without HA, LM and REJ	Case (ii) with Only HA	Case (iii) with HA and LM	Case (iv) with Only REJ	Case (v) with HA and REJ	Case (vi) with HA, LM and REJ
1	5.61623178	5.630850437	5.638009995	5.521797121	5.323315107	5.533842655
2	11.55509444	11.58247815	11.5900559	11.14945372	10.91820502	11.08190684
3	17.46508118	17.50285648	17.5123491	16.774778	16.32616795	16.77606607
4	23.3549476	23.41224225	23.41208215	22.37593148	21.69431434	22.43628906

Table 6. Relative COA gain of HA+LM over HA-only and the baseline (default parameters).

m	$Δ COA$ (iii–ii)	Gain vs. (ii)	$Δ COA$ (iii–i)	Gain vs. (i)
1	0.007160	0.1271%	0.021778	0.3878%
2	0.007578	0.0654%	0.034961	0.3026%
3	0.009493	0.0542%	0.047268	0.2706%
4	−0.000160	−0.0007%	0.057135	0.2446%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nguyen, T.A.; Lim, D.; Kyung, M.; Min, D. Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies. Mathematics 2026, 14, 1704. https://doi.org/10.3390/math14101704

AMA Style

Nguyen TA, Lim D, Kyung M, Min D. Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies. Mathematics. 2026; 14(10):1704. https://doi.org/10.3390/math14101704

Chicago/Turabian Style

Nguyen, Tuan Anh, Damsub Lim, MinGi Kyung, and Dugki Min. 2026. "Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies" Mathematics 14, no. 10: 1704. https://doi.org/10.3390/math14101704

APA Style

Nguyen, T. A., Lim, D., Kyung, M., & Min, D. (2026). Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies. Mathematics, 14(10), 1704. https://doi.org/10.3390/math14101704

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies

Abstract

1. Introduction

2. Related Works

2.1. Migration and High Availability in MEC and Edge Storage

2.2. Software Aging and Rejuvenation in Virtualized and Containerized Systems

2.3. Cloud-Native Self-Healing and Planned Maintenance as Rejuvenation

2.4. Gap Addressed by This Work

3. System Architecture

4. Stochastic Models

4.1. Abstraction

4.2. Modeling Assumptions and SRN Formulation

4.3. Edge Node Submodel

4.4. Microservice Submodel

4.5. System Interactions: HA, LM, and SW Rejuvenation

5. Experiment and Analysis

5.1. Default Input Parameters and Model Configuration

5.2. Experiment Metrics: COA

5.3. Stationary Analysis

5.4. Sensitivity Analysis

6. Discussion

6.1. Effect of HA and Live Migration on COA

6.2. Implications of the PC Phenomenon

6.3. Parameter Sensitivity and Prioritization

6.4. Design Guidelines for MEC-Based Edge Storage

6.5. Validation Strategy and External Grounding

6.6. Future Works

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Experiment Metrics: `COA`

6.1. Effect of HA and Live Migration on `COA`

6.2. Implications of the `PC` Phenomenon