Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies
Abstract
1. Introduction
- Integrated SRN model for reactive and preventive fault management in edge storage. We build an SRN that unifies high availability, live migration, and software rejuvenation in a distributed edge storage architecture with microservices. The model captures aging, sudden failures, and rejuvenation at both edge node and microservice levels.
- Race-aware SRN formulation of the PC mechanism. We encode the contention between evacuation via live migration and node-level rejuvenation after the same trigger. This yields an SRN structure in which the trigger can lead either to successful evacuation or to premature termination of microservices.
- Capacity-oriented availability evaluation for multi-node edge storage. We adopt a reward-based definition of availability that counts microservices that remain usable while storage is up, instead of only checking whether the system is on or off. This metric lets us compare six combinations of HA, LM, and rejuvenation policies and quantify how much capacity is preserved under different failure and aging patterns.
- Systematic sensitivity analysis over timing parameters. We vary twelve timing parameters (failure, detection, repair, migration, and rejuvenation times) and quantify their effect on COA across all policy scenarios.
- Model-informed orchestration guidelines. We translate the SRN structure and COA analysis into practical guidelines for coordinating HA, live migration, and rejuvenation, with emphasis on enforcing an evacuation-before-reboot ordering and on selecting trigger intervals consistent with migration delays.
- Approach overview. We model the MEC-based distributed edge storage architecture of Section 3 using an SRN whose structure follows directly from an explicit set of modeling assumptions (Section 4.2). The SRN is built as three coupled submodels: (i) an edge-node submodel that captures hardware failures, detection, repair, and node-level rejuvenation; (ii) a microservice submodel that captures microservice failures, aging, self-rejuvenation, and dependency on the hosting node; and (iii) an inter-node policy submodel that encodes HA, live migration, and their interaction with rejuvenation triggers. We solve the induced continuous-time Markov chain (CTMC) with Mercury’s steady-state solver [22], and we report COA as a reward, which counts the expected number of usable microservices while storage remains operational. Results and sensitivity studies are presented in Section 5, and we interpret their implications for MEC orchestration in Section 6.
2. Related Works
2.1. Migration and High Availability in MEC and Edge Storage
2.2. Software Aging and Rejuvenation in Virtualized and Containerized Systems
2.3. Cloud-Native Self-Healing and Planned Maintenance as Rejuvenation
2.4. Gap Addressed by This Work
| NO. | Research Focus | Method | Rejuvenation? | LM/HA? | Contribution |
|---|---|---|---|---|---|
| [10] | This is a survey on task migration strategies for user mobility in MEC environments. | Literature review and taxonomy (delay, energy, trade-off) | No. Aging is not considered. | Partial. Task offloading, not VM/container LM or HA | Focuses on offloading optimization; no VM LM or aging-based rejuvenation |
| [17] | Availability modeling treating VM migration as rejuvenation in virtualized systems | Stochastic models, assuming VM migration based on remote storage. | Yes. LM for VMM aging | Yes. LM is used as a core technology to enable REJ. | A single-node standby model based on cloud data centers. |
| [18] | Cloud-native NFV-MANO availability analysis. | Stochastic Activity Networks (SAN). | Yes. Aging + restart/reboot rejuvenation | Partial. Container respawn HA, no VM LM | Models software aging and container respawn separately. Focuses on NFV-MANO. |
| [16] | Evaluates VMM rejuvenation via live migration for cloud availability. | Extended Stochastic Petri Nets (SPNs) and Reliability Block Diagrams (RBDs). | Yes. Time-based VMM rejuvenation through VM migration. | Yes. Live migration minimizes rejuvenation downtime. | Targets centralized cloud datacenters, not distributed edge environments. |
| [15] | OS-level software aging impact on container service dependability and performance. | Semi-Markov Process (SMP) modeling. | Yes. OS reboot and live container migration as rejuvenation techniques. | Yes. Live container migration minimizes rejuvenation downtime. | Models backup resource aging/failure; determines optimal migration trigger intervals. |
| Our work | Proactive HA modeling for distributed edge storage: integrating reactive HA/LM with REJ | Integrated SRN modeling with 6 policy scenarios and sensitivity analysis | Yes. Models aging/rejuvenation at both edge node and microservice levels | Yes. Models both failure recovery (HA) and proactive LM | Explicitly models the race condition when integrating HA/LM and REJ |
3. System Architecture
- User Domain: includes wearable computing devices, haptic devices, or IoT sensors worn or used by users. These devices continuously generate user input and receive final services from edge nodes.
- Edge Node Domain: a distributed network composed of n edge-storage nodes capable of performing computing and data management.
- High availability and live migration: recovers services upon unpredictable failures based on HA/LM policies.
- Software rejuvenation: proactively manages the system before failures occur based on software aging and rejuvenation models.
- Proactive HA workflow. In the proposed architecture, each edge node continuously monitors both abrupt failures and aging signals. When an edge node reaches an aging threshold or its rejuvenation trigger time elapses, the orchestrator attempts to evacuate its microservices to neighboring nodes with spare capacity (LM). If the evacuation completes before the reboot, the node rejuvenates safely and then rejoins the pool. If evacuation is preempted (e.g., due to headroom limits or timeouts), rejuvenation may start first and clear the remaining microservices. In contrast, if a sudden node or microservice failure occurs, HA triggers a reactive failover or redeployment. The SRN model in Section 4 encodes these two paths and their interaction. At the model level, we capture whether a node-level rejuvenation trigger follows an evacuation-first (LM) or reboot-first (REJ) behavior by the probability parameter (Table 4).
4. Stochastic Models
4.1. Abstraction
- Section roadmap. We first state the modeling assumptions and map them to SRN constructs (Section 4.2). We then define the edge-node and microservice submodels and their interaction layer, and we finally present the CTMC semantics and the COA reward used in the analysis.
- Notation and symbols. Let n be the number of edge-storage nodes, indexed by ; we also use when referring to an ordered neighbor pair . Each node initially hosts m microservice instances and has a maximum capacity of (Table 4). We denote by and the sets of Places and Transitions in the SRN. For any place , (or ) is its marking, i.e., the number of tokens in p at time t; when t is clear, we write . The SRN induces a CTMC on reachable markings (tangible markings under standard SRN/GSPN semantics, i.e., any sequences of immediate transition firings are resolved instantaneously before a timed transition fires) with infinitesimal generator , where is the transition rate from marking state x to state y and . Timed transitions are modeled as exponential; a mean delay corresponds to the firing rate of . We use for an indicator function. In particular, storage availability is represented by an up place and a failed place , and indicators such as ensure that capacity is counted only when storage is operational. Finally, the binary policy places , , and to enable or disable HA, LM, and REJ in the scenario definitions of Section 5.
4.2. Modeling Assumptions and SRN Formulation
- A1.
- A2.
- Capacity abstraction for microservices. Each node hosts m microservice tokens initially and can accept migrated tokens only if it has spare capacity up to . SRN encoding: Initial markings place m tokens in ; capacity headroom is enforced by guards (e.g., conditions of the form before enabling an incoming HA/LM move).
- A3.
- Sudden failures and repairs are memoryless and independent. Edge-node, microservice, and storage failures/repairs are modeled as exponential timed transitions with rates derived from MTTF/MTTR parameters, and their occurrence is independent across components. SRN encoding: Timed transitions such as , , and storage-failure transitions use the corresponding rates in Table 4, yielding a CTMC semantics on the reachable markings.
- A4.
- Software aging abstraction. Aging is represented as a progression between healthy, failure-probable, and aging-failure states, with optional recovery and rejuvenation actions. SRN encoding: microservice places such as , , and timed transitions , , ; node-level rejuvenation uses a clock-trigger place that enables rejuvenation-related transitions.
- A5.
- Failure detection is explicit. Detection delay is modeled separately from failure occurrence. SRN encoding: edge-node detection places/transitions (e.g., ) with mean delay parameters in Table 4; in this model, HA/LM guards are triggered when a node leaves the up place (i.e., ); detection delay contributes to the repair path , and may be absorbed into the configured HA/LM transfer times when appropriate.
- A6.
- Policies are configuration switches. HA, LM, node rejuvenation, and microservice rejuvenation can be enabled/disabled to form the six scenarios. SRN encoding: Policy places (e.g., , , ) are referenced in guard functions so that the same SRN structure supports multiple policy configurations.
- A7.
- Proactive HA ordering. When a node rejuvenation trigger occurs, the intended ordering is: trigger → migrate out → rejuvenate the node once empty. SRN encoding: The LM transition is enabled by the rejuvenation-trigger place; in the LM-first branch (selected with probability as described below), the node enters rejuvenation only after an empty-node condition such as , holds; in the REJ-first branch, this guard is bypassed to represent a forced reboot that may terminate remaining microservices.
- A8.
- Race abstraction between LM and rejuvenation. The model admits a concurrent situation where rejuvenation can begin before LM completes, leading to microservice termination. SRN encoding: immediate transitions that move microservice tokens into stop/down places when the node enters rejuvenation, while LM is still in progress; the detailed enable/disable logic is captured via guard functions.
- SRN formulation roadmap. Given these assumptions, the SRN model is built in three steps: (i) we define local SRN submodels for edge nodes and microservices that capture failure, repair, and rejuvenation dynamics; (ii) we couple n instances of these submodels using inter-node transitions that represent HA and LM across neighbors; and (iii) we define a reward function that counts usable microservice instances while the storage layer is operational, yielding COA.
4.3. Edge Node Submodel
- Basic failure and recovery: Edge Node () starts with one token in the normal state . As time passes, fires and transitions to the failure state (). The failure is detected by after time, resulting in the detected failure state . Finally, recovers after the recovery time passes and returns to the state.
- Edge node rejuvenation: , which tracks time for software aging in Figure 3a, fires to perform rejuvenation. When the rejuvenation signal is detected, the immediate transition fires, transitioning the node state from to . Upon entering the state, consumes the token in and moves it to , the rejuvenation-ready state. Once the state is confirmed, after time passes, fires and transitions to the state. After the edge node recovers, consumes the token in and resets the time clock ().
4.4. Microservice Submodel
- Normal state and aging: Microservices initially have tokens in . Due to software aging, after time passes, fires and transitions to the aging failure-probable state (), and recovers through according to the recovery-time parameter (Table 4). Additionally, even in the state, sudden failure () can occur through (mean delay: ) and is recovered through .
- Infra dependency: When Edge Node goes down (i.e., when , or ), and immediately fire to put the microservice into the down state . When the edge node returns to the state, after time passes, fires and recovers to .
- Self-rejuvenation of microservices: Microservice generates its own rejuvenation state () through and . When this trigger is generated (and is enabled), it is first latched into the clock-side state by . Then, depending on whether the microservice is currently in or , the immediate transitions or move it to the self-rejuvenation state . After rejuvenation completes, resets the microservice clock. After time passes, fires and returns to . This flow models the planned downtime of microservices.
4.5. System Interactions: HA, LM, and SW Rejuvenation
- Reactive high availability: When edge node fails and becomes empty, or immediately fires. These tokens move to the HA migration waiting state . After time, which is the time required for microservices to move to an adjacent edge node, fires and successfully fails over to of edge node .
- Proactive live migration: When the rejuvenation signal of edge node occurs, the control plane may either evacuate microservices first via LM (evacuation-first) or start node rejuvenation immediately (reboot-first). In the SRN, we represent this ordering by a probabilistic choice with (Table 4); the default setting is .
- Service destruction and recovery by rejuvenation: As soon as the edge node enters the state, reset transitions (, , , , , ) fire to create a clean state for rejuvenation. These reset transitions forcibly remove tokens from all microservice states (, , etc.) and transition them to for clearing. Tokens in are consumed by , but since this transition has no output, they are destroyed, and after the rejuvenation work is completed, fires. This recovers the destroyed microservices through a booting process () to operate in .
- : microservices of node i in normal running state;
- : aging but still running microservices of node i;
- : microservices waiting for reactive HA migration from failed node i to its neighbor j;
- Policy places that encode whether each mechanism is enabled in the considered scenario.
- Immediate-transition precedence. When an edge node fails, local dependency transitions in the per-node microservice submodel (e.g., and in Figure 4) and inter-node relocation transitions (e.g., , ) may be enabled at the same time. To enforce the intended policy semantics, we assign a higher firing priority to the inter-node HA/LM relocation transitions than to the local “microservice down” transitions. Infeasible relocations (e.g., no neighbor capacity) leave tokens to be handled by the local down transitions, which capture capacity loss.
- Proactive live migration on all outgoing arcs from node i through since ;
- The immediate pre-rejuvenation transition , which moves the edge node into the rejuvenation state .
| Name | Associated Transition | Conditional Statement of Immediate Transitions 1 |
|---|---|---|
| gTkm | If | |
| gTkmfp | If | |
| If | ||
| (Other HA transitions follow the same pattern: source index k, target index (mod n)) | ||
| gTkeprerej | If | |
| gTkeprrej | If | |
| gTkeresclk | If | |
| gTkestop | If | |
| gTkerejint | If | |
| gTkmfpdn | If | |
| gTkmprerej | If | |
| gTkmprrej | If | |
| gTkmclkstart | If | |
| gTkmclkstop | If | |
| gTkmo | If | |
| gTkmdnrejo | If | |
| gTkmafrej | If | |
| gTkmresclk | If | |
| gTkmrejint | If |
| Name | Associated Transition | Conditional Statement of Timed Transitions 1 |
|---|---|---|
| gTklmfp | If | |
| If | ||
| (Other LM transitions follow the same pattern: source index k, target index (mod n)) | ||
| gTkerej | If | |
| gTkmaf | If | |
| gTkmboot | If | |
| gTkmr | If | |
| gTkmrej | If |
| Parameters | Related Transitions | Description | Default Value |
|---|---|---|---|
| mean time to failure (MTTF) of an edge node, i.e., | 8760 h | ||
| mean time to failure (MTTF) of a microservice, i.e., (MTTF_Microservice) | 1258 h | ||
| mean time to failure (MTTF) of edge storage, i.e., | 40,000 h | ||
| mean time to detect edge-node failure | 10 s | ||
| mean time to repair (MTTR) of an edge node, i.e., | 1 h | ||
| mean time to repair (MTTR) of a microservice, i.e., (MTTR_Microservice) | 0.238 h | ||
| mean time to repair (MTTR) of edge storage, i.e., | 5 h | ||
| mean time to complete HA failover of a microservice | 2 min | ||
| mean time to complete live migration of a microservice (LM relocation time) | 1 s | ||
| mean time to trigger edge-node rejuvenation | 555 h | ||
| mean time to execute edge-node rejuvenation | 2 min | ||
| mean time to trigger microservice rejuvenation | 80 h | ||
| Mean time to execute microservice rejuvenation | 1 min | ||
| Mean time from the failure-probable state to aging failure | 72 h | ||
| Mean time to enter the failure-probable state due to aging | 168 h | ||
| Mean time to recover from the failure-probable state (resource cleanup) | 30 s | ||
| Mean time to boot a microservice after edge-node rejuvenation | 5 s | ||
| m | Baseline number of microservice instances initially hosted per edge node | 1 | |
| Maximum microservice capacity per edge node (in number of instances) | 2 | ||
| Equation (37) | Probability of selecting the evacuation-first (LM) branch when a node-level rejuvenation trigger occurs | 0.5 |
5. Experiment and Analysis
5.1. Default Input Parameters and Model Configuration
- External grounding of key timing parameters. Several values in Table 4 correspond to control-plane timers and measured recovery latencies in container-orchestrated edge platforms. models the delay between an edge node failure and its detection by the control plane. In Kubernetes, a node’s Ready condition becomes Unknown if the controller has not heard from the node within node-monitor-grace-period (default 50 s), based on the kubelet heartbeat mechanism (Lease renewals, default every 10 s) and the node controller timers [25,26,27]. These defaults are version-dependent and can be tuned. After the node is tainted as node.kubernetes.io/not-ready or node.kubernetes.io/unreachable, the default system-added tolerations set toleration- Seconds = 300 for these NoExecute taints, which bounds the default eviction delay in untuned clusters [28,29]. Taken together, these control-plane timers imply that end-to-end workload replacement can be on the order of tens of seconds to minutes in untuned clusters. In this SRN (Assumption A5), HA relocation is triggered when a node leaves the up place, so the workload-visible replacement latency is captured by , which can be calibrated to include both detection and post-detection actions (eviction delay, scheduling, image pull, and warm-up). remains on the infrastructure repair path. At the edge, measured failover can also be in the order of seconds under an optimized setup (e.g., RTO ≈ 10.03 s end-to-end, with the failover stage ≈ 8.29 s in a StarlingX-based edge testbed) [30], while fast fault-detection extensions for containerized IoT services can reduce detection to sub-second levels in experimental settings (e.g., average detection time ≈ 0.84 s) [31]. abstracts the workload-visible end-to-end replacement time, and it can be calibrated to include detection, eviction, scheduling, image pull, and service warm-up. parameterizes the mean time to complete a microservice relocation step via live migration in the SRN. Empirical VM measurements indicate that the switchover downtime of post-copy migration can be sub-second (e.g., 0.2–0.7 s under 10 Gb/s interfaces) [32], so we treat service interruption during LM as negligible relative to other control-plane timers and focus on the relocation latency that governs how quickly a node can be drained. These mappings interpret the baseline values in Table 4 and motivate the ranges used in the sensitivity analysis (Table 7), while also clarifying how to calibrate SRN rates from platform logs and testbed measurements.
5.2. Experiment Metrics: COA
5.3. Stationary Analysis
- Mapping of scenarios to common HA baselines. Scenario (ii) corresponds to reactive failover/rescheduling: after an edge node failure is detected, microservices are restarted/rescheduled on neighboring nodes with spare capacity (no execution-state preservation). This is conceptually close to primary–standby activation (warm/cold standby) and to Kubernetes-style recovery, where workloads are recreated/rescheduled after node failure detection and eviction. Scenario (iii) adds proactive evacuation by live migration prior to planned node rejuvenation, analogous to draining a node before reboot in planned maintenance workflows (e.g., cordon+drain in Kubernetes or VM host maintenance with live migration). Scenario (iv) represents node rejuvenation without prior evacuation, analogous to maintenance that starts before draining completes. Scenario (vi) integrates HA, LM, and REJ, corresponding to evacuation-first maintenance combined with failure-driven failover. Scenario (i) is the baseline with no HA, and Scenario (v) is rejuvenation without LM.
- Relative gains. Table 6 reports the absolute COA differences and percentage gains of HA+LM (Case (iii)) over HA-only (Case (ii)) and over the baseline (Case (i)) under identical failure rates (default parameters). Under these low failure rates, the gain of HA+LM over HA-only is below 0.13%, and for , the difference is below 0.001%. When failures are frequent, the effect size of LM becomes much larger (see the sensitivity analysis for edge-node MTTF). The percentages are computed from Table 5 and rounded to four decimal places.
5.4. Sensitivity Analysis
| Parameter | Test Range | Case (i) Without HA, LM and REJ | Case (ii) with Only HA | Case (iii) with HA and LM | Case (iv) with Only REJ | Case (v) with HA and REJ | Case (vi) with HA, LM and REJ |
|---|---|---|---|---|---|---|---|
| 0.6–40 s | 5.6174 (<0.004) | 5.6300 (<0.002) | 5.6381 (<0.003) | 5.5212 (<0.003) | 5.3251 (<0.007) | 5.5321 (<0.003) | |
| 0.1–3 h | 5.6164 (<0.002) | 5.6297 (<0.003) | 5.6378 (<0.003) | 5.5213 (<0.003) | 5.3255 (<0.006) | 5.5326 (<0.004) | |
| 3–60 min | 5.6160 (<0.003) | 5.6298 (<0.005) | 5.6377 (<0.004) | 5.5206 (<0.004) | 5.3236 (<0.005) | 5.5323 (<0.004) | |
| 0.6–20 h | 5.6166 (<0.006) | 5.6299 (<0.006) | 5.6376 (<0.006) | 5.5210 (<0.005) | 5.3235 (<0.008) | 5.5327 (<0.006) | |
| 10–180 s | 5.6171 (<0.003) | 5.6296 (<0.002) | 5.6379 (<0.002) | 5.5213 (<0.005) | 5.3229 (<0.010) | 5.5325 (<0.006) |
- Why detection and repair parameters show low sensitivity in steady-state. Table 7 shows that varying , repair times (, , ), and the microservice rejuvenation duration changes COA by less than 0.01 across the tested ranges. This is expected for a steady-state time-average metric: these parameters mainly affect the duration of a degradation episode, while failure rates and rejuvenation trigger intervals determine the frequency of such episodes. A first-order alternating-renewal approximation for a single component givesso when is much larger than and , steady-state availability changes only in the fourth-to-sixth decimal place under realistic sweeps. For example, with the default h, changing from 0.6 to 40 s changes the unavailability fraction by , and changing from 0.1 to 3 h changes it by . These magnitudes explain why the COA ranges in Table 7 stay below 0.01. In scenarios with HA/LM, capacity loss is further bounded because microservices can be relocated to neighboring nodes soon after a failure is detected, so COA becomes even less sensitive to the subsequent repair duration.
6. Discussion
6.1. Effect of HA and Live Migration on COA
6.2. Implications of the PC Phenomenon
- Mapping to Kubernetes, KubeEdge, and OpenStack. In Kubernetes, planned node maintenance is typically executed as cordon+drain; drain uses the Eviction API and respects PodDisruptionBudgets and termination grace periods, then the node can be rebooted and returned to service [19,20,21]. In this sense, the branch corresponds to drain/evacuation-first coordination, while the branch corresponds to forced disruption (e.g., a reboot that proceeds before drain completes, or forced deletion when drain is blocked or times out). Kubernetes does not provide transparent live migration for generic containers by default; however, when VM live migration is used in a Kubernetes setting (e.g., via KubeVirt), the same migrate-before-reboot pattern applies [8]. KubeEdge extends Kubernetes semantics to edge deployments with a cloud–edge split and supports operation under intermittent connectivity [4,5], which may reduce the chance that evacuation completes within a deadline and can increase the practical likelihood of a forced restart path. In OpenStack, Nova supports live migration for planned maintenance and evacuation for host failure [7,33]; Nova also exposes migration completion timeout policies [7], which aligns naturally with interpreting as the chance that migration/evacuation completes before a forced maintenance action proceeds.
6.3. Parameter Sensitivity and Prioritization
- Relation to container orchestration and MEC reliability. The above ranking matches how Kubernetes-like orchestrators are engineered: liveness detection and reconciliation run on seconds-to-minutes time constants, while node hardware failures and software aging typically occur on much longer time scales. For example, Kubernetes marks an unresponsive node unhealthy after a grace period (default node-monitor-grace-period=50s) and then applies eviction/toleration logic (default tolerationSeconds=300s for NoExecute on not-ready/unreachable) before rescheduling workloads [25,26,28]. Therefore, moderate variation in these timers has a limited effect on long-run capacity when is large, while failure/aging rates and maintenance frequency dominate. In MEC settings, intermittent connectivity and constrained resources can stretch effective detection and repair times and can create false-positive liveness events, which can amplify the effect of orchestration timeouts [1,10]. The present SRN assumes perfect failure detection and does not explicitly model network partitions; adding partition/false-positive states is a direct extension for studying this regime.
6.4. Design Guidelines for MEC-Based Edge Storage
- Industrial mapping. The building blocks represented in our SRN correspond to standard operational controls in current edge stacks. For Kubernetes-based edge clusters, planned node maintenance typically follows a cordon/drain workflow (e.g., kubectl drain), where pods are safely evicted before the node is powered down, and PodDisruptionBudgets bound the number of concurrent voluntary disruptions [19,21]. Tools such as Kured automate safe reboots by cordoning and draining a node before reboot and then uncordoning it [34]. For VM-based MEC deployments, live migration is directly supported by widely deployed virtualization platforms (e.g., OpenStack) and by Kubernetes-based VM orchestration (KubeVirt) [7,8]. In this context, the PC regime should be interpreted as miscoordinated maintenance (reboot before evacuation completes or forced evacuation with insufficient timeouts), rather than a claim that real systems intentionally implement a fixed 50/50 race.
- Live migration should be treated as a first-class mechanism when building high availability for microservice-based edge storage. The steady improvement in COA across all failure and aging regimes indicates that migration provides a robust benefit even when hardware is relatively reliable.
- Node-level rejuvenation must be carefully coordinated with migration. Before rebooting an aging edge node, the orchestrator should verify that all microservice instances have been migrated or drained so that capacity is not lost through PC events.
- Rejuvenation trigger intervals should be tuned separately at the node and microservice levels. In this study, overly frequent node-level rejuvenation reduces COA, whereas shorter microservice trigger intervals improve COA within the tested range.
- When storage reliability is a concern, as in the low region of Figure 6c, storage-related overall COA becomes dominated by storage uptime Equation (42); HA/LM and rejuvenation can only improve the conditional microservice capacity when storage is up. In our SRN, periodic restarts may reduce the post-repair lag after storage recovers, but they cannot compensate for storage downtime.
- The model assumes some spare capacity on each node, encoded by (). In practice, migration-based high availability requires similar headroom, either in the form of reserved capacity on neighboring nodes or dedicated standby nodes.
6.5. Validation Strategy and External Grounding
- (i)
- Structural validation by reduction. The policies placed in Figure 5 (, , ) act as switches that enable or disable groups of transitions (Table 2, Table 3 and Table 4). By disabling and , the reachable markings collapse to the reactive HA-only case. By disabling and , the markings collapse to the rejuvenation-only case. This reduction check confirms that the integrated SRN preserves the intended semantics of its component submodels and of the reward definition used for .
- (ii)
- Parameter grounding and qualitative trend checks. We map key SRN timers to orchestration timers and measurements. Under these mappings, the predicted ordering of policies is consistent with observations: proactive migration can keep service interruption short relative to failure-driven relocation, while fault detection and eviction timers mainly matter when failures are frequent or when the system is capacity-constrained. This is consistent with our sensitivity analysis result that detection delay and repair time have limited effects on long-run within reasonable engineering ranges.
- (iii)
- Empirical validation workflow and limitations. This work does not include a dedicated MEC testbed experiment, so we do not claim a one-to-one match between the absolute values and any specific deployment. In future work (Section 6.6), we will validate SRN-predicted trends against testbed traces by (a) running the policy cases on a small MEC cluster, (b) logging failure, recovery, migration, and rejuvenation events, (c) computing an empirical as a time average of the number of ready microservice replicas multiplied by a storage-health indicator, and (d) fitting SRN parameters (or phase-type distributions) to observed traces, then comparing predicted sensitivities and policy ranking.
6.6. Future Works
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
- ETSI. GS MEC 003-V3.2.1-Multi-Access Edge Computing (MEC); Framework and Reference Architecture. ETSI Group Specification GS MEC 003, European Telecommunications Standards Institute (ETSI). 2024. V3.2.1 (2024-04). Available online: https://www.etsi.org/deliver/etsi_gs/MEC/001_099/003/03.02.01_60/gs_mec003v030201p.pdf (accessed on 15 February 2026).
- Qu, Q.; Xu, R.; Nikouei, S.Y.; Chen, Y. An Experimental Study on Microservices Based Edge Computing Platforms. arXiv 2020, arXiv:cs/2004.02372. [Google Scholar] [CrossRef]
- Xiong, Y.; Sun, Y.; Xing, L.; Huang, Y. Extend Cloud to Edge with KubeEdge. In Proceedings of the 2018 IEEE/ACM Symposium on Edge Computing (SEC), Seattle, WA, USA, 25–27 October 2018; pp. 373–377. [Google Scholar] [CrossRef]
- Zhao, H.; Liu, S.; Luo, K.; Chen, S.; Kong, L.; Jia, F. Research on application of edge computing system based on KubeEdge. Zhineng Kexue Yu Jishu Xuebao 2022, 4, 118–128. [Google Scholar] [CrossRef]
- Waseem, M.; Liang, P.; Shahin, M. A Systematic Mapping Study on Microservices Architecture in DevOps. J. Syst. Softw. 2020, 170, 110798. [Google Scholar] [CrossRef]
- OpenStack Nova Documentation Team. Live-Migrate Instances. Available online: https://docs.openstack.org/nova/latest/admin/live-migration-usage.html (accessed on 15 February 2026).
- Project, K. Live Migration. KubeVirt User Guide. Available online: https://kubevirt.io/user-guide/compute/live_migration/ (accessed on 19 February 2026).
- Venkatesh, R.S.; Smejkal, T.; Milojicic, D.S.; Gavrilovska, A. Fast In-Memory CRIU for Docker Containers. In Proceedings of the 2019 International Symposium on Memory Systems (MEMSYS’19), Washington, DC, USA, 30 September–3 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 53–65. [Google Scholar] [CrossRef]
- He, X.; Meng, M.; Ding, S.; Li, H. A Survey of Task Migration Strategies in Mobile Edge Computing. In Proceedings of the 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 24–26 April 2021; pp. 400–405. [Google Scholar] [CrossRef]
- Kim, D.S.; Hong, J.B.; Nguyen, T.A.; Machida, F.; Park, J.S.; Trivedi, K.S. Availability Modeling and Analysis of a Virtualized System Using Stochastic Reward Nets. In Proceedings of the 2016 IEEE International Conference on Computer and Information Technology (CIT), Nadi, Fiji, 8–10 December 2016; pp. 210–218. [Google Scholar] [CrossRef]
- Cotroneo, D.; Natella, R.; Pietrantuono, R.; Russo, S. Software Aging Analysis of the Linux Operating System. In Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering, San Jose, CA, USA, 1–4 November 2010; pp. 71–80. [Google Scholar] [CrossRef]
- Araujo, J.; Matos, R.; Alves, V.; Maciel, P.; de Souza, F.V.; Matias, R., Jr.; Trivedi, K.S. Software Aging in the Eucalyptus Cloud Computing Infrastructure: Characterization and Rejuvenation. ACM J. Emerg. Technol. Comput. Syst. 2014, 10, 1–22. [Google Scholar] [CrossRef]
- Alonso, J.; Bovenzi, A.; Li, J.; Wang, Y.; Russo, S.; Trivedi, K.S. Software Rejuvenation: Do IT & Telco Industries Use It? In Proceedings of the 23rd IEEE International Symposium on Software Reliability Engineering Workshops (ISSRE Workshops), Dallas, TX, USA, 27–30 November 2012; pp. 299–304. [Google Scholar] [CrossRef]
- Bai, J.; Chang, X.; Machida, F.; Trivedi, K.S. Understanding Container-Based Services Under Software Aging: Dependability and Performance Views. IEEE Trans. Sustain. Comput. 2025, 10, 562–575. [Google Scholar] [CrossRef]
- Melo, M.D.T.d.; Maciel, P.R.M.; Araujo, J.; Matos Júnior, R.d.S.; Araújo, C. Availability study on cloud computing environments: Live migration as a rejuvenation mechanism. In Proceedings of the 2013 IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Budapest, Hungary, 24–27 June 2013; pp. 1–6. [Google Scholar] [CrossRef]
- Torquato, M.; Maciel, P.; Vieira, M. Availability and Reliability Modeling of VM Migration as Rejuvenation on a System under Varying Workload. Softw. Qual. J. 2020, 28, 59–83. [Google Scholar] [CrossRef]
- Tola, B.; Jiang, Y.; Helvik, B.E. On the Resilience of the NFV-MANO: An Availability Model of a Cloud-native Architecture. In Proceedings of the 2020 16th International Conference on the Design of Reliable Communication Networks DRCN 2020, Milano, Italy, 25–27 March 2020; pp. 1–7. [Google Scholar] [CrossRef]
- Kubernetes Authors. Safely Drain a Node. Available online: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ (accessed on 15 February 2026).
- Kubernetes Authors. API-Initiated Eviction. Available online: https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/ (accessed on 15 February 2026).
- Kubernetes Authors. Disruptions. Available online: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ (accessed on 24 February 2026).
- Maciel, P.R.M.; Matos Júnior, R.d.S.; Silva, B.; Figueiredo, J.; Oliveira, D.; Fé, I.; Maciel, R.; Dantas, J. Mercury: Performance and Dependability Evaluation of Systems with Exponential, Expolynomial, and General Distributions. In Proceedings of the 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), Christchurch, New Zealand, 22–25 January 2017; pp. 50–57. [Google Scholar] [CrossRef]
- Lim, D.; Nguyen, T.A.; Min, D.; Choi, E.; Fe, I.; Silva, F.A.; Maciel, P. Metaverse Distributed Storages: High Availability Quantification Using Stochastic Reward Nets. In Proceedings of the 2025 International Conference on Metaverse Computing, Networking and Applications (MetaCom), Seoul, Republic of Korea, 27–29 August 2025; pp. 181–188. [Google Scholar] [CrossRef]
- Nguyen, T.A.; Kim, D.S.; Park, J.S. A Comprehensive Availability Modeling and Analysis of a Virtualized Servers System Using Stochastic Reward Nets. Sci. World J. 2014, 2014, 1–18. [Google Scholar] [CrossRef] [PubMed]
- Kubernetes Authors. Node Status. Available online: https://kubernetes.io/docs/reference/node/node-status/ (accessed on 15 February 2026).
- Kubernetes Authors. Kube-Controller-Manager. Available online: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/ (accessed on 15 February 2026).
- Kubernetes Authors. Node Heartbeats. Available online: https://kubernetes.io/docs/concepts/architecture/nodes/#node-heartbeats (accessed on 15 February 2026).
- Kubernetes Authors. Kube-Apiserver. Available online: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ (accessed on 23 February 2026).
- Kubernetes Authors. Taints and Tolerations. Available online: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ (accessed on 15 February 2026).
- Abuibaid, M.; Ghorab, A.; Seguin-McPeake, A.; Yuen, O.; Yungblut, T.; St-Hilaire, M. Edge Workloads Monitoring and Failover: A StarlingX-Based Testbed Implementation and Measurement Study. IEEE Access 2022, 10, 97101–97116. [Google Scholar] [CrossRef]
- Yang, H.; Kim, Y. Design and Implementation of Fast Fault Detection in Cloud Infrastructure for Containerized IoT Services. Sensors 2020, 20, 4592. [Google Scholar] [CrossRef] [PubMed]
- Biswas, M.I.; Parr, G.; McClean, S.I.; Morrow, P.; Scotney, B.W. A Practical Evaluation in Openstack Live Migration of VMs Using 10Gb/s Interfaces. In Proceedings of the 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE), Oxford, UK, 29 March–2 April 2016; pp. 346–351. [Google Scholar] [CrossRef]
- OpenStack Nova Documentation Team. Recover from a Failed Compute Node. Available online: https://docs.openstack.org/nova/latest/admin/node-down.html (accessed on 15 February 2026).
- The Kured Authors. Kured—Kubernetes Reboot Daemon. Available online: https://kured.dev/docs/ (accessed on 23 February 2026).
- Law, A.M. Simulation Modeling and Analysis, 5th ed.; McGraw-Hill Education: Columbus, OH, USA, 2015. [Google Scholar]
- Bobbio, A.; Horváth, A. Petri Nets with Discrete Phase Type Timing: A Bridge Between Stochastic and Functional Analysis. Electron. Notes Theor. Comput. Sci. 2002, 52, 209–226. [Google Scholar] [CrossRef]








| m | Case (i) Without HA, LM and REJ | Case (ii) with Only HA | Case (iii) with HA and LM | Case (iv) with Only REJ | Case (v) with HA and REJ | Case (vi) with HA, LM and REJ |
|---|---|---|---|---|---|---|
| 1 | 5.61623178 | 5.630850437 | 5.638009995 | 5.521797121 | 5.323315107 | 5.533842655 |
| 2 | 11.55509444 | 11.58247815 | 11.5900559 | 11.14945372 | 10.91820502 | 11.08190684 |
| 3 | 17.46508118 | 17.50285648 | 17.5123491 | 16.774778 | 16.32616795 | 16.77606607 |
| 4 | 23.3549476 | 23.41224225 | 23.41208215 | 22.37593148 | 21.69431434 | 22.43628906 |
| m | (iii–ii) | Gain vs. (ii) | (iii–i) | Gain vs. (i) |
|---|---|---|---|---|
| 1 | 0.007160 | 0.1271% | 0.021778 | 0.3878% |
| 2 | 0.007578 | 0.0654% | 0.034961 | 0.3026% |
| 3 | 0.009493 | 0.0542% | 0.047268 | 0.2706% |
| 4 | −0.000160 | −0.0007% | 0.057135 | 0.2446% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nguyen, T.A.; Lim, D.; Kyung, M.; Min, D. Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies. Mathematics 2026, 14, 1704. https://doi.org/10.3390/math14101704
Nguyen TA, Lim D, Kyung M, Min D. Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies. Mathematics. 2026; 14(10):1704. https://doi.org/10.3390/math14101704
Chicago/Turabian StyleNguyen, Tuan Anh, Damsub Lim, MinGi Kyung, and Dugki Min. 2026. "Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies" Mathematics 14, no. 10: 1704. https://doi.org/10.3390/math14101704
APA StyleNguyen, T. A., Lim, D., Kyung, M., & Min, D. (2026). Distributed Edge Storage Systems: Proactive High-Availability Microservices with Live Migration and Rejuvenation Strategies. Mathematics, 14(10), 1704. https://doi.org/10.3390/math14101704

