Next Article in Journal
The General Property of the Tensor Gravitational Memory Effect in Theories of Gravity: The Linearized Case
Previous Article in Journal
Existence and Multiplicity of Positive Mild Solutions for Nonlocal Fractional Variable Exponent Differential Equations with Concave and Convex Coefficients
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Multi-Level Monitoring Architecture Pattern of Cloud-Based Safety Computing Platform

by
Lei Yuan
1,
Bokai Zhang
1,
Yu Liu
1,*,
Qiang Fu
1 and
Yixiong Wu
2
1
School of Automation and Intelligence, Beijing Jiaotong University, Beijing 100044, China
2
China Unicom Digital Technology Co., Ltd., Beijing 100032, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(10), 1706; https://doi.org/10.3390/sym17101706 (registering DOI)
Submission received: 20 August 2025 / Revised: 26 September 2025 / Accepted: 2 October 2025 / Published: 11 October 2025
(This article belongs to the Section Computer)

Abstract

As rail transit systems advance toward greater automation and intelligence, cloud computing technology, with its remarkable scalability and robust data processing capabilities, has been steadily expanding its footprint in this domain. However, the adoption of cloud computing also introduces new safety challenges for train control systems. Traditional safety computers in train control systems rely on heterogeneous redundancy with symmetry to enhance safety. Nevertheless, the software in cloud computing environments, even if heterogeneous, may share the same source code, thereby triggering the risk of common-cause failures in the software. To address these issues, this study proposes a multi-level monitoring architecture system tailored to the characteristics of cloud-based safety computing platforms. This architecture innovatively integrates the three-level monitoring architecture pattern from the automotive field, the secure channel pattern, and the distributed safety mechanism architecture. It monitors the levels of common-cause software failures that cannot be eliminated through heterogeneity. The introduction of multi-level active monitoring for risk control has reduced the impact of common-cause software failures on system security. By constructing a formal security model, quantitative evaluations are conducted separately on the single-channel L2 and L3, the dual-channel L4 without degradation or monitoring, and the dual-channel L4 monitoring architecture with complete functions. This verifies the effectiveness of the proposed monitoring architecture in reducing the risk of common-cause software failures in the virtualization layer. This study provides a robust theoretical foundation and technical support for the security-oriented design and development of the next-generation intelligent rail transit systems.

1. Introduction

In August 2006, the concept of ”cloud computing” was first proposed [1]. After more than a decade of development and evolution, cloud computing has evolved into an indispensable foundational infrastructure for a multitude of organizations, with far-reaching implications across economic, social, industrial, and scientific spheres. With the rapid advancement of mobile networks and big data technologies, the vast majority of online services and data processing services rely on cloud computing technology. Widely adopted across the globe, cloud computing technology is hailed as the third major wave in information technology [2,3,4,5].
Train safety computers are a crucial technology for ensuring the safety of railway transportation. To guarantee the security and reliability of critical systems, train safety computers usually adopt a redundant design. This means that key components such as processors, control modules, and communication devices are equipped with backup systems. These backup systems can automatically take over when the primary system fails, ensuring that train operations remain unaffected [6].
Due to the unique layered architecture and virtualization technology of cloud computing, new security challenges have emerged after secure computers were migrated to cloud computing platforms. There are numerous functional software modules from the physical hardware layer to the application layer, and potential failures in these layers can lead to multi-dimensional security failures. Traditional heterogeneous redundancy fault-tolerance mechanisms face significant limitations in this multi-software environment: heterogeneous functional software modules often share core underlying modules, making the system vulnerable to common-cause software failures. This software-induced common-cause failure mechanism makes it difficult to provide effective fault isolation and fault tolerance by simply relying on hardware or software heterogeneity.
To address the above issues, establishing a system monitoring framework becomes a necessary measure. By creating a robust monitoring mechanism—multi-level surveillance—it is possible to monitor and analyze the operating status of the software layer in real time, promptly detect and address potential security threats, and guide the system toward safety.
In summary, our contributions are as follows:
  • This paper innovatively proposes a multi-level monitoring architecture system tailored to the characteristics of cloud security computing platforms for train control systems, which monitors the levels of common-cause software failures that cannot be eliminated through heterogeneity. The introduction of a multi-level active monitoring mechanism for risk management and control has reduced the impact of common-cause software failures on system safety.
  • Through rigorous mathematical analysis of the adopted multi-level monitoring architecture, this paper has constructed a formal safety model. This model is effectively applicable to cloud security computing platforms of train control systems.
  • Experimental verification of the multi-level monitoring architecture has been conducted in this paper. The results indicate that, relying on its multi-level and closed-loop monitoring mechanism, the multi-level monitoring architecture effectively compensates for the defects of local monitoring structures in risk perception and response, and its safety performance is more stable and reliable across different scenarios.
This paper adopts a progressive research approach. Section 2 reviews related works. Section 3 conducts a hierarchical analysis of the cloud-based security computing platform. Section 4 proposes a multi-level monitoring architecture pattern. Section 5 discusses the feasibility of the multi-level monitoring architecture from a security perspective. In Section 6, the overall work of this paper summarized.

2. Related Work

Compared with traditional train control systems, signal systems utilizing cloud computing technology have significantly enhanced the efficiency and reliability of rail transit. These cloud computing systems are capable of processing and storing vast amounts of data while providing great flexibility in resource utilization. They reduce dependence on physical equipment, effectively lowering operational costs. The powerful computing capability and fault-tolerant technology of cloud computing systems ensure the continuous availability of services. Due to the high scalability of cloud platforms, centralized management of multiple signaling applications can be achieved by standardizing the control software interfaces, data formats, communication protocols, and other interactive information within the cloud environment, thereby facilitating the network interconnection of cloud-based train control systems. Therefore, the application of cloud computing in rail transit has become an obvious trend in the development of the industry.
Ma et al. [7] present the design of a train control system test cloud platform based on Docker and Kubernetes clusters. By adopting containerization technology and orchestration tools, this platform realizes the modularity of test software functions, thereby improving the automation level of the test platform and supporting various future train control test tasks. Li K. [8] explores how to use cloud computing technology to solve the problem of signal centralized detection in high-speed railway systems. Through the virtualization of microcomputer servers and dynamic resource allocation, the monitoring scope of each station’s signal monitoring station can be flexibly adjusted, achieving effective remote detection of signal status and optimizing the detection function. Guo et al. [9] propose a cloud model with uncertainty cognition characteristics to evaluate the safety level of train control operations. It demonstrates the safety assessment steps based on the cloud model and determines the operational safety status of the train operation control system by calculating the similarity between the comprehensive cloud and the standard cloud.
Zou B. [10] analyzes the train control functions that can be implemented at each layer of the cloud platform and proposes an integration scheme for their convergence. Zheng T. [11] analyzes the practical application of cloud computing technology in the subway industry. Dawood M et al. [12] established a comprehensive theoretical framework for cloud computing security and systematically analyzed the types of cloud computing security issues. Zhu et al. [13] established a reinforcement learning cloud model to measure the fault repair of rail transit clouds. Gala G et al. [14] proposed a real-time cloud architecture based on virtualization technology and designed a resource management layer that includes node-level resource managers and global resource managers, so as to realize dynamic allocation, monitoring, and coordination of resources, thereby meeting the safety and real-time requirements of railway applications. Chen et al. [15] analyzed the primary system architectures of cloud computing platforms for urban rail transit, focusing on security and reliability, and proposed a tailored networking solution for such platforms.
Furthermore, Du S. [16] modeled the reliability and safety of traditional safety computer platforms, measuring their safety reliability through multiple indicators. However, Du S. did not conduct a detailed analysis of channel failure scenarios within different subsystems. Ren W. [17] employs the Monte Carlo method to model and analyze the safety and reliability of safety-critical redundant architectures. It presents a design for a safety computer platform based on private cloud infrastructure. However, the proposed platform considers only hardware-level redundancy and does not address the risk of common-cause software failures due to a shared source code. Zhang F. [18] optimizes the local data transmission method of the new train control system and conducts tests on multiple cloud platforms to reduce system latency. However, the simulation of axle counting sections still requires manual intervention, which may affect safety due to human operations. Yang Y. [19] proposes an optimized safety computer platform architecture and program sequence monitoring method to enhance the security of safety computer platforms based on cloud computing. However, while this improves the diagnostic coverage provided, the frequent occupation of CPU resources results in an increased self-checking load. Zhao Q. [20] proposed a novel architecture for cloud-based safety computing platforms and conducted qualitative and quantitative analyses on the real-time performance of such platforms. However, their evaluation was limited to homogeneous hardware configurations, failing to demonstrate the adaptability of the architecture in heterogeneous environments. Liu et al. [21] proposed a remote monitoring scheme for railway power supply systems based on cloud computing platforms. However, this scheme is limited to railway power supply systems. Zhou et al. [22] proposed a resource allocation method for railway safety-critical computing applications based on a Mixed Integer Linear Programming (MILP) model. However, the host power consumption model fails to consider the detailed differences in power consumption caused by hardware heterogeneity. Moreover, the safety verification only focuses on “whether deployment rules are met” and does not reference industry safety standards such as IEC 61508 (Geneva, Switzerland, 2010) for quantitative verification. A comparison of recent studies is presented in Table 1.
Although scholars at home and abroad have carried out systematic research on cloud-based security computing platforms from multi-dimensional perspectives, existing studies mostly focus on the functional implementation of cloud secure computing platforms or security analysis from a single dimension and have not fully considered the potential impact of common-cause software failures on system security. Regarding the quantitative security assessment of cloud security computing platforms, a systematic modeling method that integrates the characteristics of common-cause software failures and multi-level monitoring mechanisms has not yet been developed.
In view of the theoretical and practical challenges of security protection for cloud-based computing platforms, this paper innovatively proposes a multi-level monitoring architecture model for cloud-based secure computing platforms. Through formal modeling and quantitative analysis, the system’s security verification is completed, and experimental verification of the multi-level monitoring architecture is conducted. The results show that this architecture meets the Safety Integrity Level 4 (SIL4) requirements of the train control system, filling the adaptability gaps of traditional methods in cloud environments.

3. Hierarchical Analysis of Cloud-Based Security Computing Platforms

The present section is concerned firstly with the analysis of the risks inherent to cloud platforms and the subsequent proposal of a heterogeneous redundant security architecture. Subsequently, the Fault Tree Analysis method is adopted for the purpose of conducting a qualitative analysis of common-cause failures at each layer of the security architecture.

3.1. Hierarchical Risk Analysis

In accordance with the IEC 61508 functional safety standard and the hierarchical deconstruction principle of Fault Tree Analysis (FTA), a security analysis is conducted on the cloud secure computing platform. This involves identifying potential security issues at each layer and conducting a hazard-cause analysis for each layer. As a safety-critical system in rail transit, the train control system must strictly avoid the spread of a single component failure to the entire system. Accordingly, a virtual machine-based architecture with enhanced isolation capabilities is adopted. The hierarchical architecture of the cloud security computing platform is illustrated in Figure 1.
To clearly analyze the security issues caused by each layer of the cloud platform, it is necessary to conduct hazard-cause analysis for each layer. Risk analyses are performed separately for the physical hardware layer, Host OS, Hypervisor, cloud platform layer, Guest OS, and application layer. The overall security risks are summarized as shown in Figure 2.
The two-out-of-two structure consists of two channels. It detects whether a system failure has occurred by comparing the outputs of the two channels, and switches to safe operations in the event of a failure, thereby implementing a fail-safe. In cloud security computing platforms, this architecture can, to a certain extent, address security issues such as common-cause software failures. Based on the security risks analyzed at each layer, a two-out-of-two heterogeneous redundancy approach is adopted to enhance the platform’s security. The dual channels of X86 and ARM feature equivalent functions and a symmetric structure. A cloud security computing platform with a basic two-out-of-two safety architecture is constructed using X86 and ARM architectures, as shown in Figure 3.
Within the private cloud, multiple redundant virtual machines are configured, with each virtual machine independently running a functional train control software to form a traditional layered fail-safe system. In this system, a two-out-of-two architecture requires two virtual machines to independently host these critical software components. Additionally, an Objective Controller (OC) establishes a connection with the voter through the network, where the voter plays a vital role in communication control. To clarify the specific ways in which heterogeneous redundancy achieves differentiation at each level, the heterogeneous configurations obtained for each layer are summarized in Table 2.

3.2. Heterogeneous Architecture Hierarchical Common-Cause Failure Analysis

The analyses of common-cause failure factors for the two-out-of-two heterogeneous redundant architecture are conducted separately for the hardware and software components.
For the hardware layer of the constructed cloud-based security computing platform, the heterogeneous redundant structure can provide sufficient hardware diversity to eliminate the impact caused by common-cause failures.
In terms of software, common-cause failures mainly refer to those brought about by the virtualization layer and the application layer. Fault Tree Analysis (FTA) has been demonstrated to facilitate the connection of unit faults in a system through logic gates, thereby simplifying the analysis of system failures and enabling the calculation of the probability of failure occurrence based on corresponding algorithms [23]. To clearly represent the software common-cause failure factors of the cloud security computing platform, Fault Tree Analysis is adopted for hazard-causing factor analysis. The fault tree cause analysis structure diagram of the entire cloud security computing platform is obtained, as shown in Figure 4.
Analysis shows that the hardware layer can achieve sufficient diversity through heterogeneous redundancy measures, thereby eliminating the impact of common-cause failures. However, for the Host OS layer, Hypervisor layer, and cloud platform layer, the inevitable issue of identical source codes remains even after configuring heterogeneous redundancy. Once a common-cause failure occurs, it will render virtual machines unable to be established or cause them to stop running, which in turn affects the overall functionality of the system.
The differences of each layer in the cloud security computing platform, from the physical hardware layer to the application layer, under the heterogeneous redundancy architecture are sorted out, and corresponding security measures are formulated according to the risk characteristics of different layers. The summary of the diversity among different layers and the corresponding security measures is shown in Table 3.

4. Multi-Level Monitoring Mechanism

As demonstrated in the preceding section, it can be found that in the cloud-based secure computing platform adopting the two-out-of-two basic security architecture, the problem of common-cause software failures cannot be completely eliminated through heterogeneity. This section proposes a novel multi-level monitoring mechanism, incorporating robust diagnostic multi-level monitoring, to mitigate security risks arising from common-cause software failures.

4.1. Monitoring Architecture Pattern

According to EN 50129 (Brussels, Belgium, 2018), redundant architecture patterns fall into the category of combinatorial safety faults, while monitoring architecture patterns belong to reactive safety faults [24]. The key to reactive safety faults lies in program sequence monitoring, which achieves program logic sequence monitoring and time sequence monitoring by placing checkpoints at specific positions and conducting periodic inspections on them. Monitoring architecture patterns mainly include monitor–actuator patterns, safety execution patterns, and three-level monitoring architecture patterns, etc. [25]. A comparison of the aforementioned architecture patterns in terms of reliability, security, cost, variability, and execution time shows that the three-level monitoring architecture pattern and the safety channel pattern are similar in security and reliability with relatively low costs; however, the three-level monitoring architecture pattern lags far behind the safety channel architecture pattern in terms of impact on execution time [26]. The three-level monitoring architecture pattern, also known as the E-Gas architecture, adopts a three-layer architectural design, where each layer has unique functions and a failure control path. Each of these three layers has an independent failure control path, and through logical combination and collaboration, the system can quickly enter a “fail-safe state” when a problem occurs. The core purpose of the security channel pattern is to ensure that the system can still maintain safety even when a major failure occurs in the main functions. Its implementation idea is to adopt Automotive Safety Integrity Level (ASIL) decomposition technology to decompose the high-safety-level system requirements into different subsystems, reduce the failure risk of the actuator channel when performing normal operation functions, and transfer the safety control to the healthy channel. In the field of automotive autonomous driving, based on three-level monitoring, the concept of hierarchical monitoring has been expanded, and a distributed safety mechanism (DSM) has been proposed [27]. DSM distributes the safety layer across processors with different ASIL ratings. It also adopts hardware-assisted virtual machines to isolate software modules and realizes the fault-free shutdown behavior of faulty software stacks. It can be used to address the issues arising from the growing number and complexity of integrated system chips and software stacks required for autonomous operations. The train operation control system with a superimposed structure also faces the problem of an increasing number and complexity of physical devices in the process of developing towards autonomous operation [28], which is similar to the problems encountered in automotive autonomous operations. Therefore, DSM is not only applicable to the automotive field but can also be used to address similar issues in the rail transit sector.
A specific example of the three-layer monitoring concept in the E-Gas architecture under the network architecture provided by the DSM architecture is shown in Table 4.

4.2. Multi-Level Monitoring Architecture

Multiple redundant virtual machines are configured within the private cloud, where each virtual machine runs the train control function software independently and forms a traditional hierarchical fail-safe system together with the voter. Analysis of common-cause software failures across the Host OS layer, Hypervisor layer, and cloud platform layer indicates that the traditional architecturally heterogeneous redundant system still has an unavoidable issue of identical source code. Therefore, monitoring of common-cause software failures must first consider monitoring of virtual machines.
To overcome the limitations of traditional architectures, a new multi-level monitoring architecture pattern is proposed. This study draws on the advantages of the ASIL decomposition technology in the secure channel pattern and the three-level monitoring principle of the E-Gas architecture and combines the DSM to enhance the ability of the cloud-based safety computing platform to prevent common-cause failures.
The proposed implementation is as follows:
  • The train control function software running on redundant virtual machines corresponds to L1 of the conventional application functions under the DSM architecture.
  • The monitoring state machine software based on runtime verification is built on another private cloud virtual machine to monitor L1 functions through the function channel, supporting the function monitor L2 under the DSM architecture. Runtime verification is a lightweight verification technique that combines testing and model checking. It verifies the system by monitoring whether the actual execution path of the target system meets the specified monitoring properties [29].
  • In the DSM architecture, L3 needs to be configured in an independent Microcontroller Unit (MCU). However, there is no such MCU in the private cloud. Consequently, the L3 software is deployed on another virtual machine in the private cloud that is configured for redundancy. This virtual machine monitors the normal operation of L1 and L2 through a challenge–response mechanism.
  • Under the DSM architecture, L4 is essentially an “external safety monitoring unit independent of the core functional layer,” which is highly aligned with the core requirements of the train control system, namely “safety redundancy, global monitoring, and failure emergency response.” L4 does not run on the private cloud; instead, it monitors the functional controllers in the two-out-of-two heterogeneous redundant architecture from the outside and forces a transition to a safe state when a hazard is detected. Furthermore, to incorporate the idea of safe degradation of the limp-home channel from the secure channel pattern, it is necessary to add a safe degradation function to the voter.
Monitoring of L1 is achieved through the monitor state machine software based on runtime verification in L2, which continuously monitors the behavior of critical software such as the on-board Automatic Train Protection (ATP). However, since L2 is located within the virtual machine, it cannot distinguish between a virtual machine error and a common-cause failure error in its underlying Host OS layer or Hypervisor layer. Therefore, L3, which is located outside this virtual machine, is required to monitor these two layers through a challenge–response mechanism that proactively initiates challenges and verifies the legitimacy of responses. If L2 malfunctions or responds abnormally, L3 can determine whether the error occurs in L1 or L2. L4 adds a safety degradation function to the cloud-based secure computing platform following a failure, and the final security assurance mechanism is an enhanced DSM + voting function.
Based on the above comparative analysis with the DSM structure, the final multi-level monitoring architecture model is proposed. Firstly, in the first layer, the task of rail transit signal processing is undertaken by the application software L1, which focuses on executing the basic functions of the train control system.
In the second layer, the second-level monitor L2, which is configured in the corresponding virtual machine to monitor L1, runs in parallel to implement runtime verification of the system’s main functions.
Finally, on the virtual machines supported by heterogeneous hardware servers, the third-level monitor L3 is deployed. It monitors the Hypervisor layer of the entire system and the second-level monitor L2, ensuring the continuity and security of the system’s operation.
The first, second, and third layers collectively form a single channel. On this basis, a dual-channel structure is adopted to constitute a two-out-of-two voting architecture. The fourth layer is the fourth-level monitor L4, which is responsible for voting on L1 of the two channels and conducting real-time monitoring on L3 of the two channels. The fault-handling process is governed by the “symmetry principle”, which stipulates that the command logic when faults occur in the two channels is completely symmetrical. In the event of a hazard being detected, the system is switched to operate in a degraded mode.
Each layer is equipped with separate failure management measures, and through a specific logical combination, it is ensured that the system can be transferred to a degraded state via L4 when necessary. Such a hierarchical structure helps to identify safety-related failures, and through the safety-handling mechanisms of each layer, it can quickly limit or stop the improper output of the system, thus ensuring that the system enters a safe mode and preventing the potential risk of expanding into a hazard. The schematic diagram of the final overall architecture is shown in Figure 5.

5. Security Analysis of Cloud-Based Safety Computing Platform Multi-Level Monitoring Architecture

To verify the rationality of the proposed multi-level monitoring architecture scheme in this section, from the perspective of safety, the Markov method and the reliability block diagram method are adopted to quantitatively analyze the safety in the following scenarios: single channel with L2 and L3, dual-channel L4 without degradation or monitoring, and dual-channel L4 with complete functions. The Markov method and reliability block diagram method, as outlined in the IEC 61508 standard, encompass factors such as fault detection, repair, and common-cause failures [30]. These methods facilitate an objective and comprehensive analysis of the structural relationships between systems. However, a limitation of the reliability block diagram method is that it is unable to demonstrate the changes and transition processes between various states in the system. For these reasons, this section adopts the approach of taking the Markov method as the mainstay and the reliability block diagram method as a supplement to analyze the safety of the multi-level monitoring architecture pattern.

5.1. Single-Channel Structure with L2 and L3

This paper is the first to incorporate common-cause failures of cloud-based train control systems into a formal model. Furthermore, by incorporating the “fault detection–repair–degradation” behavior of the multi-level monitoring system into the Markov model, it becomes a quantitative verification tool for architectural design. The Markov safety model of a single channel with a second-level monitor L2 and a third-level monitor L3 is shown in Figure 6.
The state transition diagram for a single-channel structure with L2 and L3 contains 14 states. The definitions of the parameters in the figure are shown in Table 5.
IEC 61508 recommends a hardware failure rate value of 1 × 10 5 / h . According to the American national standard ANSI/AIAAR-103-1992, the failure rate of civil software generally ranges around 1 × 10 3 / h and 1 × 10 4 / h . The Mean Time Between Failures (MTBF) of the Hygon X86 server and Phytium ARMv8 server used in this study are 220,000 h and 100,000 h, respectively. Consequently, the failure rate of the X86 server has been set at λ H X 86 = 4.546 × 10 6 / h , while that of the ARMv8 server has been set at λ H A R M = 1 × 10 5 / h . For L1, L2, and L3 level software, the failure rate is set to 1 × 10 4 / h . For L4 safety-critical equipment with a safety integrity level not lower than SIL4, in order to achieve its safe and reliable function, the failure rates of its hardware and software should be lower than those of conventional hardware and software. Consequently, the hardware failure rate is set to 1 × 10 7 / h , and the software failure rate is set to 1 × 10 5 / h . The range of values for the hardware common-cause failure coefficient is β H 1 % , 20 % .
For the detectable dangerous failure state DD and the undetectable failure state DU, the repair rates are μ D D = 1 / M T T R and μ D U = 1 / M R T , where MTTR stands for Mean Time To Repair and MRT stands for Mean Repair Time. According to IEC 61508, M T T R = M R T is considered; therefore, the repair rates are uniformly set to μ = 1 / M T T R .
The meanings of each state are as follows:
State 1: L1, L2, and L3 all operate normally (W).
State 2: L1 and L2 are in an R D D state; L3 is normal.
State 3: L1 and L2 are in a DU state; L3 is normal.
State 4: L3 is in an R D D state; L1 and L2 are normal.
State 5: L3 is in a DU state; L1 and L2 are normal.
State 6: L1, L2, and L3 are all in an R D D state.
State 7: L1, L2, and L3 are all in a DU state.
State 8: L1 and L2 are in an R D D state; L3 is in a DU state.
State 9: L1 and L2 are in a DU state; L3 is in an R D D state.
State 10: L1 transitions from DU to R D U during testing; L2 and L3 are normal.
State 11: L2 transitions from DU to R D U during testing; L1 and L3 are normal.
State 12: L3 transitions from DU to R D U during testing; L1 and L2 are normal.
State 13: L1 and L2 transition from DU to R D U during testing; L3 is normal.
State 14: L1, L2, and L3 transition from DU to R D U during testing.
The Markov transition matrix of the single-channel structure with L2 and L3 is as follows:
M 23 = m 11 μ 0 μ 0 μ / 2 0 0 0 μ μ μ / 2 μ / 2 μ / 2 λ 12 D D m 22 0 0 0 0 0 0 0 0 0 0 0 0 λ 12 D U 0 m 33 0 0 0 0 0 μ 0 0 0 0 0 λ 3 D D 0 0 m 44 0 0 0 0 0 0 0 0 0 0 λ 3 D U 0 0 0 m 55 0 0 μ 0 0 0 0 0 0 β 123 D λ D D λ 3 D D 0 λ 12 D D 0 μ / 2 0 0 0 0 0 0 0 0 β λ D U 0 λ 3 D U 0 λ 12 D U 0 0 0 0 0 0 0 0 0 0 λ 3 D U 0 0 λ 12 D D 0 0 μ 0 0 0 0 0 0 0 0 λ 3 D D λ 12 D U 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2
m 11 = ( λ 12 D D + λ 12 D U + λ 3 D D + λ 3 D U + β 123 D λ D D + β λ D U ) m 22 = ( μ + λ 3 D D + λ 3 D U ) m 33 = ( λ 3 D D + λ 3 D U ) m 44 = ( μ + λ 12 D D + λ 12 D U ) m 55 = ( λ 12 D U + λ 12 D D )
In addition, a connection matrix is required to infer the situation in the next stage from the initial conditions. The connection matrix for the inspection test phase is as follows:
L 23 = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
The initial state is as follows:
P 1 0 = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 T
The state at any moment t = i τ + ζ within the inspection test interval is as follows:
P t = e ζ M L e τ M i P 1 0 , i 1 τ t i τ , ζ = t mod τ
Therefore, the probability that the single-channel architecture with L2 and L3 is in a dangerous failure state at time t is as follows:
P t = P ( 7 , 1 ) t + P ( 8 , 1 ) t + P ( 9 , 1 ) t
The average frequency of dangerous failure (PFH) is applicable to systems that require continuous operation to ensure safety, such as train protection systems [31].
IEC 61508 provides the general formula for PFH calculation as follows:
P F H T = 1 T 0 T w t d t
Among them, w t is the unconditional failure frequency. In IEC 61508, the definition of PFH is the average value of w t over the operating cycle. In the Markov state transition diagram, State DU represents the dangerous failure state of the unit; therefore, w t can be understood as the sum of the transition rates from all other states to State DU per unit time. The calculation method for the transition probability is the product of the probability P i t of currently being in a certain state and the transition probability λ i D from this state to the failure state DU that cannot be detected by inspection tests. The transition rate from State i to DU is represented by λ i D . The expression for w t is as follows:
w t = i D λ i D P i t
In addition, the formula for calculating PFH using the Markov method, which is required in this paper, is derived as follows:
P F H T = 1 T 0 T λ i D P i D t d t = 1 T 0 T M 2 , 1 P 1 , 1 t d t
The dangerous failure states corresponding to the single-channel architecture with L2 and L3 are seven, eight, and nine. Among them, state three and state five transition to state seven with transition rates, respectively; state two and state five transition to state eight with transition rates, respectively; state three and state four transition to state nine with transition rates, respectively. Then, the PFH of the single-channel architecture with L2 and L3 can be expressed as follows:
P F H 23 = 1 τ 0 τ ( M 8 , 2 P ( 2 , 1 ) t + M 7 , 3 + M 9 , 3 P ( 3 , 1 ) t + M 9 , 4 P ( 4 , 1 ) t + M 7 , 5 + M 8 , 5 P ( 5 , 1 ) t ) d t
where τ denotes the test time interval. Select the single-channel scenario where L1 and L2 are on X86 hardware and L3 is on ARM hardware. According to the previous text, λ 12 = 4.546 × 10 6 / h and λ 3 = 1 × 10 5 / h . Referring to IEC 61508, the inspection test interval is τ = 2190   h (three months), and the single-channel repair rate is μ = 1 / 8 h (in IEC 61508, MTTR = 8 h). The value of β 123 is taken as 2%. The variation in PFH for the single-channel structure with L2 and L3 is shown in Figure 7.
As can be seen from the figure, when diagnostic coverage (DC) is greater than or equal to 90%SIL, the PFH does not fall within the range specified in 10 9 , 10 8  (/h), thereby failing to meet the SIL4 requirements defined in IEC 61508.
Since it is necessary to study the dual-channel two-out-of-two structure with L2, L3, and L4 subsequently, for the convenience of calculation, it is necessary to obtain the safety equivalent parameters of a single channel with L2 and L3.
The detectable dangerous failure λ e D D of a single channel can be regarded as the failure rate that leads to the detectable dangerous failure of the entire single channel. In addition, the PFH of the single channel is taken as the equivalent undetectable dangerous failure rate of the single channel, denoted as λ e D U .
The diagnostic coverage rate is as follows:
D C e = λ e D D λ e D D + λ e D U

5.2. The Dual-Channel Two-out-of-Two Structure with L2, L3, and L4

The Markov safety model of the dual-channel two-out-of-two structure with the second-level monitor L2, the third-level monitor L3, and the fourth-level monitor L4 is shown in Figure 8.
The meanings of each state are as follows:
State 1: Both dual channels and the fourth-level monitor L4 are operating normally.
State 2: Either of the single channels is in the repair state R D U after detection, and the fourth-level monitor L4 is operating normally.
State 3: Either of the single channels is in the undetectable state DU, and the fourth-level monitor L4 is operating normally.
State 4: Both dual channels are in the detectable repair state R D D , and the fourth-level monitor L4 is operating normally.
State 5: Both dual channels are in the undetectable state DU, and the fourth-level monitor L4 is operating normally.
State 6: One single channel is in the detectable repair state R D D , the other single channel is in the undetectable state DU, and the fourth-level monitor L4 is operating normally.
State 7: One single channel and the fourth-level monitor L4 are in the detectable repair state R D D .
State 8: One single channel is in the detectable repair state 1, and the fourth-level monitor L4 is in the undetectable state 2.
State 9: One single channel is in the undetectable state DU, and the fourth-level monitor L4 is in the detectable repair state R D D .
State 10: One single channel and the fourth-level monitor L4 are in the undetectable state DU.
State 11: Both dual channels and the fourth-level monitor L4 are in the detectable repair state R D D .
State 12: Both dual channels are in the detectable repair state R D D , and the fourth-level monitor L4 is in the undetectable state DU.
State 13: Both dual channels are in the undetectable state DU, and the fourth-level monitor L4 is in the detectable repair state R D D .
State 14: Both dual channels and the fourth-level monitor L4 are in the undetectable state DU.
State 15: One single channel and the fourth-level monitor L4 are in the detectable repair state R D D , and the other single channel is in the undetectable state DU.
State 16: One single channel and the fourth-level monitor L4 are in the undetectable state DU, and the other single channel is in the detectable repair state R D D .
State 17: Both dual channels are operating normally, and the fourth-level monitor L4 is in the detectable repair state R D D .
State 18: The fourth-level monitor L4 is in the undetectable state DU.
States 19–30: Any component in the undetectable state DU is in the undetectable repair state R D U after the inspection test time point arrives.
In the dual-channel two-out-of-two structure with L2, L3, and L4, when both dual channels are in a failed state and the L4 is also in a failed state and unable to provide the degradation function, the system will be in a dangerous failure state. Therefore, it is considered that States 11 to 16 correspond to the situation where the entire system enters a dangerous failure state.
The Markov transition matrix of the dual-channel two-out-of-two structure with L2, L3, and L4 within the inspection test interval is as follows:
M 234 = m 11 μ 0 μ / 2 0 0 μ / 2 0 0 0 μ / 3 0 0 0 0 0 μ 0 μ μ / 2 μ / 2 μ / 2 μ / 2 μ / 2 μ / 3 μ / 3 μ / 3 μ / 3 μ / 3 μ 2 λ D D m 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 λ D U 0 m 33 0 0 μ 0 0 μ 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 β D λ D D λ D D 0 m 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 β λ D U 0 λ D U 0 m 55 0 0 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ D U λ D D 0 0 m 66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D D 0 0 0 0 m 77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D U 0 0 0 0 0 m 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D D 0 0 0 0 0 m 99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D U 0 0 0 0 0 0 m 10 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D D 0 0 λ D D 0 0 0 μ / 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D U 0 0 0 λ D D 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D D 0 0 0 λ D U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D U 0 0 0 0 λ D U 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D D λ D U 0 λ D D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D U 0 λ D U 0 λ D D 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 λ 4 D U 0 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ / 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 μ
m 11 = ( 2 λ D D + 2 λ D U + β D λ D D + β λ D U + λ 4 D D + λ 4 D U ) m 22 = ( μ + λ D D + λ D U + λ 4 D D + λ 4 D U ) m 33 = ( λ D U + λ D D + λ 4 D D + λ 4 D U ) m 44 = ( μ / 2 + λ 4 D D + λ 4 D U ) m 55 = ( λ 4 D D + λ 4 D U ) m 66 = ( λ 4 D D + λ 4 D U ) m 77 = ( μ / 2 + λ D D + λ D U ) m 88 = ( μ + λ D D + λ D U ) m 99 = ( μ + λ D D + λ D U ) m 10 = ( λ D D + λ D U )
The connection matrix is omitted here due to its excessively high dimensionality.
The initial state is as follows:
P 1 0 = 1 P i t = 0 , i = 2 , 3 , 4...30
Therefore, the probability that the dual-channel two-out-of-two structure with L2, L3, and L4 is in a dangerous failure state at time t is as follows:
P t = P ( 11 , 1 ) t + P ( 12 , 1 ) t + P ( 13 , 1 ) t + P ( 14 , 1 ) t + P ( 15 , 1 ) t + P ( 16 , 1 ) t
For the dual-channel two-out-of-two structure with L2, L3, and L4, the corresponding dangerous failure states are 11 to 16, and the PFH can be expressed as follows:
P F H 23 = 1 τ 0 τ ( M 11 , 4 + M 12 , 4 P ( 4 , 1 ) t + M 14 , 5 P ( 5 , 1 ) t + M 15 , 6 + M 16 , 6 P ( 6 , 1 ) t + M 11 , 7 + M 15 , 7 P ( 7 , 1 ) t + M 12 , 8 + M 16 , 8 P ( 8 , 1 ) t + M 13 , 9 + M 15 , 9 P ( 9 , 1 ) t + M 14 , 10 + M 16 , 10 P ( 10 , 1 ) t ) d t
The safety model using the reliability block diagram method for the two-out-of-two structure with L2, L3, and L4 is shown in Figure 9.
As can be seen from the figure, the relationship between independent failures and common-cause failures in the reliability block diagram is in series.
Using the method provided by PDS [32], the PFH values of the entire L1, L2, and L3 are calculated as follows:
P F H 123 _ 2 = P F H 12 P F H 3 τ + C 1 o o 4 β 123 λ 12 D U λ 3 D U
where τ denotes the inspection test time interval, C 1 o o 4 denotes the multiplicative coefficient used to quantify the common-cause failure, which can be obtained by looking up the values provided by the PDS method.
For a single channel, the PFH is equal to its undetectable failure rate.
Then, the PFH of L4 is as follows: P F H 4 = λ 4 D U
The overall PFH is obtained as follows:
P F H 1234 = P F H 123 _ 2 P F H 4 τ + C 1 o o 5 β 1234 λ 12 D U λ 3 D U λ 4 D U 3 + ( 1 K ) λ 4 D U
Among them, K is the conversion coefficient for switching to the degraded mode. According to IEC 61508, the value of K is 0.98.

5.3. The Dual-Channel Two-out-of-Two Structure with L2, L3 and L4 (L4 Has a Voting-Based Degradation Function but No Monitoring Function)

In order to prove the safety and necessity of the dual-channel two-out-of-two structure with L2, L3, and L4 (which has the functions of voting, monitoring, and degradation simultaneously), it is also necessary to analyze the safety of this structure when it does not have the monitoring and degradation functions. The previous reliability analysis has shown that the absence of the degradation function will have a significant impact on the reliability. Therefore, when analyzing the safety, it is also necessary to ensure the reliability. So, only the situation where L4 has the degradation mode but does not have the monitoring function will be analyzed below, and then, it will be compared with the situation where L4 has both the monitoring and degradation mode functions.
The Markov safety analysis of the dual-channel two-out-of-two structure with L2, L3, and L4 (where L4 has the voting-based degradation function but no monitoring function) still refers to the model shown in the figure, but the dangerous failure states used to calculate the Probability of PFH are different from those previously described.
Since L4 does not have the monitoring function, the challenge–response mechanism formed between L3 and L4 fails. This means that when L4 is in a failed state, the single channel with L3 will no longer be able to know the state of L4, and thus the system will not stop operating.
Therefore, if the system is in a situation where L4 fails, the single channel or the dual channels will still continue to operate. At this time, without the voting function, there is no way to know whether the outputs of the dual channels are consistent. Once the outputs of the dual channels are inconsistent, the system cannot enter the degradation mode or stop, which puts the system in a situation where a dangerous failure may occur. Corresponding to the Markov safety model in the figure, states 7, 8, 9, 10, 17, and 18 should be added as the situations where the overall system output is in a dangerous failure state.
Based on the above analysis, the probability that the dual-channel two-out-of-two structure with L2, L3, and L4 (where L4 has the voting-based degradation function but no monitoring function) is in a dangerous failure state at time t is as follows:
P t = P ( 7 , 1 ) t + P ( 8 , 1 ) t + P ( 9 , 1 ) t + P ( 10 , 1 ) t + P ( 11 , 1 ) t + P ( 12 , 1 ) t + P ( 13 , 1 ) t + P ( 14 , 1 ) t + P ( 15 , 1 ) t + P ( 16 , 1 ) t + P ( 17 , 1 ) t + P ( 18 , 1 ) t
Furthermore, the PFH can be expressed as follows:
P F H 23 = 1 τ 0 τ M 17 , 1 + M 18 , 1 P ( 1 , 1 ) t + M 17 , 1 + M 18 , 1 P ( 1 , 1 ) t + M 7 , 2 + M 8 , 2 P ( 2 , 1 ) t + M 9 , 3 + M 10 , 3 P ( 3 , 1 ) t + M 14 , 5 P ( 5 , 1 ) t + M 15 , 6 + M 16 , 6 P ( 6 , 1 ) t + M 11 , 7 + M 15 , 7 P ( 7 , 1 ) t + M 12 , 8 + M 16 , 8 + M 18 , 8 P ( 8 , 1 ) t + M 13 , 9 + M 15 , 9 P ( 9 , 1 ) t + M 14 , 10 + M 16 , 10 P ( 10 , 1 ) t + M 18 , 12 P ( 12 , 1 ) t d t
For L4 without the monitoring function, if L4 fails, the single channel has no ability to sense it. In this case, when L4 fails, it cannot make the system enter a safe state. Therefore, when using the reliability block diagram method to model and analyze the safety, the branch with L4 is omitted, but the impact of the degradation mode on the safety is retained. The model of it by the reliability block diagram method is shown in Figure 10.
At this time, the PFH is the overall PFH of L1, L2, and L3, plus the influence brought about by the degradation mode, that is as follows:
P F H 123 _ 2 = P F H 12 P F H 3 τ + C 1 o o 4 β 123 λ 12 D U λ 3 D U + ( 1 K ) λ 4 D U

5.4. Comparison of Safety Performance

When calculating using the Markov safety model, it is necessary to first specify the failure rate of the single channel. As is known from the previous content, the failure rate of a single channel in the dual-channel two-out-of-two structures with L2, L3, and L4 can be obtained from the equivalent dangerous failure rate of the single-channel structure with L2 and L3. When D C 23 = 90 % , we obtain λ e D = 2.3256 × 10 5 / h . The inspection test interval τ = 2190 h (3 months), and the repair rate of the single channel is μ = 1 / 8 h and D C e 0 , 100 % . In order to ensure the safe and reliable functionality of L4 safety-critical equipment with a safety integrity level not lower than SIL4, it is imperative that the failure rates of its hardware and software are lower than those of conventional hardware and software. Consequently, the probability of hardware failure is designated as λ 4 = 1 × 10 7 / h . Whether it is the common-cause failures of the 1oo2 (one-out-of-two) heterogeneous redundancy of L12 and L3, namely β 12 and β 3 , respectively, or the common-cause failure β 123 between the two 1oo2 as a whole, their root cause is still the common-cause failure between X86 and ARM. Therefore, these three common-cause failure coefficients are taken as the same value. In addition, the common-cause failure β 1234 between the two 1oo2 and L4 should be smaller than the previous three β values. For conservative calculation here, β 1234 is also regarded as the same as the previous three β values. The cases where β is 2% and 20% are discussed, respectively. Then, the comparison of the changes in the PFH between the dual-channel two-out-of-two structure with L2, L3, and L4 (L4 has the function of voting-based degradation but no monitoring function) and the dual-channel two-out-of-two structure with L2, L3, and L4 (L4 has complete functions) is shown in Figure 11.
Table 6 shows the values of PFH of the three structures under different diagnostic coverage rates, and common-cause failure coefficients β . To clearly show the safety comparison among various architectures, data within the value range 10 9 , 10 8  (/h) that meets the high-requirement (or continuous) mode of SIL4 shall be marked in red.
In the actual design of the signal system, to ensure safety, a safety structure with a high diagnostic coverage rate should be designed. The DC of a system is one of the important factors affecting the reliability and safety of the system. This indicator represents the self-inspection function and fault self-checking function of the system. According to IEC 61508, typical values of DC are 60%, 90%, and 99%, and the range of the common-cause failure coefficient βis [1%, 20%]. For systems such as train control systems that have high requirements for reliability and safety, the DC should be greater than or equal to 90%, and the value of the common-cause failure coefficient β should also be made as small as possible. From the analysis of the PFH values calculated for the three structures under various different conditions in the table, it can be seen that the safety of the single-channel structure with L2 and L3 is difficult to meet high-safety-level requirements under any circumstances. The dual-channel two-out-of-two structure with L2, L3, and L4 (where L4 has voting and degradation functions but no monitoring function) can only be within the range of 10 9 , 10 8  (/h) when the diagnostic coverage rate is very high and the common-cause failure factor is very low. In the majority of cases, it remains incapable of meeting the stipulated requirements. However, the dual-channel 2-out-of-2 structure with L2, L3, and L4 can basically reach the SIL4 safety level specified in IEC 61508 in most cases, which can well meet the safety requirements.
It can thus be concluded that the multi-level monitoring architecture, through its multi-layered and closed-loop monitoring mechanism, effectively compensates for the deficiencies of local monitoring structures in risk perception and response. Its safety performance is more stable and secure across different scenarios, providing a key guarantee for the train control system’s cloud security platform to address software common-cause failures and meet SIL4 safety requirements. This difference also indicates that in a cloud computing environment, partial monitoring structures relying solely on local redundancy or simplified monitoring cannot cope with multi-dimensional security challenges, and multi-level active monitoring is an essential means to ensure system security.

6. Conclusions

This paper conducts innovative research on the safety issues arising from the application of cloud computing in the rail transit train control system. To address the issue of common-cause software failures in train control safety computers, which are caused by the cloud computing hierarchical structure and virtualization technology, this paper proposes systematic solutions and breaks through the applicability limitations of traditional heterogeneous redundancy architectures in the cloud computing environment. First, this study conducts a hierarchical analysis of the cloud-based secure computing platform to identify risks at each layer and the layers where common-cause failures can be eliminated. Next, combined with relevant architectural concepts, it designs a multi-level monitoring architecture, integrating monitoring into layers where risks cannot be mitigated through heterogeneity. This architecture quickly restricts or halts the system’s improper outputs when a fault occurs in the cloud computing platform, thereby ensuring the system enters a safe mode. This effectively mitigates the risk that traditional heterogeneous redundancy architectures may share identical source codes, which can lead to common-cause software failures. Subsequently, the Markov method is applied to establish a safety model for the architecture. By comparing the PFH of different structures, it is ultimately concluded that, compared with single-channel structures and dual-channel structures without monitoring, this multi-level monitoring architecture meets the SIL4 safety requirements and can effectively reduce the impact of common-cause software failures.
Although this study provides a theoretical foundation and technical support for the safe application of cloud computing in the train control system, there are still several directions worthy of exploration. First, the hardware only verified the heterogeneity of X86 + ARM; in the future, the adaptability to architectures such as RISC-V can be explored. Second, given that the multi-level monitoring architecture of the cloud-based secure computing platform exhibits numerous potential operational states during runtime, this study simplified the Markov safety model; consequently, there remains scope for further refinement of this model in subsequent research.

Author Contributions

Data curation, Y.L.; Formal analysis, B.Z. Funding acquisition, L.Y.; Resources, L.Y.; Investigation, Y.W.; Writing—original draft, B.Z. and Q.F.; Writing—review and editing, Y.W. and Y.L.; All authors have read and agreed to the published version of the manuscript.

Funding

This paper was funded by Key Program of Railway Innovation Development Joint Fund, National Natural Science Foundation of China (U2469211).

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Yixiong Wu was employed by the company China Unicom Digital Technology Co., Ltd. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Liu, S.; Liu, J.; Wang, H.; Xian, M. Research on the development of cloud computing. In Proceedings of the 2020 International Conference on Computer Information and Big Data Applications (CIBDA), Guiyang, China, 17–19 April 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
  2. Molo, M.J.; Badejo, J.A.; Adetiba, E.; Nzanzu, V.P.; Noma-Osaghae, E.; Oguntosin, V.; Baraka, M.O.; Takenga, C.; Suraju, S.; Adebiyi, E.F. A review of evolutionary trends in cloud computing and applications to the healthcare ecosystem. Appl. Comput. Intell. Softw. Comput. 2021, 2021, 1843671. [Google Scholar] [CrossRef]
  3. Srivastava, P.; Khan, R. A review paper on cloud computing. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2018, 8, 17–20. [Google Scholar] [CrossRef]
  4. Zhang, L.; Lü, J.; Yan, F.; Xiong, Y. A Review of Trusted Cloud Computing Research. J. Zhengzhou Univ. (Nat. Sci. Ed.) 2022, 54, 1–11. [Google Scholar]
  5. Rashid, A.; Chaturvedi, A. Cloud computing characteristics and services: A brief review. Int. J. Comput. Sci. Eng. 2019, 7, 421–426. [Google Scholar] [CrossRef]
  6. Prasad, V.B. Fault tolerant digital systems. IEEE Potentials 1989, 8, 17–21. [Google Scholar] [CrossRef]
  7. Ma, Q.; Xu, Z.; Mei, M. Design of Train Control System Test Container Cloud Platform Based on Kubernetes. Comput. Technol. Dev. 2021, 31, 52–58. [Google Scholar]
  8. Li, K. Research on Migration Scheme of Railway Signal Centralized Detection System for Railway Private Cloud. Railw. Commun. Signal Eng. Technol. 2019, 16, 34–39. [Google Scholar]
  9. Guo, R.; Chen, G.; Zhao, X. Safety Assessment of Train Control Operation Based on Cloud Model and Uncertain AHP. J. China Railw. Soc. 2016, 38, 69–74. [Google Scholar]
  10. Zou, B. Research on Integrated Cloud Scheme for Rail Transit. China Informatiz. 2020, 2, 71–73. [Google Scholar]
  11. Zheng, T. Brief Analysis of Cloud Computing Technology Application in Metro Industry. Telecommun. World 2019, 26, 113–114. [Google Scholar]
  12. Dawood, M.; Tu, S.; Xiao, C.; Alasmary, H.; Waqas, M.; Rehman, S.U. Cyberattacks and Security of Cloud Computing: A Complete Guideline. Symmetry 2023, 15, 1981. [Google Scholar] [CrossRef]
  13. Zhu, L.; Zhuang, Q.; Jiang, H.; Liang, H.; Gao, X.; Wang, W. Reliability-aware failure recovery for cloud computing based automatic train supervision systems in urban rail transit using deep reinforcement learning. J. Cloud Comput. 2023, 12, 147. [Google Scholar] [CrossRef]
  14. Gala, G.; Fohler, G.; Tummeltshammer, P.; Resch, S.; Hametner, R. RT-cloud: Virtualization technologies and cloud computing for railway use-case. In Proceedings of the 2021 IEEE 24th International Symposium on Real-Time Distributed Computing (ISORC), Daegu, Republic of Korea, 1–3 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 105–113. [Google Scholar]
  15. Chen, S.; Zhao, S. Research on Networking Solutions for Urban Rail Transit Cloud Computing Platforms. Railw. Signal. Commun. Eng. 2024, 21, 89–96. [Google Scholar]
  16. Du, S. Improvement of Safety Redundant Structure Considering Both Safety and Reliability. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2019. [Google Scholar]
  17. Ren, W. Research on Key Technologies of Safety Computer Based on Private Cloud. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2020. [Google Scholar]
  18. Zhang, F. Research on Real-Time Performance of Data Transmission in New Train Control System Based on All-IP. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2021. [Google Scholar]
  19. Yang, Y. Research on Program Sequence Monitoring Method of Safety Computer Platform Based on Cloud Computing. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2023. [Google Scholar]
  20. Zhao, Q. Research on Real-Time Performance of Cloud-Based Safety Computer Platform. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2023. [Google Scholar]
  21. Liu, J. Research on Remote Monitoring Technology for Railway Power Supply Systems Based on Cloud Computing Platforms. Inf. Rec. Mater. 2025, 26, 208–210. [Google Scholar]
  22. Zhou, P.; Wang, X.; Jin, J.; Wang, H.; Ying, Z.; Fei, Z.; Wang, L. A Cloud Resource Allocation Method for Railway Safety Critical Computing Application. In Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, AL, Canada, 24–27 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2711–2716. [Google Scholar]
  23. Wang, Y.; Wu, W.A.; Gao, X.H. Reliability Research on Redundant Structure of Station Computer Interlocking System Based on Dynamic Fault Tree. Autom. Instrum. 2021, 4, 31–34. [Google Scholar] [CrossRef]
  24. Wang, Y.; Wang, Y.; Ma, L.; Wen, J.; Zhang, F. Comparative analysis of the M-out-of-N structure in EN50129: 2018 and IEC61508: 2010. J. Phys. Conf. Ser. 2020, 1654, 012082. [Google Scholar] [CrossRef]
  25. Armoush, A. Design patterns for safety-critical embedded systems. Ph.D. Thesis, RWTH Aachen University, Aachen, Germany, 2010. [Google Scholar]
  26. Luo, Y.; Saberi, A.K.; Bijlsma, T.; Lukkien, J.J.; van den Brand, M. An architecture pattern for safety critical automated driving applications: Design and analysis. In Proceedings of the 2017 Annual IEEE International Systems Conference (SysCon), Montreal, QC, Canada, 24–27 April 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–7. [Google Scholar]
  27. Bijlsma, T.; Buriachevskyi, A.; Frigerio, A.; Fu, Y.; Goossens, K.; Örs, A.O.; van der Perk, P.J.; Terechko, A.; Vermeulen, B. A distributed safety mechanism using middleware and hypervisors for autonomous vehicles. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1175–1180. [Google Scholar]
  28. Kang, Y. Research on Design and Performance Evaluation Method of Safety Computer Platform Based on Cloud Computing. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2023. [Google Scholar]
  29. Li, X. Research on Dynamic Monitoring Method of Route Control Based on Runtime Verification. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2019. [Google Scholar] [CrossRef]
  30. Hao, Y.; Jiang, Y.; Chen, T.; Cao, D.; Chen, M. iTaskOffloading: Intelligent task offloading for a cloud-edge collaborative system. IEEE Netw. 2019, 33, 82–88. [Google Scholar] [CrossRef]
  31. Hokstad, P. Demand rate and risk reduction for safety instrumented systems. Reliab. Eng. Syst. Saf. 2014, 127, 12–20. [Google Scholar] [CrossRef]
  32. Hauge, S.; Lundteigen, M.A.; Hokstad, P.; Håbrekke, S. Reliability prediction method for safety instrumented systems–pds method handbook, 2010 edition. SINTEF Rep. STF50 A 2010, 6031, 460. [Google Scholar]
Figure 1. Hierarchical architecture diagram of the cloud platform.
Figure 1. Hierarchical architecture diagram of the cloud platform.
Symmetry 17 01706 g001
Figure 2. Functional safety risks of cloud security computing platforms.
Figure 2. Functional safety risks of cloud security computing platforms.
Symmetry 17 01706 g002
Figure 3. The two-out-of-two heterogeneous security architecture of the cloud-based secure computing platform.
Figure 3. The two-out-of-two heterogeneous security architecture of the cloud-based secure computing platform.
Symmetry 17 01706 g003
Figure 4. Fault Tree Analysis of overall common-cause failures.
Figure 4. Fault Tree Analysis of overall common-cause failures.
Symmetry 17 01706 g004
Figure 5. Multi-level monitoring architecture pattern for cloud-based security computing platform.
Figure 5. Multi-level monitoring architecture pattern for cloud-based security computing platform.
Symmetry 17 01706 g005
Figure 6. Single-channel Markov safety model with L2 and L3.
Figure 6. Single-channel Markov safety model with L2 and L3.
Symmetry 17 01706 g006
Figure 7. PFH of the single channel with L2 and L3.
Figure 7. PFH of the single channel with L2 and L3.
Symmetry 17 01706 g007
Figure 8. The Markov safety model of the dual-channel with the second-level monitor L2, the third-level monitor L3, and the fourth-level monitor L4.
Figure 8. The Markov safety model of the dual-channel with the second-level monitor L2, the third-level monitor L3, and the fourth-level monitor L4.
Symmetry 17 01706 g008
Figure 9. The safety model of the reliability block diagram for the dual-channel with L2, L3, and L4.
Figure 9. The safety model of the reliability block diagram for the dual-channel with L2, L3, and L4.
Symmetry 17 01706 g009
Figure 10. Safety model of the reliability block diagram for the dual-channel structure with L4 having no monitoring function.
Figure 10. Safety model of the reliability block diagram for the dual-channel structure with L4 having no monitoring function.
Symmetry 17 01706 g010
Figure 11. Comparison of the PFH in two Situations of the dual-channel structure with L2, L3, and L4. (a) Comparison of PFH between the two structures (β = 2%); (b) comparison of PFH between the two structures (β = 20%).
Figure 11. Comparison of the PFH in two Situations of the dual-channel structure with L2, L3, and L4. (a) Comparison of PFH between the two structures (β = 2%); (b) comparison of PFH between the two structures (β = 20%).
Symmetry 17 01706 g011
Table 1. A comparison of recent studies.
Table 1. A comparison of recent studies.
Author(s)Main ContributionsLimitations
Du S. [16]Evaluates safety and reliability of safety computers using multiple metrics.Does not consider channel failure scenarios.
Ren W. [17]Designs a safety computer platform based on private cloud technology.Only considers safety redundancy structures
Zhang F. [18]Optimizes the local data transmission method for the new train control systemStill requires manual intervention
Yang Y. [19]Proposes an optimized safety computer platform architecture and a program sequence monitoring methodFrequent occupation of CPU resources increases the load of self-checking.
Zhao Q. [20]Conducts qualitative and quantitative analyses on the real-time performance of cloud-based safety computing platformsOnly verified under homogeneous hardware
Liu J et al. [21]Proposes a cloud-based remote monitoring scheme for railway power supply systems.Scope restricted to railway power supply
Zhou P et al. [22]Proposes an MILP-based resource allocation method for railway safety-critical computing.Power consumption model neglects fine-grained variations due to hardware heterogeneity.
Table 2. Heterogeneous configuration of each layer.
Table 2. Heterogeneous configuration of each layer.
LayerX86_64 ArchitectureARM Architecture
Application LayerApp for X86_64App for AArch64
Guest OS LayerFreeBSD/WindowsUbuntu
Cloud Platform LayerA certain domestic innovation private cloud platformA certain domestic innovation private cloud platform
Qemu-system-X86_64Qemu-system-aarch64
Hypervisor LayerKvm.koKvm.ko
Kvm-amdKvm-arm
Host OS LayerLinux KernelLinux Kernel
Kylin Linux for x86–64Kylin Linux for ARMv8
Physical Hardware LayerHygon C5200 Xinchuang ServerPhytium S2500 Xinchuang Server
Table 3. Summary of diversity across layers.
Table 3. Summary of diversity across layers.
LayerSummary of DifferencesComply with Security Measures
Application LayerFunctionally distinct application software developed on different Guest OS provides sufficient diversityTwo-out-of-two Heterogeneous Software Configuration
Guest OS LayerWindows and Linux provide sufficient diversity.Two-out-of-two Heterogeneous Operating System Configuration
Cloud Platform LayerThe shared identical source code between QEMU and domestically developed private cloud software poses common-cause software failure risks.Proactive monitoring of virtual machine (VM) operation detects anomalies and directs the system to a safe state
Hypervisor LayerThe partial source code of kvm.ko is identical, posing a common-cause software failure riskProactive monitoring of virtual machine (VM) operation detects anomalies and directs the system to a safe state
Host OS LayerThe partial source code of the Linux kernel is identical, posing a risk of common-cause software failures.Proactively monitor virtual machine operation and Host OS (Linux) memory aging, directing the system to a safe state upon detecting anomalies.
Physical Hardware LayerHygon and Phytium servers provide sufficient diversity.Two-out-of-two Heterogeneous Hardware Configuration
Table 4. The functions of each layer of the DSM.
Table 4. The functions of each layer of the DSM.
DSM LayerCorresponding Safety Assurance Measures of DSM
L1Regular Application Functions
L2Function Monitor (FM)
L3Control Safety Monitor (CSM)
L4Vehicle Safety Monitor (VSM)
Table 5. Markov safety model parameter.
Table 5. Markov safety model parameter.
ParameterDefinition
WOperating in normal state
DDDetectable hazardous failure state
DUUndetectable hazardous failure state
R D D In the repair state after the detectable hazardous failure is detected
R D U The undetectable failure state enters the repair state when the inspection test moment arrives
λ Total failure rate of independent failures and common-cause failures
λ D D Failure rate of detectable hazardous failures
λ D U Failure rate of undetectable hazardous failures
β Proportion of common-cause failures in total failures
μ Repair rate
Table 6. Comparison of PFH in different modes.
Table 6. Comparison of PFH in different modes.
StructureDCPFH (/h) of Three Structures
β = 1%β = 2%β = 5%β = 10%β = 20%
Single-channel Structure with L2 and L340%9.9293 × 10−81.3933 × 10−72.5941 × 10−74.5947 × 10−78.5932 × 10−7
60%6.6465 × 10−89.3241 × 10−81.7356 × 10−73.0739 × 10−75.7493 × 10−7
90%1.6718 × 10−82.3445 × 10−84.3623 × 10−87.7253 × 10−81.4451 × 10−7
Dual-channel 2-out-of-2 Structure with L2, L3 and L4 (L4 with Voting and Degradation but without Monitoring Function)40%1.1121 × 10−82.5502 × 10−86.2003 × 10−81.2930 × 10−72.4546 × 10−7
60%8.8949 × 10−91.6992 × 10−84.1304 × 10−88.1885 × 10−81.6328 × 10−7
90%2.2895 × 10−94.3733 × 10−91.0626 × 10−82.1051 × 10−84.1916 × 10−8
Dual-channel two-out-of-two Structure with L2, L3 and L440%3.2890 × 10−95.3779 × 10−91.1645 × 10−82.2089 × 10−84.2978 × 10−8
60%2.1923 × 10−93.5845 × 10−97.7612 × 10−99.9875 × 10−92.8645 × 10−8
90%5.4793 × 10−108.9585× 10−102.2306 × 10−94.2311 × 10−98.2323 × 10−9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, L.; Zhang, B.; Liu, Y.; Fu, Q.; Wu, Y. Research on Multi-Level Monitoring Architecture Pattern of Cloud-Based Safety Computing Platform. Symmetry 2025, 17, 1706. https://doi.org/10.3390/sym17101706

AMA Style

Yuan L, Zhang B, Liu Y, Fu Q, Wu Y. Research on Multi-Level Monitoring Architecture Pattern of Cloud-Based Safety Computing Platform. Symmetry. 2025; 17(10):1706. https://doi.org/10.3390/sym17101706

Chicago/Turabian Style

Yuan, Lei, Bokai Zhang, Yu Liu, Qiang Fu, and Yixiong Wu. 2025. "Research on Multi-Level Monitoring Architecture Pattern of Cloud-Based Safety Computing Platform" Symmetry 17, no. 10: 1706. https://doi.org/10.3390/sym17101706

APA Style

Yuan, L., Zhang, B., Liu, Y., Fu, Q., & Wu, Y. (2025). Research on Multi-Level Monitoring Architecture Pattern of Cloud-Based Safety Computing Platform. Symmetry, 17(10), 1706. https://doi.org/10.3390/sym17101706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop