1. Introduction
In August 2006, the concept of ”cloud computing” was first proposed [
1]. After more than a decade of development and evolution, cloud computing has evolved into an indispensable foundational infrastructure for a multitude of organizations, with far-reaching implications across economic, social, industrial, and scientific spheres. With the rapid advancement of mobile networks and big data technologies, the vast majority of online services and data processing services rely on cloud computing technology. Widely adopted across the globe, cloud computing technology is hailed as the third major wave in information technology [
2,
3,
4,
5].
Train safety computers are a crucial technology for ensuring the safety of railway transportation. To guarantee the security and reliability of critical systems, train safety computers usually adopt a redundant design. This means that key components such as processors, control modules, and communication devices are equipped with backup systems. These backup systems can automatically take over when the primary system fails, ensuring that train operations remain unaffected [
6].
Due to the unique layered architecture and virtualization technology of cloud computing, new security challenges have emerged after secure computers were migrated to cloud computing platforms. There are numerous functional software modules from the physical hardware layer to the application layer, and potential failures in these layers can lead to multi-dimensional security failures. Traditional heterogeneous redundancy fault-tolerance mechanisms face significant limitations in this multi-software environment: heterogeneous functional software modules often share core underlying modules, making the system vulnerable to common-cause software failures. This software-induced common-cause failure mechanism makes it difficult to provide effective fault isolation and fault tolerance by simply relying on hardware or software heterogeneity.
To address the above issues, establishing a system monitoring framework becomes a necessary measure. By creating a robust monitoring mechanism—multi-level surveillance—it is possible to monitor and analyze the operating status of the software layer in real time, promptly detect and address potential security threats, and guide the system toward safety.
In summary, our contributions are as follows:
This paper innovatively proposes a multi-level monitoring architecture system tailored to the characteristics of cloud security computing platforms for train control systems, which monitors the levels of common-cause software failures that cannot be eliminated through heterogeneity. The introduction of a multi-level active monitoring mechanism for risk management and control has reduced the impact of common-cause software failures on system safety.
Through rigorous mathematical analysis of the adopted multi-level monitoring architecture, this paper has constructed a formal safety model. This model is effectively applicable to cloud security computing platforms of train control systems.
Experimental verification of the multi-level monitoring architecture has been conducted in this paper. The results indicate that, relying on its multi-level and closed-loop monitoring mechanism, the multi-level monitoring architecture effectively compensates for the defects of local monitoring structures in risk perception and response, and its safety performance is more stable and reliable across different scenarios.
This paper adopts a progressive research approach.
Section 2 reviews related works.
Section 3 conducts a hierarchical analysis of the cloud-based security computing platform.
Section 4 proposes a multi-level monitoring architecture pattern.
Section 5 discusses the feasibility of the multi-level monitoring architecture from a security perspective. In
Section 6, the overall work of this paper summarized.
2. Related Work
Compared with traditional train control systems, signal systems utilizing cloud computing technology have significantly enhanced the efficiency and reliability of rail transit. These cloud computing systems are capable of processing and storing vast amounts of data while providing great flexibility in resource utilization. They reduce dependence on physical equipment, effectively lowering operational costs. The powerful computing capability and fault-tolerant technology of cloud computing systems ensure the continuous availability of services. Due to the high scalability of cloud platforms, centralized management of multiple signaling applications can be achieved by standardizing the control software interfaces, data formats, communication protocols, and other interactive information within the cloud environment, thereby facilitating the network interconnection of cloud-based train control systems. Therefore, the application of cloud computing in rail transit has become an obvious trend in the development of the industry.
Ma et al. [
7] present the design of a train control system test cloud platform based on Docker and Kubernetes clusters. By adopting containerization technology and orchestration tools, this platform realizes the modularity of test software functions, thereby improving the automation level of the test platform and supporting various future train control test tasks. Li K. [
8] explores how to use cloud computing technology to solve the problem of signal centralized detection in high-speed railway systems. Through the virtualization of microcomputer servers and dynamic resource allocation, the monitoring scope of each station’s signal monitoring station can be flexibly adjusted, achieving effective remote detection of signal status and optimizing the detection function. Guo et al. [
9] propose a cloud model with uncertainty cognition characteristics to evaluate the safety level of train control operations. It demonstrates the safety assessment steps based on the cloud model and determines the operational safety status of the train operation control system by calculating the similarity between the comprehensive cloud and the standard cloud.
Zou B. [
10] analyzes the train control functions that can be implemented at each layer of the cloud platform and proposes an integration scheme for their convergence. Zheng T. [
11] analyzes the practical application of cloud computing technology in the subway industry. Dawood M et al. [
12] established a comprehensive theoretical framework for cloud computing security and systematically analyzed the types of cloud computing security issues. Zhu et al. [
13] established a reinforcement learning cloud model to measure the fault repair of rail transit clouds. Gala G et al. [
14] proposed a real-time cloud architecture based on virtualization technology and designed a resource management layer that includes node-level resource managers and global resource managers, so as to realize dynamic allocation, monitoring, and coordination of resources, thereby meeting the safety and real-time requirements of railway applications. Chen et al. [
15] analyzed the primary system architectures of cloud computing platforms for urban rail transit, focusing on security and reliability, and proposed a tailored networking solution for such platforms.
Furthermore, Du S. [
16] modeled the reliability and safety of traditional safety computer platforms, measuring their safety reliability through multiple indicators. However, Du S. did not conduct a detailed analysis of channel failure scenarios within different subsystems. Ren W. [
17] employs the Monte Carlo method to model and analyze the safety and reliability of safety-critical redundant architectures. It presents a design for a safety computer platform based on private cloud infrastructure. However, the proposed platform considers only hardware-level redundancy and does not address the risk of common-cause software failures due to a shared source code. Zhang F. [
18] optimizes the local data transmission method of the new train control system and conducts tests on multiple cloud platforms to reduce system latency. However, the simulation of axle counting sections still requires manual intervention, which may affect safety due to human operations. Yang Y. [
19] proposes an optimized safety computer platform architecture and program sequence monitoring method to enhance the security of safety computer platforms based on cloud computing. However, while this improves the diagnostic coverage provided, the frequent occupation of CPU resources results in an increased self-checking load. Zhao Q. [
20] proposed a novel architecture for cloud-based safety computing platforms and conducted qualitative and quantitative analyses on the real-time performance of such platforms. However, their evaluation was limited to homogeneous hardware configurations, failing to demonstrate the adaptability of the architecture in heterogeneous environments. Liu et al. [
21] proposed a remote monitoring scheme for railway power supply systems based on cloud computing platforms. However, this scheme is limited to railway power supply systems. Zhou et al. [
22] proposed a resource allocation method for railway safety-critical computing applications based on a Mixed Integer Linear Programming (MILP) model. However, the host power consumption model fails to consider the detailed differences in power consumption caused by hardware heterogeneity. Moreover, the safety verification only focuses on “whether deployment rules are met” and does not reference industry safety standards such as IEC 61508 (Geneva, Switzerland, 2010) for quantitative verification. A comparison of recent studies is presented in
Table 1.
Although scholars at home and abroad have carried out systematic research on cloud-based security computing platforms from multi-dimensional perspectives, existing studies mostly focus on the functional implementation of cloud secure computing platforms or security analysis from a single dimension and have not fully considered the potential impact of common-cause software failures on system security. Regarding the quantitative security assessment of cloud security computing platforms, a systematic modeling method that integrates the characteristics of common-cause software failures and multi-level monitoring mechanisms has not yet been developed.
In view of the theoretical and practical challenges of security protection for cloud-based computing platforms, this paper innovatively proposes a multi-level monitoring architecture model for cloud-based secure computing platforms. Through formal modeling and quantitative analysis, the system’s security verification is completed, and experimental verification of the multi-level monitoring architecture is conducted. The results show that this architecture meets the Safety Integrity Level 4 (SIL4) requirements of the train control system, filling the adaptability gaps of traditional methods in cloud environments.
4. Multi-Level Monitoring Mechanism
As demonstrated in the preceding section, it can be found that in the cloud-based secure computing platform adopting the two-out-of-two basic security architecture, the problem of common-cause software failures cannot be completely eliminated through heterogeneity. This section proposes a novel multi-level monitoring mechanism, incorporating robust diagnostic multi-level monitoring, to mitigate security risks arising from common-cause software failures.
4.1. Monitoring Architecture Pattern
According to EN 50129 (Brussels, Belgium, 2018), redundant architecture patterns fall into the category of combinatorial safety faults, while monitoring architecture patterns belong to reactive safety faults [
24]. The key to reactive safety faults lies in program sequence monitoring, which achieves program logic sequence monitoring and time sequence monitoring by placing checkpoints at specific positions and conducting periodic inspections on them. Monitoring architecture patterns mainly include monitor–actuator patterns, safety execution patterns, and three-level monitoring architecture patterns, etc. [
25]. A comparison of the aforementioned architecture patterns in terms of reliability, security, cost, variability, and execution time shows that the three-level monitoring architecture pattern and the safety channel pattern are similar in security and reliability with relatively low costs; however, the three-level monitoring architecture pattern lags far behind the safety channel architecture pattern in terms of impact on execution time [
26]. The three-level monitoring architecture pattern, also known as the E-Gas architecture, adopts a three-layer architectural design, where each layer has unique functions and a failure control path. Each of these three layers has an independent failure control path, and through logical combination and collaboration, the system can quickly enter a “fail-safe state” when a problem occurs. The core purpose of the security channel pattern is to ensure that the system can still maintain safety even when a major failure occurs in the main functions. Its implementation idea is to adopt Automotive Safety Integrity Level (ASIL) decomposition technology to decompose the high-safety-level system requirements into different subsystems, reduce the failure risk of the actuator channel when performing normal operation functions, and transfer the safety control to the healthy channel. In the field of automotive autonomous driving, based on three-level monitoring, the concept of hierarchical monitoring has been expanded, and a distributed safety mechanism (DSM) has been proposed [
27]. DSM distributes the safety layer across processors with different ASIL ratings. It also adopts hardware-assisted virtual machines to isolate software modules and realizes the fault-free shutdown behavior of faulty software stacks. It can be used to address the issues arising from the growing number and complexity of integrated system chips and software stacks required for autonomous operations. The train operation control system with a superimposed structure also faces the problem of an increasing number and complexity of physical devices in the process of developing towards autonomous operation [
28], which is similar to the problems encountered in automotive autonomous operations. Therefore, DSM is not only applicable to the automotive field but can also be used to address similar issues in the rail transit sector.
A specific example of the three-layer monitoring concept in the E-Gas architecture under the network architecture provided by the DSM architecture is shown in
Table 4.
4.2. Multi-Level Monitoring Architecture
Multiple redundant virtual machines are configured within the private cloud, where each virtual machine runs the train control function software independently and forms a traditional hierarchical fail-safe system together with the voter. Analysis of common-cause software failures across the Host OS layer, Hypervisor layer, and cloud platform layer indicates that the traditional architecturally heterogeneous redundant system still has an unavoidable issue of identical source code. Therefore, monitoring of common-cause software failures must first consider monitoring of virtual machines.
To overcome the limitations of traditional architectures, a new multi-level monitoring architecture pattern is proposed. This study draws on the advantages of the ASIL decomposition technology in the secure channel pattern and the three-level monitoring principle of the E-Gas architecture and combines the DSM to enhance the ability of the cloud-based safety computing platform to prevent common-cause failures.
The proposed implementation is as follows:
The train control function software running on redundant virtual machines corresponds to L1 of the conventional application functions under the DSM architecture.
The monitoring state machine software based on runtime verification is built on another private cloud virtual machine to monitor L1 functions through the function channel, supporting the function monitor L2 under the DSM architecture. Runtime verification is a lightweight verification technique that combines testing and model checking. It verifies the system by monitoring whether the actual execution path of the target system meets the specified monitoring properties [
29].
In the DSM architecture, L3 needs to be configured in an independent Microcontroller Unit (MCU). However, there is no such MCU in the private cloud. Consequently, the L3 software is deployed on another virtual machine in the private cloud that is configured for redundancy. This virtual machine monitors the normal operation of L1 and L2 through a challenge–response mechanism.
Under the DSM architecture, L4 is essentially an “external safety monitoring unit independent of the core functional layer,” which is highly aligned with the core requirements of the train control system, namely “safety redundancy, global monitoring, and failure emergency response.” L4 does not run on the private cloud; instead, it monitors the functional controllers in the two-out-of-two heterogeneous redundant architecture from the outside and forces a transition to a safe state when a hazard is detected. Furthermore, to incorporate the idea of safe degradation of the limp-home channel from the secure channel pattern, it is necessary to add a safe degradation function to the voter.
Monitoring of L1 is achieved through the monitor state machine software based on runtime verification in L2, which continuously monitors the behavior of critical software such as the on-board Automatic Train Protection (ATP). However, since L2 is located within the virtual machine, it cannot distinguish between a virtual machine error and a common-cause failure error in its underlying Host OS layer or Hypervisor layer. Therefore, L3, which is located outside this virtual machine, is required to monitor these two layers through a challenge–response mechanism that proactively initiates challenges and verifies the legitimacy of responses. If L2 malfunctions or responds abnormally, L3 can determine whether the error occurs in L1 or L2. L4 adds a safety degradation function to the cloud-based secure computing platform following a failure, and the final security assurance mechanism is an enhanced DSM + voting function.
Based on the above comparative analysis with the DSM structure, the final multi-level monitoring architecture model is proposed. Firstly, in the first layer, the task of rail transit signal processing is undertaken by the application software L1, which focuses on executing the basic functions of the train control system.
In the second layer, the second-level monitor L2, which is configured in the corresponding virtual machine to monitor L1, runs in parallel to implement runtime verification of the system’s main functions.
Finally, on the virtual machines supported by heterogeneous hardware servers, the third-level monitor L3 is deployed. It monitors the Hypervisor layer of the entire system and the second-level monitor L2, ensuring the continuity and security of the system’s operation.
The first, second, and third layers collectively form a single channel. On this basis, a dual-channel structure is adopted to constitute a two-out-of-two voting architecture. The fourth layer is the fourth-level monitor L4, which is responsible for voting on L1 of the two channels and conducting real-time monitoring on L3 of the two channels. The fault-handling process is governed by the “symmetry principle”, which stipulates that the command logic when faults occur in the two channels is completely symmetrical. In the event of a hazard being detected, the system is switched to operate in a degraded mode.
Each layer is equipped with separate failure management measures, and through a specific logical combination, it is ensured that the system can be transferred to a degraded state via L4 when necessary. Such a hierarchical structure helps to identify safety-related failures, and through the safety-handling mechanisms of each layer, it can quickly limit or stop the improper output of the system, thus ensuring that the system enters a safe mode and preventing the potential risk of expanding into a hazard. The schematic diagram of the final overall architecture is shown in
Figure 5.
5. Security Analysis of Cloud-Based Safety Computing Platform Multi-Level Monitoring Architecture
To verify the rationality of the proposed multi-level monitoring architecture scheme in this section, from the perspective of safety, the Markov method and the reliability block diagram method are adopted to quantitatively analyze the safety in the following scenarios: single channel with L2 and L3, dual-channel L4 without degradation or monitoring, and dual-channel L4 with complete functions. The Markov method and reliability block diagram method, as outlined in the IEC 61508 standard, encompass factors such as fault detection, repair, and common-cause failures [
30]. These methods facilitate an objective and comprehensive analysis of the structural relationships between systems. However, a limitation of the reliability block diagram method is that it is unable to demonstrate the changes and transition processes between various states in the system. For these reasons, this section adopts the approach of taking the Markov method as the mainstay and the reliability block diagram method as a supplement to analyze the safety of the multi-level monitoring architecture pattern.
5.1. Single-Channel Structure with L2 and L3
This paper is the first to incorporate common-cause failures of cloud-based train control systems into a formal model. Furthermore, by incorporating the “fault detection–repair–degradation” behavior of the multi-level monitoring system into the Markov model, it becomes a quantitative verification tool for architectural design. The Markov safety model of a single channel with a second-level monitor L2 and a third-level monitor L3 is shown in
Figure 6.
The state transition diagram for a single-channel structure with L2 and L3 contains 14 states. The definitions of the parameters in the figure are shown in
Table 5.
IEC 61508 recommends a hardware failure rate value of . According to the American national standard ANSI/AIAAR-103-1992, the failure rate of civil software generally ranges around and . The Mean Time Between Failures (MTBF) of the Hygon X86 server and Phytium ARMv8 server used in this study are 220,000 h and 100,000 h, respectively. Consequently, the failure rate of the X86 server has been set at , while that of the ARMv8 server has been set at . For L1, L2, and L3 level software, the failure rate is set to . For L4 safety-critical equipment with a safety integrity level not lower than SIL4, in order to achieve its safe and reliable function, the failure rates of its hardware and software should be lower than those of conventional hardware and software. Consequently, the hardware failure rate is set to , and the software failure rate is set to . The range of values for the hardware common-cause failure coefficient is .
For the detectable dangerous failure state DD and the undetectable failure state DU, the repair rates are and , where MTTR stands for Mean Time To Repair and MRT stands for Mean Repair Time. According to IEC 61508, is considered; therefore, the repair rates are uniformly set to .
The meanings of each state are as follows:
State 1: L1, L2, and L3 all operate normally (W).
State 2: L1 and L2 are in an state; L3 is normal.
State 3: L1 and L2 are in a DU state; L3 is normal.
State 4: L3 is in an state; L1 and L2 are normal.
State 5: L3 is in a DU state; L1 and L2 are normal.
State 6: L1, L2, and L3 are all in an state.
State 7: L1, L2, and L3 are all in a DU state.
State 8: L1 and L2 are in an state; L3 is in a DU state.
State 9: L1 and L2 are in a DU state; L3 is in an state.
State 10: L1 transitions from DU to during testing; L2 and L3 are normal.
State 11: L2 transitions from DU to during testing; L1 and L3 are normal.
State 12: L3 transitions from DU to during testing; L1 and L2 are normal.
State 13: L1 and L2 transition from DU to during testing; L3 is normal.
State 14: L1, L2, and L3 transition from DU to during testing.
The Markov transition matrix of the single-channel structure with L2 and L3 is as follows:
In addition, a connection matrix is required to infer the situation in the next stage from the initial conditions. The connection matrix for the inspection test phase is as follows:
The initial state is as follows:
The state at any moment
within the inspection test interval is as follows:
Therefore, the probability that the single-channel architecture with L2 and L3 is in a dangerous failure state at time t is as follows:
The average frequency of dangerous failure (PFH) is applicable to systems that require continuous operation to ensure safety, such as train protection systems [
31].
IEC 61508 provides the general formula for PFH calculation as follows:
Among them,
is the unconditional failure frequency. In IEC 61508, the definition of PFH is the average value of
over the operating cycle. In the Markov state transition diagram, State DU represents the dangerous failure state of the unit; therefore,
can be understood as the sum of the transition rates from all other states to State DU per unit time. The calculation method for the transition probability is the product of the probability
of currently being in a certain state and the transition probability
from this state to the failure state DU that cannot be detected by inspection tests. The transition rate from State
to DU is represented by
. The expression for
is as follows:
In addition, the formula for calculating PFH using the Markov method, which is required in this paper, is derived as follows:
The dangerous failure states corresponding to the single-channel architecture with L2 and L3 are seven, eight, and nine. Among them, state three and state five transition to state seven with transition rates, respectively; state two and state five transition to state eight with transition rates, respectively; state three and state four transition to state nine with transition rates, respectively. Then, the PFH of the single-channel architecture with L2 and L3 can be expressed as follows:
where
denotes the test time interval. Select the single-channel scenario where L1 and L2 are on X86 hardware and L3 is on ARM hardware. According to the previous text,
and
. Referring to IEC 61508, the inspection test interval is
(three months), and the single-channel repair rate is
(in IEC 61508, MTTR = 8 h). The value of
is taken as 2%. The variation in PFH for the single-channel structure with L2 and L3 is shown in
Figure 7.
As can be seen from the figure, when diagnostic coverage (DC) is greater than or equal to 90%SIL, the PFH does not fall within the range specified in (/h), thereby failing to meet the SIL4 requirements defined in IEC 61508.
Since it is necessary to study the dual-channel two-out-of-two structure with L2, L3, and L4 subsequently, for the convenience of calculation, it is necessary to obtain the safety equivalent parameters of a single channel with L2 and L3.
The detectable dangerous failure of a single channel can be regarded as the failure rate that leads to the detectable dangerous failure of the entire single channel. In addition, the PFH of the single channel is taken as the equivalent undetectable dangerous failure rate of the single channel, denoted as .
The diagnostic coverage rate is as follows:
5.2. The Dual-Channel Two-out-of-Two Structure with L2, L3, and L4
The Markov safety model of the dual-channel two-out-of-two structure with the second-level monitor L2, the third-level monitor L3, and the fourth-level monitor L4 is shown in
Figure 8.
The meanings of each state are as follows:
State 1: Both dual channels and the fourth-level monitor L4 are operating normally.
State 2: Either of the single channels is in the repair state after detection, and the fourth-level monitor L4 is operating normally.
State 3: Either of the single channels is in the undetectable state DU, and the fourth-level monitor L4 is operating normally.
State 4: Both dual channels are in the detectable repair state , and the fourth-level monitor L4 is operating normally.
State 5: Both dual channels are in the undetectable state DU, and the fourth-level monitor L4 is operating normally.
State 6: One single channel is in the detectable repair state , the other single channel is in the undetectable state DU, and the fourth-level monitor L4 is operating normally.
State 7: One single channel and the fourth-level monitor L4 are in the detectable repair state .
State 8: One single channel is in the detectable repair state 1, and the fourth-level monitor L4 is in the undetectable state 2.
State 9: One single channel is in the undetectable state DU, and the fourth-level monitor L4 is in the detectable repair state .
State 10: One single channel and the fourth-level monitor L4 are in the undetectable state DU.
State 11: Both dual channels and the fourth-level monitor L4 are in the detectable repair state .
State 12: Both dual channels are in the detectable repair state , and the fourth-level monitor L4 is in the undetectable state DU.
State 13: Both dual channels are in the undetectable state DU, and the fourth-level monitor L4 is in the detectable repair state .
State 14: Both dual channels and the fourth-level monitor L4 are in the undetectable state DU.
State 15: One single channel and the fourth-level monitor L4 are in the detectable repair state , and the other single channel is in the undetectable state DU.
State 16: One single channel and the fourth-level monitor L4 are in the undetectable state DU, and the other single channel is in the detectable repair state .
State 17: Both dual channels are operating normally, and the fourth-level monitor L4 is in the detectable repair state .
State 18: The fourth-level monitor L4 is in the undetectable state DU.
States 19–30: Any component in the undetectable state DU is in the undetectable repair state after the inspection test time point arrives.
In the dual-channel two-out-of-two structure with L2, L3, and L4, when both dual channels are in a failed state and the L4 is also in a failed state and unable to provide the degradation function, the system will be in a dangerous failure state. Therefore, it is considered that States 11 to 16 correspond to the situation where the entire system enters a dangerous failure state.
The Markov transition matrix of the dual-channel two-out-of-two structure with L2, L3, and L4 within the inspection test interval is as follows:
The connection matrix is omitted here due to its excessively high dimensionality.
The initial state is as follows:
Therefore, the probability that the dual-channel two-out-of-two structure with L2, L3, and L4 is in a dangerous failure state at time t is as follows:
For the dual-channel two-out-of-two structure with L2, L3, and L4, the corresponding dangerous failure states are 11 to 16, and the PFH can be expressed as follows:
The safety model using the reliability block diagram method for the two-out-of-two structure with L2, L3, and L4 is shown in
Figure 9.
As can be seen from the figure, the relationship between independent failures and common-cause failures in the reliability block diagram is in series.
Using the method provided by PDS [
32], the PFH values of the entire L1, L2, and L3 are calculated as follows:
where
denotes the inspection test time interval,
denotes the multiplicative coefficient used to quantify the common-cause failure, which can be obtained by looking up the values provided by the PDS method.
For a single channel, the PFH is equal to its undetectable failure rate.
Then, the PFH of L4 is as follows:
The overall PFH is obtained as follows:
Among them, K is the conversion coefficient for switching to the degraded mode. According to IEC 61508, the value of K is 0.98.
5.3. The Dual-Channel Two-out-of-Two Structure with L2, L3 and L4 (L4 Has a Voting-Based Degradation Function but No Monitoring Function)
In order to prove the safety and necessity of the dual-channel two-out-of-two structure with L2, L3, and L4 (which has the functions of voting, monitoring, and degradation simultaneously), it is also necessary to analyze the safety of this structure when it does not have the monitoring and degradation functions. The previous reliability analysis has shown that the absence of the degradation function will have a significant impact on the reliability. Therefore, when analyzing the safety, it is also necessary to ensure the reliability. So, only the situation where L4 has the degradation mode but does not have the monitoring function will be analyzed below, and then, it will be compared with the situation where L4 has both the monitoring and degradation mode functions.
The Markov safety analysis of the dual-channel two-out-of-two structure with L2, L3, and L4 (where L4 has the voting-based degradation function but no monitoring function) still refers to the model shown in the figure, but the dangerous failure states used to calculate the Probability of PFH are different from those previously described.
Since L4 does not have the monitoring function, the challenge–response mechanism formed between L3 and L4 fails. This means that when L4 is in a failed state, the single channel with L3 will no longer be able to know the state of L4, and thus the system will not stop operating.
Therefore, if the system is in a situation where L4 fails, the single channel or the dual channels will still continue to operate. At this time, without the voting function, there is no way to know whether the outputs of the dual channels are consistent. Once the outputs of the dual channels are inconsistent, the system cannot enter the degradation mode or stop, which puts the system in a situation where a dangerous failure may occur. Corresponding to the Markov safety model in the figure, states 7, 8, 9, 10, 17, and 18 should be added as the situations where the overall system output is in a dangerous failure state.
Based on the above analysis, the probability that the dual-channel two-out-of-two structure with L2, L3, and L4 (where L4 has the voting-based degradation function but no monitoring function) is in a dangerous failure state at time t is as follows:
Furthermore, the PFH can be expressed as follows:
For L4 without the monitoring function, if L4 fails, the single channel has no ability to sense it. In this case, when L4 fails, it cannot make the system enter a safe state. Therefore, when using the reliability block diagram method to model and analyze the safety, the branch with L4 is omitted, but the impact of the degradation mode on the safety is retained. The model of it by the reliability block diagram method is shown in
Figure 10.
At this time, the PFH is the overall PFH of L1, L2, and L3, plus the influence brought about by the degradation mode, that is as follows:
5.4. Comparison of Safety Performance
When calculating using the Markov safety model, it is necessary to first specify the failure rate of the single channel. As is known from the previous content, the failure rate of a single channel in the dual-channel two-out-of-two structures with L2, L3, and L4 can be obtained from the equivalent dangerous failure rate of the single-channel structure with L2 and L3. When
, we obtain
. The inspection test interval
(3 months), and the repair rate of the single channel is
and
. In order to ensure the safe and reliable functionality of L4 safety-critical equipment with a safety integrity level not lower than SIL4, it is imperative that the failure rates of its hardware and software are lower than those of conventional hardware and software. Consequently, the probability of hardware failure is designated as
. Whether it is the common-cause failures of the 1oo2 (one-out-of-two) heterogeneous redundancy of L12 and L3, namely
and
, respectively, or the common-cause failure
between the two 1oo2 as a whole, their root cause is still the common-cause failure between X86 and ARM. Therefore, these three common-cause failure coefficients are taken as the same value. In addition, the common-cause failure
between the two 1oo2 and L4 should be smaller than the previous three
values. For conservative calculation here,
is also regarded as the same as the previous three
values. The cases where
is 2% and 20% are discussed, respectively. Then, the comparison of the changes in the PFH between the dual-channel two-out-of-two structure with L2, L3, and L4 (L4 has the function of voting-based degradation but no monitoring function) and the dual-channel two-out-of-two structure with L2, L3, and L4 (L4 has complete functions) is shown in
Figure 11.
Table 6 shows the values of PFH of the three structures under different diagnostic coverage rates, and common-cause failure coefficients
. To clearly show the safety comparison among various architectures, data within the value range
(/h) that meets the high-requirement (or continuous) mode of SIL4 shall be marked in red.
In the actual design of the signal system, to ensure safety, a safety structure with a high diagnostic coverage rate should be designed. The DC of a system is one of the important factors affecting the reliability and safety of the system. This indicator represents the self-inspection function and fault self-checking function of the system. According to IEC 61508, typical values of DC are 60%, 90%, and 99%, and the range of the common-cause failure coefficient βis [1%, 20%]. For systems such as train control systems that have high requirements for reliability and safety, the DC should be greater than or equal to 90%, and the value of the common-cause failure coefficient should also be made as small as possible. From the analysis of the PFH values calculated for the three structures under various different conditions in the table, it can be seen that the safety of the single-channel structure with L2 and L3 is difficult to meet high-safety-level requirements under any circumstances. The dual-channel two-out-of-two structure with L2, L3, and L4 (where L4 has voting and degradation functions but no monitoring function) can only be within the range of (/h) when the diagnostic coverage rate is very high and the common-cause failure factor is very low. In the majority of cases, it remains incapable of meeting the stipulated requirements. However, the dual-channel 2-out-of-2 structure with L2, L3, and L4 can basically reach the SIL4 safety level specified in IEC 61508 in most cases, which can well meet the safety requirements.
It can thus be concluded that the multi-level monitoring architecture, through its multi-layered and closed-loop monitoring mechanism, effectively compensates for the deficiencies of local monitoring structures in risk perception and response. Its safety performance is more stable and secure across different scenarios, providing a key guarantee for the train control system’s cloud security platform to address software common-cause failures and meet SIL4 safety requirements. This difference also indicates that in a cloud computing environment, partial monitoring structures relying solely on local redundancy or simplified monitoring cannot cope with multi-dimensional security challenges, and multi-level active monitoring is an essential means to ensure system security.