Railway Cloud Resource Management as a Service

Atanasov, Ivaylo; Dimitrova, Dragomira; Pencheva, Evelina; Trifonov, Ventsislav

doi:10.3390/fi17050192

Open AccessArticle

Railway Cloud Resource Management as a Service

¹

Faculty of Telecommunications, Technical University of Sofia, 1000 Sofia, Bulgaria

²

Faculty of Telecommunications and Electrical Equipment in Transport, Todor Kableshkov University of Transport, 1574 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(5), 192; https://doi.org/10.3390/fi17050192

Submission received: 26 March 2025 / Revised: 22 April 2025 / Accepted: 23 April 2025 / Published: 24 April 2025

(This article belongs to the Special Issue Cloud and Edge Computing for the Next-Generation Networks)

Download

Browse Figures

Versions Notes

Abstract

Cloud computing has the potential to accelerate the digital journey of railways. Railway systems are big and complex, involving a lot of parts, like trains, tracks, signaling systems, and control systems, among others. The application of cloud computing technologies in the railway industry has the potential to enhance operational efficiency, data management, and overall system performance. Cloud management is essential for complex systems, and the automation of management services can speed up the provisioning, deployment, and maintenance of cloud infrastructure and applications by enabling visibility across the environment. It can provide consistent and unified management over resource allocation, streamline security processes, and automate the monitoring of key performance indicators. Key railway cloud management challenges include the lack of open interfaces and standardization, which are related to the vendor lock-in problem. In this paper, we propose an approach to design the railway cloud resource management as a service. Based on typical use cases, the requirements to fault and performance management of the railway cloud resources are identified. The main functionality is designed as RESTful services. The approach feasibility is proved by formal verification of the cloud resource management models supported by cloud management application and services. The proposed approach is open, in contrast to any proprietary solutions and feature scalability and interoperability.

Keywords:

railways; cloud computing; cloud management; service-based architecture; RESTful services; state models; formal verification

1. Introduction

Nowadays, railway operators are striving to improve their services in competition with other forms of transport. The aim is to increase the capacity of trains to carry more passengers and freight in response to rising traffic demand, and to provide a better travel experience, while ensuring a higher level of safety and security. Integrating cloud computing into railways could enhance efficiency, maintenance, and operations by leveraging scalable, remote, and data-driven solutions. Typical cloud-based applications include maintenance and predictive analytics related to railway infrastructure and rolling stock [1,2], railway signaling and traffic management [3,4,5], internet of things integration [6,7,8], big data analytics and railway planning [9,10], Artificial Intelligence (AI) for operation optimization [11,12], and cloud-based ticketing and passenger management [13,14].

By adopting cloud technology, the digital railways become more adaptive while maintaining high safety requirements. Adaptive railway systems utilize a higher level of automation to control train speed and the transportation of passengers and goods, reducing travel time and increasing efficiency and competitiveness. Furthermore, cloud-based solutions can provide accurate real-time data of rail operations to detect potential dangers and anomalies, and enable useful insights for optimal monitoring and control. Operational data reflects the state of physical assets, such as train location and speed, train door conditions, track temperature, and weather data. The data is gathered by IoT sensors onboard the train and alongside the tracks. The data are sent to the cloud for further processing and are consumed by different cloud applications. Operational data and analytic applications enable real-time monitoring and control, and thus improve service reliability.

Essentially, the railway cloud is a segregated pool of computing and storage resources dedicated to run mission critical applications and to store associated data. Usually, it is hosted in private data centers. Due to the highly critical nature of the operational technology applications, the essential railway cloud attributes include strong resilience with five nines of high availability, deterministic quality of service (QoS), robust network security, distributed architecture, and edge computing integration. Safety critical railway applications, which are responsible for controlling and monitoring the train systems, are highly sensitive to communication disruptions and may increase the risk of potential safety hazards; so, it is essential to have robust connectivity to preserve mission critical grade availability [15,16]. Strong and flexible QoS capabilities include railway application awareness and prioritization, which ensure good application performance, reduce congestion issues, and protect the network from server misconfiguration [17]. Railways handle sensitive data, so securing the cloud environment is critical to protect against cyber threats. Ensuring compliance with data protection regulations is essential when storing and processing passenger and operational data in the cloud [18,19]. Railway networks are geographically dispersed, so the cloud infrastructure must support distributed computing across multiple regions or locations [20]. Railway operations often involve edge computing to process data locally at stations, trains, or tracks before sending it to the cloud [21,22,23].

In this paper, a service-based approach to railway cloud resource management is proposed. The focus is on the cloud orchestration functions related to fault and performance management.

The main contributions in this paper may be summarized as follows:

The requirements for fault and performance management of the railway cloud resource are identified based on an analysis of different use cases;
The communication between railway cloud management applications and the railway cloud management services for fault and performance management is designed as RESTful Application Programming Interfaces (APIs);
Models, representing the views on the alarm status of a fault management application and a fault management service, are developed, and formally described. Using the concept of synchronization between the states of two state machines, it is proved that the models maintain synchronized-in-time views on the alarm status;
Models, representing the views on the performance management job status and on the process of subscription to and notification of performance data of an application and a service, are developed, and formally described. It is proved that the models maintain synchronized-in-time views.

The rest of this paper is organized as follow. Section 2 discusses related works and highlights this paper outcomes. Section 3 and Section 4 describe accordingly the typical use cases of cloud resource fault management and performance management, identify the required functionality, and present the RESTful APIs for interaction between cloud management applications and services. The conclusion summarizes the contribution.

2. Related Works

Cloud orchestration refers to the process of automating and managing various tasks related to deploying, configuring, scaling, and maintaining applications and services in a cloud environment. It involves coordinating multiple cloud resources—such as compute instances, storage systems, network configurations, security settings, databases, and more—to ensure that everything works together seamlessly. Being a cloud, the railway cloud naturally is characterized by properties like those of the commercial clouds. Some key scenarios where cloud orchestration is practically useful include the following:

Automated deployment of services and applications across multiple environments. This ensures consistent configurations and reduces human errors through automation. In [24], the authors present a survey on edge clouds that use automated deployment mechanisms, namely Infrastructure as Code tools. In [25], the authors describe a framework for managing edge computing and heterogenous high performance computing clusters. A framework that facilitates the development and deployment of AI services at the network edge is proposed in [26].
Resource management. This enables management and scaling infrastructure resources based on demands. Automatic allocation and deallocation of infrastructure resources optimizes the costs. In [27], the authors present a literature survey of cloud computing algorithms and provide a comparative study of various resource management. In [28], the authors propose a strategy for cloud resource management based on an auction mechanism which improves the resource allocation rate. In [29], the authors present a collaborative cloud resource management approach based on a job scheduling algorithm, which is an improved version of a swarm intelligence algorithm that reduces the convergence speed for optimal results. In [30], the authors demonstrated the advantages and significance of the workload pattern for learning-based cloud resource management.
Management of multi-cloud or hybrid cloud environments. Coordinating resources across multiple cloud providers needs to ensure the seamless integration and management of diverse environments. A discussion on management of computing in the era of the hybrid cloud is presented in [31]. In [32], the authors provide a review of the research on multi-cloud management platforms.
Continuous integration/continuous deployment (CI/CD). This includes automation of the testing and deployment process to ensure rapid and reliable software delivery. In [33], the authors present an implementation of a cybersecurity approach to the CI/CD pipeline that automates the installation and deployment of its various components in cloud-based systems. An entire automated pipeline, starting with detecting changes in the application source code, creating new resources in the Kubernetes cluster to host this new version, and finally deploying the containerized application is presented in [34]. Based on a discussion of use cases and current challenges, the authors of [35] describe a framework for managing cloud-based AI application lifecycles and its key components. Research on configuration management of cloud-based applications is presented in [36].
Disaster recovery. The creation of automated disaster recovery plans can quickly restore services after an outage or failure. This includes testing and validating backup procedures in a streamlined manner. A discussion on the role of cloud computing in preparation of disaster management and how an organization can use the latest technology to minimize the consequences on it is presented in [37]. A cloud platform disaster recovery model based on the characteristics of the cloud platform, data replication, and load balancing technologies described in [38].
Security and Compliance. Implementing security policies across all cloud resources is important to maintain compliance with industry standards. This includes automation of security measures enforcement, such as access controls, encryption, and logging. In [39], the authors outline the security issues that cloud computing raises, and suggest solutions that safeguard private information and systems in cloud-based environments for businesses. In [40], the author provides general guidelines on auditing standards by referring to threats and vulnerabilities, and suggests a unified approach toward audit considerations in cloud computing environment.
Monitoring and analytics. Integration of cloud orchestration with monitoring and analytics platforms provides real-time insights into the performance of the cloud infrastructure. It is aimed at automatic resource adjustment based on usage patterns and alerts from monitoring tools. The tutorial presented in [41] discusses the AI techniques that can help in fault and performance management in multi-cloud virtual network. In [42], the authors present an application anomaly detection and bottleneck identification system based on cloud platform service components, that can monitor and analyze applications on multi-layered cloud platforms with customized index values. Studies on the platform design of distributed cloud monitoring and the key technologies of big data storage can be found in [43]. The survey presented in [44] provides an overview and analysis of advanced techniques for anomaly detection and localization of cloudified multi-service applications.
Container management. Orchestration tools, like Kubernetes, can manage containerized applications, ensuring high availability and efficient resource utilization. It is also referred to as automatically scaling of container instances based on application demand. The authors of [45] discuss and compare emerging container platforms and cloud-centric orchestration frameworks, highlighting the challenges involved.
Microservices architecture. Microservices are a powerful architectural paradigm for creating and deploying contemporary applications in the cloud computing environment. They can be managed and scaled independently. Microservices management must ensure service discovery and load balancing across distributed systems. In [46], the authors compare the features and constraints of several cloud platforms and tools for deploying and orchestrating microservices. A comprehensive overview of microservices as a suitable complementation of cloud computing is provided in [47], where the authors outline their technical challenges, such as performance, debugging, and data consistency.

In our previous research, the basic concepts related to the railway cloud orchestration and management are defined, and the requirements for its basic orchestration are identified [48]. The management interface provided by the railway cloud platform (RCP) to the Railway Management Automation and Orchestration (RMAO) platform includes two types of services:

Cloud resource management (CRM) services that orchestrate railway cloud lifecycle processes, and are responsible for the allocation and delivery of cloud resources and resource management software, including cloud deployment services, cloud infrastructure catalogue services, railway cloud monitoring services, railway cloud provisioning services, etc.
Cloudified function management (CFM) services that are responsible for the management of the lifecycles of the cloudified railway functions deployed on the railway cloud.

The basic orchestration of the railway cloud is related to the railway cloud’s provisioning. The provisioning of the railway cloud includes the railway cloud’s deployment and the integration of cloud services within the railway infrastructure. It assumes that the basic infrastructure (e.g., physical resources) is built and installed, and that secured connectivity between the components of the intelligent railway control systems exists. In [48], our focus was on the deployment of railway cloud services and resource allocation. An approach to design of functions related to the CFM services’ software update as a service was proposed. The infrastructure catalogue of the railway cloud contains information about the cloud physical infrastructure, the managed cloudified railway functions, as well as the CFM services, and information of the allocated cloud resources. In [49], we proposed an approach to design the management functionality of the railway cloudified applications.

In this paper, our focus is on the fault management (FM) and the performance management (PM) functions of railway cloud resources. Different unsupervised learning AI techniques that can detect failures and predict failures are discussed in [50]. Cloud computing failures, recovery approaches, and management tools are presented in [51]. In [52], the authors propose fault-tolerant scheduling by the use of dynamic re-clustering that addresses task and job failures in scientific workflows. A framework for a real-time system event analysis of log monitoring is proposed in [53], where the authors use the long short-term memory algorithm for predictive maintenance in a cloud fault monitoring system. An algorithm that optimizes the performance of cloud computing management for networks with high traffic is proposed in [54]. In [55], the authors present a model for cloud security and performance management. An overview of various power and performance management strategies in cloud computing is provided in [56]. In [57], the authors analyze performance, scalability, availability and security aspects in different cloud computing environments. Different cloud service, cloud management, load-distributing, and cloud strength tools are analyzed in [58], with the aim to identify the performance and extensibility. In [59], the authors analyze the types of failures related to cloud computing in railways and develop fault-tolerant mechanisms for physical nodes and virtual machines.

The specifics of railways add challenges and requirements that are unlike the commercial clouds, namely high reliability, low latency, rapid elasticity, and sustainability. Cloud hosting of critical railway applications reduces the impact of application and/or service outages through built-in availability and faster response times, which directly impact passenger safety, reduce property damages, and improve performance efficiency.

An analysis of the references shows a clear lack of open interfaces for railway cloud management. The necessity of communication between the managed objects, i.e., the cloud resources, and the managing system (cloud management system) can be based on service-based architecture, which integrates distributed, separately maintained and deployed software components. The referenced related works deal with the methods and algorithms for cloud management with no stress on the APIs as they do not focus on the communication between the managing part and the managed part. Following the principles of service-based architecture, these methods and algorithms may be realized as services, that are used by cloud management applications. The open API approach to cloud management is language-agnostic and enables the discovery of the management capabilities of services.

We propose a service-based approach to the fault and performance management of railway cloud resources. The research methodology includes an analysis of use cases, an identification of management functionality, a synthesis of railway cloud management services, and a verification of the proposed approach.

The service design follows the principles of Representational State Transfer (REST). The “resource-based” principle is a fundamental concept in RESTful architecture, and it is what sets REST apart from other architectural styles. In REST, everything is considered a resource. As a client–server architecture, it provides a uniform interface used to communicate between clients and servers, which includes HTTP methods (GET, POST, PUT, DELETE), URI (Uniform Resource Identifier) syntax, and standard media types. The RESTful approach is stateless as each request from the client contains all the information necessary to complete the request, so the server does not need to store any information about the client state. Responses from the server can be cached by the client to reduce the number of requests made to the server.

RESTful APIs are widely used in web development, and have become the de facto standard for designing networked applications [60].

The verification of the approach includes the following: the development of cloud resource models supported by cloud management application and by the cloud resource management services; their formal definitions as Labeled Transition Systems (LTSs); and by proofs that the models maintain synchronized-in-time views on the state of the cloud resources. An LTS is a formal notation used to describe finite state machines as the quadruple of a set of states, set of actions, set of transitions, and an initial state. We introduce the concept of synchronization between the states of two LTSs, and provide its formal definition. This concept is used to prove that developed models are synchronized in time. It is based on the identification of pairs of states in the respective LTSs and a mapping between transition sequences in synchronized states for both LTSs. We use the concept of synchronization between the states of two LTSs because of the following:

Verification of API: Synchronization between the states is used to verify that an API (as a way of communication in distributed system) behaves correctly under various conditions.
Concurrency verification: Synchronization between the states is used to verify whether the models maintained by a managing application and respective models maintained by a service, which communicate with each other, are synchronized in time. This is important for ensuring that systems behave correctly when they run concurrently.
Model checking: Synchronization between the states is used in model checking, which is a method of verifying the correctness of a system by constructing mathematical models of the application logic and service logic, and then checking them against the desired behavior. The concept is used to verify whether the models behave as expected.

3. Fault Management of Railway Cloud Resources

3.1. Fault Management Use Cases

Fault management and performance management are two pillars of railway cloud resource management. They are highly related and independent, and address two different aspects of cloud health. While the fault management deals with detecting and isolating cloud problems, performance management focuses on monitoring for proactive warning of cloud performance degradation.

Cloud fault management (FM) typically encompasses several key aspects:

Notification: alerting cloud operator when a fault is detected, so a prompt action to resolve it could be taken;

Detection: identifying faults and localizing problems;

Analysis: investigating the root cause of the fault to understand its nature and impact on the system;

Resolution: implementing corrective actions to resolve the fault, such as restarting services, patching software, or replacing hardware.

Effective cloud FM strategies include tight integration with performance management, aimed at proactive monitoring which regularly checks for potential issues before they become major problems, automated remediation enabling the use of automated tools and scripts to quickly resolve common faults and minimize manual intervention, and incident management which implements a structured approach to managing incidents, including notification, analysis, and resolution.

The detection aspects of cloud FM include, for example, the following: anomaly detection that uses patterns or events that derivate from normal behavior, such as unusual login times or unexpected changes in CPU usage; predictive analytics that use statistical models, machine learning algorithms, and historical data to predict when a fault is likely to occur; real time processing that analyses real-time data streams from cloud sources to detect issues before they become problems; and root cause analysis related to investigating the underlying causes of the problem.

The notification aspects of cloud FM are related to alarm management and notifications alerting the cloud operator when a fault or anomaly is detected, including notifications about disk space exceeding threshold, high CPU usage, unusual network traffic, etc.

The analysis aspects of cloud FM, in addition to root cause analysis, include troubleshooting, which involves methodical steps taken to identify and resolve a fault or anomaly, such as checking system logs, network traffic, and cloudified function performance, and testing different configurations, versions, or approaches to identify the problem cause.

The resolution aspect of cloud FM involves implementing changes or fixes to resolve a fault or anomaly that has occurred in the cloud infrastructure or applications, and may include applying patches or updates of software components, updating system configuration, or deploying new versions of cloud platform software or cloudified applications.

The detection and notification of cloud FM could be synthesized as a Cloud Resource Fault Management (CRFM) service. Typical use cases for railway cloud FM include alarm configuration, subscription management to alarm notifications, notifications of alarm occurrence, acknowledgement and clearance, and retrieval of alarm information.

A fault is an error that the railway cloud resource detects. The railway cloud resource logs and stores all detected faults. To be notified about cloud resource faults, a cloud management (CM) application in the RMAO needs to subscribe to notifications of alarms. An alarm is a notification of a problem that needs the attention of the subscriber. To filter alarm notifications the cloud management application defines the subscription criteria.

Figure 1 illustrates the communication model for an alarm subscription between a CM application in the RMAO and the CRFM service.

The railway cloud operator provides a notification criterion to initiate the subscription. The CM application sends a subscription request, which identifies the subscriber to the CRFM service. A callback notification check is performed which includes reachability test and authorization. The subscriber is informed about the reachability and authorization result. In the case of reachability success, the subscription is acknowledged by the CRFM service, and the subscription status is indicated to the railway cloud operator.

The same communication pattern is used when the subscriber wants to change its subscription. To change the subscription characteristics, the subscriber needs to delete the previous alarm subscription to create a new one. These two operations have to be executed in an atomic manner, which ensures that no alarms are missed between operations.

A cloud resource may encounter two types of faults, a fault A, which can be remediated by the cloud resource itself, or a fault B, which cannot be resolved by the cloud resource. In the case of a fault occurrence, the cloud resource logs the alarm internally. For example, a fault of type A is the class of faults that might be handled autonomously, e.g., route re-selection to a resource that has become inaccessible (due to congestion, network segment outage, or attack), data recovery (by backup service), etc. Unfortunately, the type B faults still require human intervention (rocks on the rails, balise malfunction, and the like). The railway cloud platform evaluates what faults must become alarms; for example, an alarm is raised for a fault of type B only. Optionally, the CRFM service may perform alarm analysis, alarm correlation, and alarm escalation based on alarm priority and alarm frequency. The CRFM service logs the alarm and evaluates the alarm criteria. The alarm is reported by the CRFM service to the CM application. If the fault is cleared by the cloud resource, the fault clearance is reported. The CRFM service determines which alarm needs to be cleared by the fault clearance. The alarm may be cleared by the RMAO or by the CRFM service. Next, the alarm clearance is logged and reported to the CM application with active subscription.

Alarm conditions experienced by a cloud resource may be managed or cleared. A management entity in the RMAO may requests alarm acknowledgement/clearance. The CRFM service may also clear alarms autonomously. Some alarms can be cleared manually while others can be cleared automatically (alarms related to faults of type A). Alarm acknowledgement is required in case of a necessity of human intervention (alarms associated with faults of type B). Based on the cloud resource type, an alarm can be manually cleared after it has been raised. The railway operator or the RMAO initiates an alarm acknowledgement/clear request, which is sent to the CRFM service. The CRFM service performs the alarm acknowledgement/clear on the logged alarm. In the case of a successful alarm acknowledgement/clearance, a success response is returned. If the alarm does not exist or the acknowledgement/clearance failed due to unexpected condition, a response is sent to the requester. The alarm acknowledge/clear status is reported and a notification is sent. If the CRFM service initiates an alarm clear, a notification of a successful alarm clear operation is sent by the CRFM service to the CM application. A failed alarm clear operation, autonomously initiated by the CRFM service, is not reported. Figure 2 shows the communication model of alarm acknowledgement/clear.

An alarm can be suppressed until a specific period of time has elapsed, which prevents alarm storms during a specific set of conditions. The purpose is to allow the railway operator or the RMAO to activate the suppression of alarm notifications under the suppression criteria and to deactivate the alarm suppression. Upon a request for alarm suppression activation, the CRFM service processes the request and activates the alarm suppression. The CRFM service returns a response back to the requesting entity including the configured alarm suppression criteria. The railway operator or the RMAO may query or update the alarm suppression information. Alarm deactivation may be initiated by the railway operator or by the RMAO. Upon receiving a request to deactivate the alarm suppression, the CRFM processes the request, executes the operation, and returns a response.

3.2. Fault Management as a Service

Based on an analysis of typical cloud resource fault management use cases, the basic CRFM service functionality can be identified. The service needs to provide API that enables CM application as follows:

create a new alarm definition and its criteria and actions, and to update the criteria and actions for an existing alarm;
retrieve information about an existing alarm;
acknowledge/clear an existing alarm;
activate/deactivate the alarm suppression;
subscribe to alarm events by providing notification criteria, such as the alarm type and severity, and the address where the notifications have to be sent to;
be notified of an alarm occurrence that attracts the subscriber attention.

All resource URIs are defined under the root //{apiRoot}/crfm/v1 in a service directory where the CRFM service is registered.

Table 1 summarizes the RESTful resources and supported HTTP methods of the CRFM service.

The alarm dictionary provides alarm definition and meaning by associating alarms with specific fault conditions. It is updated when a new railway cloud resource type has been onboarded.

Figure 3 shows the flow of query one or multiple alarms. The response returns data structure of alarm type, which describes the alarm ID, associated fault conditions, correlated alarms, etc.

Figure 4 shows the flow of alarm acknowledgement. The request body contains the data structure of the alarm modification type, which describes the required action (acknowledge) on the alarm, and the response confirms the action.

Figure 5 shows the flow of the alarm suppression. The request body contains data structure of the alarm suppression type, which describes the required action (activate), and the response confirms the action.

Figure 6 shows the flow of the alarm subscription creation, read, and deletion, and the alarm notification. When the managing entity sends a request to create an alarm subscription, it provides the filter for notifications, callback address to receive notifications, and authentication parameters for authorization when sending notifications. The notifications are sent to the callback address provided by the requester.

3.3. Formal Verification of CRFM API

To verify the proposed approach to the design of cloud resource FM API, models representing the alarm status maintained by a CM application and by the CRFM service are developed.

Figure 7 shows the UML (Unified Modeling language) state diagram of a simplified alarm status model maintained by a CM application.

The model includes subscription to and notification of alarms. Being notified of an alarm event, the CM application may query alarm information, request alarm acknowledgement or clearance, or request alarm suppression, or resume.

In the AppNull state, the alarm does not exist. In this state, the cloud operator may define the alarm notification criteria (setNotifCriteria event), and the CM application requests to subscribe to the alarm notifications (AlarmSubscriptionReq action). In the AppNull state, the CRFM service may accept a subscription (AlarmSubscriptionRes(ack) event) or may reject it (AlarmSubscriptionRes(fail) event) and, in either case, the cloud operator is notified (subscribed action or notsubscribed action). In the AppNull state, the CRFM service may notify the CM application about an alarm (AlarmNotification(raised) event), and the CM application notifies the cloud operator (alarmRaised action).

In the AppActive state, the alarm is active. In the AppActive state, the cloud operator may query the alarm (queryAlarm event), and the CM application requests information about the alarm (AlarmQueryReq action). In the AppActive state, the CRFM service provides information about the alarm (AlarmQueryRes event), and the CM application sends information to the cloud operator (alarmInfo action). In the AppActive state, the cloud operator may request alarm suppression (suppressAlarm event), and the CM application sends a request to the CRFM service to suppress the alarm (AlarmSuppressReq action). In the AppActive state, the cloud operator may request alarm acknowledgment (acknowledgeAlarm event), and the CM application sends a request from the CRFM service to acknowledge the alarm (AlarmAckReq action). In the AppActive state, the cloud operator may request alarm clearance (clearAlarm event) and the CM application sends a request from the CRFM service to clear the alarm (AlarmClearReq action).

In the Acknowledging state, the CM application waits for alarm acknowledgement. In this state, the CRFM service may acknowledge the alarm (AlarmAckRes event), and the CM application notifies the cloud operator (alarmAcknowledged action).

In the Suppressing state, the CM application waits for alarm suppression. In this state, the CRFM service may inform the CM application that the alarm is suppressed (AlarmSuppressRes event), and the CM application notifies the cloud operator (alarmSuppressed action).

In the AppSuppressed state, the alarm is suppressed. In this state, the cloud operator may request to resume the alarm processing (resumeAlarm event), and the CM application requests the CRFM service to resume to the alarm (AlarmResumeReq action).

In the Resuming state, the CM application waits for alarm resumption. In this state, the CRFM service may inform the CM application that the alarm is resumed (AlarmResumeRes event), and the CM application notifies the cloud operator (alarmResumed action).

In the Clearing state, the CM application waits for alarm clearance. In this state, the CRFM service may inform the CM application that the alarm is cleared (AlarmClearRes event), and the CM application notifies the cloud operator (alarmCleared action).

In the AppCleared state, the alarm is cleared.

Figure 8 shows the UML state diagram of a simplified alarm status model maintained by the CRFM service.

In the Null state, the alarm does not exist. In this state, the CM application may request to subscribe to the alarm notification (AlarmSubscriptionReq event) or a fault may occur in a cloud resource (faultRaised event).

In the Authorizing state, the CRFM service authorizes the subscription request of the CM application. In this state, if the CM application is not authorized (unauthorized event) the CRFM service rejects the subscription (AlarmSubscriptionRes(fail) action), otherwise the CM application is authorized (authorized event).

In the ReachabilityTesting state, a reachability check is performed. If the reachability has passed successfully (reachable event), the CRFM service accepts the CM application’s subscription (AlarmSubscriptionRes(ack) action). If the reachability has failed (unreachable event), the CRFM service rejects the CM application’s subscription (AlarmSubscriptionRes(fail) action).

In the Analyzing&Logging state, the CRFM service performs alarm analysis, alarm correlation, and logs the alarm. After the alarm has been logged (AlarmLogged event), the CRFM service notifies the CM application that an alarm is raised (AlarmNotification(raised) action).

In the Active state, the alarm exists. In this state, the CM application may query the alarm (AlarmQueryReq event), and the CRFM service sends information about the alarm (AlarmQueryRes action). In the Active state, the CM application may request alarm acknowledgement (AlarmAckReq event), and the CRFM service acknowledges the alarm (AlarmAckRes action). In the Active state, the CM application may request alarm clearance (AlarmClearReq event), and the CRFM service clears the alarm (AlarmClearRes action). In the Active state, the CM application may request alarm suppression (AlarmSuppressReq event), and the CRFM service suppresses the alarm (AlarmSuppressRes action).

In the Suppressed state, the alarm is not propagated. In this state, the CM application may request to resume the alarm (AlarmResumeReq event), and the CRFM service resumes to alarm (AlarmResumeRes action).

In the AppCleared state, the alarm is cleared.

The communication between the CM application and the CRFM service is based on HTTP methods applied on the alarm and subscription resources.

As far as the developed models form the base of the application’s logic and the service’s logic, they must provide synchronized-in-time views of the alarm status. For that reason, we formally describe the models as Labelled Transition Systems.

In the following formal model definitions, that formally describe the models, the short notation of the names of states and actions are given in brackets.

Definition 1.

Let L^app = (S^app, Σ ^app, →^app, s₀^app) be a formal description of the model representing, the alarm status, maintained by the CM application, where:

S^app = {AppNull [s^a₁], AppActive [s^a₂], Acknowledging [s^a₃], Suppresing [s^a₄], AppSuppressed [s^a₅], Resuming [s^a₆], Clearing [s^a₇], AppCleared [s^a₈]} is a set of states;

Σ ^app = {setNotifCriteria [a], AlarmSubscriptionRes (ack) [b], alarmSubscriptionRes (fail) [c], AlarmNotification(raised) [d], queryAlarm [e], AlarmQueryRes [f], acknowledgeAlarm [g], AlarmAckRes [h], suppressAlarm [i], AlarmSuppressRes [j], resumeAlarm [k], AlarmResumeRes [l], clearAlarm[m], AlarmClearRes [n]} is a set of events;

→^app = {(s^a₁

\overset{a}{\to}

s^a₁), (s^a₁

\overset{b}{\to}

s^a₁), (s^a₁

\overset{c}{\to}

s^a₁), (s^a₁

\overset{d}{\to}

s^a₂), (s^a₂

\overset{e}{\to}

s^a₂), (s^a₂

\overset{f}{\to}

s^a₂), (s^a₂

\overset{g}{\to}

s^a₃), (s^a₃

\overset{h}{\to}

s^a₂), (s^a₂

\overset{i}{\to}

s^a₄), (s^a₄

\overset{j}{\to}

s^a₅), (s^a₅

\overset{k}{\to}

s^a₆), (s^a₆

\overset{l}{\to}

s^a₂), (s^a₂

\overset{m}{\to}

s^a₇), (s^a₇

\overset{n}{\to}

s^a₈)} is a set of transitions;

s₀^app = s^a₁ is the initial state.

Definition 2.

Let L^ser = (S^ser, Σ^ser, →^ser, s₀^ser) be a formal description of the model representing the alarm status, supported by the CRFM service, where:

S₁^ser = {Null [s^s₁], Authorizing [s^s₂], ReachabilityTesting [s^s₃], Analyzing&Logging [s^s₄], Active [s^s₅], Suppressed [s^s₆], Cleared [s^s₇]} is a set of states;

Σ ^ser = {AlarmSubscriptionReq [α], authorized [β], unauthorized [γ], unreachable [δ], reachable [ε], faultRaised [ζ], alarmLogged [η], AlarmQueryReq [θ], AlarmAckReq [ι], AlarmSuppressReq [κ], AlarmResumeReq [λ], AlarmClearReq [μ]} is a set of events;

→^ser = {(s^s₁

\overset{α}{\to}

s^s₂), (s^s₂

\overset{β}{\to}

s^s₃), (s^s₃

\overset{γ}{\to}

s^s₁), (s^s₂

\overset{δ}{\to}

s^s₁), (s^s₃

\overset{ε}{\to}

s^s₁), (s^s₁

\overset{ζ}{\to}

s^s₄), (s^s₄

\overset{η}{\to}

s^s₅), (s^s₅

\overset{θ}{\to}

s^s₅), (s^s₅

\overset{ι}{\to}

s^s₅), (s^s₅

\overset{κ}{\to}

s^s₆), (s^s₆

\overset{λ}{\to}

s^s₅), (s^s₅

\overset{μ}{\to}

s^s₇)} is a set of transitions;

s₀^ser = s^s₁ is the initial state.

The concept of synchronization between the states of two LTSs can be introduced, as in the following definition.

Definition 3.

Having two LTSs M = (P, Σ₁, →₁) and N = (Q, Σ₂, →₂), the relation

R⊆P×Q is synchronization, if for (p↦q)∈R exists

p⇒₁p’↦ q⇒₂q’ s.t. (p’↦ q’)∈R

where

⇒₁:= p₁

\overset{a}{\to}

p₂

\overset{b}{\to}

…

\overset{l}{\to}

p_n, for some p₁, p₂,…,p_n∈P, a, b,…,l∈ Σ₁, with p₁=p and p_n=p’

and

⇒₂:= q₁

\overset{α}{\to}

q₂

\overset{β}{\to}

…

\overset{λ}{\to}

q_m, for some q₁, q₂,…,q_m∈Q, α, β,…,λ∈ Σ₂, with q₁=q and q_m=q’.

The concept of synchronization is used to prove that the view of a CM application on the alarm status and the respective view of the CRFM service are synchronized in time, i.e., their views on the alarm status are same.

Proposition 1.

Let R₁ ⊆ S^app × S^ser be a relationship between states of L^app and L^ser where R₁ = {(s^a₁, s^s₁), (s^a₂, s^s₅), (s^a₅, s^s₆), (s^a₈, s^s₇)}. R₁ is the synchronization between the states in L^app and L^ser.

Proof.

We identify the following transition mapping between the states of R₁ in Table 2.

Therefore, R₁ is synchronized between states. □

This means that L^app and L^ser have synchronized-in-time views on the alarm status.

4. Performance Management of Railway Cloud Resources

4.1. Performance Management Use Cases

Cloud Performance Management (PM) refers to the process of monitoring, measuring, and optimizing the performance of cloud-based applications, infrastructure, and services. It involves collecting and analyzing data from various sources to identify areas of improvement, to troubleshoot issues, and to ensure that cloud resources are utilized efficiently.

The primary goal of CPM is to ensure that cloud-based systems meet the required performance standards, while also providing a cost-effective and scalable solution for businesses. This is achieved as follows:

Monitoring performance metrics: Collecting data on key performance indicators (KPIs), such as response time, throughput, latency, and resource utilization. For example, latency is an important KPI, because of the requirements of real-time railway cloud operation derived from the demands for highly responsible, safe, secure, and time-determined railway services. The high throughput is required to provide the availability and continuity of the railway mission critical services, and it is related to resilience, which is the ability of the system to deal with certain types of failures and to remain reliable despite them.

Identifying bottlenecks: Analyzing performance data to identify areas of inefficiency or congestion in the cloud infrastructure.

Optimizing resource allocation: Adjusting resource allocation to ensure that the right resources are available for the workload, thereby reducing waste and costs. This refers also to dynamic resource allocation and load balancing.

Ensuring scalability: Proactively scaling up or down to match changing railway demands and to prevent over-provisioning or under-provisioning of resources.

Providing visibility and control: Offering real-time insights into cloud performance, enabling railway cloud operator to make informed decisions about resource allocation and optimization.

Application performance metrics include average response time, error rate, and load, among others. Typical infrastructure performance metrics include CPU utilization, memory usage, disk space utilization, network bandwidth utilization, and storage performance metrics. Among the cloud provider performance metrics are availability, reliability, scalability, security metrics, and the percentage of resources being used efficiently and cost-effectively. Other performance metrics include time to repair, mean time between failures, throughput, latency, and packet loss. These performance metrics can be collected using various tools and techniques, including cloud provider APIs. By monitoring these performance metrics, railway operators can gain visibility into their cloud-based systems and make data-driven decisions to optimize resource allocation, improve user experience, and reduce downtime.

A PM job is a means to measure the cloud resource KPIs, and it is required for coordinating the PM activities. The cloud resource management services collect data from measuring of cloud resource KPIs. PM jobs may be created, queried, updated, deleted, suspended, or resumed. This functionality may be designed as a Cloud Resource Performance Management (CRPM) service.

Figure 9 shows the communication model of a PM job creating, collecting, and processing performance data for the PM job, and for PM job updating.

The PM job is created in the cloud resource management services. The creation begins with the start of default PM jobs during the railway cloud operation. If the CM application has an active subscription, it is notified of the status of the default PM jobs. The RMAO may decide to activate additional PM jobs, and the CM application sends a request. If the request contains subscription related information (e.g., thresholds for notifications), then a subscription is created for the PM job duration, and the CRPM service notifies the CM application of the new PM job status. The railway cloud resource operates, and the performance data are collected. The railway cloud resources send to the CRPM service the collected performance data that is processed for the PM job, and it is available for reporting.

The update of a PM job may be triggered by the railway cloud operator or autonomously by the RMAO. The PM job update request, which contains measurement selection criteria, is sent to the PM service. The PM service checks the existence of the PM job, checks the permissions of the requestor, and processes the request parameters. A response for the PM job update request is returned.

Subscription to the performance data enables an entity (the railway cloud operator or the RMAO) to receive data that has been collected in the railway cloud. The measurement reports may be used for performance analysis within the RMAO. The reporting of performance data may be in a form of event-based reporting, streaming-based reporting, or file reporting. In event-based reporting, a notification is sent to the subscriber when the thresholds are reached, and may be used for non-real time notifications. For real-time monitoring of KPIs, streaming-based reporting is used to send performance measurement data in a continuous session. File reporting is off-line reporting that uses a file-based notification reporting mechanism. The subscription to notifications must determine the frequency of reporting, the method of delivery, and the encoding scheme used for the payload, a list of PM jobs, and measurements of interest.

Figure 10 shows the communication model of performance measurements notification reporting. A connection between the PM subscription endpoint and the Performance Subscription Manager (PSM) in the CRPM service is established for each event to be reported. The performance data is reported. If the performance data is received successfully, the subscriber sends an acknowledgement, otherwise the subscriber initiates data rejection.

Figure 11 shows the communication model of performance measurements file-based reporting. A connection between the PM subscription endpoint and the PSM is established for each file to be reported. It is possible for the file to be uploaded by the PSM, where it pushes data in a file-based format to the CM application. As an alternative, the CM application pulls the performance measurement data after being notified that the file is ready. If the performance data are received successfully, the subscriber sends an acknowledgement, otherwise the subscriber initiates data rejection.

A PM job may be in an active state when it is currently running, in a suspended state when a request to stop measurements is received, or in a deprecated state when it is deleted after being suspended. A PM job is suspended in order to improve cloud performance because the PM jobs come at the cost of cloud overhead.

Figure 12 shows the communication model of PM job suspension. The railway operator may request a PM job suspension, or the RMAO may autonomously initiate it. The PM job suspend request is sent to the CRPM service, which checks whether the PM job exists and evaluates the period for which the PM job has to be suspended (sleeping period), and returns a response in case of success. The requester is informed about the suspended status of the PM job. Exception use cases include not-existent PM job or unexpected conditions.

Following the same pattern of communication, the railway cloud operator or the RMAO autonomously may resume a suspended PM job.

The railway cloud operator can query, update, or delete a PM job. The update of a PM job occurs when railway cloud resources are updated, added, or deleted, or upon a request by the railway cloud operator or the RMAO.

4.2. Performance Management as a Service

Based on an analysis of typical cloud resource PM use cases, the basic CRPM service functionality is identified. The service needs to provide interfaces that enable a CM application as follows:

create a PM job;
subscribe to reporting of the PM data;
be notified of the PM data;
query, delete, suspend, and resume an existing PM job.

Table 3 provides a brief overview of the resources and methods of the CRPM service API. All resource URIs follow the root {apiRoot}/crpm/v1, where the “apiRoot” and “crpm” can be discovered using the service registry.

Figure 13 shows the flow of PM job creation. A PM application creates a PM job by sending a POST request to the resource, representing the PM jobs, including a data structure with information about the job in the message body. The PM job information contains the measured object type, identifiers of the measured object instance, criteria of the performance data collection, and a callback interface to receive notifications. The CRPM service creates a PM job and returns a “201 Created” response to the CM application.

A CM application queries all PM jobs by sending a GET request to the resource representing PM jobs. The PM service returns a “200 OK” response including a list of PM jobs in the response body.

The resource representing an individual PM job is manipulated by a CM application by applying on the resource a GET method for reading a specific PM job, a PUT method to update a PM job, a DELETE method to deprecate a PM job, or a PATCH method to suspend or resume a PM job.

Figure 14 show the flow of suspending a PM job.

4.3. Formal Verification of the CRPM API

In order to verify the proposed service-based approach to the design of railway cloud resource performance management, the models representing a PM job status supported by a CM application and the CRPM service are developed. Transitions in models are triggered by applying HTTP methods on the resources representing a cloud resource PM.

Figure 15 shows the UML state diagram of a simplified PM job status model supported by a CM application.

In the App.Null state, the PM job is not created yet. In this state, default PM jobs may be started (DeafaultPMjobsStatus event), or the cloud operator may request additional PM jobs (AddPMjob event), and the CM application requests from the CRPM service to create a PM job (PMjobCreateReq action).

In the App.Active state, the CM application may receive the response of the PM job creation request (PMjobCreateRes event) and notify the cloud operator of the result (PMjobAdded action). The PM job is active. In the case of a successful PM job creation, the cloud operator may decide to suspend the PM job (SuspendPMjob event), and the CM application requests from the CRPM service to suspend the PM job (PMjobSuspendReq action).

In the Suspending state, the CM application waits for the PM job retention. In this state, the CM application may receive the response of its request for PM job retention (PMjobSuspendRes event) and, if so, it informs the cloud operator for the result of operation (PMjobSuspended action).

In the App.Suspended state, the PM job is suspended. In this state, the cloud operator may request to resume the PM job (ResumePMjob event), and the CM application requests from the CRPM service to resume the suspended PM job (PMjobResumeReq action). In the App.Suspended state, the cloud operator may terminate the PM job (terminatePMjob event), and the CM application requests from the CRPM service to terminate the suspended PM job (PMjobTerminateReq action).

In the Resuming state, the CM application may receive the response of its request to resume the suspended PM job (PMjobResumeRes event), and it notifies the cloud operator of the result of operation (PMjobResumed action).

In the Terminating state, the CM application may receive the response of its request to terminate the PM job (PMjobTerminateRes event), and it notifies the cloud operator of the result of operation (PMjobTerminated action).

In the App.Terminated state, the PM job is terminated.

Figure 16 shows the UML state diagram of a simplified PM job status model supported by the CRPM service.

In the Null state, the PM job is not created. In this state, the CRPM service may start default PM jobs (DefaultPMjobsStarted event) and report to the CM application about its status (DefaultPMjobStatus action), or the CRPM service may receive from the CM application a request to create an additional PM job (PMjobCreateReq event) and may inform the CM application about the result of the operation (PMjobCreateRes action).

In the Active state, the PM job is active and measurements are performed. In this state, the CRPM service may receive a request from the CM application to suspend the PM job (PMjobSuspendReq event), and then it may inform the CM application about the result of the operation (PMjobSuspendRes action).

In the Suspended state, the PM job is suspended. In this state, the CRPM service may receive a request from the CM application to resume the PM job (PMjobResumeReq event), and may inform the CM application about the result of the operation (PMjobResumeRes action). In the same state, the CRPM service may receive a request from the CM application to terminate the PM job (PMjobTerminateReq event), and may inform the CM application about the result of the operation (PMjobTerminateRes action).

In the Terminated state, the PM job is terminated.

Both models are formally described to prove that the views of the CM application and the CRPM service on the PM job status are synchronized in time.

Definition 4.

Let M^app = (S^app, Σ ^app, →^app, s₀^app) be a formal description of the model representing the PM job status, supported by the CM application, where:

S^app = {App.Null [s^a₁], App.Active [s^a₂], Suspending [s^a₃], App.Suspended [s^a₄], Resuming [s^a₅], Terminating [s^a₆], App.Terminated [s^a₇]};

Σ^app = {DefaultPMJobsStatus [a], AddPMJob [b], PMJobCreateRes [c], SuspendPMJob [d], PMJobSuspendRes [e], ResumePMJob [f], PMJobResumeRes [g], TerminatePMJob [h], PMJobTerminateRes [i]};

→^app = {(s^a₁

\overset{a}{\to}

s^a₁),(s^a₁

\overset{b}{\to}

s^a₂), (s^a₂

\overset{c}{\to}

s^a₂), (s^a₂

\overset{d}{\to}

s^a₃), (s^a₃

\overset{e}{\to}

s^a₄), (s^a₄

\overset{f}{\to}

s^a₅), (s^a₅

\overset{g}{\to}

s^a₂), (s^a₄

\overset{h}{\to}

s^a₆), (s^a₆

\overset{i}{\to}

s⁴₇)};

s₀^app = s^a₁.

Definition 5.

Let M^ser= (S^ser, Σ^ser, →^ser, s₀^ser) be a formal description of the model representing the PM job status, supported by the CRPM service, where:

S^ser = {Null [s^s₁], Active [s^s₂], Suspended [s^s₃], Terminated [s^s₄]};

Σ^ser = {DefaultPMJobsStarted [α], PMJobCreateReq [β], PMJobSuspendReq [γ], PMJobResumeReq [δ], PMJobTerminateReq [ε]} is a set of actions;

→^ser = {(s^s₁

\overset{α}{\to}

s^s₁), (s^s₁

\overset{β}{\to}

s^s₂), (s^s₂

\overset{γ}{\to}

s^s₃), (s^s₃

\overset{δ}{\to}

s^s₂), (s^s₃

\overset{ε}{\to}

s^s₄)};

s₀^ser = s^s.

Proposition 2.

Let R₂ ⊆ S^app × S^ser be a relationship between states of M^app and M^ser where R₂ = {(s^a₁, s^s₁), (s^a₂, s^s₂), (s^a₄, s^s₃), (s^a₇, s^s₄)}. R₂ is the synchronization between the states in M^app and M^ser.

Proof.

We identify the following transition mapping between the states of R₂ in Table 4.

Therefore, R₂ is synchronized between states. □

This means that M^app and M^ser have synchronized-in-time views on the PM job status.

Figure 17 shows the UML state diagram of a simplified PM subscription and notification model, maintained by a CM application.

In the App.Unsubscribed state, the CM application is not subscribed to PM reports. In this state, the cloud operator may define criteria for PM reporting (NotificationCriteria event), and the CM application requests to subscribe to PM reports (PMSubscriptionReq action).

In the Subscribing state, the CM application waits for a response of its PM subscription request. In this state, a reachability check (connectivity and authorization check) may be performed (ReachabilityCheckReq event), and when the reachability test has passed successfully (reachable event), the CM application response is ReachabilityCheckRes(suc), otherwise (unreachable event), the CM application responses with ReachabilityCheckRes(fail).

In the Unreachable state, the reachability check has failed. In this state, the CM application’s subscription request may be rejected (PMSubscriptionRes(rej) event), and the CM application notifies the cloud operator about the subscription status (SubscriptionStatus(fail) action).

In the Reachable state, the reachability check has passed. In this state, the CM application’s subscription requests may be accepted (PMSubscriptionRes(ack) event), and the CM application notifies the cloud operator about the subscription status (SubscriptionStatus(suc) action).

In the App.Subscribed state, the CM application has active subscription to PM reports. In this state, the CM application may open the connection to receive PM reports (OpenConnectionReq event).

In the ConnectionSetup state, a connection to the subscription endpoint is attempted. In case the connection was open successfully (setupSuccess event), a success response is given to the CRPM service (OpenConnectionRes(suc) action), otherwise (setupFailure event), the CM application sends OpenConnectionRes(fail).

In the OpenConnection state, the connection was established successfully. In this state, the CRPM service sends performance data to the CM application (SendPerformanceData event).

In the App.DataReceiving state, the performance data is submitted. If the performance data is received successfully (success event), the CM application responds to the CRPM service with PerformanceDataReceived (action), otherwise (failure event), the response is ReceiverRejectedData (action).

Figure 18 shows the UML state diagram of a simplified PM subscription and notification model, maintained by the CRPM service.

In the Unsubscribed State, the CM application is not subscribed to PM reports. In this state, a subscription request may be received (PMSubscriptionReq event), and the CRPM service initiates a reachability test (ReachabilityCheckReq action).

In the ReachabilityChecking state, the connectivity and authorization are checked. In the case of the reachability failure (ReachabilityCheckRes(fail) event), the CRPM service rejects the subscription requests (PMSubscriptionRes(rej) action), otherwise (ReachabilityCheckRes(suc) event), the CRPM service acknowledges the subscription (PMSubscriptionRes(ack) action).

In the Subscribed state, the CM application is subscribed to receive PM reports. In this state, if the performance data needs to be sent (AvailableDataForReporting event), the CRPM service requests to open connection (OpenConnectionReq action).

In the OpeningConnection state, a connection is attempted. In the case of a successful connection establishment (OpenConnectionRes(suc) event), the CRPM service sends the performance data (SendPerformanceData action), otherwise (OpenConnectionRes(fail) event), performance data cannot be sent.

In the DataSending state, the performance data is sent. In this state, the CRPM service may receive a notification that the data is received successfully (PerformanceDataReceived event) or that a receiver has rejected the data (ReceiverRejectsData event), e.g., due to loss of user privilege or unproper data format.

Both models are formally described and it is proved that they are synchronized in time.

Definition 6.

Let N^app = (S^app, Σ^app, →^app, s₀^app) be a formal description of the subscription and PM reporting model maintained by a CM application, where:

S^app = {App.Unsubscribed [s^a₁], Subscribing [s^a₂], Unreachable [s^a₃], Reachable [s^a₄], App.Subscribed [s^a₅], ConnectionSetup [s^a₆], OpenConnection [s^a₇], App.DataReceiving [s^a₈]};

Σ^app = {NotificationCriteria [a], ReachabilityCheckReq [b], unreachable [c], PMSubscriptionRes(rej) [d], reachable [e], PMSubscriptionRes(ack) [f], OpenConnectionReq [g], setupFailure [h], setupSuccess [i], SendPerformanceData[j], success [k], failure [l]};

→^app = {(s^a₁

\overset{a}{\to}

s^a₂), (s^a₂

\overset{b}{\to}

s^a₂), (s^a₂

\overset{c}{\to}

s^a₃), (s^a₃

\overset{d}{\to}

s^a₁), (s^a₂

\overset{e}{\to}

s^a₄), (s^a₄

\overset{f}{\to}

s^a₅), (s^a₅

\overset{g}{\to}

s^a₆), (s^a₆

\overset{h}{\to}

s^a₅), (s^a₆

\overset{i}{\to}

s^a₇), (s^a₇

\overset{j}{\to}

s^a₈), (s^a₈

\overset{k}{\to}

s^a₅), (s^a₈

\overset{l}{\to}

s^a₅)};

s₀^app = s^a₁.

Definition 7.

Let N^ser = (S^ser, Σ^ser, →^ser, s₀^ser) be a formal description of the subscription and PM reporting model maintained by the CRPM service, where:

S^ser = {Unsubscribed[s^s₁], ReachabilityChecking [s^s₂], Subscribed [s^s₃], OpeningConnection [s^s₄], DataSending [s^s₅]};

Σ^ser = {PMSubscriptionReq [α], ReachabilityCheckRes(suc) [β], ReachabilityCheckRes(fail) [γ], AvailableDataForReporting [δ], OpenConnectionRes(suc) [ε], PerformanceDataReceived [ζ], ReveiverRejectedData [η], OpenConnectionRes(fail) [θ]};

→^ser = {(s^s₁

\overset{α}{\to}

s^s₂), (s^s₂

\overset{β}{\to}

s^s₃), (s^s₂

\overset{γ}{\to}

s^s₁), (s^s₃

\overset{δ}{\to}

s^s₄), (s^s₄

\overset{ε}{\to}

s^s₅), (s^s₅

\overset{ζ}{\to}

s^s₃), (s^s₅

\overset{η}{\to}

s^s₃), (s^s₄

\overset{θ}{\to}

s^s₃)};

s₀^er = s^s₁.

Proposition 3.

Let R₃ ⊆ S^app × S^ser be a relationship between the states of N^app and N^ser, where R₃ = {(s^a₁, s^s₁), (s^a₅, s^s₃), (s^a₈, s^s₅)}. R₃ is synchronization between the states in N^app and N^ser.

Proof.

We identify the following transition sequence mapping between the states of R₃ in Table 5.

Therefore, R₃ is synchronized between states. □

This means that N^app and N^ser have synchronized-in-time views on the process of subscription to and notification of PM data.

5. Discussion and Conclusions

The integration of cloud computing in railways enables scalability, cost efficiency, and data integration. Cloud computing provides the ability to scale resources up or down, as needed, which is particularly useful for handling varying workloads during peak and off-peak hours. Railways can reduce infrastructure costs by leveraging cloud services instead of maintaining physical servers. The railway cloud facilitates the integration of data from various sources, such as train operations, maintenance logs, and passenger systems, enabling optimal decision-making. However, successful integration depends on overcoming challenges like data security, complexity, and cost-effectiveness. To provide the required reliability and safety of the cloudified mission-critical railway applications, the cloud management system must operate in real time. Cloud resource fault and performance management are critical aspects of managing cloud computing infrastructure and applications, enabling proactive detection, analysis, and resolution of faults to ensure high availability, reliability, and performance. One of the key railway cloud management challenges is the lack of open APIs and standardization, which is related to the so-called vendor lock-in problem.

This paper proposes an approach to the design of open APIs for railway cloud resource fault and performance management. The API design aims to avoid the single-provider-problem and to provide a higher degree of flexibility. The only solution appears to be the open interfaces approach. Open APIs allow for a variety of providers and solutions for different tasks and applications during the whole lifecycle of the system. However, open interfaces are not necessarily providing publicly open access.

The proposed approach to railway cloud resource management features the advantages of APIs that make them essential for modern software development, such as simplified integration, scalability, and customization. As the proposed RESTful APIs use HTTP protocol for data transmission, they are programming language-agnostic. They enable the separation of cloud management applications and services, as the API definition of cloud management services defines the means of interaction between the clients and the server. The API specification could evolve in time by improving and extending its functionality.

The generic drawbacks of the proposed approach are typical for APIs and come with certain limitations:

Security risks: Exposing parts of a railway cloud system can lead to vulnerabilities if not secured properly.
Complexity: APIs can be complex to design and maintain, especially for large systems, such as a railway cloud.
Rate limiting: Many APIs have rate limits, restricting how often they can be called.
Third-party dependency: Relying on railway cloud management APIs can be risky if the cloud provider changes or discontinues a given service.

The deployment challenges of API integration, especially within legacy railway systems that are not digitalized, requires further research.

Further elaboration of the proposed approach will be related to the development of logical data models. Data models enable unified access to data for different applications. Logical data models offer more detail about the concepts and relationships in the domain under consideration. They indicate data attributes, such as data types and their corresponding lengths, and show the relationships among entities.

By implementing a comprehensive cloud fault and performance management strategy, railway operators can ensure that their cloud-based systems are performing optimally, providing a better user experience, and ensuring high reliability, safety, and secure transport services.

Author Contributions

I.A. contributed to the methodology. D.D. contributed to the conceptualization. E.P. contributed to the formal analysis, verification, and writing. V.T. contributed to the writing, review, and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

This research is part of the project KP-06-H57/12, granted by Bulgarian National Science Fund.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial Intelligence
API	Application Programming Interface
CI/CD	Continuous integration/continuous deployment
CFM	Cloudified function management
CM	Cloud Management
CPU	Central Processing Unit
CRFM	Cloud Resource Fault Management
CRM	Cloud Resource Management
CRPM	Cloud Resource Performance Management
FM	Fault Management
HTTP	Hypertext Transfer Protocol
ID	Identifier
KPI	Key Performance Indicator
LTS	Labeled Transition System
PM	Performance Management
QoS	Quality of Service
RCP	Railway Cloud Platform
REST	Representational State Transfer
RMAO	Railway Management Automation and Orchestration
UML	Unified Modeling Language
URI	Uniform Resource Identifier

References

Dekker, B.; Ton, B.; Meijer, J.; Bouali, N.; Linssen, J.; Ahmed, F. Point Cloud Analysis of Railway Infrastructure: A Systematic Literature Review. IEEE Access 2023, 11, 134355–134373. [Google Scholar] [CrossRef]
Binder, M.; Mezhuyev, V.; Tschandl, M. Predictive Maintenance for Railway Domain: A Systematic Literature Review. IEEE Eng. Manag. Rev. 2023, 51, 120–140. [Google Scholar] [CrossRef]
Liu, Y.; Liu, R.; Dong, R.; Qiu, Z.; Bai, H. Point Cloud and Visible Light Fusion Detection System. In Proceedings of the IEEE 17th International Conference on Signal Processing (ICSP), Suzhou, China, 28–31 October 2024; pp. 136–140. [Google Scholar] [CrossRef]
Ksica, F.; Rubes, O.; Kovar, J.; Chalupa, J.; Hadas, Z. Smart Sensing System for Railway Monitoring. In Proceedings of the 20th International Conference on Mechatronics—Mechatronika (ME), Pilsen, Czech Republic, 7–9 December 2022; pp. 1–6. [Google Scholar] [CrossRef]
Liang, H.; Zhu, L.; Yu, F.R.; Yuen, C. Cloud-Edge-End Collaboration for Intelligent Train Regulation Optimization in TACS. TVT 2025, 74, 454–465. [Google Scholar] [CrossRef]
Memon, T.R.; Memon, T.D.; Chowdhry, B.S.; Kalwar, I.H.; Mal, K. Development of Specialized IoT Cloud Platform for Railway Track Condition Monitoring. In Proceedings of the International Conference on Robotics and Automation in Industry (ICRAI), Rawalpindi, Pakistan, 26–27 October 2021; pp. 1–4. [Google Scholar] [CrossRef]
Singh, P.; Zeinab, V.; Meriga, K.; Pasha, J.; Dulebenets, M.A. Internet of Things for sustainable railway transportation: Past, present, and future. Clean. Logist. Supply Chain. 2022, 4, 100065. [Google Scholar] [CrossRef]
Yan, Z.; Zhang, W.; Wang, X.; Khan, M.K. Multidimensional Data Integrity Checking Scheme for IoT-Edge Computing-Assisted Intelligent Railway Systems. IEEE Trans. Veh. Technol. 2025. [Google Scholar] [CrossRef]
Sobrinho, O.G.; Bernucci, L.L.M.; Pizzigatti, P.L.; Motta, R.D.S.; Nachicao, J.; Sanuel, A. Big data analytics in support of the under-rail maintenance management at Vitória—Minas Railway. In Proceedings of the IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 6026–6028. [Google Scholar] [CrossRef]
Mcmahon, P.; Zhang, T.; Dwight, R. Requirements for Big Data Adoption for Railway Asset Management. IEEE Access 2020, 8, 15543–15564. [Google Scholar] [CrossRef]
Li, G.; Or, S.W.; Chan, K.W. Intelligent Energy-Efficient Train Trajectory Optimization Approach Based on Supervised Reinforcement Learning for Urban Rail Transits. IEEE Access 2023, 11, 31508–31521. [Google Scholar] [CrossRef]
Bešinović, N.; Donato, L.D.; Flammini, F.; Goverde, R.; Lin, Z.; Liu, R. Artificial Intelligence in Railway Transport: Taxonomy, Regulations, and Applications. T-ITS 2022, 23, 14011–14024. [Google Scholar] [CrossRef]
Vadivel, M.; Marin, V.B.; Balasubramani, S.; Hemalatha, S.; Murugan, S.; Velmurugan, S. Cloud-Based Passenger Experience Management in Bus Fare Ticketing Systems Using Random Forest Algorithm. In Proceedings of the 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 14–15 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Sathish, M.; Sushmitha, K.; Devannan, V.; Sharan, S.J. Cloud Based Town Bus Ticket Payment System Integrated with Mobile Application. In Proceedings of the 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 16–17 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Zhu, L.; Zhuang, Q.; Jiang, H.; Liang, H.; Gao, X.; Wang, W. Reliability-aware failure recovery for cloud computing based automatic train supervision systems in urban rail transit using deep reinforcement learning. J. Cloud. Comp. 2023, 12, 147. [Google Scholar] [CrossRef]
Li, G.; Qiu, Y.; Wang, J. Research on Efficient Utilization of Network Resources and Intelligent Operation and Maintenance of Rail Transit Cloud Platform Based on SDN, HP3C ‘24. In Proceedings of the 8th International Conference on High Performance Compilation, Computing and Communications, Guangzhou, China, 7–9 June 2022; pp. 102–107. [Google Scholar] [CrossRef]
Narouwa, M.; Mendiboure, L.; Badis, H.; Maaloul, S.; Molla, D.M.; Berbineau, M.; Langar, R. Enabling Network Technologies for Flexible Railway Connectivity. IEEE Access 2024, 12, 151532–151553. [Google Scholar] [CrossRef]
Qlu, Y. Secure Mechanism of Intelligent Urban Railway Cloud Platform Based on Zero-trust Security Architecture, HP3C ‘22. In Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications, New York, NY, USA, 23–25 June 2022; pp. 99–105. [Google Scholar] [CrossRef]
Kour, R.; Patwardhan, A.; Krim, R.; Thaduri, A. A review on cybersecurity in railways. Proc. Inst. Mech. Eng. Part F J. Rail. Rapid. Transit. 2022, 237, 3–20. [Google Scholar] [CrossRef]
Zhu, W. Research on Construction of Cloud Computing Platform for Railway Enterprises. In Proceedings of the International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM), Dublin, Ireland, 16–18 October 2019; pp. 488–492. [Google Scholar] [CrossRef]
Liu, J.; Song, J.; Wang, H.; Lin, S. Comparative Analysis on Collaborative Cloud-Edge-End Computing Architecture of High-Speed Train. In Proceedings of the IEEE 23rd International Conference on Communication Technology (ICCT), Wuxi, China, 20–22 October 2023; pp. 752–757. [Google Scholar] [CrossRef]
Zhang, X. Optimization design of railway logistics center layout based on mobile cloud edge computing. PeerJ Comput. Sci. 2023, 9, e1298. [Google Scholar] [CrossRef] [PubMed]
Saeik, F.; Avgeris, M.; Spatharakis, D.; Santi, N.; Dechouniotis, D.; Violos, J.; Leivadeas, A.; Athanasopoulos, N.; Mitton, N.; Papavassiliou, S. Task offloading in Edge and Cloud Computing: A survey on mathematical, artificial intelligence and control theory solutions. J. Comput. Netw. 2021, 195, 108177. [Google Scholar] [CrossRef]
Santos, Á.; Bernardino, J.; Correia, N. Automated Application Deployment on Multi-Access Edge Computing: A Survey. IEEE Access 2023, 11, 89393–89408. [Google Scholar] [CrossRef]
Nitto, E.D.; Gorronogoitia, J.; Kumara, I.; Meditskos, G. An Approach to Support Automated Deployment of Applications on Heterogeneous Cloud-HPC Infrastructures. In Proceedings of the 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, Romania, 1–4 September 2020; pp. 133–140. [Google Scholar] [CrossRef]
Valadares, D.C.G.; Filho, T.B.D.O.; Meneses, T.F.; Santos, D.F.S.; Perkusich, A. Automating the Deployment of Artificial Intelligence Services in Multiaccess Edge Computing Scenarios. IEEE Access 2022, 10, 100736–100745. [Google Scholar] [CrossRef]
Manchanda, A.; Kaur, A.; Kaur, A. Cloud Computing: Resource Management, Categorization, Scalability and Taxonomy. In Proceedings of the 2nd Edition of IEEE Delhi Section Flagship Conference (DELCON), Rajpura, India, 24–26 February 2023; pp. 1–5. [Google Scholar] [CrossRef]
Chen, Q.; Wang, X.; Jiang, Z. Efficient Cloud Computing Resource Management Strategy Based on Auction Mechanism. In Proceedings of the 24st Asia-Pacific Network Operations and Management Symposium (APNOMS), Sejong, Republic of Korea, 6–8 September 2023; pp. 286–289. [Google Scholar]
Mishra, K.; Majhi, S.K.; Sahoo, K.S.; Bhoi, S. Collaborative Cloud Resource Management and Task Consolidation Using JAYA Variants. IEEE TNSM 2024, 21, 6248–6259. [Google Scholar] [CrossRef]
Saxena, D.; Singh, A.K. Workload Pattern Learning-Based Cloud Resource Management Models: Concepts and Meta-Analysis. IEEE Trans. Sustain. Comput. 2024, 1–20. [Google Scholar] [CrossRef]
Judith, S.; Hurwitz, D.K. Managing a Hybrid and Multicloud Environment. In Cloud Computing for Dummies; Wiley: Hoboken, NJ, USA, 2020; pp. 43–58. [Google Scholar]
Xu, D.; Liu, F.; Chen, W.; He, F.; Tang, X.; Zhang, Y.; Wang, B. A review of research on multi-cloud management platforms. In Proceedings of the ISCTT 2022, 7th International Conference on Information Science, Computer Technology and Transportation, Xishuangbanna, China, 27–29 May 2022; pp. 1–16, ISBN 978-3-8007-6006-0. [Google Scholar]
Bello, Y.; Figetakis, E.; Refaey, A.; Spachos, P. Continuous Integration and Continuous Delivery Framework for SDS. In Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Halifax, NS, Canada, 18–20 September 2022; pp. 406–410. [Google Scholar] [CrossRef]
Cepuc, A.; Botez, R.; Craciun, O.; Ivanciu, I.-A.; Dobrota, V. Implementation of a Continuous Integration and Deployment Pipeline for Containerized Applications in Amazon Web Services Using Jenkins, Ansible and Kubernet. In Proceedings of the 19th RoEduNet Conference: Networking in Education and Research (RoEduNet), Bucharest, Romania, 11–12 December 2020; pp. 1–6. [Google Scholar] [CrossRef]
Hummer, W.; Muthusamy, V.; Rausch, T.; Dube, P.; El Maghraoui, K.; Murthi, A.; Oum, P. ModelOps: Cloud-Based Lifecycle Management for Reliable and Trusted AI. In Proceedings of the IEEE International Conference on Cloud Engineering (IC2E), Prague, Czech Republic, 24–27 June 2019; pp. 113–120. [Google Scholar] [CrossRef]
Wan, R.; Liang, Y.; Wen, Z.; Zhao, L. Research on Application Configuration Management Technology for Cloud Platform. In Proceedings of the IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 17–19 June 2022; pp. 781–784. [Google Scholar] [CrossRef]
Singhal, S.; Sharma, A.; Gourisaria, M.K.; Sharma, B.; Dhaou, I.B. A Disaster Management System Using Cloud Computing. In Proceedings of the 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), Giza, Egypt, 4–7 December 2023; pp. 1–6. [Google Scholar] [CrossRef]
Tang, M.; Wang, P.; Cheng, X.; Liu, Z.; Li, Y.; Wang, Z. Cloud Platform Data Disaster Recovery Model. In Proceedings of the IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 8–10 December 2023; pp. 786–790. [Google Scholar] [CrossRef]
Choudhary, C.; Vyas, N.; Lilhore, U.K. Cloud Security: Challenges and Strategies for Ensuring Data Protection. In Proceedings of the 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS), Tashkent, Uzbekistan, 1–3 November 2023; pp. 669–673. [Google Scholar] [CrossRef]
Seetharamarao, R.Y. A Unified Approach Towards Security Audit and Compliance in Cloud Computing Environment. In Proceedings of the 16th International Conference on Developments in eSystems Engineering (DeSE), Istanbul, Turkiye, 18–20 December 2023; pp. 623–629. [Google Scholar] [CrossRef]
Gupta, L.; Salman, T.; Zolanvari, M.; Erbad, A.; Jain, R. Fault and performance management in multi-cloud virtual network services using AI: A tutorial and a case study. Comput. Netw. 2019, 165, 106950. [Google Scholar] [CrossRef]
Lin, D.; Jiang, M.; Zhang, H.; Xu, Y.; Yan, A. Research on Data Operation Monitoring and Analysis System of Computer Intelligent Cloud Platform. In Proceedings of the IEEE 3rd International Conference on Data Science and Computer Application (ICDSCA), Dalian, China, 27–29 October 2023; pp. 1388–1393. [Google Scholar] [CrossRef]
Jin, C.; Yao, Z.; Tao, J.; Shao, S. Design and Implementation of Distributed Cloud Monitoring Big Data Storage Based on Zabbix. In Proceedings of the 5th Annual International Conference on Data Science and Business Analytics (ICDSBA), Changsha, China, 24–26 September 2021; pp. 125–130. [Google Scholar] [CrossRef]
Soldani, J.; Brogi, A. Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications: A Survey. arXiv 2021, arXiv:2105.12378. [Google Scholar] [CrossRef]
Kumar, E.S.; Ramamoorthy, R.; Kesavan, S.; Shobha, T.; Patil, S.; Vighneshwari, B. Comparative Study and Analysis of Cloud Container Technology. In Proceedings of the 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 28 February–1 March 2024; pp. 1681–1686. [Google Scholar] [CrossRef]
Chhikara, P.; Tekchandani, R.; Kumar, N.; Obaidat, M.S. An Efficient Container Management Scheme for Resource-Constrained Intelligent IoT Devices. IEEE Internet Things J. 2021, 8, 12597–12609. [Google Scholar] [CrossRef]
Pathak, G.; Singh, M. A Review of Cloud Microservices Architecture for Modern Applications. In Proceedings of the World Conference on Communication & Computing (WCONF), Raipur, India, 12–14 July 2023; pp. 1–7. [Google Scholar] [CrossRef]
Atanasov, I.; Pencheva, E.; Trifonov, V. Microservices for Cloudification and Orchestration of Railway Operations. In Computer and Communication Engineering. CCCE 2024. Communications in Computer and Information Science; Neri, F., Du, K.L., San-Blas, A.A., Jiang, Z., Eds.; Springer: Cham, Switzerland, 2025; Volume 2192. [Google Scholar] [CrossRef]
Atanasov, I.; Pencheva, E.; Trifonov, V.; Kassev, K. Railway Cloud: Management and Orchestration Functionality Designed as Microservices. Appl. Sci. 2024, 14, 2368. [Google Scholar] [CrossRef]
Ramoliya, D.; Patel, A.; Patel, K.; Patel, G.; Vaghela, P.; Budhrani, A. Advanced Techniques to Predict and Detect Cloud System Failure: A Survey. In Proceedings of the 6th International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 1–3 December 2022; pp. 788–793. [Google Scholar] [CrossRef]
Saleh, L.; al-sitt, W. Cloud Computing Failures, Recovery Approaches and Management Tools. In Proceedings of the 21st International Arab Conference on Information Technology (ACIT), Giza, Egypt, 6 October 2020; pp. 1–10. [Google Scholar] [CrossRef]
Pandita, A.; Upadhyay, P.K.; Mishra, V.P. Fault-Tolerant Scheduling of Scientific Workflow in Cloud Computing. In Proceedings of the International Conference on Artificial Intelligence and Quantum Computation-Based Sensor Application (ICAIQSA), Nagpur, India, 20–21 December 2024; pp. 1–6. [Google Scholar] [CrossRef]
Raj, A.; Jadon, S.; Kulshrestha, H.; Rai, V.; Arvindhan, M.; Sinha, A. Cloud Infrastructure Fault Monitoring and Prediction System using LSTM based predictive maintenance. In Proceedings of the 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 13–14 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Gorantla, V.A.K.; Sriramulugari, S.K.; Gorantla, B.; Yuvaraj, N.; Singh, K. Optimizing Performance of Cloud Computing Management Algorithm for High-Traffic Networks. In Proceedings of the 2nd International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 15–16 March 2024; pp. 482–487. [Google Scholar] [CrossRef]
Sawhney, G.; Kaur, G.; Deorari, R. CSPM: A secure Cloud Computing Performance Management Model. In Proceedings of the International Conference on Cyber Resilience (ICCR), Dubai, United Arab Emirates, 6–7 October 2022; pp. 1–5. [Google Scholar] [CrossRef]
Yezdani, R.; Quadri, S.M.K. Power and Performance Issues and Management Approaches in Cloud Computing. In Proceedings of the 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 16–17 December 2022; pp. 2112–2120. [Google Scholar] [CrossRef]
Sandhiya, V.; Suresh, A. Analysis of Performance, Scalability, Availability and Security in Different Cloud Environments for Cloud Computing. In Proceedings of the International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 23–25 January 2023; pp. 1–7. [Google Scholar] [CrossRef]
Syed, S.B.; Rasul, A.; Javed, T.; Rizwan, M.; Singh, A.; Dev, K. Performance Analysis of Cloud Computing for Distributed Data Center using Cloud-Sim. In Proceedings of the International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar] [CrossRef]
Kang, Y.; Bu, B.; Gao, B. Safety Analysis of Rail Transit Redundant Structure in Cloud Computing Environment Based on Graph and Bayesian Theory. In Proceedings of the IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 4284–4289. [Google Scholar] [CrossRef]
Bolanowski, M.; Żak, K.; Paszkiewicz, A.; Ganzha, M.; Paprzycki, M.; Sowiński, P.; Lacalle, I.; Palau, C.E. Efficiency of REST and gRPC Realizing Communication Tasks in Microservice-Based Ecosystems. In New Trends in Intelligent Software Methodologies, Tools and Techniques; Fujita, H., Watanobe, Y., Azumi, T., Eds.; IOS Press: Amsterdam, The Netherlands, 2022; pp. 97–108. [Google Scholar] [CrossRef]

Figure 1. Communication model of subscribing to alarm notifications.

Figure 2. Communication model of alarm acknowledgement/clear.

Figure 3. Flow of information retrieval of one or multiple alarms.

Figure 4. Flow of an alarm acknowledgement.

Figure 5. Flow of alarm suppression.

Figure 6. Flow of creating the subscription, retrieving the subscription information, notifying of the alarm occurrence, and deleting the subscription.

Figure 7. A simplified model representing the alarm status supported by a CM application.

Figure 8. A simplified model representing the alarm status maintained by the CRFM service.

Figure 9. Communication model of PM job creation and update.

Figure 10. Communication model of PM event reporting.

Figure 11. Communication model of performance measurements file-based reporting.

Figure 12. Communication model of PM job suspension.

Figure 13. Flow of PM job creation.

Figure 14. Flow of suspending a PM job.

Figure 15. A simplified model of the PM job status maintained by a CM application.

Figure 16. A simplified model of the PM job status maintained by the CRPM service.

Figure 17. A simplified PM subscription and notification model, maintained by a CM application.

Figure 18. A simplified PM subscription and notification model, maintained by the CRPM service.

Table 1. Summary of resources and supported HTTP methods of Cloud Resource Fault Management API.

Resource Name	Resource URI	HTTP Method	Description
All alarms	/Alarms	GET	Retrieves the list of all alarms (active or suppressed)
Individual alarm	/Alarms/{AlarmID}	GET	Retrieves information about an individual alarm
		PUT	Used to acknowledge an individual alarm
		DELETE	Used to clear an individual alarm
Alarm suppression	/Alarms/{AlarmID}/AlarmSuppression	GET	Retrieves the information about the alarm suppression criteria and status
		PUT	Used to activate or deactivate the alarm suppression, and to query or update alarm suppression criteria
All alarm subscriptions	/AlarmSubscriptions	GET	Retrieves the list of all alarm subscriptions
		POST	Creates a new alarm subscription
Individual alarm subscription	/AlarmSubscriptions/{AlarmSubscriptionID}	GET	Retrieves information about an individual alarm subscription
		DELETE	Terminates an individual subscription
All alarm logs	/AlarmLogs	GET	Retrieves the list of all alarm logs
Individual alarm log	/AlarmLogs/{AlarmLogID}	GET	Retrieves information about individual alarm log
All fault logs	/FaultLogs	GET	Retrieves the list of all fault logs
Individual fault log	/FaultLogs/{FaultLogID}	GET	Retrieves information about the individual fault log
All debug logs	/DebugLogs	GET	Retrieves the list of all debug logs
Individual debug log	/DebugLogs/{DebugLogID}	GET	Retrieves information about the individual debug log
Alarm dictionary	/AlarmDictionary	GET	Retrieves the definition of an alarm
		PUT	Updates an alarm definition
		DELETE	Deletes an alarm definition

Table 2. Mapping between the transition sequences in models representing the alarm status.

Transition Abstraction	States Mapping	Transition Sequences in L^app	Transition Sequences in L^ser
Successful creation of subscription to alarm notifications	(s^a₁, s^s₁)	s^a₁ $\overset{a}{\to}$ s^a₁ $\overset{b}{\to}$ s^a₁	s^s₁ $\overset{α}{\to}$ s^s₂ $\overset{β}{\to}$ s^s₃ $\overset{ε}{\to}$ s^s₁
Unsuccessful creation of subscription to alarm notifications	(s^a₁, s^s₁)	s^a₁ $\overset{a}{\to}$ s^a₁ $\overset{c}{\to}$ s^a₁	s^s₁ $\overset{α}{\to}$ s^s₂ $\overset{γ}{\to}$ s^s₁ or s^s₁ $\overset{α}{\to}$ s^s₂ $\overset{β}{\to}$ s^s₃ $\overset{δ}{\to}$ s^s₁
A fault rises, is processed, and an alarm notification is sent	(s^a₂, s^s₅)	s^a₁ $\overset{d}{\to}$ s^a₂	s^s₁ $\overset{ζ}{\to}$ s^s₄ $\overset{η}{\to}$ s^s₅
Alarm query	(s^a₂, s^s₅)	s^a₂ $\overset{e}{\to}$ s^a₂ $\overset{f}{\to}$ s^a₂	s^s₅ $\overset{θ}{\to}$ s^s₅
Alarm acknowledgement	(s^a₂, s^s₅)	s^a₂ $\overset{g}{\to}$ s^a₃ $\overset{h}{\to}$ s^a₂	s^s₅ $\overset{ι}{\to}$ s^s₅
Alarm suppression	(s^a₅, s^s₆)	s^a₂ $\overset{i}{\to}$ s^a₄ $\overset{j}{\to}$ s^a₅	s^s₅ $\overset{k}{\to}$ s^s₆
Alarm retention	(s^a₂, s^s₅)	s^a₅ $\overset{k}{\to}$ s^a₆ $\overset{l}{\to}$ s^a₂	s^s₆ $\overset{λ}{\to}$ s^s₅
Alarm clearance	(s^a₈, s^s₇)	s^a₂ $\overset{m}{\to}$ s^a₇ $\overset{n}{\to}$ s^a₈	s^s₅ $\overset{μ}{\to}$ s^s₇

Table 3. Overview of the resources and methods of the Cloud Resource Performance Management API.

Resource Name	Resource URI	HTTP Method	Meaning
All PM jobs	/pmJobs	POST	Creates a PM job
		GET	Retrieves the list of PM jobs
Individual PM job	/pmJobs/{pmJobID}	GET	Queries an individual PM job
		PUT	Updates a PM job
		PATCH	Suspends or resumes a PM job
		DELETE	Deletes a PM job
PM subscriptions	/pmSubscriptions	POST	Creates a PM subscription
		GET	Retrieves the list of PM subscriptions
Individual PM	/pmSubscriptions/	GET	Reads a PM subscription
Subscription	{pmSubscriptionID}	DELETE	Deletes a PM subscription

Table 4. Mapping between the transition sequences in M^app and M^ser.

Transition Abstraction	States Mapping	Transition Sequences in M^app	Transition Sequences in M^ser
Default PM jobs are started	(s^a₁, s^s₁)	s^a₁ $\overset{a}{\to}$ s^a₁	s^s₁ $\overset{α}{\to}$ s^s₁
Creation of an additional PM job	(s^a₂, s^s₂)	s^a₁ $\overset{b}{\to}$ s^a₂ $\overset{c}{\to}$ s^a₂	s^s₁ $\overset{β}{\to}$ s^s₂
Suspension of the PM job	(s^a₄, s^s₃)	s^a₂ $\overset{d}{\to}$ s^a₃ $\overset{e}{\to}$ s^a₄	s^s₂ $\overset{γ}{\to}$ s^s₃
Retention of the PM job	(s^a₂, s^s₂)	s^a₄ $\overset{f}{\to}$ s^a₅ $\overset{g}{\to}$ s^a₂	s^s₃ $\overset{δ}{\to}$ s^s₂
Termination of the PM job	(s^a₇, s^s₄)	s^a₄ $\overset{h}{\to}$ s^a₆ $\overset{i}{\to}$ s^a₇	s^s₃ $\overset{ε}{\to}$ s^s₄

Table 5. Mapping of transition sequences in N^app and N^ser.

Transition Abstraction	State Mapping	Transition Sequences in N^app	Transition Sequences in N^ser
Successful creation of subscription to PM data	(s^a₁, s^s₁)	s^a₁ $\overset{a}{\to}$ s^a₂ $\overset{b}{\to}$ s^a₂ $\overset{e}{\to}$ s^a₄ $\overset{f}{\to}$ s^a₅	s^s₁ $\overset{α}{\to}$ s^s₂ $\overset{β}{\to}$ s^s₃
Unsuccessful creation of subscription to PM data	(s^a₁, s^s₁)	s^a₁ $\overset{a}{\to}$ s^a₂ $\overset{b}{\to}$ s^a₂ $\overset{c}{\to}$ s^a₃ $\overset{d}{\to}$ s^a₁	s^s₁ $\overset{α}{\to}$ s^s₂ $\overset{γ}{\to}$ s^s₁
PM data is available for reporting. The connection is established and PM data is sent.	(s^a₅, s^s₃)	s^a₅ $\overset{g}{\to}$ s^a₆ $\overset{i}{\to}$ s^a₇ $\overset{j}{\to}$ s^a₈	s^s₃ $\overset{δ}{\to}$ s^s₄ $\overset{ε}{\to}$ s^s₅
PM data is available for reporting. The connection setup fails.	(s^a₅, s^s₃)	s^a₅ $\overset{g}{\to}$ s^a₆ $\overset{h}{\to}$ s^a₅	s^s₃ $\overset{δ}{\to}$ s^s₄ $\overset{θ}{\to}$ s^s₃
The PM data is received successfully.	(s^a₈, s^s₅)	s^a₈ $\overset{k}{\to}$ s^a₅	s^s₅ $\overset{ζ}{\to}$ s^s₃
The receiver rejects the PM data.	(s^a₈, s^s₅)	s^a₈ $\overset{l}{\to}$ s^a₅	s^s₅ $\overset{η}{\to}$ s^s₃

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Atanasov, I.; Dimitrova, D.; Pencheva, E.; Trifonov, V. Railway Cloud Resource Management as a Service. Future Internet 2025, 17, 192. https://doi.org/10.3390/fi17050192

AMA Style

Atanasov I, Dimitrova D, Pencheva E, Trifonov V. Railway Cloud Resource Management as a Service. Future Internet. 2025; 17(5):192. https://doi.org/10.3390/fi17050192

Chicago/Turabian Style

Atanasov, Ivaylo, Dragomira Dimitrova, Evelina Pencheva, and Ventsislav Trifonov. 2025. "Railway Cloud Resource Management as a Service" Future Internet 17, no. 5: 192. https://doi.org/10.3390/fi17050192

APA Style

Atanasov, I., Dimitrova, D., Pencheva, E., & Trifonov, V. (2025). Railway Cloud Resource Management as a Service. Future Internet, 17(5), 192. https://doi.org/10.3390/fi17050192

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Railway Cloud Resource Management as a Service

Abstract

1. Introduction

2. Related Works

3. Fault Management of Railway Cloud Resources

3.1. Fault Management Use Cases

3.2. Fault Management as a Service

3.3. Formal Verification of CRFM API

4. Performance Management of Railway Cloud Resources

4.1. Performance Management Use Cases

4.2. Performance Management as a Service

4.3. Formal Verification of the CRPM API

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI