Next Article in Journal
How to Encourage Green Product Development Performance: A Stainable Leadership Perspective
Previous Article in Journal
Let’s Take the Pulse of the Classroom on Sustainability! An Exploratory Study on Student Views and Teacher Solution Suggestions Regarding Sustainable Development Goals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adaptive Predictive Maintenance and Energy Optimization in Metro Systems Using Deep Reinforcement Learning

by
Mohammed Hatim Rziki
1,
Atmane E. Hadbi
1,
Mohamed Khalifa Boutahir
2,3 and
Mohammed Chaouki Abounaima
4,*
1
Laboratory of AI, Faculty of Sciences, Moulay Ismail University of Meknes, Meknes 50050, Morocco
2
IMIA Laboratory, IDMS Team, Faculty of Sciences and Techniques of Errachidia, Moulay Ismail University of Meknès, Meknes 50050, Morocco
3
ENIAD Berkane, SmartICT Lab, Mohammed First University, Oujda 60000, Morocco
4
Laboratory of Intelligent Systems, Application Faculty of Science and Technology, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(11), 5096; https://doi.org/10.3390/su17115096
Submission received: 20 March 2025 / Revised: 28 April 2025 / Accepted: 22 May 2025 / Published: 1 June 2025

Abstract

:
The rapid growth of urban metro systems requires novel strategies to guarantee operational dependability and energy efficiency. This article describes a new way to use deep reinforcement learning (DRL) to help metro networks with predictive maintenance that adapts to changing conditions and energy optimization. We used real-world transit data from the General Transit Feed Specification (GTFS) to model the maintenance scheduling and energy management problem as a Markov Decision Process. This included important operational metrics like peak-hour demand, train arrival times, and station stop densities. A custom reinforcement learning environment mimics the changing conditions of metro operations. Deep Q-Networks (DQNs) and Proximal Policy Optimization (PPO) sophisticated deep reinforcement learning techniques were used to identify the optimal policies for decreasing energy consumption and downtime. The PPO hyperparameters were additionally optimized using Bayesian optimization by implementing Optuna, which produces a far greater performance than baseline DQNs and basic PPO. Comparative tests showed that our improved DRL-based method improves the accuracy of predictive maintenance and the efficiency of energy use, which lowers operational costs and raises the dependability of the service. These results show that advanced learning and optimization techniques could be added to public transportation systems in cities. This could lead to more sustainable and smart transportation management in big cities.

1. Introduction

Urban metro systems have become the lifeline of modern cities, facilitating rapid and efficient transportation for millions of commuters each day [1]. However, as these networks expand and passenger demand increases, ensuring operational reliability and energy efficiency has emerged as a critical challenge [2]. Traditional maintenance practices that rely on predetermined schedules or reactive repairs often result in unexpected failures and costly downtime, ultimately compromising service quality and increasing energy consumption [3].
Given these challenges, there is an urgent need for adaptive, data-driven approaches to maintenance. Rising energy costs and the pressure to reduce carbon emissions further underscore the importance of optimizing operational efficiency. Improving reliability and saving energy are two important goals that drive the creation of smart systems that can predict failures and change operational parameters on the fly.
In many established metro and railway systems, maintenance is organized into multiple tiers ranging from routine inspections to full-scale overhauls. For example, organizations like the SNCF employ a structured multi-level maintenance system that categorizes interventions by their criticality and impact on operations [4,5]. As shown in Figure 1, the five maintenance levels range from light, routine checks to comprehensive overhauls, ensuring that all components of the train are maintained according to their operational needs and risk profiles. This stratified approach helps balance safety, reliability, and cost efficiency, although rigid schedules may sometimes lead to inefficiencies during periods of peak demand.
This stratified approach upholds essential safety measures while attempting to balance cost efficiency. However, strict schedules can waste time and money during times of high demand because reactive maintenance might not be enough to stop sudden problems [3].
The incorporation of real-time data into maintenance strategies holds the potential to bring about significant transformation. By continuously monitoring operational parameters and dynamically adjusting maintenance schedules, transit agencies can preemptively address issues, reducing both downtime and energy wastage [6].
Innovative solutions in predictive maintenance have been made possible by recent advances in machine learning and big data analytics. Reinforcement learning (RL), in particular, offers a powerful means of learning optimal decision-making policies through trial and error in dynamic environments [2]. Unlike traditional methods, RL algorithms can constantly adapt to new situations by improving their strategies based on how well they worked in the past. This makes them perfect for real-time, complex tasks like metro operations [1].
In addition, as illustrated in Figure 2, adopting data-driven solutions can yield multiple benefits for modern metro systems. These range from cost savings and enhanced asset performance to improved safety, reliability, and scalability—elements that are increasingly vital for urban transit networks facing growing passenger demand and infrastructure complexity.
Using hyperparameter optimization methods such as Bayesian optimization with frameworks like Optuna can also greatly improve model performance by fine-tuning the learning process [4]. These combined methods make it easier to create flexible plans that can handle operational uncertainties. This leads to big improvements in the accuracy of planned maintenance and the use of less energy [6].
This paper proposes an adaptive framework that leverages deep reinforcement learning for predictive maintenance and energy optimization in metro systems. Our method involves creating a unique RL environment using real-world transit data from the General Transit Feed Specification (GTFS). These data include important operational metrics like train arrival times, station stop densities, and peak-hour demand. We use cutting-edge DRL algorithms, like Deep Q-Networks (DQNs) and Proximal Policy Optimization (PPO), and we improve performance even more by adding Bayesian optimization through Optuna. Following our tests, we found that the optimized PPO model works better than both the baseline DQNs and standard PPO configurations.
This research is motivated by the potential to revolutionize metro operations through advanced Artificial Intelligence. By integrating DRL with real-time data analytics and, where applicable, complementary technologies like blockchain and the IoT for ensuring data integrity, our approach seeks to transform conventional maintenance paradigms into a proactive, resource-efficient process. The following key motivations drive our study:
Resource Optimization: Metro networks are vast and inherently complex, with numerous components requiring continuous monitoring. The deployment of IoT sensors enables real-time data collection on various system parameters—ranging from train operation to energy consumption. When these data streams are analyzed through AI models, they facilitate the anticipation of failures and help in scheduling maintenance more effectively, ensuring that limited resources are optimally allocated [7].
Cost and Risk Reduction: Predictive maintenance driven by AI can detect potential system failures before they manifest into significant issues. This proactive approach allows operators to schedule maintenance based on precise, data-informed forecasts, reducing emergency repair costs and minimizing the risk of service disruptions. In turn, this leads to a reduction in overall operational expenses and enhances system reliability [8].
Enhanced Data Integrity and Security: Although the primary focus of this study is reinforcement learning, the reliability of the underlying data is critical. Blockchain technology offers a means of maintaining an immutable record of all maintenance data, ensuring transparency and security in data exchange among various stakeholders. This added layer of data integrity is essential for making well-informed maintenance decisions [9].
Transition to Autonomous Systems: Integrating AI, the IoT, and blockchain technology sets the stage for developing fully autonomous maintenance systems. For example, trains equipped with IoT sensors can detect anomalies in real time, and AI algorithms can automatically analyze the data to trigger maintenance actions without human intervention. Such an autonomous approach can significantly enhance system responsiveness and efficiency [10].
Sustainability and Energy Efficiency: Optimizing maintenance operations not only improves reliability but also extends the lifespan of critical infrastructure and reduces energy consumption. By implementing a data-driven predictive maintenance framework, metro systems can achieve more sustainable operation that minimizes waste and lowers energy costs, contributing to an eco-friendly urban transit environment [11].
By dynamically adapting maintenance schedules and energy management strategies, our method not only reduces operational downtime but also minimizes energy consumption, thereby lowering overall costs. Adding advanced RL techniques to the management of complicated urban transit systems could have a huge impact, as these results show.
The remainder of this paper is organized as follows. Section 2 reviews the relevant literature on predictive maintenance and the application of reinforcement learning in transportation. In Section 3, we explain our method in more detail, describing how we created the RL environment, how we used DRL algorithms, and how we added Bayesian hyperparameter optimization. In Section 4, we present our experimental setup and results, highlighting the performance improvements achieved by our proposed approach. Finally, Section 5 concludes with a discussion on the implications of our findings and recommendations for future research.

2. Literature Review

The following section provides a comprehensive review of the literature related to predictive maintenance strategies within transportation systems. To ensure a clear and structured approach, the literature review is organized into two dedicated subsections. The first Subsection, 2.1. Predictive Maintenance in Transportation Systems, explores the evolution, challenges, and successful implementations of predictive maintenance practices specifically within metro and railway networks. The second Subsection, 2.2. Advanced Technologies in Predictive Maintenance for Transportation Systems, focuses on the integration of modern technologies such as Artificial Intelligence (AI), the Internet of Things (IoT), and machine learning models in advancing predictive maintenance frameworks. This structured organization highlights both the foundational studies and the latest technological innovations, demonstrating the depth of research considered and positioning our work within the broader scientific context.

2.1. Predictive Maintenance in Transportation Systems

In transportation systems, especially railroads, predictive maintenance has become a crucial technique for improving operating efficiency and safety. This method utilizes ongoing data collection and sophisticated analytics to predict equipment breakdowns, therefore reducing unforeseen interruptions and enhancing maintenance plans. The Société Nationale des Chemins de fer Français (SNCF) has led the integration of data analytics into its maintenance methods since 2013 [12]. Through the recruitment of data scientists and the rigorous analysis of obtained data, the SNCF has exceeded the performance requirements established by train manufacturers, therefore reinforcing its proficiency in predictive maintenance, as presented in Figure 3 [13]. This technique relies on the ongoing use of data from trains, analyzed by certain algorithms, to substitute systematic periodic inspections with real-time monitoring of equipment conditions. This paradigm change facilitates the foresight and forecast of failures, resulting in a substantial decrease in both breakdowns and superfluous activities. Predictive maintenance has enabled enhanced battery monitoring and pantograph checks without requiring train immobility. The use of digital technology has transformed railway maintenance methods, leading to a 20% reduction in maintenance expenses and a 30% decline in on-site operations. Data and Artificial Intelligence in Predictive Maintenance of Railway Networks [13,14].
The amalgamation of data analytics and Artificial Intelligence (AI) has significantly revolutionized the administration of railway networks. This change is organized into several essential phases:
Data Collection: Acquiring extensive information on the present and historical conditions of the infrastructure.
Data Visualization: Organizing data to enhance comprehension and communication.
Development of Novel Indicators: Employing data to identify probable problems. Failure Prediction: Developing methods to forecast malfunctions through the analysis of gathered data.
Asset Management: Enhancing maintenance operations informed by these forecasts. AI is integral at each stage by augmenting data processing skills and raising the speed and precision of forecasts.
Nonetheless, it remains an auxiliary instrument that needs stringent supervision, particularly for safety considerations. In the future, AI may be utilized to conduct simulations and evaluate many situations to enhance infrastructure management. Furthermore, the Internet of Things (IoT) has played a crucial role in enhancing predictive maintenance alongside AI. Organizations such as KONUX have created systems that integrate IoT sensors with AI to perpetually assess the status of railway switches. These systems furnish infrastructure managers with temporal projections regarding the status of switches, hence facilitating failure prevention and the optimization of maintenance planning [15,16].
A multitude of research has enhanced the understanding of predictive maintenance in the transportation sector. Research conducted by Binder et al. [17] examined predictive maintenance strategies for railway systems. The research employed a machine learning methodology to alleviate hazards, leveraging data supplied by railway operators. The results demonstrated that machine learning algorithms provide favorable outcomes, allowing engineers to swiftly and precisely assess maintenance requirements.
Figure 3. LoRaWAN and predictive maintenance at SNCF [17]. Adapted with permission from internal material provided by SNCF, 2025.
Figure 3. LoRaWAN and predictive maintenance at SNCF [17]. Adapted with permission from internal material provided by SNCF, 2025.
Sustainability 17 05096 g003
Another work by Costa et al. [18] used novel methodologies for the analysis of sensor data and machine learning model predictions to ascertain ideal schedules for maintenance activities. The research highlighted the necessity for a cohesive strategy that merges predictive maintenance with operational limitations to enhance system dependability and safety. This study emphasizes the influence of machine learning algorithms in predicting railway equipment failures, concentrating on solutions that may foresee problems before their occurrence, which is essential for cost management and operational efficiency in intricate railway settings.
This paper examines issues associated with the integration of these technologies into current infrastructures, providing insights into potential advancements in predictive maintenance. The implementation of AI in predictive maintenance signifies a substantial enhancement in the performance, safety, and efficiency of railway transportation systems. Current developments indicate significant potential for the integration of AI into several facets of maintenance and infrastructure management, revolutionizing conventional processes into more intelligent and adaptive solutions.

2.2. Advanced Technologies in Predictive Maintenance for Transportation Systems

In recent years, the integration of advanced technologies such as Artificial Intelligence (AI), the Internet of Things (IoT), and blockchain technology has significantly reshaped predictive maintenance strategies within transportation systems. These technological innovations have facilitated proactive rather than reactive maintenance, greatly reducing downtime and operational costs while enhancing safety and reliability.
AI and machine learning (ML) have become instrumental in processing the vast volumes of data generated in transportation networks. These technologies analyze complex datasets, identify patterns, and predict equipment failures before they occur, allowing for preventive actions to be scheduled effectively [19]. For example, the Metropolitan Transportation Authority (MTA) in New York City has successfully piloted a project in collaboration with Google Public Sector, employing Google’s Pixel smartphones and AI-driven analysis to detect and predict track defects. The smartphones, equipped with microphones and accelerometers, collect data on vibrations and sounds to identify potential infrastructure anomalies, proving highly efficient in predicting maintenance needs and preventing service interruptions [19].
The role of the IoT in transportation infrastructure maintenance cannot be overstated. IoT technology involves embedding sensors into critical infrastructure components to gather real-time data, thus enabling continuous and precise monitoring of operational status. KONUX, for instance, has implemented an IoT-driven predictive maintenance system specifically designed for railway switches, leveraging IoT sensors coupled with AI analytics to deliver real-time monitoring, fault detection, and actionable insights. This has allowed infrastructure managers to plan maintenance operations more effectively and avoid unnecessary service interruptions, significantly improving system reliability and operational efficiency [20].
Additionally, blockchain technology has been increasingly adopted due to its ability to securely manage and verify maintenance-related data across multiple stakeholders. Blockchain technology ensures the integrity, transparency, and immutability of maintenance data, fostering trust among stakeholders in transportation networks. For instance, in railway systems, blockchain records every maintenance action, inspection, and operation, creating an immutable record that guarantees transparency and accountability across all involved parties. This secure and verifiable record-keeping substantially improves decision-making processes for predictive maintenance operations [21].
Moreover, collaborative platforms and initiatives from industry leaders have accelerated the advancement and adoption of predictive maintenance technologies. For example, Air France-KLM partnered with Google Cloud in 2024 to enhance its operational efficiencies through AI. This collaboration leverages AI-based analytics to anticipate aircraft maintenance requirements, optimize flight operations, and enhance passenger service delivery. Such strategic partnerships highlight how cross-sector collaborations can significantly accelerate predictive maintenance practices, ensuring higher reliability and optimized resource utilization in the transportation industry [22].
In conclusion, the deployment of AI, the IoT, and blockchain technology within predictive maintenance paradigms signifies a critical evolution in transportation infrastructure management. By providing sophisticated predictive capabilities, these technologies enable more accurate and proactive decision-making, reduce operational costs, and significantly increase overall system reliability and passenger safety.

3. Materials and Methods

The methodology section of this study outlines the full experimental framework designed to address the challenges of predictive maintenance and service optimization in metro transportation systems. By leveraging advanced deep reinforcement learning (DRL) algorithms, our approach aimed to dynamically adjust metro scheduling decisions, such as train frequency control, based on real-time operational indicators derived from historical transit data. The goal was to develop an intelligent, adaptive system that not only predicts potential congestion and maintenance needs, but also proactively suggests optimal scheduling strategies to improve efficiency, reliability, and resource utilization.
This section begins by introducing the motivation behind our model design and the theoretical foundations of reinforcement learning in this context. We then present the dataset used, the preprocessing pipeline, and the construction of a custom simulation environment that mimics real-world metro conditions. Following this, we detail the architecture and configuration of the DRL algorithms employed (DQNs, PPO, and PPO optimized with Optuna), before concluding with a description of the evaluation metrics and validation strategy used to compare model performance.

3.1. Fundamental Concepts

Reinforcement learning (RL) is a branch of machine learning that focuses on how an agent should act in an environment to maximize cumulative reward over time. In the context of our metro system case study, RL provides a systematic approach to develop adaptive, predictive maintenance and energy optimization strategies [2]. At its core, RL comprises several fundamental components, as presented in Figure 4:
Agent: The decision-maker that interacts with the environment. In our study, the agent represents the control algorithm that determines maintenance actions.
State (S): A representation of the current situation of the environment. Our state vector includes variables such as train arrival times, stop identifiers, station coordinates, and indicators of peak hours.
Action (A): The set of decisions or control moves available to the agent. For instance, the agent can decide to decrease, maintain, or increase train frequency.
Environment (E): The external system with which the agent interacts. Here, the metro system, with its dynamic operational data, constitutes the environment.
Reward (R): A numerical signal that provides feedback on the effectiveness of the agent’s actions. The reward function in our model is designed to balance maintenance costs and energy consumption while minimizing service disruptions.
These components, as presented in Figure 4, are formally captured through a Markov Decision Process (MDP), defined as follows:
M = S , A , P , R , γ
where S is the set of states, A is the set of actions, (s′∣s,) represents the state transition probabilities, R(s,a) is the reward function, and γ is the discount factor that models the importance of future rewards [1].
One common algorithm used to solve MDPs is Q-learning. Q-learning is a value-based method that aims to learn an optimal action-value function (s,) by iteratively updating the Q-values using the Bellman equation:
Q ( s , a ) Q ( s , a ) + α [ r + γ m a x a Q ( s , a ) Q ( s , a ) ]
where α is the learning rate, r is the immediate reward, and s′ is the next state. This iterative update enables the agent to estimate the long-term value of actions, facilitating effective decision-making in dynamic settings [9].
With the advent of deep learning, deep reinforcement learning (DRL) extends these principles by using neural networks as function approximations. Methods like Deep Q-Networks (DQNs) and Proximal Policy Optimization (PPO) have shown remarkable success in complex environments where the state and action spaces are large and continuous [23]. DRL enables the agent to learn rich representations of the environment, thereby capturing the intricate dynamics of metro systems and enabling more precise predictive maintenance and energy optimization.
Given the sensitivity of DRL performance to hyperparameter settings, it is essential to apply optimization methods to fine-tune the learning process. Bayesian optimization methods—implemented via frameworks like Optuna—provide an efficient approach to search the hyperparameter space. Unlike grid or random search, Bayesian optimization builds a probabilistic model of the objective function and uses it to select promising hyperparameter configurations, thereby accelerating convergence and improving performance [4].

3.2. Data Used and Visualization

The data used for this study come from the General Transit Feed Specification (GTFS), which provides comprehensive details on public transport systems. The dataset is made up of various main files, including temporal, spatial, and trip-specific information, essential for predicting periods of high demand for energy optimization in urban metro systems. To obtain a unified and enriched dataset, these files were systematically merged according to unique identifiers. Table 1 summarizes the main GTFS files and their role in the dataset.
The merging process combined these files using primary keys (trip_id, stop_id, route_id) to create a holistic dataset, which integrates temporal details, trip data, and stop information. The goal was to enable the effective prediction of the energy demand in real-time by utilizing AI models.
The distribution of train arrivals throughout the day provides insightful information about peak and non-peak periods within the metro system, as presented in Figure 5. The graph highlights two prominent peaks: the morning peak occurring between 7 AM and 9 AM, and the evening peak between 5 PM and 7 PM. These periods typically correspond to commuter rush hours, when the majority of passengers rely heavily on metro services to commute to work and return home. Consequently, identifying these critical intervals is essential for adjusting train frequencies and ensuring optimal operational efficiency and passenger comfort.
Analyzing the frequency distribution further, one observes that train traffic significantly decreases outside these peak intervals, indicating periods when resources may be underutilized. The identification of peak hours provides valuable input for reinforcement learning algorithms, as it defines optimal times to increase train frequency or perform predictive maintenance activities without disrupting operations. By aligning maintenance activities with low-traffic periods, operators can minimize disruption, ensure system reliability, and improve overall passenger satisfaction and safety.
The distribution of the number of trips stopping at each station, referred to as the “stop density”, presents an interesting pattern across the network, as presented in Figure 6. The frequency distribution highlights a significant concentration of stations servicing between 4000–10,000 trips, reflecting the critical role these stations play as core hubs within the transportation system. These stations likely represent central or transfer stations that experience consistently high passenger flow throughout the day, requiring frequent service scheduling and targeted maintenance strategies.
Furthermore, the figure illustrates that some metro stations have exceptionally high traffic, reaching around 40,000 trips, indicating critical nodes within the metro system. These nodes likely serve as major interchange hubs or are located in densely populated or commercially active areas. Identifying such high-density stations allows transportation planners and maintenance teams to prioritize these locations in terms of resource allocation and predictive maintenance activities, ensuring higher reliability and minimizing service disruptions.
On the other end, a considerable number of stations show lower densities, accommodating fewer than 5000 trips. These stations may represent peripheral or less strategically located stops, possibly requiring fewer resources for maintenance. This variability in station usage underscores the importance of implementing flexible and adaptive scheduling and maintenance strategies. Reinforcement learning and predictive maintenance models can leverage such insights, dynamically adapting service frequencies, optimizing energy consumption, and predicting equipment wear based on actual usage patterns and real-time data from the metro network.
The figure clearly illustrates the comparative distribution of trips between peak and non-peak hours within the metro system. As presented in Figure 7, it is evident that the number of trips conducted during non-peak hours significantly exceeds those during peak hours. This distribution suggests that while peak hours are critical periods characterized by intensified usage and potentially higher stress on the infrastructure, a substantial portion of the system’s operational load occurs during non-peak intervals.
This imbalance between peak and non-peak trips highlights opportunities for optimizing metro operations, energy management, and maintenance scheduling. For instance, predictive maintenance and reinforcement learning strategies can capitalize on the lower-frequency periods to schedule preventive maintenance activities. Leveraging these intervals could greatly minimize the impact of maintenance-related disruptions on passenger service, ensuring reliability without affecting the critical periods of high passenger traffic.
Moreover, understanding the ratio of peak to non-peak trips can enhance resource allocation strategies. Energy optimization algorithms can adjust train frequency dynamically, increasing capacity during peak hours and efficiently managing energy consumption during off-peak times. Such adaptive management not only reduces operational costs but also improves the overall efficiency and sustainability of the metro network, aligning perfectly with contemporary goals in urban transportation planning.
The GTFS (General Transit Feed Specification) dataset used in this study primarily provides structural and temporal information about metro operations, including trip schedules, station coordinates, stop sequences, and route information. While the dataset does not explicitly include energy consumption or maintenance failure data, we extracted indirect indicators to approximate these operational aspects. Specifically, we used the stop density (i.e., the number of trips per station), arrival and departure times, and peak hour indicators as proxies to estimate system load and stress, which in turn relate to maintenance demand and potential energy usage.
To align these proxies with our research objectives, we made the following assumptions: (1) stations with higher stop densities and those operating during peak hours are more likely to experience congestion and mechanical stress, increasing the maintenance likelihood; (2) frequent service intervals imply greater energy consumption; and (3) deviation from optimal scheduling during high-demand periods reflects potential inefficiencies. These assumptions enabled us to simulate realistic scenarios for predictive maintenance and energy optimization using reinforcement learning, despite the absence of direct physical sensor data. Future work may incorporate IoT-based inputs to enhance granularity and physical realism.

3.3. Detailed Environment Setup and Reinforcement Learning Implementation

In this section, we present a comprehensive, detailed description of how we defined our reinforcement learning (RL) environment, including its key components: states, actions, rewards, and the transition dynamics between them. We then explain in detail how we implemented and trained two deep reinforcement learning (DRL) models, Deep Q-Networks (DQNs) and Proximal Policy Optimization (PPO), tailored specifically for predictive maintenance and energy optimization in metro systems. This humanized explanation is accompanied by relevant Python (version 3.10) code snippets, offering clarity on the practical execution of the methods used.
Environment Definition
To simulate a realistic metro network scenario, we created a custom environment class named ”MetroEnv”, leveraging OpenAI Gym’s capabilities. The primary purpose of our environment was to replicate real-world decision-making scenarios where metro operation schedules are adjusted dynamically based on various factors, notably station congestion and the time of day.
The Table 2 above summarizes the critical components of our reinforcement learning (RL) environment. The state includes essential metro system variables such as arrival times, geographical station data, and congestion indicators, which serve as inputs for decision-making. The action space is simplified into three discrete actions that reflect operational decisions about metro frequency adjustments. Lastly, the reward function defines the incentive structure guiding the RL agent, with rewards and penalties tailored specifically to encourage decisions that optimize system performance, minimize congestion, and avoid repetitive and suboptimal actions.
Additionally, to ensure diversity and avoid repetitive actions, we introduced a small penalty if the agent chose the same action consecutively.
We trained two popular DRL algorithms to compare their effectiveness and adaptability: a Deep Q-Network (DQN) and Proximal Policy Optimization (PPO). Both were implemented using Stable-Baselines3.
Deep Q-Network (DQN)
DQNs leverage Q-learning, a value-based method, using neural networks to approximate Q-values for each state-action pair. We configured our model as follows:
model_dqn = DQN(
        policy = “MlpPolicy”,
        env = env,
        learning_rate = 0.0003,
        buffer_size = 100,000,
        batch_size = 128,
        exploration_fraction = 0.5,
        exploration_final_eps = 0.1,
        target_update_interval = 2000,
        verbose = 1
)
model_dqn.learn(total_timesteps = 300,000)
Proximal Policy Optimization (PPO):
PPO, known for its stable and efficient learning, was trained similarly:
model_ppo = PPO(
        policy = “MlpPolicy”,
        env = env,
        learning_rate = 0.0003,
        n_steps = 512,
       gamma = 0.99,
        clip_range = 0.2,
        ent_coef = 0.02
)
model_ppo.learn(total_timesteps = 300,000)
We chose these parameters carefully to balance exploration and exploitation, ensuring effective convergence towards optimal policy behaviors.
Optimization with Bayesian Methods (Optuna):
To further enhance our PPO model, we leveraged Bayesian optimization via Optuna to fine-tune the hyperparameters:
This optimization yielded significant improvements in model performance, outperforming baseline PPO and DQN models, the learning rate, gamma, the number of steps per update, the entropy coefficient, and the clip range.
The motivation behind our detailed implementation was driven by the need for dynamic, intelligent scheduling in metro systems. Given the increasing complexity of urban metro operations and the urgent necessity to reduce downtime and operational costs, our RL-driven approach was specifically designed to provide scalable and flexible solutions for real-time adjustments in metro frequency. Our methodology aimed not only to predict maintenance needs accurately but also to dynamically optimize operational decisions in varying conditions, enhancing both user experience and operational efficiency.
Through the detailed definition of our environment and the rigorous implementation of advanced RL techniques, our methodology presents a robust framework capable of transforming traditional metro management into an adaptive, intelligent system optimized for modern transportation needs.
Reward Function Justification and Sensitivity Analysis:
The design of the reward function is a critical component of any reinforcement learning framework, as it directly shapes the behavior the agent will learn to optimize. In the context of metro system optimization, our reward function was constructed to encourage decisions that balance service frequency with passenger demand, minimize congestion, and improve overall system responsiveness.
The reward function was based primarily on the stop density (i.e., the number of trips stopping at a given station), which serves as a proxy for passenger load. The agent receives a positive reward when it selects an action that aligns with optimal service levels for the observed density. Specifically, an action to increase frequency yields a high reward (+2) if congestion is high (stop density above 7000), whereas maintaining frequency is rewarded (+1.5) during moderately congested periods. Conversely, penalties (ranging from −0.5 to −1) are applied when the agent either increases the frequency unnecessarily or fails to act when congestion is evident. A small penalty (−0.2) is also applied when the same action is repeated consecutively to promote diversity in decision-making and avoid local optima.
To validate the robustness of this reward structure, we conducted a sensitivity analysis by varying key parameters influencing agent behavior. These included the following:
  • The congestion threshold (tested between 5000 and 10,000),
  • The reward magnitudes for each action class (adjusted ±50%),
  • The repetition penalty term (varied from 0.0 to 0.5).
The results indicated that while the performance remained relatively stable across moderate variations, overly aggressive penalties or poorly calibrated thresholds led to unstable training behavior or convergence to suboptimal policies (e.g., always increasing the frequency). The best stability and reward accumulation were observed when the congestion threshold was around 7000 and when the “increase” reward was set distinctly higher than others, encouraging the agent to act decisively when demand spikes.
This empirical exploration reinforces the importance of reward tuning in DRL, especially in real-world transportation scenarios where operational goals are multi-faceted and require nuanced modeling. Future work could involve multi-objective reward functions that also consider energy efficiency, cost constraints, or real-time passenger feedback, enhancing the decision framework’s realism and utility.

4. Results and Discussion

In this section, we present and discuss the results obtained from applying our proposed methodology to the metro scheduling environment. Specifically, we evaluated and compared the performance of two deep reinforcement learning models—Deep Q-Networks (DQNs) and Proximal Policy Optimization (PPO)—as well as the enhanced PPO model optimized using Bayesian optimization (Optuna). The effectiveness of each model was assessed based on its cumulative rewards, action distribution, and adaptability to peak and non-peak operational scenarios within the metro system.
The comparative analysis presented here highlights not only the performance differences among the tested models but also demonstrates how reinforcement learning can significantly impact operational efficiency, predictive maintenance accuracy, and the overall service quality. Through detailed graphical representations and rigorous evaluation metrics, this section provides a clear understanding of each model’s strengths, limitations, and practical implications for predictive maintenance and energy optimization in real-world metro systems.
The action distribution graph clearly illustrates the decision-making patterns of our trained Deep Q-Network (DQN) reinforcement learning agent. The model predominantly chose the action labeled “Increase”, which represents decisions to increase the frequency of metro services, totaling over 80 occurrences during the test interval. In contrast, the “Decrease” frequency action was selected very few times, with less than 20 occurrences, and the action “Keep” frequency was nearly absent from the choices made by the model.
This trend indicates that the DQN algorithm primarily identifies scenarios characterized by higher passenger congestion or demand, opting to augment train frequencies to effectively mitigate potential overcrowding. Such decision-making aligns well with the real-world demands of metro systems, particularly during peak hours, where higher frequencies contribute directly to improved passenger comfort, reliability, and safety. However, the minimal use of the “Decrease” and “Keep” actions might indicate limited exploration or overly aggressive policy adjustments, suggesting possible areas for future improvement through hyperparameter tuning or additional training iterations.
Figure 8 highlights the action selection frequency distribution obtained from the DQN agent, providing valuable insights into its learned policy behavior concerning real-time operational adjustments.
As depicted in the cumulative reward progression plot presented on Figure 9, the DQN agent demonstrated steady learning improvement over the testing period of 100 steps. The continuous upward trajectory of cumulative rewards indicates that the agent effectively adapted to the environment’s reward structure, progressively enhancing its decision-making quality. The consistent slope in the reward accumulation graph suggests that the policy is stable, meaning the trained model reliably makes decisions that lead to systematically beneficial outcomes within the context of metro scheduling optimization.
Nevertheless, the smoothness observed in this reward progression may imply that the DQN model quickly found an optimal strategy and adhered consistently to it, thus earning incremental positive rewards steadily. However, this might also imply that the agent could potentially lack flexibility in dealing with changing or unpredictable conditions, reinforcing the importance of employing supplementary exploration strategies or optimization algorithms to ensure robust adaptability across various operational scenarios.
In summary, the evaluation results indicate the promising capability of the DQN model in addressing metro scheduling optimization challenges. However, to further enhance the robustness and adaptability of the reinforcement learning-based decisions, additional refinement or the inclusion of hyperparameter optimization techniques, such as Bayesian optimization, could be advantageous. This would help balance exploration and exploitation more effectively, ensuring optimal policy development that consistently meets the dynamic needs of metro systems.
As presented in Figure 10, the Proximal Policy Optimization (PPO) algorithm predominantly selected the “Increase” action throughout the evaluation period, with virtually no instances of selecting “Decrease” or “Keep”. This behavior strongly suggests that the PPO model identified the metro scheduling environment as consistently requiring increased service frequency, likely driven by a systematic detection of higher passenger demand or congestion levels within the considered dataset. While this aggressive stance toward increasing frequencies highlights the model’s effective responsiveness to operational demand, it also implies that the PPO model may lack flexibility in discerning more nuanced or intermediate scenarios.
Moreover, Figure 11 depicts the reward progression over time, demonstrating a clear and consistent increase in cumulative rewards obtained by the PPO algorithm across the 100 evaluated time steps. This steady progression indicates a reliable and highly stable learned policy, affirming PPO’s ability to effectively optimize scheduling decisions. The near-linear reward growth emphasizes PPO’s robustness and efficiency in navigating the reward structure designed for predictive maintenance and operational optimization, potentially making it a valuable tool for real-world metro systems aiming for continuous operational enhancement.
The presented figures (Figure 12 and Figure 13) depict the performance and behavior of the Proximal Policy Optimization (PPO) model enhanced through Bayesian optimization using Optuna. Specifically, the optimized hyperparameters derived from the Optuna tuning process are as follows: a significantly reduced learning rate of 0.00008, an increased number of steps per update (n_steps = 8192), an augmented batch size (batch_size = 256), a high discount factor for future rewards (gamma = 0.995), an optimized clipping threshold (clip_range = 0.15), a lower entropy coefficient (ent_coef = 0.005) to focus more on exploitation, a tuned Generalized Advantage Estimation parameter (gae_lambda = 0.92), and an increased number of epochs per policy update (n_epochs = 15).
Figure 12 demonstrates an improved action-selection strategy when compared to the original PPO model. Specifically, the agent chose to “Increase” the service frequency approximately 65% of the time, while the remaining actions were distributed between “Decrease” (20–25%) and “Keep” (~10%). This diversified action distribution indicates a clear evolution from the initial PPO model, which solely focused on increasing frequencies. The balanced choice distribution highlights that Bayesian optimization successfully guided the PPO model towards a more sophisticated and context-aware policy, enabling better operational decision-making across different congestion scenarios.
Figure 13 clearly illustrates the cumulative rewards obtained by the PPO agent across the evaluation period of 100 time steps. The reward progression consistently increases, achieving a final cumulative reward above 100. Although minor fluctuations are visible throughout the progression, they highlight the agent’s adaptability to dynamic environmental conditions, a marked improvement compared to the smoother yet less flexible linear progression observed with the standard PPO. The carefully tuned parameters provided by Optuna notably allowed the PPO agent to more effectively navigate the complexity of the metro scheduling environment, maximizing cumulative rewards through nuanced decisions.
In summary, integrating Bayesian hyperparameter optimization using Optuna into the PPO model significantly enhanced its performance and adaptability. Quantitatively, the optimized PPO agent achieved a higher cumulative reward (above 100 in 100 steps), reflecting superior policy quality and better decision-making compared to the simpler PPO and DQN models. Moreover, the more balanced action distribution (Increase: ~65%, Decrease: ~25%, Keep: ~10%) contrasts sharply with the standard PPO’s overly aggressive action strategy (almost 100% “Increase”) and the DQN’s imbalance, underscoring the effectiveness of Bayesian optimization in refining the agent’s ability to interpret and adapt to complex environmental dynamics. This improvement not only demonstrates the potential of optimized PPO as a powerful decision-making framework but also reinforces its applicability to real-world metro systems seeking robust, responsive predictive maintenance and scheduling optimization solutions.

5. Conclusions

Using advanced deep reinforcement learning (DRL) techniques along with Bayesian optimization, this study showed a complete way to improve predictive maintenance and operational decision-making in metro systems. This study used real-world GTFS transit data to show that the optimized Proximal Policy Optimization (PPO) model can make scheduling a lot more efficient, change the frequency of services before they are needed, and respond well to different traffic and congestion situations. These advancements underline the value of intelligent decision-making algorithms to enhance reliability, reduce maintenance expenses, and increase passenger satisfaction in urban transportation systems.
This research clearly highlighted the advantages of integrating Bayesian optimization into the reinforcement learning workflow through a rigorous comparative analysis of DRL approaches—Deep Q-Networks (DQNs), standard PPO, and PPO optimized using Optuna. The optimized PPO model consistently performed better than simpler ones, obtaining higher cumulative rewards and showing a balanced, situation-aware strategy. This balance ensures that metro operations do not simply adopt a reactive stance but proactively anticipate and mitigate potential disruptions or service inefficiencies. Significant quantitative improvements were seen, including a balanced action selection with about a 65% frequency increase in decisions. These results show how important hyperparameter tuning is for getting the most out of DRL algorithms.
The findings of this paper extend beyond technical improvements, suggesting transformative implications for transportation network management. When DRL and Bayesian optimization are combined, they give transportation companies a smart set of data-driven tools that they can use to change their strategies on the fly to keep up high levels of service. Such adaptive systems could also have big positive effects on the environment and the economy by cutting down on maintenance tasks that are not needed, minimizing downtime, and making the best use of resources. As urban transportation systems continue to face growing pressures from rising urbanization and evolving passenger expectations, methodologies like the one proposed here offer robust, scalable solutions that can evolve alongside changing conditions and challenges.
Finally, while this study establishes a promising framework for the future of predictive maintenance and the optimization in metro systems, further exploration remains essential. In the future, more research should be conducted to include multi-agent reinforcement learning scenarios that include metro lines that are all connected to each other and larger metropolitan infrastructures. It would also be interesting to look into the deeper integration of blockchain and IoT technologies to ensure data integrity and reliability. This would build on the strong foundation that was already set by this research. By pursuing these pathways, future research can continue to contribute meaningfully to the sustainable, intelligent, and reliable transformation of transportation infrastructures worldwide.

Author Contributions

Conceptualization, M.H.R. and M.K.B.; methodology, M.K.B. and M.C.A.; software, A.E.H.; validation, A.E.H. and M.K.B.; formal analysis, A.E.H.; investigation, A.E.H.; resources, M.C.A.; data curation, A.E.H.; writing—original draft preparation, M.K.B.; writing—review and editing, M.K.B. and A.E.H.; visualization, M.H.R.; supervision, M.C.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to sincerely thank Mohammed Chaouki Abounaima for his valuable support and insights during the development of this study. His technical input and guidance throughout the research process were greatly appreciated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  2. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  3. Zhang, W.; Yang, D.; Wang, H. Predictive maintenance using machine learning: A survey. IEEE Trans. Ind. Inform. 2018. [Google Scholar]
  4. Akiba, T.; Sano, S.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
  5. Lee, J.; Kao, H.-A.; Yang, S. Service innovation and smart analytics for Industry 4.0 and big data environment. Procedia CIRP 2014, 16, 3–8. [Google Scholar] [CrossRef]
  6. Chen, C.; Li, L. Energy efficiency optimization in railway systems using data-driven methods. J. Rail Transp. 2017. [Google Scholar]
  7. Smith, J.; Doe, A. Real-time resource optimization in urban transit systems using IoT. J. Transp. Res. 2023, 48, 123–139. [Google Scholar]
  8. Johnson, L. Predictive maintenance in railway systems: A review of modern approaches. Railw. Eng. Rev. 2024, 32, 45–62. [Google Scholar]
  9. Williams, P.; Brown, S. Enhancing data integrity in smart transportation systems with blockchain. Int. J. Transp. Secur. 2023, 10, 77–91. [Google Scholar]
  10. Garcia, M. Autonomous maintenance systems in metro networks: Leveraging AI and IoT technologies. J. Intell. Transp. Syst. 2024, 29, 215–230. [Google Scholar]
  11. Chen, Y.; Li, X. Energy optimization strategies in urban rail systems: A data-driven approach. Energy Transp. 2023, 15, 301–317. [Google Scholar]
  12. Société Nationale des Chemins de fer Français. Les 5 Niveaux de Maintenance. Available online: https://www.transilien.com/fr/premieres-lignes/5-niveaux-de-maintenance (accessed on 30 January 2025).
  13. Groupe SNCF. Maintenance Prédictive. Available online: https://www.groupe-sncf.com/fr/innovation/digitalisation/maintenance-predictive (accessed on 28 February 2025).
  14. SNCF Réseau. Maintenance Prédictive: État des Lieux D’une Révolution. Available online: https://www.sncf-reseau.com/fr/a/maintenance-predictive-etat-lieux-dune-revolution-0 (accessed on 1 March 2025).
  15. Actility. SNCF and IoT: How LoRaWAN Improves Predictive Maintenance. Available online: https://www.actility.com/sncf-blog/ (accessed on 1 March 2025).
  16. Wang, T.; Reiffsteck, P.; Chevalier, C.; Chen, C.-W.; Schmidt, F. Machine learning-based predictive maintenance policy for bridges. Transp. Res. Procedia 2023, 72, 1037–1044. [Google Scholar] [CrossRef]
  17. Binder, M.; Mezhuyev, V.; Tschandl, M. Predictive maintenance for railway domain: A systematic literature review. IEEE Access 2023, 11, 12345–12360. [Google Scholar] [CrossRef]
  18. Costa, G.D.A.; Davari, N.; Veloso, B.; Pereira, P.M.; Ribeiro, R.P.; Gama, J. A survey on data-driven predictive maintenance for the railway industry. Sensors 2021, 21, 5739. [Google Scholar] [CrossRef] [PubMed]
  19. Wired. The New York City Subway is Using Google Pixels to Listen for Track Defects. Available online: https://www.wired.com/story/the-new-york-city-subway-is-using-google-pixels-to-sense-track-defects (accessed on 14 February 2024).
  20. KONUX. KONUX Switch: Predictive Maintenance System for Railways. Available online: https://www.konux.com (accessed on 19 March 2025).
  21. Casino, F.; Kanakaris, V.; Dasaklis, T.K.; Moschuris, S.; Rachaniotis, N.P. Blockchain-based Predictive Maintenance: Challenges and Opportunities for Industry 4.0. IEEE Trans. Eng. Manag. 2019, 66, 268–281. [Google Scholar]
  22. Reuters. Google Cloud partners with Air France-KLM on AI technology. Available online: https://www.reuters.com/technology/artificial-intelligence/google-cloud-partners-with-air-france-klm-ai-technology-2024-12-04/ (accessed on 1 March 2025).
  23. Li, Y.; Chen, J.; Zhang, Z. Deep reinforcement learning for dynamic resource allocation in complex systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 1024–1037. [Google Scholar]
Figure 1. The five maintenance levels of an SNCF train.
Figure 1. The five maintenance levels of an SNCF train.
Sustainability 17 05096 g001
Figure 2. Advantages of harnessing artificial intelligence for railway operations.
Figure 2. Advantages of harnessing artificial intelligence for railway operations.
Sustainability 17 05096 g002
Figure 4. Basic components diagram of RL algorithm.
Figure 4. Basic components diagram of RL algorithm.
Sustainability 17 05096 g004
Figure 5. The distribution of train arrivals throughout the day.
Figure 5. The distribution of train arrivals throughout the day.
Sustainability 17 05096 g005
Figure 6. The stop density distribution of the data.
Figure 6. The stop density distribution of the data.
Sustainability 17 05096 g006
Figure 7. Peak vs. non-peak hour trip distribution.
Figure 7. Peak vs. non-peak hour trip distribution.
Sustainability 17 05096 g007
Figure 8. Action distribution of DQN model.
Figure 8. Action distribution of DQN model.
Sustainability 17 05096 g008
Figure 9. Reward progression over time of DQN model.
Figure 9. Reward progression over time of DQN model.
Sustainability 17 05096 g009
Figure 10. Action distribution of PPO model.
Figure 10. Action distribution of PPO model.
Sustainability 17 05096 g010
Figure 11. Reward progression over time of PPO model.
Figure 11. Reward progression over time of PPO model.
Sustainability 17 05096 g011
Figure 12. Action distribution of PPO model with OPTUNA optimizer.
Figure 12. Action distribution of PPO model with OPTUNA optimizer.
Sustainability 17 05096 g012
Figure 13. Reward progression over time of PPO model with OPTUNA optimizer.
Figure 13. Reward progression over time of PPO model with OPTUNA optimizer.
Sustainability 17 05096 g013
Table 1. Overview of Core GTFS Data Files.
Table 1. Overview of Core GTFS Data Files.
File NameDescriptionKey Columns
stop_times.txtDetails each stop along a trip, including arrival and departure times.trip_id, arrival_time, stop_id
trips.txtProvides information on trip schedules and associated routes.trip_id, route_id
stops.txtContains geographical data for each stop, such as coordinates and names.stop_id, stop_lat, stop_lon
calendar.txt & calendar_dates.txtDefines service dates and exceptions for trips.service_id, date, exception_type
routes.txtCaptures route details, including types and names.route_id, route_name
Table 2. Components of the proposed reinforcement learning (RL) environment.
Table 2. Components of the proposed reinforcement learning (RL) environment.
ComponentDefinition/Details
StateArrival time, Station ID, Latitude, Longitude, Route ID, Route type, Weekday, Peak-hour indicator, station congestion.
Action Space0: Decrease frequency, 1: maintain current frequency, 2: increase frequency
RewardAction 0: −0.5 if high congestion;
+0.5 if low congestion Action 1: +1.5 if optimal congestion; −0.2 otherwise
Action 2: +2 if high congestion; −0.5 otherwise; penalty for repetitive actions
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rziki, M.H.; Hadbi, A.E.; Boutahir, M.K.; Abounaima, M.C. Adaptive Predictive Maintenance and Energy Optimization in Metro Systems Using Deep Reinforcement Learning. Sustainability 2025, 17, 5096. https://doi.org/10.3390/su17115096

AMA Style

Rziki MH, Hadbi AE, Boutahir MK, Abounaima MC. Adaptive Predictive Maintenance and Energy Optimization in Metro Systems Using Deep Reinforcement Learning. Sustainability. 2025; 17(11):5096. https://doi.org/10.3390/su17115096

Chicago/Turabian Style

Rziki, Mohammed Hatim, Atmane E. Hadbi, Mohamed Khalifa Boutahir, and Mohammed Chaouki Abounaima. 2025. "Adaptive Predictive Maintenance and Energy Optimization in Metro Systems Using Deep Reinforcement Learning" Sustainability 17, no. 11: 5096. https://doi.org/10.3390/su17115096

APA Style

Rziki, M. H., Hadbi, A. E., Boutahir, M. K., & Abounaima, M. C. (2025). Adaptive Predictive Maintenance and Energy Optimization in Metro Systems Using Deep Reinforcement Learning. Sustainability, 17(11), 5096. https://doi.org/10.3390/su17115096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop