Cyclic Air Braking Strategy for Heavy Haul Trains on Long Downhill Sections Based on Q-Learning Algorithm

: Cyclic air braking is a key factor affecting the safe operation of trains on long downhill sections. However, a train’s cycle braking strategy is constrained by multiple factors such as driving environment, speed, and air-refilling time. A Q-learning algorithm-based cyclic braking strategy for a heavy haul train on long downhill sections is proposed to address this challenge. First, the operating environment of a heavy haul train on long downhill sections is designed, considering various constraint parameters, such as the characteristics of special operating routes, allowable operating speeds, and train tube air-refilling time. Second, the operating status and braking operation of a heavy haul train on long downhill sections are discretized in order to establish a Q-table based on state–action pairs. The training of algorithm performance is achieved by continuously updating Q-tables. Finally, taking the heavy haul train formation as the study object, actual line data from the


Introduction
Heavy haul trains have a large transportation capacity, high efficiency, and low transportation costs; thus, they have received widespread attention from countries worldwide.To control speed when heavy haul trains operate on long downhill sections, the braking system must increase cyclic air braking [1].The existing strategy used for air braking mainly relies on the conductor's experience, which is insufficient for meeting the safety and efficiency requirements for heavy haul train operation [2].Therefore, an intelligent control strategy must be developed to improve the air braking performance of heavy haul trains on long downhill sections [3].
With the rapid development of heavy haul trains, scholars have recently conducted extensive research on the cyclic braking of heavy haul trains on long downhill sections.Related methods can be summarized as mechanistic model-, machine learning-, and reinforcement learning-based methods.
In terms of mechanistic model-based methods, a neural network-based air braking model was proposed to accurately predict pressure changes in the key components of train air braking systems [4].In [5], a new hybrid model of long short-term memory (LSTM) was developed to describe the changes in control force.In [6], a long short-term memory model with delayed information was constructed to solve the problem of deep learning models being unable to explain the impact of model inputs on system outputs.A real-time slope estimation model based on Kalman filtering was constructed for the electric and air braking system of heavy haul trains [7].Traditional physical-driven models usually fail to reflect the "true" dynamics of heavy haul trains because of the strong nonlinearity and uncertainty in the mechanistic model due to air resistance, frequently switching working conditions, and variations in external influencing factors such as weather and temperature.During heavy haul train operations, a large amount of data are accumulated, providing support for the research on data-driven circulating air braking strategies for heavy haul trains.
In terms of machine learning-based research, an intelligent driving strategy for heavy haul trains based on expert knowledge and machine learning is proposed, in order to determine feasible air pressure reduction and the exact time to apply and release air brakes [8].In [9], an optimization model for the operation of heavy-duty trains was established, achieving optimal control while maximizing operating distance and minimizing air braking time.In [10], to address the issue of the severe imbalance in the proportion of operating data for heavy haul trains under different working conditions, a random forest algorithm was used to extract data and establish a model for automatic air brakes.In [11], based on a train dynamics model, the model parameters-including energy consumption, running time, and distance of pneumatic braking-were optimized, and the artificial bee colony (ABC) algorithm was introduced to find reasonable switching points for different states.
Reinforcement learning can be used to handle large-scale state spaces and dynamically changing environments, and is characterized by a strong real-time decision-making ability.Reinforcement learning has received widespread attention in studies on braking strategies for heavy haul trains.In [12], a long downhill section operation optimization method suitable for long-formation heavy haul trains was developed to improve the braking performance of 20,000-ton heavy haul trains.In [13], a deep reinforcement learning method with a reference system was constructed, which satisfies the constraints on speed, time, and position during train operation and reduces the tracking errors of reinforcement learning.In [14], a double-switch Q-network (DSQ network) architecture was designed to solve the problem of the optimal control of multiple electric locomotives in heavy haul trains.However, fully using the massive amounts of data generated by trains during operation is a key issue in reinforcement learning methods that needs to be addressed.
The Q-learning algorithm is a widely recognized and extensively used reinforcement learning method.It not only boasts a solid theoretical foundation, but also features a relatively simple application process.Additionally, it has demonstrated excellent performance in numerous practical scenarios, providing strong practical support for its utilization in the field of heavy haul train braking.Significantly, the Q-learning algorithm has a unique advantage in handling discrete action spaces, making it well-suited to address the challenges faced by heavy haul trains operating on long downhill sections.Based on these comprehensive considerations, we developed a cyclic air braking strategy for heavy haul trains on long downhill sections based on the Q-learning algorithm.The main contributions of this study are as follows: (1) A heavy haul train model with operational constraints was constructed, considering the vehicle's characteristics on long and steep slopes of railway lines, as well as heavy haul trains equipped with traditional pneumatic braking systems.In addition, with the optimization objectives of safe train operation and operational efficiency, a Q-learning algorithm-based cycle braking strategy for heavy haul trains on long downhill sections was developed under constraints such as interval speed limits and air-refilling time.(2) Simulations and experiments were conducted under actual heavy haul train operating conditions, and the experimental results were compared under different parameters and ramp speeds.The experimental results showed that the proposed intelligent control strategy performs well in various scenarios, demonstrating its effectiveness and practicality in train braking.
The rest of this article is organized as follows: A model for heavy haul trains operating on long downhill sections is described in Section 1, introducing constraints on train operation and the performance indicators of train operation.Section 2 introduces the method of circulating air braking for heavy haul trains based on the Q-learning algorithm.The effectiveness and robustness of the proposed method were verified through simulation experiments, as described in Section 3. Finally, Section 4 provides a summary of the study and outlines prospects for future research.

Dynamics Model
During the operation of heavy haul trains, various factors such as track gradient, train formation, and on-board mass exert diverse forces on each train.However, in this study, the forces of the interactions between carriages were not considered in the calculation of additional resistance.As a result, the forces acting on the train during its operation primarily comprised locomotive traction, braking force (including electric and pneumatic braking), fundamental running resistance, and additional resistance.According to the principles of Newtonian dynamics, the mathematical expression of each train model can be formulated as follows: Generally, the running resistance F R encountered by a heavy haul train during braking on a long and steep downhill slope is mainly composed of basic resistance M R and additional resistance L R .These resistances depend on the operating speed of the heavy haul train as well as its physical characteristics [15].
According to previous research [16], the formula for calculating the basic resistance of a heavy haul train is as follows: The additional resistance is determined by the slope force g R , curvature resistance c R , and the tunnel resistance t R [16], as shown in Equation (4).The specific calculations [17] of these factors are given in Equation (5).
To facilitate the understanding, the main symbols are introduced in Table 1.

Running Constraints
The aim of this study on the circulating air braking of heavy haul trains on long and steep downhill sections is, essentially, to solve a multi-constraint and multi-objective optimization problem.Considering the actual requirements of driving control and model design of trains, the running constraints set in this study were as follows: When a train operates on a long and steep downhill section, cyclic braking is adopted for speed control.To ensure sufficient braking force in the next braking cycle, sufficient time is required to refill the air pipe to full pressure [18].That is, the duration of the release phase shall not be less than the minimum air filling time T a specified by the operating procedures.
where t b j+1 represents the time point where the air brake is engaged in the (j + 1)th cycle, and t r j indicates the time point where the air brake is released in the j-th cycle.T a is closely related to the formation of the train and the pressure drop in the train air pipe.For fixed train parameters, the air-filled time under a certain pressure drop needs to be determined.A longer train and a larger pressure reduction generally require longer air-filled time.
To ensure safety, the speed of a heavy haul train cannot exceed the speed limit V max at any point on long and steep downhill line sections.This value often depends on the infrastructure of the railway line or the temporary setup during operation.Additionally, train speed must be greater than the minimum air brake release speed V r min .The specified limit is designated as 40 km/h for a 10,000-ton heavy haul train formation [19].Therefore, the speed should meet the following requirement: Regarding the optimization objectives of this study, the operation of a heavy haul train on long downhill sections is also constrained by the relative output ratio of the braking force [20].The two main types of braking devices used for heavy haul trains include variable resistance and pneumatic braking systems.Variable resistance braking, also known as regenerative braking, can feed energy back to other locomotives to provide power.Pneumatic braking systems produce braking force by reducing the air pressure in the train's air brake pipe [21].
The output electric brake force B 1 of a heavy haul train depends on the maximum electric brake force u d max (v) and the relative output ratio b d .The output pneumatic braking force depends on whether air braking is applied.Therefore, a train's braking force can be expressed as where the maximum electric brake force u d max is a piecewise function related to the operating speed [22], and the air braking force u a is a function of the air pressure drop [23].

Performance Indicators
This study mainly focuses on the safety and maintenance cost of the heavy haul train operation process.The maintenance cost is expressed as the air-braking distance.Hence, two indicators were used to evaluate the control of the heavy haul train.

•
Safety: Safety is a prerequisite for train operation.The running speed of a heavy haul train must be kept under the upper limit but cannot be lower than V r min .Here, K is defined in order to indicate whether the train's speed remains within the speed limit.
• Air-braking distance: As excessive wear is caused by the friction between the wheels and brake shoes when the air brake is engaged for a long distance, the replacement of air brake equipment increases maintenance costs.By reducing the air brake distance during operation, the maintenance cost can be reduced.Therefore, the air brake distance L a of a heavy haul train is defined as

Algorithm Design
Reinforcement learning is a machine learning approach tailored for goal-oriented tasks [24].Unlike traditional methods, reinforcement learning does not instruct the agent on how to act, but rather guides the agent through interactions with the environment to learn the correct strategies [25].In this section, we first define the train operation process as a Markov decision process (MDP).Second, we describe a control algorithm based on Q-learning that learns the cyclic braking strategy for long downhill sections.

Markov Decision Process
Before applying the Q-learning algorithm, the process of controlling train operation on a long and steep downhill slope needed to be defined as a Markov decision process (MDP), that is, the formalization of sequential decision-making [24,26].A schematic diagram of the MDP interaction of a heavy haul train running on a long and steep downhill slope is shown in Figure 1.
Air-braking distance: As excessive wear is caused by the friction between the wheel and brake shoes when the air brake is engaged for a long distance, the replacemen of air brake equipment increases maintenance costs.By reducing the air brake dis tance during operation, the maintenance cost can be reduced.Therefore, the air brak distance   of a heavy haul train is defined as

Algorithm Design
Reinforcement learning is a machine learning approach tailored for goal-oriented tasks [24].Unlike traditional methods, reinforcement learning does not instruct the agen on how to act, but rather guides the agent through interactions with the environment to learn the correct strategies [25].In this section, we first define the train operation proces as a Markov decision process (MDP).Second, we describe a control algorithm based on Q-learning that learns the cyclic braking strategy for long downhill sections.

Markov Decision Process
Before applying the Q-learning algorithm, the process of controlling train operation on a long and steep downhill slope needed to be defined as a Markov decision proces (MDP), that is, the formalization of sequential decision-making [24,26].A schematic dia gram of the MDP interaction of a heavy haul train running on a long and steep downhil slope is shown in Figure 1.

Action
A control command of the Locomotives.

Agent
An on-board controller that makes control decisions.

Reward
An evaluation signal from the environment.

State
Train operation position, speed and operation time.

Environment
Train dynamics and rail infrastructure settings.As shown in Figure 1, the MDP consists of five elements: the agent, environment action, state, and reward.A heavy haul locomotive is defined as an agent that makes con trol decisions.The heavy haul train dynamics and railway infrastructure settings are de fined as the environment.During the interaction process, the agent performs actions based As shown in Figure 1, the MDP consists of five elements: the agent, environment, action, state, and reward.A heavy haul locomotive is defined as an agent that makes control decisions.The heavy haul train dynamics and railway infrastructure settings are defined as the environment.During the interaction process, the agent performs actions based on the environment, and the environment responds to the agent, with new heavy haul train states and reward signals based on operational constraints.Therefore, location, speed, and operating time are defined as the states of the heavy haul train.
where s k is the status of the heavy haul train at step k, P k is the position of the train, V k is train speed, and T k is the train's running time.
The control action is defined as the setting of the relative electric brake force and air brake notch.
where b a k is a binary variable representing the air brake control command, and b a k is the relative output ratio of the electric brake force output by the train locomotive, which is limited by the constraint condition in Equation (8).
The control output of a heavy haul train in each period is determined only by the speed, position, and operating time of the train.Thus, the process of controlling a heavy haul train can be exactly defined using reinforcement learning as a Markov decision-making process [26], which is expressed as follows:

Q-Learning Algorithm
In this section, the Q-learning algorithm-based intelligent control method is described for heavy haul trains operating on steep downhill slopes.The Q-learning algorithm is a reinforcement learning algorithm that learns in an environment without prior knowledge.Based on the principle of temporal difference control, the agent continuously updates the Q-value function through interactions with the environment.Using the Q-value as the evaluation criterion, the algorithm iteratively seeks the optimal action to maximize the expected total reward obtained during the interaction with the environment.The iteration process of the Q-learning algorithm involves learning the optimal actions from the Markov decision process (MDP).In a single simulation process, Q-learning updates the Q-values in real time to form new strategies for the next simulation, as shown in the control process diagram in Figure 2. brake notch.
[ , ], 0,1, , where    is a binary variable representing the air brake control command, and    is th relative output ratio of the electric brake force output by the train locomotive, which i limited by the constraint condition in Equation (8).
The control output of a heavy haul train in each period is only by th speed, position, and operating time of the train.Thus, the process of controlling a heavy haul train can be exactly defined using reinforcement learning as a Markov decision-mak ing process [26], which is expressed as follows:

Q-Learning Algorithm
In this section, the Q-learning algorithm-based intelligent control method is de scribed for heavy haul trains operating on steep downhill slopes.The Q-learning algo rithm is a reinforcement learning algorithm that learns in an environment without prio knowledge.Based on the principle of temporal difference control, the agent continuously updates the Q-value function through interactions with the environment.Using the Q value as the evaluation criterion, the algorithm iteratively seeks the optimal action to max imize the expected total reward obtained during the interaction with the environment The iteration process of the Q-learning algorithm involves learning the optimal action from the Markov decision process (MDP).In a single simulation process, Q-learning up dates the Q-values in real time to form new strategies for the next simulation, as shown in the control process diagram in Figure 2. (1) Randomly initialize Q (s, a), ∀s ∈ S, a ∈ A(s).
(2) According to the ε-greedy policy π and the current state s, action a is selected from the Q-table.Execute action a as determined by the decision-making process; then, obtain the reward value r by interacting with the environment and proceed to the next state.
Update the Q-Table , i.e., s→s'; continue until the termination state is reached.(3) By following this procedure, after multiple iterations, the optimal policy and the optimal state-action value function can both be obtained.
When the number of algorithm iterations reaches a certain quantity, the termination condition is met.The generation of the optimal policy is no longer determined by the greedy policy, but is based on selecting actions according to the optimal Q-values corresponding to each state at each time, forming the optimal policy.

Policy Design
To ensure that the algorithm balances exploration and exploitation capabilities, an ε-greedy policy is adopted, defining the agent's behavior at a given time step.Formally, the policy is a function that outputs the probability of selecting each possible action relative to the Q-function.It can be represented as the following: where |A(s k )| is the number of actions in the action set when the state is s k , a * = argmax a Q(s, a), ε (0, 1).Specifically, using the ε-greedy policy to select control actions during the train operation process involves randomly choosing actions with a probability of ε, and adopting the action with the highest estimated Q-value with a probability of 1−ε.This approach enhances the algorithm's global search capability.

Reward Function Design
The optimization goal of the reinforcement learning problem is reflected by the reward function.For the train control process in question, to ensure safe operation, the operating speed cannot exceed the upper limit.Therefore, the constraint in Equation ( 10) must be satisfied.If the speed is higher than the upper limit V max or lower than the minimum remission speed V r min , a negative reward R c is given to the agent.If the air brake is engaged by a heavy haul train at step k, a zero reward is given.Otherwise, a positive reward R d is given to encourage the release of the air brake.Therefore, the award is defined as follows: Algorithm 1 summarizes the control method for heavy haul trains based on the Qlearning algorithm.

Algorithm Simulation and Analysis
This section describes the simulation experiments that were conducted using real data from the Shuozhou-Huanghua Railway in China.First, the setup of the experimental parameters and the data are introduced.Second, the experimental results are presented and analyzed in three main parts: the model training process, effectiveness testing in practical applications, robustness testing of the algorithm.

Experimental Parameter Settings
To validate the effectiveness of the proposed intelligent control algorithm, simulation experiments were conducted using "1 Locomotive + 100 Wagons" in combination with the route data from the Shuohuang Railway in China.Our aim was to obtain a speed tracking curve for a heavy haul train on long downhill sections.The train consisted of HXD1 electric locomotives and C80 freight cars, with the specific train parameters shown in Table 2.The total length of the train route was S = 20,000 m, and the slope of the route mostly ranged from 10‰ to 12‰, which complied with the requirements of the Technical Management Regulations for Chinese Railways for long downhill sections.Additionally, the speed limit on this route was 80 km/h, and specific route data are provided in Table 3.The hyperparameters for the Q-learning algorithm were set as shown in Table 4. Using the parameter settings described above, the proposed Q-learning algorithm was validated through simulation experiments.During this study, the learning rate λ of the Qlearning algorithm was defined, and the sensitivity of this parameter was analyzed.Three groups of experiments were set up, with λ values set to 0.0001, 0.001, and 0.01, separately.The iterative Q-learning process in each group of experiments was observed.The initial speed V0 of the heavy haul train entering the long downhill section was set to 40 km/h.Other hyperparameters were set according to Table 4.The more interactions between the reinforcement learning agent and the environment, the richer the experience, and the more accurate the strategies.During training, the agent and the environment interacted 1 million times, including 100,000 episodes.For each episode, the total reward value corresponding to the solution generated based on Q-values was recorded.The cumulative reward change curve of the optimized algorithm is shown in Figure 3.The best training performance was achieved for the experiment depicted in Figure 3b.Compared with the other two groups of experiments, when λ = 0.001, the cumulative reward change curve of the Q-learning algorithm exhibited a faster and more stable convergence rate, as well as a higher convergence value.Therefore, we recommend setting the learning rate λ to 0.001 during training.Owing to the presence of the ε-greedy policy in the Q-learning algorithm, the agent initially randomly explores during training, and the action selection during decision-making is random.Consequently, the optimization space for Q-values is large, resulting in relatively small reward values and optimization effects.As exploration proceeds, the agent gradually learns the correct braking strategy, and the cumulative reward value continuously increases.As training progresses, the control policy optimization of the Q-learning algorithm tends to stabilize and approach the optimal state.The agent tends to adopt the optimal action with the maximum Q-value, leading to a stable cumulative reward curve, which indicates convergence of the Q-learning algorithm's iterations.

Effectiveness Testing of Practical Application
After training the Q-learning algorithm, the effectiveness of the control algorithm in The best training performance was achieved for the experiment depicted in Figure 3b.Compared with the other two groups of experiments, when λ = 0.001, the cumulative reward change curve of the Q-learning algorithm exhibited a faster and more stable convergence rate, as well as a higher convergence value.Therefore, we recommend setting the learning rate λ to 0.001 during training.Owing to the presence of the ε-greedy policy in the Q-learning algorithm, the agent initially randomly explores during training, and the action selection during decision-making is random.Consequently, the optimization space for Q-values is large, resulting in relatively small reward values and optimization effects.As exploration proceeds, the agent gradually learns the correct braking strategy, and the cumulative reward value continuously increases.As training progresses, the control policy optimization of the Q-learning algorithm tends to stabilize and approach the optimal state.The agent tends to adopt the optimal action with the maximum Q-value, leading to a stable cumulative reward curve, which indicates convergence of the Q-learning algorithm's iterations.

Effectiveness Testing of Practical Application
After training the Q-learning algorithm, the effectiveness of the control algorithm in periodically braking the train was verified, considering different entry states of the train and the state transition process.The Q-values corresponding to different entry states and state transitions were read directly from the Q-table.The effectiveness of the control algorithm for periodic braking of the train was validated.The speed tracking curves of the train under different entry speeds are shown in Figure 4.When the entry speed is 30 km/h, the train adopts a three-cycle braking optimization strategy through reinforcement learning training.Similarly, when the entry speed is 40 km/h, the train also adopts a three-cycle braking optimization strategy.However, when the entry speed is 50 km/h, due to the increase in entry speed, the train adopts a four-cycle braking optimization strategy.Additionally, according to the simulation results, when the train runs on a long downhill section, air braking tends to be applied at the maximum speed limit during the cyclic braking process, and the braking is released appropriately at the right time.Despite the train entering the downhill section at the three different entry speeds mentioned above, the running speed of the train increases.This is because, on long downhill sections, trains tend to initially maintain a coasting state to save energy and ensure a higher running speed, and then apply braking at the appropriate time.learning training.Similarly, when the entry speed is 40 km/h, the train also adopts a threecycle braking optimization strategy.However, when the entry speed is 50 km/h, due to the increase in entry speed, the train adopts a four-cycle braking optimization strategy.
Additionally, according to the simulation results, when the train runs on a long downhill section, air braking tends to be applied at the maximum speed limit during the cyclic braking process, and the braking is released appropriately at the right time.Despite the train entering the downhill section at the three different entry speeds mentioned above, the running speed of the train increases.This is because, on long downhill sections, trains tend to initially maintain a coasting state to save energy and ensure a higher running speed, and then apply braking at the appropriate time.Figure 4 shows that after training, for the three entry speeds mentioned above, the reinforcement learning agent can control train speed by applying air braking before reaching the maximum speed limit.This ensures that the train remains within a safe operating speed range until exiting the section.This indicates that the Q-learning algorithm can effectively train agents to develop good control strategies, keeping the train speed within the speed limits and maintaining a relatively high average speed.This validates the effectiveness of the algorithm.

Performance Comparison Experiment
To verify the robustness of the Q-learning algorithm in controlling heavy haul trains through cyclic braking on long downhill sections, the optimized results for different entry speeds were compared.The key parameters of the Q-learning algorithm were set as follows: the learning rate λ was 0.001; the maximum number of iterations M was 100,000; the discount factor γ was 0.95; the exploration rate ε was 0.1; and the state transition time interval Δt was 50.
Table 5 shows that under different conditions, the Q-learning algorithm with the same hyperparameter settings-aiming to optimize air braking distance, running time, and running efficiency-shows robustness in braking distance and braking efficiency when heavy haul trains perform cyclic air braking on long downhill sections.Addition- Figure 4 shows that after training, for the three entry speeds mentioned above, the reinforcement learning agent can control train speed by applying air braking before reaching the maximum speed limit.This ensures that the train remains within a safe operating speed range until exiting the section.This indicates that the Q-learning algorithm can effectively train agents to develop good control strategies, keeping the train speed within the speed limits and maintaining a relatively high average speed.This validates the effectiveness of the algorithm.

Performance Comparison Experiment
To verify the robustness of the Q-learning algorithm in controlling heavy haul trains through cyclic braking on long downhill sections, the optimized results for different entry speeds were compared.The key parameters of the Q-learning algorithm were set as follows: the learning rate λ was 0.001; the maximum number of iterations M was 100,000; the discount factor γ was 0.95; the exploration rate ε was 0.1; and the state transition time interval ∆t was 50.

Figure 1 .
Figure 1.Schematic diagram of MDP interaction during operation of heavy haul trains.

Figure 1 .
Figure 1.Schematic diagram of MDP interaction during operation of heavy haul trains.

Figure 2 .
Figure 2. Flowchart of controller for Q-learning algorithm.

Figure 3 .
Figure 3.The cumulative reward change curves of the algorithm for different learning rates: (a) the learning rate λ of Q−learning algorithm is 0.0001; (b) the learning rate λ of Q−learning algorithm is 0.001; (c) the learning rate λ of Q−learning algorithm is 0.01.

Figure 3 .
Figure 3.The cumulative reward change curves of the algorithm for different learning rates: (a) the learning rate λ of Q−learning algorithm is 0.0001; (b) the learning rate λ of Q−learning algorithm is 0.001; (c) the learning rate λ of Q−learning algorithm is 0.01.

Figure 4 .
Figure 4. Periodic braking strategies of heavy haul trains at different entry speeds.

Figure 4 .
Figure 4. Periodic braking strategies of heavy haul trains at different entry speeds.

Table 1 .
Description of symbols used in the train model.
j Time point of releasing air brake in the j th cycle i Gradient of the track on which the train is running V max Upper limit of train running speed