Optimal Control Algorithm for Subway Train Operation by Proximal Policy Optimization

Chen, Bin; Gao, Chunhai; Zhang, Lei; Chen, Junjie; Chen, Jun; Li, Yuyi

doi:10.3390/app13137456

Open AccessArticle

Optimal Control Algorithm for Subway Train Operation by Proximal Policy Optimization

by

Bin Chen

^1,2,3,*

,

Chunhai Gao

^2,3,

Lei Zhang

^1,2,3,

Junjie Chen

⁴

,

Jun Chen

^2,3 and

Yuyi Li

⁵

¹

School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China

²

Traffic Control Technology Co., Ltd., Beijing 100070, China

³

National Engineering Research Center of Rail Transportation Operational and Control System, Beijing 100044, China

⁴

School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

⁵

School of Automobile and Traffic Engineering, Jiangsu University of Technology, Changzhou 213001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7456; https://doi.org/10.3390/app13137456

Submission received: 6 May 2023 / Revised: 19 June 2023 / Accepted: 20 June 2023 / Published: 23 June 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

With the increasing scale of the urban subway, the total energy consumption of the subway has increased dramatically and poses a great challenge to the comfort of passengers and the punctuality of train operation. In order to ensure on-time train operation and passenger comfort, and at the same time reduce the energy consumption of subway operation, this paper proposes a Proximal Policy Optimization (PPO)-based optimization algorithm for the optimal control of subway train operation. Firstly, a reinforcement learning architecture for optimal control of subway train operation is constructed with the position and speed of train operation as the reinforcement learning state, energy consumption and comfort as the optimization objectives, and train operation time as the constraint. The proposed reinforcement learning model is trained by the PPO algorithm, and the reward scaling is added to the training process to accelerate the training speed and improve the efficiency of the algorithm. The experimental results show that the proposed PPO with reward scaling algorithm can effectively reduce train energy consumption and improve passenger comfort while ensuring on-time train operation.

Keywords:

train operation optimization; urban transit railway; reinforcement learning; Proximal Policy Optimization

1. Introduction

The urban subway is the main line of urban public transportation and the main artery of passenger flow. It is a key system project for the rapid and smooth operation of urban traffic. It plays an important role in improving the happiness index and promoting social and economic development. With the rapid development of urban subways and the continuous expansion of network scale, the sustainable development of urban subways has attracted more and more attention. Although urban subways use less energy per unit of traffic than other forms of public transportation, they are massive. With the continuous expansion of urban subways, total energy consumption has increased dramatically. Among them, the traction energy consumption of train operation accounts for a relatively high proportion. Urban subway trains use on-board inverters to convert the DC power in the power supply system into AC power for the traction motor to drive the train, which requires a lot of power consumption. At the same time, auxiliary facilities such as lighting, air conditioning, and electric heating inside the carriage also require energy from the traction power supply system. In this situation, optimizing train operation energy consumption while maintaining comfort and punctuality is crucial for train operations.

At present, many scholars have carried out relevant research on the train operation optimization problem by using heuristic methods such as genetic algorithms and differential evolution. The optimization process is roughly divided into two steps: first, optimize the train speed profiles, and then track the optimized speed profiles. Manual tuning, speed profiles, and operational constraints have been considered in existing work in order to improve performance. However, these solutions may lead to a gap between the control model and the actual operating process, thereby affecting the accuracy of the control model [1]. The emergence of reinforcement learning has found a feasible solution to this problem. It can directly optimize the train operation control strategy without the train speed profiles, to get rid of the limitations of the offline speed profiles and the precise model information of the train.

In this thesis, we propose an optimal control algorithm for subway train operation based on Proximal Policy Optimization (PPO) to reduce train operation energy consumption, ensure passenger comfort, and at the same time meet the requirements of the train schedule for train operation time. In this paper, the position and speed of the train are used as the reinforcement learning state, the energy consumption and comfort are the optimization goals, the train running time is strongly constrained, and the output sequence of the train operation control is optimized based on PPO.

This study is organized as follows: Section 2 presents a literature review about optimal control algorithms for subway train operation from the perspective of heuristic methods and reinforcement learning methods. Section 3 introduces the optimal control algorithm for subway train operation based on PPO, including the original PPO algorithm and the combination between PPO and the train operation model. Section 4 discusses the results of the PPO-based optimal control algorithm for subway train operation based on real data. Section 5 summarizes the innovation of this study and provides directions for future research.

2. Literature Review

Since the 1960s, foreign scholars have begun to study the energy-saving operation optimization of trains [2,3]. These research results laid the foundation for the follow-up research in the field of train operation control optimization. However, due to the limited computing resources at that time, and the relatively simple model construction at that time, the information extraction and construction of the actual operating environment were not sufficient. The research results cannot be widely used in the actual train operation control.

With the development of technology, the operation control of subway trains has gradually evolved from the fixed block non-automatic train operation mode to semi-automatic train operation, manned train automatic operation, and unattended train automatic operation. Among them, the automatic speed control of the train is one of the core technologies for the automatic operation of subway trains. At present, automatic train operation (ATO) technology is applied in the subway train system. It first generates the speed profiles used to guide the train operation according to the safety constraints and the basic information of the line, and then uses the speed tracking technology to generate the speed profiles, combined with the current train running state (speed, position, etc.), calculates the control quantity that the train traction drive system needs to output under the current running state, and outputs it to the train interface unit to complete the operation of the train.

In this case, many scholars have carried out relevant research on the train operation optimization problem from the perspective of optimizing speed profiles. Gao et al. [4] systematically introduced the energy-saving approaches of the urban subway system from three aspects: train speed profiles optimization, renewable energy utilization, and train schedule optimization with speed profiles, and clarified the future research direction for the field of urban subway energy conservation. Ahmadi et al. [5] proposed an energy-saving solution that satisfies the constraints related to the operation. Firstly, the optimal speed distribution of a single train is given by taking the net energy consumption of the train and the travel time between stations as the objective function. Then by allocating travel time among stations and adopting a predetermined optimal speed distribution, the total grid input power is minimized. De Martinis et al. [6] proposed an optimization framework for the definition and evaluation of energy-saving speed curves based on supply design models, considering the possible impact of speed profiles on subway flow. This makes it possible to fully consider the business requirements of the service when evaluating the optimal speed profiles. Amrani et al. [7] proposed a speed profiles optimization scheme based on a genetic algorithm, taking into consideration the train physical constraints and train operation constraints. On the premise of considering the utilization of renewable energy, Zhou et al. [8] proposed an improved train control model to obtain the train speed optimization profiles by combining the optimization of the operation diagram with the optimization of the operation of a single vehicle considering the change of train quality caused by passengers getting on and off the train. Anh et al. used the optimization method based on Pontryagin’s maximum principle in their literature [9,10], which can not only find the optimal switching points in the three operating stages of acceleration, coasting, and braking but also determine the optimal switching point from these switching points. It has excellent speed profiles, but also ensures a fixed travel time. To use nonlinear programming to determine the travel time, the Lagrangian multiplier

λ

is introduced into the objective function as a time constraint. Considering the longitudinal section design, Kim et al. [11] proposed a multi-stage decision-making model to jointly optimize the longitudinal section design profiles, cruising speed, and inert speed conversion point of the route to optimize the energy saving of urban subway operation. Sandidzadeh et al. [12] used the Second Tabu Search (TS) algorithm to optimize the train speed profiles. Scheepmaker et al. based on the optimal allocation of running time at multiple stations of a single vehicle [13], combined with regenerative braking and mechanical braking, derived the optimal control structure and studied the impact of different speed limits on different driving strategies [14]. De Sousa et al. [15] proposed a metro energy regeneration model based on controlling stops and train departures throughout the journey and using energy from regenerative braking in the drive system, to optimize power consumption and improve efficiency.

In terms of train speed profile optimization, there are also many scholars conducting research from other angles. Considering the actual location of the substation, Chen et al. [16] modeled the traction power supply system and combined it with the optimal speed profiles of the train and the travel time distribution between stations to minimize the energy consumption at the substation level. Su et al. [17] considered the speed tracking problem for multi-objective speed profiles optimization and proposed a numerical algorithm for solving the train energy consumption control problem with a given travel time. On this basis, based on the ATO control principle, the target speed profiles optimization method guides the train to output the optimal control sequence. Pu et al. [18] constructed a comprehensive optimization method for train speed tracking through speed profiles and fuzzy PID controller design and combined it with the NSGA-II algorithm. Based on the analysis of train operating conditions and the operating environment, Zhu et al. [19] established a multi-objective model of train automatic operation by using a multi-objective decision-making method with a penalty function. They also used genetic algorithms to solve the model, and then designed a fuzzy controller, to achieve speed tracking control. Wu et al. [20] proposed an improved traction energy consumption evaluation model for the traction system considering the working efficiency of the traction motor under different working conditions and the efficiency of the inverter and gearbox, and then transformed the speed profiles optimization into a multi-decision problem, which is solved by a dynamic programming method. Liu et al. [21] designed a moving horizon optimization algorithm that can obtain the speed profile online and applied sequential quadratic programming (SQP) to a multi-objective optimization problem with several nonlinear constraints, which can realize online optimization of speed profiles. In most urban subway operation optimization problems, the train operation model is considered based on a single particle. Wang et al. [22] proposed a multi-particle train model for the optimization of the urban rail train speed profiles, taking the train length factor into account. In addition, some scholars have carried out research on the optimization of train operation profiles [23,24,25,26,27].

3. Methodology

In this section, we build the train dynamic model including the traction/breaking characteristics of the subway train. Subsequently, we propose an optimal control algorithm for subway train operation based on PPO. This algorithm aims to optimize the operation control by utilizing the dynamic model and considering parameters of the operating environment, such as the line slope.

3.1. The Train Dynamic Model

Figure 1 shows the force analysis while the train is running. f is the traction force output by the train’s traction systems.

G (p)

represents the additional gradient resistance due to the gradient on the line track while p represents the position of the railway track.

R (v)

is the resistance produced by friction while v is the running speed of the train. b is the braking force generated by the braking system of the train.

As shown in Figure 1, when the train is running, the resistance it receives includes

G (p)

and

R (v)

. Taking the train traction and braking forces into account, the kinematic equation is shown in Equation (1).

\{\begin{matrix} \frac{d v}{d t} = \frac{f - b - R (v) - G (p)}{(1 + τ M)} \\ \frac{d p}{d t} = v \end{matrix}

(1)

while M is the train mass and

τ

represents the rotation mass coefficient of the train, and its value is determined by the total mass of the train and the converted mass of the rotation part.

The energy consumption is calculated by Equation (2).

E (T) = \frac{\int_{0}^{T} f v (t) d t}{η_{f}} + \int_{0}^{T} P_{a u} (t) d t - η_{b} \int_{0}^{T} b v (t) d t

(2)

In Equation (2), T is the travel time of the train, and

P_{a u} (t)

is the power of train auxiliary systems, such as lights and air conditioners, at time t.

η_{f}

is the efficiency of converting electrical energy into mechanical energy when the train traction system is working.

η_{b}

is the efficiency of converting mechanical energy into electrical energy during the regenerative braking of the train.

The first part of Equation (2) represents the electric energy consumed by the traction system of the train; the second part is the energy consumed by auxiliary systems of the train, and the third part is the energy generated by train regenerative braking. Compared with the energy consumed by the traction system, the proportion of auxiliary energy consumption in the process of train operation is very small, so this paper takes the energy consumed by the train traction drive system as the optimization object. That is, the optimization objective of the optimization problem is obtained by Equation (3).

E (T) = \frac{\int_{0}^{T} f v (t) d t}{η_{f}} - η_{b} \int_{0}^{T} b v (t) d t

(3)

The additional gradient resistance

G (p)

and the resistance produced by friction

R (v)

are calculated by Equations (4) and (5), respectively.

G (p) = M g sin (θ (p))

(4)

R (v) = r_{0} + r_{1} v + r_{2} v^{2}

(5)

while g is the gravitational acceleration;

θ (p)

is the gradient of the line on position p;

r_{0}

,

r_{1}

and

r_{2}

are basic resistance coefficients.

In Equation (4), the gradient

θ (p)

is measured by radians. To ensure the safety of train operation, the gradient of the line is strictly limited within a certain range, that is, the slope is at a small value. We can thus make the following approximation.

sin (θ (p)) \approx θ (p)

(6)

Then we can change Equation (4) into Equation (7) for more convenient calculation.

G (p) = M g θ (p)

(7)

In addition to energy consumption, passenger comfort is defined by the negative integral of jerk rate over time as another optimization object, and it is defined by Equation (8).

C (T) = - \int_{0}^{T} |\frac{d u}{d t}| d t

(8)

while u is the acceleration of the train.

Equations (3) and (8) are the objects to be optimized. There are additional constraints to consider, such as travel time and speed limitations, in addition to the dynamic characteristics of the subway train as depicted in Equation (1). Then we have the multi-objective optimization function for train operation, as shown in Equation (9).

\begin{matrix} min O p (E (T), C) \\ s . t . \begin{matrix} \{\begin{matrix} \frac{d v}{d t} = \frac{f - b - R (v) - G (p)}{(1 + τ M)} \\ \frac{d p}{d t} = v \end{matrix} \\ v (0) = v (T) = 0 \\ t \in [0, T] \\ v (t) \in [0, v_{max} (t)] \end{matrix} \end{matrix}

(9)

while

O p (\cdot)

represents the optimization function and

v_{m a x} (t)

is the maximum speed at time t.

3.2. Optimization Model of Train Operation Based on Reinforcement Learning

This subsection presents the reinforcement learning (RL) modules for train operation.

The RL architecture includes the following items:

a g e n t

,

e n v i r o n m e n t

,

a c t i o n

,

s t a t e

, and

r e w a r d

.

a g e n t

obtains

s t a t e

from

e n v i r o n m e n t

, and executes

a c t i o n

. Then,

a c t i o n

is output into

e n v i r o n m e n t

, which can cause the transfer of

s t a t e

and

r e w a r d

generated by

e n v i r o n m e n t

to

a g e n t

. RL aims to obtain as much

r e w a r d

as possible from

e n v i r o n m e n t

.

Agent: As the output module of the train operation control command, the ATO subsystem plays the role of the agent in the RL architecture, which controls the speed and position of the train by changing the output level of the traction system.

Action Space: During the operation of the train, the output level of the traction/braking system is the only way to control the change in speed and position. Changes in traction/brake levels will result in corresponding changes in train speed and position. The traction and brake output of subway trains are continuous variables, and when the traction/brake reach the maximum, it will output the max traction/braking force (represented by

f_{m a x}

and

b_{m a x}

, respectively). So, we can define the action space in the RL architecture as

[- 1, 1]

, and the action

a_{t}

taken by the agent at time t conforms to Equation (10). If

a_{t} > 0

, the train is under the traction condition with traction power

a_{t} \times f_{m a x}

, and the train is under the brake condition with brake force

a_{t} \times b_{m a x}

if

a_{t} < 0

. If

a_{t} = 0

, the train is under the coasting condition.

a_{t} \in [- 1, 1]

(10)

State Space: Two crucial elements describe the operation characteristic of the train: speed and position. When the agent, known as the ATO system, takes an action, the speed and position of the train will be changed immediately. We define the speed and position of the train as the state of RL architecture, and the state

s_{t}

at time t can be described as Equation (11). Note that the speed and the position are bounded within a certain range, shown in Equation (12).

s_{t} = (v_{t}, p_{t})

(11)

\{\begin{matrix} v_{t} \in [0, v_{m a x} (t)] \\ p_{t} \in [0, P] \end{matrix}

(12)

Environment: The environment in the RL architecture requires an accurate relationship between the agent’s output and the state transfer. In this paper, the purpose of the environment is to describe the relationship between the output level of the train traction system and the speed and position of the train, i.e., to construct an accurate train dynamics model with the control level as the independent variable.

We assumed that at time t, the speed and the position of the train are

v_{t}

and

p_{t}

, respectively. Meanwhile, the agent takes an action

a_{t}

. The traction force and brake force are calculated by Equations (13) and (14), respectively.

f_{t} = \{\begin{matrix} a_{t} \times f_{m a x} & a_{t} > 0 \\ 0 & a_{t} \leq 0 \end{matrix}

(13)

b_{t} = \{\begin{matrix} a_{t} \times b_{m a x} & a_{t} < 0 \\ 0 & a_{t} \geq 0 \end{matrix}

(14)

Then, we can get the acceleration of the train

u_{t}

using Equations (1), (5) and (7). After that, the state of RL architecture

(v_{t}, p_{t})

is transferred to

(v_{t + Δ t}, p_{t + Δ t})

by Equation (15) (assuming that the action execution time is

Δ t

).

\{\begin{matrix} v_{t + Δ t} = v_{t} + u_{t} Δ t \\ p_{t + Δ t} = p_{t} + v_{t} Δ t + 0.5 u_{t} {Δ t}^{2} \end{matrix}

(15)

Reward: The reward of RL architecture comprises three components: the reward of energy saving, the reward of comfort, and the penalty term of time exceeded. We define the reward

r (t)

at time t as Equation (16).

r (t) = - μ_{1} E (t) + μ_{2} C (t)

(16)

where

E (t)

and

C (t)

are calculated by Equations (3) and (8), respectively, and

μ_{1}

and

μ_{2}

are the coefficients of the reward of energy saving and the reward of comfort. In practice, we expect to minimize energy consumption and maximize comfort. And in RL architecture, we always expect to maximize the expectation of accumulative rewards. Therefore,

μ_{1}

and

μ_{2}

are all positive values.

When the train arrives at the destination station, the total reward is calculated by Equation (17).

r (T) = \{\begin{matrix} - μ_{1} E (T) + μ_{2} C (T) & |T - T_{t i m e t a b l e}| \leq T_{d} \\ - P_{e n} & |T - T_{t i m e t a b l e}| \geq T_{d} \end{matrix}

(17)

where

P_{e n}

is a large positive number,

T_{t i m e t a b l e}

is the expected operating time, and

T_{d}

is the expected deviation between the real operating time and the expected operating time.

3.3. Proximal Policy Optimization Method

Proximal Policy Optimization Method [28] is one of the policy gradient methods which work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form of Equation (18).

\hat{g} = {\hat{E}}_{t} [\nabla_{θ} log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}]

(18)

And

\hat{g}

is obtained by differentiating the objective function shown as Equation (19).

L^{P G} (θ) = {\hat{E}}_{t} [log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}]

(19)

where

π_{θ}

is a stochastic policy,

a_{t}

and

s_{t}

are the action and state at time t, respectively,

{\hat{A}}_{t}

is an estimator of the advantage function.

PPO is optimized from the Trust Region Policy Optimization (TRPO) and in TRPO, the objective function is maximized subject to a constraint on the size of the policy update shown in Equation (20) [29].

\begin{matrix} max_{θ} & {\hat{E}}_{t} [\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})} {\hat{A}}_{t}] \\ s . t . & {\hat{E}}_{t} [K L [π_{θ_{o l d}} (\cdot | s_{t}), π_{θ} (\cdot | s_{t})]] \leq δ \end{matrix}

(20)

where

θ_{o l d}

is the parameters before the update. In TRPO, the optimization problem (20) is solved by using a penalty instead of the constraint. But it is hard and complicated for TRPO to implement. There are two ways to make the implementation easier: adaptive KL penalty coefficient and clipped surrogate objective [28]. In this paper, we focus on the clipped surrogate objective due to its simplicity and accuracy.

We denote the objective function in Equation (20) as follows:

L^{C P I} (θ) = {\hat{E}}_{t} [\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})} {\hat{A}}_{t}]

(21)

where

C P I

represents the conservative policy iteration [30].

In PPO, a hyper-parameter

ϵ

is introduced to limit the distance between old and new strategies to a specific range shown as Equation (22), where

r a d i o_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

. The first part is the probability ratio multiplied by the advantage function that is the unclipped objective

L^{C P I}

. The second part of Equation (22) is to limit the probability ratio

r a d i o_{t} (θ)

to the interval

[1 - ϵ, 1 + ϵ]

which is the clipped objective. At last, we take the minimum of the unclipped and clipped objective to calculate

L^{C L I P} (θ)

.

L^{C L I P} (θ) = {\hat{E}}_{t} [min (r a d i o_{t} (θ) {\hat{A}}_{t}, c l i p (r a d i o_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(22)

In order to speed up the training process and improve the efficiency of the PPO algorithm, we adopted Reward Scaling [31] in the training process, shown in Equation (23).

R (t) = \frac{R (t)}{S t d_{R} + ξ}

(23)

where

S t d_{R}

is the standard deviation of rewards R and

ξ

is a very small positive number to prevent a zero denominator.

The optimal control process for subway train operation based on PPO with reward scaling is shown as Algorithm A1 in Appendix A. The comparison between the PPO algorithm with and without Reward Scaling is described in detail in Section 4.

4. Case Study

This section carried out the simulation results of several algorithms applied to the subway operation control including Deep Deterministic Policy Gradient (DDPG), PPO, and PPO with rewards scaling. The simulations are based on the actual track and train conditions in Beijing. The traction/breaking characteristics are shown in Figure 2 and Figure 3. The track speed limitation is shown in Figure 4. Other parameters of the simulation are shown in Table 1.

Due to the difference between the three algorithms, the training steps differ, but the numbers of training episodes are all around 20,000 episodes. We carried out simulations through these aspects as follows: convergence of the rewards, the loss, and the indicators of train operation.

Figure 5, Figure 6 and Figure 7 are the variations of rewards with the episodes in three algorithms. In Figure 5, we can find that the reward of DDPG reaches a stable state around the 3000th episode. The reward of PPO fluctuates greatly and has not yet converged over 21,000 episodes. After adding the reward scaling, the reward of the PPO algorithm converged at about the 3000th episode. Despite the small data jump at the 4000th time, the reward converges back quickly. The DDPG algorithm and the PPO with reward scaling both have better results than the conventional PPO from the analysis of the change in reward only.

Figure 8, Figure 9 and Figure 10 represent the variation trend of training losses in three algorithms, respectively. The loss of DDPG jumps between very large and very small values, which means after 20,000 episodes (for about

7.2 \times 10^{6}

training steps) the algorithm still does not converge. As for the conventional PPO, the training does not converge due to the dispersion of the loss every few episodes, which can also be proved by Figure 6. In Figure 10 we can find that the loss of PPO with reward scaling converges to a small range at about

2 \times 10^{6}

training steps corresponding to about the 3000th episode. Despite the data jumping around a few training steps, the loss remains in the small range, which means the training of PPO with reward scaling converges. From Figure 8, Figure 9 and Figure 10, we can see that although the reward of the DDPG algorithm reaches a stable state around the 3000th episode, its training loss does not converge. Only PPO with reward scaling performs the best among these three algorithms.

From the above analysis, it can be seen that, among the three algorithms, the conventional PPO algorithm has the worst effect. Although it may converge if the training continues, it will consume a lot of time and computational resources. The DDPG algorithm’s reward will reach a stable state, but from the loss analysis, it has not yet reached a stable convergence state. The PPO with reward scaling algorithm has the fastest convergence speed and is more stable. However, there are still other indicators that are also important to the operation of the subway train.

The operation time is an important indicator in the subway system. In this paper, we select 110% of the prescribed operation time (shown in Table 1), which is 88 s, as the threshold of the operation time. The results are shown in Figure 11, Figure 12 and Figure 13, respectively.

From Figure 11, Figure 12 and Figure 13, we can find that after 4000 episodes, the operation time in DDPG and PPO with reward scaling meets the requirements of the operation (only small episodes whose operation times exceed the threshold but converge quickly) while the operation time in the conventional PPO exceeds the threshold for many episodes. Although the operation time in DDPG is lower than that in PPO with reward scaling, the energy consumed in DDPG is much higher than that in PPO with reward scaling shown in Figure 14 and Figure 15. The energy consumed in the conventional PPO is lower than that of the PPO with reward scaling because of more time consumed in the conventional PPO shown in Figure 15 and Figure 16; however, this does not meet the demand of subway operation.

The mean impact rates of the three algorithms are shown in Figure 17, Figure 18 and Figure 19. Here, it can be observed that, at the end of training, the mean impact rate of the PPO with reward scaling is lower than that of DDPG. Similarly, the mean impact rate of the conventional PPO is lower than that of the PPO with reward scaling because the conventional PPO consumes more time in operation, which does not meet the demand of subway operation.

From the above analysis, it can be seen that PPO with reward scaling has the fastest convergence speed and is the most stable after convergence. Compared with DDPG, it performs better in terms of operation time, energy consumption, and comfort. The conventional PPO may require more computing resources to achieve a stable convergence state. Although it performs better in terms of energy consumption and comfort than PPO with reward scaling in some episodes, it does not meet the requirements of the subway operating timetable.

5. Conclusions

With the increasing scale of the urban subway, the total energy consumption of the subway has increased dramatically and poses a great challenge to the comfort of passengers and the punctuality of train operation. In order to ensure on-time train operation and passenger comfort, and at the same time reduce the energy consumption of subway operation, this paper proposes a Proximal Policy Optimization (PPO)-based optimization algorithm for the optimal control of subway train operation. Firstly, a reinforcement learning architecture for optimal control of subway train operation is constructed with the position and speed of train operation as the reinforcement learning state, energy consumption and comfort as the optimization objectives, and train operation time as the constraint. The proposed reinforcement learning model is trained by the PPO algorithm, and the reward scaling is added to the training process to accelerate the training speed and improve the efficiency of the algorithm. Finally, the algorithm proposed in this paper is compared with the Deep Deterministic Policy Gradient (DDPG) algorithm and the conventional PPO, and the superiority of the algorithm is verified.

The research in this paper can provide new ideas for the optimal control direction of train operation. It can directly optimize the train operation control strategy without the train speed profiles, to get rid of the limitations of the offline speed profiles and the precise model information of the train. In addition, the research content of this paper can provide the basic control method selection for the multi-train tracking model, but the train operation model still needs to be improved according to the constraints of multi-train tracking. This is also the content of the author’s future research.

Author Contributions

Conceptualization, resources, methodology, software, validation, formal analysis, investigation, data curation, writing—original draft preparation and visualization, are provided by B.C.; writing—review and editing are provided by L.Z., J.C. (Junjie Chen) and Y.L.; project administration and funding acquisition are provided by C.G. and J.C. (Jun Chen). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62003150), the Fundamental Science (Natural Science) Research Project of Jiangsu Higher Education Institutions under Grant 21KJB120002, and the Open Project Fund of National International Science and Technology Cooperation Base on Railway Vehicle Operation Engineering of Beijing Jiaotong University (BMRV21KF01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are greatly thankful to the reviewers and editor for their precious advice to improve the quality of the thesis.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The optimal control process for subway train operation based on PPO with reward scaling is shown as follows:

Algorithm 1: Optimal control algorithm for train operation based on PPO with reward scaling

References

Zhang, L.; Zhou, M.; Li, Z. An intelligent train operation method based on event-driven deep reinforcement learning. IEEE Trans. Ind. Inform. 2021, 18, 6973–6980. [Google Scholar] [CrossRef]
Ichikawa, K. Application of optimization theory for bounded state variable problems to the operation of train. Bull. JSME 1968, 11, 857–865. [Google Scholar] [CrossRef]
Sidelnikov, V. Computation of optimal controls of a railroad locomotive. Proc. State Railw. Res. Inst. 1965, 2, 52–58. [Google Scholar]
Gao, Z.; Yang, L. Energy-saving operation approaches for urban rail transit systems. Front. Eng. Manag. 2019, 6, 139–151. [Google Scholar] [CrossRef] [Green Version]
Ahmadi, S.; Dastfan, A.; Assili, M. Increasing energy efficiency in urban rail transit by integrated speed profile optimization and traveling time distribution. Iran. Electr. Ind. J. Qual. Product. 2017, 6, 54–63. [Google Scholar]
Martinis, V.D.; Weidmann, U.A. Definition of energy-efficient speed profiles within rail traffic by means of supply design models. Res. Transp. Econ. 2015, 54, 41–50. [Google Scholar] [CrossRef]
Amrani, A.; Hamida, A.B.; Liu, T.; Langlois, O. Train speed profiles optimization using a genetic algorithm based on a random-forest model to estimate energy consumption. In Proceedings of the Transport Research Arena (TRA), Vienna, Austria, 16–19 April 2018. [Google Scholar]
Zhou, Y.; Bai, Y.; Li, J.; Mao, B.; Li, T. Integrated optimization on train control and timetable to minimize net energy consumption of metro lines. J. Adv. Transp. 2018, 2018, 7905820. [Google Scholar] [CrossRef] [Green Version]
Anh, T.; Quyen, N. Optimal speed profile determination with fixed trip time in the electric train operation of the cat linh-ha dong metro line based on pontryagin’s maximum principle. Eng. Technol. Appl. Sci. Res. 2020, 10, 6488–6493. [Google Scholar] [CrossRef]
Anh, A.T.H.T.; Quyen, N.V. A novel method for determining fixed running time in operating electric train tracking optimal speed profile. Int. J. Electr. Comput. Eng. (IJECE) 2021, 11, 4881–4890. [Google Scholar] [CrossRef]
Kim, M.E.; Kim, E. Simulation-based multistage optimization model for railroad alignment design and operations. J. Transp. Eng. Part A Syst. 2020, 146, 04020057. [Google Scholar] [CrossRef]
Sandidzadeh, M.A.; Askarian, M.; Soleymaani, F. The effect of using the tabu search algorithm on the speed of achieving the optimal train speed profile (in order to reduce energy consumption). J. Transp. Res. 2020, 17, 31–48. [Google Scholar]
Scheepmaker, G.M.; Pudney, P.J.; Albrecht, A.R.; Goverde, R.M.; Howlett, P.G. Optimal running time supplement distribution in train schedules for energy-efficient train control. J. Rail Transp. Plan. Manag. 2020, 14, 100180. [Google Scholar] [CrossRef]
Scheepmaker, G.M.; Goverde, R.M. Energy-efficient train control using nonlinear bounded regenerative braking. Transp. Res. Part C Emerg. Technol. 2020, 121, 102852. [Google Scholar] [CrossRef]
de Sousa, C.A.; Pereira, S.L.; Dias, E.M. Improvement of the energy efficiency of subway traction systems through the use of genetic algorithm in traffic control. J. Control. Autom. Electr. Syst. 2019, 30, 85–94. [Google Scholar] [CrossRef]
Chen, M.; Fang, Q.; He, T.; Guo, Y.; Wang, Q.; Sun, P. Integrated optimization of train speed profile and timetable considering the location of substations. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021. [Google Scholar]
Su, S.; Tang, T.; Chen, L.; Liu, B. Energy-efficient train control in urban rail transit systems. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit 2015, 229, 446–454. [Google Scholar] [CrossRef]
Pu, Q.; Zhu, X.; Liu, J.; Cai, D.; Fu, G.; Wei, D.; Sun, J.; Zhang, R. Integrated optimal design of speed profile and fuzzy pid controller for train with multifactor consideration. IEEE Access 2020, 8, 152146–152160. [Google Scholar] [CrossRef]
Zhu, X.; Pu, Q.; Zhang, Q.; Zhang, R. Automatic train operation speed profile optimization and tracking with multi-objective in urban railway. Period. Polytech. Transp. Eng. 2020, 48, 57–64. [Google Scholar] [CrossRef] [Green Version]
Wu, Z.; Gao, C.; Tang, T. An optimal train speed profile planning method for induction motor traction system. Energies 2021, 14, 5153. [Google Scholar] [CrossRef]
Liu, X.; Xun, J.; Ning, B.; Liu, T.; Xiao, X. Moving horizon optimization of train speed profile based on sequential quadratic programming. In Proceedings of the 2018 International Conference on Intelligent Rail Transportation (ICIRT), Singapore, 12–14 December 2018. [Google Scholar]
Wang, W.; Zeng, X.; Shen, T.; Liu, L. Energy-efficient speed profile optimization for urban rail transit with considerations on train length. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018. [Google Scholar]
Arıkan, Y.; Çam, E. Optimizing of speed profile in electrical trains for energy saving with dynamic programming. In Proceedings of the 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 11–13 October 2019. [Google Scholar]
Tian, Z.; Zhao, N.; Hillmansen, S.; Roberts, C.; Dowens, T.; Kerr, C. Smartdrive: Traction energy optimization and applications in rail systems. IEEE Trans. Intell. Transp. Syst. 2019, 20, 2764–2773. [Google Scholar] [CrossRef]
Allen, L.; Chien, S. Application of regenerative braking with optimized speed profiles for sustainable train operation. J. Adv. Transp. 2021, 2021, 8555372. [Google Scholar] [CrossRef]
Ma, S.; Ma, F.; Tang, C. An Energy-Efficient Optimal Operation Control Strategy for High-Speed Trains via a Symmetric Alternating Direction Method of Multipliers. Axioms 2023, 12, 489. [Google Scholar] [CrossRef]
Hao, W.; Meng, L.; Tan, T. Optimizing Minimum Headway Time and Its Corresponding Train Timetable for a Line on a Sparse Railway Network. Symmetry 2020, 12, 1223. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the 2015 International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
Kakade, S.; Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, Sydney, Australia, 8–12 July 2002. [Google Scholar]
Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; Madry, A. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv 2020, arXiv:2005.12729. [Google Scholar]

Figure 1. The force analysis while the train is running.

Figure 2. Track characteristics.

Figure 3. Braking characteristics.

Figure 4. Track speed limitation.

Figure 5. Rewards of DDPG.

Figure 6. Rewards of PPO.

Figure 7. Rewards of PPO with reward scaling.

Figure 8. Loss of DDPG.

Figure 9. Loss of PPO.

Figure 10. Loss of PPO with reward scaling.

Figure 11. Operation time of DDPG.

Figure 12. Operation time of PPO.

Figure 13. Operation time of PPO with reward scaling.

Figure 14. Energy consumption of DDPG.

Figure 15. Energy consumption of PPO with reward scaling.

Figure 16. Energy consumption of PPO.

Figure 17. Mean impact rate of DDPG.

Figure 18. Mean impact rate of PPO.

Figure 19. Mean impact rate of PPO with reward scaling.

Table 1. Parameters of the simulation.

Parameter Names	Parameters
Operating mileage/m	854
Train mass/t	234.4
Rotation mass coefficient	1.068
Max speed/(km/h)	110
Max traction power/kN	282
Operation time/s	80
$μ_{1}$	2/3
$μ_{2}$	1/3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, B.; Gao, C.; Zhang, L.; Chen, J.; Chen, J.; Li, Y. Optimal Control Algorithm for Subway Train Operation by Proximal Policy Optimization. Appl. Sci. 2023, 13, 7456. https://doi.org/10.3390/app13137456

AMA Style

Chen B, Gao C, Zhang L, Chen J, Chen J, Li Y. Optimal Control Algorithm for Subway Train Operation by Proximal Policy Optimization. Applied Sciences. 2023; 13(13):7456. https://doi.org/10.3390/app13137456

Chicago/Turabian Style

Chen, Bin, Chunhai Gao, Lei Zhang, Junjie Chen, Jun Chen, and Yuyi Li. 2023. "Optimal Control Algorithm for Subway Train Operation by Proximal Policy Optimization" Applied Sciences 13, no. 13: 7456. https://doi.org/10.3390/app13137456

APA Style

Chen, B., Gao, C., Zhang, L., Chen, J., Chen, J., & Li, Y. (2023). Optimal Control Algorithm for Subway Train Operation by Proximal Policy Optimization. Applied Sciences, 13(13), 7456. https://doi.org/10.3390/app13137456

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimal Control Algorithm for Subway Train Operation by Proximal Policy Optimization

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. The Train Dynamic Model

3.2. Optimization Model of Train Operation Based on Reinforcement Learning

3.3. Proximal Policy Optimization Method

4. Case Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI