A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations

Liu, Huan; Qiu, Hengyi; Lu, Wanming; Shan, Xiaonian

doi:10.3390/su172310695

Open AccessArticle

A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations

¹

Business School, Nanjing University of Science and Technology Zijin College, Nanjing 210023, China

²

College of Civil and Transportation Engineering, Hohai University, Nanjing 210024, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(23), 10695; https://doi.org/10.3390/su172310695

Submission received: 22 October 2025 / Revised: 25 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025

(This article belongs to the Special Issue Sustainable Intelligent Transport Systems: AI-Driven Multi-Modal Fusion for Green Development)

Download

Browse Figures

Versions Notes

Abstract

In urban arterials, buses face dual constraints from signal-controlled intersections and bus stop dwell demands, and frequent start–stop cycles result in reduced operational efficiency and elevated energy consumption. To address this critical challenge, a sustainable eco-driving strategy integrating offline and online Reinforcement Learning (RL) is proposed in this study. Leveraging real-world trajectory data from a 15.47 km route with 31 stops, the energy consumption characteristics of electric buses under the combined effects of stops and intersections are systematically analyzed, and high energy consumption scenarios are precisely identified. An initial energy saving strategy is first trained using offline RL, and subsequently subjected to online optimization in a vehicle–infrastructure cooperative simulation environment that incorporates three typical stop configurations. The soft actor-critic algorithm is employed to reconcile the dual goals of energy efficiency and ride comfort. Simulation results reveal a significant improvement with the proposed strategy, achieving an 11.2% reduction in energy consumption and a 37.7% decrease in travel time compared to the Krauss benchmark model. This study confirms the effectiveness of RL in boosting the operational sustainability of public transport systems, offering a scalable technical framework to promote the development of green urban mobility. The research findings provide theoretical support and practical references for the large-scale promotion and engineering application of energy saving autonomous driving technology for electric buses.

Keywords:

eco-driving; reinforcement learning; sustainable transportation; energy efficiency; intelligent transportation systems

1. Introduction

Currently, numerous countries worldwide have incorporated the development of the electric vehicle industry into their national strategies and successively issued plans for electric vehicle fleet development, covering multiple vehicle categories such as passenger cars and buses, with the aim of further enhancing energy conservation and emission reduction effects in the transportation sector. Incentive policies including financial subsidies, tax reductions and exemptions, and road priority rights have become key means to promote the market penetration of electric vehicles, and their implementation effects and optimization paths have emerged as a research hotspot in the field of sustainable transportation development [1].

Compared with private electric vehicles, electric buses offer advantages such as fixed operating routes, concentrated charging demands, and controllable operational scenarios, making it easier to develop a standardized promotion model. Additionally, electric buses demonstrate remarkable benefits in energy conservation, emission reduction, and operational cost reduction, thus emerging as a core priority area for the global promotion and application of electric vehicles. According to 2023 statistics, China’s bus fleet reached 682,500 vehicles, of which 554,400 were new energy buses, accounting for a notable 81.2% of the total. The sheer scale of this fleet underscores the immense potential for energy savings and emissions reduction through operational optimization.

Urban signalized intersections represent typical scenarios with interrupted traffic flow, where stochastic signal timing and vehicle dynamics lead to significant energy and operational inefficiencies. Furthermore, the dense distribution of bus stops and their close proximity to intersections, which are common characteristics of urban infrastructure, often disrupt the operational efficiency of bus services [2]. These disruptions, stemming from traffic signals and passenger boarding/alighting activities, result in repeated acceleration–deceleration cycles, lower average speeds, and degraded powertrain efficiency. Collectively, these factors elevate the energy consumption per unit distance and substantially undermine the overall energy utilization efficiency of electric buses [3], thereby counteracting their potential environmental benefits.

While various eco-driving strategies have been proposed in existing literature to alleviate energy waste at signalized intersections. Most reported methods are tailored and assessed primarily for intersection-specific scenarios, often sidelining the critical role of upstream and downstream bus stops. Stochastic delays introduced by passenger activities at these stops can fundamentally compromise vehicle trajectory, resulting in intersection-focused optimization strategies being suboptimal or even counterproductive when applied to the segment of bus stop and intersection area. Therefore, there is an urgent requirement for strategies that can holistically orchestrate driving behavior across this coupled system.

This study addresses the critical sustainability challenge of low energy utilization efficiency caused by frequent starts and stops during bus operation. The innovation of this research lies in the development of a novel AI-enhanced eco-driving strategy, uniquely tailored to the complex scenario of the bus stop and intersection road segment. Based on real-world operational data from a typical bus route, this study analyzes the energy consumption characteristics of electric buses under the combined influence of bus stops and intersections. An AI-enhanced eco-driving strategy integrating offline and online Reinforcement Learning (RL) is proposed. Compared with previous studies, the proposed method features two core distinctive advantages: (1) synergistic integration of offline and online RL; (2) coordinated optimization of driving behaviors for both bus stop dwell and traffic signal response within a unified Vehicle-to-Infrastructure (V2I) cooperative framework.

As for the structure of this paper, Section 2 conducts a comprehensive literature review. Section 3 presents the research framework and data results. The methodology of eco-driving strategy is proposed in Section 4. Section 5 conducts the numerical experiments. Section 6 summarizes the main conclusions and puts forward prospects for future research directions.

2. Literature Review

2.1. Energy Consumption Characteristics of Electric Buses

Tang Yi [4], by constructing a bus energy consumption factor model based on specific power correction, discovered that the bus energy consumption factor exhibits a decreasing power function relationship with speed and is positively correlated with bus mass. The results of their corrected model for calculating average single-trip energy consumption were consistent with the MOVES energy consumption and emissions model, with a calculation error within 12%. Misanovic et al. [5], based on real-vehicle experimental results from the EKO1 line in Belgrade, found that bus operational factors such as air conditioning system usage and passenger load significantly impact energy consumption per unit distance. Specifically, the use of heating systems during winter could increase energy consumption by approximately 20% to 30%. Kivekäs et al. [6] established a linear positive correlation between the number of stops and total energy consumption, using the measured data collected on the city Line 11 in Espoo, Finland, Belloni et al. [7] found that up to 50% of the energy consumed by the motor for traction could be dominated by driving behavior, specifically acceleration patterns and throttle pedal control styles. They noted that aggressive maneuvers like rapid acceleration and frequent deep throttle applications led to high-magnitude energy consumption peaks, whereas gentle maneuvers like gradual starts, constant-speed driving, and the judicious use of coasting distances could effectively reduce the overall energy consumption level. Chen et al. [8] utilized UAV aerial photography technology to achieve high-precision trajectory extraction (RMSE: 0.175 m), providing a data foundation for analyzing bus operational parameters; Zhuang et al. [9] adopted a few-shot learning strategy from autonomous driving perception technology, enhancing the recognition capability in special scenarios to optimize energy consumption decisions; Chen et al. [10] developed a traffic flow prediction denoising model based on wavelet transform, supplying more accurate input data for bus scheduling systems; Li et al. [11] employed a three-stage predictive control framework (day-ahead electricity purchase optimization, charging power allocation, and real-time energy storage regulation), offering a systematic solution for energy efficiency management of charging infrastructure.

Synthesis of studies [4,5,6,7,8,9,10,11] indicates that the energy efficiency of electric buses is highly sensitive to operational dynamics, with frequent stops, aggressive driving, and auxiliary loads being primary contributors to energy waste. This understanding directly motivates the development of eco-driving strategies to smooth traffic flow and eliminate inefficient patterns. Integrated solutions incorporating high-precision trajectory monitoring, robust environmental perception, data cleansing, and smart charging management are progressively building a systematic framework for energy optimization. Future research should focus on breakthroughs in multi-source data collaboration and adaptive driving strategy optimization to achieve full-chain energy efficiency improvements.

2.2. Evolution of Eco-Driving Strategies

The research on eco-driving strategies has progressed from model-driven to data-driven approaches. Rule-based and model-based methods provide foundational solutions; for instance, Wu et al. [12] achieved up to 19.5% fuel consumption reduction through speed optimization at signalized intersections. Long et al. [13] balanced energy consumption, delay, and comfort under high saturation conditions with a trajectory model, and Xu et al. [14] planned speed intervals for buses that incorporated operational constraints, significantly improving system economic efficiency. Although intuitive and reliable, the performance of these methods is easily affected by model mismatches in highly stochastic real traffic environments. Optimization-based methods enable explicit multi-objective balancing. Zhu et al. [15] used Bayesian adaptive optimization to achieve synergistic reductions in energy consumption and emissions, Zhang et al. [16] demonstrated broad applicability across different driving styles with multi-objective optimization, and Feng et al. [17] combined model predictive control for real-time optimization, significantly improving energy efficiency and safety while ensuring computational efficiency. However, their effectiveness is constrained by computational complexity and model accuracy. To learn driving behavior patterns from data, supervised learning methods have been widely applied: Zhang et al. [18,19] constructed high-precision driving behavior recognition models based on trajectory data to support energy saving interventions; Niu et al. [20] explored cooperative car-following strategies for connected vehicles, reducing emissions by approximately 15%; Li et al. [21] proposed segmented speed strategies for different queue scenarios, achieving significant energy savings. However, the policies derived from these methods are often static and difficult to adapt online. Reinforcement learning methods have shown great potential due to their ability to learn optimal policies through interaction with the environment. Lu et al. [22] simultaneously optimized energy efficiency and safety in mixed traffic flow, while Qin et al. [23] and Xi et al. [24] proposed deep reinforcement learning frameworks for conventional vehicles and electric buses, respectively, demonstrating excellent multi-objective synergistic optimization capabilities and robustness in complex intersection scenarios.

Studies [12,13,14,15,16,17,18,19,20,21,22,23,24] indicate that research on eco-driving strategies has shifted from model-driven to data-driven approaches. Early rule-based and model-based methods, while intuitive, are susceptible to model mismatch; optimization methods are constrained by computational complexity; supervised learning generates static policies that are difficult to adapt online. In contrast, reinforcement learning methods achieve dynamic multi-objective collaborative optimization through environmental interaction, demonstrating strong robustness in complex scenarios. However, existing research mostly focuses on isolated intersections, and the sequential decision-making challenge for buses navigating the typical “bus stop–intersection” corridor, particularly reinforcement learning frameworks that integrate offline pre-training and online fine-tuning, remains underexplored.

2.3. Summary of Quantitative Findings from Literature

To provide a clear overview of the performance achievements in the field of eco-driving, Table 1 synthesizes key quantitative findings from the reviewed literature. This summary highlights the diversity of approaches and their demonstrated effectiveness across different performance metrics, such as energy savings, emissions reduction, and travel efficiency. The data underscores a common trend: significant energy savings are achievable through intelligent driving strategies, yet the application context and specific methodologies vary widely.

Existing studies on electric bus eco-driving have established a systematic understanding of their energy consumption patterns, yet remain predominantly vehicle-centric. While some research has begun to examine how bus stops influence stopping and driving behaviors, there is a notable absence of quantitative energy consumption analysis based on real operational data that jointly considers driving modes within integrated “bus stop–intersection” scenarios. Meanwhile, RL has shown growing potential in traffic control applications due to its capacity for continuous optimization and adaptive decision-making. Specifically, online RL enables dynamic control through real-time environmental interaction, whereas offline RL can train effective models directly from historical datasets, offering distinct advantages in high-risk or high-cost settings. To bridge these gaps, this paper introduces a dual-phase RL framework combining offline pre-training with online fine-tuning, designed to develop an eco-driving strategy for the “bus stop–intersection” scenario that reflects real-world operational characteristics while maintaining adaptability in complex traffic environments.

3. Materials

3.1. Research Framework

This study develops a three-phase sustainability optimization framework (detailed in Section 4) specifically designed to enhance the energy efficiency and operational sustainability of electric bus operations in urban environments. The optimization method of eco-driving strategy is shown in Figure 1.

Phase 1: Data-Driven Analysis of Operational Characteristics.

Leveraging real-world operational data from a specific bus route in China, this study systematically analyzes the overall distribution characteristics of power, speed, and acceleration for electric buses operating under complex environmental conditions, including varying road hierarchy levels, traffic flow states, and load conditions.
It further investigates the impact of frequent stops and starts at bus stops and intersections on overall operational efficiency and energy consumption levels.
The analysis examines the characteristics of variation in energy consumption per unit distance relative to the distance between bus stops and intersections, thereby providing the data foundation and theoretical basis for developing an eco-driving strategy.

Phase 2: Offline RL for Initial Strategy Learning.

Using offline reinforcement learning (RL) methods based on real bus trajectory data, this study extracts “state–action–reward” correlations.
It performs imitation learning of bus behavioral patterns under different states to output an energy saving control strategy that possesses practical deployment potential while aligning closely with real-world data distribution.
This process results in an initial policy exhibiting both stability and practicality.

Phase 3: Online RL for Strategy Optimization and Evaluation.

An online reinforcement learning approach is employed to optimize the pre-trained policy.
A simulation environment reflecting fundamental urban traffic characteristics is designed using the Simulation of Urban MObility (SUMO) platform with the version of 1.25.0.
Experience trajectories are generated based on the initial policy to build a behavioral experience dataset.
This dataset is introduced into the experience replay buffer as initial experience.
The system iteratively generates new experience through ongoing interactions between the bus agent and the simulation environment for RL training, continuously updating the policy parameters in real-time.
This culminates in a robust eco-driving strategy.
A comparative simulation analysis is conducted between this strategy and SUMO’s default Krauss driving model to analyze the operational mechanisms contributing to energy savings.

3.2. Data Collection

This study is grounded in a robust empirical foundation, utilizing real-world operational data from a specific bus route in China for July 2020. The selected route, spanning 15.47 km with 31 stops, traverses a representative variety of urban functional landscapes, including expressways, arterial roads, and secondary arterials. This carefully chosen corridor encapsulates the operational challenges and opportunities within diverse urban settings (e.g., residential, commercial), ensuring the findings are directly relevant for improving the sustainability of urban public transport systems under varying infrastructure and traffic conditions.

To capture a comprehensive picture of operational dynamics, data collection was strategically designed to include both weekdays and weekends, with balanced coverage of morning/evening peak hours and off-peak periods. This temporally diverse data collection strategy is critical for understanding vehicle performance across the full spectrum of real-world service conditions, including different passenger load factors and traffic flow environments. Such comprehensiveness is essential for developing an eco-driving strategy that is both effective and robust, key to achieving consistent energy savings and emissions reductions.

The subject vehicle was equipped with a sophisticated on-board diagnostics (OBD) system, recording high-resolution time series data on critical parameters such as energy consumption, GPS trajectories, speed, gradient, and altitude [25]. These multimodal datasets provide a holistic view of the vehicle–environment interactions that determine energy efficiency. The key collected metrics, which offer complementary insights into vehicle dynamics and energy use, are systematically detailed in Table 2.

3.3. Data Processing

3.3.1. Power Calculation

As shown in Figure 2, the power distribution of the target vehicle during stationary states exhibits specific characteristics. While most power values concentrate around 10 kW (reflecting stable auxiliary load demand), the presence of negative power values indicates limited energy recuperation. Crucially, clearly identifiable high-magnitude outliers are observed, likely caused by transient power surges during system wake-up or pre-charging events.

Given that these outliers differ significantly from steady-state power characteristics and cannot represent typical auxiliary system behavior, we exclude them from subsequent auxiliary power calculations.

The energy decomposition principle governs drive power derivation via Equation (1), where drive power equals total power minus auxiliary power:

P_{d r i v e} = \frac{U \times I}{1000} - P_{a s s i s t}

(1)

where

P_{d r i v e}

: Drive power (kW)

P_{a s s i s t}

: Auxiliary power (kW). The average value of auxiliary power excluding outliers is 8.09 kW.

3.3.2. Acceleration Calculation

Vehicle acceleration is derived through force–power equilibrium according to Equations (2) and (3) balancing power output against systematic resistance forces:

a = \frac{P_{d r i v e}}{v} - F_{r} - F_{d} - F_{g}

(2)

\{\begin{matrix} F_{r} = (m + m_{p} \times n) \times g \times C_{r} \\ F_{g} = (m + m_{p} \times n) \times g \times \sin (θ) \\ F_{d} = 0.5 \times C_{d} \times A \times ρ \times v^{2} \end{matrix}

(3)

where

v

: Velocity (m/s)

F_{r}

: Rolling resistance (N)

F_{d}

: Aerodynamic drag (N)

F_{g}

: Grade resistance (N)

m

: Vehicle curb mass (kg)

m_{p}

: Average passenger mass (kg)

n

: Passenger count (onboard)

g

: Gravitational acceleration (9.81 m/s²)

C_{r}

: Rolling resistance coefficient (0.008)

θ

: Road gradient angle (rad)

C_{d}

: Aerodynamic drag coefficient (0.6)

A

: Frontal area (m²)

ρ

: Air density (kg/m³)

3.3.3. Energy Consumption Analysis

The distribution of total vehicle power (Figure 3) reveals a fundamental contrast between static and dynamic states.

During stationary operation, power output is concentrated around 10 kW, indicating a consistent baseline auxiliary load. In deceleration phases, negative power values approaching zero confirm low-level energy recuperation capabilities. Notably, the power distribution during motion displays increased dispersion, significantly amplified fluctuations, and high-frequency clustering near 0 kW.

This diverse profile stems from the vehicle’s frequent transitions between acceleration, deceleration, idling, and recuperation cycles, resulting in wide-ranging power oscillations. The marked difference between static stability and dynamic variability directly correlates with the complex driving patterns typical of urban bus operations.

The power and stop distributions between bus stops and intersections, illustrated in Figure 4, show distinct patterns across operational phases. During bus stop approach, frequent braking and energy recuperation operations generate pronounced negative skewness in the power distribution, creating a contrast to acceleration behavior. Vehicle startup triggers a rapid change, where power levels surge upward, opposing the preceding deceleration phase.

A clear spatial relationship emerges after vehicle startup: power distribution patterns demonstrate an inverse variation relative to stop frequencies across different road segments. Areas with lower stop frequency show highly concentrated power distributions and reduced mean power values, whereas high stop frequency zones exhibit substantially broader dispersion and elevated averages.

This complementary pattern was quantitatively validated through correlation analysis of segments beyond a normalized distance of 0.2. A robust interdependence between mean power and stop counts is confirmed by a Pearson correlation coefficient of 0.807 (p < 0.001), signifying a statistically significant co-variation throughout vehicle operation. The combination of localized variations during transient phases and macro-level spatial patterns collectively characterizes the energy dynamics of electric buses in stop–intersection corridors.

The unit-distance energy consumption between stops and intersections is calculated according to Equation (4).

w = \int_{t_{0}}^{t_{1}} \frac{U \times I}{Δ d} d t

(4)

where

w

: Energy consumption per unit distance (kWh/km)

Δ d

: Distance between stop and intersection (km)

t

: Entry time to stop zone (s)

t_{1}

: Arrival time at intersection (s)

4. Methods

4.1. Offline Pre-Training with Q-Learning Decision Transformer

This study develops a three-phase sustainability optimization framework specifically designed to enhance the energy efficiency and operational sustainability of electric bus operations in urban environments. The framework comprises: (1) data-driven operational analysis, (2) offline reinforcement learning for initial strategy acquisition, and (3) online reinforcement learning for policy optimization. In alignment with this framework, the methodology encompasses both offline pre-training from historical data and online fine-tuning in simulated environments. It includes a flow chart of the two-phase deep learning algorithm, as illustrated in Figure 5.

To acquire a stable and energy-efficient initial policy without the cost and risk of online interaction, we first employed Offline Reinforcement Learning (Offline RL). This approach learns directly from the collected historical trajectory data, imitating and optimizing the driving behaviors embedded in the real-world dataset.

We adopted the Q-learning Decision Transformer (QDT) algorithm for the offline pre-training. QDT synergizes the sequence modeling prowess of the Transformer architecture with the value function foundation of Q-learning. This hybrid design addresses the limitations of prior offline RL methods: it mitigates the extrapolation error common in traditional Q-learning in offline settings, while overcoming the suboptimal trajectory stitching issue of vanilla Decision Transformers by leveraging Q-value estimates to relabel the Return-to-Go (RTG) sequences. This makes it particularly suitable for learning competent policies from our mixed-quality historical driving data.

4.1.1. Reward-Driven Policy Design

The offline RL model was constructed with the following components:

State Space: A 4-dimensional vector encompassing: the current vehicle speed (km/h), the road gradient (rad), the number of onboard passengers (persons), and the instantaneous distance to the upcoming intersection (m).
Action Space: A 1-dimensional continuous variable representing the vehicle’s acceleration, bounded within $[- 3.0, + 3.0] m / s^{2}$ .
Reward Function according to Equation (5): A multi-objective reward function was designed to incentivize energy efficiency, efficiency, and comfort:

R = - w_{1} \times P + w_{2} \times v - w_{3} a^{2} + R_{a r r i v a l}

(5)

where

P

is the instantaneous power (kW),

v

is the speed (km/h),

a

is the acceleration (m/s²), and

R_{a r r i v a l}

is a sparse terminal reward granted upon successfully reaching the intersection. The weights of

w_{1}

,

w_{2}

,

w_{3}

were tuned to balance these competing objectives.

4.1.2. Performance of the Initial Offline Policy

The QDT model was trained on a dataset comprising over 31,000 real-world trajectory segments. The performance of the resulting initial policy was rigorously evaluated by comparing its generated trajectories against the original human-driven trajectories, with key metrics summarized in Table 3.

The offline-learned initial policy demonstrated significant improvements across multiple dimensions, as quantified in Table 3.

Travel Efficiency: The average speed increased from 21.0 km/h to 31.0 km/h, indicating a substantial improvement in operational fluidity.
Driving Smoothness: The average acceleration shifted markedly from −0.996 m/s² to −0.231 m/s². This positive change, with a reduction in the magnitude of negative acceleration, signifies a decisive shift away from harsh braking patterns towards smoother driving.
Energy Consumption Profile: While the instantaneous average power increased—consistent with achieving higher speeds—the crucial metric of cumulative energy consumption decreased by 0.034 kWh per segment. This confirms that the policy successfully converts higher power expenditure into more efficient overall energy use by minimizing wasteful stop–start cycles.

These results validate that the offline RL phase successfully extracted a stable, smoother, and more energy-efficient initial policy from the historical dataset. The policy provides a high-quality and safer starting point for subsequent online optimization, effectively mitigating the cold-start problem.

4.2. Online Reinforcement Learning with Soft Actor-Critic

The SAC (Soft Actor-Critic) algorithm synthesizes the strengths of Actor-Critic methodologies with an entropy-regulated strategy. It maximizes expected returns while incorporating an entropy regularization term, thereby effectively encouraging exploration of state-action spaces and mitigating policy convergence to local optima [21]. Furthermore, its inherent dual focus on performance and adaptability is reinforced by an adaptive entropy temperature coefficient, which dynamically modulates policy stochasticity to maintain an optimal balance between exploration and exploitation [22].

The SAC algorithm was chosen as the primary optimization approach due to its unique entropy regularization mechanism, which effectively manages the exploration–exploitation trade-off. This makes it particularly suitable for high-dimensional state spaces (13-dimensional) and multi-objective reward scenarios. Key advantages include:

Entropy term encourages policy exploration, preventing convergence to local optima (e.g., mitigating premature convergence issues common in PPO);
Superior stochastic environment handling compared to TD3 (e.g., adapting to traffic signal phase transitions) through dynamically adjusted randomness via an adaptive temperature coefficient;
Pre-experimental validation: Comparative tests of SAC, PPO, and TD3 demonstrated SAC’s outperformance in both cumulative reward (+10–15%) and convergence speed. The revised version will include brief comparative results and selection rationale.

4.2.1. Physical State Representation

We construct a systematically designed state space that integrates critical operational metrics, traffic context, and mission progress indicators observable in real-road environments [23]. This framework incorporates thirteen carefully selected features across three categories, as detailed in Table 4, ensuring comprehensive characterization of kinematic dynamics, contextual constraints, and progression milestones.

4.2.2. Reward Function

The reward function retains its tri-objective architecture across energy efficiency, travel efficiency, and comfort dimensions:

Energy Dimension: Imposes a direct penalty based on instantaneous power consumption during operation.
Efficiency Dimension: Calculated as the ratio of current speed to ideal target speed. Provides positive incentives for higher speeds, balancing energy conservation with travel time requirements.
Comfort Dimension: Employs an acceleration-squared penalty term to constrain abrupt speed changes (mirroring the offline learning setup). Counters kinetic energy fluctuations to ensure passenger comfort.

The unified reward function maintains dimensional equilibrium through Equations (6) and (7).

r e w a r d_{SAC} = - w_{e n e r g y} E (v, a) + w_{s p e e d} \frac{v}{v_{i d e a l}} - w_{c o m f} a^{2} + r_{+}

(6)

E (v, a) = 2.7 \times 10^{- 7} (w_{E} \times a \times v \times m + p_{s t a t i c}) Δ t

(7)

where

w_{e n e r g y}

,

w_{s p e e d}

,

w_{c o m f}

: Weight coefficients governing the contribution balance among energy consumption, efficiency, and comfort within the total reward. These weights maintain dimensional equilibrium across objectives by ensuring mathematically commensurate scaling. The weights

w_{e n e r g y} = 0.7

,

w_{s p e e d} = 0.2

, and

w_{c o m f} = 0.1

were determined through grid search optimization to balance the competing objectives of energy efficiency, operational efficiency, and ride comfort.

v_{i d e a l}

: Target cruise speed = 10 m/s, representing the kinematic equilibrium point for efficiency optimization.

w_{E}

: Dynamic energy coefficient exhibiting, with value of 1.0 during acceleration (energy consumption)m with the value of 0.25 during deceleration (mirroring energy recuperation at 25% efficiency).

m is the mass of the bus.

p_{s t a t i c}

: Stationary base power = Random value within [5, 10] kW.

Δ t

: Simulation timestep (s), serving as the temporal constant for discrete-time reward integration.

4.2.3. Model Parameters

The SAC model parameters, detailed in Table 5, adopt a structurally framework across policy and value networks:

Both networks utilize stacked fully connected layers with 256 hidden units each,
Activation: ReLU functions ensure nonlinear feature extraction uniformity,
Output: Acceleration commands undergo tanh-constrained compression, mapping outputs to the balanced interval [−3, 3] m/s². This bounded action guarantees both kinematic feasibility and control stability for longitudinal vehicle dynamics while maintaining algorithmic equilibrium.
The complete training procedure comprised 150 steps, with a batch size of 128 and a replay buffer capacity of 100,000 to ensure stable policy convergence. Training was executed on an NVIDIA Tesla V100 GPU, requiring approximately 2 h of computation time.

5. Simulation Analysis

Experimental Design and Evaluation Framework: A comprehensive evaluation of the SAC-optimized eco-driving strategy was conducted by comparing its performance against SUMO’s default Krauss model. The simulation environment, configured according to the parameters in Figure 6 and Table 6, enabled a rigorous comparative analysis focused on trajectory characteristics and cumulative energy consumption. To maintain methodological consistency, all performance metrics were evaluated using dimensionally comparable criteria across both models.

Road Network Optimization for Efficient Simulation: To enhance computational efficiency while preserving critical geometric properties, we developed a scaled-down roadway model reducing the total length from 15.47 km to 1547 m. This scaling approach maintained essential topological features—including gradient characteristics at intersections (with a preservation rate of 2.1 ± 0.3%) and curvature profiles—through a validated geometric scaling methodology.

Baseline Model Specification: The Krauss car-following model, originally developed by Stefan Krauss (1997) [24], establishes collision-free vehicle movement through dynamic speed constraints. The model continuously calculates safe speed thresholds based on surrounding traffic conditions, creating a balanced equilibrium between safety requirements and velocity maintenance.

Road Network Configuration: The simulated road network featured a main road with bidirectional six-lane design and branch roads with bidirectional four-lane access. This configuration ensured consistent maneuverability for buses in both directions, creating a representative environment for strategy learning and validation.

Scenario Coverage for Comprehensive Testing: To address diverse real-world operating conditions, we implemented three distinct station–intersection layouts with varying spatial arrangements:

There is a station upstream of the intersection;
There is a station downstream of the intersection;
There are stations both upstream and downstream of the intersection.

The probability of successful signal transmission decays monotonically with increasing vehicle-to-intersection distance. Drawing upon the theoretical foundations of the Log-Distance Path Loss Model and Log-Normal Shadow Fading Model, we specifically employ a centrally Sigmoid function for modeling [25]. The signal loss probability function exhibits inverse distance as defined by Equations (8) and (9).

p_{l o s s} (d) = (1 + e^{- k (d - d_{m i d})})^{- 1}

(8)

k = \frac{- I n [p_{l o s s} {(d)}^{- 1} - 1]}{d_{m i d} - d} = \frac{- I n [{0.05}^{- 1} - 1]}{800 - 600} = 0.01472

(9)

where

d_{m i d}

: Inflection point = 800 m, representing the spatial axis where communication transitions from stable to unstable states.

k

: Slope control coefficient, calibrated to achieve complementary boundary.

All performance metrics (e.g., energy savings, time reduction) will be reported with 95% confidence intervals and standard errors based on 10 independent simulation runs. Paired t-tests were conducted to compare SAC against the Krauss model (

H_{0} : μ_{s a c} = μ_{K r a u s s}

, significance level

α = 0.05

), with results (p < 0.001) demonstrating the statistical significance of the optimized strategy’s superiority.

The trajectory comparison is shown in Figure 7. The comparison of cumulative energy consumption is shown in Figure 8.

Based on simulation data, the Krauss model achieved an average travel time of 346.6 s (mean of 10 simulation runs) over the identical 1547 m route, while the SAC model reduced this to 216.1 s. This corresponds to an absolute reduction of 130.5 s (346.6–216.1) and a relative reduction of (130.5/346.6) × 100% ≈ 37.7%.

Similarly for energy consumption, the Krauss model yielded an average cumulative energy consumption of 0.587 kWh, whereas the SAC model consumed 0.521 kWh. The absolute energy savings amounted to 0.066 kWh (0.587–0.521), equating to a relative reduction of (0.066/0.587) × 100% ≈ 11.2%.

6. Conclusions

This study focuses on the operational segments of electric buses in integrated “bus stop–intersection” scenarios. Using real-world operational data, it systematically analyzes the impact of stop–start behavior on functional energy consumption and proposes an eco-driving strategy framework that integrates offline trajectory learning with online reinforcement optimization. By dynamically adjusting acceleration parameters to optimize vehicle operating states, the framework enhances energy efficiency, with its performance validated through comparative experiments and simulation.

Results demonstrate that stopping frequency in “bus stop–intersection” segments significantly influences the electric buses’ energy consumption per unit distance. Compared to the original trajectories, trajectories optimized by the ODT model increase average speed from 21 km/h to 31 km/h, markedly reduce fluctuations in average acceleration, and lower energy consumption per unit distance from 0.219 kWh to 0.185 kWh. In contrast to the Krauss driving model, the SAC model with integrated online strategy optimization reduces total travel time by 126.78 s and decreases cumulative trajectory energy consumption by 0.073 kWh.

While the proposed two-stage deep reinforcement learning framework demonstrates promising results for eco-driving of electric buses, several limitations remain, pointing to valuable directions for future research. First, there exists a gap between offline and online state representations: the offline pre-training employs a simplified state space, whereas the online simulation operates in a high-dimensional environment. This dimensional mismatch complicates the policy transfer and experience replay buffering process, potentially compromising policy stability during the fine-tuning phase. Second, the simulation network is structurally simplified and does not account for the complexities of a city-scale road network. Additionally, the strategy is trained from a single-agent perspective, neglecting potential interactions among multiple buses operating concurrently. Furthermore, the current strategy exhibits limited robustness to unplanned events, lacking explicit mechanisms to handle incidents such as accidents or severe congestion. To address these limitations, future research should focus on the following directions:

Develop a unified training framework that seamlessly integrates offline historical data with online interaction, algorithmically reconciling state representation differences to enable more stable and efficient policy initialization and fine-tuning.
Extend the simulation environment to city-scale road networks and introduce multi-agent reinforcement learning to train cooperative eco-driving strategies for multiple buses, achieving system-wide optimization of traffic flow and energy efficiency.
Enhance policy robustness through adversarial training, incorporating a wider range of stochastic disturbances and edge-case scenarios during training, and explore techniques such as control barrier functions to build dynamic emergency response capabilities.

Moreover, to improve the generalizability and practicality of the strategy, future work should expand the research to multi-powertrain systems (e.g., fuel cell and hybrid buses), investigate cross-vehicle policy transfer frameworks based on meta-reinforcement learning, and incorporate more complex road structures (e.g., interchanges, tidal lanes) into simulations, thereby promoting the application of eco-driving strategies in broader and more realistic traffic scenarios.

Author Contributions

Conceptualization, H.L. and H.Q.; methodology, W.L.; validation, X.S.; formal analysis, H.L.; investigation, H.Q.; resources, H.L.; data curation, X.S.; writing—original draft preparation, H.L.; writing—review and editing, H.Q.; visualization, W.L.; supervision, X.S.; project administration, H.L. and H.Q.; funding acquisition, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Jiangsu Province (Grant No. BK20242054) and the Jiangsu Youth Science and Technology Talent Support Program (Grant No. JSTJ-2025-645).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Macioszek, E. The role of incentive programs in promoting the purchase of electric cars-review of good practices and promoting methods from the world. In Research Methods in Modern Urban Transportation Systems and Networks; Lecture Notes in Networks and Systems; Springer: Berlin/Heidelberg, Germany, 2021; p. 207. [Google Scholar]
Wang, H.X. Energy-Saving Driving Strategy Under the Environment of V2X: Some Cases for Electric Buses. Master’s Thesis, Chang’an University, Xi’an, China, 2020. [Google Scholar]
Lan, Q. Analysis on the Basic Situation of Pure Electric Bus Operation. Automob. Appl. Technol. 2023, 24, 209–213. [Google Scholar]
Tan, Y. Study on Energy Consumption Evaluation of Bus Routes Based on Operating Characteristics. Master’s Thesis, Beijing Jiao Tong University, Beijing, China, 2019. [Google Scholar]
Misanovic, S.M.; Glisovic, J.D.; Blagojević, I.A.; Taranovic, D.S. Influencing factors on electricity consumption of electric bus in real operating conditions. Therm. Sci. 2021, 25 (Suppl. 1), 81–90. [Google Scholar] [CrossRef]
Kivekas, K.; Vepsäläinen, J.; Tammi, K.; Anttila, J. Influence of driving cycle uncertainty on electric city bus energy consumption. In Proceedings of the IEEE Vehicle Power and Propulsion Conference (VPPC), Belfort, France, 11–14 December 2017. [Google Scholar]
Belloni, M.; Tarsitano, D.; Sabbioni, E. An Experimental Analysis of Driver Influence on Battery Electric Bus Energy Consumption. In Proceedings of the IEEE Vehicle Power and Propulsion Conference (VPPC), Washington, DC, USA, 7–10 October 2024. [Google Scholar]
Chen, X.Q.; Li, Z.B.; Yang, Y.S.; Qi, L.; Ke, R. High-Resolution Vehicle Trajectory Extraction and Denoising from Aerial Videos. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3190–3202. [Google Scholar] [CrossRef]
Zhuang, Y.F.; Liu, P.; Yang, H.; Zhang, K.; Wang, Y.; Pu, Z. Few-shot learning for novel object detection in autonomous driving. Commun. Transp. Res. 2025, 5, 100194. [Google Scholar] [CrossRef]
Chen, X.Q.; Wu, S.B.; Shi, C.J.; Huang, Y.; Yang, Y.; Ke, R.; Zhao, J. Sensing Data Supported Traffic Flow Prediction via Denoising Schemes and ANN: A Comparison. IEEE Sens. J. 2020, 20, 14317–14328. [Google Scholar] [CrossRef]
Li, Y.Q.; Pu, Z.Y.; Liu, P.; Qian, T.; Hu, Q.; Zhang, J.; Wang, Y. Efficient predictive control strategy for mitigating the overlap of EV charging demand and residential load based on distributed renewable energy. Renew. Energy 2025, 240, 122154. [Google Scholar] [CrossRef]
Wu, F.L.; Ye, H.B.; Bektas, T.; Dong, M. New and tractable formulations for the eco-driving and the eco-routing-and-driving problems. Eur. J. Oper. Res. 2025, 321, 445–461. [Google Scholar] [CrossRef]
Long, K.J.; Chen, B.H.; Gao, Z.B. Ecological Driving Strategy at Signalized Intersections in a Connected and Autonomous Vehicle Environment. J. Changsha Univ. Sci. Technol. 2025, 45, 130–142. [Google Scholar]
Xu, W.H.; Li, X.; Wang, T.Q. Hybrid Scheduling Method for Conventional Bus and Demand-responsive Transit Based on Eco-driving Technology. J. Transp. Syst. Eng. Inf. Technol. 2025, 25, 227–240. [Google Scholar]
Zhu, P.X.; Hu, J.J.; Li, J.J. Co-optimization Method for Personalized Eco-adaptive Cruise Control of Plug-in Hybrid Electric Vehicles Considering Uncertainties in Driving Style. J. Mech. Eng. 2025, 61, 214–229. [Google Scholar]
Zhang, Y.L.; Fu, R.; Guo, Y.S. Eco-driving strategy for connected electric buses at the signalized intersection with a station. Transp. Res. Part D Transp. Environ. 2024, 128, 104076. [Google Scholar] [CrossRef]
Feng, Y.W.; Li, S.Y.; Hu, J. Data-driven Model Predictive Control Based Eco-driving Towards an Actuated Signalized Intersection. China J. Highw. Transp. 2025, 38, 57–69. [Google Scholar]
Zhang, L.Q.; Zhu, Z.J.; Zhang, Z.Y. An improved method for evaluating eco-driving behavior based on speed-specific vehicle-specific power distributions. Transp. Res. Part D Transp. Environ. 2022, 113, 103476. [Google Scholar] [CrossRef]
Zhang, Y.L.; Yuan, W.; Wang, Y. Recognition model for eco-driving behavior of electric-buses entering and leaving stops. Energy 2025, 321, 135466. [Google Scholar] [CrossRef]
Niu, Q.Q. Research on the simulation of the collaborative following environment of intelligent networked vehicles based on ecological driving. Auto Driv. Serv. 2024, 8, 44–46. [Google Scholar]
Li, Y.; Zhang, S.R.; Zhou, B. Eco-driving strategy for mixed platoons based on vehicle trajectory data. J. Chang’an Univ. 2025, 45, 138–152. [Google Scholar]
Lu, K.; Li, D.J.; Wang, Q. Safe reinforcement learning-based eco-driving control for mixed traffic flows with disturbances. IEEE Trans. Intell. Transp. Syst. 2025, 26, 4948–4959. [Google Scholar] [CrossRef]
Qin, Y.Y.; Huang, Y.; Yan, S.Y. Deep Reinforcement Learning Control of CAV for Eco-driving in Random Environments at Signalized Intersections. China J. Highw. Transp. 2025, 38, 262–274. [Google Scholar]
Krauss, S. Microscopic Modeling of Traffic Flow: Investigation of Collision Free Vehicle Dynamics. Ph.D. Thesis, University of Cologne, Köln, Germany, 1997. [Google Scholar]
Chen, S.Y.; Piao, L.H.; Zang, X.D.; Luo, Q.; Li, J.; Yang, J.; Rong, J. Analyzing differences of highway lane-changing behavior using vehicle trajectory data. Phys. A Stat. Mech. Its Appl. 2023, 624, 128980. [Google Scholar] [CrossRef]

Figure 1. Framework of Eco-Driving Strategy Optimization Based on Offline and Online Reinforcement Learning.

Figure 2. Power distribution in the parking state.

Figure 3. Total power distribution.

Figure 4. Power and parking distribution between stations and intersections.

Figure 5. Two-Phase Deep Reinforcement Learning Framework.

Figure 6. Simulation Scenario with consideration of the impact of Station and Intersection Layouts.

Figure 7. Trajectory Comparison Between SAC Model and Krauss Model.

Figure 8. Comparison of Cumulative Energy Consumption in Online Optimization.

Table 1. Summary of quantitative findings from eco-driving studies.

Category	Reference	Core Methodology	Key Quantitative Findings	Context
Rule/Model-Based	Wu et al. [12]	Speed optimization with trigonometric model	Fuel consumption reduced by up to 19.5%	Signalized intersection
	Long et al. [13]	Piecewise quadratic trajectory model	Fuel consumption reduced by 18%; delay reduced by 7.5%	High saturation intersection
	Xu et al. [14]	Speed intervals with operational constraints	System cost reduced by 16.2%; timetable sync. improved by 81%	Bus timetable coordination
Optimization-Based	Zhu et al. [15]	Bayesian model and short-term prediction	Energy cost reduced by 4.77–23.97%; emissions reduced by 10.35–37.47%	Personalized eco-adaptive cruise control
	Zhang et al. [16]	Speed pattern recognition and NSGA-II	Energy saving rates of 5.78–23.58% across driver types	Signalized intersection
	Feng et al. [17]	MPC and signal timing prediction	Fuel consumption reduced by 8%; safety improved by 20% (<40 ms computation)	Actuated signalized intersection
Supervised Learning	Zhang et al. [18,19]	CatBoost behavior recognition	Recognition accuracy of 92.8% (entry) and 96.5% (exit)	Bus stop approach and departure
	Niu et al. [20]	Cooperative car-following strategy	CO₂ emissions reduced by ~15%	Intelligent and Connected Vehicles (ICVs)
	Li et al. [21]	Two-phase speed strategy	Energy savings of 14.55–33.96% across queue scenarios	Mixed platoons based on trajectory data
Reinforcement Learning	Lu et al. [22]	RL with safety constraints	Overall energy efficiency improved by 10.88%	Mixed traffic flow with disturbances
	Qin et al. [23]	Deep RL (enhanced TD3)	Fuel economy improved by 2.6–9.3%	Random environments at intersections
	Xi et al. [24]	TD3-based optimization	Energy consumption reduced by 9.82–26.13%	Electric buses at signalized intersections

Table 2. Description of Data Acquisition Parameters.

Parameter	Description	Precision/Unit
Date	Date of data collection	/
Time	Timestamp within a day	1 s
Longitude/Latitude	Geographic coordinates of vehicle front (GCJ-02 coordinate system)	1 × 10⁻⁶°
Altitude	Elevation of vehicle front	0.1 m
Gradient	Road slope angle at vehicle front	0.1 rad
Mileage	Cumulative travel distance	0.1 km
Speed	Vehicle speed	0.1 km/h
SOC	State of Charge (% of remaining battery energy)	0.01%
Remaining Energy	Absolute value of remaining battery energy	1 Ah
Current	Battery current	0.1 A
Voltage	Battery voltage	0.1 V
Passenger Count	Number of onboard passengers (excluding driver)	1 person

Table 3. Comparison of Average Values of Each Metric Before and After Optimization.

Metric	Original Trajectory	Optimized Trajectory	Optimization Effect
Average Speed (km/h)	21.0	31	+10.0
Average Acceleration (m/s²)	−0.996	−0.231	+0.765
Average Power (kW)	−10.0	4.0	+14.0
Average Cumulative (kWh)	0.219	0.185	−0.034

Table 4. State and Action Spaces for SAC Training Model.

Category	Attribute	Unit	Significance
Operational State	Current Speed	m/s	Directly reflects energy consumption level and travel progress
Operational State	Current Acceleration	m/s²	Indicates powertrain status and vehicle dynamic trends
Operational State	Distance to Next Stop	m	Supports stop/deceleration timing decisions
Operational State	Remaining Distance to Destination	m	Indicates overall mission progress
Operational State	Current Lane Speed Limit	m/s	Serves as control upper bound to prevent speeding
Operational State	Leading Vehicle Distance	m	Facilitates safe following distance maintenance
Operational State	Leading Vehicle Speed	m/s	Evaluates traffic dynamics for car-following planning
Operational State	Docking Status at Stop	-	Distinguishes driving vs. docking states for control logic adjustment
Operational State	Normalized Remaining Route Ratio	-	Standardizes route progress for global decision-making
Environmental Perception	Distance to Next Intersection	m	Anticipates traffic signal impact
Environmental Perception	Current Green Signal Status	-	Determines possibility for acceleration/deceleration planning
Environmental Perception	Current Signal Phase Index	-	Provides granular signal timing for sequential inference
Environmental Perception	Time to Next Signal Phase Switch	s	Predicts optimal passing windows for trajectory optimization
Output Action	Commanded Acceleration	m/s²	Controls longitudinal vehicle dynamics under current states

Table 5. SAC Model Parameters.

Module	Parameter Configuration	Module	Parameter Configuration
Network Architecture	MLP (256, 256) × 2	Initial Entropy Temperature	0.1
Initial Learning Rate	0.0003	Replay Buffer Capacity	100,000
Batch Size	128	Total Training Steps	150
Discount Factor	0.99	State Dimension	13
Soft Update Coefficient	0.005 (Balanced ratio)	Action Dimension	1
Maximum acceleration (m/s²)	2	Safe time distance (s)	1.2
Random factor	0.15	Minimum distance (m)	2

Table 6. Simulation Parameters.

Parameter	Unit	Value
Road Speed Limit	m/s	14 (Main Road)			11 (Branch Road)
Max Acceleration	m/s²	+3 (Acceleration)			−3 (Deceleration)
Signal Phase Cycles	s	58		75		90
Phase Duration Ratio	-	0.5172		0.6000		0.4889
Intersection Positions	m	350		850		1350
Station Positions	m	640	730		990		1395
Simulation Timestep	s	0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Qiu, H.; Lu, W.; Shan, X. A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations. Sustainability 2025, 17, 10695. https://doi.org/10.3390/su172310695

AMA Style

Liu H, Qiu H, Lu W, Shan X. A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations. Sustainability. 2025; 17(23):10695. https://doi.org/10.3390/su172310695

Chicago/Turabian Style

Liu, Huan, Hengyi Qiu, Wanming Lu, and Xiaonian Shan. 2025. "A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations" Sustainability 17, no. 23: 10695. https://doi.org/10.3390/su172310695

APA Style

Liu, H., Qiu, H., Lu, W., & Shan, X. (2025). A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations. Sustainability, 17(23), 10695. https://doi.org/10.3390/su172310695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Objective Reinforcement Learning Framework for Energy-Efficient Electric Bus Operations

Abstract

1. Introduction

2. Literature Review

2.1. Energy Consumption Characteristics of Electric Buses

2.2. Evolution of Eco-Driving Strategies

2.3. Summary of Quantitative Findings from Literature

3. Materials

3.1. Research Framework

3.2. Data Collection

3.3. Data Processing

3.3.1. Power Calculation

3.3.2. Acceleration Calculation

3.3.3. Energy Consumption Analysis

4. Methods

4.1. Offline Pre-Training with Q-Learning Decision Transformer

4.1.1. Reward-Driven Policy Design

4.1.2. Performance of the Initial Offline Policy

4.2. Online Reinforcement Learning with Soft Actor-Critic

4.2.1. Physical State Representation

4.2.2. Reward Function

4.2.3. Model Parameters

5. Simulation Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI