Autonomous Vehicle Fuel Economy Optimization with Deep Reinforcement Learning

: The ever-increasing number of vehicles on the road puts pressure on car manufacturers to make their car fuel-e ﬃ cient. With autonomous vehicles, we can ﬁnd new strategies to optimize fuels. We propose a reinforcement learning algorithm that trains deep neural networks to generate a fuel-e ﬃ cient velocity proﬁle for autonomous vehicles given road altitude information for the planned trip. Using a highly accurate industry-accepted fuel economy simulation program, we train our deep neural network model. We developed a technique for adapting the heterogeneous simulation program on top of an open-source deep learning framework, and reduced dimension of the problem output with suitable parameterization to train the neural network much faster. The learned model combined with reinforcement learning-based strategy generation e ﬀ ectively generated the velocity proﬁle so that autonomous vehicles can follow to control itself in a fuel e ﬃ cient way. We evaluate our algorithm’s performance using the fuel economy simulation program for various altitude proﬁles. We also demonstrate that our method can teach neural networks to generate useful strategies to increase fuel economy even on unseen roads. Our method improved fuel economy by 8% compared to a simple grid search approach.


Introduction
An increase in the fuel economy of a vehicle can reduce environmental impact and maintain cost.From 1990 to 2017, the average kilometers per liter (km/L) for all light-duty vehicles in the US was increased by 18% [1].Nevertheless, in the same period, greenhouse gas emission from transportation increased by 71%, making it the second-fastest-growing emission source overall [2].These statistics show that an increase in the number of new cars with better fuel economy adds more to the total emission than older cars' retirement subtracts.One central area we can improve in this situation is how we drive the car along a given trajectory.Designing velocity profiles pertinent to given road grade information can reduce fuel consumption considerably.Since many vehicles introduce the autonomous driving mode, the autonomous driving system in fully controls the vehicle from path planning to selecting what speed it will take for a given trip.Accordingly, it is possible to make the vehicle more fuel efficient in the autonomous vehicle mode [3][4][5][6].
Studies to optimize energy efficiency on electric vehicles (EV) are being carried on to reduce the impact on the environment.A precise prediction and planning of energy consumption on EV is critical because of limited battery capacity and long recharging time.The effect of the reference cornering response on energy saving as both control allocation is studied in [7], and reference understeer characteristics can cause wheel torque distribution and generate yaw moments, impacting energy consumption.Their experiment on a four-wheel-drive EV showed that more energy could be saved through the appropriate sharing of the reference cornering response.A more comprehensive overview of EVs' existing torque distribution strategies with independently actuated drivetrains can be found in Reference [8].In addition, Reference [9] derived a solution for energy-optimal routing by using the A* search framework.Their solution-based on A* search is faster than the generic framework with the Dijkstra or Pallottino strategy.
Compared to human driving, using a cruise control system can increase fuel economy as it holds the vehicle speed steady.Advanced cruise control using model predictive control (MPC) can further optimize fuel economy.In Reference [10], the authors proposed a system to improve fuel economy on congested roads where traffic conditions change frequently and have multiple intersections.This system captures the relevant information of the traffic status and road grades and predicts the future moving of the preceding vehicles.Then, it computes the optimal vehicle control input using MPC.They showed that MPC can increase 15% fuel economy compared to the Gipps model [10,11].
The MPC methods can optimize fuel economy for the finite time intervals with numeric methods and works well with linear systems.However, MPC is not suitable for long-term plans, and internal combustion engines (ICE), which are the most common in automobiles.The ICE models are highly nonlinear, making it hard to apply MPC in a straight forward manner.
A numerical method in ordinary differential equations and differential-algebraic equations for nonlinear MPC (NMPC) of heavy-duty trucks was introduced in Reference [12].This method gives a chance for multiple predictive gear choices.However, the combination of nonlinear dynamics, constraints, and objectives with the gear choice's hybrid nature results in a challenging combinatorial prediction problem.
Both linear and nonlinear MPC-based models may lead to unreliable results in the real world driving conditions if their parameters are not adjusted for varying conditions such as vehicle mass, weather conditions, and fuel type.In Reference [13], the authors validated the concept of adaptive nonlinear model predictive controller (ANLMPC) by implementing their algorithm in a vehicle equipped with a standard production powertrain control module.In their ANLMPC approach, the vehicle and fuel model parameters were adjusted automatically.During a real-world test on a sport utility vehicle (SUV), ANLMPC improved fuel economy up to 2.4% on average compared to a production cruise controller with the same time of arrival.The works of References [14,15] incorporated quadratic programming to handle nonlinearity in improving fuel economy while cruising.Since these nonlinear MPC (NMPC) is difficult to design and can take a lot of CPU power, an approach to train a deep neural network (DNN) controller architecture on data from well-designed NMPC was proposed [16].The DNN controller removes the process of solving complex optimization problems by stating estimation problems in real-time.Additionally, DNN allowed fast computation using GPU.They showed that the MPC could be successfully replaced with the trained DNN controller.
Another approach to optimize fuel in ICE is using dynamic programming (DP) to search velocity profile on the whole trip.In Reference [17], their method generated the optimal velocity profile in the distance domain using DP algorithm.They used cloud servers generated route once the driver tells the destination.Traffic information is also used to solve the fuel optimization problem using the DP algorithm.They improved 5-15% in fuel economy without considerable time loss.Reference [18] also studied to minimize trip time and fuel consumption using DP algorithm for a heavy diesel truck.They reduced 3.5% of fuel consumption without a time increment on a 120 km route.However, DP algorithm faces a crucial setback when the vehicle strays from its calculated optimal velocity profile.The DP algorithm has to calculate the whole velocity profile from the current position until it reaches a destination, which the vehicle may not follow again.
Deep reinforcement learning can overcome the above-mentioned issues, the high complexity, and other environmental factors.A famous example of a deep learning application is the Deep Mind's AlphaGo [19].Deep Mind's AlphaGo was expected to take decades for artificial intelligence to beat humans because of its vast search space and complexity.However, AlphaGo showed us that, with deep reinforcement learning, it won 4-1 against Lee Se-dol, which is one of the best Go players in the world.AlphaGo showed that reinforcement learning could overcome high complexity and respond appropriately to changing situations.In Reference [20], the authors proposed a deep reinforcement learning method that controls the vehicle's velocity to optimize traveling time without losing its dynamic stability.In Reference [21], deep reinforcement learning is used to control the electric motor's power output, optimizing the hybrid electric vehicle's fuel economy.
As previously mentioned, researchers tried to optimize fuel consumption of vehicles equipped with ICE.However, they confronted some difficulties like high nonlinearity with a computational load.Thus, we propose a deep reinforcement learning-based algorithm that trains autonomous vehicles to optimize fuel while following a given trajectory.The learning algorithm is expected to generate the optimal velocity profile after the training phase is done despite of high nonlinearity and to calculate in short time with the help of high-performing GPU.This optimum speed profile based on the given trajectory's altitude profile will maximize the effect of improving fuel economy for autonomous vehicles or vehicles equipped with some partial autonomous driving functionalities, such as smart cruise control system.Hyundai Motor's fuel economy simulation program (FESP) was used for the evaluation of the learned velocity profile in the process of training and testing, which calculates the accurate fuel consumption for the given trip.Fuel economy simulation, connected to the learning algorithm and treated as a black box is used as a proxy for measuring the goodness of the learning algorithm.We developed a novel method of using black box numerical differentiation and backpropagation over our DNN to train our learning algorithm.Furthermore, we developed a deep reinforcement learning algorithm that considers both trip road information and vehicle hysteresis status and generates an optimal strategy by effectively reducing the dimension of velocity profiles with a suitable parameterization for fast convergence.Our evaluation demonstrates that our method can teach a deep neural network to optimize fuel even on unseen roads.
The proposed deep reinforcement learning algorithm takes grade information of road ahead of a given journey and time budget (TB) as input while making an optimal velocity profile in the time domain.More specifically, the grade information is given as sequences.Our model generates velocity blocks of T second-interval for an output, and the velocity profile is made by concatenating them.Our model is trained on a realistic fuel consumption simulation program for ICE, treated as a black box for the purpose of calculating the gradient of reward (representing the goodness of our DNN) in terms of the network weights.The contributions of this paper are as follows.

•
We developed a deep reinforcement learning based algorithm suitable for a fuel economy optimization problem.

•
We developed a method for training deep neural network weights by combining backpropagation and numeric differentiation when a heterogeneous program (written in Fortran in our case) intervenes in the process of training the network.

•
We represented the velocity profile in the time domain rather than in the spatial domain so that we can utilize the vehicles' time-varying states and control parameters.We further utilized parameterized representation for the vehicle speed control strategies rather than directly generating the velocity profile for faster convergence to the optimal solution.

Problem Formulation
For an autonomous vehicle or a vehicle equipped with smart cruise control to travel a predetermined distance within a predetermined time, there are numerous speed profiles.
In this paper, we aim to find a speed profile that satisfies the specified time condition and maximizes fuel economy while the autonomous vehicle moves from A to B. A velocity profile is a velocity plan over a trip, often expressed as a sequence of distance or hourly velocity divided into sufficiently satisfactory levels.
A trip involves starting a vehicle at a specified velocity through a given road until it reaches its end-point.Our proposed method generates a velocity profile V p , while a vector composed of velocities v t sampled evenly with sampling interval Ts, as output.Given the road's grade information and time budget as input, V p should lead to minimum fuel consumption for the autonomous vehicle.The road grade information is given by an array of distance (m) and grade (%) pairs.We call this array, Grd = [(d 1 , g 1 ), (d 2 , g 2 ), . . ., (d N , g N )], where distances d i is sampled with Ds making d i = (i − 1)•Ds.We list the variable names and their meanings in Table 1.

Grd(p s )
Grd for the road ahead where distance is measured from the position p s - When the velocity profile is generated for a given trip, it is passed to a vehicle dynamics simulator to calculate the fuel consumption and fuel economy FE.In this study, we used the fuel economy simulation developed by the Hyundai motor company in the calculation of the fuel consumption and fuel economy.Figure 1 shows the whole process from producing a velocity profile to calculating the fuel economy.

Fuel Economy Simulation Program
The Transmission Research Lab of the Hyundai Motor company developed the FESP for this research, shown in Figure 2.This program calculates fuel economy with vehicle data such as Engine BSFC (brake specific fuel consumption), transmission efficiency, vehicle road load, road grade, and control characteristics, according to driving conditions.In addition, the shift pattern is optimized by FESP.That is, a dynamic shift pattern is used to enable a smooth shift step strategy, according to real-time optimal point driving and a long-term driving strategy.As a result, the number of shifts is reduced, and busy shifts are prevented, improving fuel economy and drivability.
We treat this simulation program as a black box not to use backpropagation through fuel consumption directly.This choice makes further research free from a particular choice fuel consumption model.

Problem Constraints
The vehicle starts at rest, fixing the first element in V p to be 0. A simple rule-based program such as one shown in Figure 4 can be used to ensure the vehicle stops strictly at the end-point.However, it is allowed to finish the trip with speed since our work is more focused on optimizing fuel economy.Therefore, stopping the vehicle can be handled separately for this task.
We limit the maximum speed to 150 km per hour (km/h) and the minimum speed to 0 km/h.i.e., To ensure that the diverse V p is explored in finding a fuel-efficient strategy, we gave a sufficiently long TB and 100 s per 1-km distance.For example, a trip of 2 km length should take no more than 200 s to finish.It is okay to generate V p that travels 2 km in less than 200 s, but it should not take any longer than 200 s in this case.

Reinforcement Learning
Reinforcement learning is a framework that is used to describe agents in a given environment by maximizing its cumulative reward.The state, S t , denote all the information needed to characterize the environment at time t.The agent will choose an action A t according to the state S t .This action will result in a new state S t+∆T , where ∆T is a fixed time length that one action takes.Determining which action to take is governed by the policy π.The policy, π(A t , S t ) denotes the probability that the agent will take action A t at a given state S t .Whenever the agent visits a state S t , the agent is given a reward r t .The episode is the sequence of state, action, and reward sets from the start to the end state.The cumulative reward R is the sum of all r t in the episode.
In our study, we want a DNN, which we call the velocity block generator, to generate a speed profile for a trip.The current state, S t , contains a vehicle position, Grd for the road ahead, velocity profile up until current time V p (t), and the remaining time allowed for the trip TB − t.The velocity block generator determines the policy in a deterministic way, i.e., π(A t , S t ) is 1 for the output action and 0 for other actions.The reward is given only at the terminal state, making R straight forward FE for the trip time is no more than TB.In case it takes more time to finish the trip, we gave a small penalty by how much distance was left to the destination.
The terminologies of reinforcement learning in our problem are organized in Table 2.In this reinforcement learning framework, our velocity block generator takes S t and returns A t that generate acceleration for time t and t + ∆T, generating velocities for this period.These velocities, called velocity block V p (t, t + ∆T), make up the whole V p once the terminal state is reached.In Figure 5, the velocity profile generator calls the velocity block generator multiple times to generate V p , where the velocity block generator is our DNN model.Figure 6 shows this process in more detail.The DNN, which is the velocity block generator, takes the state by direct inputs and by its hidden state in long short-term memory (LSTM) layers.Its output will generate accelerations and velocities, V p (t, t + ∆T).These velocity blocks make the whole V p for the trip.The fuel consumption simulation program takes this V p along with Grd to calculate the fuel economy.

Deep Neural Network Architecture
The DNN model used, shown in Figure 7, is structured appropriately to handle all the data that is about to be given.These input data includes the present condition in this drive, past velocity profiles, and grade information for the road ahead.Only Grd of the road ahead and V p (t − ∆T, t) are passed through long short-term memory (LSTM) layers to reduce dimensionality.The rest of the past velocity profiles are given indirectly by the previous state of the LSTM layers.The LSTM layers that process Grd and V p are called the Road model and the Car model, respectively.Through these models, high-level features of the road and the vehicle state are made.These feature vectors are then concatenated with conditions such as how much time is left on the clock.They are all processed together through fully connected layers, known as dense layers, to return Y t .The action Y t generates the velocity profile block V p (t, t + ∆T).We have freedom in representing this period with a finite number of parameters.However, there are a few things to consider when choosing the right parameterization method.First, even random outputs should be interpreted as a plausible vehicle movement.The DNN, in its first stage of the training, will make random outputs.If this is not a plausible movement, this episode is terminated, and DNN will not learn anything about fuel-saving strategies.Second, the cumulative reward R and its gradient should be as smooth as possible for DNN outputs.Otherwise, the numeric differentiation of R is not suitable to train DNN.Third, the number of parameters should be as small as possible to represent diverse vehicle movement to shorten the time it needs to calculate R's numerical differentiation for the outputs.Figure 8 is shown to clearify these three points.We scheme to make acceleration values at a few fixed time points, fill in the intermediate values with a straight line, and then integrate it to make V p (t, t + ∆T).Concretely, the DNN output Y t = [y 0 , y 1 , y 2 , . . ., y M−1 ], the number y 0 determines the acceleration magnitude at t + ∆T, and the other numbers, y 1 , y 2 , . . ., y M−1 determine how the transition from a t to a t+∆T takes place.

Numerical Differentiation
To update θ, the DNN's learnable parameters, we use the gradient ascent algorithm directly on R with learning rate λ.
In this case, R represents as a function of θ as it determines the action, which determines R.
We have to differentiate the FESP program to calculate the last term in (6), the gradient of R for θ.We cannot use automatic differentiation in this step since FESP is a black box program.Instead, we use numerical differentiation.
The derivative of a real function f at x is defined as a limit below.
The term in the limit, called Newton's difference quotient, approximate f (x) with small nonzero number h.The error, for this approximation, is of order O(h).Similarly, we use a symmetric difference quotient to approximate real derivatives.
In this study, approximation error is of order O(h 2 ) since the first-order errors cancel each other.Thus, we will use Equation ( 8) to calculate R's gradient.However, another level of indirection is used to calculate this gradient.This is because the deep learning network has millions of parameters.
We will use the chain rule to divide the term ∂R/∂θ into a multiplication of two gradients.
Here, Y is the tensor of all the outputs Y t .The gradient ∂R/∂Y is relatively easy to apply Equation ( 8), as all the outputs of DNN on a trip are of the order of hundreds.Furthermore, Y can be differentiated automatically using standard deep learning libraries such as TensorFlow and PyTorch.
Algorithms 1-3 are pseudo codes that show how numeric differentiation is implemented in the training process.The Velocity_Block_Generator (VBG) algorithm takes a velocity block of a previous timestep, time left, and let the DNN generate V p for this trip.
Apply θ new to the Velocity profile Generator

Time Interval Experiment
The time interval, ∆T, has a profound impact on the training step, and the trained results with itself.With a bigger ∆T, less action is needed to finish the trip, resulting in faster training iterations.However, a small number of actions in a drive reduces the velocity block generator's controllability of the vehicle.It might settle in sub-optimal strategies that are less reflexive to the environment.Thus, we can balance training time and training quality by adjusting ∆T.We have experimented with ∆T = 5, 10, and 20 s on an easy 2 km long road with one hill and compared their results.To get reliable metrics for training results, we repeated numerous training for all the three ∆T.This repetition is necessary since the initial parameters of velocity block generator significantly impact the training outcomes.
We repeat 50 policy search rounds for each ∆T, where a policy search round, as shown in Algorithm 5, is defined as a single velocity profile generation trial with random initialization of a velocity block generator composed of playing and learning from 20 episodes.We repeat 50 policy search rounds for a fixed ∆T.Comparing each policy search round will clarify the impact of ∆T in the time and quality of the training.All three results will also confirm our method's ability to train the velocity block generator on a black box fuel consumption program on ICE.The time sampling Ts and distance sampling Ds is set to 0.1 (s) and 5 m, respectively.

Generalization Experiment
The advantage of the deep learning method is that it can be generalized to unseen roads.We have fixed ∆T to 5 s and made ten 5 km roads.The velocity block generator had to learn from nine of these roads (road no.1~9) and was tested on the unseen one road (road no.10).
In this case, during a policy search round, the velocity block generator make trips in all ten roads, including the test road, to see how it does during the training.The velocity block generator's learnable parameters get updated from all nine trips at once.After 20 updates, by evaluating how it did on the test road for 20 iterations, we will see if it had be generalized.The time sampling Ts and distance sampling Ds is set to 0.1 s and 5 m, respectively.

Evaluation Metrics
We measured Average FE, Top 5 FE, Best FE, and average search time for the 50 policy search rounds.For each policy search, the highest FE value is selected over 20 episodes.Then, on average, the top 5 average, and maximum is calculated over 50 policy search rounds, respectively.The FE increase is the increase rate of best FE compared to FE made from a constant speed cruise control velocity profile.The constant velocity cruise control velocity profile is made by accelerating with α (m/s 2 ) until the top speed ν km/h is reached.Then the velocity is kept constant until the trip ends.The values of α and ν are determined by a simple grid search, which is explained in Appendix A. If the FE of the last five trips on a policy search round is higher than the FE of the first five trips on average, we mark this policy search round as a success.The success rate is the percentage of successful searches in 50 policy search rounds.This rate will show us how stable the training is or is not.We list these metrics in Table 3 to clarify their definitions.

Implementation Environment
The training algorithm codes were mainly written with python version 3.6 and TensorFlow version 2.3 on Ubuntu 18.04 OS.The FESP, which calculates an episode's fuel economy, was written in Fortran programming language.This is accessible to the main python program by using NumPy's F2py module, which enables Fortran subroutines to be executed in python programs.For the hardware, the Titan RTX graphics card and intel i9-9980XE CPU with 128 GB ram were utilized.Although the Titan RTX graphics card was used in the training, TensorFlow's GPU usage was limited to 4 GB.

Time Interval Experimental Results
The training results with varying ∆T had a very high variance across the set's 50 policy search rounds.Nevertheless, in Figure 10, averaging the fuel economy of the trip in the set shows a tendency to rise as the train progresses.The overall performance of the set is listed in Table 4, also the best case and detailed analysis of the best case is shown in Figures 11 and 12 respectively.Our result shows that the training was stable during training with a success rate of 100% for all ∆T.The set with ∆T = 5 s was better on average, but the very best FE was found in the set with ∆T = 20 s.At the beginning of training, larger ∆T seems to be better on average, but this is reversed at the end stage of training, where smaller ∆T seems to do better on average.The FE increase was at least 5.1% when compared to a constant speed cruise control.The average rep times show an exponential increase as ∆T gets smaller.

Generalization Experiment Results
The DNN learned to optimize FE in all 10 roads.The V p generated in these roads are shown in Figure 13.It seems that it has learned to keep the speed to accelerate to 70 km/L until it reaches uphill or downhill and decelerate or accelerate accordingly.The FE values resulting from constant velocity cruising and that of our velocity profile generator are compared in Table 5.We also verified DNN's generalization through another experiment.The DNN uncovered some strategies for the training roads (road no.1~9) and made a speed profile for the unseen road (road no.10) with reasonable FE.Unfortunately, we encountered some difficulties during the training loops.DNN sometimes falls into the local maximum FE strategy.

Conclusions and Future Work
We proposed a novel method for improving fuel economy given road grade information for the trip and the allowed time budget based on the deep reinforcement learning.Our algorithm was trained using various examples of road grades and time budgets, and the output velocity profile was evaluated using Hyundai Motor Group's Fuel Economy Simulation Program.We evaluated our algorithm and demonstrated that our algorithm increased, at most, in 8% fuel economy than normal constant velocity cruising.We also learned that, contrary to our prediction, smaller ∆T is not always better.As mentioned in the result section, Figure 10 shows that larger ∆T seems to be better in the earlier stage of training.This seems to be because larger ∆T inherently generates smoother velocity profiles.
As shown in Table 4, for ∆T = 10 s, it takes about 186 s to generate a velocity profile and train the velocity profile generator for 20 times.Thus, only 9.3 s is needed to generate and learn on average, and it takes only about 3 s to plan for the whole trip of 2 km and only 0.2 s for planning for the next

Figure 1 .
Figure 1.Input/output diagram of the proposed method.The velocity profile generator takes road grade information and time budget as input and returns the velocity profile as output.The FESP calculates fuel economy at the end of the trip.Our goal is to generate velocity profiles that maximize FE.

Figure 2 .
Figure 2. Fuel economy simulation program developed by the Hyundai motor company.The calculation time of this program is 55 ms in the FTP-75 mode.As shown in Figure3, this program's fuel consumption rate is almost equivalent in most of the situations to that of a commercial program.

Figure 3 .
Figure 3.Comparison of the fuel consumption rate calculated by our program and another commercial program.

Figure 4 .
Figure 4. (a) The velocity at the end-point is not constrained.(b) The velocity is strictly 0 at the end-point.

Figure 5 .
Figure 5.The velocity profile generator calls the velocity block generator multiple times until the trip is finished.The velocity block generator is made with DNN, and the dotted arrows represent hidden states of LSTM layers inside this neural network.It takes the state as input and outputs M values.These values are interpolated to be accelerations, which is integrated to be a velocity block.

Figure 6 .
Figure 6.The DNN, which is the velocity block generator, takes the state by direct inputs and by its hidden state in long short-term memory (LSTM) layers.Its output will generate accelerations and velocities, V p (t, t + ∆T).These velocity blocks make the whole V p for the trip.The fuel consumption simulation program takes this V p along with Grd to calculate the fuel economy.

Figure 7 .
Figure 7. Deep neural network (DNN) model structure.Sequential data is passed through LSTM layers by reducing its dimensionality.LSTM layers are suitable for processing sequential data and extracting meaningful features.These feature tensors are concatenated with non-sequential inputs and processed all together through the dense layers to return the output tensors.

Figure 8 .
Figure 8.(a) Not a plausible vehicle acceleration and result in an error during the learning process.(b) A plausible vehicle acceleration.However, 26 numbers are used to represent this simple line.(c) This represents the same vehicle acceleration.It uses only six numbers and likely a better parameterization of this acceleration profile.

Figure 9 .
Figure 9. Parameterization example of ∆T = 10 and M = 5.In this case, the value of y 0 is 1.0 m/s 2 , and the values of y 1 , y 2 are −0.49m/s 2 , and y 3 , y 4 are 0.49 m/s 2 .With these values, we determined the key points, shown as red arrows in the diagram.

Figure 10 .
Figure 10.Fuel economy changes during different training iterations.The points in the graph are the average of 50 different policy search rounds in the set.

Figure 11 .
Figure 11.The best trip in the three sets.The velocity generator found the strategy to accelerate up to 70 km/h until the uphill starts and slowly decelerates except for slight acceleration that is given during the downhill.

Figure 12 .
Figure 12.Detailed analysis on the best trip in the three sets.

Figure 13 .
Figure 13.Generalization results in road no.10.The trip made at the final iteration of total 20 iterations of training.Roads no 1-9 were used in the training whereas road no. 10 was not.All the trips including road no. 10 seems to have a fuel saving velocity profile for the road.The results for roads no 1-9 is shown in Appendix B.

Figure A2 .
Figure A2.The final result for the nine roads, road no.1-9, which are used in the training process.

Table 1 .
Terminologies and definition used in the fuel optimization problem.

Table 2 .
Reinforcement learning terminologies and explanations.

Table 3 .
Policy search round evaluation metrics.

Table 4 .
Policy search round evaluation.

Table 5 .
Comparison between constant velocity cruise and our algorithm.
Road FE of Constant Velocity Cruise (km/L) FE of Velocity Profile Generator (km/L) FE Increase (%)