Weakly Supervised Reinforcement Learning for Autonomous Highway Driving via Virtual Safety Cages

The use of neural networks and reinforcement learning has become increasingly popular in autonomous vehicle control. However, the opaqueness of the resulting control policies presents a significant barrier to deploying neural network-based control in autonomous vehicles. In this paper, we present a reinforcement learning based approach to autonomous vehicle longitudinal control, where the rule-based safety cages provide enhanced safety for the vehicle as well as weak supervision to the reinforcement learning agent. By guiding the agent to meaningful states and actions, this weak supervision improves the convergence during training and enhances the safety of the final trained policy. This rule-based supervisory controller has the further advantage of being fully interpretable, thereby enabling traditional validation and verification approaches to ensure the safety of the vehicle. We compare models with and without safety cages, as well as models with optimal and constrained model parameters, and show that the weak supervision consistently improves the safety of exploration, speed of convergence, and model performance. Additionally, we show that when the model parameters are constrained or sub-optimal, the safety cages can enable a model to learn a safe driving policy even when the model could not be trained to drive through reinforcement learning alone.


Introduction
Autonomous driving has gained significant attention within the automotive research community in recent years [1][2][3]. The potential benefits in improved fuel efficiency, passenger safety, traffic flow, and ride sharing mean self-driving cars could have a significant impact on issues such as climate change, road safety, and passenger productivity [4][5][6].
Deep learning techniques have been demonstrated to be powerful tools for autonomous vehicle control, due to their capability to learn complex behaviours from data and generalise these learned rules to completely new scenarios [7]. These techniques can be divided in two categories based on the modularity of the system. On one hand, modular systems divide the driving task into multiple sub-tasks such as perception, planning, and control and deploy a number of sub-systems and algorithms to solve each of these tasks. On the other hand, end-to-end systems aim to learn the driving task directly from sensory measurements (e.g., camera, radar) by predicting low-level control actions.
End-to-end approaches have recently risen in popularity, due to the ease of implementation and better leverage of the function approximation of Deep Neural Networks (DNNs). However, the high level of opaqueness in DNNs is one of the main limiting factors in the use of neural network-based control techniques in safety-critical applications, such as autonomous vehicles [8][9][10][11]. As the DNNs used to control the autonomous vehicles become deeper and more complex, their learned control policies and any potential safety issues within them become increasingly difficult to evaluate. This is made further challenging by the complex environment in which autonomous vehicles have to operate in, as it is impossible to evaluate the safety of these systems in all possible scenarios they may encounter once deployed [12][13][14]. One class of solutions to introduce safety in machine learning enabled autonomous vehicles is to utilise a modular approach to autonomous driving, where the machine learning systems are used mainly in the decision making layer, whilst low-level control is handled by more interpretable rule-based systems [15][16][17]. Alternatively, safety can be guaranteed through redundancy for end-to-end approaches, where machine learning can be used for vehicle motion control with an additional rulebased virtual safety cage acting as a supervisory controller [18,19]. The purpose of the rule-based virtual safety cage is to check the safety of the control actions of the machine learning system, and to intervene in the control of the vehicle if the safety rules imposed by the safety cages are breached. Therefore, during normal operation, the more intelligent machine learning-based controller is in control of the vehicle. However, if the safety of the vehicle is compromised the safety cages can step in and attempt to bring the vehicle back to a safe state through a more conservative rule-based control policy.
This work extends our previously developed Reinforcement Learning (RL) based vehicle following model [20] and virtual safety cages [21]. We make important extensions to our previous works, by integrating our safety cages into the RL algorithm. The safety cages not only act as a safety function enhancing the vehicle's safety, but are also used to provide weak supervision during training, by limiting the amount of unnecessary exploration and providing penalties to the agent when breached. In this way, the vehicle can be safe during training by avoiding collisions when the RL agent takes unsafe actions. More importantly, the efficiency of the training process is improved, as the agent converges to an optimal control policy with less samples. We also compare our proposed framework on less safe agents with smaller neural networks, and show significant improvement in the final learned policies when used to train these shallow models. Our contributions can be summarised as follows:

•
We combine the safety cages with reinforcement learning by intervening on unsafe control actions, as well as providing an additional learning signal for the agent to enable safe and efficient exploration. • We compare the effect of the safety cages during training for both models with optimised hyperparameters, as well as less optimised models which may require additional safety considerations. • We test all trained agents without safety cages enabled, in both naturalistic and adversarial driving scenarios, showing that even if the safety cages are only used during training, the models exhibit safer driving behaviour. • We demonstrate that by using the weak supervision from the safety cages during training, the shallow model which otherwise could not learn to drive can be enabled to learn to drive without collisions.
The remainder of this paper is structured as follows. Section 2 discusses related work and explains the novelty of our approach. Section 3 provides the necessary theoretical background for the reader and describes the methodology used for the safety cages and reinforcement learning technique. The results from the simulated experiments are presented and discussed in Section 4. Finally, the concluding remarks are presented in Section 5.

Autonomous Driving
A brief overview of relevant works in this field is given in this Section. For a more indepth view of deep learning based autonomous driving techniques, we refer the interested readers to the review in [7]. One of the earliest works in neural control for autonomous driving was Pomerleau's Autonomous Land Vehicle In a Neural Network (ALVINN) [22], which learned to steer an autonomous vehicle by observing images from a front facing camera, using the recorded steering commands of a human driver as training data. Among the first to adapt techniques such as ALVINN to use deep neural networks, was NVIDIA's PilotNet [23]. PilotNet was trained for lane keeping using supervised learning with a total of 72 h of recorded human driving as training data.
Since then, these works have inspired a number of deep learning techniques, with imitation learning often being the preferred learning technique. For instance, Zhang et al. [24] and Pan et al. [25] extended the popular Dataset Aggregation (DAgger) [26] imitation learning algorithm to the autonomous driving domain, demonstrating that autonomous vehicle control can be learned from vision. While imitation learning based approaches have shown important progress in autonomous driving [27][28][29][30], they present limitations when deployed in environments beyond the training distribution [31]. These driving models relying on supervised techniques are often evaluated on performance metrics on pre-collected validation datasets [32], however low prediction error on offline testing is not necessarily correlated with driving quality [33]. Even when demonstrating desirable performance during closed-loop testing in naturalistic driving scenarios, imitation learning models often degrade in performance due to distributional shift [26], unpredictable road users [34], or causal confusion [35] when exposed to a variety of driving scenarios.
However, RL-based techniques have shown promising results for autonomous vehicle applications [36][37][38]. These RL approaches are advantageous for autonomous vehicle motion control, as they can learn general driving rules, which can also adapt to new environments. Indeed, many recent works have utilised RL for longitudinal control in autonomous vehicles with great success [39][40][41][42][43]. This is largely due to the fact that longitudinal control can be learned from low-dimensional observations (e.g., relative distances, velocities), which partially overcomes the sample-efficiency problem inherent in RL. Moreover, the reward function for RL is easier to define in the longitudinal control case (e.g., based on safety distances to vehicles in front). For these reasons, we focus on longitudinal control and extend on our previous work on RL-based longitudinal control in a highway driving environment [20].

Safety Cages
Virtual safety cages have been used in several cyber-physical systems to provide safety guarantees when the controller is not interpretable. The most straightforward application of such safety cages is to limit the possible actions of the controller to ensure the system is bounded to a safe operational envelope. If the controller issues commands that breach the safety cages, the safety cages step in and attempt to recover the system back to a safe state. This type of approach has been used to guarantee the safety of complex controllers in different domains such as robotics [44][45][46][47], aerospace [48], and automotive applications [49][50][51]. Heckemann et al. [18] suggested that these safety cages could be used to ensure the safety of black box systems in autonomous vehicles by utilising the vehicle's sensors to monitor the state of the environment, and then limiting the actions of the vehicle in safety-critical scenarios. Demonstrating the effectiveness of this approach, Adler et al. [49] proposed five safety cages based on the Automotive Safety Integrity Levels (ASIL) defined by ISO26262 [52] to improve the safety of an autonomous vehicle with machine learning based controllers. Focusing on path planning in urban environments, Yurtsever et al. [53] combined RL with rule-based path planning to provide safety guarantees in autonomous driving. Similar approaches have also been used for highway driving, by combining rule-based systems with machine learning based controllers for enhanced driving safety [54,55].
In our previous work [21], we developed two safety cages for highway driving, and demonstrated these safety cages can be used to prevent collisions when the neural network controllers make unpredictable decisions. Furthermore, we demonstrated that the interventions by the safety cages can be used to re-train the neural networks in a supervised learning approach, enabling the system to learn from its own mistakes and further making the controller more robust. However, the main limitation of the safety cage approach was that the re-training happened in an offline manner, where the learning was broken down into three stages: (i) supervised training, (ii) closed-loop evaluation with safety cages, and (iii) re-training using the safety cage interventions as labels for supervised learning.
Here, we extend on this approach by utilising the safety cages to improve the safety of a RL based vehicle motion controller, and using the interventions of the safety cages as weak supervision which enables the system to learn to drive more safely in an online manner. Weak supervision has been shown to improve the efficiency of exploration in RL [56] by guiding the agent towards useful directions during exploration. Here, the weak supervision enhances the exploration process in two ways; the safety cages stop the vehicle from taking unsafe actions thereby eliminating the unsafe part of the action space from the exploration, while also maintaining the vehicle in a safe state and thereby reducing the amount of states that need to be explored. Reinforcement learning algorithms often struggle to learn efficiently at the beginning of training, since initially the agent is taking largely random actions, and it can take a significant amount of training before the agent starts to take the correct actions which are needed to learn its task. Therefore, by utilising weak supervision to guide the agent to the correct actions and states, the efficiency of the early training stage can be improved. We show that eliminating the unsafe parts of the exploration space improves convergence during training, which can be a significant advantage considering the low sample efficiency of RL. Furthermore, we show that the safety cages eliminate the collisions that would normally happen during training, which could be a further advantage of our technique, should the training occur in a real-world system where collisions are undesirable.

Reinforcement Learning
Reinforcement learning can be formally described as a Markov Decision Process (MDP). The MDP is denoted by a tuple {S, A, P, R}, where S represents the state space, A represents the the action space, P denotes the state transition probability model, and R is the reward function [57]. As shown in Figure 1, at each time step t, the agent takes an action a t from the possible set of actions A, according to its policy π which is a mapping from states s t to actions a t . Based on the action taken in the current state, the environment then transitions to the next state s t+1 according to the transition dynamics p(s t+1 |s t , a t ) as given by the transition probability model P. The agent then observes the new state s t+1 and receives a scalar reward r t according to the reward function R. The aim of the agent in the RL setting is to maximise the total accumulated returns R t : where γ ∈ [0, 1] is the discount factor used to prioritise immediate rewards over future rewards.

Deep Deterministic Policy Gradient
Deep Deterministic Policy Gradient (DDPG) [58] extends the Deterministic Policy Gradient algorithm by Silver et al. [59] by utilising DNNs for function approximation. It is an actor-critic based off-policy RL algorithm, which can scale to high-dimensional and continuous state and action spaces. The DDPG uses the state-action value function, or Q-function, Q(s, a), which estimates the expected returns after taking an action a t in state s t under policy π. Therefore, given a state visitation distribution ρ π under policy π in environment E the Q-function is denoted by: The Q-function can be estimated by the Bellman equation for deterministic policies as: As the expectations depend only on the environment, the critic network can be trained off-policy, using transitions from a different stochastic policy with the state visitation distribution ρ β . The parameters of the critic network θ Q can then be updated by minimising the critic loss L Q : where The actor network parameters θ π are then updated using the policy gradient [59] from the expected returns from a start distribution J with respect to the actor parameters θ π : For updating the networks, mini-batches are drawn from a replay memory D, which is a finite sized buffer storing state transitions e = [s t , a t , r t , s t+1 ]. To avoid divergence and improve stability of training, DDPG utilises target networks [60], which copy the parameters of the actor and critic networks. These target networks, target actor π (s|θ π ) and target critic network Q (s, a|θ Q ), are updated slowly based on the learned network parameters to improve stability: where τ 1 is the mixing factor, a hyperparameter controlling the speed of target network updates.
To encourage the agent to explore the possible actions for continuous action spaces, noise is added to the actions of the deterministic policy π(s t |θ π ). This exploration policy π e (s t ), samples noise from a noise process N which is added to the actor policy: Here, the chosen noise process N is the Ornstein-Uhlenbeck process [61], which generates temporally correlated noise for efficient exploration in physical control problems.

Safety Cages
Virtual safety cages provide interpretable rule-based safety for complex cyber-physical systems. The purpose of these safety cages is to limit the actions of the system to a safe operational envelope. The simple way to achieve this, would be to limit the upper or lower limits of the system's action space. However, by using run-time monitoring to observe the state of the environment, the safety cages can dynamically select the control limits based on the current states. Therefore, the system can be limited in its possible courses of action when faced with a safety-critical scenario, such as a near-accident situation on a highway. We utilise our previously presented safety cages [21], which limit the longitudinal control actions of a vehicle based on the Time Headway (TH) and Time-To-Collision (TTC) relative to the vehicle in front. The TH and TTC metrics represent the risk of potential forward collision with the vehicle in front, and are calculated as: where x rel is the distance between the two vehicles in m, v is the velocity of the host vehicle in m/s, and v rel is the relative velocity between the two vehicles in m/s. The TTC and TH metrics were chosen as the states monitored by the safety cages as they represent the risk of potential collision with the vehicle in front, thereby providing effective safety measurements for our vehicle following use-case. We utilise two metrics as the TTC and TH provide complimentary information; the TTC measures time to a forward collision assuming both vehicles continue at their current speeds, whilst TH measures distance to the vehicle in front in time and makes no assumptions about the lead vehicle's actions. For example, when the host vehicle is driving significantly faster than the vehicle in front, as the distance between the vehicles gets closer the TTC approaches zero and correctly captures the risk of a forward collision. However, in a scenario where both vehicles are driving close to each other but at the same speed, the TTC will not signal a high risk of collision even though in this scenario if the lead vehicle begins to break, the two vehicles would be in a likely collision. In such a scenario, the two vehicles will have a low headway, therefore monitoring the TH will correctly inform the safety monitors of a collision risk.
The risk levels for both safety cages are as defined in [21], where the aim was to identify potential collisions in time to prevent them, whilst minimising unnecessary interventions on the control of the vehicle. The different risk levels and associated minimum braking values are illustrated in Figure 2. For each safety cage, there are three risk levels for which the safety cages will enforce a minimum braking value on the vehicle, with higher risk levels using increased rate of braking relative to the associated safety metric. When the vehicle is in the low risk region, no minimum braking is necessary and the RL agent is in full control of the vehicle. The minimum braking values enforced by the safety cages can be formally defined as shown in (11)- (12). The braking value is normalised to the range [0, 1] where 0 is no braking and 1 is maximum braking value. In this framework, both safety cages provide a recommended braking value, which is then compared to the current braking action from the RL agent. The final braking value used for the vehicle motion control, b, is then chosen as the largest braking value between the two safety cages and the RL agent as given by (13).

Highway Vehicle Following Use-Case
The vehicle following use-case was framed as a scenario on a straight highway, with two vehicles travelling in a single lane. The host vehicle controlled by RL is the follower vehicle, and its aim is to maintain a 2 s headway from the lead vehicle. The lead vehicle velocities were limited to v lead ∈ [17,40] m/s, and coefficient of friction values were chosen from the range [0.4, 1.0] for each episode. During training, the lead vehicle's acceleration was limited tov lead ∈ [−2, 2] m/s 2 , except for emergency braking manoeuvrers, which occurred on average once an hour, which used an acceleration in the rangė v lead ∈ [−6, −3] m/s 2 . The output from the RL agent is the gas and brake pedal values, which are continuous action values used to control the vehicle. As in [20], a neural network is used to estimate the longitudinal vehicle dynamics, by inferring the vehicle response to the pedal actions from the RL agent. This neural network acts as a type of World Model [62], providing an estimation of the simulator environment. This has the advantage that the neural network can be deployed on the same GPU as the RL network during training, thereby speeding up training time significantly. The World Model was trained with 2,364,041 time-steps from the IPG CarMaker simulator under different driving policies combining a total of 45 h of simulated driving. This approach was shown in [20] to speed up training by up to a factor of 20, compared to training with the IPG CarMaker simulator. However, to ensure the accuracy of all results, we also evaluate all trained policies in IPG CarMaker (Section 4.1).

Training
The DDPG model is trained in the vehicle following environment for 5000 episodes, where each episode lasts up to 5 min or until a collision occurs. The simulation is sampled at 25 Hz, therefore each time-step has a duration of 40 ms. The training parameters of the DDPG were tuned heuristically, and the final values can be found in Table 1. The critic uses a single hidden layer, followed by the output layer estimating the Q value. The actor network utilises 3 hidden feedforward layers, followed by a Long Short-Term Memory (LSTM) [63] and then the action layer. The actor network outputs the vehicle control action, for which the action space is represented by a single continuous value a t ∈ [−1, 1], where positive values represent the use of the gas pedal and negative values represent the use of the brake pedal. The observations of the agent are composed of 4 continuous state-values, which are the host vehicle velocity v, host vehicle accelerationv, relative velocity v rel , and time headway TH, such that s t = [v,v, v rel , TH] T . To enable the LSTM to learn temporal correlations, the mini-batches for training were sampled as consecutive time-steps, with the LSTM cell state reset between each training update. To encourage the agent to learn a safe vehicle following policy, a reward function based on its current headway and headway derivative was defined in [20] based on the reward function by Desjardins & Chaib-Draa [64], as shown in Figure 3. The agent gains the maximum reward when it is close to the target headway of 2 s, whilst straying further from the target headway results in smaller rewards. The headway derivative is used in the reward function to encourage the vehicle to move towards the target headway, by giving small positive rewards as it moves closer to the target and penalising the agent when it is moving further away from the target region. For further comparison, we compare training the model with an additional penalty for breaching the safety cages, such that the final reward is given as follows: where r t is the reward for time-step t, r th is the headway based reward function as shown in Figure 3, and r sc is the safety cages penalty equal to -0.1 if the safety cage is breached and 0 otherwise. The episode rewards during training can be seen in Figure 4, where three models are compared. The three models are DDPG only, DDPG+SC which is DDPG with safety cages, and DDPG+SC (no penalty) which is the DDPG with safety cages but without the r sc penalty. As can be seen, the DDPG+SC model has lower rewards at the beginning of training as it receives additional penalties compared to the other two models. However, after the initial exploration the DDPG+SC is the first model to reach the optimal rewards per episode (∼7500 rewards), demonstrating improved convergence. Comparing the DDPG+SC models with and without penalties from the safety cages shows the model with the penalties converges to the optimal solution sooner, suggesting the penalty improves convergence during training. An additional benefit of the safety cages here is the safety of exploration, as the DDPG model collided 30 times during training, whilst the DDPG+SC model had no collisions during training. However, it can be seen that all three models converge to the same level of performance, therefore no significant difference in the trained policies can be concluded from the training rewards alone.  As an additional investigation of the effect of the safety cages on less safe control policies, we train two further models utilising smaller neural networks with constrained parameters. These models use the same parameters as in Table 1, except they only have 1 single hidden layer with 50 neurons and no LSTM layer. We refer to these models as Shallow DDPG and Shallow DDPG+SC. It should be noted that the parameters of these models were not tuned for better performance, and indeed sub-optimal parameters were chosen on purpose to enable better insight into the effect of the safety cages in unsafe systems. The episode rewards for the two shallow models during training are shown in Figure 5. As can be seen, these two models have a more significant difference in training performance. The Shallow DDPG struggles to learn a feasible training policy, whilst the Shallow DDPG+SC learns to drive without collisions, although at a lower level of overall performance compared to the deeper models.

Results
To investigate the performance of the learned control policies, we evaluate the vehicle follower models in various highway driving scenarios. We utilise two types of testing for this evaluation. Naturalistic testing tests the control policies in typical driving scenarios, giving an idea of how the control policies perform in everyday driving. Adversarial testing utilises an adversarial agent to create safety-critical scenarios, showing how the vehicle performs in dangerous edge cases where collisions are likely to occur. The controller performance in both types of scenario is important, since most driving scenarios on the road fall into naturalistic driving the controller must be able to drive efficiently and safely in these scenarios, however the controller must also be able remain safe in dangerous edge cases in order to avoid collisions. To enable better analysis of the performance of the RL-based control policies, no safety cages are used during testing so the vehicle follower models must depend on their own learned knowledge to keep the vehicle safe. This also enables better understanding on the effect of using the safety cages during training on the final learned control policy.

Naturalistic Testing
For the naturalistic driving, similar lead vehicle behaviours were used to those during training, with velocities in the range [17,40] m/s and acceleration [−2, 2] m/s 2 . The exception to this was the harsh braking manoeuvres which occurred, on average, once an hour with deceleration [−6, −3] m/s 2 . At the start of the episode, the coefficient of friction is randomly chosen in the range [0.4, 1.0] and each episode lasts until 5 min has passed or a collision occurs. For each driving model, a total of 120 test scenarios were completed, totalling up to 10 h of testing. All driving for these tests occurred in the IPG CarMaker simulation environment to ensure accuracy of the results. Two types of baselines are provided for comparison; the IPG Driver is the default driver in the CarMaker Simulator and A2C is the Advantage Actor Critic [65] based vehicle follower model in [20].
The results from the naturalistic driving scenarios are summarised in Table 2. The table shows the RL based models outperform the default IPG Driver, with the exception of the shallow models. The results demonstrate that both DDPG based models outperform the previous A2C-based vehicle follower model. However, comparing the DDPG and DDPG+SC models shows the benefit of using the safety cages during RL training. While in most scenarios the two models have similar performance (the mean values seen are approximately equal), the minimum headway by the DDPG+SC during testing is higher, showing it can maintain a safer distance from the lead vehicle. However, as both models can maintain a safe distance without collisions this difference is not significant by itself. Therefore, investigating the difference between the Shallow DDPG and Shallow DDPG+SC models provides further insight into the role the safety cages play in supervision during RL training. Similar to the training rewards, the shallow models show a more extreme difference between the two models. The results show the Shallow DDPG model without safety cages fails to learn to drive safely, whilst the Shallow DDPG+SC model avoids collisions safely, although it comes relatively close to collisions with a minimum time headway at 0.79 s. This shows the benefit of the safety cages in guiding the model towards a safe control policy during training.

Adversarial Testing
Utilising machine learning to expose weaknesses in safety-critical cyber-physical systems has been shown to be an effective method for finding failure cases effectively [66,67]. We utilise the Adversarial Testing Framework (ATF) presented in [34], which utilised an adversarial agent trained through RL to expose over 11,000 collision cases in machine learning based autonomous vehicle control systems. The adversarial agent is trained through A2C [65] with a reward function r A based on the inverse headway: This reward function encourages the adversarial agent to minimise the headway and make collisions happen, while capping the reward at 100 ensures that the reward does not tend to infinity as the headway reaches zero. As this lead vehicle used in the adversarial testing can behave very differently to those seen during training, this testing focuses on investigating the models' generalisation capability as well as their response to hazardous scenarios. Each DDPG model is tested under two different velocity ranges; the first limits the lead vehicle's velocity to the same as the training scenarios with v lead ∈ [17,40] m/s, and the second uses a lower velocity range which enables the ATF to expose collisions more easily at a velocity range of v lead ∈ [12, 30] m/s. For each model, 3 different adversarial agents were trained, such that results can be averaged between these 3 training runs. The minimum episode TH during training can be seen for both deep models over the 2500 training episodes in Figures 6 and 7. These tests show that both deep models can maintain a safe distance from the lead vehicle even when the lead vehicle is attempting to cause collisions intentionally. Although a slight difference in the two models can be seen, as the DDPG+SC model has a slightly higher headway on average as well as significantly less variance. However as both deep models remain at a safe distance from the adversarial agent, these models can be considered safe even in safety-critical edge cases. Comparing the two shallow models in Figures 8 and 9, a more significant difference can be seen. While both models are worse in performance than the deep models, the Shallow DDPG is significantly easier to exploit than the Shallow DDPG+SC model. The Shallow DDPG model continues to cause collisions during the adversarial testing, whilst the Shallow DDPG+SC model remains at a safer distance. In the training conditions, the Shallow DDPG+SC remains relatively safe, with no decrease in the minimum headway during the training of the adversarial agent, although it can be seen that the variance increases as the training progresses. In the lower velocity case, the Shallow DDPG+SC still avoids collisions, but the adversarial agent is able to reduce the minimum headway significantly better. This shows that the safety cages have helped the model learn a significantly more robust control policy, even when the model uses sub-optimal parameters. Without the additional weak supervision from the safety cages, it can be seen that these shallow models would not have been able to learn a reasonable driving policy. Therefore, the weak-supervision by the safety cages can be used to train models with sub-optimal parameters. In addition, for models with optimal parameters they provide improved convergence during training and slightly improved safety in the final trained policy.

Conclusions
In this paper, a reinforcement learning technique combining rule-based safety cages was presented. The safety cages provide a safety mechanism for the autonomous vehicle in case the neural network-based controller makes unsafe decisions, thereby enhancing the safety of the vehicle and providing interpretability in the vehicle motion control system. In addition, the safety cages are used as weak supervision during training, by guiding the agent towards useful actions and avoiding dangerous states.
We compared the model with safety cages to a model without them, and show improvements in safety of exploration, speed of convergence, and the safety of the final control policy. In addition to improved training efficiency, simulated testing scenarios demonstrated that even with the safety cages disabled, the model which used them during training has learned a safer control policy by maintaining a minimum headway of 1.69 s in a safety-critical scenario, compared to 1.53 s without safety cage training. We additionally tested the proposed approach on shallow models with constrained parameters, and showed that the shallow model with safety cage training was able to drive without collisions, whilst the shallow model without safety cage training collided in every test scenario. These results demonstrate that the safety cages enabled the shallow models to learn a safe control policy while otherwise the shallow models were not able to learn a feasible driving policy. This showed that the safety cages add beneficial supervision during training, enabling the model to learn from the environment more effectively. Therefore, this work provides an effective way to combine reinforcement learning based control with rule-based safety mechanisms not only to improve the safety of the vehicle, but also incorporating weak supervision in the training process for improved convergence and performance.
This work opens up multiple potential avenues for future work. The use-case in this study was a simplified vehicle following scenario. However, extending the safety cages to consider both longitudinal and lateral control actions, as well as potential objects on other lanes, would allow the technique to be applied to more complex use-cases such as urban driving. Moreover, comparing the use of the weak supervision for different use-cases or learning algorithms (e.g., on-policy vs. off-policy RL) would help with understanding the most efficient use of weak supervision in reinforcement learning. Furthermore, extending the reinforcement learning agent to use more high dimensional inputs, such as images, would allow investigation into how the increased speed of convergence helps in cases where the sample inefficient reinforcement learning algorithms struggle. Finally, using the safety cages presented here in real-world training could better demonstrate the benefit in both safety and efficiency of exploration, compared to the simulated scenario presented in this work.