# High-Level Sensor Models for the Reinforcement Learning Driving Policy Training

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

#### 1.1. Related Work

#### 1.2. Motivation

## 2. Sensor Models

#### 2.1. Dynamic Environment Perception

#### 2.1.1. Interfaces

#### 2.1.2. Model of Dynamic Environment Perception Stack

- False positive detections.Various physical phenomena, such as multipath reflections of a radar wave, may result in an introduction of non-existing objects to the object list ${\widehat{\mathbf{S}}}_{d}$, leading to the situation in which ADAS/AD algorithms assume the presence of potentially dangerous objects in unoccupied areas.
- False negative detections. Limited performance of the sensors, as well as performance degradations caused by difficult environmental conditions such as bad weather, may lead to missed object detections, i.e., situations in which a dynamic object present in ego’s vicinity is not represented in the object list ${\widehat{\mathbf{S}}}_{d}$.
- State estimation errors. Partial occlusions and performance degradations may lead to potentially dangerous differences between ground truth state description ${\mathbf{S}}_{{d}_{i}}$ of a particular object, and its estimate ${\widehat{\mathbf{S}}}_{{d}_{i}}$ composed by the sensor stack. It should be noted, that due to the filtering properties of the tracking algorithm, state estimation errors may be time-correlated.

#### 2.2. Static Environment Perception

#### Model of Static Environment Perception Stack

## 3. Driving Policy

#### 3.1. Proximal Policy Optimization

#### 3.2. Direct Control Policy

- Ego state observed at the time t defined as ${\mathbf{o}}_{ego}^{\left(t\right)}=\left(\right)open="["\; close="]">{v}_{{s}_{e}}^{\left(t\right)},{v}_{e{x}_{e}}^{\left(t\right)},{a}_{{s}_{e}}^{\left(t\right)},{a}_{{s}_{e}}^{(t-1)},{\gamma}_{e}^{\left(t\right)}$, where ${v}_{{s}_{e}}^{\left(t\right)}$ is a current longitudinal velocity, ${v}_{e{x}_{e}}^{\left(t\right)}$ is a speed limit execution, defined as ratio of ${v}_{{s}_{e}}^{\left(t\right)}$ to the current speed limit, ${a}_{{s}_{e}}^{\left(t\right)}$ denotes current longitudinal acceleration, ${a}_{{s}_{e}}^{(t-1)}$ describes the acceleration at the previous time update, and ${\gamma}_{e}^{\left(t\right)}$ is the ratio of the current yaw rate to the absolute velocity.
- Other road users state, where each vehicle perceived by the ego is described with a vector ${\mathbf{o}}_{{obj}_{i}}={\left(\right)}^{{\mathbf{q}}_{i}}T$, where ${\mathbf{q}}_{\mathbf{i}}\in {\mathbb{R}}^{2}$ is the width and length of the vehicle, ${\mathbf{x}}_{\mathbf{i}}\in {\mathbb{R}}^{2}$ its position relative to the ego vehicle, ${\psi}_{i}\in \mathbb{R}$ rotation with respect to the ego, ${\mathbf{v}}_{\mathbf{i}}={[{v}_{{s}_{i}},{v}_{{d}_{i}}]}^{T}$ denotes the vehicle’s velocity relative to the ego, and ${\mathbf{a}}_{\mathbf{i}}\in {\mathbb{R}}^{2}$ its relative acceleration. Depending on a setup, the observation is created based on the sensor models’ output, or the ground truth data.
- Lane markers state, where each lane marker registered by the ego’s perception system is encoded with a vector ${\mathbf{o}}_{{\mathrm{lm}}_{i}}={\left(\right)}^{{\mathbf{d}}_{{\mathrm{lm}}_{i}}}T$ where ${\mathbf{d}}_{{\mathrm{lm}}_{i}}\in {\mathbb{R}}^{10}$ is a vector of uniformly placed lateral position samples that describe the lane marker’s geometry, ${h}_{{\mathrm{lm}}_{i}}\in \mathbb{R}$ is the observed length of the marker, ${\gamma}_{{\mathrm{lm}}_{i}}\in \mathbb{R}$ is the marker’s rotation at the point adjacent to the ego’s position, and ${m}_{{\mathrm{lm}}_{i}}\in [0,1]$ encodes the marker type, where ${m}_{l{m}_{i}}=0$ if marker is a broken line, and ${m}_{{\mathrm{lm}}_{i}}=1$ otherwise.

#### 3.3. Rewards

- Speed limit execution calculated at each step as a ratio of ego’s current velocity to a current speed limit, multiplied by a factor ${r}_{\mathrm{speed}\_\mathrm{limit}}$.
- Action values, specifically a squared acceleration and squared steering angle values, scaled by factors ${r}_{\mathrm{acc}}$ and ${r}_{\mathrm{steer}}$ respectively.
- Lane centering defined as a current distance of the ego vehicle’s center from the lane center multiplied by a factor ${r}_{\mathrm{centering}}$.
- Time To Collision calculated as a time at which a collision between the ego and another road user would collide if neither of them would change their current longitudinal acceleration nor lateral position. If constant accelerations would not lead to a collision or the time to collision is lower than ${r}_{\mathrm{ttc}\_\mathrm{max}}$, the value for this reward component is set to 0, otherwise, TTC scaled by ${r}_{\mathrm{TTC}}$ factor is assigned.
- Terminal states reward component assigned in events of a collision between the ego and other road user or a road barrier and exceeding the speed limit by $10\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}/\mathrm{s}$ or more.

#### 3.4. Training Setup

## 4. Evaluation Setup

#### 4.1. Baseline Sensor Models

- introduction of state estimation errors through the addition of Gaussian noise,
- limitation of the observed lane markers distance to a value drawn from a normal distribution,
- simulation of the false negative object detection errors by random assignment of the binary visibility flag at each timestep,
- simulation of the false positive object detections through the creation of single-timestep objects with normally distributed state values,
- disturbance of the observed lane markers geometry performed through adding the Gaussian noise to the coefficients of the lane markers polynomials.

#### 4.2. Test Scenarios

#### 4.3. Evaluated Policies

## 5. Results

## 6. Discussion

- Lane geometry errors in absence of nearby vehicles. Even minor geometry errors frequently result in situations, where the agent drives close to the side of the road, triggering the road barrier collision terminal state due to touching the road barrier.
- Late detection of the vehicle in front. Since the situation in which a slow-moving vehicle appears in close proximity to the ego cannot be observed in the ground-truth environment, where the vehicle is observed as soon as it enters the detection area, the driving policy is not trained to handle such situations properly. Late detection typically results in a severe steering maneuver in an attempt to avoid the collision. Excessive control values applied to achieve this however result in a sharp turn, causing a severe collision with a road barrier.
- False-positive object detections. False positives appearing in front of the ego vehicle result in behaviors similar to the ones described in a previous point. The ego attempts to avoid the collision through a severe steering maneuver, crashing into a road barrier due to an excessively sharp turn. Interestingly, false-positive objects that appear outside the road (behind a road barrier) also seem to destabilize the control policy - often triggering unexpected steering maneuvers that result in a collision.

## 7. Conclusions

## Funding

## Conflicts of Interest

## Appendix A. Calibration Parameters

Parameter Name | Value | Unit | Description |
---|---|---|---|

${p}_{\mu \_\mathrm{delay}}$ | 0.3 | s | Mean object detection delay |

${p}_{\sigma \_\mathrm{delay}}$ | 0.55 | s | Object detection delay standard deviation |

${p}_{\mathrm{fn}\_\mathrm{prob}}$ | 0.001 | - | Probability of false negative object detection |

${p}_{\mathrm{fn}\_\mu}$ | 1.47 | s | Mean duration of false negative object detection |

${p}_{\mathrm{fn}\_\sigma}$ | 1.5 | s | False negative object detection duration standard deviation |

${p}_{\mathrm{fp}\_\mathrm{prob}}$ | 0.0175 | - | Probability of false positive object detection |

${p}_{\mathrm{fp}\_\mu}$ | 0.5 | s | Mean duration of false positive object detection |

${p}_{\mathrm{fp}\_\sigma}$ | 2.8 | s | False positive object detection duration standard deviation |

${\mu}_{q}$ | ${[4.34,1.89]}^{T}$ | - | Mean false positive object size |

${\mathsf{\Sigma}}_{q}$ | $\left[\begin{array}{cc}0.21& 0\\ 0& 0.01\end{array}\right]$ | - | False positive object size covariance matrix |

${\mu}_{x}$ | ${[45.1,0]}^{T}$ | - | Mean false positive object position |

${\mathsf{\Sigma}}_{x}$ | $\left[\begin{array}{cc}19.3& 0\\ 0& 0.97\end{array}\right]$ | - | False positive object position covariance matrix |

${\mu}_{\psi}$ | 0.0 | - | Mean rotation of false positive object |

${\sigma}_{\psi}$ | 0.44 | - | Standard deviation of false positive object rotation |

${\mu}_{v}$ | 0.0 | $\frac{\mathrm{m}}{\mathrm{s}}$ | Mean speed of false positive object relative to ego |

${\sigma}_{v}$ | 11.7 | $\frac{\mathrm{m}}{\mathrm{s}}$ | Standard deviation of false positive object speed |

${\mu}_{a}$ | 0.0 | $\frac{\mathrm{m}}{{\mathrm{s}}^{2}}$ | Mean acceleration of false positive object relative to ego |

${\sigma}_{a}$ | 3.46 | $\frac{\mathrm{m}}{{\mathrm{s}}^{2}}$ | Standard deviation of false positive object speed |

${\mathbf{p}}_{\mathrm{ou}\_\lambda}$ | diag (0.5, 0.65, 0.11, 0.45, 0, 0.5, 0) | - | State estimation noise parameter |

${\mathbf{p}}_{\mathrm{ou}\_\mathsf{\Sigma}\_{\mathrm{init}}_{i}}$ | diag (1.3, 1.0, 1.4, 0.7, 0, 2.2, 0) | - | State estimation noise parameter |

${\mathbf{p}}_{\mathrm{ou}\_\mathsf{\Sigma}\_u}$ | diag (2.0, 1.6, 1.3, 0.7, 0, 2.5, 0) | - | State estimation noise parameter |

${p}_{\mathrm{lm}\_\mathrm{ou}\_\lambda \_h}$ | 0.4 | - | Lane markers length noise parameter |

${p}_{\mathrm{lm}\_\mathrm{lim}}$ | 5.0 | m | Mean lane markers length shortening |

${p}_{\mathrm{lm}\_\mathrm{jump}}$ | 15.0 | m | Min lane markers length change for noise reset |

${p}_{\sigma \_h}$ | 5.6 | - | Lane markers length noise parameter |

${p}_{\mathrm{lm}\_\mathrm{disc}\_\mathrm{c}\_0}$ | 0.001 | - | Marker false negative probability parameter |

${p}_{\mathrm{lm}\_\mathrm{disc}\_\mathrm{l}\_0}$ | 0.01 | - | Marker false negative probability parameter |

${p}_{\mathrm{lm}\_\mathrm{disc}\_\mathrm{c}\_1}$ | 0.01 | - | Marker false negative probability parameter |

${p}_{\mathrm{lm}\_\mathrm{disc}\_\mathrm{l}\_1}$ | 0.01 | - | Marker false negative probability parameter |

${p}_{\mathrm{lm}\_\mathrm{disc}\_\mathrm{c}\_2}$ | 0.02 | - | Marker false negative probability parameter |

${p}_{\mathrm{lm}\_\mathrm{disc}\_\mathrm{l}\_2}$ | 0.01 | - | Marker false negative probability parameter |

${h}_{\mathrm{max}}$ | 90.0 | m | Maximum length of lane marker |

${p}_{\mathrm{rec}\_\mathrm{hyst}}$ | 0.005 | - | Marker false negative recovery probability parameter |

${p}_{\mathrm{rec}\_\mathrm{pps}}$ | 0.05 | - | Marker false negative recovery probability parameter |

${p}_{\mathrm{rec}\_\mathrm{sat}}$ | 0.3 | - | Marker false negative recovery probability parameter |

${\mathbf{p}}_{\mathrm{lm}\_\mathrm{ou}\_\lambda}$ | diag (5.5, 5.5, 1.5, 2.5) | - | Lane marker geometry noise parameter |

${\mathbf{p}}_{\mathrm{lm}\_\mathrm{ou}\_\mathsf{\Sigma}\_\mathrm{init}}$ | diag (2.5, 0.05, 0.001, 0.0001) | - | Lane marker geometry noise parameter |

${\mathbf{p}}_{\mathrm{lm}\_\mathrm{ou}\_\mathsf{\Sigma}\_u}$ | diag (0.15, 0.007, ${10}^{-4}$, ${10}^{-6}$) | - | Lane marker geometry noise parameter |

Parameter Name | Value | Description |
---|---|---|

${p}_{{a}_{acc}}$ | 3.5 | Acceleration output scaling. |

${p}_{{a}_{steer}}$ | 0.125 | Steering output scaling. |

${n}_{\mathrm{max}\_\mathrm{obj}}$ | 10 | Max number of observed objects. |

${n}_{\mathrm{max}\_\mathrm{lm}}$ | 6 | Max number of observed lane markers. |

Parameter Name | Value | Description |
---|---|---|

${r}_{\mathrm{speed}\_\mathrm{limit}}$ | 0.04 | Speed limit execution squared. |

${r}_{\mathrm{acc}}$ | −0.003 | Ego acceleration squared. |

${r}_{\mathrm{steer}}$ | −1.0 | Steering angle squared. |

${r}_{\mathrm{centering}}$ | −0.006 | Lane centering. |

${r}_{\mathrm{TTC}\_\mathrm{max}}$ | 6.0 | Max Time To Collision to be included in reward in m/s. |

${r}_{\mathrm{TTC}}$ | −0.01 | Time to collision (inversed). |

${r}_{\mathrm{terminal}}$ | −10.0 | Terminal states (collisions, speed limit violations.) |

Hyperparameter | Value |
---|---|

Train batch size | 250,000 |

Minibatch size | 5000 |

Num epochs | 15 |

Discount ($\gamma $) | 0.99 |

GAE parameter ($\lambda $) | 0.95 |

Clipping parameter ($\u03f5$) | 0.3 |

KL coefficient | 0 |

Entropy coefficient | 0 |

VF coefficient | 1.0 |

Parameter | Value |
---|---|

Object position error covariance matrix | $\left[\begin{array}{cc}1.2& 0\\ 0& 0.7\end{array}\right]$ |

Velocity error variance | 2.0 |

Object length error variance | 0.5 |

Object length error lower limit | −1.0 |

Object width error variance | 0.5 |

Object width error lower limit | −1.0 |

False positive detection probability ^{1} | 0.0575 |

False negative detection probability ^{2} | 0.1 |

Lane marker mean observed length | 87.0 |

Lane marker observed length variance | 5.0 |

Lane marker observed length upper limit | 90.0 |

Lane marker coefficients errors covariance matrix | diag (0.005, 0.0005, 0.00005, 0.000005) |

^{1}Parameters of the false positive object detections are drawn from the same distributions as described in Table A1.

^{2}Duration of the false negatives is fixed to a single timestep.

## Appendix B. Test Scenarios

#### Appendix B.1. Scenario A: Late Detection of a Slow-Moving Object in Front, Empty Highway

Parameter | Value |
---|---|

Ego’s initial velocity $\left[\frac{\mathrm{m}}{\mathrm{s}}\right]$ | $\mathcal{U}(20.0,30.0)$ |

Object’s initial relative longitudinal position [m] | $\mathcal{N}(30.0,3.{0}^{2})\phantom{\rule{4pt}{0ex}}$ |

Object’s initial relative lateral position [m] | $\mathcal{N}(0.0,0.{3}^{2})\phantom{\rule{4pt}{0ex}}$ |

Object’s initial absolute speed $\left[\frac{\mathrm{m}}{\mathrm{s}}\right]$ | $\mathcal{N}(10.0,2.{0}^{2})\phantom{\rule{4pt}{0ex}}$ |

Object’s acceleration $\left[\frac{\mathrm{m}}{{\mathrm{s}}^{2}}\right]$ | $\mathcal{N}(0.0,1.{0}^{2})\phantom{\rule{4pt}{0ex}}$. |

#### Appendix B.2. Scenario B: Constant Error of Front Object’s Speed Estimation

Parameter | Value |
---|---|

Ego’s initial velocity $\left[\frac{\mathrm{m}}{\mathrm{s}}\right]$ | $\mathcal{U}(20.0,30.0)$ |

Object’s initial relative longitudinal position [m] | $\mathcal{N}(40.0,3.{0}^{2})\phantom{\rule{4pt}{0ex}}$ |

Object’s initial relative lateral position [m] | $\mathcal{N}(0.0,0.{3}^{2})\phantom{\rule{4pt}{0ex}}$ |

Object’s initial absolute speed $\left[\frac{\mathrm{m}}{\mathrm{s}}\right]$ | $\mathcal{N}(10.0,2.{0}^{2})\phantom{\rule{4pt}{0ex}}$ |

Object’s acceleration $\left[\frac{\mathrm{m}}{{\mathrm{s}}^{2}}\right]$ | $\mathcal{N}(0.0,1.{0}^{2})\phantom{\rule{4pt}{0ex}}$. |

Constant velocity estimation error $\left[\frac{\mathrm{m}}{\mathrm{s}}\right]$ | 20 |

#### Appendix B.3. Scenario C: Normally Distributed Error of Front Object’S Speed Estimation

**Table A8.**Distributions of Scenario C parameters. Ego and object initial state parameters are identical as in Scenario B.

Parameter | Value |
---|---|

Velocity estimation error $\left[\frac{\mathrm{m}}{\mathrm{s}}\right]$ | $\mathcal{N}(0.0,10.{0}^{2})$ |

#### Appendix B.4. Scenario D: Normally Distributed Front Object’S Lateral Position Estimation Error

**Table A9.**Distributions of Scenario D parameters. Ego and object initial state parameters are identical as in Scenario B.

Parameter | Value |
---|---|

Lateral position estimation error $\left[\frac{\mathrm{m}}{\mathrm{s}}\right]$ | $\mathcal{N}(0.0,3.{5}^{2})$ |

#### Appendix B.5. Scenario E: Random Occurrences of False Negative Detection Errors of Front Objects

**Table A10.**Distributions of Scenario E parameters. Ego and object initial state parameters are identical as in Scenario B.

Parameter | Value |
---|---|

Probability of false negative detection occurrence at each time step. | 0.1 |

Duration of false negative detection error events. [s] | 0.2 |

#### Appendix B.6. Scenario F: Frequent False Negative Road Markers Detection Errors

**Table A11.**Distributions of Scenario F parameters. Ego initial state parameters are identical as in Scenario B.

Parameter | Value |
---|---|

Probability of lane marker false negative detection. | 0.4 |

## References

- Kalra, N.; Paddock, S.M. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transp. Res. Part A Policy Pract.
**2016**, 94, 182–193. [Google Scholar] [CrossRef] - Chao, Q.; Bi, H.; Li, W.; Mao, T.; Wang, Z.; Lin, M.C.; Deng, Z. A survey on visual traffic simulation: Models, evaluations, and applications in autonomous driving. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2019; Volume 39, pp. 287–308. [Google Scholar]
- Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv
**2016**, arXiv:1610.03295. [Google Scholar] - Slavik, Z.; Mishra, K.V. Phenomenological modeling of millimeter-wave automotive radar. In Proceedings of the 2019 URSI Asia-Pacific Radio Science Conference (AP-RASC), New Delhi, India, 9–15 March 2019. [Google Scholar] [CrossRef]
- Hirsenkorn, N.; Subkowski, P.; Hanke, T.; Schaermann, A.; Rauch, A.; Rasshofer, R.; Biebl, E. A ray launching approach for modeling an FMCW radar system. In Proceedings of the 2017 18th International Radar Symposium (IRS), Prague, Czech Republic, 28–30 June 2017; pp. 1–10. [Google Scholar] [CrossRef]
- Jasiński, M. A Generic Validation Scheme for real-time capable Automotive Radar Sensor Models integrated into an Autonomous Driving Simulator. In Proceedings of the 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR), Miedzyzdroje, Poland, 26–29 August 2019; pp. 612–617. [Google Scholar] [CrossRef]
- Schuler, K.; Becker, D.; Wiesbeck, W. Extraction of Virtual Scattering Centers of Vehicles by Ray-Tracing Simulations. IEEE Trans. Antennas Propag.
**2008**, 56, 3543–3551. [Google Scholar] [CrossRef] [Green Version] - Muckenhuber, S.; Museljic, E.; Stettinger, G. Performance evaluation of a state-of-the-art automotive radar and corresponding modeling approaches based on a large labeled dataset. J. Intell. Transp. Syst.
**2021**, 655–674. [Google Scholar] [CrossRef] - Wheeler, T.A.; Holder, M.; Winner, H.; Kochenderfer, M.J. Deep stochastic radar models. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 47–53. [Google Scholar] [CrossRef] [Green Version]
- Lelowicz, K. Camera model for lens with strong distortion in automotive application. In Proceedings of the 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR), Miedzyzdroje, Poland, 26–29 August 2019; pp. 314–319. [Google Scholar]
- Lelowicz, K.; Jasiński, M.; Piłat, A.K. Discussion of novel filters and models for color space conversion. IEEE Sens. J.
**2022**, 22, 14165–14176. [Google Scholar] [CrossRef] - Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
- Genser, S.; Muckenhuber, S.; Solmaz, S.; Reckenzaun, J. Development and experimental validation of an Intelligent Camera Model for Automated Driving. Sensors
**2021**, 21, 7583. [Google Scholar] [CrossRef] [PubMed] - Kim, T.; Song, B. Detection and tracking of road barrier based on radar and vision sensor fusion. J. Sens.
**2016**, 2016. [Google Scholar] [CrossRef] - Romero, L.M.; Guerrero, J.A.; Romero, G. Road curb detection: A historical survey. Sensors
**2021**, 21, 6952. [Google Scholar] [CrossRef] [PubMed] - Nobis, F.; Geisslinger, M.; Weber, M.; Betz, J.; Lienkamp, M. A deep learning-based radar and camera sensor fusion architecture for object detection. In Proceedings of the 2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany, 15–17 October 2019; pp. 1–7. [Google Scholar]
- Segata, M.; Cigno, R.L.; Bhadani, R.K.; Bunting, M.; Sprinkle, J. A LiDAR Error Model for Cooperative Driving Simulations. In Proceedings of the 2018 IEEE Vehicular Networking Conference (VNC), Taipei, Taiwan, 5–7 December 2018. [Google Scholar] [CrossRef] [Green Version]
- Pankiewicz, N.; Wrona, T.; Turlej, W.; Orłowski, M. Promises and Challenges of Reinforcement Learning Applications in Motion Planning of Automated Vehicles. In International Conference on Artificial Intelligence and Soft Computing; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 318–329. [Google Scholar]
- Hanke, T.; Hirsenkorn, N.; van Driesten, C.; Garcia-Ramos, P.; Schiementz, M.; Schneider, S.; Biebl, E. A Generic Interface for the Environment Perception of Automated Driving Functions in Virtual Scenarios. 2019. Available online: https://www.hot.ei.tum.de/forschung/automotive-veroeffentlichungen (accessed on 9 November 2022).
- Hanke, T.; Hirsenkorn, N.; Dehlink, B.; Rauch, A.; Rasshofer, R.; Biebl, E. Generic architecture for simulation of ADAS sensors. In Proceedings of the International Radar Symposium, Dresden, Germany, 24–26 June 2015; pp. 125–130. [Google Scholar] [CrossRef]
- Zhu, Z.; Zhao, H. A Survey of Deep RL and IL for Autonomous Driving Policy Learning. IEEE Trans. Intell. Transp. Syst.
**2021**, 23, 14043–14065. [Google Scholar] [CrossRef] - Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv
**2015**, arXiv:1506.02438. [Google Scholar] - Shalev-Shwartz, S.; Shammah, S.; Shashua, A. On a formal model of safe and scalable self-driving cars. arXiv
**2017**, arXiv:1708.06374. [Google Scholar] - Turlej, W.; Pankiewicz, N. Adversarial Trajectories Generation for Automotive Applications. In Proceedings of the 2021 25th International Conference on Methods and Models in Automation and Robotics (MMAR), Międzyzdroje, Poland, 23–26 August 2021; pp. 115–120. [Google Scholar]

**Figure 1.**Architecture of the direct control network. ${n}_{\mathrm{embd}}$-dimensional input embeddings of a const output token, Ego features (${\mathbf{o}}_{\mathrm{ego}}$), objects ${\mathbf{o}}_{{\mathrm{obj}}_{i}}\phantom{\rule{4.pt}{0ex}}\mathrm{for}\phantom{\rule{4.pt}{0ex}}i=1..{n}_{\mathrm{max}\_\mathrm{obj}}$, and lane markers ${\mathbf{o}}_{l{m}_{i}}\phantom{\rule{4.pt}{0ex}}\mathrm{for}\phantom{\rule{4.pt}{0ex}}i=1..{n}_{\mathrm{max}\_\mathrm{lm}}$ are concatenated into matrix $I\in {R}^{{n}_{embd}\times \left(\right)open="("\; close=")">1+1+{n}_{\mathrm{obj}}+{n}_{\mathrm{lm}}}$ and consumed by the transformer encoder layers. Encoders use a masking mechanism to prevent non-existing objects or lane markers from impacting the outputs. Ultimately transformer encoder layers produce the output $O\in {R}^{{n}_{embd}\times \left(\right)open="("\; close=")">1+1+{n}_{\mathrm{obj}}+{n}_{\mathrm{lm}}}$. The first column of O is then processed by a deep fully connected network to generate the control values ${a}_{\mathrm{acc}}$ and ${a}_{\mathrm{steer}}$. Note that the transformer structure used for observation lacks positional encodings, as there is no need for sorting or prioritizing the environmental features.

**Figure 2.**Simulation environment. Training and evaluations were performed in the simulation of a randomly-generated multi-lane highway. Ground truth vehicles and road (light gray on the figure) are parsed by the sensor models to produce vehicle state estimations (in blue) and lane markers geometry model (in red). The ego vehicle (green) is controlled by the direct control policy.

**Figure 3.**Training progress. Training was performed separately in environments with the described sensor models based on Ornstein–Uhlenbeck processes (OU-SM), baseline sensor models that utilize Gaussian noise (G-SM), and with the ground-truth sensor data (GT). Training in all environments progressed similarly, with the agent trained in the OU-SM environment reaching a slightly lower performance.

**Table 1.**Key Performance Indicators for evaluated agents. The table summarizes the evaluation of three driving policies (GT agent - policy trained in the ground-truth environment, G-SM agent - policy trained in the environment with baseline sensor models based on Gaussian noise, and OU-SM agent, which was trained in an environment with sensor models proposed in previous chapters). Each policy was evaluated in three simulation environments - with GT data, with G-SM setup, and with OU-SM setup.

GT Agent | G-SM Agent | OU-SM Agent | |||||||
---|---|---|---|---|---|---|---|---|---|

Performance Indicator | in GT env | in G-SM env | in OU-SM env | in GT env | in G-SM env | in OU-SM env | in GT env | in G-SM env | in OU-SM env |

Mean episode length (sim steps) | 977.4 | 716.8 | 291.5 | 925.8 | 927.2 | 804.0 | 907.6 | 941.8 | 910.8 |

Average speed [$\frac{\mathrm{m}}{\mathrm{s}}$] | 27.6 | 27.9 | 26.0 | 29.1 | 29.4 | 29.0 | 26.0 | 26.2 | 27.1 |

Average abs steering angle [rad] | 0.35 | 0.56 | 0.30 | 0.54 | 0.57 | 0.62 | 0.41 | 0.46 | 0.48 |

Average abs acceleration [$\frac{\mathrm{m}}{{\mathrm{s}}^{2}}$] | 0.64 | 0.91 | 1.23 | 0.68 | 0.76 | 0.82 | 0.98 | 0.91 | 0.99 |

Heavy braking events | 1.7 | 8.7 | 1.5 | 2.9 | 7.6 | 4.1 | 2.7 | 8.2 | 3.5 |

Fraction of episodes failed | 0.03 | 0.48 | 0.84 | 0.04 | 0.07 | 0.27 | 0.03 | 0.02 | 0.03 |

**Table 2.**Performance of the agents in test scenarios. Each agent was evaluated in each scenario type 100 times, where scenarios for each run were sampled from the distributions described in Appendix B. Performance in the table is measured by a fraction of sampled scenarios evaluations that ended in collisions (the lower the better).

Scenario Type | GT Agent | G-SM Agent | OU-SM Agent |
---|---|---|---|

A. Late detection of a slow-moving object in front, empty highway | 0.16 | 0.23 | 0.06 |

B. A constant error of front object’s speed estimation | 0.41 | 0.73 | 0.31 |

C. Normally distributed error of front object’s speed estimation | 0.45 | 0.30 | 0.24 |

D. Normally distributed front object’s lateral position estimation error | 0.26 | 0.37 | 0.12 |

E. Random occurrences of false negative detection errors of front object | 0.12 | 0.14 | 0.06 |

F. Frequent false negative road markers detections | 0.11 | 0.06 | 0.04 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Turlej, W.
High-Level Sensor Models for the Reinforcement Learning Driving Policy Training. *Electronics* **2023**, *12*, 71.
https://doi.org/10.3390/electronics12010071

**AMA Style**

Turlej W.
High-Level Sensor Models for the Reinforcement Learning Driving Policy Training. *Electronics*. 2023; 12(1):71.
https://doi.org/10.3390/electronics12010071

**Chicago/Turabian Style**

Turlej, Wojciech.
2023. "High-Level Sensor Models for the Reinforcement Learning Driving Policy Training" *Electronics* 12, no. 1: 71.
https://doi.org/10.3390/electronics12010071