Deep Reinforcement Learning and Imitation Learning for Autonomous Parking Simulation

Anagnostara, Ioanna Marina; Tsardoulias, Emmanouil; Symeonidis, Andreas L.

doi:10.3390/electronics14101992

Open AccessArticle

Deep Reinforcement Learning and Imitation Learning for Autonomous Parking Simulation

by

Ioanna Marina Anagnostara

^*

,

Emmanouil Tsardoulias

^*

and

Andreas L. Symeonidis

School of Electrical and Computer Engineering, Faculty of Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(10), 1992; https://doi.org/10.3390/electronics14101992

Submission received: 15 April 2025 / Revised: 2 May 2025 / Accepted: 12 May 2025 / Published: 13 May 2025

(This article belongs to the Special Issue Deep Perception in Autonomous Driving, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In recent years, system intelligence has revolutionized various domains, including the automotive industry, which has fully incorporated intelligence through the emergence of Advanced Driver Assistance Systems (ADAS). Within this transformative context, Autonomous Parking Systems (APS) have emerged as a foundational component, revolutionizing the way vehicles navigate and park with precision and efficiency. This paper presents a comprehensive approach to autonomous parallel parking, leveraging advancements in Artificial Intelligence (AI). Three state-of-the-practice approaches—Imitation Learning (IL), deep Reinforcement Learning (deep RL), and a hybrid deep RL-IL method—are employed and evaluated through extensive experiments in the CARLA Simulator using randomly generated parallel parking scenarios. Results demonstrate that the hybrid deep RL-IL approach achieves a remarkable success rate of 98% in parking attempts, surpassing the individual IL and deep RL methods. Furthermore, the proposed hybrid model exhibits superior maneuvering efficiency and higher overall reward accumulation. These findings underscore the advantages of combining deep RL and IL, representing a significant advancement in APS technology.

Keywords:

autonomous vehicles; autonomous parking systems; reinforcement learning; imitation learning; deep deterministic policy gradient (DDPG)

1. Introduction

In an era defined by technological innovation, intelligence has emerged as a transformative force across diverse sectors, redefining task execution and enhancing operational efficiency. From manufacturing to healthcare, the integration of systems equipped with intelligence has not only increased efficiency but also opened doors to new possibilities in enhancing precision, reducing errors, and ultimately augmenting overall productivity [1].

This pervasive influence of intelligence sets the stage for the intersection of cutting-edge technology and transportation, introducing a transformative era marked by increasing autonomy in vehicles. The acceleration of the evolution of autonomous vehicles (AVs) became intricately tied to the rapid advancement of Artificial Intelligence (AI) and machine learning (ML). In the automotive sector, the application of ML technology through the integration of Artificial Intelligence (AI) in vehicles, commonly referred to as self-driving or AVs, has become a focal point [2]. AI and ML are integral in reshaping the landscape of autonomous driving, fostering adaptability and efficiency. Various ML techniques have been employed across different domains, with one notable approach being Reinforcement Learning (RL). RL enables AVs to learn optimal behaviors through trial and error by interacting with their environment [3]. This mechanism allows vehicles to navigate complex scenarios, receive rewards for desirable actions, and continuously refine their decision-making processes.

AVs operate based on a three-phase design known as “sense-plan-act”, a fundamental principle shared by many robotic systems [4]. The primary challenge for AVs lies in effectively understanding the intricate and ever-changing driving environment they operate in. In tackling this challenge, AVs are equipped with technologies that provide advanced sensing capabilities, utilizing radar, lidar, cameras, and ultrasonic sensors, facilitating real-time perception of the vehicle’s surroundings. The collected data then serve as input for perception algorithms, playing a crucial role in constructing a comprehensive situational awareness model [5]. The decision-making process is a core functionality driven by the power of AI and ML. These systems interpret the perceived environment, enabling the vehicle to make complex decisions, including route planning, obstacle avoidance, and parking maneuvering, among others. The execution of these decisions falls under the domain of the control systems, which seamlessly integrate features such as adaptive cruise control, lane-keeping assistance, automated braking, and autonomous parking. This integration ensures precise control over critical aspects like acceleration, braking, and steering [6].

The transformative capabilities of AVs extend beyond mere intelligence, promising a revolution in the way we approach transportation. From adaptive cruise control to complex decision-making processes, AVs offer a spectrum of functions that redefine the driving experience. Within this array of autonomous capabilities, a particular focus lies on autonomous parking—a feature that involves intricate maneuvers to seamlessly move a vehicle from a traffic lane into a parking space, encompassing parallel, perpendicular, or angled parking scenarios. While all forms of autonomous parking contribute to convenience, parallel parking emerges as the most challenging due to its precision requirements and the need to navigate limited spaces effectively. Achieving centimeter-level accuracy in parallel parking necessitates a delicate interplay of advanced sensing capabilities, sophisticated decision-making algorithms, and precise control systems.

To build upon the advancements in autonomous vehicle technology, this work focuses on improving the efficiency and precision of parallel parking through the integration of deep RL and Imitation Learning (IL). By combining these techniques, we aim to develop an adaptive model for autonomous parking.

The rest of this paper is structured as follows: Section 2 reviews the state of the art in Autonomous Parking Systems (APS), providing an overview of current advancements in the field. Section 3 presents the theoretical background of RL and the DDPG algorithm, which are central to the methods used in this work. Section 4 covers implementation, including the environmental setup, and a detailed explanation of the three developed approaches for autonomous parallel parking. Section 5 presents experimental results, evaluating the performance of each approach. Finally, Section 6 provides conclusions and discusses potential future enhancements.

2. State of the Art

Recent studies in autonomous parking have explored a variety of methods to address its inherent challenges, such as accurately detecting parking spaces, navigating complex environments, and ensuring safe maneuvering without collisions. Approaches to APS can be categorized into two distinct groups. The first group, akin to the modular framework, employs a path planning approach, where the emphasis lies in meticulously devising optimal trajectories and paths for vehicles to navigate seamlessly into parking spaces. These methods involve pre-computing precise paths that adhere to vehicle dimensions and parking constraints, often utilizing techniques such as geometric calculations, search algorithms, or optimization-based formulations [7]. The modular nature of this approach involves the division of the system into distinct modules or components, each responsible for specific tasks, such as perception, planning, and control. In contrast, the second group follows an intelligent control approach, which leverages cutting-edge technologies like ML and RL, to enable real-time decision making for controlling the throttle, brakes, and steering during parking maneuvers. This approach, aligning with the end-to-end framework, allows vehicles to dynamically adapt their actions based on sensor inputs and learned policies, resulting in a more adaptable and context-aware parking process [7]. While both categories share the goal of achieving autonomous parking, they distinguish themselves by their primary focus: the path planning approach emphasizes the pre-computation of stages, while the intelligent control approach centers on real-time decision making and the integration of multiple tasks [8].

2.1. Conventional Autonomous Parking Approaches

Conventional approaches in APS typically utilize mathematical principles and algorithms to facilitate optimal vehicle paths. These systems are broadly categorized into three types: geometric methods, search methods, and optimization methods. Geometric-based methods in APS utilize predefined paths composed of multi-segment fundamental curves, including straight lines and arcs, as parking path patterns. Planning variables such as curve parameters are determined based on kinematic and environmental constraints using algebraic or optimization methods [9]. Vorobieva et al. [10] developed a geometric path using circle arcs and clothoid curves, emphasizing efficient computational processes. However, these methods often struggle in dynamic environments due to their reliance on fixed geometrical constraints. To address these limitations, Piao et al. [11] introduced a multi-sensor fusion approach for parking space recognition and adaptive trajectory generation, accommodating nonparallel initial states, for parallel parking. In terms of parking space recognition, their multi-sensor fusion method significantly outperformed both single and double sensor approaches, achieving a success rate of 94% compared to 45% and 64%, respectively. Geometric-based methods are advantageous due to their low computational cost but often exhibit a lower success rate in planning compared to other approaches [12].

Search-based methods, including algorithms like A* and Rapidly Exploring Random Trees (RRTs), aim to find collision-free paths and have been widely employed for path planning in autonomous parking. Huang et al. [13] demonstrated the application of the multi-heuristic hybrid A* algorithm, a state-space heuristic approach tailored for vehicle kinematics, in both forward and backward parallel parking scenarios, though it occasionally resulted in suboptimal paths. However, these algorithms are typically computationally expensive. To minimize computation costs, Kwon and Chung [14] proposed a three-step path-planning approach for both parallel and vertical parking. This method involves calculating reachable regions, generating candidate paths within these regions, and determining the optimal path based on a cost function.

Optimization-based methods, another era of focus in the literature, employ model-based techniques for path planning, integrating the dynamic characteristics of vehicles. Planning variables, such as linear velocity and steering angle, are optimized within a predefined framework to determine vehicle motion parameters. A study from Chi et al. [15] introduced a hierarchical parking framework that combined graph search techniques with nonlinear model predictive control, effectively handling complex parking maneuvers through numerical simulations. The quality of these solutions often depends on initial estimates, leading researchers to provide an initial estimate close to the optimal solution to expedite the optimization solver’s convergence and enhance the system’s overall efficiency. In contrast to other methods, optimization-based approaches offer superior path quality, yet they face challenges in representing obstacles as equations and incur high computational costs depending on the solver used for the optimization problem [12].

Research has also explored combining methods to improve the efficacy of individual approaches. Zhang et al. [16] introduced a method merging A* with the Optimization-Based Collision Avoidance (OBCA) algorithm, utilizing a global path planner to generate an initial coarse path. This coarse path serves as the starting point for OBCA, enhancing solution quality by aligning with vehicle dynamics.

Conventional APS approaches, while foundational, often encounter limitations in flexibility and adaptability. These methods primarily rely on predefined constraints and mathematical algorithms, which can result in suboptimal performance in complex parking scenarios. Additionally, their reliance on complex calculations and extensive data processing can lead to increased computational demands, resulting in slower decision-making processes and reduced efficiency in executing parking maneuvers.

2.2. Advanced Autonomous Parking Approaches

In contrast, advanced approaches leverage ML and AI techniques, particularly Neural Networks (NNs), to enhance adaptability and efficiency in APS. NNs have demonstrated significant potential in processing complex data, such as visual information for detecting parking space boundaries. Li et al. [17] utilized a Convolutional NN (CNN) for an end-to-end APS, achieving high success rates in bay parking scenarios by training with real-time video and control command data. IL, an ML method where models learn to replicate human behavior by training on example data, is also commonly used in advanced approaches. Within this context, Moon et al. [18] developed a deep NN-based parking controller that emulates human control laws through supervised learning, demonstrating robustness in various parking conditions. Gamal et al. [19] developed an intelligent perpendicular parking system using CNNs, employing a curve-based path planning algorithm to generate a path from the vehicle’s position to the target parking space while simultaneously collecting image data to train the CNN, enabling the vehicle to reach the parking space with good orientation regardless of the initial starting position. Chai et al. [20] developed a multilayer deep NN control scheme for ground AVs’ parking maneuvers. The approach involved iterative optimization for time-optimal parking trajectories, followed by deep NN training to learn system state–control action relationships. Additionally, a strategy was employed to enhance network accuracy by retraining on a selected trajectory ensemble with incorrect predictions. Field experiments validated the scheme’s real-world effectiveness across diverse scenarios.

Last but not least, deep RL has emerged as a potent technique in autonomous parking, enabling vehicles to learn optimal parking policies through trial and error. RL continuously learns and accumulates experience from numerous parking attempts, while also learning the optimal commands for different parking slots related to the vehicle. Within this context, Zhang et al. [21] introduced an end-to-end perpendicular parking framework applying the DDPG algorithm, allowing vehicles to iteratively learn the optimal steering wheel angles for different parking slots. This approach mitigates errors from path tracking and ensures reliable parking through a vision and chassis-based tracking algorithm. Takehara et al. [22] employed proximal policy optimization (PPO), a deep RL algorithm using segmentation images as inputs. This study demonstrated that segmentation images significantly improved the accuracy and efficiency of the perpendicular parking task compared to normal images, highlighting the potential of segmentation techniques in enhancing APS. Similarly, Thunyapoo et al. [23] introduced a perpendicular APS employing the PPO algorithm and utilizing depth sensors for varying parking difficulty levels based on the proximity of nearby vehicles. The study found better performance in sparse configurations, leading the authors to conclude that separate models yield improved results for different parking scenarios. Additionally, Song et al. [24] proposed a data-efficient RL (DERL) algorithm for parallel APS lateral planning and control, using a model-based approach coupled with a truncated Monte Carlo tree search (MCTS) algorithm. Their method utilized two NNs to learn state–action relationships, one to estimate the probability distribution of high-reward actions and another to predict state values. Data efficiency was further improved through an adaptive exploration scheme and weighted policy learning, resulting in enhanced parking performance compared to conventional path planning methods. Zhang et al. [25] proposed another model-based RL method that learns parallel parking policies through iterative data generation, evaluation, and training, reducing reliance on human expertise. Their approach establishes an environmental model for simulated interactions and introduces a data generation algorithm combining MCTS with longitudinal and lateral policies, leading to improved parking performance compared to the NN alone. Sousa et al. [26] also presented an RL approach to autonomous parking using the DDPG, enabling the vehicle to perform perpendicular, angular, and parallel parking maneuvers while controlling steering and speed for collision-free motion. However, the model was limited in the parallel parking task, as it could only park from the right side of the parking spot and struggled in environments where obstacles were less than twice the vehicle’s length apart. Finally, Du et al. [27] developed an RL-based trajectory planner for APS, focusing on both parallel and perpendicular tasks. They found that breaking the parking task into subtasks accelerated training and that the DQN outperformed the Deep Recurrent Q-Network (DRQN) approach, due to the fact that long-term dependencies on states may negatively impact the agent’s performance in simple parking scenarios.

While existing approaches demonstrate promise, they often overlook the complexities of parallel parking tasks, as most studies utilizing deep RL have primarily focused on perpendicular parking scenarios. Additionally, although NNs can be effective, they may be less efficient than RL methods for optimizing decision-making processes, as NNs typically require extensive training on large datasets that may not encompass the full range of parking scenarios.

2.3. Contribution

This study aims to implement a robust autonomous parallel parking system by developing advanced algorithms for detecting parking spaces and maneuvering the vehicle within them. Three distinct approaches are developed for the movement of the vehicle from the traffic lane to the identified parking space. Initially, the study explores IL, where an NN is trained to mimic human parking behaviors. Notably, the IL approach has not been employed for autonomous parking tasks, primarily due to its inefficiency in capturing the dynamic nature of parking scenarios. Subsequently, a deep RL approach is developed, specifically focusing on employing the Deep Deterministic Policy Gradient (DDPG) algorithm. This approach focuses on learning optimal parking policies through continuous interaction with the environment, enabling the system to handle a wide range of parking scenarios by adjusting actions based on observed states. Although the DDPG has been explored in other parking tasks, its application for parallel parking remains relatively unexplored. The primary contribution of this work, however, lies in the third approach which integrates the DDPG with IL. By leveraging IL to initially guide the DDPG algorithm, this hybrid method aims to accelerate convergence and enhance system performance. The IL provides a strong starting point, reducing the exploration phase and enabling the DDPG algorithm to fine-tune the policy more efficiently. The study then progresses to the evaluation and comparison of these three approaches. Each approach undergoes a comprehensive assessment, and their performance is meticulously compared to ascertain their relative strengths and weaknesses in handling diverse parking scenarios.

3. Theoretical Background

3.1. Reinforcement Learning Concepts

RL is a field of ML focused on training agents to make sequential decisions in dynamic environments. Unlike supervised learning, RL does not rely on labeled datasets but instead allows agents to learn through interaction, receiving feedback in the form of rewards or penalties based on their actions. The goal is for agents to learn policies that maximize cumulative rewards over time, adapting dynamically to complex and uncertain scenarios [28,29].

3.1.1. Key Components of Reinforcement Learning

RL operates within an interactive system where an agent continuously and dynamically engages with an external environment. The fundamental components of RL, collectively shaping the iterative learning process, include the agent, which is the entity tasked with decision making and learning within the system, and the environment, which is the external system with which the agent interacts, providing the context for decision making. Each interaction is characterized by a state, representing the current situation or configuration of the environment, and an action which denotes the decisions or moves that the agent can take given a particular state. The agent’s decisions are guided by rewards, which serve as the feedback mechanism indicating the desirability of its actions [30].

3.1.2. Markov Decision Process

RL problems are typically formulated as Markov Decision Processes (MDPs), where future states depend solely on the current state and action taken. An MDP is defined by a set of states (

S

), a set of actions (

A

), a transition probability function (

P

), and a reward function (

R

) [31]. In the context of RL, the agent and the environment interact at discrete time steps

t = 0, 1, 2, \dots

. At each time step

t

, the agent receives a representation of the environment’s state

S_{t} \in S

and selects an action

A_{t} \in A (S_{t})

. One time step later, as a consequence of its action, the agent receives a numerical reward

R_{t + 1} \in R

and finds itself in a new state

S_{t + 1}

.

3.1.3. Reward Function

In RL, rewards are a key part of the learning process. Rewards signal to an agent what controls it has taken that are valuable, indicating which ones should be repeated when the same state is visited in the future. Rewards can be deterministic, consistently associated with specific actions, or stochastic, involving randomness and probability distributions. Sparse rewards are infrequent and tied to significant achievements, while dense rewards provide continuous feedback, aiding faster learning [29,32].

3.1.4. Returns and Episodes

In RL, the agent’s objective is to maximize cumulative rewards over time, expressed as the sum of rewards it receives at each time step

t

until the final time step

T

. However, for tasks with continuous actions that lack natural episode boundaries, defining the final time step as

T = \infty

could lead to an infinite return. To address this challenge, the concept of discounting is introduced, where the agent selects actions to maximize the sum of discounted future rewards, where the expected discounted reward is defined as follows:

R_{t} = R_{t + 1} + γ R_{t + 2} + {γ^{2} R}_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1},

(1)

where

γ \in [0,1]

is the discount rate, determining the present value of future rewards. When

γ = 0

, the agent prioritizes immediate rewards, focusing on selecting actions that maximize only reward

R_{t + 1}

. As

γ

approaches 1, future rewards carry more weight in the agent’s decision-making process [28,33].

3.1.5. Policy and Value Functions

RL algorithms commonly involve the estimation of value functions to assess the desirability of states or state–action pairs, based on expected future rewards. These value functions are inherently linked to policies, denoted by

π

, which represent mappings from states to action probabilities. The value function of a state

s

under a policy

π

, denoted as

u^{π} (s)

, represents the expected return when initiating from state

s

and subsequently adhering to policy

π

[28]. Similarly, the action-value function, denoted as

q_{π} (s, a)

, represents the expected return when initiating from state

s

, taking action

a

, and following policy

π

thereafter. The ultimate goal of RL algorithms is to discover the optimal policy, denoted as

π^{*}

, that maximizes the expected cumulative reward over time by iteratively updating value functions or policies.

3.1.6. Exploration vs. Exploitation

RL involves a trade-off between exploration and exploitation. To fulfill its goal in achieving maximum rewards, the RL agent needs to select actions with proven effectiveness from past experiences. However, to uncover such actions, the agent must also take actions that lead into unfamiliar territories [28]. The dilemma lies in the agent’s need to exploit prior knowledge for immediate rewards while simultaneously exploring new actions to enhance future decision making [34]. Overemphasizing exploitation risks suboptimal long-term performance by potentially overlooking undiscovered, more rewarding actions. Conversely, excessive exploration may impede the agent’s ability to exploit already identified high-reward actions, hindering short-term gains.

3.2. Deep Reinforcement Learning

Traditional RL methods using lookup tables for state and action values face significant scalability challenges. As the number of states and actions increases, the memory requirements become impractical, and the learning process slows down due to the need to individually update each state. Additionally, these tabular methods struggle to handle continuous action and state spaces efficiently.

Deep RL overcomes these limitations by integrating deep NNs into the learning process. Instead of relying on explicit tables for each state or state–action pair, DRL utilizes NNs to approximate the mappings between states, actions, and their corresponding values. This adaptation enables DRL to efficiently handle large and continuous state and action spaces, capturing intricate patterns and generalizing across diverse scenarios. DRL is particularly well suited for complex and realistic environments due to its ability to handle high-dimensional inputs [28,29].

3.3. Deep Deterministic Policy Gradient

Introduced by researchers Timothy P. Lillicrap et al. [35] in 2016, the Deep Deterministic Policy Gradient (DDPG) has become a notable algorithm in the field of RL. This algorithm is particularly influential in scenarios where the action space is continuous, addressing challenges that discrete action algorithms may struggle with. The DDPG combines elements of deep learning and policy gradient methods, making it well suited for complex tasks in various domains, including robotics and autonomous systems. Its off-policy nature allows it to learn from experiences generated by any policy, enhancing exploration efficiency. Moreover, its deterministic approach ensures stable learning in continuous action spaces by directly mapping states to specific actions rather than probabilistically choosing from a distribution [36].

3.3.1. Actor–Critic Architecture

The DDPG utilizes the actor–critic (AC) algorithm, consisting of two distinct networks: the actor network (policy network) and the critic network (evaluation network). The actor network selects actions based on current state information, while the critic network evaluates these actions by estimating their value function. These networks operate in tandem, with the critic network providing feedback to the actor network to refine its action strategy. The actor network adjusts its actions to maximize the value estimated by the critic network, which continuously updates its value function based on state and reward inputs from the environment. This iterative process improves the actor network’s action strategy over time, aiming to achieve optimal performance [37].

3.3.2. Architectural Foundations and Innovations in DDPG

The DDPG builds on the Deterministic Policy Gradient concept, employing NNs to model both the policy and Q functions, leveraging deep learning techniques to complete the training [3]. Derived from the AC framework, the DDPG integrates aspects of the Deterministic Policy Gradient (DPG) for stable continuous action optimization and of the Deep Q-Network (DQN) for effective Q-function approximation [35,36]. This combination enhances the algorithm’s performance by providing robust policy estimation and value function updates through Neural Network-based methods. The DDPG incorporates these foundational elements and introduces two additional techniques aimed at enhancing training stability and sample efficiency.

Target Networks

The DDPG algorithm employs two target networks, target actor and target critic, to stabilize training and reduce variance [38]. The target actor network is a slow-moving replica of the primary actor network, responsible for generating actions, while the target critic network mirrors the primary critic network, which evaluates the quality of these actions. The DDPG adopts a slow update mechanism of the target networks, facilitated through soft target updates, introducing a smoothing effect on the learning targets [39]. During each iteration, a fraction of the primary network weights is blended with the corresponding target network weights through the following equation:

a = τ \cdot β + (1 - τ) \cdot α,

(2)

where

a

are the weights of the target networks,

β

are the weights of the main networks, and

τ

is the soft update coefficient.

Experience Replay

Another technique employed by the DDPG is the experience replay, which systematically samples stored tuples of past states, actions, rewards, and next-states from a dedicated memory buffer. Implemented as a circular buffer, experience replay periodically samples batches of experiences to update the agent’s learning model. This approach decorrelates sequential experiences, stabilizes learning, and improves decision making by incorporating diverse past events, including rare occurrences [37,40].

3.3.3. Exploration Noise

The policy in the DDPG is deterministic since it directly outputs the action. However, to encourage exploration, the authors of the original DDPG paper [35] suggest the incorporation of noise into the actions determined by the policy. For this purpose, the Ornstein–Uhlenbeck (OU) process is recommended, as it generates temporally correlated exploration noise. The OU process is a stochastic, stationary Gauss–Markov process that produces noise correlated with the previous instances, preventing the noise from canceling out the overall dynamics [41]. The actions generated by applying the OU noise are obtained from the following expression:

a_{t}^{'} = a_{t} + θ \cdot d t \cdot (μ - a_{t}) + σ \cdot \sqrt{d t} \cdot N (0,1),

(3)

where

a_{t}

is the original action sampled from the actor policy,

a_{t}^{'}

is the perturbed action with added noise,

θ

is the rate of mean reversion controlling how quickly the noise reverts to the mean,

μ

is the mean of the noise,

d t

is the time step,

σ

is the standard deviation of the noise, and

N (0,1)

is a sample from a standard normal distribution.

3.3.4. Learning Mechanism

The learning mechanism of the DDPG algorithm involves simultaneous minimization of both the actor and the critic loss, centering on the utilization of the Deterministic Policy Gradient to train the actor network and update the critic network. The actor network is trained to maximize the expected cumulative reward by adjusting its

θ

parameters. The primary aim in updating the actor network is to optimize its parameters to generate actions that yield to maximum predicted value, as assessed by the critic network, for a given state. The critic network is trained to minimize the temporal difference between the predicted Q-values and the target Q-values obtained from the Bellman equation, by updating the networks

θ^{Q}

parameters. The predicted Q-value estimates the expected return starting from state

s

and taking action

a

, while the target Q-value is computed using the target action network

μ^{'}

, which provides the action for the next state

s^{'}

. The objective of training the critic network is to reduce the squared difference between these predicted and target Q-values, thereby improving the accuracy of the Q-value estimates. This process is crucial for learning an effective action-value function that guides the actor network toward making better decisions in continuous action spaces, considering both immediate rewards and discounted future rewards. Figure 1 provides a visual representation of the DDPG network architecture.

4. Implementation

This section outlines the environment setup, including the chosen state and action spaces for the autonomous parallel parking simulation. It also presents the implementation of the parking space detection algorithm and details the three approaches for the autonomous maneuvering of the vehicle from the traffic lane to the parking space.

4.1. Perception and State Representation

The APS could be utilized encompassing a variety of sensors, including ultrasonic or radar sensors, LiDAR, and camera systems. In this study, radar sensors were chosen exclusively for the autonomous parking simulation due to their robust performance in adverse weather conditions, accurate distance measurements, and suitability for proximity detection, aligning well with the precision requirements of parking maneuvers. In addition to radar readings, the state space in the autonomous parking simulation incorporates spatial characteristics, resulting in a comprehensive representation with a total of 18 dimensions.

4.1.1. Radar Sensors

The ego vehicle is equipped with a perception system consisting of 10 radar sensors positioned primarily on the right-hand side of the vehicle. Nine of these sensors are distributed along various locations on the vehicle, contributing to a comprehensive perception of the surroundings. Additionally, one radar sensor is specifically positioned near the back-right wheel of the ego vehicle. This particular placement is intentional and serves the purpose of providing detailed information about the proximity of the ego vehicle to the sidewalk during parallel parking. A visual representation of the precise positioning of these radar sensors can be observed in Figure 2. Significantly, the radar sensors have a Field of View (FOV) equal to 50 degrees.

4.1.2. Spatial Characteristics

The implementation of the APS involves incorporating additional parameters crucial for accurately representing the state of the ego vehicle. Extensive analysis of various states was conducted throughout this research to determine the optimal state representation for effectively addressing the parking problem. The selected characteristics, thoroughly analyzed and presented in Table 1, play a pivotal role in defining the comprehensive state of the ego vehicle during the autonomous parallel parking process.

4.1.3. State Vector Representation

The state space in the autonomous parking simulation is represented as a comprehensive 18-dimensional state vector. This vector combines measurements from the radar sensors as well as spatial characteristics of the ego vehicle, as described in Section 4.1.1 and Section 4.1.2. The radar measurements provide essential information for proximity detection and obstacle avoidance, while the spatial characteristics capture the position, orientation, and alignment of the ego vehicle relative to the parking space. The full state vector can be mathematically represented as follows:

s = [{r a d a r}_{1}, {r a d a r}_{2}, \dots, {r a d a r}_{10}, z, a n g l e, a n g l e d i f f e r e n c e, d i s t a n c e, p o s i t i o n, o r i e n t a t i o n, p a r a l l e l d i s t a n c e, v e r t i c a l d i s t a n c e],

(4)

where

{r a d a r}_{1}, {r a d a r}_{2}, \dots, {r a d a r}_{10}

represent the 10 radar measurements, and the remaining variables are defined in detail in Table 1.

4.2. Action Space

The action space is an aspect that defines every possible action the agent can take in a given state. In the context of controlling the ego vehicle within the parallel parking simulation, the actions are continuous and are determined by parameters such as throttle, steer, brake, and reverse. Two fundamental properties, completeness, and validity are essential for a well-defined action space. Completeness ensures that the agent has the necessary actions to achieve its objectives. If any action is omitted, the agent may struggle to fulfill its tasks successfully. Validity, on the other hand, mandates that all actions within the space must be legal and adhere to the rules of the environment [42].

To streamline the process, the action space for the parallel parking simulation focuses solely on two key control parameters: steer and reverse. In this context, “steer” denotes the steering angle of the ego vehicle’s front wheels, dictating the direction in which the vehicle turns. The “reverse” parameter signifies the backward movement of the vehicle. A visual representation of the vehicle’s final state, based on various steer and reverse combinations, is presented in Figure 3. The brake is consistently set to 0, indicating that the ego vehicle does not stop during parking unless the parking is successfully completed. Additionally, the throttle is maintained at a constant value of 0.3. For the selected vehicle, a Tesla Model 3, this throttle setting results in a speed range of approximately 18 to 25 km/h when the vehicle is in motion and the brakes are off.

4.3. Parking Space Detection

The initial task within this end-to-end framework is the detection of a suitable parking space from the perspective of the ego vehicle. To accomplish this, predefined criteria were established: the minimum acceptable width for a parking space was set to match the ego vehicle’s width, and the minimum acceptable length was determined to be 1.2 times the ego vehicle’s length.

In the initial phase, the ego vehicle maintains a constant throttle and utilizing the radar sensor located on the right side at the front of the ego vehicle, continuous measurements are taken to assess spatial relationships and identify potential obstacles within the sensor’s FOV. The process commences by initiating a timer when the sensor reading exceeds the minimum acceptable width. Subsequent readings are continuously recorded, and the distance traveled by the vehicle is calculated. The detection process concludes when this calculated distance reaches the minimum acceptable length for the parking spot. At this point, the ego vehicle interrupts its movement, signaling the successful detection of the parking space. It is important to note that the parking space is considered to be in the right lane of the ego vehicle at the coordinates where the ego vehicle was when the minimum acceptable length was detected. Essentially, this involves identifying the location where the ego vehicle encountered an obstacle-free distance equal to the minimum acceptable length and then projecting this point onto the right lane, forming the boundaries of the parking space rectangle. The pseudocode for the implementation of the parking space detection algorithm is presented in Algorithm 1.

Algorithm 1. Parking Space Detection Pseudocode

Maintain a constant throttle for the ego vehicle

Receive continuous readings from the front right-side radar sensor

distance = 0

While distance < minimum acceptable length

if radar reading > minimum acceptable width

Initiate timer

While radar reading > minimum acceptable width

Calculate distance traveled from timer initiation

if distance >= minimum acceptable length

Interrupt ego vehicle’s movement

Parking space found!

4.4. Autonomous Parallel Parking Implementation

Once the parking space is identified, the subsequent task involves maneuvering the ego vehicle from the traffic lane to the identified parking space. The autonomous parallel parking implementation is approached through three distinct methods, as detailed in the following sections.

4.4.1. Imitation Learning

The first approach employs IL, a technique designed to replicate human behavior in a specific task. In this context, the agent is trained to execute a task by learning a mapping between observations and corresponding actions [43]. To implement IL, a dataset of imitation data is manually collected using the CARLA Simulator v.0.9.14 to train an NN, including state–action pairs for twenty complete parking sequences (https://github.com/IoannaMarina/Autonomous-parallel-parking-Imitation-data/, accessed on 1 May 2025). The ego vehicle is manually driven from its initial position to the parking space, and data capturing the ego vehicle’s state representation, along with the corresponding actions, are recorded at a 1 second interval. Figure 4 provides a visual representation of the trajectory of the ego vehicle during the manual movement from the traffic lane to the parking space. Thus, Figure 4 illustrates two indicative scenarios during the generation of imitation samples, showcasing the path taken by the ego vehicle in the process.

Subsequent to the collection of imitation data, an NN is trained using the state representation as input and actions as output. The only preprocessing step involves encoding the reverse output. In the CARLA Simulator, the reverse control of a vehicle is represented by either True or False. For NN training purposes, these values are translated into 1 if reverse is True and −1 if reverse is False.

Throughout this research, various architectures and hyperparameters were systematically tested through an iterative loop. Each iteration involved resetting the network, training it with different hyperparameters, and employing diverse architectures. The optimal network configuration, selected for achieving superior accuracy and minimizing loss, is detailed in Table 2. The input layer is composed of 18 neurons, corresponding to the dimensionality of the ego vehicle’s state space. Meanwhile, the output layer comprises two neurons, aligning with the action space of the ego vehicle. Notably, the output layer is equipped with a tanh activation function, which outputs values within the range of [−1, 1]. This activation choice is particularly suited for both steer and reverse, considering their encoding, as it confines their values within this predefined range.

The network is trained employing a mean squared error loss function to quantify the disparity between predicted and actual outputs. The Adam optimizer, with a learning rate of 0.0001 is utilized to fine-tune the network’s weights during training. Early stopping, a regularization technique, is implemented to cease training if the model’s performance, monitored by the loss function, does not exhibit improvement for 25 consecutive epochs. The loss plot of the training is presented in Figure 5.

Notably, the steer predictions exhibit a 75% matching sign accuracy, while the reverse estimates show a high accuracy of 96%. It is important to highlight that the network outputs reverse values in the range [−1, 1], where values above 0 translate to True and values below 0 translate to False. For steer predictions, given the continuous range of [−1, 1], the evaluation focuses on matching signs rather than exact values, considering the challenge of achieving precise predictions within this broad continuous space. This approach is particularly relevant since similar sign steer values correspond to movement in the same orientation.

4.4.2. DDPG

The second approach employs the DDPG algorithm for the implementation of the parallel parking simulation. The architecture of the actor network is the same as the architecture of the IL model, with the difference that the output layer is initialized with random values sampled from a uniform distribution between −0.003 and 0.003. This specific range for weight initialization is often used in RL scenarios to prevent weight values from being too large, which can lead to instability during training. The critic network’s architecture adheres to the recommended structure outlined by Keras [44], and the detailed layer-wise configuration is provided in Table 3.

Reward Function

The implementation of RL algorithms requires the construction of a nuanced and context-specific reward function. Various strategies were explored before arriving at the definitive reward function for the autonomous parallel parking simulation.

The refined reward function, which addresses issues of poor learning and suboptimal policies, is a combination of distance-related and orientation-related components, designed to encourage desired behaviors and penalize unwanted actions. The primary elements include the distance to the parking space (

d i s t a n c e

) and the angle difference (

a n g l e_d i f f)

between the desired and actual orientation of the ego vehicle. The overall reward is determined as a weighted sum of these components:

r e w a r d = 0.35 \cdot a n g l e_r e w a r d + 0.65 \cdot d i s t a n c e_r e w a r d,

(5)

a n g l e_r e w a r d = \{\begin{matrix} a b s (\cos (a n g l e_d i f f)) i f a n g l e_d i f f \leq 45 \\ 1.5 \cdot (1 - a b s (\cos (a n g l e_d i f f)) i f a n g l e_d i f f > 45, \end{matrix}

(6)

d i s t a n c e_r e w a r d = \{\begin{matrix} \frac{m a x_d i s t a n c e - d i s t a n c e}{m a x_d i s t a n c e - m i n_d i s t a n c e} i f m i n_d i s t a n c e < d i s t a n c e < m a x_d i s t a n c e \\ 1 i f d i s t a n c e \leq m i n_d i s t a n c e, \end{matrix}

(7)

where

m a x_d i s t a n c e

is set at 10 m, serving as a critical threshold beyond which the agent incurs a penalty, prompting episode termination. Conversely, the

m i n_d i s t a n c e

is established at 0.5 m, indicating proximity to the parking space, and the agent receives a positive reward of 1 for maintaining this close distance. The higher weight of 0.65 for distance reflects the priority of achieving accurate positioning within the parking spot, while the 0.35 weight for angle allows for slight orientation deviations, as perfect alignment is less critical than ensuring the vehicle is well placed. Notably, these values were determined after an exhaustive search to balance successful parking with manageable orientation flexibility.

Additionally, the reward function handles collision scenarios through the utilization of the front and rear radar sensors. When the measurement from one of these sensors becomes equal to zero, a collision is considered to have occurred. The ego vehicle’s velocity at the time of this measurement is then used to estimate the intensity of the collision. If the velocity of the ego vehicle is higher than 2 m/s, the collision is considered severe. If the velocity is 2 m/s or lower, the collision is considered minor. The handling of collisions in the reward function is as follows:

i f c o l l i s i o n i s n o t N o n e : \{\begin{matrix} r e w a r d = - 1 i f s e v e r e c o l l i s i o n \\ r e w a r d = - 0.5 \cdot i n t e n s i t y i f m i n o r c o l l i s i o n, \end{matrix}

(8)

where if a severe collision is detected the episode is promptly terminated. In cases of minor collisions, a negative reward is assigned to the agent, without terminating the episode. To enhance the robustness of the algorithm, an additional check is implemented. If the ego vehicle experiences minor collisions on more than three occasions during a single episode, the episode is then terminated.

To refine the learning process, the RL framework incorporates supplementary rewards and penalties targeting specific behaviors. These include addressing substantial alterations in vehicle orientation, instances of incorrect steering, and proximity to sidewalks. The initial reward assigned to the agent, as defined by the comprehensive reward function in Equation (5), undergoes modification based on the conditions outlined in Table 4. Specifically, if the computed reward is positive, it is multiplied by a factor of −1; otherwise, it remains unchanged.

In determining the successful parking state, the ego vehicle is considered close and aligned with the parking space based on the following condition:

d i s t a n c e < 0.5 A N D a n g l e_d i f f < 3 .

(9)

Based on this condition, an episode terminates with a reward equal to 1 when the ego vehicle maintains a linear proximity of less than 0.5 m from the parking space and exhibits an angular variance of fewer than 3 degrees between its orientation and the desired orientation.

During the preliminary training of the DDPG model, a reward hacking phenomenon was observed, induced by action oscillations. Reward hacking in the context of RL refers to a situation where an agent exploits the reward function to achieve higher rewards without genuinely accomplishing the intended task [45]. In the observed scenario, the agent engaged in action oscillations, moving back and forth without making progress toward the parking space. Although these actions did not contribute to successful parking, they yielded positive rewards, allowing the agent to “hack” the reward function and artificially inflate its cumulative reward.

To mitigate reward hacking, two additional penalties are introduced into the reward function. The first penalty addresses back-and-forth oscillation: the reward obtained from the primary reward function is diminished by a factor of 0.8 if the current reverse value is opposite to that of the previous step. This approach acknowledges that occasional back-and-forth movement may occur, and while suboptimal, it is not a strictly forbidden action. The second penalty in the reward function introduces a temporal discounting mechanism for the value of actions. Specifically, it reflects the idea that actions performed earlier in an episode are considered more influential or significant than the same actions performed later. By assigning higher value to actions in the initial stages of an episode, the penalty aims to encourage the agent to make decisive and effective maneuvers early on, influencing its learning process to prioritize actions with a more immediate impact on achieving the task. To achieve the second penalty, the rewards are reduced by a factor of 0.001 multiplied by the step number in the episode. Finally, all rewards are clipped within the range of −1 to 1 to ensure normalization and prevent the algorithm from being overly sensitive to extreme reward values.

Exploration Noise

To facilitate exploration during the training process, the OU process is employed to introduce noise to the actions sampled from the actor network. A decay mechanism is strategically incorporated for the noise, emphasizing exploration initially while gradually diminishing it as training progresses. This dynamic adjustment allows the algorithm to transition from an exploratory phase to more deterministic behavior over time. To implement this, the noise action is multiplied by a noise factor, which progressively decays throughout the training duration. The noise factor commences with an initial maximum value set at 0.2 and decreases at a controlled rate of 0.001 until it converges to the minimum value of 0.0001.

Modifications

To ensure continuous exploration, a strategy is implemented to reset the noise factor to its initial maximum value after a specific number of episodes. This systematic approach evaluates the success rate of parking attempts during training and triggers the noise factor initiation based on the criteria outlined in Table 5.

Two additional approaches are explored, with the first one omitting the reinitiation of the noise factor based on the success rate, and the second involving constant reinitiation with frequencies of 20, 40, and 80 episodes. In the former, the noise is discernible only in the initial five episodes, resulting in inadequate exploration. The latter approach, where noise is initiated at fixed intervals, leads to the emergence of catastrophic forgetting [46]. Catastrophic forgetting in RL denotes a phenomenon where NNs, particularly the actor and critic networks in the DDPG, tend to forget or lose previously acquired information when trained on new data. This issue arises during network fine-tuning or updates with new experiences, resulting in a decline in performance on tasks learned earlier [47]. Notably, this phenomenon is observed in the cumulative reward function over episodes, which exhibits significant oscillations, as depicted in Figure 6. The frequent adjustments to the noise factor lead the actor network to swiftly adapt to new exploration strategies, causing a struggle to retain knowledge from earlier episodes and a subsequent decline in the network’s ability to execute previously learned tasks effectively.

Furthermore, an additional strategic adjustment is implemented to augment training stability and optimize the performance of the DDPG model. Upon the observation of 20 consecutive successful parking attempts during the training process, the learning rates for both the actor and critic networks are divided by 10. This adaptive measure is introduced in response to indications of potential convergence or proximity to an optimal policy. The deliberate reduction in learning rate serves to prevent overshooting optimal values, mitigate the risk of divergence, and foster a smoother convergence of the model.

Finally, after several iterations with diverse combinations of hyperparameters, the final hyperparameters exploited for the standalone DDPG algorithm are summarized in Table 6.

The choice of a smaller learning rate for the actor network compared to the critic network in the DDPG is motivated by the inherent difference in the nature of their updates. The actor network is responsible for determining the policy, directly influencing the agent’s actions. Given the deterministic nature of the policy in the DDPG, small changes in the actor’s parameters can lead to substantial shifts in the output actions, potentially destabilizing the learning process. A smaller learning rate for the actor mitigates this issue, promoting more cautious and stable updates to the policy. On the other hand, the critic network is tasked with approximating the action-value function and plays a role in assessing the quality of the selected actions. Critic updates involve backpropagating the temporal difference errors, and a relatively larger learning rate allows the critic to adapt more quickly to changes in the environment.

Training

Subsequently, the DDPG algorithm undergoes training using the identified hyperparameters. The training consists of a maximum of 2000 episodes, each comprising 100 steps. Despite the absence of a clear plateau in the average cumulative reward, the training concludes after 2000 episodes. The training duration spans approximately 52 h, underscoring the computational demands of the simulation. Figure 7 visually represents the average cumulative reward plot of the DDPG algorithm training. Notably, the cumulative reward exhibits fluctuations, suggesting instability or a lack of convergence in the learning process. Remarkably, even with these fluctuations, the cumulative reward attains high values, potentially indicating proficient performance by the agent. In the concluding episodes of the training, the agent successfully executes parking attempts, demonstrating its ability to adapt and make optimal decisions, albeit within an environment characterized by some instability. It is worth noting that extending the simulation for additional episodes is a plausible approach, but it may necessitate modifications, particularly in the experience replay mechanism, to mitigate the potential effects of catastrophic forgetting.

4.4.3. DDPG-IL

The third and final approach combines the DDPG with IL for the implementation of the parallel parking simulation. Within this context, the weights obtained during the execution of the first approach in IL are utilized as the initial weights for the actor network. Both the actor networks used in the IL phase and the RL phase share the same architecture. The critic’s network architecture remains consistent with the one used in the second approach, where the DDPG algorithm was tested. Additionally, each state–action pair from the IL samples undergoes reward evaluation using the specified reward function. Subsequently, the experience replay buffer is populated using the samples collected under the policy generated by IL, and these samples are frozen in the experience replay buffer throughout the training process. This freezing mechanism is implemented to maintain a reference of good practices established during the successful IL phase.

Hyperparameters Definition Process

To determine the optimal hyperparameters for the DDPG-IL algorithm, a comprehensive analysis is performed. The process involves utilizing pre-trained actor network weights and populating the experience replay buffer with imitation samples. During the analysis, both the actor and critic networks are trained without the additional generation of real-time simulation data. A total of 324 hyperparameter combinations, as outlined in Table 7, are systematically evaluated, and the criteria, detailed in Table 8, are employed for the assessment.

The evaluation results in the optimal hyperparameter combination, with an actor learning rate of 0.0001, critic learning rate of 0.001, discount rate of 0.9, soft update rate of 0.001, and experience replay batch of 64. Notably, this configuration achieves a steer estimation accuracy of 75% and a reverse value estimation accuracy of 100%. An important observation is that, despite utilizing the same imitation samples for both the standalone IL approach and the integrated DDPG-IL approach, the latter yields superior accuracies. This underscores the effectiveness of the DDPG-IL hybrid model in enhancing performance compared to the standalone IL approach. Figure 8 depicts the convergence of actor and critic losses (a), alongside the residual variance (b), throughout the training process. Notably, the residual variance is expected to initiate with a high value, experience rapid early reduction, and then exhibit a more gradual decrease during the overall training period. Persistent high residuals may indicate that the critic network struggles to predict rewards accurately. Conversely, a drop to zero might signify a collapse of policy entropy, resulting in deterministic behavior, with the value network perfectly learning the collected rewards. Additionally, if the residuals oscillate wildly, it could suggest a high learning rate, emphasizing the importance of appropriate tuning for stable training dynamics [48].

Overall, the hyperparameters exploited for the training of the DDPG-IL algorithm are summarized in Table 6.

Training

Subsequently, the DDPG-IL algorithm undergoes training using the selected hyperparameters. The training consists of a maximum of 2000 episodes, each comprising 100 steps. The DDPG-IL training is halted upon reaching a plateau in the averaged cumulative reward plot, observed after approximately 800 episodes, as depicted in Figure 9. The decision to conclude the training process is driven by the consideration that the agent has likely converged to a stable policy, and additional iterations may not lead to significant performance gains. It is essential to note that reaching a plateau does not necessarily mean that the agent has found the globally optimal solution. It could be a local optimum, and further exploration or adjustments to the learning parameters may be needed to improve the agent’s performance or discover a better policy. However, at this point in training, the agent consecutively succeeds in parking attempts, providing a practical basis for concluding the training process. It is noteworthy that the training process here lasted approximately 26 h.

5. Experimental Results

To ensure a fair comparison among the three developed approaches, a standardized framework has been devised wherein the same parking scenario is applied to each approach. In this controlled setting, the three approaches undergo testing across 100 identical parking scenarios in the CARLA Simulator, and their respective performances are meticulously collected for in-depth analysis. The comparison primarily centers on three key aspects: success rate, average steps per successful episode, and average reward per successful episode.

5.1. Success Rate

The first aspect examines the success rate, quantifying the frequency with which each approach achieves successful parking outcomes. This metric serves as a pivotal indicator of the overall efficacy and reliability of the approaches under consideration. To further strengthen the results, 95% confidence intervals for the success rates are calculated based on 10 separate trials, each consisting of 100 identical parking scenarios. Each set of tests was conducted independently, providing a robust estimate of the variability and reliability of the success rates across multiple testing repetitions. Figure 10 illustrates a bar plot representing both the success rates and their corresponding 95% confidence intervals for the three approaches. The plot reveals that the DDPG-IL approach boasts the highest success rate (98%), with a 95% confidence interval of (97.17%, 98.83%). The DDPG approach follows closely with a success rate of 92% and a confidence interval of (90.74%, 93.26%). As anticipated, the IL approach exhibits the lowest success rate of 63%, with a confidence interval of (60.54%, 65.46%).

In comparing these findings with those from other studies, Piao et al. [11] tested their multi-sensor self-adaptive trajectory generation method across ten parallel parking scenarios, achieving a success rate of 90%. While this demonstrates the effectiveness of their method within specific spatial constraints, as the parking space was 1.38 times the vehicle length, the limited scope of testing in only ten scenarios raises concerns about the applicability of their findings to varied real-world parking situations. Additionally, Sousa et al. [26] found that successful parallel parking was only achievable when the parking spot length was at least twice the vehicle’s length. Although specific success rates were not reported, the requirement for larger parking spaces significantly limits the applicability of this approach in real-world scenarios.

Furthermore, Zhang et al. [25] evaluated their method in five scenarios for both standalone NNs and those combined with the MCTS, using parking spot lengths of 1.28. 1.4, and 1.54 times the vehicle length. While all scenarios were completed successfully, the limited number of test scenarios suggests potential issues regarding the robustness of the findings across diverse parking environments. Song et al. [24] tested their approach in 25 real parallel parking scenarios, where the parking space was 1.54 times the vehicle length. The DERL method achieved a 100% success rate, while the baseline MCTS approach with a refined vehicle model recorded a success rate of 84%, both under conditions where the initial vehicle positions were the same as those used during training. To further test the generalization capability, the second set of experiments involved varying the initial orientation of the vehicle across 20 tests. In these tests, the DERL method maintained a 100% success rate, while the MCTS approach’s success rate dropped to 70%, highlighting the DERL method’s superior adaptability to unseen initial parking poses.

Notably, in the present work, the ratio of parking space length to vehicle length is 1.2, which is the smallest compared to all other studies discussed. This tighter constraint enhances the relevance of our findings to real-world scenarios where space limitations are more pronounced. Finally, Du et al. [27] evaluated their DQN and DRQN based on the performance of the latest 5000 training parking scenarios over 105,000 episodes, yielding average success rates of 98.4% for DQN and 94.3% for DRQN using discrete action spaces with 42, 62, 82, and 102 actions. While these results are commendable, there are notable concerns when addressing the complexities of autonomous parking tasks with discrete action spaces.

It is worth mentioning that research on conventional APS approaches [10,13,14] does not focus on evaluating success rates. Instead, these studies primarily utilize metrics such as the length of the parking path, the time taken to park, and the computation time associated with path planning.

5.2. Average Steps per Successful Episode

The second aspect focuses on the average steps required for a vehicle to successfully complete a parking maneuver. Within the simulation environment, a step involves a movement lasting approximately 2 s. This aspect holds significance as it provides insights into the efficiency and speed at which each approach identifies and navigates the optimal path during the parallel parking simulation. A lower average step count not only indicates a quicker and more streamlined parking process but also highlights the approach’s ability to swiftly adapt to diverse parking scenarios. Figure 11 presents a bar plot illustrating the average steps per episode for each approach. Upon examination, it is evident that the IL approach requires the fewest steps per successful parking attempt, with an average of 25 steps per episode. Following closely is the DDPG-IL approach, which requires 30 steps per episode. Conversely, the standalone DDPG approach necessitates the most steps compared to the other methods, averaging 36 steps per episode.

5.3. Average Reward per Successful Episode

Concluding the evaluation, the third aspect assesses the average reward earned by the agent in each attempt, utilizing the reward function defined in Reward Function Section. This dimension provides insights into the efficacy of the agent’s actions, offering a measure of the quality of performance across various scenarios. Essentially, the average reward serves as a valuable indicator of the agent’s ability to consistently make favorable decisions and execute actions that contribute positively to the overall parking outcome. Figure 12 presents a bar plot illustrating the average reward per successful parking attempt for the three approaches. Notably, the DDPG-IL approach emerges with the highest average reward, followed by the standalone DDPG approach and the IL approach, respectively.

5.4. Discussion

Overall, in the IL approach, the ego vehicle demonstrates a proficient start to its parking maneuver, effectively replicating behaviors learned from the training scenarios. However, it exhibits a notable challenge when confronted with novel and unanticipated situations, causing uncertainties in decision making. The success of parking attempts correlates significantly with scenarios closely mirroring those encountered during the training phase. Despite showcasing better efficiency in terms of average steps per successful parking attempt compared to the other two algorithms, its success rate of 63%, combined with the fact that in the remaining attempts, episodes terminated due to the ego vehicle colliding with obstacles, diminishes its overall effectiveness and renders this approach unsafe.

In the DDPG algorithm, successful parking attempts occur, but a limitation of this approach is the relatively high number of steps required for the vehicle to successfully execute parallel parking, and a notable observation is that the vehicle does not strictly adhere to real-world rules of parallel parking. Another limitation was observed in 8% of the attempts, where 3% of these resulted in collisions with obstacles, either minor or severe. These collisions were primarily with motorbikes or bicycles, which are smaller and harder for radar sensors to detect, that were parked in front or below the parking space. Other obstacles, such as objects on the sidewalk, also contributed to some of the collisions. In the remaining 5% of the failed attempts, the vehicle failed to complete the parking within the 100-step limit, but without any collisions. Despite this factor, the overall success rate makes the approach effective and acceptable.

Remarkably, the DDPG-IL algorithm exhibits superior performance, achieving a 98% success rate. It excels in both successful attempts and average rewards, with the addition of Imitation Learning data significantly improving the vehicle’s ability to handle a wider range of scenarios. This improvement makes the agent more adept at avoiding obstacles, including motorbikes, bicycles, and other unexpected objects. Notably, all of the 2% unsuccessful attempts in the DDPG-IL approach were due to timeouts, with no collisions observed. Although unsuccessful episodes did not result in collisions, the vehicle required more steps to complete the parking maneuver, demonstrating room for improvement in efficiency. Importantly, the DDPG-IL approach closely mimics the parking maneuvers commonly executed by human drivers in real-life scenarios. It is also important to highlight that the experiments were conducted in a simulated environment. Therefore, real-world implementation would necessitate further adaptation to account for environmental variability, perception uncertainty, and integration challenges with onboard systems. Furthermore, given that most commercial Autonomous Parking Systems are deployed under human supervision, the presence of a driver can provide an additional safety layer by allowing manual intervention in rare cases of failure.

Additionally, the effect of increasing the amount of imitation data in the DDPG-IL approach was examined. Contrary to expectations, performance declined when 30 or 40 demonstrations were used instead of the original 20. This degradation is likely due to overfitting, where the agent learns a narrower behavioral pattern that does not generalize well to new situations, and to replay buffer imbalance, where imitation data disproportionately influenced the training due to their relatively large share. These findings underscore the need for careful calibration of demonstration volume to avoid compromising generalization capabilities.

Ultimately, the DDPG-IL hybrid approach stands out as the most robust and effective solution for autonomous parallel parking in this study. The integration of IL data with DDPG training not only enhances overall performance but also contributes to safer and more efficient parking maneuvers. Notably, unsuccessful parking episodes in the DDPG-IL approach did not involve collisions but required more steps from the ego vehicle.

To further illustrate the performance of the developed approaches, Figure 13 and Figure 14a,b show snapshots of the ego vehicle’s movement at different stages in three identical scenarios for the IL, DDPG, and DDPG-IL algorithms, respectively.

In this scenario, both the DDPG-IL and DDPG approaches successfully complete the parking attempt, whereas the IL approach results in the vehicle deviating from the parking space and ending up on the sidewalk. The observed behavior highlights a critical limitation of the IL approach: its inability to navigate challenging situations effectively. In contrast, the DDPG-IL approach achieves the parking task within 26 steps, while the DDPG approach requires 33 steps. Importantly, the DDPG-IL approach closely mimics human-like parking maneuvers due to leveraging data designed to replicate human behavior, whereas the DDPG approach adopts a distinct, less efficient maneuvering strategy.

6. Conclusions and Future Enhancements

This study presents a comprehensive framework for autonomous parallel parking using three methodologies: IL, DDPG, and a hybrid DDPG-IL approach in the CARLA Simulator. While the IL approach achieved a 47% success rate, the DDPG algorithm improved performance significantly to 92%. The hybrid DDPG-IL method further enhanced success rates to 98%, demonstrating superior efficiency and safety.

Although the model is trained and evaluated in a simulation environment, its ability to generalize to real-world scenarios is a key consideration for future work. The transition from simulation to reality presents several challenges, including discrepancies between simulated and real-world conditions such as sensor noise, environmental variations, and the complexity of real-world traffic. However, by employing techniques like domain adaptation and fine-tuning with real-world data, the framework could be adapted to perform effectively in real-world settings. Additionally, incorporating real-world testing in controlled environments will help refine the system and ensure robustness.

Future enhancements to this framework include integrating prioritized experience replay, which assigns varying priorities to experiences based on their significance, potentially enhancing learning efficiency and overall performance. Additionally, introducing dynamic traffic conditions will make the simulation environment more challenging and realistic, better reflecting real-world scenarios. Another advancement involves integrating the parking framework with a broader autonomous driving system, ensuring seamless coordination between parking maneuvers and general driving tasks. Moreover, exploring alternative sensor configurations and investigating how the system adapts to changes in sensor input, such as the addition of more sensors, can further improve robustness. These advancements will help ensure that the framework is adaptable and scalable.

Author Contributions

All authors contributed to the conceptualization and methodology of the work. I.M.A. was responsible for the code and simulation implementation, as well as the manuscript writing. E.T. and A.L.S. provided supervision and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in https://github.com/IoannaMarina/Autonomous-parallel-parking-Imitation-data/tree/main (accessed on 1 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AC	Actor Critic
ADAS	Advanced Driver Assistance Systems
AI	Artificial Intelligence
APS	Autonomous Parking Systems
AV	Autonomous Vehicle
CNN	Convolutional Neural Network
DDPG	Deep Deterministic Policy Gradient
DERL	Data-efficient Reinforcement Learning
DPG	Deterministic Policy Gradient
DQN	Deep Q-Network
DRL	Deep Reinforcement Learning
DRQN	Deep Recurrent Q-Network
FOV	Field of View
IL	Imitation Learning
MCTS	Monte Carlo Tree Search
MDP	Markov Decision Process
ML	Machine Learning
NN	Neural Network
OBCA	Optimization-Based Collision Avoidance
OU	Ornstein–Uhlenbeck
PPO	Proximal Policy Optimization
RL	Reinforcement Learning
RRT	Rapidly-Exploring Random Trees

References

Wajcman, J. Automation: Is it really different this time? Br. J. Sociol. 2017, 68, 119–127. [Google Scholar] [CrossRef] [PubMed]
Stanchev, P.; Geske, J. Autonomous cars. History. State of art. Research problems. Commun. Comput. Inf. Sci. 2016, 601, 1–10. [Google Scholar] [CrossRef]
Liu, Q.; Zhai, J.W.; Zhang, Z.Z.; Zhong, S.; Zhou, Q.; Zhang, P.; Xu, J. A Survey on Deep Reinforcement Learning. Jisuanji Xuebao/Chin. J. Comput. 2018, 41, 1–27. [Google Scholar] [CrossRef]
Behere, S.; Törngren, M. A functional architecture for autonomous driving. In Proceedings of the WASA ’15: Proceedings of the First International Workshop on Automotive Software Architecture, Montreal, QC, Canada, 4 May 2015; pp. 3–10. [Google Scholar] [CrossRef]
Bagloee, S.A.; Tavana, M.; Asadi, M.; Oliver, T. Autonomous vehicles: Challenges, opportunities, and future implications for transportation policies. J. Mod. Transp. 2016, 24, 284–303. [Google Scholar] [CrossRef]
Liu, Q.; Li, X.; Yuan, S.; Li, Z. Decision-Making Technology for Autonomous Vehicles: Learning-Based Methods, Applications and Future Outlook. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 30–37. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, R.; Lei, L.; Yang, X. Research on Automatic Parking System Strategy. World Electr. Veh. J. 2021, 12, 200. [Google Scholar] [CrossRef]
Teng, S.; Hu, X.; Deng, P.; Li, B.; Li, Y.; Ai, Y.; Yang, D.; Li, L.; Xuanyuan, Z.; Zhu, F.; et al. Motion Planning for Autonomous Driving: The State of the Art and Future Perspectives. IEEE Trans. Intell. Veh. 2023, 8, 3692–3711. [Google Scholar] [CrossRef]
Zhou, R.-F.; Liu, X.-F.; Cai, G.-P. A new geometry-based secondary path planning for automatic parking. Int. J. Adv. Robot. Syst. 2020, 17, 172988142093057. [Google Scholar] [CrossRef]
Vorobieva, H.; Glaser, S.; Minoiu-Enache, N.; Mammar, S. Automatic parallel parking with geometric continuous-curvature path planning. In Proceedings of the 2014 IEEE Intelligent Vehicles Symposium Proceedings, Ypsilanti, MI, USA, 8–11 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 465–471. [Google Scholar] [CrossRef]
Piao, C.; Zhang, J.; Chang, K.; Li, Y.; Liu, M. Multi-Sensor Information Ensemble-Based Automatic Parking System for Vehicle Parallel/Nonparallel Initial State. Sensors 2021, 21, 2261. [Google Scholar] [CrossRef]
Jang, C.; Kim, C.; Lee, S.; Kim, S.; Lee, S.; Sunwoo, M. Re-Plannable Automated Parking System With a Standalone Around View Monitor for Narrow Parking Lots. IEEE Trans. Intell. Transp. Syst. 2020, 21, 777–790. [Google Scholar] [CrossRef]
Huang, J.; Liu, Z.; Chi, X.; Hong, F.; Su, H. Search-Based Path Planning Algorithm for Autonomous Parking: Multi-Heuristic Hybrid A. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 21–23 May 2022; pp. 6248–6253. [Google Scholar] [CrossRef]
Kwon, H.; Chung, W. Performance analysis of path planners for car-like vehicles toward automatic parking control. Intell. Serv. Robot. 2014, 7, 15–23. [Google Scholar] [CrossRef]
Chi, X.; Liu, Z.; Huang, J.; Hong, F.; Su, H. Optimization-Based Motion Planning for Autonomous Parking Considering Dynamic Obstacle: A Hierarchical Framework. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 21–23 May 2022; pp. 6229–6234. [Google Scholar] [CrossRef]
Zhang, X.; Liniger, A.; Sakai, A.; Borrelli, F. Autonomous Parking Using Optimization-Based Collision Avoidance. In Proceedings of the 2018 IEEE Conference on Decision and Control (CDC), Miami Beach, FL, USA, 17–19 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4327–4332. [Google Scholar] [CrossRef]
Li, R.; Wang, W.; Chen, Y.; Srinivasan, S.; Krovi, V.N. An end-to-end fully automatic bay parking approach for autonomous vehicles. In Proceedings of the ASME 2018 Dynamic Systems and Control Conference (DSCC2018), Atlanta, GA, USA, 30 September–3 October 2018; Volume 2. [Google Scholar] [CrossRef]
Moon, J.; Bae, I.; Kim, S. Automatic Parking Controller with a Twin Artificial Neural Network Architecture. Math. Probl. Eng. 2019, 2019, 4801985. [Google Scholar] [CrossRef]
Gamal, O.; Imran, M.; Roth, H.; Wahrburg, J. Assistive Parking Systems Knowledge Transfer to End-to-End Deep Learning for Autonomous Parking. In Proceedings of the 2020 6th International Conference on Mechatronics and Robotics Engineering (ICMRE), Barcelona, Spain, 12–15 February 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 216–221. [Google Scholar] [CrossRef]
Chai, R.; Tsourdos, A.; Savvaris, A.; Chai, S.; Xia, Y.; Chen, C.L.P. Design and Implementation of Deep Neural Network-Based Control for Automatic Parking Maneuver Process. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1400–1413. [Google Scholar] [CrossRef] [PubMed]
Zhang, P.; Xiong, L.; Yu, Z.; Fang, P.; Yan, S.; Yao, J.; Zhou, Y. Reinforcement learning-based end-to-end parking for automatic parking system. Sensors 2019, 19, 3996. [Google Scholar] [CrossRef] [PubMed]
Takehara, R.; Gonsalves, T. Autonomous Car Parking System using Deep Reinforcement Learning. In Proceedings of the 2021 2nd International Conference on Innovative and Creative Information Technology (ICITech), Virtual, 23–25 September 2021; pp. 85–89. [Google Scholar] [CrossRef]
Thunyapoo, B.; Ratchadakorntham, C.; Siricharoen, P.; Susutti, W. Self-Parking Car Simulation using Reinforcement Learning Approach for Moderate Complexity Parking Scenario. In Proceedings of the 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Phuket, Thailand, 24–27 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 576–579. [Google Scholar] [CrossRef]
Song, S.; Chen, H.; Sun, H.; Liu, M. Data Efficient Reinforcement Learning for Integrated Lateral Planning and Control in Automated Parking System. Sensors 2020, 20, 7297. [Google Scholar] [CrossRef]
Zhang, J.; Chen, H.; Song, S.; Hu, F. Reinforcement Learning-Based Motion Planning for Automatic Parking System. IEEE Access 2020, 8, 154485–154501. [Google Scholar] [CrossRef]
Sousa, B.; Ribeiro, T.; Coelho, J.; Lopes, G.; Ribeiro, A.F. Parallel, Angular and Perpendicular Parking for Self-Driving Cars using Deep Reinforcement Learning. In Proceedings of the 2022 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Santa Maria da Feira, Portugal, 29–30 April 2020; IEEE: Piscataway, NJ, USA, 2022; pp. 40–46. [Google Scholar] [CrossRef]
Du, Z.; Miao, Q.; Zong, C. Trajectory Planning for Automated Parking Systems Using Deep Reinforcement Learning. Int. J. Automot. Technol. 2020, 21, 881–887. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; Bradford Books: Bradford, UK, 2018. [Google Scholar]
Plaat, A. Deep Reinforcement Learning; Springer Nature: Singapore, 2022. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
van Otterlo, M.; Wiering, M. Reinforcement Learning and Markov Decision Processes; Springer: Berlin/Heidelberg, Germany, 2012; pp. 3–42. [Google Scholar] [CrossRef]
Bowyer, C. Rewards in Reinforcement Learning. 2022. Available online: https://www.linkedin.com/pulse/rewards-reinforcement-learning-caleb-m-bowyer/?trk=pulse-article_more-articles_related-content-card (accessed on 1 May 2025).
AlMahamid, F.; Grolinger, K. Reinforcement Learning Algorithms: An Overview and Classification. In Proceedings of the 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Virtual, 12–17 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–7. [Google Scholar] [CrossRef]
Wang, H.; Zariphopoulou, T.; Zhou, X. Exploration Versus Exploitation in Reinforcement Learning: A Stochastic Control Approach. SSRN Electron. J. 2019. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Sewak, M. Deep Reinforcement Learning; Springer: Singapore, 2019. [Google Scholar] [CrossRef]
Guo, S.; Zhang, X.; Zheng, Y.; Du, Y. An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning. Sensors 2020, 20, 426. [Google Scholar] [CrossRef]
Kobayashi, T.; Ilboudo, W.E.L. t-soft update of target network for deep reinforcement learning. Neural Netw. 2021, 136, 63–71. [Google Scholar] [CrossRef]
Fu, X.; Zhu, J.; Wei, Z.; Wang, H.; Li, S. A UAV Pursuit-Evasion Strategy Based on DDPG and Imitation Learning. Int. J. Aerosp. Eng. 2022, 2022, 3139610. [Google Scholar] [CrossRef]
Zhang, H.; Xu, J.; Zhang, J.; Liu, Q. Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms. Comput. Intell. Neurosci. 2022, 2022, 1–10. [Google Scholar] [CrossRef] [PubMed]
Doob, J.L. The Brownian Movement and Stochastic Equations. Ann. Math. 1942, 43, 351–369. [Google Scholar] [CrossRef]
Zhu, J.; Wu, F.; Zhao, J. An Overview of the Action Space for Deep Reinforcement Learning. In Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 22–24 December 2021; ACM: New York, NY, USA, 2021; pp. 1–10. [Google Scholar] [CrossRef]
Ahmed, H.; Medhat, G.M.; Eyad, E.; Chrisina, J. Imitation learning: A Survey of Learning Methods. ACM Comput. Surv. 2017, 50, 273–306. [Google Scholar]
Singh, H. Deep Deterministic Policy Gradient (DDPG). 2020. Available online: https://keras.io/examples/rl/ddpg_pendulum/ (accessed on 1 May 2025).
Skalse, J.; Howe, N.H.R.; Krasheninnikov, D.; Krueger, D. Defining and Characterizing Reward Hacking. Adv. Neural Inf. Process. Syst. 2022, 35, 9460–9471. [Google Scholar]
French, R. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999, 3, 128–135. [Google Scholar] [CrossRef]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Jones, A.; Debugging, R.L. Without the Agonizing Pain. 2021. Available online: https://andyljones.com/posts/rl-debugging.html (accessed on 1 May 2025).

Figure 1. The DDPG network architecture.

Figure 2. Ego vehicle sensors’ layout.

Figure 3. Steer and reverse combination maneuvers.

Figure 4. Vehicle trajectory during imitation data gathering.

Figure 5. Convergence of the implemented IL model.

Figure 6. Cumulative rewards in catastrophic forgetting.

Figure 7. Cumulative reward convergence in DDPG.

Figure 8. Learning dynamics: (a) actor and critic losses and (b) residual variance.

Figure 9. Cumulative reward convergence in DDPG-IL.

Figure 10. Comparative analysis: success rate.

Figure 11. Comparative analysis: average steps per successful episode.

Figure 12. Comparative analysis: average reward per successful episode.

Figure 13. Vehicle parking maneuvering with IL.

Figure 14. Vehicle parking maneuvering with (a) DDPG and (b) DDPG-IL.

Table 1. Spatial characteristics of the state representation.

Characteristic	Description	Purpose
z	The vertical position of the ego vehicle in 3-dimensional space.	Determines whether the ego vehicle is positioned on the sidewalk.
Angle	The current orientation of the ego vehicle.	Determines the ego vehicle’s heading.
Angle difference	Angular variance between current and desired orientation for parallel parking.	Quantifies the deviation between the ego vehicle’s yaw and the target yaw of the parking space.
Distance	Linear proximity measurement between the ego vehicle and the target parking location.	Quantifies the physical closeness of the ego vehicle to the parking space.
Position	Binary representation of the relative position of the ego vehicle. A value of 1 denotes the ego vehicle in front of the parking spot, and −1 indicates it is positioned behind.	Determines whether the ego vehicle is situated in front or behind the parking space.
Orientation	Binary representation of ego vehicle’s directional alignment. A value of 1 signifies the ego vehicle facing to the right, and −1 indicates it is oriented to the left.	Determines whether the ego vehicle is facing to the left or right concerning the target orientation.
Parallel distance	The distance between the ego vehicle and the parking space, considering parallel alignment.	Measures how closely the ego vehicle aligns parallel to the parking space.
Vertical distance	The distance between the ego vehicle and the parking space, considering vertical alignment.	Measures how closely the ego vehicle aligns vertically to the parking space.

Table 2. Neural Network architecture for Imitation Learning.

Layer	Neurons	Activation Function
Input	18	-
Dense (Hidden Layer 1)	256	ReLu
Dense (Hidden Layer 2)	128	ReLu
Dense (Hidden Layer 3)	64	ReLu
Dense (Output Layer)	2	tanh

Table 3. Critic network architecture for DDPG.

Layer	Neurons	Activation Function
State Input	18 (State Space)	-
Dense (State Layer 1)	16	ReLu
Dense (State Layer 2)	32	ReLu
Action Input	2 (Action Space)	-
Dense (Action Layer 1)	32	ReLu
Concatenation Layer	20 (State + Action Space)	-
Dense (Hidden Layer 1)	256	ReLu
Dense (Hidden Layer 2)	256	ReLu
Dense (Output Layer)	1	Linear

Table 4. Agent’s behaviors leading to reward modifications.

Behavior	Description
$z > 0.03$	The ego vehicle is on the sidewalk.
$r e v e r s e = F a l s e A N D d i s t a n c e > 1.5 \cdot i n i t i a l_d i s t a n c e A N D l o c a t i o n = 1$	The ego vehicle is in front of the parking space and keeps moving forward.
$r e v e r s e = T r u e A N D d i s t a n c e > 1.5 \cdot i n i t i a l_d i s t a n c e A N D l o c a t i o n = - 1$	The ego vehicle is behind the parking space and keeps moving backward.
$r i g h t b a c k w h e e l s e n s o r < 0.35 A N D r e v e r s e = T r u e a n d s t e e r > 0$	The ego vehicle is moving closer to the sidewalk with a positive steer while continuing to move backward.
$o r i e n t a t i o n = 1 A N D a n g l e_d i f f > 5 A N D l o c a t i o n = 1$	The ego vehicle is facing to the right, with an angle difference above 5, and is in front of the parking space.

Table 5. Exploration strategy based on success rate.

Success Rate	Noise Factor Initiation Frequency
<20	every 40 episodes
<40	every 80 episodes
<60	every 120 episodes
<100	every 160 episodes

Table 6. DDPG algorithm hyperparameters.

Hyperparameter	Value
Actor network learning rate	0.0001
Critic network learning rate	0.001
$Gamma (γ$ )	0.9
$Soft update coefficient (τ$ )	0.001
Ornstein–Uhlenbeck noise	$μ = 0, σ = 0.4, θ = 0.15, d t = 0.01$
Noise factor maximum value	0.2
Noise factor minimum value	0.0001
Noise factor decay rate	0.001
Replay buffer size	100,000
Batch size	64

Table 7. Hyperparameter combinations.

Hyperparameter	Value
Actor learning rate	0.001, 0.0001, 0.00001
Critic learning rate	0.001, 0.002, 0.0001, 0.0002
$Discount rate (γ$ )	0.9, 0.95, 0.99
$Soft update rate (τ$ )	0.0005, 0.001, 0.005
Experience replay batch size	32, 64, 128

Table 8. Evaluation criteria.

Criterion	Description
Convergence of losses	Convergence behavior of the actor and critic losses over the training epochs
Residual of variances	Relative change in variance between predicted critic values and target values
Steer estimation accuracy	Percentage of correct predictions for the steering actions
Reverse estimation accuracy	Percentage of correct predictions for the reverse actions
Convergence of losses	Convergence behavior of the actor and critic losses over the training epochs

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Anagnostara, I.M.; Tsardoulias, E.; Symeonidis, A.L. Deep Reinforcement Learning and Imitation Learning for Autonomous Parking Simulation. Electronics 2025, 14, 1992. https://doi.org/10.3390/electronics14101992

AMA Style

Anagnostara IM, Tsardoulias E, Symeonidis AL. Deep Reinforcement Learning and Imitation Learning for Autonomous Parking Simulation. Electronics. 2025; 14(10):1992. https://doi.org/10.3390/electronics14101992

Chicago/Turabian Style

Anagnostara, Ioanna Marina, Emmanouil Tsardoulias, and Andreas L. Symeonidis. 2025. "Deep Reinforcement Learning and Imitation Learning for Autonomous Parking Simulation" Electronics 14, no. 10: 1992. https://doi.org/10.3390/electronics14101992

APA Style

Anagnostara, I. M., Tsardoulias, E., & Symeonidis, A. L. (2025). Deep Reinforcement Learning and Imitation Learning for Autonomous Parking Simulation. Electronics, 14(10), 1992. https://doi.org/10.3390/electronics14101992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning and Imitation Learning for Autonomous Parking Simulation

Abstract

1. Introduction

2. State of the Art

2.1. Conventional Autonomous Parking Approaches

2.2. Advanced Autonomous Parking Approaches

2.3. Contribution

3. Theoretical Background

3.1. Reinforcement Learning Concepts

3.1.1. Key Components of Reinforcement Learning

3.1.2. Markov Decision Process

3.1.3. Reward Function

3.1.4. Returns and Episodes

3.1.5. Policy and Value Functions

3.1.6. Exploration vs. Exploitation

3.2. Deep Reinforcement Learning

3.3. Deep Deterministic Policy Gradient

3.3.1. Actor–Critic Architecture

3.3.2. Architectural Foundations and Innovations in DDPG

Target Networks

Experience Replay

3.3.3. Exploration Noise

3.3.4. Learning Mechanism

4. Implementation

4.1. Perception and State Representation

4.1.1. Radar Sensors

4.1.2. Spatial Characteristics

4.1.3. State Vector Representation

4.2. Action Space

4.3. Parking Space Detection

4.4. Autonomous Parallel Parking Implementation

4.4.1. Imitation Learning

4.4.2. DDPG

Reward Function

Exploration Noise

Modifications

Training

4.4.3. DDPG-IL

Hyperparameters Definition Process

Training

5. Experimental Results

5.1. Success Rate

5.2. Average Steps per Successful Episode

5.3. Average Reward per Successful Episode

5.4. Discussion

6. Conclusions and Future Enhancements

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI