1. Introduction
Automated valet parking (AVP) is a key function in intelligent transportation systems because it reduces the burden of low-speed parking, helps prevent minor collisions, and improves the use of limited parking resources in lots and garages [
1]. Unlike highway autonomy, AVP operates in confined spaces with frequent curvature changes and tight terminal constraints, requiring both reliable long-range motion execution during cruising and precise low-speed maneuvers into a final target pose during parking. As a result, AVP must address long-horizon tracking and short-horizon high-precision docking simultaneously, while remaining robust to sensing imperfections, model mismatch, and discontinuities caused by curved approaches or mode switching [
2,
3].
Existing AVP solutions are generally divided into rule-based and learning-based methods [
1,
4]. Rule-based approaches typically combine geometric or graph-search planners with tracking controllers to achieve predictable behavior and constraint satisfaction in structured maps. Prior studies have shown that such planning and re-planning pipelines can handle narrow lots and perception errors with appropriate sensing configurations [
5], and that search-based methods can generate feasible collision-free paths in tight parking scenarios [
6]. However, these pipelines often require extensive retuning across layouts and can become fragile when errors accumulate over long approach segments, especially under high curvature, heading wrap-around, or local-minimum effects caused by limitations of low-speed actuation.
Learning-based methods, particularly deep reinforcement learning (DRL), have been explored to reduce hand-crafted rule complexity and to learn policies directly from interaction. Although reinforcement learning has shown promise in autonomous driving, its real-world deployment remains constrained by poor sample efficiency, training instability, and strong dependence on reward design and exploration strategy [
4]. These issues are even more pronounced in parking tasks, where sparse rewards provide weak learning signals, slow convergence, and may induce undesirable behaviors such as oscillation or spinning near the goal [
4,
7]. To alleviate these problems, data-efficient and reward-shaped reinforcement learning formulations have been proposed to improve convergence and better alignment between planning and control objectives [
8]. This has motivated hierarchical designs in which model-based and learning-based components handle long-range cruising and precise terminal docking within their respective operating regimes.
Recent AVP research increasingly adopts hierarchical architectures that combine model-based control with reinforcement learning, allowing long-range motion to remain constraint-aware while the final maneuver benefits from learned nonlinear behavior [
7]. On the model-based side, optimization-based planners can explicitly enforce kinematic feasibility and collision-avoidance constraints, and have shown strong performance in cluttered or irregular environments, especially when iterative refinement or warm-start strategies are used to maintain practical computation [
9,
10]. Online re-planning methods have also been studied to preserve trajectory continuity when newly detected obstacles invalidate an ongoing parking plan [
11]. In parallel, reinforcement learning has gained attention for low-speed parking because it can capture nonlinear maneuvers with continuous control and has recently demonstrated improved robustness under non-ideal scenarios with modern algorithms [
12]. More broadly, MPC-RL integration has been investigated to reduce trial-and-error while preserving constraint structure, such as by using NMPC to assist policy learning or switching, or by learning decision policies that modulate MPC execution [
13,
14].
Recent studies also indicate that traditional planning methods and hybrid MPC-RL approaches have complementary advantages in cluttered parking environments. Traditional graph-search, geometry-based, and optimization-based planners provide interpretable path generation and explicit constraint handling, which makes them effective for structured lots and obstacle-rich spaces [
15,
16,
17]. However, their performance may depend on map quality, warm-start quality, re-planning frequency, and parameter tuning, especially when the local geometry changes or when perception updates invalidate the planned trajectory. In contrast, hybrid MPC-RL methods attempt to combine the constraint awareness of model-based control with the adaptability of learned policies. Such approaches can use MPC to guide policy learning, assist switching decisions, or maintain feasibility while reinforcement learning improves behavior in nonlinear or difficult-to-model maneuvers [
18,
19]. Nevertheless, many recent studies still focus on either isolated parking maneuvers, local trajectory planning, or general driving tasks, rather than a complete perception-triggered cruise-to-park AVP process. Therefore, the present work focuses on the integration of NMPC-based cruising, TD3-based terminal parking, and camera-triggered phase switching within one structured parking-lot pipeline, while also evaluating transfer to multiple unseen target slots under fixed controller settings.
Despite this progress, several gaps remain in achieving a seamless cruise-to-park AVP pipeline in a structured parking-lot setting. First, much of the literature still formulates parking as an isolated start-to-slot task, whereas practical AVP requires long-range cruising, perception-driven slot confirmation, and reliable mode switching before the terminal maneuver. Second, long-horizon cruising in confined lots can suffer from tracking error accumulation and discontinuities caused by tight curvature and heading wrap-around, yet these effects are rarely analyzed together with downstream parking performance in an integrated framework. Third, learning-based parking policies often exhibit inefficient near-goal behaviors, such as oscillation or spinning under sparse or delayed rewards, while prior studies do not consistently report stage-wise error trends or demonstrate transfer performance across multiple target slots under fixed controller settings.
Motivated by these gaps, this paper proposes a cruise-to-park AVP framework that couples constraint-aware nonlinear model predictive control (NMPC) for cruising with a learning-based low-speed parking policy. In the implemented system, the vehicle first follows a predefined cruising route in a structured parking lot, and a camera-based free-slot detection module triggers the transition from slot searching to parking once an available space is detected [
20]. During cruising and search, the NMPC formulation employs a customized cost function that balances path-tracking accuracy, speed regulation, and control smoothness, enabling accurate long-horizon execution with small tracking errors, including on curved segments. During parking, a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent is trained for continuous control using LiDAR feedback for collision-aware docking [
21]. The reward function includes an explicit time-penalty term to suppress pre-parking spinning and to encourage timely convergence to the terminal pose, consistent with the role of reward shaping in improving learning efficiency under sparse or delayed feedback [
22]. The overall system is implemented in a structured Simulink environment with Unreal Engine-based geometry-aware sensing modules, enabling consistent evaluation of the perception-triggered phase transition and the end-to-end cruise-to-park process.
The integrated framework is validated through parking-lot simulations in Simulink. Intra-lot target-slot transferability is examined on six previously unseen target slots using identical controller settings and without further retraining, while approach-phase tracking errors and terminal parking errors are recorded for evaluation. The results show stable phase transitions, accurate cruising behavior, and transferable parking performance across the tested slots, suggesting that the proposed NMPC tracker with a customized cost function and the time-penalized TD3 parking policy provide a practical baseline for cruise-to-park AVP studies.
This work does not claim to replace existing hybrid MPC-RL parking frameworks or to demonstrate broad policy generalization across arbitrary parking layouts. Instead, it focuses on an integrated cruise-to-park implementation in which NMPC-based cruising, geometry-based free-slot detection, TD3-based terminal docking, and phase switching are evaluated together under a fixed structured parking-lot setup. The validation therefore emphasizes intra-lot target-slot transferability, baseline algorithm comparison, and reward-design ablation under consistent simulation conditions.
The main contributions of this paper are summarized as follows:
A hierarchical cruise-to-park AVP framework is developed by integrating camera-triggered slot detection, NMPC-based long-range cruising, and TD3-based low-speed parking within a unified simulation environment.
A customized NMPC cost design is introduced for the cruising phase to improve long-horizon path-following performance by jointly regulating tracking accuracy, vehicle speed, and control smoothness in structured parking-lot routes.
A time-penalized TD3 reward formulation is proposed for the parking phase to reduce inefficient oscillation or spinning near the goal and to improve convergence toward the target pose during collision-aware docking.
The proposed framework is validated on multiple previously unseen parking slots under fixed controller settings to examine phase-transition behavior, docking feasibility, and intra-lot target-slot transferability within the same parking-lot layout. Additional PPO/SAC baseline comparisons and a time-penalty ablation are included to evaluate the relative learning performance and the effect of reward shaping.
The remainder of this paper is organized as follows.
Section 2 presents the methodology, including the overall system architecture, the perception-triggered transition logic, the NMPC design for cruising, and the TD3 formulation for parking control.
Section 3 reports the simulation settings and experimental results, focusing on cruising performance, terminal parking accuracy, and intra-lot transfer performance across unseen parking slots.
Section 4 discusses the main findings, practical implications, and limitations of the proposed framework.
Section 5 concludes the paper and suggests directions for future work.
3. Results
3.1. Overall Performance of the Proposed AVP Framework
To evaluate the end-to-end capability of the proposed AVP framework, a representative cruise-to-park trial for slot 64 was examined in the structured parking-lot environment. As shown in
Figure 8, the ego vehicle successfully completed the full task from the initial position to the target slot. The vehicle first followed the predefined cruising route under the NMPC cruising controller and then switched to the TD3 parking controller for the terminal docking maneuver after the target slot was confirmed by the perception module. The task was completed successfully without collision or invalid termination. Slot 64 was selected as the representative end-to-end case because it illustrates the complete cruise-to-park process in one of the unseen target-slot scenarios.
The full trajectory demonstrates that the proposed framework successfully integrates long-range constrained cruising and low-speed parking within a unified closed-loop architecture. During the cruising phase, the NMPC controller maintained stable path-following behavior along the predefined route and guided the vehicle smoothly toward the parking region. After the cruise-to-park switching point, the TD3 controller generated the steering actions required for precise low-speed docking into slot 64. No obvious trajectory discontinuity was observed around the controller handoff, indicating that the phase manager and command selector achieved the intended smooth transition between the two control stages.
Overall, this representative result confirms the feasibility of the proposed cruise-to-park AVP pipeline in a structured parking-lot scenario. The successful completion of the full maneuver indicates that the perception-triggered switching logic, the NMPC cruising controller, and the TD3 parking controller can operate coherently within the proposed end-to-end AVP framework.
3.2. NMPC Cruising Results
The cruising-stage tracking performance of the proposed framework was evaluated using the representative test for slot 64. During the cruising stage, the NMPC controller guided the vehicle along the predefined route before the parking maneuver was triggered. The lateral tracking error and heading error remained bounded throughout the approach phase, indicating that the vehicle entered the parking stage with a well-aligned pose for the subsequent TD3 parking controller.
Figure 9 shows the lateral tracking error during the NMPC cruising stage. Since the cruising controller primarily aims to keep the vehicle aligned with the reference path, the evaluation focuses on lateral deviation and heading alignment rather than longitudinal tracking error. The lateral error remained bounded within a small range throughout the approach phase, with larger deviations mainly appearing during curved segments and local geometric transitions of the route. Quantitatively, the cruising-stage tracking remained accurate, with a maximum lateral deviation of 0.0369 m and a mean absolute lateral error of 0.0051 m. These transient peaks were limited in magnitude and decayed quickly after the vehicle completed the corresponding turning adjustments, demonstrating effective path regulation under the customized NMPC formulation.
Figure 10 presents the heading error of the vehicle relative to the local path tangent during cruising. The heading error remained well controlled overall and was mainly concentrated around the same turning regions where lateral correction was required. Although several short-duration peaks appeared during curvature changes, the error quickly returned toward zero after each transition, indicating that the controller regulated the vehicle orientation without persistent oscillation. The final yaw error at the end of the cruising phase was approximately zero, further confirming that the vehicle entered the parking stage with a well-aligned pose.
Overall, the NMPC controller achieved accurate and well-behaved cruising motion before the phase switch to parking. The bounded lateral deviation and rapidly recovering heading response indicate that the controller provided a stable pre-parking trajectory, which is important for ensuring a smooth handoff to the TD3 parking controller in the integrated AVP framework.
The cruise-to-park switching behavior was further examined from the closed-loop trajectory and phase-transition response. Once a valid parking slot was detected, the target pose was latched and the system switched from NMPC cruising to TD3 parking without repeated mode switching. No visible trajectory discontinuity, collision, workspace violation, or post-switch oscillatory behavior was observed around the handoff point in the reported simulations. This indicates that the proposed latching and command-selection logic provided a smooth controller handoff under the deterministic simulation conditions considered in this study.
The NMPC solve time was estimated using the Simulink Profiler during the cruising phase. Since the NMPC optimization is implemented inside the MPC Tracking Controller block, the profiled execution time of this block was used as a conservative implementation-level estimate of the NMPC solve time. The profiled MPC Tracking Controller required an average of 0.0405 s per controller call, calculated from a total profiled time of 32.897 s over 813 calls. This value is below the controller sampling time of 0.1 s, suggesting that the current MATLAB/Simulink implementation is preliminarily feasible for the selected simulation control rate. Nevertheless, the reported value should be interpreted as a profiler-based controller-level timing estimate rather than a certified embedded real-time solver benchmark, and embedded deployment would require code generation, hardware-specific profiling, and validation under processor and communication constraints.
3.3. TD3 Parking Results
The parking-stage performance of the proposed framework was evaluated using the TD3 training process and the terminal parking behavior in the training slot. Since the parking policy was trained in slot 7, this section focuses on the learned parking behavior in that slot before the unseen-slot transfer results are presented in
Section 3.4.
Figure 11 shows the TD3 training reward curve for parking policy learning. As training progressed, the episode reward exhibited an overall increasing trend, while the average reward and evaluation statistics also improved gradually. Training was terminated after 850 episodes once the stopping criterion was satisfied. In the implemented training setup, the evaluation statistic was computed every 20 episodes as the mean episode reward over the evaluation window, and training stopped when this value exceeded 140. At termination, the final average reward was 89.0592 and the evaluation statistic reached 143.201. These results indicate that the agent progressively learned more effective parking behaviors during training. The stabilization of the reward trend in the later training stage suggests that the proposed reward formulation provided informative learning signals and supported convergence of the parking policy.
The learned policy was further examined in the training slot to evaluate its terminal docking performance. As shown in
Figure 12, the vehicle successfully executed the parking maneuver in slot 7 and reached the target parking region with stable low-speed behavior. The terminal trajectory remained smooth, and no obvious oscillatory correction or unstable spinning behavior was observed near the goal, indicating that the TD3 controller was able to generate effective steering actions for the final docking stage.
Quantitatively, the parking maneuver in slot 7 was completed in 6.50 s. The final position error was 0.6270 m, including a lateral error of −0.5615 m and a longitudinal error of −0.2791 m, while the final yaw error was 0.0534 rad (3.06 deg). In addition, the trajectory-wide RMS distance-to-goal during parking was 7.8054 m and the maximum distance-to-goal was 12.7944 m, reflecting the large initial offset before convergence to the target pose. These results indicate that the learned policy achieved successful and stable terminal parking in the training slot, with good final heading alignment and reasonable terminal accuracy.
Overall, the TD3 training and parking results confirm that the learned parking policy can provide a reliable low-speed docking controller within the proposed cruise-to-park AVP framework. Together with the designed reward shaping strategy, the controller enables stable parking behavior in the trained scenario and establishes the basis for the unseen-slot validation study presented in the next section.
3.4. Validation in Unseen Slots
The intra-lot target-slot transferability of the learned parking policy was validated on six previously unseen parking slots, namely Slots 14, 15, 23, 39, 47, and 64. All tests were conducted using the same controller settings as in the training slot, and no additional retraining or parameter adjustment was performed. This evaluation was intended to examine whether the learned TD3 parking policy could be transferred to new terminal parking scenarios within the same structured parking-lot environment.
As summarized in
Table 5, the learned policy completed successful docking maneuvers in all six unseen validation slots under the same fixed controller settings. No collision, invalid termination, workspace violation, or timeout was observed in these tests, corresponding to a success rate of 100% under the deterministic evaluation setting. Since the vehicle is expected to approach the target slot from outside during the terminal maneuver, parking-slot containment was evaluated at the final pose rather than imposed as an in-slot constraint throughout the entire approach. During the reported validation trials, no workspace-boundary violation, obstacle collision, invalid termination, or final footprint violation was observed. The computed final-footprint containment indicator was true for all six unseen-slot validation cases. The final-pose visualization in
Figure 13 further confirms that all final vehicle footprints were contained within their corresponding target parking regions. The parking duration ranged from 6.4 s to 7.4 s. The final position error remained below 0.75 m in Slots 14, 15, 23, and 39, with corresponding final yaw errors of 0.1164 rad, 0.0362 rad, 0.0803 rad, and 0.0880 rad, respectively. More challenging but still successful cases were observed in Slots 47 and 64, where the final position errors increased to 1.3380 m and 1.1849 m, respectively. These two cases did not correspond to parking failure. Instead, the controller still guided the vehicle into the designated parking region with stable terminal behavior and acceptable final orientation, while exhibiting larger terminal offsets than the other validation-slot tests. This indicates that the main limitation in these cases was reduced terminal accuracy rather than loss of parking feasibility. Over all six validation-slot tests, the mean final position error was 0.8748 m and the mean final yaw error was 0.0818 rad (4.69 deg), indicating that the learned policy retained encouraging intra-lot target-slot transfer performance for the validation parking conditions that were not part of the training.
The larger terminal position errors observed in Slots 47 and 64 can be attributed to the greater mismatch between their local approach geometries and the training-slot condition. In these two cases, the vehicle enters the parking phase from a relative pose that is less similar to the training distribution associated with Slot 7. As a result, the TD3 controller must correct a larger terminal offset within a short low-speed docking horizon. The component-wise errors in
Table 5 also indicate that the main degradation is positional rather than purely orientational. For example, Slot 47 has a very small final yaw error but a larger lateral and longitudinal offset, suggesting that the vehicle remained well aligned in heading while stopping at a shifted terminal position. Slot 64 also shows a larger combined positional offset, although the vehicle still reached the designated parking region without invalid termination. These results suggest that the learned policy achieves intra-lot target-slot transferability, but it is not fully invariant to slot-specific approach geometry when trained on only one target slot. Future work can mitigate this limitation by using multi-slot training, domain randomization over target poses and approach angles, slot-geometry-aware observation features, or online fine-tuning and adaptation for new parking layouts.
Figure 13 shows the local parking trajectories and final poses in the six validation slots. Although the target locations and approach geometries differed across cases, the learned TD3 controller generated feasible docking trajectories without retraining. The larger terminal offsets in Slots 47 and 64 suggest that terminal accuracy remains sensitive to local approach geometry, but these cases still achieved stable docking within the designated parking region. Overall, the validation results indicate that the policy learned transferable parking behavior within the structured parking-lot environment rather than memorizing the training slot.
3.5. Baseline Comparison and Reward Ablation
3.5.1. Baseline Comparison Among TD3, PPO, and SAC
To provide a quantitative comparison against alternative reinforcement learning baselines, PPO and SAC agents were additionally trained and evaluated under the same parking environment. The three algorithms used the same observation definition, action space, reward formulation, training target slot, and six-slot evaluation protocol. The comparison is intended to provide an algorithmic baseline under the implemented simulation setting rather than to establish a universal ranking of reinforcement learning methods for all AVP scenarios.
Table 6 summarizes the cross-algorithm comparison. PPO showed the fastest training convergence and better sample efficiency, while SAC showed stronger exploration ability but less stable terminal behavior in the more challenging validation cases. In terms of parking outcomes, both TD3 and PPO completed all six validation slots, whereas SAC produced an unstable trajectory with a large terminal error in Slot 64. The performance gap was most evident in the difficult cases, as reflected by the worst-case error, where TD3 achieved the lowest worst-case position error among the three algorithms. Therefore, TD3 was retained as the main parking controller in the proposed cruise-to-park framework because it provided the most reliable balance between terminal accuracy, stability, and robustness across the six-slot evaluation.
The comparison indicates that PPO had the best training efficiency, while TD3 provided the most reliable terminal parking performance across the six validation slots. SAC showed competitive behavior in easier slots, but its unstable looping behavior and large terminal error in Slot 64 reduced its overall reliability.
3.5.2. Ablation Study on the Explicit Time Penalty
To directly examine the effect of the explicit time-penalty term in the TD3 reward function, an ablation study was conducted by retraining the TD3 agent after removing the per-step time penalty, while keeping the observation definition, action space, remaining reward components, training target slot, and termination conditions unchanged. This experiment was designed to evaluate whether the explicit time penalty indeed helps suppress inefficient near-goal behavior during terminal docking.
Figure 14 shows the training reward evolution of the TD3 agent without the explicit time penalty. Relative to the time-penalized TD3 training results reported in the previous experiments, the ablated case exhibited a weaker reward-improvement trend and less stable convergence behavior. This suggests that the explicit time penalty contributes to more efficient policy learning by discouraging unproductive near-goal motion, rather than merely changing the final parking pose.
Figure 15 presents a representative parking trajectory obtained without the explicit time penalty. In this case, the vehicle exhibited noticeable spinning-like and repeated corrective motion near the target region before final docking. Although the agent was still able to enter the designated parking region, the terminal maneuver became less decisive and required more unnecessary adjustment, indicating that parking feasibility was largely preserved while parking efficiency deteriorated. Compared with the proposed reward design, the ablated policy showed more prolonged near-goal corrective motion, indicating reduced maneuver decisiveness in this representative case.
Overall, the ablation results support the inclusion of the explicit time penalty in the proposed reward formulation. The main benefit of this term is to suppress spinning-like and inefficient near-goal motion, improve training efficiency, and promote smoother and more decisive terminal parking behavior.
However, the present ablation study was intentionally focused on the explicit time-penalty term, because this term was directly associated with the spinning-like and inefficient near-goal behavior observed during pilot training. The progress reward and steering-smoothness penalty were kept unchanged in order to isolate the contribution of the time penalty. A more comprehensive component-wise reward ablation, including separate evaluations of the progress reward and the steering-smoothness penalty, will be conducted in future work.
4. Discussion
The results support the effectiveness of separating the AVP task into long-horizon cruising and terminal parking. In this structure, NMPC provides a constraint-aware tracking layer before the switching point, while TD3 focuses on the nonlinear low-speed maneuver near the target slot. This division reduces the burden on a single controller and allows each component to operate within a more suitable regime. The observed cruise-to-park behavior suggests that such a hierarchical design can provide a practical baseline for integrated AVP studies in structured parking-lot environments [
31]. The profiled MPC tracking controller block required an average of 0.0405 s per controller call, which was below the 0.1 s controller sampling time. This result provides a preliminary indication of computational feasibility under the selected MATLAB/Simulink simulation setting, although embedded real-time deployment would still require code generation and hardware-specific profiling.
The baseline comparison further supports the selection of TD3 as the main parking controller in the proposed framework. PPO showed the fastest training convergence, indicating higher training efficiency under the implemented reward and observation settings. SAC maintained stronger exploration ability, but its terminal behavior became less stable in the more challenging validation cases, especially in Slot 64. In contrast, TD3 achieved the lowest worst-case position error and the most reliable terminal behavior across the six-slot evaluation. Therefore, the comparison does not imply that TD3 is universally superior to PPO or SAC, but it shows that TD3 provided the best balance among terminal accuracy, stability, and robustness in the implemented parking environment.
The reward design also played an important role in shaping the parking behavior. The ablation study showed that removing the explicit time-penalty term weakened the training trend and produced spinning-like or repeated corrective motion near the target region. This directly supports the role of the time penalty in discouraging inefficient near-goal behavior. In the proposed time-penalized TD3 setting, no obvious repeated spinning behavior was observed near the goal, suggesting that the time-penalized reward helped suppress locally inefficient behaviors while preserving stable convergence. This finding is consistent with prior studies showing that reinforcement learning-based parking performance is highly sensitive to reward formulation and task structure [
29,
30,
32,
33]. Nevertheless, the current ablation does not fully decompose all reward components. Future experiments will further examine the independent and coupled effects of the progress reward, steering magnitude penalty, and steering-increment penalty on convergence speed, trajectory smoothness, time-to-park, and terminal parking accuracy.
The validation results further suggest that the learned policy did not merely memorize the training slot. Under fixed controller settings and without retraining, the controller completed successful docking maneuvers in all six unseen validation slots within the same structured parking-lot layout. However, this result should be interpreted as intra-lot target-slot transferability rather than broad policy generalization across different parking environments. The variation in terminal position error, especially in Slots 47 and 64, indicates that parking accuracy remains sensitive to target-slot geometry and relative pose at the switching point. This suggests that the learned docking strategy captured transferable behavior within the tested structured parking lot, while further improvements are still needed to reduce slot-dependent terminal offsets.
4.1. Modeling Assumptions and Practical Limitations
The present study is based on several modeling assumptions that are appropriate for an initial cruise-to-park validation but should be considered when interpreting the results. First, the kinematic bicycle model is suitable for the low-speed AVP setting considered in this work, where tire slip and high-order lateral dynamics are less dominant than in high-speed maneuvers [
34]. However, this model does not explicitly capture tire saturation, suspension effects, actuator dynamics, or low-friction conditions. These effects may influence tracking accuracy and phase-transition quality in real vehicles.
Second, the camera-based free-slot detection module is implemented as a geometry-based sensor rather than a pixel-level perception pipeline. This design allows the switching logic and controller interaction to be evaluated in a controlled manner, but it also assumes reliable slot visibility and does not model false positives, false negatives, occlusion, or image-processing latency. Third, the LiDAR feedback used by the TD3 controller is represented by idealized ray-casting range measurements. Sensor noise, dropout, and calibration errors may affect the observation vector and therefore the learned parking policy. Fourth, the TD3 parking controller in this study controls only the steering command, while the longitudinal speed is fixed at 2 m/s during the parking phase. This design simplifies the parking MDP and allows the study to focus on steering-policy learning and terminal docking behavior under a consistent low-speed condition. Therefore, the reported results should be interpreted as steering-policy transfer under a fixed-speed parking setting rather than a fully coupled longitudinal-lateral parking policy. The sensitivity of the learned policy to different parking speeds remains an important limitation and will be examined in future work by introducing speed-varying policies or joint speed-steering control.
These assumptions may lead to optimistic performance compared with a real deployment. Nevertheless, they allow the present study to focus on the feasibility of the hierarchical NMPC-TD3 architecture and the cruise-to-park transition mechanism. Future work will introduce perception uncertainty, LiDAR noise and missing beams, localization drift, actuator delay, dynamic obstacles, mirrored or rotated parking-slot layouts, different aisle widths, more detailed vehicle dynamics, and hardware-in-the-loop or real-vehicle validation to further evaluate the robustness of the proposed framework.
4.2. Extension to AI-Defined Vehicle Networks and Multi-Vehicle AVP
Although the present framework focuses on a single ego vehicle, the hierarchical cruise-to-park architecture can be extended to AI-defined vehicle networks and multi-vehicle AVP systems [
35]. In a networked parking environment, a higher-level parking management module or infrastructure-side planner could aggregate parking-slot availability, vehicle intents, and local traffic conditions [
36]. This network-level layer could assign target slots, resolve conflicts among multiple vehicles, and provide updated cruising references to each ego vehicle.
The NMPC cruising controller in the proposed framework could then track the assigned route under vehicle-level constraints, while the TD3 controller would remain responsible for local low-speed docking after the target slot is confirmed. Vehicle-to-Everything communication could be used to share information such as slot occupancy, intended parking maneuvers, and vehicle priority. For example, a vehicle entering a parking slot could broadcast its intended docking action, while nearby vehicles or the infrastructure could update their cruising targets to avoid local conflicts. In this way, the proposed single-vehicle architecture can serve as a lower-level control and docking module within a broader AI-defined parking network.
4.3. Offline Transferability, Network Constraints, and Future Adaptation
The current TD3 parking controller should be interpreted as an offline-trained policy rather than an online adaptive AI-defined vehicle controller. In the reported experiments, the policy is trained in Slot 7 and then deployed to unseen target slots without retraining. Therefore, the results demonstrate intra-lot transferability under a fixed parking-lot layout, but they do not demonstrate continuous adaptation to new parking-lot geometries, new perception conditions, or changing vehicle-network conditions.
In contrast, AI-defined vehicles are expected to improve their models or policies over time through online learning, cloud-edge updates, fleet-level knowledge sharing, or federated learning [
37]. Such mechanisms could allow parking policies to adapt to slot shapes, local traffic patterns, and sensor characteristics while reducing the need for complete retraining from scratch. Extending the proposed TD3 module with online fine-tuning, continual learning, or federated policy updates is therefore an important future direction.
Network constraints are another important consideration for extending the proposed framework to AI-defined vehicle networks. The current implementation assumes reliable state estimation, synchronized sensor updates, and stable target-slot information once the free slot is detected and latched. In a networked AVP system, communication delays, packet dropouts, or outdated infrastructure messages could affect slot-availability updates, target assignment, and multi-vehicle coordination [
38]. A delayed occupancy update may cause a vehicle to continue toward a slot that has already been assigned to another vehicle, while a delayed coordination message may affect the timing of the cruise-to-park transition.
To address these issues, future work should incorporate uncertainty-aware planning and communication-aware control. For example, the network planner could use conservative estimates of slot availability when communication is intermittent, while the vehicle-level controller could revalidate the target slot before switching to the parking mode. The TD3 parking policy could also be trained with simulated sensing and communication impairments so that it becomes more robust to noisy, delayed, or partially missing observations.
4.4. Safety Considerations
Safety guarantees also require further consideration before the proposed framework can be deployed in real-world or multi-vehicle AVP environments. The NMPC cruising controller can explicitly enforce input constraints and can be extended to include additional safety constraints during route tracking. However, the TD3 parking controller is a learned policy and does not by itself provide formal guarantees of collision avoidance or recursive feasibility.
In the present study, safety is encouraged through LiDAR-based observations, invalid-operation penalties, actuator saturation, and closed-loop simulation validation. These mechanisms are useful for training and evaluation, but they are not equivalent to certified safety. For real-world deployment, the learned parking policy should be combined with a supervisory safety layer, such as a control barrier function-based safety filter, a fallback NMPC controller, or an emergency override module that can intervene when predicted collisions or constraint violations are detected [
39].
This safety layer would be especially important in networked AVP scenarios, where target-slot changes, delayed coordination messages, and nearby moving vehicles may introduce additional risk during the phase transition and docking process.
5. Conclusions
This study presented a hierarchical cruise-to-park automated valet parking framework that combines an NMPC-based controller for the cruising phase with a TD3-based controller for the terminal parking phase in a unified Simulink environment. The objective was to achieve a smooth transition from route following to final parking while maintaining stable control performance throughout the maneuver.
The simulation results show that the proposed framework was able to complete the full parking task under the designed parking-lot setting. During the cruising phase, the NMPC controller kept the vehicle close to the reference path and provided suitable initial conditions for the parking stage. The profiled MPC tracking controller block required an average of 0.0405 s per controller call, which was below the 0.1 s controller sampling time and provided a preliminary indication of computational feasibility in the current MATLAB/Simulink implementation. After the switching condition was satisfied, the TD3 controller completed the terminal maneuver without obvious discontinuity at the transition point. In addition to the training slot, the same controller configuration was validated on six previously unseen target slots within the same parking-lot layout, where successful parking behavior was obtained without retraining. These results indicate that the framework provides a degree of intra-lot target-slot transferability within the tested environment.
The additional baseline comparison showed that PPO achieved faster training convergence, while TD3 provided the most reliable terminal parking performance across the six validation slots. SAC showed competitive behavior in easier cases but became less stable in the most challenging slot. The time-penalty ablation further demonstrated that removing the explicit time penalty can lead to spinning-like or repeated corrective motion near the target region. These results support the use of TD3 with the proposed time-penalized reward formulation as the parking controller in the current cruise-to-park framework.
Although the current study was limited to simulation and a fixed parking-lot layout, it provides a structured basis for further development of end-to-end AVP systems. Future work can extend the framework by introducing speed-varying parking policies, joint longitudinal-lateral control, sensor noise, missing LiDAR beams, localization drift, actuator delay, dynamic obstacles, mirrored or rotated parking layouts, different aisle widths, more diverse parking scenarios, and component-wise reward ablations of the progress reward and steering-smoothness penalty. Further evaluation in higher-fidelity simulation [
40,
41], hardware-in-the-loop platforms, or real-vehicle experiments [
42] will also be necessary before practical deployment. It is also planned to use robust parameter space control design for the parking-space-seeking path tracking in the future [
43,
44,
45,
46,
47].