1. Introduction
Autonomous navigation of micro aerial vehicles (MAVs) in confined and low-visibility environments remains a challenging problem with important applications in infrastructure inspection, underground exploration, and search-and-rescue missions. In this work, we specifically focus on the turn negotiation problem in narrow tubular environments, defining it as an explicit control challenge that requires coordination between steering, body stabilization, and local sensory perception. Our contribution lies not in proposing a new RL algorithm, but in formulating this problem, designing interpretable rewards, and systematically analyzing learned policies to highlight how turn-handling behavior emerges and generalizes across tube geometries. Among these environments, narrow tubular structures such as tunnels, pipes, and underground conduits pose particularly stringent constraints due to limited clearance, restricted sensing conditions, and strong coupling between perception, control, and safety. In such scenarios, even small deviations from the tube centerline or poorly negotiated turns may lead to collisions, loss of stability, or mission failure.
A substantial body of prior work has addressed autonomous flight in tunnel-like environments using a variety of sensing modalities and control strategies. Vision- and Light Detection and Ranging (LiDAR)-based approaches have demonstrated reliable navigation in underground or dark environments by exploiting geometric cues extracted from onboard sensors [
1,
2,
3,
4,
5,
6,
7]. Complementary research has explored alternative sensing paradigms for vision-denied conditions, including bio-inspired sonar [
8] or millimeter-wave-based localization [
9], highlighting the importance of robust perception in degraded visibility scenarios. While these methods provide strong foundations for perception and localization, they typically rely on explicit geometric reconstruction or global navigation objectives to ensure collision-free motion.
More recently, reinforcement learning (RL) has emerged as a promising paradigm for autonomous navigation in complex and constrained environments.
RL-based methods have shown impressive performance in constrained robotic navigation tasks, such as medical endoscopy [
10,
11,
12], as well as in high-speed drone racing [
13,
14].
These approaches demonstrate the ability of learned policies to exploit rich sensory inputs and implicitly encode control strategies that would be difficult to design manually. However, most existing RL applications either focus on open or semi-structured environments, or assume task formulations in which navigation constraints are addressed implicitly through global reward shaping or predefined trajectories.
In the specific context of tubular environments, an important yet underexplored challenge is the turn negotiation problem: the need for an autonomous agent to adapt its motion when approaching and traversing bends or curvature variations while maintaining safe clearance and a centered trajectory. In the majority of prior work, turn negotiation is not addressed as an explicit problem. Instead, it is typically handled through heuristic controllers, local geometric rules, or as a side effect of global objectives such as collision avoidance or trajectory smoothness. As a result, the relationship between perception, local decision-making, and turn-specific behavior remains largely implicit and difficult to analyze systematically.
Unlike generic obstacle avoidance or path-following, turn negotiation in narrow tubular environments introduces a transient conflict between lateral clearance, heading alignment, and body stabilization. Successfully traversing a bend often requires the agent to initiate steering before the future direction of the tube becomes directly observable, effectively aligning with an implicit or temporally integrated direction inferred from past sensory cues, while simultaneously maintaining rear clearance to preserve stability.
In this work, we propose a reinforcement learning-based navigation framework that explicitly targets autonomous flight in narrow tubular environments without relying on prior knowledge of the tube geometry. Our approach leverages sparse range-based observations and adaptive weighting of front and rear LiDAR measurements to enable the agent to anticipate and negotiate turns in a stable and reactive manner. Rather than prescribing explicit geometric rules or predefined trajectories, the proposed policy learns to modulate its behavior based on local sensory cues, leading to robust turn handling and improved trajectory centering across a wide range of tunnel configurations.
The main contribution of this paper is not the introduction of a new reinforcement learning algorithm, but rather the formulation and analysis of tubular navigation as a learning problem in which turn negotiation emerges as a central and measurable behavior. By systematically comparing our approach with classical baselines such as Pure Pursuit [
15] and by evaluating performance across multiple tunnel geometries, we demonstrate that learned policies can effectively address limitations observed in both heuristic controllers and prior learning-based methods. Furthermore, this work provides a clearer conceptual link between perception, local control decisions, and navigation performance in confined environments.
The remainder of the paper is organized as follows.
Section 2 reviews existing work on autonomous navigation in tunnels, learning-based control for MAVs, and sensing in low-visibility environments, with a particular emphasis on how turn negotiation is addressed or overlooked.
Section 3 presents the proposed reinforcement learning framework and sensor modeling. Experimental results and comparisons are discussed in
Section 4, followed by perspectives for future work in
Section 5 and a final conclusion in
Section 6.
2. Related Work
This section reviews prior work related to autonomous navigation in confined and tubular environments, learning-based control for aerial robots, and sensing in low-visibility conditions. Beyond a general comparison of methodologies, we place particular emphasis on how existing approaches handle—or overlook—the problem of turn negotiation in narrow tubular geometries.
2.1. Autonomous Navigation in Tunnels and Confined Environments
Several works have addressed autonomous flight in tunnel-like or underground environments using classical perception and control pipelines. Vision-based approaches have demonstrated the feasibility of navigating dark or texture-poor underground spaces by relying on convolutional neural networks for feature extraction and obstacle avoidance [
3,
6]. Other methods rely on geometric modeling and reactive control to ensure collision-free navigation in unknown tunnel-like environments [
4]. These approaches typically emphasize robustness to poor illumination and sensor noise, and often achieve reliable straight-line motion in constrained spaces.
More recent contributions specifically target flight through narrow tunnels, highlighting the aerodynamic and control challenges associated with limited clearance [
1,
2]. Light Detection and Ranging (LiDAR)-based methods, including the use of tilted sensors, have been proposed to improve perception of tunnel geometry and facilitate navigation in curved or irregular conduits [
5]. While these systems can successfully traverse bends, turn handling is generally achieved through geometric heuristics or implicit controller behavior rather than through an explicit formulation of the turn negotiation problem.
In these works, turns are treated as particular instances of obstacle avoidance or path following. The control laws are typically designed to minimize distance to walls, track an estimated centerline, or follow a predefined path when available. As a result, the transition between straight motion and curved motion is handled implicitly, without isolating turning as a distinct control regime requiring specific coordination between perception and actuation.
2.2. Sensing and Navigation in Low-Visibility Environments
A parallel line of research focuses on navigation in environments where vision is unreliable or unavailable. Bio-inspired sensing approaches, such as sonar-based perception, have shown that autonomous drones can navigate without visual input by exploiting echo-based spatial cues [
8]. Similarly, millimeter-wave-based localization techniques have been proposed to enable precise pose estimation in visually degraded environments [
9]. These works emphasize sensing robustness and localization accuracy rather than control strategy design.
While such sensing modalities can provide valuable information for navigation, they do not, by themselves, address how an agent should adapt its control behavior when negotiating turns in confined geometries. The control policies used in these systems typically rely on standard obstacle avoidance or trajectory tracking mechanisms, and turning behavior again emerges implicitly from global navigation objectives.
2.3. Learning-Based Control for Constrained Navigation
Closer to our problem setting, several learning-based approaches have been proposed for navigation in narrow and tortuous environments in the context of robotic endoscopy [
10,
11,
12]. These studies address motion within constrained tubular structures and highlight the relevance of reinforcement learning for achieving safe and adaptive behavior. Nevertheless, even in this domain, turn handling is usually embedded within global safety or progress objectives, rather than formulated as a distinct coordination problem between steering and body stabilization.
Reinforcement learning has also been successfully applied to agile flight and high-speed drone racing [
13,
14]. These works demonstrate that end-to-end learning can produce highly effective control policies when sufficient training data and well-designed reward functions are available. However, the environments considered are generally open or structured around race tracks, where obstacles are sparse and turning behavior is largely dictated by predefined gates or trajectories.
Across these learning-based approaches, the reward functions and policy architectures are generally designed to encourage smooth motion, collision avoidance, or task completion. While such objectives implicitly encourage successful turn traversal, they do not explicitly encode how control priorities should be redistributed when entering or exiting a turn, nor how different sensing regions should contribute differently depending on the local geometry.
2.4. Discussion: Turn Negotiation as an Explicit Control Problem
From the reviewed literature, it emerges that while many systems are capable of navigating curved or tubular environments, the handling of turns is rarely analyzed as a first-class control problem. Instead, turn negotiation is typically treated implicitly, either through heuristic controllers, geometric constraints, or learned policies optimized for global objectives. As a consequence, existing approaches provide limited insight into how sensing, steering, and stabilization priorities are redistributed during curvature transitions, particularly when steering must be initiated before the future direction of the tube becomes directly observable. Moreover, they do not offer metrics to specifically characterize turn-related behavior. This lack of explicit formulation makes it difficult to compare methods or to diagnose failure modes associated with turns in narrow tubular geometries.
In this work, we define the turn negotiation problem as the explicit modeling of the transition between straight motion and curved motion in tubular environments. This includes (i) the temporary acceptance of front asymmetry to enable steering, (ii) the stabilization of the vehicle body through rear constraints during rotation, and (iii) a continuous redistribution of control priorities without discrete mode switching or access to prior knowledge of the tube geometry.
To the best of our knowledge, none of the reviewed works explicitly formulate turn negotiation as a continuous coordination problem between steering and body stabilization driven by local sensory cues. Our contribution therefore complements existing approaches by focusing on this specific but critical aspect of tubular navigation. Rather than proposing a new reinforcement learning algorithm, we introduce a problem formulation and an interpretable reward design that makes turn negotiation explicit, analyzable, and central to the learned behavior. In summary, while prior works demonstrate capable navigation in tubular or constrained environments, none explicitly formulate turn negotiation as a measurable control problem. Our approach fills this gap by introducing a problem definition, a reward design, and analysis metrics that explicitly target turn-handling behavior, making it a central and analyzable aspect of policy performance.
3. Methodology
3.1. Problem Description
The problem addressed consists in enabling an autonomous drone to navigate within a three-dimensional tubular environment whose geometry may exhibit significant variations in curvature. The drone must traverse the tube until its endpoint while avoiding any collision with the internal walls and maintaining a stable trajectory, despite having no global information about the shape of the conduit.
The tube is generated from a smooth spatial curve acting as a guiding axis. This curve may present substantial local variations: abrupt changes in orientation, tightly curved regions, or portions temporarily unobservable from the drone’s current position. Turns therefore constitute the most critical situations, as they can cause a temporary loss of visibility of the central point of the conduit within the drone’s field of view.
The drone evolves under realistic kinematics: it possesses a forward direction, a controllable speed, and a continuously updated local frame. No map of the tube nor any prior knowledge of its geometry is provided. Its perception relies exclusively on local observations, including:
- -
proprioceptive information (orientation, forward direction, velocity, estimated progression);
- -
limited exteroceptive measurements, in particular a front/back perimeter LiDAR providing normalized distances to the walls;
- -
a basic visual module detecting, when possible, a point corresponding to the local center of the conduit in the field of view.
These elements constitute the perceptual basis enabling navigation, centering, and adaptation to geometric variations until the drone reaches the end of the tube.
3.2. General Description of the Approach
This subsection presents the principles guiding our navigation strategy before introducing a mathematical formalization within the framework of a Markov Decision Process.
At each time step, the agent constructs a perceptual state synthesizing all available local observations. This state includes: (i) instantaneous kinematics (orientation, forward direction, speed), (ii) the current progression along the tube, (iii) the possible detection of a target point visible in the field of view, (iv) and features derived from the front/back LiDAR, enabling inference of the local symmetry of the conduit and anticipation of upcoming curved regions.
When the target point is visible, the drone learns to orient itself preferentially toward this reference direction. When the target temporarily disappears during a turn, a short-term directional memory preserves the last useful orientation, ensuring a smooth transition until the structure becomes observable again. In the absence of reliable visual or memory cues, the tube geometry is estimated from LiDAR asymmetries.
The drone’s motion is thus governed by the joint adjustment of its speed and forward direction based on local observations and relevant geometric cues. Behaviors are evaluated according to instantaneous criteria of centering, alignment, and consistency with the implicit structure of the conduit, as well as the ability to negotiate turns effectively.
Finally, to provide a deterministic point of comparison, the agent’s performance is contrasted with that of a trajectory-following algorithm of the Pure Pursuit type, which directly exploits the centerline used to generate the tube.
The next section formalizes this decision framework using a Markov Decision Process.
3.3. Problem Formulation
Notation: throughout this paper, bold symbols denote vectors, while italic symbols denote scalars.
The autonomous navigation problem of a drone inside a confined tubular environment is formulated as a Markov Decision Process (MDP), defined by the tuple
where
denotes the state space,
the action space,
the stochastic transition dynamics,
the reward function, and
the discount factor. The objective of the agent is to maximize the expected cumulative reward (Equation (
2)):
where
denotes the agent’s parameterized policy.
PPO Algorithm for Reinforcement Learning
Among policy optimization methods, the Proximal Policy Optimization (PPO) algorithm has become a widely adopted reference due to its stability and performance [
16].
The MDP formulation (Equation (
1)) provides a general mathematical framework for learning through interaction. To clarify how this formalism applies to navigation in tubular environments, we now describe the physical components of the system: the drone, its sensors, and the local structure of the environment, which jointly determine both the state space and the dynamics of the problem.
3.4. Drone Modeling
In this work, as illustrated in
Figure 1, the drone is modeled as an autonomous quadcopter required to navigate through a narrow tube, negotiate turns, and avoid collisions. The modeling choice is inspired by the sensory configuration of commercial drones specialized in confined-structure inspection, such as the Flyability ELIOS 3 shown in
Figure 2, designed for the exploration of tunnels, caves, and constrained industrial environments.
As shown in
Figure 3, the theoretical drone considered in this study is equipped with the following perception sensors:
- -
a front-facing camera, aligned with the longitudinal axis of the quadcopter and used to acquire images of the tube and detect its center when visible;
- -
two LiDARs, positioned at the front and rear of the drone, each providing coverage on a vertical plane perpendicular to the drone’s longitudinal axis, offering robust perception of radial distances to the tube walls.
Figure 3.
Illustration of the drone equipped with its front-facing camera and its front and rear LiDARs.
Figure 3.
Illustration of the drone equipped with its front-facing camera and its front and rear LiDARs.
This sensor configuration is consistent with recent practices in autonomous drone navigation within tubular environments. For instance, in [
6], the authors propose a navigation approach based on a front-facing camera detecting the tunnel center in each image and controlling the drone’s heading accordingly. Complementarily, several works [
7,
17,
18] leverage LiDAR configurations to build a local geometric representation of the environment and provide reliable obstacle-distance estimation in underground environments.
In the present paper, we assume that the drone’s front camera can detect the tunnel center when it lies within its field of view, an assumption directly inspired by the approach in [
6]. The LiDARs complement this perception by providing continuous measurements of distances to the walls over the full vertical perimeter, which is essential for assessing navigability and correcting trajectory deviations when the tunnel center is not observable.
This camera–LiDAR combination, inspired both by industrial solutions (ELIOS 3) and recent scientific practices, offers a robust compromise between visual perception and geometric measurement for autonomous navigation in confined tubular environments.
3.5. Environment
As illustrated in
Figure 4, the environment is modeled as a confined space
in which the drone evolves. The kinematic state of the quadcopter at time
t is represented by the vector:
where:
- -
is the position of the center of mass in the global reference frame,
- -
is the drone’s speed along its longitudinal (forward) axis,
- -
is the rotation matrix representing the drone’s orientation with respect to the global frame, forming a local orthonormal basis.
This representation (Equation (
3)) captures the drone’s spatial position, forward speed, and orientation—elements essential for trajectory planning and autonomous navigation control in complex tubular environments.
Figure 4.
Illustration of the drone’s evolution within the tubular environment.
Figure 4.
Illustration of the drone’s evolution within the tubular environment.
To facilitate reinforcement learning and improve policy robustness, we introduce a Curriculum Learning (CL) mechanism. The central idea is to progressively expose the agent to increasingly complex tubular environments while controlling the geometric difficulty of the curves used to generate the 3D conduits.
As shown in
Figure 5, tube generation relies on a set of raw curves derived from predefined synthetic models. At each episode, two curves are selected according to the curriculum level, smoothed via spline interpolation, and combined to produce a mixed tubular geometry. This process follows two main steps:
- -
uniformization and smoothing: each curve is reparameterized via B-splines with random tension and degree, ensuring geometric diversity even within the same difficulty level;
- -
curriculum-controlled selection:
- -
level 0: nearly straight curves, indices ;
- -
level 1: moderately curved shapes, indices ;
- -
level 2: highly curved shapes, indices .
- -
geometric fusion: two curves are interpolated using a random factor to produce a new trajectory at each episode, even within the same curriculum level.
Figure 5.
Illustration of the massive tube-generation principle during training.
Figure 5.
Illustration of the massive tube-generation principle during training.
This procedure ensures a progressive, controlled, yet non-repetitive increase in geometric complexity. It notably exposes the agent to difficult scenarios involving low-visibility turns, where the vanishing point temporarily disappears and navigation must rely solely on inertial and LiDAR cues.
The agent moves to the next level only when the average success rate exceeds a predefined threshold.
Table 1 summarizes the levels used.
The success thresholds used in this curriculum were deliberately chosen to be moderate. In this initial development phase, the primary objective is not to maximize the controller’s absolute performance but rather to ensure stable RL policy convergence while validating the feasibility of the approach across increasingly complex environments.
Stricter thresholds (e.g., >0.9) would increase the risk of stagnation in intermediate levels, particularly when the vanishing point is lost in strongly curved tubes. They would also significantly increase the training time required to progress between levels, making the curriculum less practical for this preliminary study.
The adopted thresholds (0.80–0.85) therefore represent an effective compromise: high enough to enforce meaningful progressive learning, yet accessible enough to allow smooth exploration and full validation of the RL approach.
In summary, Curriculum Learning serves as a key mechanism to:
- -
stabilize RL policy convergence by avoiding excessively complex scenarios at the start,
- -
improve the drone’s ability to generalize to tubes with varied shapes and tight turns,
- -
provide a progressive training framework suited to the interaction between the nominal Pure Pursuit controller and the adaptive RL module.
3.6. Observation Space
At each time step, the agent receives an observation vector
constructed from five components: features derived from the LiDAR scans, the drone’s orientation and kinematics, the visual perception of the target, a memory mechanism for handling turning phases, and a set of global context indicators.
3.6.1. Derived LiDAR Features
Rather than relying directly on raw LiDAR measurements, the environment uses a set of nine geometric features extracted from the front and rear scans:
They include:
- -
horizontal and vertical asymmetries for the front and rear scans;
- -
front and rear symmetry scores and ;
- -
average distances and ;
- -
the normalized minimum distance , used as a safety indicator.
These quantities summarize the essential LiDAR information while removing the local variability of raw measurements.
3.6.2. Orientation and Kinematics
The drone provides its local orthonormal basis
corresponding respectively to its forward, upward, and lateral axes. In addition, the observation includes:
- -
the normalized longitudinal velocity;
- -
its increment between two time steps;
- -
the global progression along the tube (between and 1);
- -
the drone’s position normalized by the tube radius.
This subset contributes 15 dimensions.
3.6.3. Visual Perception of the Target
If the tube center is visible in the field of view, its direction is projected into the drone’s local frame:
along with:
- -
a normalized depth of the target;
- -
a binary visibility flag ( if visible, otherwise).
This module provides 5 dimensions.
3.6.4. Memory Mechanism
When a turn temporarily hides the target, the agent relies on:
- -
a temporal ratio indicating how long the target has been invisible;
- -
the last memorized target direction, expressed in the local frame (3 values);
- -
a memory-availability flag ( or ).
This block adds another 5 dimensions.
3.6.5. Global Context
Finally, three additional indicators complete the state:
- -
a secondary safety metric based on the LiDAR minimum distance;
- -
an indicator of whether the drone remains inside the tube;
- -
the normalized progression within the episode.
3.6.6. Complete Vector
The final observation is therefore:
a compact state composed of 37 informative features specifically designed for navigation in confined tubular environments.
3.7. Action Space
In the context of autonomous navigation in confined tubular environments, the agent must continuously move forward while applying corrective orientation adjustments to follow the conduit’s geometry. The action space, illustrated in
Figure 6, was designed to satisfy this constraint: available orientations are restricted to a cone consistent with the front camera’s field of view (FOV), ensuring that each action directs the drone toward plausible continuations of the conduit—even in phases where the tube center is no longer visible. This design allows the agent to explore controlled, geometrically plausible directions.
At each step, the action corresponds to an
orientation–speed pair chosen from a discrete space:
where
contains four levels of longitudinal speed, and
contains nine orientations uniformly distributed within the angular limits of the FOV.
The maximum half-angles of the cone are given by:
where
,
, and
c denote respectively the half-width, half-height, and synthetic focal length of the camera.
The set
contains nine angle pairs
defined as:
where
ensures sufficiently strong corrections while preserving flight stability. These orientations are symmetrically and uniformly distributed inside the cone, providing a reduced yet expressive set of corrective directions consistent with the FOV geometry.
Each pair is converted into a unit orientation vector by combining the drone’s current forward direction with two locally defined transverse axes. An additional constraint prevents the selection of backward-facing orientations, which would be incompatible with forward navigation inside a narrow conduit.
This discrete action space is not intended to uniformly sample all possible 3D orientations. Rather, it provides a symmetric, uniformly distributed, and limited set of plausible directions for following the conduit. It allows the agent to:
- -
explore orientations consistent with the tube’s geometric continuity, even when the center is temporarily invisible;
- -
apply sufficient corrections to follow curved sections;
- -
keep the action space compact, which favors efficient learning.
The next section describes how the agent exploits this action space to handle straight segments, anticipate turns, and maintain a stable trajectory even when the tube center becomes temporarily unobservable.
3.8. Turn Negotiation
As previously introduced, negotiating turns within a curved conduit is a major challenge for an agent equipped with a single forward-facing camera, whose field of view cannot guarantee continuous visibility of the geometric center of the tube. The strategy developed in this work relies on a progressive interplay between visual perception, directional memory, and geometric analysis from LiDAR measurements, allowing continuous adaptation to the local dynamics of the conduit.
- (1)
Early Turn Detection via Degradation of Visual Alignment
Even before the target leaves the field of view, the tube curvature induces a lateral drift of the target point in the image plane. This drift is reflected in practice as a progressive decrease of the visual alignment score
(Equation (
12)), computed as the dot product between the drone’s instantaneous forward direction and the vector from the camera to the target.
This indicator provides an explicit signal as long as the target remains visible. A significant reduction in reveals a local geometry unfavorable to frontal alignment, indicating that a turn is imminent.
This initiates a fully operational transitional phase: the drone actively adjusts its orientation in response to the degradation of visual centering, beginning to negotiate the turn before the target completely disappears from view. This phase is essential to reduce oscillations and to initiate a smooth rotation of the camera toward the new direction of the tube.
- (2)
Memory–Vision Transition When the Target Leaves the Field of View
When curvature becomes sufficient to push the target outside the camera field of view, the system retains the last valid direction (Equation (
13)).
This directional memory, maintained as long as it remains recent, provides the dynamic continuity needed to enter the turn smoothly, extending the strategy initiated during the visual transitional phase.
- (3)
Geometric Navigation in Prolonged Absence of Vision
If the memory expires, navigation relies solely on the instantaneous analysis of front and rear LiDAR measurements. In a turn, a natural mechanical dissymmetry emerges: the drone’s front tends to remain oriented toward the previously perceived or memorized direction, while the rear, subject to rotational motion and slight lateral drift, tends to move toward the outer wall of the curve.
This dynamic makes the front LiDAR measurements less reliable for assessing lateral safety, as they partly reflect the previously followed direction and become ambiguous in tight curvature. In contrast, the rear LiDAR provides more relevant geometric information: it indicates the drone’s true lateral offset from the walls during rotation, acting similarly to a tactile perception.
Thus, in the absence of vision, the front primarily contributes to orientation, while the rear plays a stabilizing role by preventing collisions with the tube walls during the maneuver. This functional dissymmetry enables the drone to maintain controlled progression through turns even when vision and directional memory are unavailable.
- (4)
Global Dynamics of Turn Negotiation
The complete strategy produces a coherent, continuous sequence: (i) visual anticipation via degradation of frontal alignment, (ii) directional maintenance using active memory when the target (red dot in
Figure 7) becomes invisible, (iii) geometric stabilization based on LiDAR measurements when vision is no longer exploitable.
This three-phase dynamic enables smooth, robust, and physically plausible navigation, without requiring explicit knowledge of the local curvature of the conduit.
3.9. Transition to Reward Shaping
The mechanisms described above—visual anticipation of turns, directional maintenance through memory, and geometric stabilization based on front and rear LiDAR—depend on a tight coordination between perception and drone dynamics. For these behaviors to emerge during learning, the reward function must encode each step of the process into exploitable signals.
The next section therefore details the construction of the instantaneous reward components. These explicitly encode: (i) early turn detection via degradation of visual alignment, (ii) transition to the memory regime when the target disappears, (iii) differentiated use of front and rear LiDAR to ensure safe recentering in curves, (iv) geometric progression and trajectory stability.
The goal of this reward shaping is to provide the drone with signals that are consistent with the navigation principles established in the previous section, enabling it to autonomously reproduce this robust behavior across all encountered configurations.
3.10. Reward Shaping
The reward function was designed to encourage robust navigation in a constrained tubular environment, where the agent operates with predominantly local and partial perception. It relies on a set of signals derived from the LiDAR sensors, the front-facing camera, the memory of the target, and the geometric progression within the tube. The instantaneous reward results from a weighted combination of these components, modulated by the local context (straight segments, turns, presence or absence of the target in the visual field).
3.10.1. LiDAR Features and Turn Detection
At each step, two LiDARs perpendicular to the drone’s axis (front and rear) provide a set of normalized distances in [0, 1]. These measurements are grouped into four sectors (left, right, top, bottom) for both front and rear scans, enabling the definition of horizontal and vertical asymmetries:
where
F and
R denote front and rear sectors, and
L,
R,
T,
B the
left,
right,
top, and
bottom regions. These asymmetries are used to construct symmetry scores:
where
measure the drone’s local centering within the tube cross-section.
Turn presence is estimated through a confidence indicator:
where
and
are the front/rear average LiDAR distances. Thus,
in straight segments and
in tight turns.
3.10.2. Centering and Alignment
Centering is evaluated via the radial distance
to the tube’s local axis. The corresponding score is:
Alignment depends on target visibility:
where
is the drone direction,
the local tube axis, and
,
the normalized directions to the visible or memorized target. An instantaneous trajectory score is defined as:
This score
(Equation (
20)) combines centering and alignment into a single measure of navigation quality.
3.10.3. Adaptive Weighting Depending on Context
The front and rear can be weighted with two scalars,
and
with:
The nominal values and define a balanced yet slightly front-biased baseline. This choice reflects the fact that, in nominal forward flight, front LiDAR measurements are more informative for immediate collision avoidance and steering, while rear measurements mainly contribute to stabilizing the agent’s body within the tube. These coefficients are not the result of fine-grained tuning but serve as reference values: they sum to one, which implicitly normalizes the term and keeps its contribution bounded and comparable across navigation contexts.
In straight segments, where the turn confidence is low (), symmetry constraints applied at both the front and rear sensing planes provide a reliable proxy for centered and stable motion. The resulting fixed weights encourage symmetric LiDAR measurements along the entire body of the agent, enforcing global centering without privileging a specific section.
During curved motion, the control objective changes. As the drone advances through a turn, its front section must temporarily tolerate asymmetries to steer toward the target or a memorized direction; enforcing strong front symmetry in this phase would counteract the required rotation. Accordingly, the contribution of the front symmetry score is progressively relaxed, while the rear symmetry score is emphasized to stabilize the agent and promote a smooth coupling between rotation and translation.
This effect is directly encoded by the adaptive weights: for a pronounced turn (), the front contribution decreases to , while the rear contribution increases to . Importantly, this mechanism does not rely on discrete mode switching but implements a continuous transition between control regimes, smoothly redistributing the relative importance of front and rear symmetry constraints.
3.10.4. Three Instantaneous Reward Regimes
The instantaneous reward is composed of several normalized components whose relative priorities are encoded by fixed coefficients. In all cases, the symmetry-related term
uses the adaptive front and rear weights introduced in the previous section. These weights (Equation (
21)) modulate the influence of front and rear LiDAR symmetry according to the turn confidence, and are reused consistently across all reward regimes. Depending on target visibility, additional terms are included or excluded, resulting in three distinct formulations.
Case 1—Target Visible
When the target is visible, the alignment term
(Equation (
19)) is dominant and directly guides the drone toward the target direction. The symmetry terms
and
maintain centering within the tube, with their relative importance adjusted according to turn intensity through the adaptive weights.
Two secondary terms complement this primary guidance. The term reflects the average front LiDAR distance and encourages trajectories where no region of the front sensing plane is too close to surrounding walls. This contributes to collision avoidance and promotes stable motion, particularly in confined sections. The term penalizes horizontal misalignment of the front, helping to keep the drone laterally aligned with the tube while approaching the target.
The coefficients and scale these secondary components so that they provide meaningful corrective signals without overpowering the main guidance terms and . Their role is to refine the trajectory while preserving the dominance of target alignment and symmetry-based centering.
Case 2—Target Absent but Direction Memorized
When the target is no longer visible but its direction is memorized, the alignment term now reflects this stored direction. The symmetry terms and remain unchanged and continue to anchor the agent within the tube using the adaptive front and rear weights.
In this partially blind regime, the secondary terms shift from clearance-based information to orientation stabilization. The term penalizes horizontal misalignment relative to the last known target direction, while penalizes vertical misalignment. Together, these terms support stable orientation and upright flight in the absence of direct visual feedback.
The same scaling principle applies as in Case 1: the coefficients emphasize lateral corrections slightly more than vertical ones, while keeping these terms subordinate to the primary alignment and symmetry signals.
Case 3—Blind Navigation (LiDAR Only)
In the fully blind scenario, no alignment information is available and the term is dropped. Navigation relies entirely on LiDAR-derived cues. The adaptive symmetry term becomes the primary guidance signal, encouraging the drone to remain centered within the tube while accounting for turn-dependent front–rear asymmetries.
The term continues to reward globally free space in the front sensing plane, reducing collision risk even without directional information. The term penalizes asymmetry between front and rear symmetry scores. Since and measure how well centered the drone is at the front and rear sensing planes, their difference captures front–rear imbalance: large values indicate a twisted or poorly centered configuration.
The negative coefficient converts this imbalance into a penalty, encouraging the agent to minimize front–rear asymmetries. This stabilizes both rotation and translation within the tube and prevents oscillatory or drifting behaviors in the absence of visual guidance. Here, the coefficient maintains clearance as a strong stabilizing signal, while moderates the asymmetry penalty to avoid overwhelming the primary symmetry contributions.
3.10.5. Warm-Up Bonus
During the first steps of an episode (
), an additional bonus term is applied:
This warm-up bonus mitigates early misalignment by temporarily reinforcing front and rear symmetry while penalizing front–rear imbalance. Its influence decreases with turn intensity and vanishes after the warm-up phase, ensuring that it stabilizes initial behavior without biasing long-term navigation.
3.10.6. Terminal Rewards
Three events terminate an episode:
4. Experimental Results
This section presents the full set of results obtained during the development, training, and evaluation of the RL agent. The analyses include: (i) training parameters, (ii) performance evolution, (iii) comparison with a reference algorithm, and (iv) validation in a simulated 3D environment.
4.1. Training Parameters
4.1.1. Model Architecture
The model uses a PPO architecture composed of a fully connected network that receives as input an 84-dimensional observation vector, including simulated LiDAR distances, relative position inside the tube, and drone velocity. The architecture consists of:
two dense layers with 256 and 128 neurons, respectively;
tanh activation for the main layers and relu for the final layers;
observation normalization;
an output layer producing continuous actions (linear and angular velocities).
4.1.2. Hyperparameters
Training was performed using the
Ray library (version 2.49.2) and its
RLlib implementation of the PPO paradigm. The main training hyperparameters are summarized in
Table 2. These parameters result from an initial exploration aimed at ensuring return-signal stability and sufficient success rates.
4.1.3. Computation Resources in Python Environment
Training was conducted on a server machine equipped with:
- -
CPU: 2 × AMD EPYC 7F52 (32 physical cores each);
- -
GPU: 4× NVIDIA (driver 535.230.02), 2 of which were used;
- -
OS: Ubuntu 20.04.6 LTS;
- -
Python version: 3.9.5.
4.2. Training Evaluation
The agent’s performance is evaluated using two metrics: (i) the average success rate, defined as the proportion of episodes in which the drone reaches the end of the tube without collision; (ii) the average return, corresponding to the cumulative sum of rewards obtained in each episode. The evolution of these indicators is shown in
Figure 8 and
Figure 9.
4.2.1. Average Success per Iteration
Figure 8 illustrates the progression of the success rate throughout training. A rapid performance increase is observed during the first few dozen iterations, corresponding to the learning of the simplest curriculum scenarios (near–rectilinear tubes).
After reaching the second curriculum level, a noticeable drop in success rate appears, reflecting the increased difficulty caused by tighter turns and temporary loss of the vanishing point. This initial degradation is followed by a gradual recovery, indicating that the agent progressively adapts to these more demanding conditions, eventually reaching stabilized behavior before advancing to the final level.
On the last level of the curriculum, the agent stabilizes around an average success rate between 0.75 and 0.80, indicating robust convergence in the most complex environments. This stabilization confirms the ability of the learned policy to handle strongly curved tube geometries, where navigation requires anticipating the trajectory from partial or intermittent visual cues.
4.2.2. Average Return per Episode
Figure 9 shows the evolution of the average return. The curve exhibits steady growth during the early curriculum levels, followed by stabilization phases as complexity increases.
Localized variations in return, particularly noticeable during level transitions, reflect the increasing difficulty of the environments: the tighter the turns, the more the policy must finely adjust its direction while avoiding collisions. Nevertheless, variance gradually decreases over the course of training, indicating an increasingly consistent and reproducible policy.
The final stabilization of the return confirms that the policy adapts well to the most demanding environments in the curriculum, highlighting the effectiveness of the proposed reward shaping.
It is worth noting that the average return decreases in the final curriculum level, stabilizing around values close to 150, whereas it remained between 200 and 250 in earlier levels. This reduction does not indicate a degradation of the learned policy, but rather reflects the intrinsic difficulty of highly curved tube geometries: episodes tend to be shorter, corrective maneuvers are more frequent, and the reward accumulated per time step becomes structurally lower even when the agent succeeds.
Moreover, due to the moderate success threshold used to trigger progression between levels, the agent performs only a limited number of iterations on the final level. As a consequence, the policy receives fewer optimization steps in the most demanding configuration, which partly explains why the return does not rise to the same values as in simpler levels, despite a comparable success rate.
These effects are therefore expected consequences of the reward structure and curriculum design rather than signs of instability in the learning process.
4.2.3. Impact of Curriculum Learning
The presented performances must be interpreted in light of the Curriculum Learning strategy employed. In this first study, the objective is not to maximize the absolute success rate but to ensure feasibility, stability, and convergence of learning across tubular environments of increasing complexity. For this purpose, the success thresholds governing progression between curvature levels (
Table 1) were deliberately set to moderate values.
A stricter choice of thresholds (e.g., >0.9) would have increased the risk of stagnation in intermediate levels, especially when curvature leads to intermittent loss of the vanishing point. Excessively high success requirements at this stage would have lengthened training disproportionately, without providing substantial benefit for this proof of concept.
Conversely, the chosen thresholds allow for regular progression while giving the agent sufficient time to adapt to the specific characteristics of each difficulty level. The fluctuations observed in the performance curves therefore do not reflect instability in training, but rather the expected transitions between distinct navigation regimes.
This behavior validates the relevance of combining a directional action space, reward shaping, and controlled difficulty progression, as well as the suitability of the chosen thresholds for this initial demonstration.
4.3. Comparison with the Reference Algorithm
In this section, we experimentally evaluate the performance of the proposed PPO agent by comparing it with the deterministic Pure Pursuit (PP) algorithm, commonly used for trajectory tracking in mobile robotics. The comparison is carried out over the three complexity levels defined in the curriculum, each consisting of 100 independent episodes.
4.3.1. Comparison Framework and Information Asymmetry
It is essential to emphasize that the two methods do not operate under the same informational conditions. Pure Pursuit benefits from a major structural advantage: the algorithm has a priori access to the exact geometric axis of the tube, which it uses as a reference to compute a tracking command. The ideal trajectory is therefore provided explicitly and with perfect accuracy.
In contrast, the PPO agent has no information about the true geometry of the tube. It must infer the direction of progression from partial observations:
- -
local LiDAR measurements describing only wall proximity;
- -
visual detection of the tube center when it is within the camera’s field of view;
- -
implicit estimation of trajectory continuity when the vanishing point disappears in strongly curved regions.
Thus, PP operates in a context of complete information, whereas PPO must solve a navigation problem in a partially observable environment. The fact that PPO exceeds PP on several metrics should therefore be interpreted in light of this fundamental asymmetry.
4.3.2. Comparison Results
The results obtained are presented in
Table 3. The selected metrics are: (i) success rate, (ii) tube-exit rate (failure without collision but loss of confinement), (iii) a navigation quality index aggregating centering and alignment.
Definition of the Navigation Quality Index
The navigation quality index Q used in our study is directly derived from the measurements collected at each simulation step. At every instant, two metrics are computed when the drone is inside the tube:
- -
a centering score
, defined as
where
is the radial distance of the drone to the local tube axis and
R the tube radius;
- -
an alignment score , corresponding to the dot product between the drone’s direction and the estimated tube or vanishing-point direction (via camera or memory), normalized to .
At the end of the episode, average values
and
are computed, and the quality index is defined as an equally weighted average:
A high value of Q corresponds to a trajectory that is well centered and well aligned with the local tube direction, regardless of whether the episode ends in success or failure. It should be noted that this index is not a global performance measure, but rather an indicator of the geometric cleanliness of the trajectory. Therefore, Q provides complementary insight beyond success and exit rates and helps highlight differences in trajectory quality between approaches.
The key observation is that, although the Pure Pursuit baseline has prior knowledge of the tube geometry, this does not always guarantee smoother or better-centered trajectories compared to the RL-based approach. As a result, even with access to a reference path, Pure Pursuit can still experience tube exits when local tracking inaccuracies accumulate.
In contrast, the RL-based agent does not rely on any prior knowledge of the tube trajectory. Instead, it infers navigation cues from onboard perception when available and otherwise operates in a reactive manner under partial or no visibility. Despite this lack of explicit geometric information, the experimental results show that the RL approach achieves higher success rates and lower exit rates.
4.3.3. Performance Analysis
Success Rate
At all curriculum levels, PPO achieves a substantially higher success rate than PP. The gap is especially pronounced at the intermediate level (71% vs. 34%), where tube curvature strongly disrupts Pure Pursuit due to its reliance on continuous visibility of the ideal trajectory.
Robustness and Tube Exits
PPO consistently and significantly reduces the tube-exit rate:
- -
34% → 16% at level 0,
- -
66% → 29% at level 1,
- -
27% → 13% at level 2.
This indicates that the learned policy is better equipped to correct trajectories when the environment imposes tight turns or when the tube center is not visible.
Interpretation of the Quality Index
While Pure Pursuit achieves a slightly higher average quality index—reflecting more aligned and centered paths—this difference remains moderate and must be interpreted cautiously. PP benefits from direct access to the ideal trajectory, allowing it to produce cleaner movements, but this does not guarantee its ability to remain within the tube when geometry becomes complex.
In contrast, PPO may produce less smooth trajectories at times, but systematically prioritizes viable navigation. This strategy results in a far higher success rate at all levels: the RL agent maintains the drone inside the tube even when visual cues become ambiguous or when curvature forces PP into maneuvers it cannot anticipate.
4.3.4. Summary and Significance of the Results
In a context where only the PPO agent must implicitly infer the tube structure from sparse sensor signals, while Pure Pursuit relies on complete geometric knowledge, the results are significant:
- 1.
PPO exceeds Pure Pursuit in success rate across all complexity levels;
- 2.
PPO exhibits more robust navigation, with a clear reduction in tube exits;
- 3.
the learned agent demonstrates an ability to implicitly reconstruct the direction of progression, even in zones where visual information is partial or absent.
These findings show that the reinforcement learning approach does more than merely imitate Pure Pursuit: it acquires a navigation capability that is not accessible to a geometric controller relying on full model knowledge. This opens promising perspectives for autonomous navigation in unknown or unmodeled tubular environments, such as natural tunnels, industrial infrastructures, or anatomical channels.
4.4. Validation in a High-Fidelity Simulated Environment
To test the agent in a setting close to real-world conditions, we developed a Unity scene designed to provide a faithful three-dimensional visualization of the agent’s navigation inside an industrial-type tubular conduit. Although perception and interactions with the tube walls are computed on the Python side to remain consistent with the training environment, the Unity 3D engine provides a physically coherent simulation based on a rigid-body model subject to gravity. This integration makes it possible to obtain and visualize a continuous inertial evolution of the drone under the control commands issued by the agent’s policy.
At this stage, the visualization serves primarily as a complementary tool to qualitatively illustrate the learned behavior. Future steps will include: (i) integrating a more realistic sensor model (camera, measurement noise), (ii) increasing the fidelity of the drone dynamics in Unity, (iii) conducting tests in real geometries digitized or reconstructed from 3D scans.
4.4.1. Description of the Unity Physical Model
The drone’s motion is described using classical Newtonian equations governing the translation and rotation of a rigid body. Linear dynamics are modeled as:
where
m is the drone mass,
its linear velocity,
the gravity vector, and
the force resulting from the RL command transformed by Unity (desired direction and speed level).
Orientation is updated using rigid-body rotational dynamics:
where
is the angular velocity,
the inertia tensor, and
the torque derived from the agent’s directional command.
Unity thus acts as a robust inertial integration engine: at each control interval, it computes the next pose from and the applied command, ensuring dynamic continuity, gravity effects, and numerically stable orientation updates.
4.4.2. Data Exchange Between Python and Unity
The actions produced by the agent specify only a desired movement direction in the drone’s visual space and a speed level. Before being transmitted to the Unity physics engine, this direction is normalized and may be adjusted by a small stabilizing vertical term. This term’s only purpose is to partially counteract gravity, preventing a drone initially in hover from immediately dropping in the absence of explicit thrust commands. This preprocessing step does not modify the underlying dynamic model: it merely translates the agent’s abstract action into the force applied to the rigid body.
The physical simulation itself is fully managed by Unity. At each control step requested by Python, Unity performs exactly one dynamic integration according to Newtonian equations, simultaneously applying the command force, gravity, inertial effects, and any potential contacts with the environment. Gravity is therefore intrinsic to Unity’s physics engine and is applied automatically at every simulation step, independent of the control frequency imposed by Python.
After this integration step, Unity sends back to Python the resulting dynamic state, consisting of the drone’s 3D position, its orientation as a quaternion, and its linear and angular velocities. This information forms the physical basis from which Python reconstructs the drone’s local reference frame and feeds all perception and evaluation modules.
Meanwhile, perception, tube geometry, and progression evaluation remain entirely handled on the Python side. This environment is responsible for: (i) computing the derived LiDAR measurements, (ii) verifying that the drone remains inside the tubular structure, (iii) estimating longitudinal progression, (iv) detecting episode termination conditions.
4.4.3. Exchange Protocol Between Python and Unity via Shared Memory
As illustrated in
Figure 10, real-time communication between the Python environment (which computes the action and navigation indicators) and Unity (which integrates inertial dynamics) relies on a shared-memory mechanism. This approach enables bidirectional, very low-latency communication without network overhead, ensuring precise synchronization between the two execution engines. It further benefits from relying solely on native functionalities:
MemoryMappedFile in C# for Unity and
multiprocessing.shared_memory in Python, avoiding external dependencies and ensuring maximal portability.
Two shared-memory buffers are used:
- -
a Python → Unity buffer containing the instantaneous command issued by the RL policy, already adjusted to the Unity physical environment (desired direction and forward speed);
- -
a Unity → Python buffer storing the updated drone pose produced by the physics engine (position, orientation, and optionally velocities).
At each control step, Python writes into the first buffer the action computed by the agent, which Unity can immediately read and interpret as a physical command applied to the drone’s rigid body. Symmetrically, Unity writes the updated pose into the second buffer after performing inertial integration over the interval . Python then reads this pose in order to:
(i) estimate the drone’s progression along the tube’s centerline; (ii) evaluate centering and alignment metrics; (iii) verify trajectory validity (remaining inside the tube, no collision); (iv) construct the next observation fed to the agent.
This shared-memory protocol offers several advantages: (1) it avoids any network overhead, (2) it guarantees constant and minimal access latency, (3) it enables a high control frequency compatible with a real-time physics engine, (4) it relies exclusively on standard mechanisms already available in both environments.
The Python–Unity coupling thus operates as a synchronous closed loop: Python writes the command, Unity executes the dynamics and writes the updated pose, and Python interprets this pose to produce the next action.
4.4.4. Use of the Blender 3D Modeling Tool
As illustrated in
Figure 11, we modeled the industrial tubular conduit and its geometric centerline in Blender, then exported them separately in
.obj format. The two environments make use of these files in the following way:
- -
In Unity, the .obj file of the tubular conduit is imported directly to provide a realistic 3D visualization of the drone’s inertial dynamics inside the structure;
- -
In Python, both the .obj conduit file and its centerline are loaded to manage tube-center perception and wall interactions during inference.
Figure 11.
(a) View of the industrial tubular structure in Blender (b) View of the industrial tubular structure in Python window (Vedo package) (c) View of the industrial tubular structure’s geometric centerline in Blender. (d) View of the industrial tubular structure’s geometric centerline in Python window (Vedo package).
Figure 11.
(a) View of the industrial tubular structure in Blender (b) View of the industrial tubular structure in Python window (Vedo package) (c) View of the industrial tubular structure’s geometric centerline in Blender. (d) View of the industrial tubular structure’s geometric centerline in Python window (Vedo package).
4.4.5. Inference Algorithm
The complete inference process of the drone with Unity-assisted inertial integration is summarized in the Algorithm 1.
Figure 12 shows the motion of the drone inside the industrial tubular structure within the Unity 3D scene.
| Algorithm 1: Inference of the trained navigation agent with Unity-assisted physical integration |
![Sensors 26 01236 i001 Sensors 26 01236 i001]() |
4.5. Summary
The experimental results demonstrate that the reinforcement learning approach adopted in this work enables reliable autonomous navigation in tubular environments of increasing complexity. The agent progressively acquires an effective strategy thanks to Curriculum Learning, which facilitates the transition from simple rectilinear scenarios to highly curved conduits where the tube center becomes intermittently visible. The observed convergence, between 75% and 80% success in the most demanding environments, shows the policy’s ability to anticipate the local geometry from partial visual and LiDAR cues.
The comparison with the deterministic Pure Pursuit algorithm highlights a central point: despite a major information asymmetry—PP has exact knowledge of the tube’s geometric axis while PPO must infer it indirectly—the learned policy systematically outperforms the classical controller in success rate and robustness. The trajectories produced by PPO are sometimes less smooth, but they allow the drone to remain confined within the tube in situations where geometry or visibility make strict tracking of an ideal trajectory ineffective.
Finally, inertial validation in Unity confirms that the learned strategy remains functional in a more realistic physical setting than that used during training. The agent maintains stable trajectories under continuous dynamics and gravity, showing that the learned behaviors do not rely on specific kinematic simplifications.
Overall, the results validate the combined relevance of (i) a directionally constrained yet expressive action space, (ii) reward shaping tailored to confined environments, and (iii) a controlled increase in task complexity. The trained agent proves not only effective but also capable of generalizing to situations in which a controller relying on explicit geometric modeling fails.
5. Discussion and Future Perspectives
The results obtained highlight several design choices that contributed to the success of the approach, while also revealing limitations that may guide future research.
5.1. Action Space and Modeling Limitations
The discrete action space used in this work was specifically designed for navigation in narrow tubular environments. By restricting orientations to those lying within the drone’s visual cone and pointing forward, it avoids dangerous or irrelevant maneuvers while reducing learning complexity. This pragmatic choice enabled the rapid emergence of stable behaviors.
However, this action space also presents limitations. First, the direct coupling between perception and action—only visually plausible directions are allowed—simplifies the problem but does not reflect the capabilities of a real quadrotor, which can initiate maneuvers outside its field of view. Second, the drone dynamics were modeled in an essentially kinematic manner during training.
These simplifications provide a solid foundation but naturally call for extensions: densification or regularization of the directional grid, introduction of continuous actions, or integration of a richer dynamic model allowing explicit handling of forces and torques.
Importantly, the proposed framework does not account for aerodynamic disturbances induced by close proximity to walls, such as lateral ground effects, flow deviations, or turbulence generated in narrow sections. These phenomena, common in confined environments, can significantly affect the real stability of a quadrotor and are absent from the current model. Incorporating them—whether via simplified aerodynamic simulators or experimental data—represents a key perspective for bringing the agent closer to real operational conditions.
5.2. Turn Negotiation and Partial Information Handling
Turn negotiation is the most challenging scenario in a tubular environment. The current strategy relies on a combination of three mechanisms: (i) direct use of the visual target when available, (ii) a memory mode that temporarily prolongs the previous direction, (iii) exploitation of lidar symmetries to detect and follow local geometry.
This approach has proven effective, but it also shows limitations. The use of directional memory may lead to suboptimal behaviors when geometry changes abruptly, and lidar symmetry measures are sensitive to complex or anisotropic structures. Natural extensions include the introduction of recurrent models (LSTM or GRU), explicit estimation of the tube’s local curvature, or richer 3D perception to better characterize the surrounding geometry.
5.3. Visual Perception: From Ideal Signal to Real Onboard Vision
In this work, the direction of the tube center is assumed to be detected ideally whenever it appears in the field of view. This assumption facilitates analysis of the navigation behavior but does not reflect the challenges of real onboard imaging. Illumination, textures, reflections, or motion blur strongly affect the visual estimation of a vanishing point or main conduit direction.
A key perspective is therefore to replace this idealized module with a real estimator, based for example on segmentation networks, geometric detection pipelines, or an end-to-end perception–control approach. Such a transition would bring the system closer to real operational conditions while enabling co-adaptation between learned perception and navigation strategy.
5.4. Generalization and Extension to Other Tubular Environments
Although developed for aerial navigation in industrial, natural, or underground environments, the proposed framework has a much broader scope. The challenges addressed here—progression in a confined conduit, partial local perception, intermittent loss of visibility, curvature anticipation—also arise in medical applications, particularly in endoscopy or robotic navigation inside anatomical canals. In such contexts, the mechanisms developed here (partial information handling, turn negotiation, compact geometric state representation) form a promising basis for miniature or semi-autonomous navigation approaches. The work presented here can also be extended to drone navigation in enclosed natural environments such as natural caves. An experiment based on field data will be carried out as part of the CAVESTOUR (
https://interreg-marittimo.eu/fr/web/cavestour/ (accessed on 2 January 2026)) project. The experimentation will use high resolution 3D reconstruction of the Baume Obscure cave (France) to simulate autonomous navigation in real environment. On the other hand, the results of this work can also be extended to the navigation of underwater drones, and experiments in the context of exploring submerged pipelines will also be carried out.
5.5. Global Perspectives
Future work may focus on:
- -
integrating realistic real or synthetic visual perception;
- -
exploring continuous and dynamically coherent action spaces;
- -
incorporating full drone dynamics during training;
- -
extending to even more complex tubular geometries (helical or branched);
- -
experimental validation on a real robotic platform.
Thus, the methodology presented constitutes a robust initial foundation for learning autonomous behaviors in confined environments, while opening pathways toward transdisciplinary developments ranging from industrial robotics to miniature medical robotics.
6. Conclusions
This work presents a reinforcement learning approach for autonomous drone navigation in confined tubular environments, a context characterized by restrictive geometry, partial perception, and intermittent visibility loss. The proposed framework—directional action space, tailored reward shaping, Curriculum Learning progression, and inertial integration through Unity—enabled the training of a robust policy capable of generalizing to unknown geometries.
The results show that the PPO agent learns stably to negotiate both rectilinear and highly curved conduits, achieving success rates between 75% and 80% in the most complex scenarios. The comparison with the Pure Pursuit algorithm reveals a key insight: despite a significant information asymmetry—PP has access to the exact geometric axis of the tube, whereas PPO must infer it solely from lidar and visual cues—the learned policy consistently outperforms the classical controller in robustness and tube retention. These results demonstrate that the agent can implicitly reconstruct the direction of progression, even when the conduit center disappears from the field of view in sharp turns.
The 3D validation in Unity confirms that the learned behavior remains effective under continuous inertial dynamics, representing an important step toward deployment in real-world conditions. The proposed approach therefore provides a unified framework, spanning from kinematic learning to realistic physical evaluation, while maintaining geometric consistency through the use of a shared 3D model.
Beyond industrial or natural tubular environments, the developed methodology has broader applicability: navigation in confined spaces, automated inspection, subterranean robotics, or even guidance within anatomical channels in medical robotics. The mechanisms studied—partial perception management, turn negotiation, local geometry anticipation—are directly transferable to these fields.
Future perspectives include integrating more realistic real or synthetic visual perception, adopting continuous or dynamically coherent action spaces, using recurrent models for improved partial-information handling, testing on more complex or branching tubular geometries, and accounting for aerodynamic disturbances inherent to confined environments. Ultimately, experimentation on a real robotic platform will constitute the decisive step in validating the operational applicability of the approach.
Overall, this work shows that a reinforcement learning agent can not only master navigation in an unknown conduit but also surpass a classical solution built upon complete geometric knowledge. It thus opens the path toward autonomous systems capable of operating reliably in confined environments where explicit modeling is difficult or impossible.