1. Introduction
The exploration and development of oil and natural gas is an engineering practice that challenges the unknown areas of the deep earth. Although the traditional vertical drilling technology is mature and reliable, it gradually shows its limitations in the face of complex and changeable underground geological conditions, increasingly stringent environmental restrictions, and increasing economic benefits. Many oil and gas reservoirs with exploitation value are not vertically distributed below the surface. They may be located below mountains, lakes, oceans, or sensitive areas such as cities, farmland, and nature reserves. Its shape is often not regularly layered, but is inclined, fractured, lenticular, or located in a complex tectonic zone. In addition, a single vertical well can usually only effectively exploit a limited range of oil and gas resources near its wellbore. This limitation becomes more critical as modern drilling operations increasingly encounter deep, faulted, and lithologically heterogeneous formations, where operational safety and efficiency depend on adaptive and robust trajectory control strategies.
In order to overcome these challenges, directional drilling technology came into being. The core of this technology is to accurately control the extension path and final target of the wellbore in the underground three-dimensional space, that is, wellbore trajectory design. This requires engineers to carefully plan a three-dimensional space path that can skillfully avoid geological risks (such as faults, high-pressure layers, and salt-gypsum layers), accurately hit multiple dispersed oil and gas targets (multiple targets), maximize the effective range of contact reservoirs (increase the drainage area), and meet ground environment and engineering constraints (such as platform location or anti-collision requirements).
Therefore, wellbore trajectory design is the core technology of oil and gas drilling engineering. Its accuracy and reliability are directly related to the safety, efficiency, cost, and final oil and gas recovery effect of drilling operations [
1]. Especially in complex geological environments (such as faults, salt-gypsum layers, high-pressure formations, unstable shale formations, and multi-target targets, etc.), the design process faces severe challenges. Dynamic geological conditions (such as formation pressure fluctuation, lithology mutation, and drillability difference) and complex space obstacles (such as adjacent wells, drilled wells, and geological risk areas) require high flexibility and intelligent avoidance of trajectory design [
2]. However, traditional trajectory design methods (such as methods based on geometry or optimization theory) often rely on preset models and static assumptions [
3]. It is difficult to adapt to these complex and changeable real-time environmental information efficiently and dynamically, and its intelligence and ability to deal with complex constraints are obviously limited.
Reinforcement learning (RL) has shown considerable promise in trajectory planning under complex drilling conditions [
4,
5,
6]. Liu et al. [
7] applied a deep Q-network (DQN) to enhance intelligent trajectory control. Fan and Chen [
8] further introduced an attention-based improvement (NDQN) to enable online decision-making. Wang [
9] proposed a deep deterministic policy gradient (DDPG)-based control algorithm, later enhanced via transfer learning for trajectory tracking. Jian [
10] addressed convergence issues by developing a double dueling DQN (D3QN) with a refined reward function. Peshkov and Pavlov [
11] utilized artificial intelligence (AI) with logging while drilling (LWD) data to optimize well trajectory in real time. Other researchers [
12,
13,
14] explored borehole control using deep chemical models. While these methods demonstrate progress, challenges remain in spatial adaptability and robustness in highly heterogeneous formations.
However, the Soft Actor-Critic (SAC) algorithm shows significant advantages in the field of reinforcement learning by virtue of its design based on the maximum entropy framework [
15], especially in terms of exploration efficiency and learning stability. These characteristics make it an ideal choice to deal with complex control problems. However, challenges still exist when applied to problems such as wellbore trajectory design in complex geological environments. Such environments usually have high nonlinearity, strong coupling, and strong state history dependence. In the face of these characteristics, the traditional SAC algorithm has limitations in effectively capturing and utilizing the historical state information of long sequences, and it is difficult to accurately predict the dynamic changes of the environment. This deficiency also constitutes a key factor restricting the further improvement of the performance of the SAC algorithm in such complex scenes.
Recent research has extended reinforcement learning applications in trajectory optimization by introducing more robust network architectures and domain adaptations. For instance, researchers have proposed using Transformer-based attention modules for long-term dependency capture in time-series control tasks [
16], as well as advanced reward shaping techniques to handle sparse and delayed feedback [
17]. Additionally, hybrid strategies combining SAC with other Actor-Critic variants such as TD3 or PPO have been successfully applied to geosteering, toolface orientation prediction, and adaptive bit path control [
18,
19,
20,
21,
22]. These developments further validate the potential of reinforcement learning in trajectory design and offer technical references for our method integration.
To solve the above problems, this paper deeply integrates the self-attention mechanism into the SAC algorithm framework. This mechanism can dynamically analyze the dependencies between input sequence elements and assign different weights to different time states. The SAC algorithm integrated into this mechanism can accurately capture long-term dependencies and identify and utilize key historical state information that is highly relevant to the current decision, focusing on the key signals in the evolution of geological environment in real time and improving the response and prediction ability to environmental dynamics; at the same time, a richer and more discriminative state feature representation is learned, thereby enhancing the state representation ability.
In order to make the trajectory optimization design more suitable for the actual geological conditions and enhance the applicability of the model in different scenarios, the Wasserstein generative adversarial network (WGAN) technology is applied to the generation of 3D geological model data for the first time in the study. With the help of this technology, high-quality and highly realistic geological data are successfully created. These data show the complex spatial distribution and statistical characteristics of geological structures such as stratigraphic interface, fault strike, lithology distribution, and pore pressure field in the target area, which provides data support close to the real geological environment for algorithm training and makes the optimization results more valuable in practical engineering.
During optimization goal setting, this study established a set of multi-objective weighted combinations of reward function systems. Each system takes into account a number of key factors: the first is the trajectory smoothness, which reduces the friction torque by reducing the dogleg to ensure drilling efficiency and string safety; the second is the safety of obstacle avoidance, strictly avoiding obstacles such as adjacent wells and geological risk areas to ensure the safety of drilling operations. The system strives to accurately reach the predetermined geological target point. It also has geological adaptability, guiding the trajectory through high-quality reservoirs, bypassing unfavorable areas such as high-pressure layers and easily collapsed strata. Finally, the construction feasibility is fully considered, and the engineering constraints such as the slope limit, tool performance, and measurement accuracy are fully considered. By reasonably allocating the weight of each objective, the collaborative optimization and organic unification of these important objectives are realized.
In summary, this study constructs a new method of intelligent wellbore trajectory optimization design for a complex geological environment by integrating a self-attention enhanced SAC algorithm, WGAN geological modeling technology, and a multi-objective reward system. The research results not only significantly improve the intelligence level and optimization performance of trajectory design in dynamic and complex environments, but also provide new theoretical support and technical path for safer, more efficient, and more accurate geosteering drilling in oil and gas drilling engineering. While deep reinforcement learning (DRL) has shown promise in trajectory optimization tasks, most existing approaches overlook fine-grained geological heterogeneity and lack spatial perception capabilities. Moreover, the integration between attention mechanisms and SAC-based drilling agents remains underexplored. This study addresses these gaps by designing an attention-enhanced SAC model tailored for adaptive borehole path planning under complex geological constraints. The remainder of this paper is organized as follows:
Section 2 introduces the SAC algorithm and its integration with the self-attention mechanism.
Section 3 details the reinforcement learning environment, including geological data generation, state-action design, and reward system construction.
Section 4 presents experimental design, result analysis under multiple obstacle scenarios, and a comparative study of the proposed method. Finally,
Section 5 summarizes the key findings and outlines the potential for future research.
2. Materials and Methods
The SAC algorithm [
8] consists of three core parts: Firstly, with the help of the maximum entropy framework, it fully stimulates the potential of agents to explore unknown environments while improving the stability of the model; Secondly, the offline strategy update mechanism is used to effectively improve the efficiency of the algorithm by repeatedly mining and using historical collected data; and Thirdly, the work is carried out based on the Actor-Critic (AC) architecture, in which the strategy network and the value network operate independently of each other. Among them, the policy network, as an Actor, is responsible for generating the probability distribution of actions and guiding agent decision-making. The value network acts as a Critic, used to evaluate the Q-value of the state-action pair under the current strategy, so as to feedback and optimize the strategy. The two complement each other and jointly ensure the efficient operation of the algorithm.
2.1. Data Collection and Generation Method
The 3D geological data used for wellbore trajectory optimization in this study were generated using Wasserstein generative adversarial network (WGAN) technology, with public logging data from Daqing Oilfield as the foundational training samples. The WGAN model consists of a generator and a discriminator, which operate through an adversarial mechanism: the generator is responsible for producing simulated geological data, while the discriminator evaluates the similarity between the generated data and real samples (derived from Daqing Oilfield logging data). By reconstructing the loss function using Wasserstein distance, this approach effectively addresses the training instability of traditional GANs in scenarios with sparse geological data, avoids mode collapse (where generated data becomes overly homogeneous), and stabilizes gradient updates during training. The generated data accurately reproduce complex spatial distributions and statistical characteristics of geological features in the target area (based on Daqing Oilfield’s geological properties), including stratigraphic interfaces, fault strikes, lithology distributions (e.g., sandstone, shale), and pore pressure fields. These data are output in 3D point cloud format (.ply) and mapped to the reinforcement learning environment, providing real-time geological constraints for trajectory optimization.
2.2. Theoretical Basis of Maximum Entropy Reinforcement Learning
In standard reinforcement learning, the objective function is to maximize the cumulative reward:
where
is the state-action sequence;
is the time step index,
is the discount factor, and
is the instant reward function.
The maximum entropy reinforcement learning adds the entropy term of the strategy [
23]:
where
is the entropy term and
is the entropy coefficient, which indicates that the intensity of control exploration is controlled. A larger
encourages more exploration, meaning the strategy is more diverse. A smaller
encourages strategic certainty.
2.3. SAC Algorithm Network Architecture and Objective Function
SAC adopts the Actor-Critic structure, and the strategy network (Actor) represents the probability distribution from learning state to action. The value network (Critic) is used to evaluate the Q-value of the state-action pair [
24]. It is an offline reinforcement learning, and its network architecture is shown in
Figure 1:
Critic network loss function:
Actor network loss function:
SAC automatically adjusts the policy entropy by adaptively updating .
In
Figure 1, the SAC network consists of parallel Actor and Critic modules. The Actor uses observed state sequences as input, passes through embedding and dense layers, and produces a probability distribution over actions. The Critic includes two Q-value estimators and a value estimator, which evaluate the quality of state-action pairs. The color-coded blocks denote network layers: blue for input embedding, gray for intermediate processing, and orange for output heads. Arrows represent the direction of data flow from state input to action/value outputs. SAC algorithm network structure diagram. The figure illustrates the core architecture of the Soft Actor-Critic (SAC) model used in this study. The left pathway represents the policy network (Actor), which receives state sequences and outputs action probabilities. The right pathway shows two Q-value networks and a value network (Critic). The blocks marked in different colors correspond to input embedding (blue), hidden layers (gray), and output layers (orange/red), while arrows indicate the data flow.
2.4. The Basic Idea of Attention Mechanism
The attention mechanism was first proposed for machine translation tasks, allowing the model to dynamically focus on different parts of the input sequence. Its core is to give different weights to the input sequence and improve the model’s ability to process long sequence data [
25,
26,
27,
28,
29].
A weight
is calculated for each position
, indicating the importance of the position to the final output:
where
is the score for the
th position, calculated as:
where
is the query vector,
is the key/value vector, and
is the feature dimension (scaling coefficient, preventing gradient disappearance or explosion).
2.5. Self-Attention Mechanism
The self-attention mechanism is a special kind of attention mechanism. Query, Key, and Value matrices come from the same sequence data. Its network structure is shown in
Figure 2.
where
,
, and
are the parameter matrices.
To further enhance model representation, multiple attentional modes are learned at the same time, i.e., multi-head attentional mechanisms:
Each head is calculated separately:
This figure illustrates the multi-head self-attention structure. The left “Scaled Dot-Product Attention” is the basic attention unit, with inputs as Query (Q), Key (K), and Value (V) matrices (all derived from the same sequence data) and outputs as weighted features. “Multi-Head Attention” enhances model representation by concatenating features from multiple parallel attention units (heads); colors (red, blue) distinguish computations of different heads.
2.6. SAC Algorithm with Self-Attention Mechanism
In the wellbore trajectory design, the action is usually a sequence (i.e., the state data of recent steps). Let be the window length, , and Actor and Critic are added to the self-attention layer, respectively:
The state sequence
is passed through the self-attention layer to obtain the feature a with attention weight
:
The Actor network uses this feature to output the action distribution: .
The Critic network is based on state characteristics and action estimation Q-value: .
3. Reinforcement Learning Environment Modeling and Design
3.1. Three-Dimensional Geological Data Generation
Accurately grasping geological parameters is a key prerequisite for optimizing wellbore trajectory design and ensuring safe and efficient drilling. These parameters reveal the core information such as the distribution state of underground rock strata, rock hardness, pore pressure and fracture pressure, which directly determine the feasibility of wellbore trajectory, potential drilling risk and final construction cost. Especially in areas with faults, folds, or complex lithology, if there is a lack of detailed geological data support, the designed trajectory can easily fail to effectively avoid high-risk areas, which in turn leads to serious accidents such as wellbore collapse, sticking, and even lost circulation. In order to ensure the scientificity of well trajectory design and construction safety, it is necessary to fully integrate high-quality geological data in the design stage. The introduction of WGAN (Wasserstein generative adversarial network) technology has opened up a new path for the generation and expansion of geological data. This technology significantly improves the training stability of traditional GAN in sparse geological data scenarios through the confrontation mechanism between the Wasserstein distance reconstruction generator and discriminator. Compared with traditional GANs, which use JS divergence as the loss function, WGAN effectively avoids mode collapse (a common issue where generated data becomes overly homogeneous) and stabilizes gradient updates during training. This improvement enables WGAN to generate geological data with higher fidelity—specifically, more accurate reproduction of complex spatial distributions of stratigraphic interfaces, fault strikes, and pore pressure fields, which are critical for simulating real geological conditions.
The generated data not only has outstanding authenticity and diversity, but also provides rich geological sample support for 3D drilling environment modeling.
This technology significantly improves the training stability of traditional GAN in sparse geological data scenarios through the confrontation mechanism between the Wasserstein distance reconstruction generator and discriminator. The generated data used in this study are synthetic, generated by WGAN based on public geological samples (e.g., logging data from a typical oilfield). The generated data not only has outstanding authenticity and diversity, but also provides rich geological sample support for 3D drilling environment modeling.
This directly improves the accuracy and engineering practicability of the geological model and provides a reliable basis for the optimization design of the wellbore trajectory. The related network structure is shown in
Figure 3.
Detailed components (Generator: generates 3D geological data; Discriminator: evaluates data authenticity) and data flow (real geological samples → Generator → synthetic data → Discriminator feedback). Inputs include lithology logs and pore pressure measurements; outputs are 3D point cloud data in .ply format.
In this paper, a three-dimensional wellbore trajectory simulation environment is constructed, which mainly refers to the typical scenarios and parameter ranges in actual oil and gas well trajectory planning. The starting point of the borehole is located at a fixed point (wellhead) on the surface, and the target point is located at a predetermined reservoir location underground.
In order to be representative, the target depth is set to about 800 m, 1200 m North, and 1000 m East. The WGAN model is used to train the known geological data, and the geological section is randomly generated. The training results are shown in
Figure 4.
In
Figure 4, an added legend (color codes for lithology: red = sandstone, blue = shale, gray = fault zones), coordinates (North–East, in meters), and five profiles are generated to cover different stratigraphic zones.
Figure 4 is supplemented.
Figure 4a shows the cross-section when East = 0, and
Figure 4b shows limestone, volcanic rock, mudstone, and sandstone, from top to bottom.
A three-dimensional geological model is constructed by multi-section splicing, and its color codes represent different lithologies. The generated 3D point cloud data (
Figure 5) is further mapped to the RL simulation environment to provide real-time geological background support for the optimization design of wellbore trajectory. As shown in
Figure 3, the WGAN model adopts a generative-discriminative adversarial structure to learn the spatial distribution of geological features. The network effectively reconstructs the statistical relationships in sparse training data, resulting in high-quality section outputs (
Figure 4). The lithological boundaries, fault contours, and stratification transitions are visibly coherent, suggesting that the model has captured key geospatial patterns.
Figure 5 presents the 3D point cloud integration of multiple generated sections, forming a continuous volumetric geological model. This point cloud serves as the dynamic input for the reinforcement learning environment and enables realistic simulation of complex drilling scenarios.
Figure 5 shows supplemented lithology legends and explains the generation via multi-section splicing of
Figure 4 profiles. The model is used in RL training to provide real-time geological constraints (e.g., obstacle positions). In
Figure 5, the color is consistent with
Figure 4: Green represents limestone, blue represents volcanic rock, yellow represents mudstone, and red represents oil-bearing sandstone.
3.2. State Space Design
In the optimization design of wellbore trajectory, the state space is the basis for the reinforcement learning model to perceive the environment, which directly determines the effectiveness of strategy learning and the responsiveness of trajectory adjustment. In order to ensure that the model can fully capture the spatial pose changes and geological conditions during the drilling process, the state space designed in this paper not only includes traditional drilling parameters, but also introduces multi-source LWD data to realize dynamic perception and real-time feedback of geological environment changes:
where
is the current deviation angle, indicating the angle between the direction of the vertical section bit and the depth direction, which is used to control the deflection degree of the bit relative to the vertical direction, and the range is
;
is the current azimuth, which represents the angle between the direction of the drill bit on the horizontal plane and the North direction, and represents the orientation of the wellbore on the horizontal plane, and the range is
;
denotes the current northerly position in meters;
represents the current position of East in meters;
represents the current vertical depth position in meters;
is the current position gamma logging value;
is the density logging value of the current position;
is the neutron logging value at the current position;
is the current position resistivity;
is the acoustic time difference logging value at the current position;
is the preset wellbore trajectory information; and
is the obstacle information.
After introducing the self-attention mechanism, the input of the state is a state sequence , which can capture the long-term dependence, where is the timing length.
3.3. Action Spaces
In the reinforcement learning-driven trajectory control while drilling, the action space defines the behavior that the agent (i.e., “virtual drill”) can adopt in each step of decision-making and is the core hub of strategy learning and trajectory adjustment. The action space of the agent reflects the trajectory control commands that can be applied at each time step. In the drilling process, the incremental adjustment of the deviation angle and azimuth angle is the main means to realize the trajectory steering and obstacle avoidance control. The action space is defined as a two-dimensional continuous space:
The agent can adjust the deviation angle and azimuth angle, respectively, within the radian range and can be designed according to the specific situation according to whether there are obstacles. This range setting combines the actual adjustment ability that can be achieved by the conventional rotary steering tool (RSS) in the wellbore structure to ensure that the trajectory planning results have good constructability.
3.4. Reward Function Design
The reward function is an important part of the trajectory design of the drilling hole. The reasonable design of the reward function will determine the learning effect of the agent, the accuracy of the trajectory tracking, and the smoothness of the adjustment process while punishing dangerous operations or excessive curvature. Therefore, this paper constructs a multi-objective weighted combination reward function system based on geological modeling, obstacle distribution, and trajectory characteristics. The form is as follows: The designed reward system integrates seven core components, each addressing a specific aspect of drilling control. The goal-approaching reward drives the trajectory toward the target position. The smoothness reward penalizes abrupt directional changes to ensure continuity and avoid mechanical stress. Obstacle avoidance introduces spatial safety constraints to prevent collisions. Geological adaptability encourages paths through soft, low-risk formations. The step penalty discourages unnecessarily long trajectories. Formation hardness is negatively weighted to avoid difficult-to-drill zones, while the drillability incentive encourages efficient path planning. Together, these terms form a balanced and adaptable reward framework that enables the reinforcement learning agent to plan trajectories that are not only accurate but also safe, efficient, and geologically feasible:
where
is the weight coefficient of hardness and drillability, which is usually set as a positive number less than 1, and is set according to engineering experience or strategy optimization.
- (1)
Goal approaching reward
Used to encourage the trajectory to progressively approach a preset target point, usually defined as a negative value for the distance to the target point:
where
is the current drill position and
is the target position.
- (2)
Trajectory Smoothness Bonus
Used to constrain the local continuity of the trajectory and penalize movements with violent steering or sudden changes in curvature:
It encourages a smooth range of motion while achieving goals, avoiding “sharp turns” or unrealistic trajectory adjustments.
- (3)
Obstacle avoidance reward
Used to punish the behavior of approaching or entering the dangerous area of the obstacle, reflecting the safety of the trajectory:
where
is the penalty coefficient and
is the safety boundary to ensure that the trajectory will not stray into the obstacle area.
- (4)
Geological response reward
Evaluates formation risks through real-time LWD data, encourages drill bits to pass through “soft formation, low pressure zone”, and avoids entering “high hardness, high pressure fault zone”:
This equation indirectly reflects the drillability of the formation. When RT is high, it represents the tight layer or oil and gas layer, and when it is low, it may be mudstone or fault zone. This design enhances the adaptability of trajectory design to geological environment. To achieve obstacle avoidance, the proposed method integrates multiple components: (1) a continuous action space that allows fine-tuned directional adjustments in azimuth and inclination, (2) a safety-aware reward function that penalizes proximity to risk zones, and (3) a self-attention mechanism that enables the model to recognize spatial-temporal patterns in geological context. Obstacles are encoded as spatial constraints in the state space, and the reward function imposes penalties when the predicted trajectory approaches these constraints. During training, the reinforcement learning agent learns to associate certain state-action pairs with penalties or rewards based on proximity to obstacles, which drives the emergence of avoidance behavior. The result is a trajectory policy that proactively bypasses obstacle zones while maintaining overall path feasibility and smoothness.
- (5)
Strategy efficiency reward
To encourage reaching the target point as soon as possible, introduce a “survival penalty” for each step of the operation:
where
is a fixed deduction item for each step, preventing the agent from adopting an inaction strategy such as “standing still”.
- (6)
Stratum hardness penalty term
The hardness of the formation represents the anti-breaking ability of the rock. The higher the hardness, the greater the difficulty of drilling, the more serious the wear of the drill bit, the greater the decrease of control accuracy, and the higher the risk of sticking and deviation instability. According to the acoustic time difference (DT) or the equivalent hardness
calculated based on the geological regression model, the following penalty term can be constructed:
where
is the hardness value of the formation at the current drilling location (which can be obtained from the joint inversion of DT, DEN, CNL, etc.) and
is a conditioning factor to control the intensity of the penalty for high hardness zones.
- (7)
Drillability enhancements
Drillability is a comprehensive measure of how easy a formation is to drill and is usually related to rock strength, degree of fragmentation, and well path stability. To encourage the drill bit to prioritize path areas with better drillability, the following positive incentives can be designed:
where
is the formation drillability rating, which can be obtained from logging data, geologic model prediction, or empirical index inversion, and
is the drillability incentive coefficient, which is an adjustment parameter accompanying the hardness penalty term. The values of
and
were determined through pretraining experiments using random geological models to evaluate their sensitivity to drilling difficulty and geological heterogeneity. A grid search method was employed, where different combinations of these coefficients were tested across training episodes. Values that led to stable convergence, high obstacle avoidance success rate, and minimal curvature fluctuation were selected. The final settings reflect a trade-off between safety (avoiding hard rock and unstable zones) and trajectory efficiency.
The reward function system is capable of realizing the unification of target accuracy, safe obstacle avoidance, geological adaptation and trajectory smoothing, directly quantifying the engineering control objectives (e.g., distance, direction, obstacle avoidance), and flexibly introducing additional indexes according to the specific mission scenarios (e.g., deep-water drilling, traversing of fault zones) with scalability.
4. Experimental Design and Result Analysis
A graphical flowchart of the experimental design is added as Figure X to clarify the logic of test scenarios. The flowchart illustrates the overall process: (1) Generating 3D geological models with obstacles using WGAN technology; (2) Constructing the reinforcement learning environment by defining state space, action space, and reward function; (3) Training the SAC algorithm integrated with self-attention mechanism using the generated geological data; (4) Testing the trained model in typical obstacle scenarios (initial obstacle, mid-borehole obstacle, terminal obstacle, double obstacles, and lateral obstacle); and (5) Evaluating model performance through indicators such as convergence speed, trajectory smoothness, and obstacle avoidance success rate. In order to verify the performance of the SAC trajectory optimization model incorporating the self-attention mechanism proposed in this paper under complex obstacle environments, a comparative experimental platform was constructed, and the relevant hyperparameters were set in detail. In the following section, the hardware and software environment configurations used for the experiments (
Table 1) and the main hyperparameter settings of the SAC reinforcement learning algorithm (
Table 2) are introduced to ensure the reproducibility of the experiments and the scientificity of the methodology, respectively.
A comparative validation approach was adopted to evaluate the superiority of the proposed SAC algorithm with a self-attention mechanism. The baseline model for comparison was the original SAC algorithm without the self-attention mechanism. Both models were trained and tested under identical experimental conditions: the same 3D geological environments generated by WGAN, consistent state/action spaces, reward function settings, and hyperparameters (as listed in
Table 2). The comparison focused on key performance metrics in typical obstacle scenarios (especially the double-obstacle environment), including convergence speed (number of steps to reach stable performance), average reward value, trajectory smoothness (dogleg severity), and obstacle avoidance success rate. This design ensures that the observed performance differences can be attributed to the introduction of the self-attention mechanism.
The experiment was conducted with 1000 training episodes to ensure the model sufficiently converges. Each obstacle scenario (initial, mid-borehole, terminal, double obstacles, lateral obstacle) was repeated five times to reduce randomness and improve result reliability. The evaluation criteria included: (1) convergence speed (number of steps to reach stable average reward); (2) trajectory smoothness (quantified by dogleg severity, with lower values indicating better smoothness); (3) obstacle avoidance success rate (percentage of trajectories that avoid all obstacles without collision); and (4) average reward value (comprehensive metric reflecting target accuracy, safety, and efficiency).
The hyperparameter values listed in
Table 2 were determined based on a combination of prior literature benchmarks in deep reinforcement learning (e.g., SAC implementations in control tasks), experimental tuning, and practical constraints of drilling optimization. The learning rate, discount factor, and entropy coefficient follow standard ranges recommended by prior studies [
15,
27]. Parameters such as batch size, buffer size, and update frequency were fine-tuned through repeated pretraining trials to balance training stability and convergence speed. The hidden layer size [
28] is a commonly used configuration that offers sufficient expressive power without excessive computation. These settings were found to provide robust performance across multiple experimental scenarios.
The experimental process steps are as follows:
Step 1: Generating 3D geological models with obstacles using WGAN;
Step 2: Constructing the reinforcement learning environment by defining state space, action space, and multi-objective reward functions;
Step 3: Training the SAC algorithm integrated with the self-attention mechanism;
Step 4: Testing the trained model in typical obstacle scenarios (initial, mid-borehole, terminal, double obstacles, lateral obstacles);
Step 5: Evaluating performance via metrics such as convergence speed, trajectory smoothness, and obstacle avoidance success rate.
In actual drilling engineering, the downhole environment is often complex and changeable, often accompanied by the existence of obstacles such as faults, abandoned wellbores, and high-pressure abnormal zones. If the wellbore path crosses or approaches these areas, it may not only cause engineering accidents such as lost circulation and well collapse, but also greatly increase drilling costs and safety risks. Therefore, the design of the obstacle avoidance trajectory is of great significance for ensuring drilling safety and improving wellbore quality. Obstacle avoidance design not only requires avoiding risk areas, but also takes into account the constructability and target orientation of the trajectory. It is one of the core problems of modern intelligent wellbore trajectory planning to ensure a smooth trajectory and accurate target entry while meeting the safe space distance. A detailed pseudocode of the proposed SAC + Attention-based trajectory optimization method is provided in
Appendix A to support reproducibility.
4.1. Wellbore Trajectory Design Under Vertical Obstacle Interference Scenarios
In order to investigate the ability of the reinforcement learning model to deal with complex spatial constraints in the actual drilling process, three groups of typical obstacle wells are introduced in the three-dimensional wellbore path planning scene, which are located at the coordinate points (300,253), (600,500), and (900,700) in the North-east plane, and the safety avoidance radius is set to 100 m to simulate the real engineering scenarios such as dense well layout and old wellbore avoidance.
In
Figure 6, the trajectory avoidance training process is shown in a borehole initial obstacle environment. The figure includes four subfigures: (a) Three-dimensional spatial trajectory distribution; (b) North–East horizontal projection (showing planar position relationship between trajectories and obstacles); (c) North height projection (reflecting vertical depth changes of trajectories); and (d) East height projection (showing eastward displacement and depth correlation). When the obstacle is set in the initial section of the wellbore, all the trajectories of the training process are shown in bitmap 6. From the three-dimensional view, it can be seen that the trajectory, as a whole, is a continuous and smooth transition from vertical to horizontal, reflecting good wellbore trajectory design and construction. From the distribution of trajectories, most of the trajectories were able to effectively sense and bypass the obstacle area in advance, and there were no obvious collisions with the obstacle area. The trajectory shows more obvious lateral (East direction) displacement in the process of baffling, and safety avoidance is realized by adjusting the azimuth and inclination of the starting stage of the borehole trajectory inclination building. In the horizontal projection perspective, the trajectory obviously shows an active avoidance trend of obstacles. Most of the trajectories take deflection measures before approaching the obstacle, showing obvious adjustment to the East and North, indicating that the model has better ability to avoid and predict in advance. A small number of trajectories showed a greater deflection to stay away from the obstacle safety buffer. Although individual trajectories slightly deviated from the overall group trend, the overall convergence was good, and there was no irregularly scattered trajectory. From the North height projection, the smooth transition of the trajectory from vertical to horizontal can be observed. All trajectories show obvious vertical descent in the initial stage, and the trajectory is in a horizontal or near-horizontal state before reaching the location of the obstacle (about 300 North coordinates), effectively avoiding the interference height range of the obstacle. The deviation adjustment of the initial position is obvious, which reflects the flexibility of the wellbore planning stage and the effectiveness of the obstacle avoidance strategy. The integrity and consistency of the path are good, and the trajectory of the back section tends to be consistent, showing the maturity of the model trajectory control and optimization strategy. In the East height projection, the transition of the initial trajectory from the vertical state to the horizontal state can also be observed. Most of the trajectories show a significant shift to the East to avoid obstacles. This lateral displacement action begins to occur in the early stage, indicating that the reinforcement learning model can make early spatial adjustment in the face of obstacles. The trajectory tends to be stable at about 600 m in the East direction, and the depth is about 700–800 m, indicating that the obstacle avoidance measures have been completed in the earlier stage of path planning. Color lines: Lines of different colors represent the well trajectory of multiple wells or different sections of the same well (vertical depth profile along the east coordinate), and the color is used to distinguish the trajectory sequence of different wells, construction stages or design schemes. Red shadow area: Red rectangle area marks the key layers of engineering or geological concern, target reservoir interval, lithologic/physical anomaly zone (such as fault, fracture development zone).
Figure 7 shows the trajectory avoidance training process in a borehole initial obstacle environment. The figure includes four subfigures: (a) 3D spatial trajectory distribution; (b) North-east horizontal projection (showing planar position relationship between trajectories and obstacles); (c) North height projection (reflecting vertical depth changes of trajectories); and (d) East height projection (showing eastward displacement and depth correlation).
Figure 7 is a display of the highest reward trajectory. From a three-dimensional perspective, it can be clearly seen that this trajectory actively adopts an avoidance strategy at the initial position of the obstacle (vertical cylinder), and the obstacle avoidance performance is very clear and effective. The trajectory is continuous and smooth without obvious curvature mutation, showing good trajectory construction characteristics. It shows that the reinforcement learning strategy fully considers the actual engineering requirements of wellbore trajectory continuity and stability in the path optimization process. From the perspective of horizontal projection, the optimal trajectory has obvious lateral avoidance action in front of the obstacle (about North coordinate 200~400 m range). By properly adjusting the obstacle avoidance area to the East side (East direction), the avoidance strategy is very accurate, the obstacle avoidance range is moderate, and there is no obvious redundant offset. In the process of obstacle rounding, the turning radius of the path is relatively flat, and there is no sharp turning, which reflects that the model has good smoothness and stability. It can be seen from the vertical projection clock that the trajectory shows that the initial section quickly enters the horizontal well section from the vertical well section. The whole deflection process is very smooth and the curvature is continuous, indicating that the path design of the enhanced learning strategy is smooth in the vertical direction. The transition fully meets the curvature and smoothness requirements of wellbore design.
In
Figure 8, Trajectory distribution in mid-borehole obstacle environment. The figure includes a 3D view and two orthogonal projections (North-east, North height). A scale bar is added at the bottom right, indicating a spatial scale of 0–1000 m. As shown in
Figure 8, when the obstacle is located in the middle transition area where the borehole gradually turns to the horizontal, the trajectory distribution is more dispersed, and the obstacle avoidance strategy is no longer a single advance offset, but presents diversified obstacle avoidance paths, and some of the trajectories have obvious curve adjustments before and after the obstacle, which indicates that the strategy possesses a stronger ability of local reconfiguration and flexible bypass avoidance in the mid-range, and the trajectory disturbance brought by the obstacle in the mid-range is more challenging, but the strategy is capable of The trajectory perturbation caused by obstacles in the middle section is more challenging, but the strategy is able to dynamically adjust the path based on the environmental feedback, showing enhanced spatial adaptive ability. The east-direction adjustment before and after the trajectory crossing the obstacle is more obvious. The overall change in the East direction is large, and the trajectory distribution band presents a ‘fan-shaped’ diffusion, indicating that the model has a more relaxed space selection freedom in the middle section. The vertical change is longer, and the trajectory near the obstacle area does not undergo drastic adjustment in the vertical depth direction, indicating that the model mainly relies on lateral adjustment to avoid, rather than secondary deflection or uplift. Colored lines: Different colors represent well trajectories of multiple wells or different sections of the same well, distinguishing trajectories by well, construction stage, or design scheme. Red shaded area: Marks key intervals of engineering/geological interest.
Figure 9 Trajectory planning in terminal obstacle environment. Visual indication of obstacles: gray spheres (radius = 80 m) located at coordinates (1200, 950, 800) [North, East, Depth]. A scale bar (0–500 m) is added at the bottom left, with all axes labeled in meters (m). As shown in
Figure 9, when the obstacle is set at the end of the borehole, the trajectory undergoes a small but precise spatial offset before approaching the end to avoid the safety buffer zone of the obstacle body, the avoidance amplitude is small but the action is effective and does not affect the overall trajectory continuity. The reinforcement learning strategy still has the ability of fine-tuning in the later stage in the case of terminal obstacles, showing the ability of stable control and precise position correction in the convergence period. The reinforcement learning strategy has stabilized at the end stage, and obstacle avoidance is achieved through fine space adjustment rather than changing the deviation angle, which maximizes the continuity of trajectory curvature. The obstacle avoidance strategy is more conservative, avoiding drastic adjustment at the end of the wellbore, and ensuring the accuracy of the target point and the constructability of the path. Colored lines: Different colors represent well trajectories of multiple wells or different sections of the same well (vertical depth profiles along the east coordinate), distinguishing by well, construction stage, or scheme. Red area: Marks key layers of engineering/geological interest (e.g., reservoirs, fault/fracture zones).
Figure 10 shows the optimal trajectory in a mid-borehole obstacle environment, and
Figure 11 shows the optimal trajectory in a terminal obstacle environment. Optimal trajectories are distinguished by red bold lines (other training trajectories are shown in light blue). A scale bar (0–800 m) is added to the bottom right of each figure, with all axes labeled in meters (m). It can be seen from the figures that both optimal trajectories successfully avoid the obstacle safety buffer zone, showing good path planning ability and construction feasibility. In the case of the middle obstacle, the trajectory achieves a smooth bypass transition through the flexible linkage of azimuth angle and deviation angle before approaching the obstacle, and the path shows moderate deflection rather than sudden change in the horizontal projection, reflecting that the model has excellent local path reconstruction ability. In the end obstacle situation, the trajectory tends to converge as a whole, and the obstacle avoidance behavior is mainly based on slight adjustment, accurately and stably avoiding the obstacle body, which fully reflects the stable control ability and high-precision convergence characteristics of the reinforcement learning strategy near the target point. The trajectories of the two cases remain continuous and smooth in the vertical direction without dramatic curvature changes, which verifies the robustness and intelligent adaptability of the proposed path strategy under different stages of obstacle interference.
Figure 12 shows the trajectory training process in a double-obstacle environment. The dark blue lines represent the proposed SAC + self-attention trajectories, light blue lines represent baseline SAC trajectories, red cubes denote the first obstacle, and blue cylinders denote the second obstacle. All axes are labeled with “North (m)”, “East (m)”, and “Height (m)” to indicate coordinates and units (meters). Colored lines: Different colors represent well trajectories of multiple wells or different sections of the same well, distinguishing by well, construction stage, or scheme. Red area: Marks key layers of engineering/geological interest.
In order to further verify the obstacle avoidance performance and design ability of SAC algorithm, double obstacles are set up.
Figure 12 shows all the training trajectories. The figure shows two vertical obstacle wells (red and blue cylinders), which are located in the early, middle, and rear sections of the path, respectively. All trajectories show good continuity and spatial coordination during the avoidance process. Smooth offset is achieved before and after the two obstacle bodies, and no collision or sharp turning occurs. The overall curvature of the trajectory is well controlled, especially in the smooth transition between the two obstacles, which fully verifies the path reconstruction ability and continuity maintenance ability of the SAC strategy. From the horizontal projection, it can be seen that most of the trajectories move eastward near the first obstacle body (North ≈ 400 m), and quickly adjust the direction towards the second obstacle body after crossing the obstacle area. When approaching the second obstacle body (North ≈ 800 m), the path makes reasonable avoidance again, and finally converges again in the target area, where the obstacle avoidance action is clear and effective. The overall path still maintains a high uniformity between the two obstacles, and there is no path aggregation or excessive offset, indicating that the strategy has multi-stage obstacle avoidance ability and overall trajectory optimization ability. From the vertical projection, it can be seen that all the trajectories change little in the vertical direction, mainly focusing on the initial deflection process (from vertical to horizontal), and then remain highly stable in the 600~800 m interval. It shows that the two obstacle avoidances are mainly realized by spatial reconstruction in the horizontal direction without obvious vertical disturbance, which meets the requirements of curvature control and well deviation stability in engineering. The red and blue obstacle areas are not crossed in both vertical projections, and the trajectory avoidance strategy is reasonable and effective.
Figure 13 shows the optimal trajectory in a double-obstacle environment. All axes are labeled with “North (m)”, “East (m)”, and “Height (m)” to indicate spatial coordinates and units (meters). The optimal trajectory is identified by a bold green line. This ensures clear visualization of the trajectory’s spatial distribution and its relationship with obstacles.
Figure 13 shows the path distribution of the obtained optimal wellbore trajectory in the three-dimensional perspective and different projection planes. In the three-dimensional view, the trajectory is smooth and continuous, and there is no sharp curvature fluctuation, indicating that the SAC strategy successfully incorporates the two obstacle avoidance behaviors into the overall planning system to realize the natural transition of the path. In the horizontal projection (
Figure 13b), the trajectory shifts eastward when approaching the red obstacle, and quickly transitions back to the midline after the obstacle avoidance is successful. Before continuing to advance to the blue obstacle area, a flexible deflection occurs again and shifts westward, forming two continuous but compliant and smooth direction correction processes. Between the two obstacles, the trajectory shows the rhythm characteristics of natural return-fine-tuning transition-re-offset, which greatly reduces the additional path length and unnecessary deflection. This path shape avoids an “S-type sharp turn” or “inflection point switching”, but presents a segmented arc combined continuous curve, which is conducive to wellbore construction stability and tool stress control. In the two vertical projection planes (
Figure 13c,d), the trajectory quickly enters the near horizontal state after the build-up phase, and maintains a stable wellbore depth during the entire obstacle avoidance process without dramatic changes. It shows that the whole obstacle avoidance behavior is mainly completed by horizontal space path planning, without relying on lifting or drilling avoidance, which further guarantees the construction feasibility and mechanical stability of the trajectory.
4.2. Wellbore Trajectory Design Under Lateral Interference
This group of experiments was designed to strengthen the guidance and avoidance effect of learning strategies on multi-path trajectories under the condition of a lateral obstacle body (expressed in the form of an inclined cylinder) in the wellbore path. The barrier body is set in the range of 300~600 m in the East, 200~500 m in the North, and about 600~800 m in the vertical depth, which represents the possible high-pressure zone, adjacent well safety control zone, or inaccessible geological anomaly body in the actual drilling process.
It can be seen from the training trajectory distribution map (
Figure 14) that some paths fail to effectively avoid the obstacle body in the initial stage, especially in the middle section, or there is a dense trajectory interference phenomenon, which reflects that some strategies lag behind in obstacle recognition or response. In the horizontal projection diagram, the trajectory passes through the obstacle interval intensively, forming a potential collision risk, which does not meet the safety design principles between wells. Colored lines: Different colors represent well trajectories of multiple wells or different sections of the same well, distinguishing by well, construction stage, or scheme. Red area: Marks key layers of engineering/geological interest.
Figure 14 shows the training process under lateral obstacle interference. A clear distinction is made between paths: green lines represent successful paths (avoiding lateral obstacles and reaching the target), and red lines represent failed paths (colliding with obstacles or deviating from the target zone). The failure percentage is specified as 15% (calculated from 500 test runs), indicating the proportion of failed paths in the total training process.
In contrast, before approaching the obstacle body, the optimal trajectory (
Figure 15) realizes the obstacle avoidance reconstruction of the overall trajectory through the linkage adjustment of the azimuth angle and the inclination angle. In the horizontal plane, it can be observed that the path is obviously offset around the obstacle zone, and the trajectory is gently twisted, so as to achieve the goal of space avoidance. In the vertical plane projection, the trajectory shape remains continuous, especially the local depression presented in the East height projection is a curvature control result within a reasonable range. The sag is not trajectory instability, but the path is within the range of whipstocking ability, and the spatial offset is completed through multiple small radius arc segments to maintain the smoothness of obstacle avoidance and wellbore stability.
Figure 15 shows the optimal trajectories under lateral obstacle interference. All axes are clearly labeled with titles: “North Coordinate (m)”, “East Coordinate (m)”, and “Vertical Depth (m)” to specify units (meters). Context is provided: this figure illustrates the optimal paths generated by the SAC + self-attention model when avoiding lateral obstacles (located 300–500 m East of the target zone). This ensures clarity in spatial reference, scenario context, and visual elements.
From the engineering point of view, the continuous curvature transformation adopted by such trajectories in the obstacle section effectively reduces the stress concentration and vibration risk of drilling tools and facilitates geosteering and subsequent trajectory correction operations. The optimal trajectory maintains the accuracy of target arrival, shows the high adaptability and practical operability of intelligent trajectory planning in the face of a complex structural environment, and verifies the practical value of this method under complex field conditions.
4.3. Comparative Analysis of the Self-Attention Mechanism
In order to evaluate the actual effect of the self-attention mechanism in wellbore trajectory design tasks, based on the same training environment and parameter settings, this paper compares the performance of the original SAC algorithm and the improved SAC algorithm with the self-attention mechanism in the process of double-obstacle trajectory optimization.
Figure 16 shows the trend of evaluation mean reward (EMR) with the number of steps during training for both algorithms.
Figure 16 shows the comparison of reward convergence curves. The horizontal axis is labeled “Training Steps” (range: 0–30,000 steps, linear scaling), and the vertical axis is labeled “Average Evaluation Reward” (range: −2.0 to −1.5, linear scaling). Units: Steps (no unit) for the horizontal axis; the reward value is a dimensionless metric. Scaling is clarified as linear, with equal intervals between tick marks (5000 steps per interval on the horizontal axis; 0.1 per interval on the vertical axis) to ensure clear visualization of convergence trends.
The experimental results show that the strategy after introducing the self-attention mechanism shows faster convergence ability in the early stage of training. In the first 20,000 training steps, the average reward of the improved model increases rapidly and tends to be stable, while the performance of the baseline SAC algorithm lags significantly at the same stage. In addition, the average reward value in the final stable stage is also better. The SAC + Attention strategy is maintained at a high level of about −1.76 for a long time, while the baseline SAC is stable at about −1.77, indicating that after introducing the attention mechanism, the strategy can more effectively avoid obstacles, optimize paths, and improve the quality of task completion.
After introducing the attention mechanism, the model showed smaller reward value fluctuations during the training process, and the overall stability was significantly improved. This phenomenon shows that the model has stronger anti-interference ability and the ability to adapt to new scenarios when dealing with complex obstacle environments. Therefore, it can continuously generate high-quality action trajectories, effectively avoiding violent shocks or instability that may occur during the exploration of the strategy. Moreover, the core advantage of the self-attention mechanism is that it can capture the correlation between the elements within the state information. This ability enables the strategy model to dynamically and sensitively perceive changes in key spatial features in the environment, such as the precise position of obstacles, the adjustment of wellbore direction, and the movement of target points. This enhanced perception directly improves the performance of the trajectory control strategy: when faced with multi-stage and multi-source spatial interference, the strategy can respond faster and more accurately, showing higher environmental adaptability and operational sensitivity. On the one hand, the introduction of the self-attention mechanism optimizes the learning efficiency of the strategy and improves the overall effect of path planning. On the other hand, it greatly enhances the operational stability and decision-making intelligence level of the model in complex and changeable environments. These provide strong support for the practical application of the reinforcement learning algorithm in the intelligent optimization task of wellbore trajectory.
5. Conclusions
In this paper, an intelligent optimization design method of wellbore trajectory combining a self-attention mechanism and Soft Actor-Critic (SAC) algorithm is proposed. Combined with the three-dimensional geological modeling technology constructed by WGAN, a reinforcement learning environment for trajectory design under complex geological conditions is systematically constructed. By introducing the attention mechanism, the strategy’s ability to mine state history information is effectively enhanced, and the model’s ability to identify and utilize key features in multi-source logging data is improved. The experimental results show that compared with the original SAC algorithm, the proposed method has significant advantages in convergence speed, obstacle avoidance ability, trajectory continuity, and strategy stability. For example, compared to the baseline SAC algorithm, the improved SAC + Attention model achieved a 25% faster convergence within the first 20,000 training steps, as shown in
Figure 16. The average reward improved from −1.77 to −1.76, and the success rate of obstacle avoidance increased by 12.5% in dual-obstacle scenarios. These metrics demonstrate better learning efficiency, more stable path optimization, and enhanced applicability under complex geological constraints. Furthermore, the strategy fluctuation range (standard deviation) of the model in 500 consecutive iterations is 0.03, which is only 43% of that of the baseline model (0.07), demonstrating stronger strategy stability. These metrics further verify that the self-attention mechanism enhances the model’s environmental perception and decision-making robustness.
Especially in the face of various typical downhole interference scenarios, such as initial, middle, end, and double obstacles, the improved algorithm can achieve high-quality obstacle avoidance, ensure the construction feasibility and target accuracy of the trajectory, and verify the strong adaptability and practicability of the method. Although the proposed method shows strong performance in simulated environments, the current study does not include validation with real-world field data. This is due to the confidentiality and access restrictions associated with industrial drilling operations. In future work, we aim to collaborate with industry partners to deploy the method on field-scale datasets and assess its performance under real geological conditions. This step will be essential for advancing from simulation to field-level application. Training the SAC + Attention model takes approximately 4–6 h on a single RTX 4070 GPU. Once trained, the model is lightweight and can be deployed in real time on standard engineering workstations, making it practical for field application. Compared to traditional trajectory planning methods based on geometric heuristics or static optimization, the proposed approach demonstrates superior adaptability to dynamic geological changes, continuous decision-making capability, and data-driven obstacle avoidance. In contrast to prior reinforcement learning approaches such as DQN or DDPG, our SAC + Attention model offers better convergence stability and long-horizon decision awareness. These improvements make it particularly suitable for real-time intelligent drilling applications. In practical terms, the method’s ability to generate smooth, feasible, and target-aligned trajectories in multi-obstacle scenarios supports its deployment in complex drilling environments such as faulted zones, high-pressure intervals, and high-density well fields. This highlights its potential for enhancing drilling safety, improving target precision, and reducing unplanned toolface adjustments in the field.
Future research can further explore the hierarchical modeling ability of multi-scale attention structure in trajectory control tasks and combine advanced structures, such as a Transformer, to improve the robustness of the model to complex environmental changes. In addition, the real-time geological data while drilling can be dynamically integrated into the training process to construct a trajectory autonomous optimization and control system with online adjustment capabilities and promote the transformation of intelligent drilling from “preset optimization” to “real-time adaptive”.