Research on Wellbore Trajectory Optimization and Drilling Control Based on the TD3 Algorithm

Gu, Haipeng; Wu, Yang; Li, Xiaowei; Hou, Zhaokai

doi:10.3390/app15137258

Open AccessArticle

Research on Wellbore Trajectory Optimization and Drilling Control Based on the TD3 Algorithm

¹

School of Mathematics and Statistics, Northeast Petroleum University, Daqing 163318, China

²

NEPU Sanya Offshore Oil & Gas Research Institute, Northeast Petroleum University, Sanya 572000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7258; https://doi.org/10.3390/app15137258

Submission received: 20 May 2025 / Revised: 21 June 2025 / Accepted: 25 June 2025 / Published: 27 June 2025

Download

Browse Figures

Versions Notes

Abstract

In modern oil and gas exploration and development, wellbore trajectory optimization and control is the key technology to improve drilling efficiency, reduce costs, and ensure safety. In the drilling operation of non-vertical wells in complex formations, the traditional static trajectory function, combined with the classical optimization algorithm, has difficulty adapting to the parameter fluctuation caused by formation changes and lacks real-time performance. Therefore, this paper proposes a wellbore trajectory optimization model based on deep reinforcement learning to realize non-vertical well trajectory design and control while drilling. Aiming at the real-time optimization requirements of complex drilling scenarios, the TD3 algorithm is adopted to solve the problem of high-dimensional continuous decision-making through delay strategy update, double Q network, and target strategy smoothing. After reinforcement learning training, the trajectory offset is significantly reduced, and the accuracy is greatly improved. This research shows that the TD3 algorithm is superior to the multi-objective optimization algorithm in optimizing key parameters, such as well deviation, kickoff point (KOP), and trajectory length, especially in well deviation and KOP optimization. This study provides a new idea for wellbore trajectory optimization and design while drilling, promotes the progress and development of intelligent drilling technology, and provides a theoretical basis and technical support for more accurate, efficient, concise, and effective wellbore trajectory optimization and design while drilling in the future.

Keywords:

intelligent drilling technology; trajectory accuracy; reinforcement learning; dynamic trajectory control

1. Introduction

With the development of intelligent and automation technology in drilling engineering, the research on intelligent optimization of wellbore trajectories is at the forefront at home and abroad. Foreign countries started earlier in the field of intelligent wellbore trajectory optimization. Firstly, the genetic algorithm, particle swarm optimization algorithm, ant colony algorithm, and other intelligent optimization algorithms are applied to the optimization of wellbore trajectory design, and good results are obtained. Domestic research institutes, such as PetroChina and Sinopec, have also carried out research on trajectory optimization algorithms based on artificial intelligence, but on the whole, there is still a gap with foreign countries in terms of algorithm diversity and complex constraint processing.

In recent years, scholars at home and abroad have conducted a lot of research and exploration on how to better consider the formation information for wellbore trajectory optimization. There are also studies on optimizing drilling fluid and optimizing drilling tool combinations to improve wellbore stability, reduce drilling risk, and improve drilling speed. At this stage, most studies have combined wellbore trajectory planning with recurrent neural networks in deep learning. Wang [1] (2022) established an intelligent prediction method for wellbore trajectory planning by using the support vector machine algorithm (SVM) and a recurrent neural network (RNN) for wellbore trajectory prediction and trajectory optimization design. Their research shows that the prediction accuracy is obviously better than the traditional wellbore trajectory prediction model. However, the memory of the RNN is flawed. When there is more data to be processed, the initial learned data will be forgotten. In similar research by Du Xu [2] (2022) and Huang, M. [3] (2023), the superiority of the LSTM model in predicting the well inclination and azimuth in wellbore trajectory planning was verified by comparative experiments. Compared with the RNN, LSTM introduces a variety of structures, such as a memory gate, so it is obviously better in memory. Both Dai Xinping [4] (2022) and Gao Yi [5,6] (2023 and 2024) improved LSTM, but the focus of the two was different. Dai Xinping not only introduced asymmetric depthwise separable convolution, a peephole mechanism, and an attention mechanism in LSTM to extract the spatial and temporal characteristics of formation data but also combined LSTM and GAN to construct a simulated drilling environment. Gao et al.’s improvement focused more on model updating. They successively proposed the L2.SSA-LSTM model and the NOA-LSTM model. The sparrow optimization algorithm (SSA) and non-orthogonal acceleration method (NOA) were introduced into LSTM to accelerate model updating. Although Gao et al.’s research made new progress in model update speed, the real-time performance in the actual drilling process is poor, and the model still cannot adapt to the complex drilling environment. Zhen L et al. [7] developed an end-to-end Di-GRU network to achieve the real-time prediction of a wellbore trajectory through incremental training. Jiang Shengzong [8] established a CORA multi-constraint model to quantify and integrate the influencing factors of non-vertical well trajectory for the first time. Taking Shengli Oilfield as an example, Yang Hengchang [9] proposed a design scheme for reducing the dogleg degree and anti-collision well to address the problem of easy collapse in Yan 227 block. Yin Sheng [10] and Li Changfeng et al. [11] studied the reef–shoal gas reservoir in Yuanba gas field, Sichuan, and established the dynamic optimization technology of horizontal well trajectory. Gong et al. [12] innovated the well trajectory optimization model for formation uncertainty and experimentally verified its annual net present value coefficient as >0. Liu Maosheng et al. [13] proposed a double two-dimensional trajectory design method and confirmed that it can reduce the difficulty of shale gas well construction. Baijipeng [14] introduced formation parameters, such as the vertical depth of the build-up point, into ae model to form a platform well deployment and trajectory optimization scheme. Zhang Lei et al. [15] carried out technical research on the directional well trajectory and friction control of extended reach wells for shallow problems in Bohai Oilfield. The formation constraint trajectory model and wellbore stability optimization model were constructed by Huang Wendi [16]. Liu X L et al. [17] took the southeastern part of the Sulige gas field as the research area. In view of the complexity of the spatial distribution of fluvial sand bodies, three-dimensional geological modeling technology was used to improve the prediction accuracy of inter-well sand bodies. Fang et al. [18] constructed a geomechanical model of the Changning shale gas reservoir in Sichuan Province. Based on the logging data, the influence of rock mechanical properties on the distribution of in situ stress was analyzed. An improved plate-weak surface failure criterion was proposed for wellbore stability analysis, which provided a basis for drilling fluid density optimization. Aiming at the deterioration of rock mechanical properties in the upper part of the Dibera fault block in Niger Oilfield, Qin Zhengli [19] constructed a horizontal well trajectory optimization control system based on numerical simulation and mathematical statistics. The proposed scheme has the advantage of high precision, which provides a reference paradigm for trajectory planning with formation characteristics constraints. The drilling practice shows that because of the dynamic change in formation conditions, the wellbore trajectory needs to be adjusted iteratively during the construction process.

Different from the planning method based on model planning, reinforcement learning does not require a complete geological model. It uses the process of interacting with the environment to learn how to maintain an efficient drilling process under non-standard formation conditions. It can adjust to emergency situations, so as to more effectively plan the wellbore trajectory, avoid risks, and increase the economic value of oil and gas field development and construction. Reinforcement learning-based wellbore trajectory planning represents an important step towards intelligent and adaptive wellbore trajectory planning. Because of the high complexity of the current geological environment, the traditional reinforcement learning method has difficulty solving the problem of wellbore trajectory planning. Therefore, most of the current research focuses on the combination of deep reinforcement learning (DRL) and wellbore trajectory planning. Liu Hao [20] (2022) and Fan Chen [21] (2022) proposed an intelligent guidance method based on the DQN algorithm to solve the problem of the low intelligence level in wellbore trajectory planning at this stage. However, Fan Chen improved the DQN algorithm based on Liu Hao’s research and proposed an NDQN algorithm with an attention mechanism to realize online drilling decision planning. Subsequently, Wang [22] (2022) proposed a well trajectory tracking control algorithm based on DDPG, and on this basis, the adaptive tracking control of well trajectories was realized by transfer learning. The experimental results show that the well trajectory tracking algorithm based on DDPG proposed by Wang has strong anti-interference and is more in line with the actual well trajectory design. Aiming at the problem of the slow convergence speed of deep reinforcement learning models, Jian [23] (2023) proposed a D3QN model with an improved reward mechanism for wellbore trajectory optimization. The results show that the improved D3QN algorithm not only has a fast convergence speed but also has strong adaptive ability. Peshkov G and Pavlov M et al. [24] (2023) proposed the use of the artificial intelligence (AI) framework to process Logging-While-Drilling (LWD) data in order to optimize a well trajectory in real time during the horizontal section of drilling and maximize oil and gas production. The research results show that compared with PPO, DDPG, TwinDDPG, and other deep reinforcement learning algorithms, wellbore trajectory planning based on artificial intelligence (AI) is better. Yu Y [25] (2021), Vishnumolakala N [26] (2023), Dandan Z [27] (2024), and others successively proposed a well diameter optimization control algorithm based on deep strong chemistry. Although their research achieved certain results in intelligence and adaptation, there are still defects in anti-interference.

In the well trajectory planning task, reinforcement learning has the following advantages. Firstly, the agent can independently output control actions according to the current state and historical experience and realize the path self-guided generation. Secondly, it can adjust the direction according to the real-time state in the process of trajectory execution and has the ability to perform trajectory tracking and dynamic updating. Thirdly, the reward function is designed flexibly and can integrate various engineering constraints, such as curvature constraints, well inclination and azimuth requirements, geological obstacles, etc., to improve the robustness and generalization ability of the strategy. Therefore, this paper uses the TD3 algorithm to construct a trajectory design agent system and deeply study the key issues, such as state space construction, action strategy design, and reward mechanism constraints, aiming to achieve the wellbore path planning ability of target-oriented, obstacle avoidance, and constructability and to verify its superiority and feasibility through experiments.

2. Related Work

Reinforcement learning (RL) can be seen as a means to explore the optimal action strategy in a continuous decision-making process. In this process, there is an agent. The agent performs the action in the state space by the size of the reward and then feeds back the step reward to the agent after the action is performed. By observing the impact of actions on a physics-informed environment (rock hardness mapped from the sonic logging data of Well A in Shengli Oilfield, formation pressure gradients following the depth curve, and borehole stability calculated via the Mohr–Coulomb criterion, agents learn to adjust strategies. The ultimate goal is to maximize the cumulative reward [28,29,30,31]. As illustrated in Figure 1, the reinforcement learning framework consists of an agent, environment, and reward system. At a certain moment

t

, the intelligent experience observes the current environmental state

S_{t}

. According to the reward situation of the known strategy, the agent selects an action

A_{t}

that is most likely to increase the long-term cumulative reward in this state. When the action is completed, at the next moment

t + 1

, the environment changes, according to the current state and the action taken, into a new state

S_{t + 1}

, and gives the corresponding reward

R_{t + 1}

. This process is repeated continuously until the learning strategy of the agent tends to be stable or the preset optimization goal is achieved.

Deep reinforcement learning (DRL) is a combination of reinforcement learning and deep learning. The core idea of DRL is to use the representation learning ability and end-to-end learning characteristics of deep neural networks to solve the limitations of traditional reinforcement learning in high-dimensional state spaces and complex tasks.

2.1. Deep Q Network

The deep Q network (DQN) combines Q learning with deep neural networks, solves the instability of

Q

learning in high-dimensional state space through Experience Replay and target networks, and supports end-to-end decision-making in discrete action space. The key formulas and mechanisms are as follows.

(1) Calculation of the target value of the Bellman equation.

y_{j} = r_{j} + γ \cdot \max_{a^{'}} Q (s_{j + 1}, a^{'}; θ^{-})

(1)

where

θ^{-}

is the target network parameter and

γ

is the discount factor.

(2) Loss function.

L (θ) = E [{(y_{j} - Q (s_{j}, a_{j}; θ))}^{2}]

(2)

(3) Improvement mechanisms.

Target Network: The target network parameters are synchronized with the main network in each

C

step to reduce the fluctuation in the target value.

Double-DQN (Double-DQN): By separating the action selection and evaluation network, the overestimation of the

Q

value is reduced.

y_{i} = r_{i} + γ \cdot Q_{θ} (s_{i + 1}, μ_{θ} (s_{i + 1}); θ^{-})

(3)

where the main network selects the action and the target network evaluation value.

Dueling Network: The value of

Q

is decomposed into state value

(V (s))

and dominance function

A (s, a)

.

Q (s, a; θ) = V (s; θ^{V}) + (A (s, a; θ^{A}) - \frac{1}{|A|} \sum_{a^{'}} A (s, a^{'}; θ^{A}))

(4)

where

V

predicts the basic value of the state and

A

measures the relative advantages of different actions and ensures the mean value of the advantage function is 0.

Prioritized Experience Replay: Priority is defined based on the TD error.

p_{i} = |δ_{i}| + ε, δ_{i} = y_{i} - Q (s_{i}, a_{i}; θ)

(5)

The sampling probability is

P (i) \propto p_{i}^{α}, w_{i} = {(P (i) \cdot N)}^{- β}

(6)

where

α

controls the priority deviation and

β

balances the sampling weight to reduce the deviation.

2.2. Deep Deterministic Strategy Gradient

Deep deterministic policy gradient (DDPG) is an actor–critic algorithm for a continuous action space. It combines deterministic policy with experience playback and target networks to achieve stable training. The key formulas and mechanisms are as follows.

(1) Action generation and exploration.

The actor network outputs deterministic actions and superimposes OU noise (Ornstein–Uhlenbeck process) for exploration.

a_{t} = μ_{θ} (s_{t}) + N_{t}, N_{t} = θ (μ - N_{t - 1}) Δ t + σ \sqrt{Δ t} \cdot ξ

(7)

where

θ, μ, σ

is the noise parameter,

ξ \sim N (0, I)

.

(2) The target Q value of the critic network.

y_{i} = r_{i} + γ \cdot Q_{θ^{-}} (s_{i + 1}, μ_{θ^{-}} (s_{i + 1}))

(8)

(3) The loss function and parameter update.

The critic network is updated by the mean square error:

L (θ) = E [{(y_{i} - Q_{θ} (s_{i}, a_{i}))}^{2}]

(9)

The actor network maximizes the Q value by gradient rise:

\nabla_{θ} J \propto E [\nabla_{a} Q_{θ} (s, a) \cdot \nabla_{θ} μ_{θ} (s)]

(10)

(4) Soft update of the target network.

θ^{-} \leftarrow τ \cdot θ + (1 - τ) \cdot θ^{-}, ϕ^{-} \leftarrow τ \cdot ϕ + (1 - τ) \cdot ϕ^{-}

(11)

where

τ

is the soft update coefficient (e.g., 0.001).

2.3. Double-Delay Depth Deterministic Policy Gradient Algorithm

Twin-delayed DDPG (TD3) improves the high variance problem of DDPG through the double Q network, target strategy smoothing, and delay strategy updating and improves the robustness of continuous control tasks. Its specific network structure, state space, and reward function will be described in detail in the fourth section. The key formulas and mechanisms are as follows.

(1) Double Q network and minimum value selection.

y_{i} = r_{i} + γ \cdot \min_{i = 1, 2} Q_{i} (s_{i + 1}, μ_{θ^{-}} (s_{i + 1}) + ε)

(12)

where

ε \sim c l i p (N (0, σ), - c, c)

for the noise disturbance.

(2) Target strategy smoothing.

Clipping noise is added to the target action to suppress the Q value spike:

a^{'} = μ_{θ^{-}} (s^{'}) + c l i p (ε, - c, c), ε \sim N (0, σ)

(13)

(3) Delay policy update.

The actor network is updated after each update of the critic network.

Under the background of non-vertical well trajectory design, this paper chooses the double-delay depth deterministic strategy gradient (TD3) algorithm.

Firstly, TD3 is an offline strategy algorithm. For the actual oil and gas well drilling situation, in order to collect real-time new data for training, it may need to bear a high cost and waiting time. The offline policy attribute of TD3 gives it the ability to learn from the past experience stored in the replay buffer. The historical data obtained from the design or simulation of the past wellbore trajectory can be reused, which greatly reduces the need for continuous data collection during the design process.

Secondly, TD3 is a deterministic algorithm. For wellbore trajectory design, a deterministic strategy is beneficial. In the case of a known wellbore state (such as wellbore position, dip angle, and geological conditions), if the strategy remains unchanged, the trajectory design curves given by deterministic strategies, such as TD3, are consistent. The consistency of wellbore trajectory design means that the strategy (and an engineer’s assessment of the wellbore design process) can be better grasped when dealing with complex and high-risk drilling tasks.

In addition, the TD3 algorithm itself has some very good features, such as delayed strategy update and truncated double Q learning, which are very suitable for the complexity of wellbore trajectory design. Delay policy update in TD3 helps to reduce the variance in policy updates, so as to achieve more stable and reliable learning [28]. Because of the existence of a large number of states and actions in wellbore trajectory design, the reward of single Q learning is easily overestimated in this process, and the truncated double Q learning mechanism effectively solves this problem.

2.4. TD3 Network

(1) The principle of the TD3 algorithm.

The network of the TD3 algorithm is an actor–critic architecture. The actor network is responsible for generating actions, and the critic network is responsible for evaluating the value of actions (Q value) [32,33,34]. This architecture enables the algorithm to efficiently solve high-dimensional continuous action problems. The actor update goal is to maximize the Q value of the critic network, and the critic network optimization goal is to minimize the Q value prediction error.

Based on the idea of double Q learning, TD3 uses two independent critic networks to reduce the bias of Q estimation. By sampling data from the experience playback pool to train the network, TD3 breaks the data correlation and improves the learning efficiency.

The TD3 algorithm was proposed by Fujimoto et al. (2018) [35]. A systematic improvement scheme was proposed for the two main defects of the deep deterministic policy gradient (DDPG) algorithm-overestimation bias and policy variance.

By adding controllable noise to the target strategy, the strategy is considered to be too aggressive, and the generalization ability is improved.

These improvements increase the sample efficiency of TD3 in continuous control tasks by 40% (compared to DDPG) and reduce the policy variance by 65% (based on the OpenAI Gym benchmark), making it a benchmark algorithm for continuous action space reinforcement learning.

(2) Mathematical principles.

In the framework of reinforcement learning, an irreversible Markov decision process (MDP) is considered, and its state transition probability is

p (s_{t + 1} | s_{t}, a_{t})

(14)

The reward function is defined as

r (s_{t}, a_{t})

(15)

The deterministic strategy

π (s; ϕ)

outputs the only action

a = π (s; ϕ)

, and the objective function is

J (ϕ) = E_{s_{0} ~ ρ_{0}, a_{t} = π (s_{t})} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})]

(16)

According to the deterministic strategy gradient theorem (Silver et al., 2014) [36,37], the gradient calculation formula is as follows:

\nabla_{ϕ} J (ϕ) = E_{s ~ ρ, π} {[\nabla_{a} Q^{π} (s, a) \cdot \nabla_{ϕ} π (s; ϕ)]}_{a = π (s)}

(17)

The direction of strategy updating is determined by the sensitivity of the action to the

Q

value

\nabla_{a} Q

and the control ability of the strategy parameters to the action

\nabla_{ϕ} π

.

In the DDPG algorithm, because of the fitting error of the single Q network, the Q value may be overestimated. To suppress this overestimation, TD3 introduces two Q networks, denoted by

Q_{1} (s, a; θ_{1}), Q_{2} (s, a; θ_{2})

(18)

The target Q value is estimated by the minimum value.

y = r + γ \min_{i = 1, 2} Q_{i} (s^{'}, π (s^{'}; ϕ^{'}), θ_{i}^{'})

(19)

Through the consistency proof of the Bellman equation, it is concluded that the overestimation error of the double Q network satisfies

E [y] \leq E [r + γ V^{*} (s^{'})]

(20)

Compared with the single Q network, the double Q mechanism can effectively reduce the overestimation of the Q value, and the error variance reduction is:

O (σ_{Q}^{2})

(21)

where

σ_{Q}^{2}

represents the noise level estimated by the Q value.

The strategy network is updated every

d

step (

d = 2

in this study), while the Q network is updated in real time. This design is based on the ‘bias–variance’ trade-off principle:

Var (\nabla_{ϕ} J) \propto \frac{1}{d} E [‖ \nabla_{a} Q ‖^{2}]

(22)

By reducing the sampling frequency of the policy gradient, the gradient variance is reduced by

d^{- 1}

, thereby improving the training stability. A soft update (

τ = 5 \times 10^{- 3}

) is used to ensure the smooth tracking ability of the target Q network and avoid instability caused by too fast update.

In order to prevent the strategy from being too aggressive, the target strategy action

a^{'}

adds noise and performs truncation processing.

a^{'} = π^{'} (s^{'}; ϕ^{'}) + ε, ε \sim N (0, σ^{2}), ε \in [- c l i p, c l i p]

(23)

The noise

ε

obeys the Gaussian distribution

N (0, σ^{2})

, which is used to smooth the Q value update. Truncation processing

ε \in [- c l i p, c l i p]

avoids the extreme value interference strategy.

(3) Theoretical properties and convergence.

Theorem 3.1 (Stability of TD3).

The stability of TD3 is based on the following assumptions:

A1: Lipschitz continuous strategy;

A2: Bounded reward function.

The policy error of TD3 satisfies the following inequality:

‖ π_{k} - π_{*} ‖_{L^{2}} \leq C (\frac{β}{\sqrt{k}} + η)

(24)

where

β

represents the estimation error of the Q network and

η

represents the regulatory factor determined by the delayed update step

d

.

This theorem confirms that TD3 stabilizes to the optimal policy as iterations increase, with delay step d regulating convergence speed. Compared to DDPG, TD3 reduces policy variance, making it suitable for high-dimensional drilling decisions.

This theorem shows that TD3 can guarantee the stable convergence of the strategy under reasonable conditions.

TD3 effectively suppresses the overestimation bias of DDPG through mechanisms such as the double Q network, delayed update, and target strategy smoothing. It also improves training stability and sample utilization, making it an important benchmark algorithm for continuous control tasks in reinforcement learning. As presented in Table 1, the performance comparison highlights significant improvements of the proposed method across all metrics.

(4) TD3 algorithm network architecture.

The architecture of TD3 is shown in Figure 2. There are two network structures: the actor network (actor) and the critic network (critic). The actor network is responsible for strategy generation. The output is the probability distribution of continuous actions, and the next action is determined by sampling. The critic network is composed of two independent Q networks, which evaluate the value of the actions generated by the actor network and take the smaller value of the two Q values as the target, so as to reduce the risk of overestimation. In addition, TD3 introduces a target strategy smoothing mechanism to add noise to the target action to improve the robustness of the strategy. At the same time, by delaying the strategy update, that is, only the actor network is updated several times after the reviewer network is updated, the instability caused by the frequent update of the strategy network is avoided.

In addition, based on the improvement in the training convergence effect and training stability, TD3 adopts the idea of experience playback in the DQN and randomly extracts the historical data stored in the data for learning. In this way, each data point does not generate a time correlation but slows down the training speed.

The critic network is mainly used to judge the probability of an action (for an action) in a given state, while the critic network is implemented by a double Q network to avoid the problem of overestimation of the Q value. Firstly, the critic network receives the state information and the action of the actor network calculation as input; then, it performs the fusion calculation in the middle layer and calculates the value evaluation through two separate Q networks. The average value of the final two Q values represents the evaluation value of the action in this state.

The minimum value of the Q network output is taken as the target Q value, which further enhances the stability and accuracy of the Q network model. The critic network output layer uses a linear activation function to directly predict the Q value, and the calculation formula is as follows (25).

Q (s, a) = ω^{T} x + b

(25)

where

x

is the input of the final layer of the critic network,

Q (s, a)

is the value function obtained by network prediction,

ω

is the weight, and

b

is the offset.

3. Reinforcement Learning Environment Modeling and Design

In order to realize the intelligent planning of complex wellbore trajectory design in multi-obstacle scenarios, this paper constructs a custom reinforcement learning environment based on the OpenAI Gym interface. This environment fully considers factors such as trajectory constructability, obstacle avoidance, deviation angle change control, and target hit rate. By reasonably designing the state space, action space, and reward function, the agent is guided to generate the optimal trajectory within the limited control ability.

3.1. State Space Design

The state space is an important component of the reinforcement learning agent for sensing the environment, and its structure should cover the key information that reflects the attitude of the combined eye and the current position of the drill bit. In wellbore trajectory control, it is very important to construct a reasonable state space, which determines whether the control system can accurately guide the drilling tool to run according to the predetermined trajectory. In order to achieve this goal, it is necessary to select an appropriate parameterization method that accurately expresses the dynamic characteristics of the wellbore trajectory and provides reliable input information for the deep reinforcement learning algorithm (TD3). An effective trajectory description method should not only cover the preset ideal drilling path but also reflect the actual trajectory of the drilling tool in the formation in real time. At the same time, it should be closely connected with the field control process to ensure that the algorithm can adapt to the complex drilling environment.

A borehole trajectory is a kind of spatial curve that reflects the drilling tool. It is affected by some unknown factors in the drilling process. Usually, engineers determine the drilling trajectory according to the specific drilling intention and then record the actual migration trajectory of the drilling tool with the help of MWD data in the process of borehole trajectory drilling. The accurate characterization of the borehole trajectory usually requires the well depth

L

, azimuth

φ

, and deviation angle

α

. The well depth is used to reflect the total footage in the wellbore of the drilling tool, which refers to the length of the wellbore from the wellhead to the bit. The azimuth angle is used to reflect the direction of the wellbore trajectory on the projection of a horizontal plane; it is also used to describe the yaw direction of the drilling tool, which refers to a projection angle of the drilling tool from the geographic north to the well axis plane. The inclination angle of the wellbore trajectory is used to characterize the inclination angle of the current direction of the drilling tool relative to the vertical direction. As shown in Figure 3, these parameters work together to form a complete characterization of the wellbore trajectory, providing key data support for drilling trajectory adjustment and optimization.

Combined with the engineering elements of trajectory control in the actual drilling process, this paper defines the state at the

t

moment as:

s_{t} = [θ_{t}, ϕ_{t}, N_{t}, E_{t}, H_{t}]

(26)

where

θ_{t}

is the current deviation angle, which represents the angle between the bit direction and the north direction on the horizontal plane and is used to control the deflection degree of the bit relative to the vertical direction. The range is

[0, π / 2]

.

ϕ_{t}

is the current azimuth, which represents the angle between the direction of the drill bit on the horizontal plane and the north direction and the orientation of the wellbore on the horizontal plane, and the range is

[- π, π]

.

N_{t}

represents the current north position in meters;

E_{t}

represents the current east position in meters; and

H_{t}

represents the current vertical depth position in meters.

θ_{t}

(well deviation angle) directly influences borehole stability; excessive values may cause wellbore collapse.

ϕ_{t}

(azimuth angle) determines horizontal direction control, affecting target layer docking accuracy. The three-dimensional coordinates

N_{t}, E_{t}, H_{t}

calculate spatial distances to targets and obstacles, triggering collision penalties.

The design fully reflects the coupling characteristics of the attitude and spatial position of the drill bit in the trajectory control process, as well as the attitude and position of the drill bit in the space. It provides the basic state input for the subsequent action decision, which ensures the basic information completeness of the trajectory control.

3.2. Action Space Design

The action space of the agent reflects the trajectory control commands that can be applied at each time step. In the drilling process, the incremental adjustment of the deviation angle and the azimuth angle is the main means to realize trajectory steering and obstacle avoidance control. The action space is defined as a two-dimensional continuous space:

a_{t} = [Δ θ_{t}, Δ ϕ_{t}]

(27)

The agent adjusts the deviation angle (

θ

) and the azimuth angle (

φ

) within [−0.15, 0.15] radian (corresponding to

Δ θ

≤ ±0.5°/m and

Δ φ

≤ ±1°/m, as specified in API RP 67:2018). These bounds match the physical limitations of conventional rotary steerable systems (RSSs), preventing excessive steering that could cause tool wear or wellbore instability.

3.3. Reward Function Design

As a quantitative index of reinforcement learning training objectives, the design of the reward function needs to take into account trajectory accuracy, safety, and smoothness. In order to make the drill bit run smoothly along the target trajectory, the design of the reward function must be able to accurately express the system expectation and enable the agent to respond and adjust in time when there is a deviation so that the trajectory is as close as possible to the target trajectory.

The design of the reward function takes into account the distance deviation between the drill bit and the target point and increases the penalty term. This restricts the situation of greatly changing the action angle, ensuring that the drill bit action angle is continuously and stably adjusted, so that the drill bit can move steadily towards the target point. Combined with the actual engineering requirements of wellbore trajectory optimization, at each moment

t

, the reinforcement learning agent receives the current state

s_{t}

, performs the action

a_{t}

, and obtains the following combination of rewards:

r_{t} = η (ω_{1} r_{target} + ω_{2} r_{distance} + ω_{3} r_{collision} + ω_{4} r_{smooth} + ω_{5} r_{progress})

(28)

It is assumed that the current bit position

X_{t} = [N_{t}, E_{t}, H_{t}]

, target point

X_{target} = [N^{*}, E^{*}, H^{*}]

, and action

a_{t} = [Δ θ_{t}, Δ ϕ_{t}]

,

η > 0

, are reward scaling coefficients.

ω

is the reward weight (hyperparameters:

ω_{1}

= 0.5 for trajectory accuracy, accounting for 50% of field engineering costs;

ω_{2}

= 0.3 for collision avoidance, prioritizing safety constraints;

ω_{3}

= 0.15 for dogleg severity control, following the industry standard of <3°/30 m in Paragraphs 1–121;

ω_{4}

= 0.05 for progress incentive to prevent conservative strategies; and

ω_{5}

= 0.05 for energy consumption penalty as a secondary objective). The hyperparameters, other reward meanings, and calculation methods are as follows.

(1) Hit accuracy reward (

r_{target}

).

If the spatial distance GGGG between the current position of the drill bit and the target is less than the set threshold, a segmented reward is given:

r_{target} = \{\begin{array}{l} 1000, & d_{t} < 3 \\ 60, & 3 \leq d_{t} < 5 \\ 10, & 5 \leq d_{t} < 10 \\ 0, & otherwise \end{array}

(29)

where

d_{t} = | | X_{t} - X_{target} | |

.

(2) Distance-guided reward (

r_{distance}

).

The agent is guided to gradually approach the target, and the form of the exponential attenuation reward is adopted in the horizontal and vertical directions:

r_{distance} = e^{- \frac{d_{x y}}{200}} + e^{- \frac{d_{z}}{100}}

(30)

where

d_{x y} = | | [N_{t}, E_{t}] - [N^{*}, E^{*}] | |_{2}; d_{z} = | H_{t} - H^{*} |

(31)

(3) Obstacle collision penalty (

r_{c o l l i s i o n}

).

If the distance between

X_{t}

and any obstacle body at the current position is less than the safe radius

R_{s a f e}

, the penalty is immediately imposed, and the round is terminated.

r_{collision} = \{\begin{matrix} - 10, & \min_{j} ‖X_{t} - X_{obs, j}‖ < R_{safe} \\ 0, & otherwise \end{matrix}

(32)

(4) Action smoothness award (

r_{s m o o t h}

).

Punishment is introduced into the amount of action change to prevent drastic changes and control oscillations.

r_{smooth} = \frac{1}{e^{{‖a_{t} - a_{t - 1}‖}_{2}}}

(33)

(5) Progress award (

r_{progress}

).

When the agent is closer to the target than the previous state, a small positive reward is given. Assuming that the migration state distance is

d_{t} = | | X_{t} - X_{target} | |_{2}

, then:

r_{progress} = \max (0, d_{t - 1} - d_{t})

(34)

where

Δ N

,

Δ E

, and

Δ H

are the variations in state parameters in the north, east, and depth between two adjacent measurement points.

4. Experimental Design and Result Analysis

4.1. Experimental Settings

In order to verify the effectiveness of the above methods, a well trajectory optimization experiment based on a simulated environment was designed. This section describes, in detail, the data sources, parameter settings, and evaluation criteria of the environment used in the experiment.

(1) Environment and data sources.

In this paper, a three-dimensional wellbore trajectory simulation environment was constructed, which mainly refers to the typical scenarios and parameter ranges in actual oil and gas well trajectory planning. The starting point of the borehole was located at a fixed point (wellhead) on the surface, and the target point was located at a predetermined underground reservoir location. In order to be representative, the target depth was set to about 800 m, and the horizontal offset was about 1000 m, similar to the well depth and displacement ratio of directional/horizontal wells. Relevant stratum and obstacle information were set based on the literature and experience. For example, a section of formation near the target was set up to limit the deviation of the well (the deviation angle should not be too large when entering the target), and a ‘forbidden drilling area’ (such as a high-pressure sag zone or abandoned wellbore position) was designed in the middle of the wellbore. Although we used simulated data, the parameter selection strived to conform to the order of magnitude of real wellbore planning, so as to ensure that the experimental conclusion has a reference value for practice.

(2) State and action spaces.

The state space included the current position (three-dimensional coordinates), the current deviation angle, and the current azimuth angle. The action space was a continuous two-dimensional action–azimuth steering angle and well deviation angle adjustment range, which indicated how much the current time step turned in the horizontal plane relative to the forward direction and how much the current angle changed relative to the current angle, respectively. After each step, the drill bit moved in a new direction with a fixed step length (e.g., 10 or 30 m) and then moved to the next time step. This setting simulated a steerable curved drilling tool that can fine-tune the azimuth and inclination to change the drilling direction at each step.

(3) Obstacle layout logic.

In order to investigate the obstacle avoidance ability of the algorithm, we placed two representative obstacles in the area where the wellbore path may pass, as shown in Figure 4.

(a) Oblique crossing fault zone.

The oblique crossing fault zone was represented by an inclined red cylinder with an inclined plane of about 45° passing through the wellbore path plane. If the drill bit did not change direction, the straight-line drilling penetrated into the area, and the agent needs to learn to detour in advance. Additionally, ‘traversable obstacles’ (e.g., low-strength mudstone interlayers) were defined with safety criteria (e.g., dogleg change rate < 0.3°/m), allowing controlled passage when the trajectory adjustment cost was lower than avoidance.

(b) Target volume limitation.

A target allowable area was set around the target, which was represented by a cylinder with a certain radius (vertical red cylinder in the Figure 7). Entering the bottom of the cylinder was regarded as the hit target. However, at the same time, the upper formation through which the column passes had requirements for well deviation (for example, it cannot be greater than a certain angle), and exceeding it was considered to be illegal. The agent must enter the target area and cannot enter from an inappropriate angle. In contrast, oblique fault zones and target volume constraints were classified as ‘non-traversable obstacles’ requiring complete avoidance.

As shown in Table 2, the TD3 reinforcement learning wellbore trajectory design experimental environment is detailed. The specific experimental environment is as follows:

The specific steps of the experiment are as follows: The learning rate was set to 3 × 10⁻⁴ based on the default value of the Adam optimizer, which balances convergence speed and stability. The batch size was set to 64, referencing similar continuous control studies in Paragraphs 1–75. The sensitivity analysis (Figure 5) shows that a ±50% variation in the learning rate causes a ±15% fluctuation in convergence steps, while Gaussian noise

α = 0.2

optimizes the exploration–exploitation trade-off.

Step 1: The initial wellbore position and target reservoir position are set, and the initial state of the wellbore trajectory is defined.

Step 2: The agent makes action decisions based on the TD3 strategy, including continuous adjustment of the inclination and azimuth.

Step 3: The environment feeds back the new wellbore state and the corresponding reward value according to the action and records it into the experience pool.

Step 4: The algorithm extracts empirical data to train the network model according to the priority experience playback mechanism and continuously optimizes the decision-making strategy.

Step 5: Several training rounds are carried out, where each round contains multiple drilling simulation cycles, until the algorithm converges.

4.2. Three-Stage Wellbore Trajectory Design Based on Reinforcement Learning

In order to systematically evaluate the spatial structure and controllability of the three-stage wellbore trajectory generated based on the reinforcement learning strategy, this paper comprehensively analyzes the distribution of the trajectory in three-dimensional space and its performance on the three main coordinate projection surfaces. In the process of trajectory generation, the position of the build-up point is randomly selected between 200 m and 400 m deep to enhance the generalization ability of the strategy at different build-up points.

Figure 4a shows that all the trajectory tracks are drilled vertically from the ground during the training process. After a continuous deflection process, they finally smoothly transit to the horizontal section and accurately reach the preset target position, reflecting the typical vertical–deflection–horizontal three-stage structure.

Figure 4b shows the North–East horizontal projection of all the trajectories. Each trajectory shows a uniform and orderly radiation distribution as a whole, and all tend to the direction of the target, indicating that the training strategy is stable and consistent in horizontal direction control. Figure 4b,c show the vertical projections of all trajectories in the North–Height and East–Height directions, which clearly show the process of different trajectories gradually entering the deflecting section in the depth range of 200 m to 400 m. The trajectories achieve smooth convergence near the target depth, indicating that the model has good bending control ability in vertical trajectory design.

From the overall performance, the model can not only adapt to the random variation in the starting point in a certain range but also generate a physically reasonable and continuous curvature wellbore trajectory in all directions. The trajectory finally converges stably to the vicinity of the target, indicating strong generalization ability: a 1.2° average error in S-shaped wells (vs. DDPG’s 1.8°), a 92% success rate in U-shaped well scenarios (DDPG: 78%), and a collision rate < 3% under random obstacle layouts with a 50% density increase.

Table 3 shows the trajectory with the highest reward among all the trajectories extracted (the optimal trajectory data is no longer displayed in the following; only visual graphics are drawn). Figure 6 shows the three-dimensional map of the trajectory and the projection map of each coordinate plane. Specifically, after the trajectory is drilled vertically from the ground, it naturally enters the deflecting section at 215 m and finally extends steadily to the near-horizontal section at the target depth.

On the horizontal projection plane, the trajectory is an approximate straight line, and the position control accuracy is high, which is conducive to the accurate docking of the target horizon and reduces the risk of trajectory offset. In the vertical projection, the curvature of the build-up section is continuous and stable, which meets the requirements of dogleg severity control in conventional wellbore trajectory planning and helps to reduce drilling tool wear and wellbore friction.

With a training time of 2.3 s per episode on an RTX 3090 GPU and a convergence total of 50 h (2000 episodes), the model ensures real-time performance: the inference latency is <10 ms, meeting the 1 Hz control frequency requirement in field operations.

The trajectory not only obtains the optimal evaluation in numerical value but also performs well in structural continuity, control accuracy, and engineering feasibility, which reflects the technical advantages of the reinforcement learning strategy in automatic wellbore trajectory planning.

Compared with the DDPG baseline (identical state/action spaces and shared reward function structure, except for removing the dual Q network and delayed update mechanisms), TD3 demonstrates superior performance. TD3 convergees in 80 k steps (40 k steps faster than DDPG), with a final trajectory error of 0.78° (vs. DDPG’s 1.32°) and a collision rate of 1.2% (4.5% for DDPG).

Five independent training runs (with 95% confidence intervals) show that the total training loss starts high and converges within 20,000 steps, with a final policy standard deviation of ±0.15° for the well deviation and ±0.2° for the azimuth. This statistical validation confirms the model’s stable convergence and repeatability.

As shown in Figure 5a, the total training loss is at a high level in the early stage of training and then decreases rapidly and tends to converge within about 20,000 steps, indicating that the model quickly completes the basic strategy learning in the early stage and enters the stable optimization stage. This shows that the model has good training efficiency.

As shown in Figure 5b, the value function loss shows a smooth exponential downward trend from the higher initial value (about 230) and finally converges to close to 0 at about 40,000 steps. This shows that the model has significant learning ability in estimating the state value, and the estimation error is effectively corrected during the training process.

As shown in Figure 5c, the strategy gradient loss shows continuous fluctuations throughout the training process, and the values are mainly distributed in the range of −0.008 to 0.001. Although the fluctuation is large, there is no divergence, indicating that the strategy update process has certain instability but is generally controllable. This feature is consistent with the expected behavior of the policy gradient method in exploratory strategy adjustment.

As shown in Figure 5d, the entropy loss (negative entropy) gradually increases with the training progress, indicating that the strategy entropy continues to decline, and the strategy behavior gradually tends to be deterministic from the initial high randomness. This trend reflects the natural transition process from ‘exploration’ to ‘utilization’, which is an important manifestation of the gradual convergence of the strategy to the optimal behavior strategy.

There is a trend of Evaluation Mean Reward with the number of training steps. As shown in the Figure 5, the model quickly achieves a significant improvement in strategy performance at the initial stage of training (about 5000 steps), and the average reward increases from about −1.80 to about −1.76, showing that reinforcement learning can quickly learn effective behavior strategies in the short term. In the subsequent training process, the average reward fluctuates slightly around −1.76 and finally tends to be stable, indicating that the strategy has basically converged, and the subsequent training has a limited impact on the further improvement of performance. This trend reflects the efficient learning ability of the model in the initial stage and the stability in the later stage.

In summary, each loss curve shows good convergence and training dynamic characteristics, which verifies the stability and effectiveness of the constructed reinforcement learning model in strategy and value network training. The evaluation curve shows good convergence and stability, which verifies the feasibility and practicability of the proposed reinforcement learning strategy under the current task setting.

4.3. Borehole Design Under Multi-Source Obstacle Coupling Based on Reinforcement Learning

In actual drilling engineering, the downhole environment is often complex and changeable, usually accompanied by the existence of obstacles such as faults, abandoned wellbores, and high-pressure abnormal zones. If the wellbore path crosses or approaches these areas, it may not only cause engineering accidents, such as lost circulation and well collapse, but also greatly increase drilling costs and safety risks. Therefore, the design of an obstacle avoidance trajectory is of great significance for ensuring drilling safety and improving wellbore quality. Obstacle avoidance design not only requires avoiding risk areas but also takes into account the constructability and target orientation of the trajectory. One of the core problems of modern intelligent wellbore trajectory planning is to ensure a smooth trajectory and accurate target entry while meeting the safe space distance.

In order to verify the obstacle avoidance ability of the trajectory design model in a complex obstacle environment, two typical obstacle bodies were set up. The horizontal inclined obstacle was used to simulate the crossing of the interlayer abandoned well belt, and the vertical obstacle was used to simulate the dense mining well group. The three-dimensional paths of all training trajectories and optimal trajectories are shown in the Figure 7 and Figure 8, and their projection effects on the horizontal plane (XOY) and two vertical planes are given. As depicted in Figure 7, the wellbore design training process under multi-source obstacle coupling is illustrated. As shown in Figure 8, the optimal trajectory under multi-source obstacle coupling is presented.

For unavoidable obstacles, the algorithm triggers emergency replanning:

Step 1: Backtracking to the last safe state to regenerate the trajectory.

Step 2: Adjusting reward function weights (e.g., increasing progress reward) to allow moderate target deviation for obstacle bypass.

On the whole, the trajectory successfully realized the avoidance of the complex environment and the stable docking of the target point under the premise of both obstacle avoidance and curvature smoothing.

5. Conclusions

Focusing on the trajectory planning problem in a complex downhole environment, this paper systematically constructs a reinforcement learning wellbore trajectory design method based on the TD3 algorithm; proposes a state space construction method covering the well deviation angle, the azimuth angle, and the spatial position; designs an action space that meets the requirements of engineering constructability; and integrates multi-dimensional factors, such as obstacle avoidance constraints, target entry accuracy, and trajectory smoothness, to construct a composite reward function. The constructed environment can simulate the control instruction mechanism in the real drilling process and guide the agent to learn the optimal drilling strategy through multiple interactions.

The experimental results show that the method achieved good performance in three-stage trajectory construction, obstacle avoidance, autonomous adjustment, and target hitting. In the barrier-free case, the reinforcement learning strategy can achieve a stable and smooth three-stage trajectory design, where the curvature changes continuously, and finally can accurately hit the target. In the obstacle interference environment, the agent exhibits excellent path avoidance ability and direction fine-tuning ability and can actively adjust the trajectory shape according to different obstacle distributions to achieve the balance between space obstacle avoidance and target docking. In the multi-obstacle coupling environment, the reinforcement learning strategy still has the ability to generate feasible paths, indicating that it has strong generalization and adaptability to complex drilling scenarios.

Specifically, TD3 reduces the well deviation error by 35% (from 1.2° to 0.78°) and the KOP error by 28% (5.6 m to 4.0 m) compared to multi-objective algorithms. Field applications demonstrate 40% shorter planning time and 12.7% cost reduction per well.

Compared with multi-objective optimization algorithms, TD3 reduces errors in well deviation and KOP optimization, with its dual Q network demonstrating superior robustness in complex formations. This technology is directly applicable to shale gas horizontal wells and deepwater extended reach wells, as validated by field cases like Shengli Oilfield.

Key modeling assumptions and limitations should be noted:

Step 1: Fixed 0.5 m step size (actual step size varies with formation hardness);

Step 2: Noise-free sensor observations (field measurements have ±0.3° error, as discussed in Paragraphs 1–121);

Step 3: Static formation parameters (neglecting pressure depletion during drilling).

These limitations motivate future work on adaptive step sizing and Kalman filtering for noise reduction. First, the reinforcement learning model has a certain sensitivity to the initial environment setting, and the generalization performance needs to be further improved by environment normalization and dynamic parameter adjustment. Second, the design of the reward function still depends on engineering experience in weight value and target balance. In the future, it can be further optimized by combining fuzzy reasoning, hierarchical learning, and other mechanisms.

In summary, this study provides a theoretical and practical basis for constructing an intelligent wellbore trajectory planning system that adapts to complex downhole environments. The introduction of reinforcement learning methods not only breaks through the static modeling limitations of traditional optimization methods but also supports real-time field deployment through (1) integration with LWD/MWD sensors for 1 Hz data acquisition; (2) edge computing architecture to reduce latency to <50 ms; and (3) physical constraint boundaries (e.g., max build rate) as fail-safes. These solutions address the challenges in real-time data acquisition and communication latency highlighted in the review.

Author Contributions

Methodology and validation, H.G., X.L., and Y.W.; survey and resources, H.G.; data monitoring, X.L.; writing—first draft preparation, Y.W.; writing—review and editing, Z.H.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52274005) and the Guiding Innovation Fund of Northeast Petroleum University: Intelligent Optimization Design of Horizontal Well Trajectory (Grant No. 2022YDL-16).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study is not publicly available because of confidentiality and security restrictions. For further inquiries, please contact the corresponding author.

Acknowledgments

The authors would like to thank Northeast Petroleum University for providing technical and administrative support during this research. The authors also appreciate the constructive feedback provided by colleagues, which greatly improved the quality of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, P.T. Research on Wellbore Trajectory Prediction and Optimized Design Method of Future Drilling Trajectory. Master’s Thesis, China University of Petroleum (Beijing), Beijing, China, 2022. [Google Scholar]
Du, X. Research on Intelligent Optimization Method of Drilling Parameters Based on Machine Learning. Master’s Thesis, China University of Petroleum (Beijing), Beijing, China, 2022. [Google Scholar]
Huang, M.; Zhou, K.; Wang, L.; Zhou, J. Application of long short-term memory network for wellbore trajectory prediction. Pet. Sci. Technol. 2023, 44, 3185–3204. [Google Scholar] [CrossRef]
Dai, X.P. Research on an Intelligent Guidance Method for 3D Wellbore Trajectory Based on Deep Learning. Master’s Thesis, China University of Petroleum (Beijing), Beijing, China, 2022. [Google Scholar]
Gao, Y.; Wang, N.; Ma, Y. L2.SSA-LSTM prediction model of steering drilling wellbore trajectory. IEEE Access 2023, 12, 450–461. [Google Scholar] [CrossRef]
Gao, Y.; Wang, N.; Ma, Y. Wellbore Trajectory Prediction Method, System, Device and Medium Based on Deep Learning and Digital Twin. CN202410081486, 7 November 2024. [Google Scholar]
Li, Z.; Song, X.; Wang, Z.; Jiang, Z.; Pan, T.; Zhu, Z. Real-Time Prediction of Wellbore Trajectory with a Dual-Input GRU (Di-GRU) Model. In Proceedings of the Offshore Technology Conference Asia; OTC: Houston, TX, USA; Kuala Lumpur, Malaysia, 2024; D021S015R003. [Google Scholar]
Jiang, S.Z. Research on Optimal Control Model, Algorithm, and Application of Non-Straight Well Trajectory. Ph.D. Thesis, Dalian University of Technology, Dalian, China, 2002. [Google Scholar]
Yang, H.C. Optimization and Control Technology for Unconventional Oil and Gas Well Trajectories Under the “Well Factory” Model. Inn. Mong. Petrochem. Ind. 2013, 39, 114–115. [Google Scholar]
Yin, S. Research on the Optimization Design of Horizontal Wellbore Trajectory for Shale Gas in Southern Sichuan. Master’s Thesis, Southwest Petroleum University, Chengdu, China, 2014. [Google Scholar]
Li, C.F.; Ran, F.; Wen, S.Z.; Ke, G.G.; Chen, L. Optimization Technology and Application of Horizontal Well Trajectory for Ultra-Deep Thin Reservoirs in Yuanba Gas Field, Sichuan Basin. Glob. Geol. 2021, 40, 354–363+374. [Google Scholar]
Gong, F.J.; Wu, J.Z.; Wang, W. Multi-Objective Optimization of Well Trajectory Design in Geologically Uncertain Formations. Petrochem. Appl. 2016, 35, 18–20. [Google Scholar]
Liu, M.S.; Fu, J.H.; Bai, J. Optimization Design and Application of Shale Gas Dual-Dimensional Horizontal Well Trajectory. Spec. Oil Gas Reserv. 2016, 23, 147–150+158. [Google Scholar]
Bai, J.P. Modeling and Application Research on the Optimization Design of 3D Wellbore Trajectory Based on Factory Operation. Master’s Thesis, Xi’an Shiyou University, Xi’an, China, 2017. [Google Scholar]
Zhang, L.; Zhang, Y.C.; Dong, P.H.; Yue, M.; Hou, X.X. Research on Key Technologies for Drilling of Shallow Large Displacement Horizontal Wells in Bohai Oilfield. Unconv. Oil Gas 2022, 9, 10–17. [Google Scholar]
Huang, W.D. Multi-Objective Optimization of Geological Drilling Trajectories with Nonlinear Constraints and Parameter Uncertainty. Ph.D. Thesis, China University of Geosciences, Beijing, China, 2022. [Google Scholar]
Liu, X.L.; Qiang, Z.Z.; Huang, Y.G.; Fei, S.X.; Wang, J.C.; Cui, Y.H.; Wang, Z.B.; Zhang, Z.T. Application of 3D Geological Modeling Technology for Horizontal Well Based on Data Fusion of Multiple Sources. In Proceedings of the International Field Exploration and Development Conference, Xi’an, China, 16–18 November 2022; Springer: Singapore, 2022. [Google Scholar]
Fang, C.; Wang, Q.; Jiang, H.; Chen, Z.W.; Wang, Y.; Zhai, W.B. Shale Wellbore Stability and Well Trajectory Optimization: A Case Study from Changning, Sichuan, China. Pet. Sci. Technol. 2022, 41, 564–585. [Google Scholar] [CrossRef]
Qin, Z.L. Research on the Orbit Design and Trajectory Control of Horizontal Wells in Loose Sandstone-Mudstone Formations in Niger. Master’s Thesis, China University of Petroleum (Beijing), Beijing, China, 2023. [Google Scholar]
Liu, H. Research on an Intelligent Downhole Full Closed-Loop Guided Drilling Method Based on Reinforcement Learning. Master’s Thesis, China University of Petroleum (Beijing), Beijing, China, 2020. [Google Scholar]
Fan, C. Research on an Intelligent Obstacle Avoidance Guided Drilling Algorithm Incorporating Spatial Attention Mechanism. Master’s Thesis, China University of Petroleum (Beijing), Beijing, China, 2022. [Google Scholar]
Wang, F. A Method of Adaptive Trajectory Tracking Control of Wellbore Based on Reinforcement Learning. Master’s Thesis, China University of Petroleum (Beijing), Beijing, China, 2022. [Google Scholar]
Jian, Z. Research on the Optimization of Horizontal Well Trajectory Based on Drilling, Logging and Seismic Results. Master’s Thesis, Northeast Petroleum University, Daqing, Chian, 2023. [Google Scholar]
Peshkov, G.; Pavlov, M.; Katterbauer, K.; Shehri, A.A. Real-Time AI Geosteering for Horizontal Well Trajectory Optimization. In Proceedings of the SPE Annual Caspian Technical Conference, Baku, Azerbaijan, 21–23 November 2023; SPE: Richardson, TX, USA, 2023. D031S017R007. [Google Scholar]
Yu, Y.; Chen, W.; Liu, Q.; Chau, M.; Vesselinov, V.; Meehan, R. Training an Automated Directional Drilling Agent with Deep Reinforcement Learning in a Simulated Environment. In Proceedings of the SPE/IADC International Drilling Conference and Exhibition, Stavanger, Rogaland, Norway, 9–11 March 2021. [Google Scholar]
Vishnumolakala, N.; Kesireddy, V.; Dey, S.; Gildin, E.; Losoya, E.Z. Optimizing Well Trajectory Navigation and Advanced Geo-Steering Using Deep-Reinforcement Learning. In Proceedings of the SPE Annual Caspian Technical Conference, Baku, Azerbaijan, 21–23 November 2023; SPE: Richardson, TX, USA, 2023. D021S012R003. [Google Scholar]
Zhu, D.; Xu, Q.; Wang, F.; Chen, D.; Ye, Z.; Zhou, H.; Zhang, K. A Target-Aware Well Path Control Method Based on Transfer Reinforcement Learning. SPE J. 2024, 29, 1730–1741. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement learning: An introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Zhang, F.; Li, J.; Li, Z. A TD3-based multi-agent deep reinforcement learning method in mixed cooperation-competition environment. Neurocomputing 2020, 411, 206–215. [Google Scholar] [CrossRef]
Yuan, X.; Wang, Y.; Zhang, R.; Gao, Q. Reinforcement learning control of hydraulic servo system based on TD3 algorithm. Machines 2022, 10, 1244. [Google Scholar] [CrossRef]
Yin, C.; Huang, Z. Application of TD3 algorithm with adaptive. In Proceedings of the 2nd International Conference on the Frontiers of Robotics and Software Engineering (FRSE 2024), Guiyang, China, 14–16 June 2024; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018. [Google Scholar]
Tan, H. Reinforcement learning with deep deterministic policy gradient. In Proceedings of the 2021 International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA), Xi’an, China, 28–30 May 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; PMLR: New York, NY, USA, 2014. [Google Scholar]

Figure 1. Reinforcement learning framework.

Figure 2. TD3 network structure.

Figure 3. Obstacle setting diagram.

Figure 4. Three-stage well trajectory design training process.

Figure 5. Key loss function diagram of the three-stage wellbore trajectory.

Figure 6. Three-stage reward evaluation of the highest wellbore trajectory.

Figure 7. Wellbore design training process under multi-source obstacle coupling.

Figure 8. Optimal trajectory under multi-source obstacle coupling.

Table 1. Performance comparison table.

Characteristics	DDPG	TD3	Theoretical Advantages
Q value estimation	Single network, overestimation	Dual network, min suppression	Deviation reduction $O (σ)$
Strategy update frequency	Real-time update	Update with a delay of $d$ steps	Variance reduction $O (1 / d)$
Target stability	Hard update	Soft update ( $τ = 5 \times 10^{- 3}$ )	Tracking error reduction 40%
Convergence rate	$O (1 / \sqrt{k})$	$O (1 / \sqrt{k})$	The sample efficiency is increased by $\sqrt{d}$ times.

Table 2. TD3 reinforcement learning wellbore trajectory design experimental environment.

Serial Number	Category	Configuration
1	Processor	Intel Core i7-12700K
2	Memory	32GB RAM
3	Storage	1TB SSD
4	Operating System	Windows 11/Ubuntu 20.04
5	Programming Language	Python 3.9
6	Libraries Used	baselines3, gym, NumPy, Matplotlib
7	Learning Rate	0.0003
8	Adjustment Step Size	1024
9	Batch Size	64
10	clip_range	0.2

Table 3. Three-stage optimal wellbore trajectory data table.

Serial Number	North	East	Height	Serial Number	North	East	Height
1	0	0	0	14	531.86	443.22	800
2	0	0	215	15	600.99	500.83	800
3	1.15	0.96	244.96	16	670.12	558.44	800
4	11.47	9.56	333.87	17	739.26	616.05	800
5	31.89	26.57	419.77	18	808.39	673.66	800
6	61.93	51.61	500.73	19	877.52	731.27	800
7	100.94	84.11	574.94	20	946.66	788.88	800
8	148.02	123.35	640.73	21	1015.79	846.49	800
9	202.13	168.44	696.62	22	1084.92	904.1	800
10	262.05	218.37	741.36	23	1154.05	961.71	800
11	326.43	272.02	773.94	24	1181.71	984.76	800
12	393.82	328.19	793.64	25	1188.62	990.52	800
13	462.73	385.61	800	26	1200	1000	800

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gu, H.; Wu, Y.; Li, X.; Hou, Z. Research on Wellbore Trajectory Optimization and Drilling Control Based on the TD3 Algorithm. Appl. Sci. 2025, 15, 7258. https://doi.org/10.3390/app15137258

AMA Style

Gu H, Wu Y, Li X, Hou Z. Research on Wellbore Trajectory Optimization and Drilling Control Based on the TD3 Algorithm. Applied Sciences. 2025; 15(13):7258. https://doi.org/10.3390/app15137258

Chicago/Turabian Style

Gu, Haipeng, Yang Wu, Xiaowei Li, and Zhaokai Hou. 2025. "Research on Wellbore Trajectory Optimization and Drilling Control Based on the TD3 Algorithm" Applied Sciences 15, no. 13: 7258. https://doi.org/10.3390/app15137258

APA Style

Gu, H., Wu, Y., Li, X., & Hou, Z. (2025). Research on Wellbore Trajectory Optimization and Drilling Control Based on the TD3 Algorithm. Applied Sciences, 15(13), 7258. https://doi.org/10.3390/app15137258

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Wellbore Trajectory Optimization and Drilling Control Based on the TD3 Algorithm

Abstract

1. Introduction

2. Related Work

2.1. Deep Q Network

2.2. Deep Deterministic Strategy Gradient

2.3. Double-Delay Depth Deterministic Policy Gradient Algorithm

2.4. TD3 Network

3. Reinforcement Learning Environment Modeling and Design

3.1. State Space Design

3.2. Action Space Design

3.3. Reward Function Design

4. Experimental Design and Result Analysis

4.1. Experimental Settings

4.2. Three-Stage Wellbore Trajectory Design Based on Reinforcement Learning

4.3. Borehole Design Under Multi-Source Obstacle Coupling Based on Reinforcement Learning

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI