Attitude Control of Stabilized Platform Based on Deep Deterministic Policy Gradient with Disturbance Observer

: A rotary steerable drilling system is an advanced drilling technology, with stabilized platform tool face attitude control being a critical component. Due to a multitude of downhole interference factors, coupled with nonlinearities and uncertainties, challenges arise in model establishment and attitude control. Furthermore, considering that stabilized platform tool face attitude determines the drilling direction of the entire drill bit, the effectiveness of tool face attitude control and nonlinear disturbances, such as friction interference, will directly impact the precision and success of drilling tool guidance. In this study, a mathematical model and a friction model of the stabilized platform are established, and a Disturbance-Observer-Based Deep Deterministic Policy Gradient (DDPG_DOB) control algorithm is proposed to address the friction nonlinearity problem existing in the rotary steering drilling stabilized platform. The numerical simulation results illustrate that the stabilized platform attitude control system based on DDPG_DOB can effectively suppress friction interference, improve non-linear hysteresis, and demonstrate strong anti-interference capability and good robustness.


Introduction
Rotary steering technology, as an emerging drilling innovation, has gained increasing attention from scholars.This marks a substantial stride towards intelligent and automated drilling, particularly in challenging environments [1,2].Rotary steering technology offers numerous advantages, encompassing rapid drilling velocities, reduced accident frequencies, and excellent maneuverability.Furthermore, it results in cost savings.This technological advancement delineates the trajectory of progress in drilling methodologies and procedures, offering the potential for extended horizontal displacements, diminished risks of borehole obstructions, and the obviation of the need for recurrent insertion and retrieval from the borehole, thereby augmenting drilling efficiency [3,4].
The core of a rotary steering system (RSS) comprises a stabilized platform situated within the drill collar, which is integral to governing the manipulation of the tool face angle, as the precise control of the tool face angle is paramount for achieving directional drilling and managing wellbore inclination [5][6][7][8].Researchers have proposed various methods for controlling the attitude of stabilized platforms in rotary steering drilling, including PID control [9], fuzzy control with sliding mode variable structure control [10], and output feedback linearization control [11].However, classic approaches have significant limitations; since control performance is closely tied to control parameters, some approaches do not even account for frictional nonlinearity effects.Moreover, the stabilized platform always exists with unknown disturbances and parameter perturbation, and frictional nonlinearity is a severe disturbance.Most conventional control methods are unable to address all situations at the design stage.It is interesting to note that previous studies have not Appl.Sci.2023, 13, 12022 2 of 14 extensively researched the impact of LuGre friction, which further accentuates the need for innovative control strategies that can overcome these challenges.
Deep Deterministic Policy Gradient (DDPG) is a reinforcement learning algorithm that can be used to solve continuous motion control problems, including nonlinear control challenges [12].In comparison to traditional control methodologies, DDPG stands out for its adaptability and robustness [13].Furthermore, DDPG demonstrates excellent generalization capabilities and scalability, rendering it suitable for a wide array of control scenarios and system dynamic models [14,15].Therefore, in this study, the DDPG algorithm is applied to the control system of a rotary steering drilling stabilized platform; building upon this foundation, we introduce a novel approach, the Disturbance-Observer-Based Deep Deterministic Policy Gradient, to effectively counteract the impact of non-linear frictional disturbances.The specific tasks are outlined as follows: 1.
A rotary steering drilling stabilized platform model is established, and a LuGre friction model is constructed to provide a basis for the attitude control strategy.2.
A DDPG-based deep reinforcement learning attitude control system for the stabilized platform is developed.This involves the selection of the state vector, the design of the reward function, and the construction of the Actor-Critic network structure.

3.
A Disturbance-Observer-Based Deep Deterministic Policy Gradient is proposed, which is aimed at effectively suppressing frictional disturbances and enhancing the control performance and robustness of the system.

Model 2.1. Stabilized Platform Model
In accordance with the operational principles of a stabilized platform within rotary steerable drilling [16][17][18][19], we have formulated a comprehensive controlled object model for the stabilized platform, which is illustrated in Figure 1.
Appl.Sci.2023, 13, 12022 2 of 15 have not extensively researched the impact of LuGre friction, which further accentuates the need for innovative control strategies that can overcome these challenges.Deep Deterministic Policy Gradient (DDPG) is a reinforcement learning algorithm that can be used to solve continuous motion control problems, including nonlinear control challenges [12].In comparison to traditional control methodologies, DDPG stands out for its adaptability and robustness [13].Furthermore, DDPG demonstrates excellent generalization capabilities and scalability, rendering it suitable for a wide array of control scenarios and system dynamic models [14,15].Therefore, in this study, the DDPG algorithm is applied to the control system of a rotary steering drilling stabilized platform; building upon this foundation, we introduce a novel approach, the Disturbance-Observer-Based Deep Deterministic Policy Gradient, to effectively counteract the impact of non-linear frictional disturbances.The specific tasks are outlined as follows: which is aimed at effectively suppressing frictional disturbances and enhancing the control performance and robustness of the system.

Stabilized Platform Model
In accordance with the operational principles of a stabilized platform within rotary steerable drilling [16][17][18][19], we have formulated a comprehensive controlled object model for the stabilized platform, which is illustrated in Figure 1.x ω = , the mathematical model of the stabilized platform control system can be expressed as: where In the figure, K M is the PWM to MOS tube ratio, K E is the turbine electromagnetic torque to current ratio, K w is gyroscope conversion coefficient, F n is the external disturbance torque, and F f is friction torque.
Assuming x 1 = θ and x 2 = ω, the mathematical model of the stabilized platform control system can be expressed as: . where f R a +C m C e is the electromechanical constant, and J, R a , f ,C m , and C e represent the rotational inertia, armature resistance, viscous friction coefficient, motor torque coefficient, and counterelectromotive force coefficient, respectively.

Friction Characteristic Model
Given the complexity of the underground drilling environment, the control system of the stabilized platform is susceptible to the influence of nonlinear friction.Consequently, it is necessary to conduct a thorough analysis and research of the friction model.The LuGre model, initially proposed by Canudas in 1995, provides a comprehensive framework for describing the dynamic and static aspects of various friction phenomena [20,21].
The expression for the LuGre model is as follows: where ω s is the angular velocity of the tool face; Z is the deformation of the bristle; F f , F c , and F s represent the friction torque, Coulomb friction, and static friction, respectively; and σ 0 , σ 1 , and σ 2 represent the stiffness coefficient, viscous damping coefficient, and viscous friction coefficient, respectively.The LuGre friction model effectively describes both static and dynamic frictional processes.Moreover, it possesses the capacity to elucidate intricate phenomena, encompassing viscous friction, Coulomb friction, and Stribeck friction.In the context of operating the control system for a stabilized platform employed in rotary steerable drilling, it is essential to acknowledge the potential manifestation of non-linear frictional effects.These effects encompass phenomena such as low-speed crawling, steady-state error, and limit cycle oscillation.Given the inherent compatibility between the friction characteristics modeled via the LuGre model and the friction encountered during the rotational processes of the stabilized platform, we select the LuGre model for integration into the control system and conduct rigorous research and analysis.

DDPG Algorithm
The policy-based approach has effectively addressed the limitation of value-based deep reinforcement learning algorithms, which often face challenges when dealing with continuous action spaces.However, when confronted with potentially infinite action spaces, this method may inadvertently lead to convergence on local optima rather than achieving globally optimal solutions.To address this challenge, Sutton introduced a novel reinforcement learning framework known as Actor-Critic, building upon the principles of the DDPG algorithm [22,23].
The DDPG algorithm implements the Actor-Critic framework, where the Actor is responsible for policy updates, and the Critic manages adjustments to the action value function [24,25].Deep neural networks serve as nonlinear function approximators for both the Actor networks µ(s|θ µ ) and Critic networks Q(s, a θ Q ).The Critic network is updated by minimizing the mean square error, guiding the Actor network's policy updates to select appropriate actions.After extensive training, the optimal value target is achieved.
To mitigate the correlation between the current Q-value and the target Q-value, a dual network architecture is utilized for both the policy and value functions.In this architecture, both the policy network and the value network consist of online and target networks.The online network is tasked with updating the current network parameters, while the target network is dedicated to optimizing the target value.

Parameters Updating of DDPG Algorithm
The process of updating parameters in the DDPG algorithm is depicted in Figure 2.
dual network architecture is utilized for both the policy and value functions.In this architecture, both the policy network and the value network consist of online and target networks.The online network is tasked with updating the current network parameters, while the target network is dedicated to optimizing the target value.

Parameters Updating of DDPG Algorithm
The process of updating parameters in the DDPG algorithm is depicted in Figure 2.While updating network parameters, a small batch of samples is randomly selected from the experience pool to train the network.The target value is calculated using the target Critic network, as expressed in Equation ( 5): In Equation ( 5), i y represents the Q-value of the current action, i r is the reward for each step, γ is the discount factor, ' (.) μ is the target policy, The current value of Q is calculated based on the current state value i s and ac- tion value i a .Subsequently, the online Critic network is updated by minimizing the loss function, as shown in Equation ( 6): In Equation ( 6), Q θ represents the parameter of the online Critic network, and N signifies the number of samples.Parameter updates to the online Actor network are executed through policy gradient techniques, optimizing reward maximization.
The objective function of the DDPG algorithm is defined as the expected value of the discounted cumulative reward ( ) While updating network parameters, a small batch of samples is randomly selected from the experience pool to train the network.The target value is calculated using the target Critic network, as expressed in Equation ( 5):

Target Critic Network Online Critic Network Online Actor Network Target Actor Network
In Equation ( 5), y i represents the Q-value of the current action, r i is the reward for each step, γ is the discount factor, µ (.) is the target policy, Q (.) refers to the target value, and θ µ , θ Q denote the network parameters of the target Actor and target Critic networks, respectively.
The current value of Q is calculated based on the current state value s i and action value a i .Subsequently, the online Critic network is updated by minimizing the loss function, as shown in Equation ( 6): In Equation ( 6), θ Q represents the parameter of the online Critic network, and N signifies the number of samples.
Parameter updates to the online Actor network are executed through policy gradient techniques, optimizing reward maximization.
The objective function of the DDPG algorithm is defined as the expected value of the discounted cumulative reward θ Q : To enhance the agent's reward acquisition, the parameter θ µ should be updated to maximize this objective function.Q(s i , a i θ Q ) , with the aim of maximizing the objective function.Consequently, the chain rule method is employed to derive the gradient of the objective function.
where µ(s|θ µ )| s=s i denotes the deterministic policy; Q(s, a θ Q ) s=s i ,a=µ(s i ) signifies the generation of values Q by selecting actions based on the deterministic policy µ in a given state s i .The gradient ascent algorithm is employed to modify the parameters θ µ of the objective function.Furthermore, the target network parameters undergo modification through the application of a soft update technique, incrementally adjusting the target network at each time step, as detailed in Equation ( 9): where τ symbolizes the soft update coefficient, θ µ represents the parameters of the online Actor network, θ Q represents the parameters of the online Critic network, θ µ denotes the attributes of the target Actor network, and θ Q signifies the characteristics of the target Critic network.Throughout the training process, the utilization of the soft update method serves to maintain gradual changes in the target network parameters, facilitating a consistent gradient computation for the online network and fostering facile convergence during the training procedure.

Design of Deep Reinforcement Learning Controller for Stabilized Platform 4.1. Overview of the Control System Framework
This framework primarily comprises three key components: the DDPG controller, the controlled object model of the stabilized platform, and the friction disturbance model (Figure 3).objective function. where signifies the generation of values Q by selecting actions based on the deterministic policy μ in a given state i s .The gradient ascent algorithm is employed to modify the parameters μ θ of the objective function.Furthermore, the target network parameters undergo modification through the application of a soft update technique, incrementally adjusting the target network at each time step, as detailed in Equation ( 9): where τ symbolizes the soft update coefficient, μ

Overview of the Control System Framework
This framework primarily comprises three key components: the DDPG controller, the controlled object model of the stabilized platform, and the friction disturbance model (Figure 3).In this figure, the DDPG controller receives a state vector, denoted as s t , and generates a corresponding action a t through the policy network.The reward value, represented as r t , is obtained by executing the action on the stabilized platform using utilizing the value network.Simultaneously, the current training sample is stored within the experience replay pool, with each data point stored in the form of a quaternion array (s t , a t , r t , s t+1 ).A small batch of samples is randomly selected from the experience replay pool, and the controller undergoes extensive training to update the weight parameters of the Actor and Critic networks, achieving a non-linear approximation of both networks and improving the control effect of the deep reinforcement learning algorithm.
Subsequently, with the DDPG algorithm and the stabilized platform model as the foundation, we proceed to select the appropriate state vector and design the reward function.

Selecting State Vectors
In the context of stabilized platform control, the reference tool face serves as the system input, denoted as θ t .The difference between the reference tool face angle and the current tool face angle is e t = θ r − θ t , while the difference between the reference tool face angle and the tool face angle from the previous moment is To achieve the desired tracking performance in the stabilized platform control problem, it is essential to adjust the current tool face angle in accordance with the reference tool face angle.As a result, when selecting state variables, the current tool face angle θ t and the current error e t become pivotal considerations.Moreover, to ensure that the current state progresses in the direction of a minimized error, we include the previous moment's error e t−1 as a state value.In summary, the state vector at the current moment is constructed as follows:

Designing the Reward Function
When formulating the reward function, we take into account the impact of system error on system control performance.When the current error, denoted as θ t , approaches θ r , signifying a smaller error, a higher desired reward value is sought.Consequently, the reward function is structured as the summation of the current error value e t and previous moment's error e t−1 .The design of the reward function ensures a positive reward when both the current error and the previous moment's error fall within the expected range.Any deviation beyond this range results in a negative penalty.
The reward function at the current moment can be expressed as: In the equation, r 1 (t) and r 2 (t) represent the current error reward value and the previous moment error reward value, respectively.Additionally, α and β symbolize the current error reward value coefficient and the previous moment error reward value coefficient, respectively.These coefficients are instrumental in fine-tuning the relative importance of r 1 (t) and r 2 (t) in the reward value calculation.Furthermore, ε 1 and ε 2 denote the permissible error ranges for the current moment error and the previous moment error, respectively.

Network Structure Design
In line with the preceding discussion and following the principles of the DDPG algorithm, the designed DDPG controller adopts a dual network framework encompassing the Actor and the Critic, each consisting of both online and target networks.These network structures exhibit complete structural congruence aside from their parameterization.Figure 4 provides a graphical representation of the Actor and Critic network architectures.The Actor network's input layer interfaces with the state vector, denoted as s t .It includes two fully connected neural layers, consisting of 64 and 32 nodes, with Rectified Linear Unit (ReLU) activation functions.The output layer produces the action a t , representing the control variable chosen by the agent within the current-state context.Conversely, the Critic network's input layer accommodates both the state vector, s t , and the action, a t .The output layer generates the value associated with the agent's action selection under the present-state conditions.The overall structural composition of the Critic network mirrors that of the Actor network, maintaining consistency throughout.
Linear Unit (ReLU) activation functions.The output layer produces the action t a , representing the control variable chosen by the agent within the current-state context.Conversely, the Critic network's input layer accommodates both the state vector, t s , and the action, t a .The output layer generates the value associated with the agent's action selection under the present-state conditions.The overall structural composition of the Critic network mirrors that of the Actor network, maintaining consistency throughout.
(a) (b) In accordance with the state vector, reward function, and network structure design of the DDPG controller as elucidated earlier, the iterative process of tracking the tool face angle unfolds.During each iteration, the network parameters undergo calculated adjustments until the system achieves convergence with the desired ideal state.

Design of DDPG_DOB
In the practical environment, the stabilized platform control system is affected by frictional disturbance, which results in nonlinear characteristics, such as dead-zone nonlinearity and saturation, which lead to steady-state error, oscillation, and hysteresis phenomena within the system.Therefore, this section proposes the integration of a disturbance observer to accurately estimate and compensate for frictional disturbances, thereby eliminating their adverse effects on a stabilized platform.
The structural configuration of the stabilized platform control system enhanced by the incorporation of a disturbance observer is shown in Figure 5.In accordance with the state vector, reward function, and network structure design of the DDPG controller as elucidated earlier, the iterative process of tracking the tool face angle unfolds.During each iteration, the network parameters undergo calculated adjustments until the system achieves convergence with the desired ideal state.

Design of DDPG_DOB
In the practical environment, the stabilized platform control system is affected by frictional disturbance, which results in nonlinear characteristics, such as dead-zone nonlinearity and saturation, which lead to steady-state error, oscillation, and hysteresis phenomena within the system.Therefore, this section proposes the integration of a disturbance observer to accurately estimate and compensate for frictional disturbances, thereby eliminating their adverse effects on a stabilized platform.
The structural configuration of the stabilized platform control system enhanced by the incorporation of a disturbance observer is shown in Figure 5.The utilization of a disturbance observer plays a crucial role in estimating and mitigating disturbances.This approach enables the real-time adjustment of the controller's output based on the observed disturbances, subsequently altering the input to the stabilized platform-controlled object.As illustrated in Figure 5, the disturbance observer takes the tool face angle as input and generates an estimated value for external disturbances.Simultaneously, the DDPG controller undergoes training, taking into account the current tool face angle and the tool face angle error.The disparity between the controller's output and the disturbance observer's output serves as the input to the stabilized platform-controlled object.The operational principle of the disturbance observer is elucidated in Figure 6.The utilization of a disturbance observer plays a crucial role in estimating and mitigating disturbances.This approach enables the real-time adjustment of the controller's output based on the observed disturbances, subsequently altering the input to the stabilized platform-controlled object.As illustrated in Figure 5, the disturbance observer takes the tool face angle as input and generates an estimated value for external disturbances.Simultaneously, the DDPG controller undergoes training, taking into account the current tool face angle and the tool face angle error.The disparity between the controller's output and the disturbance observer's output serves as the input to the stabilized platform-controlled object.The operational principle of the disturbance observer is elucidated in Figure 6.
based on the observed disturbances, subsequently altering the input to the stabilized platform-controlled object.As illustrated in Figure 5, the disturbance observer takes the tool face angle as input and generates an estimated value for external disturbances.Simultaneously, the DDPG controller undergoes training, taking into account the current tool face angle and the tool face angle error.The disparity between the controller's output and the disturbance observer's output serves as the input to the stabilized platform-controlled object.The operational principle of the disturbance observer is elucidated in Figure 6.As depicted in Figure 6, the diagram elucidates the following components: u is the input to the controller; ( ) G s symbolizes the transfer function of the controlled object; y signifies the system's control output; u is the control input to the system.These components collectively contribute to the control inputs of the system: As delineated in Figure 6, when the controller output signal is denoted as ( ) U s , the disturbance signal as ( ) D s , and the measurement noise as ( ) N s , their corresponding transfer functions are represented as follows: As depicted in Figure 6, the diagram elucidates the following components: u is the input to the controller; G(s) symbolizes the transfer function of the controlled object; y signifies the system's control output; G −1 n (s) denotes the inverse of the nominal model; Q(s) is the low-pass filter; d signifies the disturbance; d represents the estimated value of the disturbance d; ξ accounts for measurement noise; and u is the control input to the system.
These components collectively contribute to the control inputs of the system: As delineated in Figure 6, when the controller output signal is denoted as U(s), the disturbance signal as D(s), and the measurement noise as N(s), their corresponding transfer functions are represented as follows: Leveraging Equation ( 14), Figure 6 is streamlined to Figure 7 through equivalent transformation.
Appl.Sci.2023, 13, 12022 9 of 15 Leveraging Equation ( 14), Figure 6 is streamlined to Figure 7 through equivalent transformation.The mathematical expression of the low-pass filter is: where N signifies the order of the denominator; represents the coefficient; and τ denotes the time constant.
The incorporation of a disturbance observer into the system enables the observation The mathematical expression of the low-pass filter is: where N signifies the order of the denominator; a k = N! k!(N−k)!represents the coefficient; and τ denotes the time constant.
The incorporation of a disturbance observer into the system enables the observation of equivalent disturbances that affect the system, enabling disturbance compensation within the control framework.Ultimately, this leads to effective disturbance suppression, mitigating the impact of disturbances on the control system.

Simulation
In this section, we seek to verify the superiority and control accuracy of the DDPG_DOB algorithm presented in the previous section through a comprehensive assessment encompassing key aspects: tracking simulation experiments, step simulation experiments, and robustness experiments.The parameters listed in Tables 1 and 2 correspond to the actual parameters of the drilling tool and are derived from measurements or known values in the field.These parameters are crucial for accurately modeling the drilling system and ensuring the validity of our simulations.

Parameter Description
The stabilized platform model [26], LuGre friction model, and neural network training parameters are shown in Tables 1-3, respectively.

Simulation Results and Analysis
In the presence of friction disturbances, we conducted a series of simulation and analysis experiments, encompassing both tool face angle set-point control and tool face angle tracking control.The study involved a comparative analysis of the DDPG_DOB control algorithm against the conventional PID, PID_DOB, and DDPG algorithms.

Tool Face Angle Step Simulation with Friction
For the simulation, we employed a tool face angle set to 60 degrees using a step signal, the external disturbance signal was set as F n = sin (t + π), and the friction disturbance was modeled as the LuGre friction.The experimental results of the PID, PID_DOB, DDPG, and DDPG_DOB control methods are presented in Figure 8 and Table 2. Based on the results shown in Figure 8 and Table 4, in comparison to PID, PID_DOB, and DDPG, the DDPG_DOB controller exhibited a substantial reduction in steady-state error by 0.023 。 , 0.012 。 , and 0.014 。 .While it displayed a slightly higher overshoot, the DDPG_DOB algorithm demonstrated the shortest settling time and rise time.These findings underscore the effectiveness of the refined DDPG_DOB control approach, highlighting its ability to enhance the system's response speed and mitigate steady-state error, thereby significantly improving the overall control performance.The input signal of the system was set as sin( )  Based on the results shown in Figure 8 and Table 4, in comparison to PID, PID_DOB, and DDPG, the DDPG_DOB controller exhibited a substantial reduction in steady-state error by 0.023 • , 0.012 • , and 0.014 • .While it displayed a slightly higher overshoot, the DDPG_DOB algorithm demonstrated the shortest settling time and rise time.These findings underscore the effectiveness of the refined DDPG_DOB control approach, highlighting its ability to enhance the system's response speed and mitigate steady-state error, thereby significantly improving the overall control performance.The input signal of the system was set as θ r = sin (πt), the external disturbance signal was set as F n = sin (t + π), and the friction disturbance was modeled as the LuGre friction.Figure 9 shows the comparative experimental results of PID, PID_DOB, DDPG, and DDPG_DOB control methods.From Figure 9, it is evident that the DDPG_DOB method outperforms both the PID algorithm and the DDPG algorithm in terms of tracking accuracy and tracking error, maintaining the tracking error within 8.7%.The stabilizing platform significantly enhances control performance, and the DDPG_DOB control method is effective at suppressing disturbances.

Robustness Experimental Research
Experiments were conducted to evaluate the robustness of the stabilized platform control system under variations in rotational inertia J armature resistance a R , and ex- ternal disturbance n F .The simulation results are presented in Figure 10, and Table 3 shows the maximum error of the control system for the stabilized platform, as shown in Figure 10.From Figure 9, it is evident that the DDPG_DOB method outperforms both the PID algorithm and the DDPG algorithm in terms of tracking accuracy and tracking error, maintaining the tracking error within 8.7%.The stabilizing platform significantly enhances control performance, and the DDPG_DOB control method is effective at suppressing disturbances.

Robustness Experimental Research
Experiments were conducted to evaluate the robustness of the stabilized platform control system under variations in rotational inertia J armature resistance R a , and external disturbance F n .The simulation results are presented in Figure 10, and Table 3 shows the maximum error of the control system for the stabilized platform, as shown in Figure 10.From Figure 9, it is evident that the DDPG_DOB method outperforms both the PID algorithm and the DDPG algorithm in terms of tracking accuracy and tracking error, maintaining the tracking error within 8.7%.The stabilizing platform significantly enhances control performance, and the DDPG_DOB control method is effective at suppressing disturbances.

Robustness Experimental Research
Experiments were conducted to evaluate the robustness of the stabilized platform control system under variations in rotational inertia J armature resistance a R , and ex- ternal disturbance n F .The simulation results are presented in Figure 10, and Table 3 shows the maximum error of the control system for the stabilized platform, as shown in Figure 10.The results presented in Figure 10 and Table 5 provide a clear conclusion.Even in the presence of parameter variations, both the PID and DDPG control methods demonstrate the capability to achieve a certain level of tracking precision.However, it is noteworthy that PID control exhibits a significant increase in error, leading to a deviation from the desired accuracy in tool face angle tracking.In contrast, DDPG control, while displaying lower error rates compared to PID, faces challenges related to latency.Remarkably, the DDPG_DOB algorithm displays superior control effectiveness, highlighting its ability to mitigate parameter variations, reduce the impact of frictional disturbances, and demonstrate robust and resilient performance.Consequently, the DDPG_DOB method proposed in this study emerges as a more suitable choice for the control system of a rotary directional drilling stabilizing platform.The results presented in Figure 10 and Table 5 provide a clear conclusion.Even in the presence of parameter variations, both the PID and DDPG control methods demonstrate the capability to achieve a certain level of tracking precision.However, it is noteworthy that PID control exhibits a significant increase in error, leading to a deviation from the desired accuracy in tool face angle tracking.In contrast, DDPG control, while displaying lower error rates compared to PID, faces challenges related to latency.Remarkably, the DDPG_DOB algorithm displays superior control effectiveness, highlighting its ability to mitigate parameter variations, reduce the impact of frictional disturbances, and demonstrate robust and resilient performance.Consequently, the DDPG_DOB method proposed in this study emerges as a more suitable choice for the control system of a rotary directional drilling stabilizing platform.

Conclusions
In this study, we investigate the stabilized platform of a rotary steering drilling system and establish a mathematical model.To address issues related to friction and unknown disturbances, a DDPG_DOB-based attitude control algorithm is proposed.Specifically, to mitigate the impact of nonlinear friction interference on the stabilized platform, a disturbance observer is introduced for estimation.The numerical simulation experiments on the stabilized platform attitude control system validate the effectiveness of the DDPG_DOB method, with results as follows: DDPG_DOB achieves a tracking response error range of 8.7%, outperforming PID and DDPG in terms of control accuracy, nonlinearity, and anti-disturbance capability.The DDPG_DOB method showcases distinct advantages over PID, PID_DOB, and DDPG; reduces steady-state errors by 0.023 • , 0.012 • , and 0.014 • ; and the settling time and rise time are the shortest.These improvements underscore its efficacy in enhancing response speed and accuracy.The DDPG_DOB-stabilized platform control system shows the effective suppression of the effects of rotational inertia, armature resistance ingestion, and external disturbance amplitude varieties on the system.This system exhibits good adaptive ability and strong robustness under complex and continuous working conditions.

1 .
A rotary steering drilling stabilized platform model is established, and a LuGre friction model is constructed to provide a basis for the attitude control strategy.2. A DDPG-based deep reinforcement learning attitude control system for the stabilized platform is developed.This involves the selection of the state vector, the design of the reward function, and the construction of the Actor-Critic network structure.3. A Disturbance-Observer-Based Deep Deterministic Policy Gradient is proposed,

Figure 1 .F
Figure 1.Control object model of stabilized platform.

Figure 1 .
Figure 1.Control object model of stabilized platform.

Figure 2 .
Figure 2. Parameter update process of the DDPG algorithm.

θ
denote the network parameters of the target Actor and target Critic networks, respectively.

Figure 2 .
Figure 2. Parameter update process of the DDPG algorithm.

θ
represents the parameters of the online Actor network, Q θ represents the parameters of the online Critic network, ' μ θ denotes the attributes of the target Actor network, and ' Q θ signifies the characteristics of the target Critic network.Throughout the training process, the utilization of the soft update method serves to maintain gradual changes in the target network parameters, facilitating a consistent gradient computation for the online network and fostering facile convergence during the training procedure.

Figure 3 .
Figure 3. Stabilized platform attitude control system.Figure 3. Stabilized platform attitude control system.

Figure 3 .
Figure 3. Stabilized platform attitude control system.Figure 3. Stabilized platform attitude control system.

Figure 4 .
Figure 4. (a) Actor network structure and (b) Critic network structure.

Figure 4 .
Figure 4. (a) Actor network structure and (b) Critic network structure.

Figure 5 .
Figure 5. Structure diagram of the stabilized platform control system enhanced with a disturbance observer.

Figure 5 .
Figure 5. Structure diagram of the stabilized platform control system enhanced with a disturbance observer.

Figure 6 .
Figure 6.Principle block diagram of disturbance observer.

−
denotes the inverse of the nominal model; ( ) Q s is the low-pass filter; d signifies the disturbance; d represents the estimated value of the disturbance d ; ξ accounts for measurement noise; and '

Figure 6 .
Figure 6.Principle block diagram of disturbance observer.

Figure 7 .
Figure 7. Simplified block diagram of disturbance observer through equivalent transformation.

Figure 7 .
Figure 7. Simplified block diagram of disturbance observer through equivalent transformation.

Figure 8 .
Figure 8.(a) Tool face step response of control system with frictional disturbances and (b) tool face step error of control system with frictional disturbances.
the friction disturbance was modeled as the LuGre friction.Figure9shows the comparative experimental results of PID, PID_DOB, DDPG, and DDPG_DOB control methods.

Figure 8 .
Figure 8.(a) Tool face step response of control system with frictional disturbances and (b) tool face step error of control system with frictional disturbances.

Figure 9 .
Figure 9. (a) Tool face tracking response of control system with frictional disturbances and (b) tool face tracking error of control system with frictional disturbances.

Figure 9 .
Figure 9. (a) Tool face tracking response of control system with frictional disturbances and (b) tool face tracking error of control system with frictional disturbances.

Figure 9 .
Figure 9. (a) Tool face tracking response of control system with frictional disturbances and (b) tool face tracking error of control system with frictional disturbances.

Figure 10 .
Figure 10.(a) Tool face tracking response and (b) tracking error when J and R a are increased by 50%; (c) tool face tracking response and (d) tracking when J and R a are decreased by 50%; (e) tool face tracking response and (f) tracking error with 4 times the amplitude of F n .

Table 4 .
Comparison of system performance indicators.

Table 4 .
Comparison of system performance indicators.

Table 5 .
Control system maximum error.

Table 5 .
Control system maximum error.