Deep Reinforcement Learning-Based Robotic Puncturing Path Planning of Flexible Needle

Lin, Jun; Huang, Zhiqiang; Zhu, Tengliang; Leng, Jiewu; Huang, Kai

doi:10.3390/pr12122852

Open AccessArticle

Deep Reinforcement Learning-Based Robotic Puncturing Path Planning of Flexible Needle

by

Jun Lin

^1,2

,

Zhiqiang Huang

^3,4,

Tengliang Zhu

^3,4,

Jiewu Leng

^3,4,*

and

Kai Huang

^1,*

¹

School of Biomedical Engineering, Sun Yat-Sen University, Guangzhou 510275, China

²

China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou 510610, China

³

State Key Laboratory of Precision Electronic Manufacturing Technology and Equipment, Guangdong University of Technology, Guangzhou 510006, China

⁴

Guangdong Provincial Key Laboratory of Computer Integrated Manufacturing, Guangdong University of Technology, Guangzhou 510006, China

^*

Authors to whom correspondence should be addressed.

Processes 2024, 12(12), 2852; https://doi.org/10.3390/pr12122852

Submission received: 21 October 2024 / Revised: 14 November 2024 / Accepted: 11 December 2024 / Published: 12 December 2024

(This article belongs to the Special Issue Industry 4.0 and Industry 5.0: Simulators and Algorithms in Manufacturing Processes and Systems)

Download

Browse Figures

Versions Notes

Abstract

The path planning of flexible needles in robotic puncturing presents challenges such as limited model accuracy and poor real-time performance, which affect both efficiency and accuracy in complex medical scenarios. To address these issues, this paper proposes a deep reinforcement learning-based path planning method for flexible needles in robotic puncturing. Firstly, we introduce a unicycle model to describe needle motion and design a hierarchical model to simulate layered tissue interactions with the needle. The forces exerted by tissues at different positions on the flexible needle are considered, achieving a combination of kinematic and mechanical models. Secondly, a deep reinforcement learning framework is built, integrating obstacle avoidance and target attraction to optimize path planning. The design of state features, the action space, and the reward function is tailored to enhance the model’s decision-making capabilities. Moreover, we incorporate a retraction mechanism to bolster the system’s adaptability and robustness in the dynamic context of surgical procedures. Finally, laparotomy simulation results validate the proposed method’s effectiveness and generalizability, demonstrating its superiority over current state-of-the-art techniques in robotic puncturing.

Keywords:

deep reinforcement learning; flexible needle puncture; path planning; robotic puncturing; laparotomy

1. Introduction

In minimally invasive surgery, puncture needles are essential tools for sampling and injection therapies, primarily available in two types, namely rigid and flexible. Flexible needles, which can navigate around obstacles such as nerves, blood vessels, and bones, offer significant clinical benefits by reducing trauma and allowing for adaptable insertion paths. As a result, they have garnered considerable attention in medical engineering research [1]. Traditional path planning relies heavily on a physician’s expertise, which introduces variability. An automated and precise approach to path planning can reduce the physician’s workload, minimize surgical risks, and improve both the success and efficiency of procedures. This need becomes even more pressing with the integration of robotic systems in the operating room, as these systems must execute punctures with high precision and autonomy in real time under complex and dynamic conditions. Robust path planning is crucial for ensuring safe needle insertion while avoiding damage to sensitive anatomical structures, making accurate and adaptive planning vital for the success of such procedures.

The path planning of flexible needles includes rule-based, mathematical optimization, and simulation methods. Rule-based methods are simple to implement but struggle to handle complex situations [2]. Mathematical optimization methods can find the global optimum in complex environments but require accurate models and a large number of computations, and they are not suitable for real-time computation. Simulation methods are used for algorithm training and validation, but they may differ from the actual surgical environment, affecting accuracy. With the development of artificial intelligence (AI) algorithms, reinforcement learning (RL) methods have been proposed for path planning. However, they face the problems of the curse of dimensionality, poor convergence, and poor generalization, although no data need to be provided in advance [3].

This paper presents an unicycle model [4] as the kinematic basis for the flexible needle, designs a hierarchical model for human tissue representation, combines the kinematic and mechanical models, and finally proposes a deep reinforcement learning-based flexible needle path planning method in robotic puncturing. By setting the obstacle repulsion and endpoint gravitational model, the reinforcement learning model is designed. The state features, action space, and reward function are defined. The retraction action is designed to improve the efficiency of path planning and the adaptability and robustness in a complex environment. The DQN algorithm is designed for agent training. The effectiveness and generalization of the method are verified by parameter sensitivity analysis and simulation experiments.

2. Related Work

There are various path planning methods for flexible needle puncturing, among which the rapidly exploring random tree (RRT) and its improved versions have been widely studied. Zhao et al. [5] proposed the Reachability and Greedy Heuristic-Guided RRT algorithm (RGHG-RRT), which improves the computational speed and searching capability. Xiong et al. [6] presented an improved RRT algorithm (I-RRT), which solves the path continuity and physical constraint issues. Hon et al. [7] introduced the RRT algorithm to simulate the real puncture environment by considering anatomical hindrances and 3D scenes. Zhang et al. [8] further improved the RRT algorithm to optimize the path planning by 3D coordinate transformation and artificial potential field cost.

Cai et al. [9] proposed a simplified particle swarm optimization (PSO) algorithm for the path planning of flexible needle puncturing. This approach adjusts the center angle of the arc and the needle’s rotation angle to make the path conform to a circular arc kinematic model. It easily adapts to changes in boundary conditions, allowing for simple and efficient path planning, with errors controlled to the millimeter level. Tan et al. [10] proposed an improved PSO algorithm with an adaptive parameter adjustment mechanism to establish a needle motion model and a three-dimensional space puncture path transformation model through a multi-objective function to optimize the path length and collision detection. The needle motion model and the transformation model of the 3D spatial puncture path were established. The path length, error, and collision detection were optimized by a multi-objective function to achieve a faster planning speed and higher accuracy. Tan et al. [11] further proposed a bee foraging learning-based PSO algorithm (BFL-PSO) and a 3D needle retraction strategy. This approach shortens the puncture distance, improves accuracy, and significantly reduces both paths and errors, demonstrating good adaptability to multi-objective path planning.

With the rapid development of AI algorithms, Tan et al. [12], Lee et al. [13], and Hu et al. [14] proposed path planning methods based on the Markov decision process (MDP), deep Q-network (DQN), and double deep Q-network (DDQN), respectively. These approaches aim to improve penetration path planning for flexible needles in 2D and 3D environments while enhancing path optimization, safety, and robustness.

Compared with traditional path planning methods, deep reinforcement learning methods have the following metrics:

Dealing with high-dimensional state and action spaces;
Automatically extracting features;
Learning complex policies;
Improving environment perception and decision-making capabilities;
Better interpretability.

However, there are several challenges with this approach, such as enhancing goal planning and exploration capabilities.

3. Path Planning of Flexible Needle in Robotic Puncturing

This section describes and transforms the flexible needle path planning in robotic puncturing, including corresponding kinematic and mathematical models.

3.1. Definition of the Internal Human Environment

The surgical environment definition describes the physical parameters and constraints in path planning within the internal human environment. In laparotomy, the surgical environment definition mainly includes the human body space scale size, obstacle situation, lesion target location, and tissue layering model.

The length, width, and height of the abdomen of an adult is about 55–75 cm, 20–30 cm, and 15–25 cm, respectively. However, the exact values of surgery vary according to individual height and body size [15]. The size of the organs involved in abdominal puncture is usually 10~15 cm. The length of the space can be appropriately reduced when setting the space to improve the learning efficiency of the agent, while the width and height of the space can be appropriately increased so that the agent can fully utilize its exploration ability.

3.1.1. Obstacle Situation

Obstacle situations refer to potential obstacles that may be encountered in the puncture path, such as other organs, blood vessels, and similar structures. These obstacles may block or interfere with the puncture path and affect the smooth operation. Therefore, the location, distribution, and shape of the obstacles need to be considered when defining the environment, as irregularly shaped obstacles, such as curved blood vessels or non-uniform tissue structures, can complicate the boundaries and increase the difficulty of path planning. Meanwhile, a reasonable reward function needs to be designed to encourage the agent to avoid these obstacles and find the optimal path, taking into account the complexity introduced by their shapes.

3.1.2. Lesion Target Location

The lesion target location pinpoints the precise coordinates, dimensions, and contours of the lesion area within the abdominal region, with its precise definition being pivotal to the agent’s decision-making process. Consequently, the establishment of a well-calibrated reward function is essential to direct the agent towards the target area effectively.

In this paper, to facilitate a comprehensive exploration of the internal anatomy of the human body, the lesion target is delineated as a spherical point situated at the base of the abdominal cavity. Furthermore, the surrounding region is also designated as part of the lesion target location. This strategic configuration enables the agent to execute puncturing or sampling procedures with enhanced efficacy, ensuring that the agent’s actions are not only targeted but also optimally positioned for maximum impact.

3.1.3. Hierarchical Model of the Human Body

To achieve a higher fidelity in simulating real-world scenarios, this paper introduces a streamlined layered model of the human body, encompassing the skin, subcutaneous tissue and muscle, and nerve tissue layers. Each stratum is designed to represent tissues of analogous density, thereby capturing the fundamental characteristics of tissue interaction during the flexible needle puncture procedure.

Given the variance in density across these layers, the flexible needle’s trajectory will exhibit distinct motion radii as it navigates through each stratum. This nuanced approach allows for a more realistic portrayal of needle behavior, accounting for the physiological nuances of the human body’s layered structure.

3.2. Modeling of the Flexible Needle Puncture Problem

In laparotomy, the prime goal is to ensure that the puncture needle reaches the target location accurately. This requires precise positioning of the patient, precise control of the procedure, and protection of the surrounding tissues [16]. An algorithm that computes an optimal path for the puncture needle should be designed to reach the target location safely and accurately, given the structure of the patient’s abdomen and the target location.

3.2.1. Kinematic Modeling

During the puncture process, the motion trajectory produced by the flexible needle is influenced by the force interaction between the needle and the tissue. Currently, related kinematic models have been proposed, including the unicycle model [4], bicycle model [17], cantilever beam model [18], and spring–beam–damper model [19]. To accurately describe the flexible needle deflection with fewer parameters to reduce the decision variables in the reinforcement learning model and thus improve the learning efficiency [20], this paper introduces the unicycle model to describe the trajectory of the flexible needle.

In this unicycle model, it is defined that the flexible needle is rigidly attached in the axial direction and that the tissue to be punctured is isotropic, homogeneous, and non-deforming. As shown in Figure 1, after the tip of the needle is inserted into human tissue, the tip is subjected to pressure perpendicular to the beveled surface of the needle tip due to the compression of the tissue. The asymmetric forces on the beveled surface and the back of the needle tip change the trajectory of the needle. Meanwhile, the radius of the needle’s trajectory remains constant, and the insertion speed has little effect on the curvature of the flexible needle.

In a three-dimensional human body space, the trajectory of the needle is shown in Figure 2. Assume that the forward speed of the needle is denoted as ν and the rotational speed of the needle in the axial direction is denoted as ω. The local coordinate system of the needle tip is denoted as

Φ_{p}

and the world coordinate system is denoted as

Φ_{w}

. The position and state of the needle tip in the world coordinate system are represented as

\begin{matrix} L_{w} = {[x, y, z, α, β, γ]}^{T} \end{matrix}

(1)

where α, β, and γ are the rotation angles around the x, y, and z axes of the world coordinate system so that the velocity of the needle in the world coordinate system can be expressed as

\begin{matrix} V_{w} = L_{w}^{'} \end{matrix}

(2)

The trajectory of a flexible needle with a slanted tip has a constant radius in three dimensions. The velocity of the tip in the local coordinate system

Φ_{p}

is formulated as

\begin{matrix} V_{p} = v \cdot {[0, 0, 1, k, 0, 0]}^{T} + w \cdot {[0, 0, 0, 0, 0, 1]}^{T} \end{matrix}

(3)

The relationship between the local coordinate system of the needle tip and the world coordinate system can be expressed by a velocity transformation equation.

\begin{matrix} V_{p} = Q V_{w} \end{matrix}

(4)

where Q is the transformation matrix between the local coordinate system of the needle tip and the world coordinate system. According to this velocity transformation equation, the kinematic model of the needle can be obtained as

\begin{matrix} \dot{L} = [\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{z} \\ \dot{α} \\ \dot{β} \\ \dot{γ} \end{matrix}] = [\begin{matrix} s i n β & 0 \\ - c o s β s i n α & 0 \\ c o s β c o s α & 0 \\ k c o s γ s e c β & 0 \\ k s i n γ & 0 \\ - k c o s γ t a n β & 1 \end{matrix}] \cdot [\begin{matrix} v \\ w \end{matrix}] \end{matrix}

(5)

Since the puncture trajectory of the flexible needle is independent of the puncture speed and rotation angle, the time variable can be separated from the control variables and the system model can be transformed into

\begin{matrix} \dot{L} = [\begin{matrix} s i n β & 0 \\ - c o s β s i n α & 0 \\ c o s β c o s α & 0 \\ k c o s γ s e c β & 0 \\ k s i n γ & 0 \\ - k c o s γ t a n β & 1 \end{matrix}] \cdot [\begin{matrix} s \\ θ \end{matrix}] \end{matrix}

(6)

In Equation (6), the tip advancement distance s and the rotation angle θ can be used as the control variables of the tip movement. With the stop-and-go strategy, the puncture trajectory consists of n segments of paths in

\{[s_{1}, θ_{1}] \cdot [s_{2}, θ_{2}], \dots, [s_{n}, θ_{n}]\}

, where

[s_{1}, θ_{1}]

denotes the advancement distance of the first segment of the puncture path and the rotation angle.

3.2.2. Mathematical Path Planning Model

To evaluate the planned path and find the optimal solution, a mathematical model is established based on this planning process. Table 1 provides an overview of the parameters, which define path information, obstacle information, and other related information.

The objective function is two-fold. The first aspect is defined to consider the path length, the deviation between the target point and the endpoint of the path, the collision obstacle indication, and the planning time. The optimal path is obtained based on computing the sum by multiplying it with the product of the length coefficient, the accuracy coefficient, the obstacle avoidance coefficient, and the time coefficient. The first objective function is formulated as

\begin{matrix} \begin{matrix} F = m i n {f (P_{1}), f (P_{2}), \dots, f (P_{n})\} \\ w h i c h f (P_{i}) = α D_{i} + β L_{i} + γ \frac{1}{d_{i, j}} + λ T_{i} \end{matrix} \end{matrix}

(7)

The second aspect of the objective function is to directly find the path that minimizes the punishment (i.e., maximizes the reward) based on the results of reinforcement learning with a reasonable reward function setting, and the second objective function is formulated as

\begin{matrix} R = m a x {R_{1}, R_{2}, \dots, R_{n}\} \end{matrix}

(8)

s . t . : R_{i} = \{\begin{matrix} 0 \\ R_{i} \end{matrix} \begin{matrix} J_{i, j} = 1 \\ J_{i, j} = 0 \end{matrix}, \forall i, j

(9)

\begin{matrix} D_{i} < d, \forall i \end{matrix}

(10)

l_{i} \leq x, \forall i

(11)

α + β + γ + λ = 1

(12)

Constraint (9) indicates whether the flexible needle collides with obstacles. And if no collision occurs, then the normal calculation is rewarded. Constraint (10) indicates that the end position of the path is considered to have reached the endpoint when the distance from the target point position is d. Constraint (11) indicates that the number of arc segments of the path should not be more than x. It restricts the number of rotations of the flexible needle and filters out the excessively long paths that are not in line with the optimal characteristics to improve search efficiency.

4. Deep Reinforcement Learning-Based Path Planning

This section describes the overall solution of deep reinforcement learning-based flexible needle path planning in robotic puncturing. The implementation framework, key technologies, and training processes are involved.

4.1. Framework of DRL-Based Path Planning

Figure 3 shows an end-to-end deep reinforcement learning-based robotic puncturing path planning of a flexible needle [21], which mainly includes the steps of environment perception, state representation, action space definition, reinforcement learning modeling, agent–environment interaction, path planning and execution, and result evaluation and feedback.

Firstly, sensors are used to obtain relevant information in the surgical environment, including tissue structure and the position of the flexible needle. These data are used to construct the environment state for the agent to make decisions. The environmental data are converted into states that can be understood by the deep reinforcement learning model, including the position, angle, speed, and obstacle information of the needle tip, and defines the space of actions that can be performed by the agent, such as the movement and rotation of the needle.

Secondly, a neural network is used to construct the reinforcement learning model, including the policy network and value network [22]. The policy network generates the action strategy, and the value network estimates the state value and gives the action value. The agent selects the action according to the current state and policy, interacts with the simulated or actual surgical environment, obtains new state and reward information, and updates the model parameters. Through continuous training and optimization, the agent gradually learns how to select the optimal action and find a safe, accurate, and effective path.

Finally, the path generated by the agent is evaluated to determine if it meets the preset metrics for safety and efficiency. If the path does not meet the requirements, feedback is provided to the reinforcement learning model for further training and optimization to improve the accuracy and reliability of path planning.

4.2. Reinforcement Learning Modeling

After defining the human body environment, the next step is to establish the reinforcement learning model for the flexible needle path planning, which includes state features, the action space, and the reward function.

4.2.1. State Features

State features are crucial to perceive the environment in reinforcement learning, which directly impacts learning efficiency and performance. In the path planning of flexible needle puncturing (PPFNP), state features are primarily defined based on the following aspects:

Position information: this includes the absolute and relative position of the needle tip and the degree of bending of the needle to ensure that the needle can accurately reach the target position;
Tissue information: this includes the density and hardness information of the surrounding tissues to ensure that no important tissues are damaged during the puncture process;
Obstacle information: this identifies the location and size of obstacles (e.g., other organs, blood vessels) in the path to avoid collision.

The following principles need to be followed when designing state features:

Able to describe the features and variations in the puncture environment, including global and local features;
Applicable to puncture problems in different environments, contained within the state space;
Provides a numerical representation of the information about the agent to enable the description of different problems in a uniform manner through formalization.

The puncturing path planning process requires a comprehensive consideration of state features such as needle position, posture, and the number of agent decisions, as shown in Table 2.

In Table 2,

(x_{0}, y_{0}, z_{0}), (x_{1}, y_{1}, z_{1}), (x_{2}, y_{2}, z_{2})

represent the position coordinates of the initial point, obstacle, and target point in the 3D space, respectively. Therefore, a total of 12 state variables are used in this paper to represent the state features of the PPFNP problem.

\begin{matrix} S = \{x, y, z, d x, d y, d z, d_{c} t o, d_{c} t t, d_{o} t t, d_{c} t s, n, s t e p\} \end{matrix}

(13)

Agents can make action decisions by sensing these features during interaction with the environment.

4.2.2. Action Space

The action space delineates the comprehensive array of actions that an intelligent agent is capable of executing within its operational environment. In the context of flexible needle path planning, the action space is meticulously crafted around two pivotal dimensions.

Feed action: This aspect pertains to the depth of needle insertion. By meticulously adjusting the feed length, the agent can effectively alter the needle tip’s position without necessitating a change in the needle’s base position. This capability is crucial for navigating through the intricacies of the tissue with precision.

Rotation action: Given the complexity of tissue structures and the imperative to circumvent obstacles, the agent must possess the agility to manipulate the needle’s rotation. This is adeptly managed by defining both the rotation angle and the axis of rotation, thereby enabling the agent to navigate around obstacles and adapt to the tissue’s contours.

Consequently, this paper establishes the action space for the flexible needle as a strategic amalgamation of these two critical actions, providing the agent with the requisite flexibility and control to effectively plan and execute needle paths. It can be formulated as follows:

\begin{matrix} A = \{s, θ\} \end{matrix}

(14)

4.2.3. Reward Function

Rewards in our system are meticulously divided into two distinct categories, the single-step reward and multiple rewards. Single-step rewards are promptly dispensed following each individual action, serving as an immediate reflection of the quality of the action taken. Conversely, multiple rewards are allocated upon the task’s completion, reflecting an assessment of the overall task execution. In addressing the PPFNP challenge, the design of the reward function demands a thorough consideration of both the task’s objectives and its constraints. This paper introduces an innovative composite reward structure, prioritizing single-step rewards to drive immediate quality action while incorporating multiple rewards to provide a holistic evaluation and to ensure that the agent strategically plans the optimal path.

The single-step reward structure encompasses the following three key components:

Tissue damage: negative rewards are imposed for any tissue damage incurred during needle insertions or rotations, thereby discouraging harmful maneuvers.
Proximity to target: positive rewards are awarded as the needle draws nearer to the target, incentivizing accurate navigation.
Distance from obstacles: negative rewards are imposed when the agent ventures too near an obstacle, promoting obstacle avoidance.

The round rewards, on the other hand, consist of the following:

Success rate: positive rewards are granted for successfully reaching the target location, reinforcing successful completion of the task.
Obstacle avoidance: negative rewards are applied for any collisions with obstacles or boundary infringements within the exploration space, emphasizing the importance of safe navigation.
Path efficiency: negative rewards are given for reaching the target via a convoluted path, encouraging the adoption of more direct routes.

By judiciously combining single-step and round rewards, the agent receives clear feedback after each action and at the end of each round. This feedback loop facilitates the gradual refinement of path planning behaviors, leading to the mastery of a more efficient, stable, and robust path planning strategy. To effectively tackle the PPFNP problem, this paper has meticulously defined specific reward functions, as detailed in Table 3.

The composite reward functions for the PPFNP contain a total of two kinds of positive rewards and six kinds of negative rewards. Regarding

{r_{-} n}_{-} o

and

{r_{-} n}_{-} t

, this paper also designs an obstacle repulsion model and an endpoint gravitational model, which are shown in Figure 4, respectively.

As depicted in Figure 4, the vicinity of the obstacle is demarcated into two distinct regions, a no-repulsion zone and a repulsion zone. The no-repulsion zone signifies an area where the agent remains unaffected by the obstacle’s repulsive force. Conversely, within the repulsion zone, the agent is subjected to a forceful repulsion emanating from the obstacle as it ventures into this space. The agent’s movement trajectory, as delineated in the figure, is influenced by a dual force dynamic, namely the repulsive force exerted by the obstacle and the gravitational pull of the target point, particularly when the agent is within the repulsion zone. Furthermore, the intensity of the repulsive force is not uniform throughout the repulsion zone; it escalates as the agent approaches closer to the obstacle. The direction of this force is consistently radially outward from the obstacle’s center, directed towards the agent.

As depicted in Figure 4, the vicinity of the target point is demarcated into two distinct zones, the weak gravity region and the strong gravity region. The strong gravity region, which is in closer proximity to the target point, exerts a more potent gravitational pull on the agent, effectively encouraging a swift and direct approach towards the target. Conversely, the weak gravity region, situated at a greater distance from the target point, applies a less intense gravitational force. This subtler influence fosters a more comprehensive exploration of the surrounding environment.

In both regions, the gravitational force escalates as the agent progresses along a direct path towards the target point, thereby discouraging aimless wandering. The gravitational force is consistently oriented from the agent towards the target point, ensuring that the agent’s trajectory is both goal-oriented and efficient. This strategic distribution of gravitational forces facilitates a balance between rapid target acquisition and thorough environmental scanning.

In order to reduce the useless exploration of the flexible needle agent and make it reach the target position as soon as possible, this paper adds

d_{-} j

in

{r_{-} n}_{-} t

, which indicates whether the current action is close to the target position or not, and

d_{-} j

is formulated as

\begin{matrix} d_{-} j = d_{-} c t t - (|x - x_{2}| + |y - y_{2}| + |z - z_{2}|) \end{matrix}

(15)

When the result of

d_{-} j

is greater than 0, it means that the current action of the agent is close to the target position and should be rewarded positively. When the result of

d_{-} j

is less than 0, it means that the current action of the agent is far from the target position and should be rewarded negatively.

Once the single-step and multiple rewards have been meticulously established, it becomes imperative to categorize the planned path and compute the cumulative reward. This cumulative reward is a reflection of the total reward amassed by the agent throughout the learning journey. Given the medical surgical context of this problem, the path classification should prioritize precision and take into account several critical factors.

Endpoint deviation: the classification must ascertain whether the discrepancy between the path’s endpoint and the target point adheres to the stipulated accuracy standards.

Obstacle collision: it is essential to evaluate whether the path intersects with any critical obstacles, such as vital organs or blood vessels, which could compromise patient safety.

Path length: the classification should also consider whether the path’s length falls within an acceptable range, ensuring that it does not surpass the maximum permissible length.

Model space adherence: lastly, it is crucial to verify that the agent remains within the designated model space during the path planning process, maintaining procedural integrity.

By meticulously considering these factors, the path classification system can effectively gauge the quality of the agent’s learning outcomes, ensuring that the planned paths are not only accurate but also safe and efficient within the constraints of medical surgery. Based on these criteria, this paper defines the cumulative reward obtained at the end of path planning as follows:

Among the five types of rewards outlined in Table 4, all except the normal puncture process are designed to trigger the termination of path planning. However, positive reward feedback is exclusively granted when the agent successfully navigates to the target location. The agent’s learning is contingent upon the diverse reward feedback it encounters, which guides its behavior towards planning paths that yield higher cumulative rewards.

Through this feedback-driven learning process, the agent gradually refines its path planning strategies, prioritizing those that not only meet the necessary criteria but also maximize the total reward. This approach ensures that the agent’s learning is both goal-oriented and reward-driven, ultimately leading to the selection of optimal paths that align with the desired outcomes.

4.3. Algorithms for Agent Training

Having established the exploration environment and formulated the reinforcement learning model, the choice of an apt reinforcement learning algorithm becomes a pivotal decision. In the realm of flexible needle puncture, the agent navigates within a three-dimensional space and encounters a multitude of intricate states, rendering conventional Q-learning methods infeasible due to their complexity.

The deep Q-network (DQN) emerges as an ideal candidate for tackling problems with discrete action spaces. Given that the flexible needle’s maneuverability is confined to two primary actions—feeding and rotating—these actions align seamlessly with the strengths of a DQN. This alignment enables the agent to effectively leverage DQN’s capabilities to navigate the complex state space and optimize its path planning, ensuring that the learning process is both efficient and effective.

4.3.1. DQN Algorithm

Upon initializing the network, the flexible needle agent employs an ε-greedy strategy to select a puncture action. This strategy adeptly marries a greedy approach with random exploration, ensuring that the agent balances the exploitation of known optimal actions with the discovery of new, potentially more effective strategies.

Upon executing the chosen action, the agent garners state and reward data, which serve as the foundation for learning. A time-differentiated objective function is meticulously computed to refine the Q-value function, while a loss function is concurrently calculated to fine-tune the Q-network’s parameters. This process nudges the parameters closer to their true values, enhancing the network’s predictive accuracy.

Every C training steps, the value network’s weights are meticulously copied to the target network, facilitating a systematic update process. This iterative cycle continues until the network achieves convergence or the pre-established number of training rounds is reached, ensuring that the learning process is both thorough and exhaustive.

The pseudocode encapsulating the DQN algorithm’s operational framework is presented in Algorithm 1, offering a clear, step-by-step guide to the agent’s learning journey.

Algorithm 1. DQN Algorithm
	Hyper-Parameter: experience pool capacity $N$ , reward discount factor $γ$ , delay step for target state-action value function update $C, ε$ in the $ε - g r e e d y$
	Input: empty experience pool $D$ , Initialize the state-action value function Q function with parameters θ
	Initialize the target state-action value function $\hat{Q}$ using the parameter $\hat{θ} \leftarrow θ$
1:	for episode = 0, 1, 2, …… do
2:	Initialize the environment and acquire observations $O_{0}$
3:	Initialization sequence $S_{0} = {O_{0}}$ and preprocess the sequence $ϕ_{0} = ϕ (S_{0})$
4:	for t = 0, 1, 2, …… do
5:	Select a random action $A_{t}$ by probability $ε$ , otherwise select the action: $A_{t} = a r g {m a x}_{a} Q (ϕ (S_{t}), a; θ)$
6:	Perform action $A_{t}$ and obtain observation data $O_{t + 1}$ and reward data $R_{t}$
7:	Set $D_{t} = 1$ if the council is over, otherwise $D_{t} = 0$
8:	Set $S_{t + 1} = {S_{t}, A_{t}, O_{t + 1}}$ and pre-processing $ϕ_{t + 1} = ϕ (S_{t + 1})$
9:	Store the state transfer data $(ϕ_{t}, A_{t}, R_{t}, D_{t}, ϕ_{t + 1})$ into D
10:	Random sampling of small batches of state transfer data from D $(ϕ_{i}, A_{i}, R_{i}, D_{i}, ϕ_{i}^{'})$
11:	If $D_{i} = 0$ , set $Y_{i} = R_{i} + γ {m a x}_{a^{'}} \hat{Q} (ϕ_{i}^{'}, a^{'}; \hat{θ})$ , otherwise, set $Y_{i} = R_{i}$
12:	Perform a gradient descent step on ${(Y_{i} - Q (ϕ_{i}, A_{i}; θ))}^{2}$ for θ
13:	Synchronize the target network $\hat{Q}$ every C steps
14:	Jump out of the loop if the segment ends
15:	end for
16:	end for

4.3.2. Training Process

As shown in Figure 5, a flowchart of the DQN algorithm is shown, and the specific steps for training a flexible needle intelligence with this algorithm are as follows:

First, to train the flexible needle agent, start by setting the following essential parameters:

MEMORY_SIZE: the capacity of the experience pool, storing more data as the size increases.
BATCH_SIZE: the number of experiences sampled for gradient computation.
MAX_EPISODES: the limit on the episodes per training cycle, with each episode consisting of a series of actions.
LEARNING_RATE: the step size for weight updates during learning.
Q_UPDATE_FREQ: the frequency at which the target network’s parameters are updated from the prediction network.
INITIAL_PARAMETERS: initialize the Q network’s parameters (θ), the target network’s parameters (θ^), and the reward discount factor (γ).

The agent performs the following processes in each puncture action decision stage:

Step 1: initialize the puncture environment to the original state, including the starting puncture point information, lesion target information, organ obstacle information, and flexible needle movement information.

Step 2: observe the current state features of the environment and the agent.

Step 3: use ε-greedy strategy to randomly select the puncture action a.

Step 4: The environment updates the state and feeds back rewards according to the action of the flexible needle agent. When a = 0, it indicates that the agent makes a needle-piercing movement and advances a distance along the current movement direction. When a = 1, it means that the agent performs a rotational movement, rotating a certain angle along the axial direction of the needle body to change the current movement direction of the needle.

Step 5: record the experience data

(s_{t}, a_{t}, r_{t}, s_{t + 1})

and save them to the experience pool D.

Step 6: Randomly select BATCH_SIZE bars of empirical data in the experience pool D. Use the target network

\hat{Q}

to calculate the target value

{\hat{y}}_{i}

.

\begin{matrix} {\hat{y}}_{i} = \{\begin{matrix} r_{i}, s_{i + 1} end state \\ r_{i} + γ \hat{Q} (s_{i + 1}, a r g \underset{a}{m a x} \hat{Q} (s_{i + 1}, a; \hat{θ}); \hat{θ}), Other statuses \end{matrix} \end{matrix}

(16)

Step 7: use the value network to calculate the predicted value

y_{i}

.

\begin{matrix} y_{i} = Q (s_{i}, a_{i}; θ) \end{matrix}

(17)

Step 8: the Q network parameters θ are updated using

{({\hat{y}}_{i} - y_{i})}^{2}

as the loss function, and gradient descent is performed on θ.

Step 9: the parameter θ of the Q network is assigned to the target network

\hat{Q}

every C steps.

Step 10: Determine whether the current turn is over or not using judgment criteria according to the reward function path planning termination condition judgment. If the termination condition is not reached, return to Step 2.

Step 11: determine whether the maximum number of cycles MAX_EPISODE has been reached—if not, return to Step 1 and continue to execute the next steps; otherwise, the cycle ends.

5. Simulation Experiment Results

The above tests for training algorithms were written based on Python 3.9.7. The hardware conditions for the experiments were a host computer with an Intel(R) Core(TM) i5-10400F CPU at 2.90 GHz and a personal computer with NVIDIA GeForce GTX 1650 4 GB. The parameter settings for constructing the experimental environment are detailed in Table 5. Based on these settings, the agent puncture environment was constructed. After setting up the experimental environment, the flexible needle agent underwent training, with training parameters specified in Table 6. Once the experimental environment was established and the training parameters were set, a sensitivity analysis of the relevant parameters was initiated.

5.1. Sensitivity Analysis

To identify the relevant parameters that can optimize the performance of the algorithm, this paper first conducts a parameter sensitivity analysis of the algorithm and compares and tests the key parameters that will affect the performance of the deep reinforcement learning algorithm, such as the learning rate [23], the discount factor [24], the batch size [25], and the target Q-network updating frequency [26]. The relevant parameters involved in the parameter sensitivity comparison experiment are shown below.

(1): Learning rate

Experiments varying the learning rate to optimize reward convergence are detailed in Figure 6a. A rate of 0.05 offers stable but lower rewards, possibly indicating a local optimum. At 0.001, rewards are higher but more volatile. Analysis suggests a learning rate of 0.005 strikes a balance, providing stable, higher rewards with less fluctuation, making it the optimal choice.

(2): Discount factor

This study assesses reward convergence across different discount factors to balance immediate and long-term rewards, as shown in Figure 6b. A discount factor of 0.9 starts with fluctuating rewards but eventually stabilizes at a high level, indicating that the agent initially struggles to learn the optimal puncture strategy but improves over time. Thus, a discount factor of 0.9 is found to be suitable for achieving stable total rewards.

(3): Batch size

The batch size impacts the agent’s learning efficiency. Experiments, as seen in Figure 6c, show that small batch sizes lead to unstable reward curves with high volatility, while large ones lead to lower rewards due to suboptimal learning. Batch sizes of 64 and 128 provide stable, high-reward curves, indicating superior agent performance. Other sizes result in erratic curves and variable strategy effectiveness.

(4): Update frequency of the target Q-network

A balanced update frequency is key to algorithm stability. This study examines reward acquisition across different frequencies, as shown in Figure 6d. Low frequencies lead to slow convergence, while high frequencies cause volatility. Findings indicate that an update frequency of 40 for the target Q-network stabilizes the reward curve, suggesting it is optimal for the agent to find effective puncture strategies.

5.2. DQN-Based Simulation Experiments for the PPFNP Problem

Once the reinforcement learning algorithm’s parameters were set, we executed targeted experiments for the PPFNP problem. In this setup, the human body’s internal environment is static, featuring a single target and obstacles to be circumvented during the 3D puncture from the starting point. The needle pauses and re-plans its path upon encountering boundaries or obstacles, persisting until it successfully hits the target.

To address this challenge, we simulated the human body’s interior, mapped obstacles and lesion targets, and developed a reinforcement learning model incorporating states, actions, and rewards. The DQN algorithm was selected to train the agent, with the results shown in Figure 7. By integrating digital twin technology, we created a virtual replica of the patient’s anatomy that updates in real time, allowing for dynamic simulations of the puncture process. This digital twin model enhances path planning accuracy by continuously reflecting the patient’s internal changes and providing real-time feedback for optimal decision-making.

Figure 7 illustrates the escalating reward trend in the flexible needle agent with increasing training rounds, signifying the agent’s improving performance and its ability to discover more rewarding puncture paths. Figure 7 offers a 3D view of the agent’s trajectory, where the red dot is the target, the larger spheres are obstacles, and the blue dots trace the agent’s planned path, with each dot representing a step in the agent’s decision-making process. The figure demonstrates the agent’s adept navigation around obstacles to reach the target. By integrating digital twin technology into this process, the system gains the ability to simulate complex, real-time scenarios more accurately, ensuring that the agent’s path is continuously optimized to account for patient-specific anatomical variations and dynamic changes.

To assess the DQN-based method’s efficacy and versatility in flexible needle puncture path planning, we performed experiments in simulated environments, comparing various methods across different scenarios. Our focus was on two primary scenarios.

5.2.1. Different Space Dimensions

Variations in individual characteristics, such as age and body shape, lead to variations in the size of abdominal space environments. To demonstrate the algorithm’s adaptability across these varying conditions, it is essential to evaluate its performance under different environmental space sizes. To this end, we conducted experiments with three distinct space dimensions, namely (30, 30, 30), (50, 50, 50), and (70, 70, 70), which represent smaller, medium, and larger abdominal space ranges, respectively (with values scaled up to ensure complete coverage of the actual space sizes). The outcomes of these experiments are depicted in Figure 7, showcasing the algorithm’s effectiveness across a spectrum of environmental conditions.

Figure 8 shows the algorithm’s consistent performance across varying abdominal space sizes. As space size grows, the algorithm’s execution remains smooth, with clear reward convergence. Despite facing challenges in smaller spaces, stability and convergence are maintained, with rewards being lower but stable. In larger spaces, higher rewards are achieved due to increased operational freedom. This trend confirms the algorithm’s adaptability and expected performance across different environments.

5.2.2. Number of Different Intelligences

In prior tests, the algorithm was found to have unused computational capacity. To optimize the computer’s performance, we raised the number of flexible needle agents per training round. This increase not only speeds up training but also bolsters learning stability, as agents learn autonomously without one’s failure affecting others. We thus experimented with setups involving one, three, and five agents, with results depicted in Figure 9.

Figure 9 presents three curves, each representing training with one, three, or five agents, and all exhibit similar upward trajectories. This indicates that the core learning pattern of the algorithm remains consistent regardless of the number of agents involved. The improvement in performance over time is uniform across setups, suggesting that the computational resources are effectively harnessed and learning efficiency is comparable. The curves, though subject to minor oscillations, demonstrate a commendable level of stability, highlighting the algorithm’s resilience and consistent performance across varying agent configurations.

5.2.3. Model Comparison

In the comparative experiment, we designed four control groups and conducted detailed performance evaluations for different models. The main objective of these control groups is to test the performance of each model in path planning experiments. We focused on calculating the average score, highest score, and lowest score of each model in the experiment to comprehensively evaluate the superiority and inferiority of the model. Table 7 shows the scores of each model on these indicators. By analyzing the data in the table, it is evident that the improved DRL model outperforms the original DQN model significantly in all indicators.

The improved model performs well on the average score, significantly higher than the traditional DQN, indicating that in most cases, the improved model can more effectively find paths and obtain higher returns. And the improved model also showed a significant improvement in the highest score of each round, indicating its superior performance in extreme situations and a better ability to handle complex path planning tasks. Finally, at the lowest score, the performance of the improved model is more robust, with a significantly higher minimum score than the traditional DQN, indicating that the improved model can maintain ideal performance even under the most unfavorable conditions rather than the possibility of extremely low scores or failures like the original DQN.

5.3. Discussion

This paper encapsulates the merits of employing deep reinforcement learning in the context of flexible needle puncture path planning, which are as follows:

Managing high-dimensional spaces: deep reinforcement learning autonomously distills essential features from raw data, adeptly handling complex, nonlinear interactions and alleviating the feature engineering burden.
Crafting sophisticated strategies: it surpasses conventional approaches by crafting more nuanced strategies, thereby boosting performance and efficiency.
Enhancing environmental awareness: it bolsters the agent’s environmental perception, aiding in collision avoidance and optimal path identification.
Adaptive strategy optimization: the dynamic strategy adjustment in response to performance feedback facilitates an ongoing refinement of goal-oriented planning and exploration.
Behavioral clarity: the agent’s behavior, shaped by environmental interactions, becomes more transparent and understandable, fostering usability and trust.

Looking ahead, future research directions are poised to include the following:

Addressing practical application challenges: Integrating path planning algorithms with medical equipment for precise navigation while ensuring needle stability, safety, and ease of use, presents significant challenges. Cross-disciplinary collaboration is crucial to merge these algorithms with practical applications, advancing flexible needle path planning in digital healthcare.
Incorporating dynamic needle characteristics: The kinematic model of a unicycle, which assumes quasi-static needle motion and neglects external forces, often deviates from reality due to factors like needle deformation and tissue friction. Future research could address these biases by integrating the needle’s dynamic properties into the model. Combining mechanical analysis with real-time feedback control—using force sensors, visual feedback, or other intelligent perception techniques—can improve path accuracy and system robustness. This fusion of data-driven and physical models holds promise for enhancing practical performance.
Integrating medical imaging: Existing medical imaging technologies, such as MRI, CT, and ultrasound, offer detailed anatomical insights that can be leveraged to guide needle path planning. Future work may focus on the seamless integration of these imaging modalities with real-time navigation systems, allowing for more accurate needle placement and adaptive path adjustments based on real-time visual feedback. This combination of imaging and path planning is key to advancing minimally invasive procedures and improving patient outcomes.
Insights from recent research: Recent studies on needle path planning have provided valuable insights into the optimization of puncture trajectories. One study highlights the use of machine learning to predict tissue properties and improve the adaptability of needle insertion paths. Another explores biomechanical modeling to simulate tissue deformation, helping refine path planning under different operational conditions [27,28,29]. These advancements suggest that combining predictive models with real-time sensor feedback could enhance the precision and reliability of needle navigation in complex medical scenarios.

6. Conclusions

This paper delves into the realm of flexible needle path planning, leveraging deep reinforcement learning to optimize puncture strategies. The needle is treated as an intelligent agent navigating a virtual human abdominal environment, aiming to refine its actions through reward-based feedback to ascertain the most effective sequence. The study commences by establishing a virtual environment that mimics the human body, detailing the spatial dimensions, positioning of obstacles, and the precise location of lesion targets. A tiered model is crafted to depict the trajectory radii of various needle segments. Following this, a reinforcement learning framework is formulated, encompassing 12 distinct states, two possible actions, and a comprehensive reward system consisting of eight single-step and five cumulative rewards to gauge the performance of the path. The deep Q-network (DQN) algorithm is then selected to train the agent, with a sensitivity analysis of parameters to identify the most favorable configuration. Experiments in path planning are conducted using these optimized parameters, evaluating the method’s efficacy and its ability to generalize across scenarios. The outcomes are encapsulated in three-dimensional path simulation diagrams, which vividly illustrate the approach’s success in addressing the static single-target flexible needle puncture path planning challenge.

Author Contributions

Conceptualization, J.L. (Jiewu Leng); methodology, J.L. (Jun Lin) and Z.H.; Software, Z.H. and T.Z.; resources, J.L. (Jun Lin) and K.H.; data curation, J.L. (Jun Lin), Z.H. and T.Z.; writing—original draft, T.Z.; supervision, J.L. (Jiewu Leng) and K.H.; project administration, K.H.; funding acquisition, J.L. (Jiewu Leng). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Planning Project of Guangdong Province of China under grant no. 2024A0505040024; the Outstanding Youth Fund of Guangdong Province under grant no. 2022B1515020006; and the Science and Technology Program of Guangzhou, China, under grant no. 2024A04J6301.

Data Availability Statement

The data that support the findings are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Wu, K.; Li, B.; Zhang, Y.; Dai, X. Review of research on path planning and control methods of flexible steerable needle puncture robot. Comput. Assist. Surg. 2022, 27, 91–112. [Google Scholar] [CrossRef]
Aggarwal, S.; Kumar, N. Path planning techniques for unmanned aerial vehicles: A review, solutions, and challenges. Comput. Commun. 2020, 149, 270–299. [Google Scholar] [CrossRef]
Yang, Y.; Li, J.; Peng, L. Multi-robot path planning based on a deep reinforcement learning DQN algorithm. CAAI Trans. Intell. Technol. 2020, 5, 177–183. [Google Scholar]
Momen, A.; Roesthuis, R.J.; Reilink, R.; Misra, S. Integrating deflection models and image feedback for real-time flexible needle steering. IEEE Trans. Robot. 2012, 29, 542–553. [Google Scholar]
Zhao, Y.J.; Joseph, F.O.M.; Yan, K.; Datla, N.V.; Zhang, Y.D.; Podder, T.K.; Hutapea, P.; Dicker, A.; Yu, Y. Path planning for robot-assisted active flexible needle using improved Rapidly-Exploring Random trees. In Proceedings of the 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 26–30 August 2014; pp. 380–383. [Google Scholar]
Xiong, J.; Li, X.; Gan, Y.; Xia, Z. Path planning for flexible needle insertion system based on Improved Rapidly-Exploring Random Tree algorithm. In Proceedings of the 2015 IEEE International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 1545–1550. [Google Scholar]
Hong, A.; Boehler, Q.; Moser, R.; Zemmar, A.; Stieglitz, L.; Nelson, B.J. 3D path planning for flexible needle steering in neurosurgery. Int. J. Med. Robot. Comput. Assist. Surg. 2019, 15, e1998. [Google Scholar]
Zhang, Y.; Ju, Z.; Zhang, H.; Qi, Z. 3-d path planning using improved RRT* algorithm for robot-assisted flexible needle insertion in multilayer tissues. IEEE Can. J. Electr. Comput. Eng. 2021, 45, 50–62. [Google Scholar]
Cai, C.; Sun, C.; Han, Y.; Zhang, Q. Clinical flexible needle puncture path planning based on particle swarm optimization. Comput. Methods Programs Biomed. 2020, 193, 105511. [Google Scholar]
Tan, Z.; Liang, H.G.; Zhang, D.; Wang, Q.G. Path planning of surgical needle: A new adaptive intelligent particle swarm optimization method. Trans. Inst. Meas. Control. 2021, 44, 766–774. [Google Scholar]
Tan, Z.; Zhang, D.; Liang, H.G.; Wang, Q.G.; Cai, W. A new path planning method for bevel-tip flexible needle insertion in 3D space with multiple targets and obstacles. Control. Theory Technol. 2022, 20, 525–535. [Google Scholar]
Tan, X.; Yu, P.; Lim, K.B.; Chui, C.K. Robust path planning for flexible needle insertion using Markov decision processes. Int. J. Comput. Assist. Radiol. Surg. 2018, 13, 1439–1451. [Google Scholar]
Lee, Y.; Tan, X.; Chng, C.B.; Chui, C.K. Simulation of robot-assisted flexible needle insertion using deep Q-network. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 342–346. [Google Scholar]
Hu, W.; Jiang, H.; Wang, M. Flexible needle puncture path planning for liver tumors based on deep reinforcement learning. Phys. Med. Biol. 2022, 67, 195008. [Google Scholar] [CrossRef]
Milovancev, M.; Townsend, K.L. Current Concepts in Minimally Invasive Surgery of the Abdomen. Vet. Clin. Small Anim. Pract. 2015, 45, 507–522. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Yu, L.; Zhang, F. A survey on puncture models and path planning algorithms of bevel-tipped flexible needles. Heliyon 2024, 10, e25002. [Google Scholar] [CrossRef]
Webster, R.J., III; Kim, J.S.; Cowan, N.J.; Chirikjian, G.S.; Okamura, A.M. Nonholonomic modeling of needle steering. Int. J. Robot. Res. 2006, 25, 509–525. [Google Scholar] [CrossRef]
Abolhassani, N.; Patel, R. Deflection of a flexible needle during insertion into soft tissue. In Proceedings of the 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, New York, NY, USA, 30 August–3 September 2006; pp. 3858–3861. [Google Scholar]
Yan, K.G.; Podder, T.; Yu, Y.; Liu, T.I.; Cheng, C.W.; Ng, W.S. Flexible needle–tissue interaction modeling with depth-varying mean parameter: Preliminary study. IEEE Trans. Biomed. Eng. 2008, 56, 255–262. [Google Scholar] [CrossRef]
Płaskonka, J. The path following control of a unicycle based on the chained form of a kinematic model derived with respect to the Serret-Frenet frame. In Proceedings of the 2012 17th International Conference on Methods & Models in Automation & Robotics (MMAR), Miedzyzdroje, Poland, 27–30 August 2012; pp. 617–620. [Google Scholar]
Liu, F.; Guo, H.; Li, X.; Tang, R.; Ye, Y.; He, X. End-to-end deep reinforcement learning based recommendation with supervised embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining 2020, Houston, TX, USA, 3–7 February 2020; pp. 384–392. [Google Scholar]
Kriegeskorte, N.; Golan, T. Neural network models and deep learning. Curr. Biol. 2019, 29, R231–R236. [Google Scholar] [CrossRef]
Gulde, R.; Tuscher, M.; Csiszar, A.; Riedel, O.; Verl, A. Deep Reinforcement Learning using Cyclical Learning Rates. In Proceedings of the 2020 Third International Conference on Artificial Intelligence for Industries (AI4I), Irvine, CA, USA, 21–23 September 2020; p. 32. [Google Scholar]
Amit, R.; Meir, R.; Ciosek, K. Discount factor as a regularizer in reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Chendu, China, 18–20 October 2024; pp. 269–278. [Google Scholar]
Smith, S.L. Don’t decay the learning rate, increase the batch size. arXiv 2017, arXiv:1711.00489. [Google Scholar]
Wang, J.; Gou, L.; Shen, H.W.; Yang, H. Dqnviz: A visual analytics approach to understand deep q-networks. IEEE Trans. Vis. Comput. Graph. 2018, 25, 288–298. [Google Scholar] [CrossRef]
Mao, Z.; Kobayashi, R.; Nabae, H.; Suzumori, K. Multimodal Strain Sensing System for Shape Recognition of Tensegrity Structures by Combining Traditional Regression and Deep Learning Approaches. IEEE Robot. Autom. Lett. 2024, 9, 10050–10056. [Google Scholar] [CrossRef]
Mao, Z.; Hosoya, N.; Maeda, S. Flexible Electrohydrodynamic Fluid-Driven Valveless Water Pump via Immiscible Interface. Cyborg Bionic Syst. 2024, 5, 91. [Google Scholar] [CrossRef]
Peng, Y.; He, M.; Hu, F.; Mao, Z.; Huang, X.; Ding, J. Predictive modeling of flexible EHD pumps using Kolmogorov–Arnold Networks. Biomim. Intell. Robot. 2024, 4, 100184. [Google Scholar] [CrossRef]

Figure 1. The trajectory of flexible needle movement under the interaction force between the needle and tissue.

Figure 2. The design of a simplified human tissue layering model. This model constructs three layers of puncture spaces based on the density differences in internal tissues in the human body. Each layer of space represents a portion of tissues that are adjacent to each other in the internal space of the human body and have little density difference. When the puncture needle passes through these spaces, the radius of movement in each layer will be different due to the density difference.

Figure 3. Deep reinforcement learning-based robotic puncturing path planning of flexible needle.

Figure 4. Model of obstacle repulsion and lesion target gravity.

Figure 5. A flowchart of the DQN-based path planning of puncturing.

Figure 6. (a) Reward convergence curve for different learning rates; (b) reward convergence curve for different discount factors; (c) reward convergence curve for different batch sizes; (d) reward convergence curve for different target Q-network update frequencies.

Figure 7. Reward graph of the DQN-based PPFNP.

Figure 8. Reward convergence plot for different space sizes.

Figure 9. Reward convergence plots for different number of agents.

Table 1. Notations and implications in path planning model.

Notations		Implications
Path Information	$n$	Number of path planning
	$P = \{P_{1}, P_{2}, \dots, P_{n}\}$	Pathway set
	$P_{i}$	$i$ th path
	$L_{i}$	$Length of path P_{i}$
	$T_{i}$	$Elapsed time for path P_{i}$
	$l_{i}$	$Number of arc segments of path P_{i}$
	$D_{i}$	$Distance of endpoint of path P_{i}$ from target point
	$R_{i}$	$Total reward for path P_{i}$
Obstacle Information	$m$	Number of obstacles
	$O = (O_{1}, O_{2}, \dots, O_{m})$	Obstacle course
	$O_{j}$	$j$ th obstacle
	$R_{j}$	$Radius of obstacle O_{j}$
	$d_{i, j}$	$Distance of path P_{i}$ $from obstacle O_{j}$
	$J_{i, j} = {0, 1}$	$Whether path P_{i}$ $collides with obstacle O_{j}$
Other Information	α	Precision factor
	β	Length factor
	γ	Obstacle avoidance factor
	λ	Time factor

Table 2. State variables and features.

Notations	Implications
$x$	$Needle coordinates in x$ -axis direction
y	$Needle coordinates in y$ -axis direction
$z$	$Needle coordinates in z$ -axis direction
$d x = x_{2} - x$	$Projection of the straight - line distance of the needle from the target point in the x$ -axis direction
$d y = y_{2} - y$	$Projection of the straight - line distance of the needle from the target point in the y$ -axis direction
$d z = z_{2} - z$	$Projection of the straight - line distance of the needle from the target point in the z$ -axis direction
$d_{-} c t o = \|x - x_{0}\| + \|y - y_{0}\| + \| z - z_{0} \|$	Manhattan distance between the current position of the needle and its initial position
$d_{-} c t t = \|x - x_{2}\| + \|y - y_{2}\| + \| z - z_{2} \|$	Manhattan distance between the current state of the needle and the target point
$d_{-} o t t = \|x_{2} - x_{0}\| + \|y_{2} - y_{0}\| + \| z_{2} - z_{0} \|$	Manhattan distance of the initial point from the target point
$d_{-} c t s = \|x - x_{1}\| + \|y - y_{1}\| + \| z - z_{1} \|$	Manhattan distance of the needle from the obstacle
$n$	Number of needle rotations

Table 3. Composite reward functions for the PPFNP problem.

Classes	Reward Functions	Implications
Single-step reward	$r_{-} s$	Negative reward for damage to normal tissue per step
	$r_{-} c$	Negative reward per needle rotation
	${r_{-} n}_{-} o = \{\begin{matrix} 0, d_{-} c t o > d_{-} o \\ \frac{d_{-} o}{d_{-} c t o}, d_{-} c t o < d_{-} o \end{matrix}$	Negative reward for approaching an obstacle
	${r_{-} n}_{-} t = \{\begin{matrix} \frac{d_{-} o t t}{d_{-} c t t} \cdot d_{-} j, d_{-} c t t > d_{-} t \\ d_{-} c to \cdot d_{-} j, d_{-} c t t < d_{-} t \end{matrix}$	Positive rewards for approaching the target position
Multiple reward	$r_{-} o$	Negative reward for collision with obstacles
	$r_{-} b$	Negative reward to reach space boundaries
	$r_{-} w$	Negative reward for exceeding the worst step
	$r_{-} t$	Positive rewards for reaching the target location

Table 4. Cumulative rewards for the PPFNP problem.

Cumulative Reward	Implications
$R_{-} n = r_{-} s + r_{-} c + {r_{-} n}_{-} o + {r_{-} n}_{-} t$	Normal puncture procedure
$R_{-} o = R_{-} n + r_{-} o$	Collision with obstacles
$R_{-} b = R_{-} n + r_{-} b$	Outside the model boundary
$R_{-} t = R_{-} n + r_{-} t$	Reach target position
$R_{-} w = R_{-} n + r_{-} t + r_{-} w$	Exceeding the worst possible step size

Table 5. Surgical environment and agent parameters.

Parameters	Settings
Spatial extent	50 × 50 × 50
Obstacle radius	3
Flexible needle curvature	0.0069
Number of intelligences per turn	1
Initial point position	[25, 25, 0]
Location of the target	[25, 25, 50]
Maximum number of steps	50
Maximum number of revolutions	5

Table 6. Flexible needle agent training parameters.

Parameters	Settings
Neural network layers	3
Number of neurons in the hidden layer	20
Number of neural networks	2
Hidden layer activation function	ReLU
Output layer activation function	linear function
Gradient descent optimizer	RMSProp
Discovery strategy	ε-greedy
Experience pool size	5000
Learning rate	[0.001, 0.005, 0.01, 0.05, 0.1]
Discount factor	[0.80, 0.85, 0.90, 0.95, 0.99]
Batch size	[16, 32, 64, 128, 256]
Q-target update frequency	[10, 20, 30, 40, 50]
Number of training iterations	8000

Table 7. Scores in path planning experiments with different models.

	Original DRL				Improved DRL
Goals	Learning Rate	Discount Factor	Batch Size	Update Frequency	Improved DRL
Average	45	−4	134	3	246
Max	161	212	186	43	332
Min	−6	−64	64	−51	197

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, J.; Huang, Z.; Zhu, T.; Leng, J.; Huang, K. Deep Reinforcement Learning-Based Robotic Puncturing Path Planning of Flexible Needle. Processes 2024, 12, 2852. https://doi.org/10.3390/pr12122852

AMA Style

Lin J, Huang Z, Zhu T, Leng J, Huang K. Deep Reinforcement Learning-Based Robotic Puncturing Path Planning of Flexible Needle. Processes. 2024; 12(12):2852. https://doi.org/10.3390/pr12122852

Chicago/Turabian Style

Lin, Jun, Zhiqiang Huang, Tengliang Zhu, Jiewu Leng, and Kai Huang. 2024. "Deep Reinforcement Learning-Based Robotic Puncturing Path Planning of Flexible Needle" Processes 12, no. 12: 2852. https://doi.org/10.3390/pr12122852

APA Style

Lin, J., Huang, Z., Zhu, T., Leng, J., & Huang, K. (2024). Deep Reinforcement Learning-Based Robotic Puncturing Path Planning of Flexible Needle. Processes, 12(12), 2852. https://doi.org/10.3390/pr12122852

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Based Robotic Puncturing Path Planning of Flexible Needle

Abstract

1. Introduction

2. Related Work

3. Path Planning of Flexible Needle in Robotic Puncturing

3.1. Definition of the Internal Human Environment

3.1.1. Obstacle Situation

3.1.2. Lesion Target Location

3.1.3. Hierarchical Model of the Human Body

3.2. Modeling of the Flexible Needle Puncture Problem

3.2.1. Kinematic Modeling

3.2.2. Mathematical Path Planning Model

4. Deep Reinforcement Learning-Based Path Planning

4.1. Framework of DRL-Based Path Planning

4.2. Reinforcement Learning Modeling

4.2.1. State Features

4.2.2. Action Space

4.2.3. Reward Function

4.3. Algorithms for Agent Training

4.3.1. DQN Algorithm

4.3.2. Training Process

5. Simulation Experiment Results

5.1. Sensitivity Analysis

5.2. DQN-Based Simulation Experiments for the PPFNP Problem

5.2.1. Different Space Dimensions

5.2.2. Number of Different Intelligences

5.2.3. Model Comparison

5.3. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI