Triplet-Fusion Self-Attention-Enhanced Pyramidal Convolutional Neural Network for Surgical Robot Kinematic Solution

Su, Tiecheng; Liang, Lu; Pan, Mingzhang; Fu, Changcheng; Huang, Hengqiu; Li, Jing’ao; Liang, Ke

doi:10.3390/act15020104

Open AccessArticle

Triplet-Fusion Self-Attention-Enhanced Pyramidal Convolutional Neural Network for Surgical Robot Kinematic Solution

by

Tiecheng Su

¹

,

Lu Liang

²

,

Mingzhang Pan

¹

,

Changcheng Fu

¹

,

Hengqiu Huang

³,

Jing’ao Li

¹

and

Ke Liang

^1,*

¹

Guangxi Key Laboratory of Manufacturing System & Advanced Manufacturing Technology, College of Mechanical Engineering, Guangxi University, Nanning 530004, China

²

College of Electrical Engineering, Guangxi University, Nanning 530004, China

³

College of Mathematics and Computer Science, Guangxi Minzu Normal University, Chongzuo 532200, China

^*

Author to whom correspondence should be addressed.

Actuators 2026, 15(2), 104; https://doi.org/10.3390/act15020104

Submission received: 20 December 2025 / Revised: 25 January 2026 / Accepted: 4 February 2026 / Published: 5 February 2026

(This article belongs to the Section Actuators for Robotics)

Download

Browse Figures

Versions Notes

Abstract

Surgical robots are increasingly utilized in medicine for their reliability and convenience. An accurate kinematic model is essential for precise robot control and enhanced surgical safety. However, the high nonlinearity and computational complexity of kinematics pose significant challenges to traditional numerical methods. This study designs a surgical robotic arm and establishes the motion mapping relationship between the joint space and the end-effector workspace. Subsequently, a hybrid kinematic estimation model based on deep pyramid convolutional neural network (DPCNN) is proposed, which integrates data sampling and an attention mechanism to improve computational accuracy. The Latin hypercube sampling technique is used to improve the uniformity of dataset sampling, and the triplet-fusion self-attention mechanism (TFSAM) is employed for multi-scale feature information. Experimental results show that the TFSAM-DPCNN model achieves coefficient of determination (R²) values exceeding 0.99 across all testing scenarios. Compared with other models, the proposed model reduced the root mean square error (RMSE) by up to 81.34%, exhibiting superior performance. Furthermore, the developed 3D simulation platform validates the effectiveness of the proposed model. This study offers a robust solution for multi-degree-of-freedom robot modeling, with potential applications across a range of robotic motion control systems.

Keywords:

surgical robot; inverse kinematics; artificial neural networks; attention mechanism; deep learning

1. Introduction

Robotic minimally invasive surgery (RMIS) achieves high success rates with low blood loss and fast recovery [1], and single-port systems further reduce trauma and accelerate convalescence [2]. In these procedures, the single-port surgical robot inserts the surgical instrument arm into the patient’s body through a cannula. The surgeon manipulates the console to perform tasks such as viewing, cutting tissue, suturing, and knotting [3]. To overcome the limited bendability and workspace of rigid, chopstick-like tools, single-port platforms adopt flexible, serpentine instrument arms that provide redundancy and compliance [4], motivating precise modeling and control of such mechanisms.

With the application of serpentine mechanisms in single-port surgical robots, research on related control and information technologies has garnered wide attention [5]. Accurate kinematic modeling is foundational for control. Forward kinematics (FK) maps joint states to end-effector poses and is routinely formulated with Denavit–Hartenberg parameters [6], whereas inverse kinematics (IK) computes joint variables from desired poses and is often challenging for serpentine-structured, multi-degree-of-freedom (DoF) instrument arms with interlinked joints [7]. Existing IK solvers broadly include analytical and numerical approaches as well as learning-based estimators, each with distinct trade-offs in terms of accuracy, computational cost, and robustness, especially for multi-DoF cable-driven serpentine mechanisms.

Recently, data-driven approaches, particularly artificial neural networks (ANNs), have emerged as promising alternatives. Despite their progress, three critical gaps remain regarding multi-DoF, cable-driven serpentine instrument arms used in single-port surgery: (i) the high nonlinearity of multi-DoF joints establishes a complex mapping relationship between the joint space and the workspace. Generic neural network architectures often struggle to effectively approximate this highly coupled, non-convex function through simple structures, thereby limiting model fitting accuracy. In clinical scenarios requiring high precision, such as vascular dissection, such prediction errors can lead to abnormal trajectory deviations, potentially causing unintended tissue damage; (ii) model performance is highly sensitive to data coverage, yet systematic sampling strategies for the broad and morphologically complex workspace are often lacking. This deficiency poses significant risks in narrow lumen operations, where the robotic arm must maneuver in constrained configurations. In edge cases located at the boundaries of the workspace, insufficient model generalization capability may result in accidents such as collisions with lumen walls; (iii) the end-effector pose is the cumulative result of motions across multiple joints, inducing pronounced cross-dimensional, long-range dependencies among outputs. Generic architectures frequently fail to capture these dependencies, leading to error accumulation that degrades the accuracy required for delicate tasks like suturing and knotting.

Building on these insights, we focus on three complementary factors aligned with the above gaps: robustness to modeling uncertainty and strong multi-segment coupling; dataset coverage and sampling efficiency across broad, morphologically complex workspaces; and model capacity to capture global, cross-dimensional dependencies. Prior work indicates the importance of sampling strategies [8]; moreover, attention mechanisms can enhance performance by extracting global features and filtering irrelevant inputs, which is particularly pertinent when IK outputs across dimensions are strongly correlated.

To bridge these gaps and address the limitations of traditional analytical methods in cable-driven systems, we propose a multi-DoF single-port instrument arm with serpentine joints and develop a robust learning-based IK framework, TFSAM-DPCNN. Traditional analytical solvers struggle with the strong position–orientation coupling of serpentine joints and cannot adapt to unmodeled physical dynamics like hysteresis. In contrast, our data-driven approach builds a universal approximator that effectively handles these complex nonlinearities. Specifically, we employ Latin hypercube sampling (LHS), a stratified sampling technique that prevents data clustering inherent in random sampling, to uniformly cover the high-dimensional parameter space and improve data efficiency. We also cascade a triplet-fusion self-attention module with a deep pyramid convolutional neural network (DPCNN) and residual connections to capture multi-scale features and global dependencies while mitigating degradation. Comparative studies demonstrate superior accuracy and robustness, and simulations validate the effectiveness and reliability of the proposed estimator. This paper makes the following contributions:

A compact, flexible multi-DoF single-port instrument arm and its joint–workspace motion mapping for precise control.
A TFSAM module that augments self-attention with triplet attention, enabling explicit modeling of global cross-dimensional dependencies and long-range interactions.
A DPCNN with TFSAM and residual connections that captures multi-scale nonlinearities and enhances robustness to modeling uncertainty.
Extensive comparisons and simulations demonstrating state-of-the-art accuracy and robustness for the targeted surgical IK task.

2. Related Work

This work aims to resolve the highly nonlinear kinematics of multi-DoF instrument arms in single-port surgical robots using artificial intelligence algorithms. Consequently, this section reviews the literature pertinent to three key areas: traditional kinematic solution methods, general artificial intelligence approaches in robotics, and specific deep learning techniques relevant to this study.

2.1. Analytical and Numerical Kinematics

To develop accurate kinematic models and achieve precise robotic motion trajectory control, many scholars have proposed various approaches for analyzing and solving IK, including geometric [9], algebraic [10], and numerical methods [11]. Among them, geometric methods offer good intuitiveness and low computational cost, making them suitable for relatively simple kinematic chains but unsuitable for solving the IK of multi-DoF redundant robots [12]. Algebraic methods derive explicit mathematical models of the robotic system based on mechanical principles. However, these methods depend on prior knowledge and experimental estimation of the real system’s parameters [13]. Since surgical instrument arms have a serpentine structure that does not satisfy the robotic Pieper criterion, its IK solutions may result in multiple joint configurations to achieve the same end-effector position and orientation. Additionally, the high nonlinearity of cable drives can increase the complexity of algebraic-based mathematical models, making them susceptible to matrix singularities and nonlinear effects [14]. Numerical methods for solving IK are flexible and robust computational approaches, typically categorized into Jacobian methods [15], Newton methods [16], and heuristic methods [17,18]. Their advantages include the ability to handle complex structures and constraints, good adaptability to dynamic environments, and fault tolerance [19]. However, the accuracy of numerical analysis depends highly on the initial guess, making real-time application challenging. Additionally, numerical methods may accumulate iterative errors, leading to a lack of analytical precision.

2.2. Neural Network-Based Kinematics

To overcome the limitations of the traditional methods mentioned above, ANNs have shown great ability to mitigate computational complexity, alleviate derivation difficulties, avoid matrix singularities, and reduce computational time in solving redundant robot kinematics problems [20]. On the other hand, when kinematic models cannot be established due to uncertain parameters, neural networks can learn from experimental data to construct accurate models. Sharkawy and Khairullah [21] proposed a multilayer feedforward neural network (MLFFNN) to solve the kinematics of a three-DoF articulated robot hand, demonstrating excellent performance with a root mean square error close to zero. Similarly, Aysal [22] and Srisuk [23] applied neural networks to three-DoF robots, but these methods have not been validated for multi-DoF robots. As robots become more complex and multifunctional, IK solutions for low-DoF robots struggle to meet control requirements. Consequently, more researchers are focusing on kinematic solutions for complex redundant robots, such as parallel robots, soft robots, humanoid robots, and other multi-DoF robots. Thomas et al. [24] applied various artificial intelligence methods to solve the IK of a six-DoF parallel robot, including multiple linear regression, support vector machines (SVMs), decision trees, and random forests (RF), among which RF showed superior performance; Chawla et al. [25] applied ANN to both the redundant constraint plane and the minimal constraint space of cable-driven parallel robots, using computation time as a key performance metric. The results showed that, compared to numerical analysis methods, ANNs reduced computation time by 83% and 44%, respectively, which proved the ability of ANNs in improving the computational efficiency. Wagaa et al. [26] aimed to find the optimal DNN configuration for solving the IK problem of a six-DoF robotic arm, and compared methods such as convolutional neural networks (CNNs), long short-term memory (LSTM), gated recurrent units (GRU), and bidirectional LSTM to determine which neural network is more effective. In conclusion, methods based on ANNs for solving the kinematics of complex multi-DoF robots have been well established and applied. However, generic CNN or RNN architectures often treat kinematic variables as independent channels or simple sequences, failing to explicitly model the strong spatial coupling and long-range dependencies inherent in serpentine mechanisms.

2.3. Attention Mechanisms in Robotics

Attention mechanisms, originally popularized in natural language processing, have subsequently been applied to various sequence-related problems [27]. Due to their excellent capability in capturing global dependencies and enhancing model interpretability, they have recently demonstrated immense potential in the field of robotics. Specifically in the domain of robot dynamics, Shao et al. [28] proposed a physics-informed dynamics learning network for soft robots, where an embedded self-attention mechanism significantly improved learning efficiency and data sensitivity, effectively addressing the challenges posed by the infinite degrees of freedom in continuum robots. Similarly, Zhang et al. [29] introduced a Directional Attention Mechanism (DAM) for chatter detection in robotic milling. By utilizing DAM to fuse spatial-domain point cloud data with time-domain vibration signals, their model successfully adapted to the varying physical stiffness of the robot under different poses, thereby achieving high detection accuracy across diverse machining conditions. Beyond dynamics, attention mechanisms have also been extensively explored in mobile robot navigation. Several studies have incorporated spatio-temporal features into Transformer architectures based on attention mechanisms to decouple the spatial relationships between agents and their temporal motion trajectories, enabling path planning in complex environments [30,31].

However, the aforementioned studies typically employ attention mechanisms to weigh input features, focusing primarily on temporal sequences or single-dimensional spatial attention. Given the high nonlinearity and strong coupling characteristics among the joints of the multi-DoF surgical instrument in this study, solving the inverse kinematics requires not only attending to the local motion transmission between adjacent joints but also capturing the global spatial mapping relationships across distant joints and dimensions. Therefore, a TFSAM is proposed to capture these cross-dimensional interactions, aiming to resolve the aforementioned long-range dependency problem. Compared to generic attention mechanisms, TFSAM provides a more granular feature extraction capability that is specifically tailored to kinematic characteristics.

3. System Description

Single-port surgical robots are typically composed of two parts: a passive main arm and an active surgical instrument arm. The passive main arm is used for pre-surgical adjustment of the robot’s position, ensuring that the instrument arm structure can access an appropriate workspace. During surgery, the passive main arm structure is usually fixed and does not participate in the surgical operation. The active surgical instrument arm is responsible for performing surgical operations, replicating the surgeon’s hand movements in real time during surgery, which requires the instrument arm to have higher operational flexibility and more precise positioning, as shown in Figure 1a. Therefore, this study focuses on the modeling and kinematic analysis of the active surgical instrument arm.

3.1. Surgical Instrument Arm Model

The main structure of the surgical instrument arm is shown in Figure 1b, mainly consisting of the surgical instrument arm base, motor box, instrument drive box, and the end multi-DoF surgical instrument arm. The surgical instrument arm base is used to fix the entire surgical instrument arm. A linear guide is installed on the surgical instrument arm base, providing the entire surgical instrument arm with the DoF for vertical movement. The motor box is installed on the linear guide and contains multiple motors and reducers. It is connected to the instrument drive box through a spline transmission device. The cable wheel is installed inside the instrument drive box; the motor can drive the cable on the fixing wheel, thus controlling all DoFs of the end instrument. Using cables as the driving method for surgical instruments offers higher flexibility, greater transmission efficiency, and requires less space compared to gear transmission and other traditional methods. It is suitable for long-distance transmission in limited spaces. Additionally, the motor box and instrument drive box are designed to allow rapid replacement of the instrument box during surgery. This design meets the needs of different surgical instruments.

The end-effector multi-DoF surgical instrument arm serves as the critical actuation mechanism. To meet the workspace and flexibility requirements of single-port surgery, a serial mechanism configuration with redundant DoFs is adopted, comprising six DoFs, as shown in Figure 1c. The rotation joint provides rotational movement for the entire instrument arm, which can effectively increase the working space. The parallel joint consists of two rotational joints which are constrained by each other and have the function of linkage. These joints rotate synchronously in opposite directions with equal angles, ensuring that the manipulator arm axis remains parallel to the z-axis and maintaining wrist flexibility. The wrist rotation joint provides rotational DoF for the end surgical instruments. The wrist serpentine joint consists of multiple rotational joints arranged perpendicularly to each other, providing pitch and yaw DoFs for the surgical instruments. Finally, the clamping DoF of the surgical instrument, which does not participate in kinematics.

3.2. Forward Kinematics Analysis

To accurately map the joint space to the workspace and obtain the analytical solution for FK of the surgical instrument, this study uses the DH matrix parameter table. This approach offers a minimal and intuitive method for determining parameters through linear algebra [32]. To express the relationship between the links and joint coordinates of the surgical instrument, we redefined the coordinate system of the surgical instrument arm joints and used the center of the axis of the end surgical instrument as the centroid of the workspace, as shown in Figure 2. It should be noted that we incorporated the vertical degree of freedom d provided by the linear guideway into the kinematics, because the movement in the vertical direction is necessary in surgery; in the wrist serpentine joints, two degrees of freedom were provided, pitch and yaw, and to improve the kinematics calculation accuracy, we disassembled each serpentine joint as an independent kinematic unit. The bending curvature of the serpentine joints with the same degree of freedom of motion is equal, and the bending angle of each pitch/yaw unit bends at an angle of

θ^{4} / θ^{5}

.

Based on the DH convention, the DH matrix parameter table for the surgical instrument arm is established, as shown in Table 1 and Figure 2. Then, a set of four parameters α, a, d, and θ are used to define the transformation matrix

{}^{i - 1}T_{i}

:

{}^{i - 1}T_{i} = R o t (x, α_{i - 1}) T r a n s (x, a_{i - 1}) R o t (z, θ_{i}) T r a n s (z, d_{i})

(1)

where Trans(·) and Rot(·) represent translation and rotation along the vector, respectively. The general expression for the homogeneous transformation matrix can be obtained from Equation (2):

{}^{i - 1}T_{i} = (\begin{matrix} \cos θ_{i} & - \sin θ_{i} & 0 & a_{i - 1} \\ \sin θ_{i} \cos α_{i - 1} & \cos θ_{i} \cos α_{i - 1} & - \sin α_{i - 1} & - d_{i} \sin α_{i - 1} \\ \sin θ_{i} \sin α_{i - 1} & \cos θ_{i} \sin α_{i - 1} & \cos α_{i - 1} & d_{i} \cos α_{i - 1} \\ 0 & 0 & 0 & 1 \end{matrix})

(2)

In the study, according to the above equation, the global transformation matrix is as follows:

{}^{1}T_{10} = {}^{0}T_{1} {}^{1}T_{2} {}^{2}T_{3} {}^{3}T_{4} {}^{4}T_{5} {}^{5}T_{6} {}^{6}T_{7} {}^{7}T_{8} {}^{8}T_{9} {}^{9}T_{10} = (\begin{matrix} n_{x} & s_{x} & a_{x} & P_{x} \\ n_{y} & s_{y} & a_{y} & P_{y} \\ n_{z} & s_{z} & a_{z} & P_{z} \\ 0 & 0 & 0 & 1 \end{matrix})

(3)

where

{[n_{x}, n_{y}, n_{z}]}^{T}

,

{[s_{x}, s_{y}, s_{z}]}^{T}

,

{[a_{x}, a_{y}, a_{z}]}^{T}

represent the position vectors of the end surgical instruments and

{[P_{x}, P_{y}, P_{z}]}^{T}

is the position vectors.

4. Methodology

This study proposes a deep learning model with low iterative error and high solution accuracy to address the IK problem of multi-DoF systems. The proposed IK estimation model framework consists of three key stages: data sampling, hybrid model architecture design, and comprehensive performance evaluation and validation.

4.1. Data Sampling Module

In order to make the dataset cover the entire sampling space as uniformly as possible, the LHS method is added as a data sampling module in the hybrid model. LHS is a modified Monte Carlo method. Unlike simple random sampling, it uses stratification to force variables to be evenly partitioned within their ranges and ensures that each stratum of a variable is sampled only once [33]. As a form of stratified sampling, this method ensures that every region of each dimension is sampled, preventing sample points from clustering in small neighborhoods and thus maintaining uniformity in data sampling.

4.2. Hybrid Model Architecture Design

4.2.1. DPCNN

The convolutional layer of CNNs effectively extracts features from input data, making it widely used and successful in fields such as image recognition and regression analysis. However, traditional CNNs have fewer convolutional kernels in the bottom layer and more convolutional kernels in the top layer, and this structure is prone to gradient explosion and overfitting when handing discrete one-dimensional data [34,35]. To solve the above problems of CNNs, Johnson proposed a deep pyramid convolutional neural network model in 2017 [36]. The model can deepen the network to extract input and output remote associations and deeper global information. It has lower computational complexity and can increase computation speed.

The DPCNN structure is shown in Figure 3. The basic embedding layer of the model generates embeddings by convolving the input data, and then multiple stacked convolutional blocks are used to extract deeper features of the input data. Each convolutional block is designed to include two convolutional layers, a residual connection, and a downsampling layer. The convolutional layers explicitly establish the dependency of angular changes between adjacent joints through local receptive fields. The residual connection block is introduced to prevent the initialization weights from being too small, resulting in slow gradient propagation, and to solve the gradient dispersion problem of deep neural networks. Residual units are implemented in a skip-connection manner, where the input is directly added to the output, allowing gradients to bypass the weakening of convolutional-layer weights and be transmitted to each block without loss, and realizing the pre-activation. Pyramidal downsampling layers achieve input feature dimension reduction through periodic insertion of half-size pooling layers. Each pooling operation compresses the input sequence length by half, resulting in an exponential decrease in data sequence length with increasing convolutional blocks, while doubling the perceptible sequence length. This pyramid-like structure not only significantly reduces computational complexity but also captures enhanced global features to improve prediction efficacy.

4.2.2. Triplet-Fusion Self-Attention Mechanism (TFSAM) Module

The IK estimation model proposed in this study uses the surgical instrument end-effector position and pose as input variables, and there exists a complex nonlinear coupling between the rotation matrix and the translation vector of the end-effector pose. However, conventional neural network training struggles to extract effective features from these variables and establish their interdependencies, and the module of attention mechanism needs to be introduced to address the limitations of the estimation model in these aspects. The attention mechanism has proved its effectiveness in various deep learning experiments in different domains such as image processing and regression prediction [37]. Therefore, this study proposes a dual-attention fusion module that cascades a self-attention mechanism (SAM) and triplet attention mechanism (TAM). The self-attention mechanism constructs global correlation weights for motion parameters to suppress redundant feature interference, while the triplet attention mechanism establishes local pose parameter correlations across multiple dimensions, capturing multi-scale data features to enhance the IK estimation performance.

In SAM, the model calculates the correlation of each element in the sequence with all other elements, and these weights reflect the interrelationships between the elements, as shown in Figure 4. Firstly, three learnable weight matrices W_q, W_k, W_v are defined, and the Queries, Keys, and Values for each element are obtained by projecting the input elements onto the weight matrices:

q_{i} = W_{q} x_{i}

(4)

k_{i} = W_{k} x_{i}

(5)

v_{i} = W_{v} x_{i}

(6)

Their vector forms are denoted as Q, K, V. These parameters are updated during training by backpropagation and gradient descent computation.

For each pair of elements, an attention score is computed to represent the degree of attention of

x_{i}

to

x_{j}

. The attention score is computed by calculating the dot product of the query vectors and the key vectors, and then dividing it by a scaling factor, which is usually expressed as the square root of the dimension of the key vectors, and is used to prevent overflow during the computation of the softmax function, as shown in Equation (7):

α = s c o r e (q_{i}, k_{i}) = q_{i} \times k_{j} / \sqrt{d_{k}}

(7)

where

d_{k}

is the dimension of the key vector. After computing the attention score, the normalized scaled attention weighting matrix

\hat{α}

is generated by the following softmax function:

\hat{α_{i j}} = softmax (q_{i} \times k_{j} / \sqrt{d_{k}})

(8)

Finally, the value vector of each element is multiplied by its corresponding attention weight and the summation is computed to get the final output

z_{i}

:

z_{i} = \sum_{i = 1}^{n} \hat{α_{i j}} \times v_{i}

(9)

The SAM captures the interdependencies between variables, allocates adaptive weights, and then the captured global feature information is used as the input to the triplet attention mechanism to further acquire the input multi-scale feature information. Let the input be

X \in ℝ^{C \times H \times W}

. Denote the SAM output weight as

S_{SAM} \in ℝ^{1 \times H \times W}

and the SAM-refined feature as

\tilde{X} = X ⊙ S_{S A M} + X

(10)

which is then fed into TAM.

Considering that the input data in this study is derived from the homogeneous transformation matrix, it contains three orientation vectors and one position vector. These vectors are not only mathematically correlated via orthogonality but are also physically highly coupled through the kinematic chain of the serpentine manipulator. Traditional attention mechanisms often overlook the latent connections between these vector components. In contrast, incorporating the TAM allows for the computation of cross-dimensional attention weights by rotating the self-attention feature tensors, thereby establishing spatial dependencies among the input pose parameters and enhancing the accuracy of inverse kinematic estimation. As shown in Figure 4, the TAM consists of three branches, which detect interdependencies of the input tensor across channel (C), height (H), and width (W). Let

Z (\cdot)

denote the concatenation of average and max pooling along the channel dimension and let

σ (\cdot)

be the Sigmoid activation. Let

Π_{H}

and

Π_{W}

be permutation operators that swap

(C \leftrightarrow H)

or

(C \leftrightarrow W)

, respectively, with inverses that restore the original layout. In the first and second branches (counterclockwise 90° around the (H) and (W) axes, i.e., using

Π_{H}

and

Π_{W}

), the attention weights share the form

M_{d} = Π_{d}^{- 1} (σ ({Conv}_{7 \times 7} (Z (Π_{d} (\tilde{X}))))), d \in H, W

(11)

In the third branch, no rotation is performed; instead, the same pooling is followed by a

7 \times 7

convolution and activation:

M_{C} = σ ({Conv}_{7 \times 7} (Z (\tilde{X})))

(12)

Finally, the outputs of these three branches are averaged to generate the final attention weights:

A = (M_{C} + M_{H} + M_{W}) / 3

(13)

and the TAM output is obtained by reweighting:

Y = \tilde{X} ⊙ A

(14)

This approach allows for a focus on specific features while ensuring the completeness of input information, thereby effectively enhancing the performance of the motion estimation model.

4.2.3. DPCNN–TFSAM Collaborative Framework

The proposed architecture integrates these modules in a cascaded manner to maximize their complementary strengths. First, the TFSAM serves as a global feature preprocessing and enhancement module; it processes the raw kinematic inputs to model long-range dependencies and assigns higher attention weights to features that are most relevant to the target joint variables. The extracted features are then fed into the DPCNN. Leveraging its pyramidal convolutional layers, DPCNN further captures multi-scale local patterns from these inputs. Finally, residual connections link the input and output of the DPCNN blocks, enabling the network to learn deep nonlinear representations without degradation and ultimately establishing a nonlinear mapping between the input pose and the joint variables.

4.3. Model Evaluation Indexes

In order to assess the accuracy of the proposed IK model, the following five evaluation indexes are used: regression coefficient (R²), mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and standard deviation (SD). When R² is closer to 1, it represents the higher accuracy of the model’s fit to the actual output. The other indexes are used to measure the error of the prediction model, and the smaller the value of the indexes, the higher the prediction accuracy of the model. The five evaluation indexes are defined below [38]:

R^{2} = 1 - (\sum_{i = 1}^{N} {(\hat{y} (i) - y (i))}^{2} / \sum_{i = 1}^{N} {(\bar{y} (i) - y (i))}^{2})

(15)

MAE = 1 / N \sum_{i = 1}^{N} | (\hat{y} (i) - y (i)) |

(16)

RMSE = \sqrt{1 / N \sum_{i = 1}^{N} {(\hat{y} (i) - y (i))}^{2}}

(17)

MAPE = 1 / N \sum_{i = 1}^{N} | (\hat{y} (i) - y (i)) / y (i) | \times 100 %

(18)

SD = \sqrt{1 / N \sum_{i = 1}^{N} {(y (i) - \bar{y} (i))}^{2}}

(19)

where

\hat{y} (i)

is the predicted value,

y (i)

is the actual value,

\bar{y} (i)

is the average of the actual value, and

N

is the total number of sample datasets.

4.4. Framework of the Proposed Model

The flowchart of the IK estimation model for the multi-DoF surgical robotic instrument arm is illustrated in Figure 5. Initially, random values for each joint degree of freedom are generated within the predefined joint space using the Latin hypercube sampling method. Then, the corresponding end-effector pose vector

{[O_{x}, O_{y}, O_{z}]}^{T}

and position vector

{[P_{x}, P_{y}, P_{z}]}^{T}

are computed based on the DH parameter matrix in Table 1. The pose vector consists of three components:

{[n_{x}, n_{y}, n_{z}]}^{T}

,

{[s_{x}, s_{y}, s_{z}]}^{T}

,

{[a_{x}, a_{y}, a_{z}]}^{T}

. These 12 parameters (9 pose vectors and 3 position vectors) are used as the input variables of the model, while the six joint degrees of freedom serve as the output variables. An IK estimation model is constructed using the TFSAM-DPCNN framework. Finally, statistical analysis is conducted for model evaluation and comparison. The predictive accuracy and generalization ability of the model are assessed using four performance metrics: RMSE, R², MAE, and MAPE. In addition, a 3D kinematic simulation platform for the single-port surgical robotic instrument arm is developed to validate the effectiveness of the proposed kinematic model. The framework of the proposed model is shown in Figure 6.

5. Result and Discussion

5.1. Experimental Setup

Based on the characteristics of single-port surgery, the surgical workspace and the operating habits of doctors, the range of joint degrees of freedom is set to meet clinical requirements, taking into account transmission scheme limitations and the lifespan of the surgical instrument arm, as shown in Table 2.

After determining the range of motion for each joint, we use the LHS method to collect a dataset of 50,000 random points within the joint space for each joint. The pose and position matrices of the surgical instrument end-effector are obtained through the DH method. It should be noted that the dataset in this study is generated based on standard DH parameter equations. This implies that the data represents the idealized geometric kinematics of the robot and does not explicitly account for physical nonlinearities common in cable-driven mechanisms. Nevertheless, the model trained on this dataset provides a solid foundation for solving the complex geometric inverse kinematics mapping and serves as a robust baseline for subsequent physical compensation algorithms.

In the training process, 70% of the dataset samples were used for training, 10% for validation, and the remaining 20% for testing. The 12 variables from the pose and position matrices were used as input to the proposed model. After each training epoch, the trained model was evaluated on the validation set. If the performance metrics on the validation set decreased, the current model parameters were saved. All processes were implemented using PyTorch 2.0.0 on an NVIDIA GeForce RTX 4090 GPU.

5.2. Impact of Optimizers on Model Performance

During model training, the optimizer determines the direction of parameter updates based on the gradients of the loss function, aiming to locate the global minimum. This process plays a critical role in both model accuracy and convergence speed [39]. In this study, five commonly used optimizers, including Adam, SGD, RMSprop, Adagrad, and Adadelta, are selected for model training, and their corresponding loss values are recorded throughout the training process, as shown in Figure 7. Adam demonstrated the fastest convergence speed and the highest accuracy, with a final loss approximately 47% lower than that of the second-best optimizer. In contrast, Adadelta converged more slowly, and Adagrad falls into local optimality. Moreover, compared with RMSprop and SGD, Adam’s loss curve exhibits no significant oscillations, indicating that its adaptive learning rate mechanism effectively suppresses gradient noise. Based on these findings, Adam is selected to update the model parameters in this study.

To verify the generalization capability of the proposed TFSAM-DPCNN and rule out overfitting, we monitored the loss trajectories during training. As illustrated in Figure 8, both the training and validation losses decrease rapidly in the early stages and converge synchronously to a stable minimum. The absence of divergence between the two curves confirms that the model has effectively learned the kinematic mapping features.

5.3. Comparison of Different Methods

To evaluate the predictive performance of the proposed IK estimation model, the TFSAM-DPCNN model was compared with several state-of-the-art models that have demonstrated strong performance in the field of IK under identical conditions, including Extreme Learning Machine (ELM), Long Short-Term Memory network with Extended Kalman Filter (LSTM-EKF) [40], CNN [41], Multilayer Perceptron Artificial Neural Network (MLP-ANN) [8], and GRU [42]. To ensure experimental consistency, identical hardware configurations, dataset partitions, and preprocessing protocols were rigorously maintained. Furthermore, to minimize the effects of random factors during the experiments, each model was trained and evaluated three times, and the average results were reported.

The results are presented in Table 3. Considering the evaluation metrics of six variables, TFSAM-DPCNN exhibits the best overall prediction performance. Except for the output variable

θ_{5}

, it achieved optimal performance across all other variables. The average R² value reaches 0.993, demonstrating the model’s excellent fitting capability. Compared to other models, the proposed method reduced the maximum RMSE, MAE, and MAPE values for all variables by 81.34%, 83.08%, and 85.58%, respectively. These results confirm the effectiveness of applying TFSAM-DPCNN to develop a multi-DoF IK estimation model. In addition to accuracy metrics, model stability is evaluated using the standard deviation (SD) of prediction errors on the test set. Each joint of the proposed model yields SD values close to the RMSE, indicating consistent predictions and validating the stability and robustness of the model’s predictions. Among the other comparative models, LSTM-EKF and GRU achieved better performance, with predictions for certain variables approaching or even surpassing those of the proposed model. This could be attributed to their gating mechanisms, which effectively capture and transmit historical motion information, thereby enhancing model performance. In contrast, ELM yields unsatisfactory prediction results due to its single-hidden-layer architecture struggling to capture the complex mapping relationships in high-dimensional kinematics. From the perspective of individual variables, the vertical translational DoF d appears to be easier to predict. This phenomenon may stem from the feature relationship between d and the position vector of the transformation matrix

{[P_{x}, P_{y}, P_{z}]}^{T}

being easier for the model to capture in the process of solving forward kinematics. The relatively larger prediction errors for

θ_{1}

and

θ_{3}

are partly attributed to their wider ranges of motion. However, a more critical factor lies in their kinematic coupling. Since both

θ_{1}

and

θ_{3}

are functionally responsible for axial rotation, their rotation axes tend to align collinearly in specific configurations, such as when the bending angles of the serpentine wrist joints are small. In these cases, different combinations of

θ_{1}

and

θ_{3}

can result in similar end-effector rotation matrices, leading to non-uniqueness in the inverse kinematics solution. Consequently, the proposed model struggles to precisely distinguish the individual contributions of these two joints during the regression process, resulting in higher prediction errors compared to other bending joints. For

θ_{5}

, although the GRU performs somewhat better than the proposed method, we found that the performance difference between the two is statistically negligible, with an RMSE difference of only 0.00012. The apparent discrepancy in the table primarily stems from the MAPE metric. It is important to note that MAPE is highly sensitive to samples where target values are close to zero. Given that

θ_{5}

in the dataset contains a certain proportion of near-zero or small-magnitude samples, MAPE may still exhibit significant fluctuations even when the absolute error levels are comparable.

Furthermore, to intuitively assess the prediction performance of different models, a subset of prediction errors for each joint variable was randomly selected, as shown in the left panel of Figure 9. It can be observed that the TFSAM-DPCNN model demonstrated superior fitting performance for all output variables compared to other models, exhibiting smaller fluctuations and fewer outliers. The right panel of Figure 9 illustrates the fitting performance of the TFSAM-DPCNN model and the distribution of prediction errors across different value intervals. It can be seen that the proposed model accurately fits the y = x line (R² = 1), which is further validated by the R² values reported in the table. The prediction errors for all variables in TFSAM-DPCNN are controlled within 6.3% of their respective value ranges, with the vertical translational DoF d maintaining an error within 2.47%. Moreover, the prediction errors of the TFSAM-DPCNN model remained relatively stable across all value intervals, without noticeable increases in error due to extreme true values. In summary, the proposed model outperforms other advanced algorithms in terms of both prediction accuracy and stability.

5.4. Computational Efficiency Analysis

To comprehensively evaluate the practicality of the models, we compared the computational efficiency of the proposed TFSAM-DPCNN with other baseline methods in terms of parameter scale and training time. As shown in Table 4, the proposed TFSAM-DPCNN exhibits a highly compact architecture with only 142,660 parameters. This is significantly fewer than the parameters required by GRU, MLP-ANNs, and LSTM-EKF. The lightweight nature of our model indicates lower memory consumption, making it particularly suitable for deployment on resource-constrained embedded systems in surgical robots.

Regarding training efficiency, the TFSAM-DPCNN requires approximately 197.63 min for training, which is longer than the baseline models. This increased training duration is attributed to the multi-branch parallel computation within the TFSAM module and the deeper hierarchical structure of the DPCNN, which are designed to capture complex spatial couplings and high-level features. However, considering that model training is a one-time offline process, this trade-off is justified by the significant improvements in estimation accuracy and the advantage of a lightweight parameter footprint for real-time inference.

Furthermore, to verify the applicability of the proposed method for surgical robot control, we set the batch size to 1 and evaluated the average single-sample inference time. The experiment is conducted on the aforementioned standard workstation hardware. The results show that the proposed model achieves an average inference latency of 1.28 ms, corresponding to a processing frequency of 784 Hz, demonstrating sufficient real-time capability. Although the current evaluation is performed on a workstation, the lightweight nature of the model suggests that it can be effectively deployed on embedded computing platforms without sacrificing real-time performance.

5.5. Ablation Experiment

To evaluate the effectiveness of different structural components in the proposed IK estimation model, an ablation experiment is conducted. The model is decomposed into three baseline variants:

(1): TAM-DPCNN (with the self-attention module removed),
(2): SAM-DPCNN (with the triplet attention module removed),
(3): DPCNN (with all attention mechanisms removed).

All experiments are executed on the original dataset under identical parameter configurations.

Figure 10 illustrates the performance of the aforementioned models across the six joint variables. Experimental results demonstrate that both the TAM and SAM enhance the DPCNN prediction accuracy across all joints, indicating that IK estimation cannot solely rely on a stacked convolutional DPCNN architecture. The fixed receptive field of DPCNN limits its ability to capture global inter-joint features, resulting in cumulative joint prediction errors. In contrast, the TAM-DPCNN introduces spatial-channel features to establish spatial correlations among the joints, assigning different weights during the prediction of each joint, which reduces the average RMSE by 20.75% and the average MAE by 21.33%. Meanwhile, the SAM-DPCNN captures global features and automatically learns the coupling relationships among input parameters, leading to a 24.45% reduction in average RMSE and 26.39% reduction in average MAE. Notably, for rotational joint variables, the performance improvements of the attention-enhanced models are even more significant, indicating their enhanced capability in handling the highly nonlinear variations in end-effector pose and position.

The TFSAM-DPCNN demonstrates superior performance across all joint variables compared to other baseline models. Relative to the DPCNN, TFSAM-DPCNN achieves average reductions of 27.98% in RMSE and 31.29% in MAE, with maximum reductions reaching 41.41% and 41.03%. It also outperforms the single-attention baseline models. Furthermore, compared to the single-attention baseline models, TFSAM-DPCNN achieved a maximum reduction of 14.13% in RMSE, indicating that the cascaded attention mechanism effectively integrates both spatial and global features, dynamically adjusting the importance of different features to improve the prediction accuracy. Overall, the components of the proposed model contributed substantially to enhancing its overall prediction performance.

5.6. Mechanical Simulation Verification

To comprehensively validate the proposed kinematic model for the single-port surgical robotic instrument arm, a simulation platform based on SolidWorks 2021 and MATLAB 2020a was developed in this study. First, a parametric model of the six-DoF surgical instrument arm was constructed in SolidWorks, and a URDF descriptive file containing physical properties and a physics engine was generated. Subsequently, a motion simulation model of the instrument arm was built in the MATLAB simulation environment using the Robotics System Toolbox, enabling visualized motion simulation.

A random motion trajectory was generated in the workspace using the motion simulation model. The end-effector pose matrices, position matrices, and corresponding actual joint variables at 80 points along the trajectory were recorded. Figure 11 illustrates the poses of the surgical instrument arm simulation model at the start and end points of the trajectory. Subsequently, the end-effector pose and position matrices at each trajectory point were input into the proposed IK estimation model to predict the joint variables. Comparative analysis with truth values is presented in Figure 12. Overall, the TFSAM-DPCNN demonstrated satisfactory prediction performance. Although minor phase delays are observed in joints θ₁ and θ₄, the overall mean error does not exceed 8%, and no cumulative error is observed. Additionally, the maximum error of joint d was only 4.1198 mm. Combined with the previous performance analyses of the IK estimation model, these results further confirm the satisfactory performance of the proposed approach.

6. Conclusions and Future Works

In this study, a design scheme of a multi-DoF instrument arm for single-port surgical robots is proposed. To address the kinematic issues of complex multi-DoF instrument arms, a hybrid kinematic estimation model is introduced. The model incorporates a data sampling module and a triplet-fusion self-attention mechanism module to enhance prediction performance. The main research conclusions are as follows:

(1): An effective kinematic hybrid estimation model is provided. The proposed TFSAM-DPCNN model can effectively fit and predict the end position and positional state of the surgical robot, and the proposed model performs optimally compared with other state-of-the-art models.
(2): The proposed TFSAM attention mechanism effectively extracts features between variables and enhances the model’s adaptability, demonstrating strong performance in predicting multiple joint variables, with R² values exceeding 0.99.

This level of accuracy improvement reduces deviations in the pose of the surgical end-effector. For delicate operations such as suturing, needle passing, and knot tying, it can improve trajectory repeatability and placement consistency, ensuring that the manipulator follows the planned path more faithfully, thereby minimizing the risk of unintended damage to surrounding healthy tissues within confined anatomical spaces. In conclusion, this study provides an effective solution for the kinematic modeling of complex multi-degree-of-freedom robots, which provides a basis for robot motion control and can be extended to various motion control schemes for complex robots.

In future work, we will consider the following research directions. First, the current model focuses on pointwise inverse kinematics and does not explicitly incorporate temporal continuity constraints. In future trajectory-tracking studies, we will consider introducing temporal smoothness constraints to ensure continuous and safe manipulator motion. Second, we will investigate uncertainty quantification methods to provide confidence intervals for each joint prediction, enabling online anomaly detection and safe switching strategies. Finally, although the proposed model achieves excellent accuracy for geometry-based inverse kinematics, it must be acknowledged that the current training data do not include non-ideal dynamic effects such as friction, hysteresis, or cable elongation. To bridge the gap between simulation and reality, future work will conduct Hardware-in-the-Loop testing, using the proposed model as a feedforward controller to provide an initial inverse solution and combining sensor-based closed-loop feedback to compensate for physical errors. In addition, real hardware data will be used to fine-tune the model via transfer learning, enabling deployment in real dynamic environments.

Author Contributions

T.S.: Writing—original draft, Methodology, Visualization, Conceptualization. L.L.: Validation, Methodology, Writing—review and editing. M.P.: Resources, Project administration. C.F.: Data curation, Investigation. H.H.: Formal analysis. J.L.: Software, Formal analysis. K.L.: Conceptualization, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62461002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

Using Chat GPT 4.1 for English translation and polishing.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Liang, Y.; Du, Z.; Wang, W.; Yan, Z.; Sun, L. An improved scheme for eliminating the coupled motion of surgical instruments used in laparoscopic surgical robots. Robot. Auton. Syst. 2019, 112, 49–59. [Google Scholar] [CrossRef]
Wu, Z.; Zhu, C.; Ding, Y.; Wang, Y.; Xu, B.; Xu, K. A robotic surgical tool with continuum wrist, kinematically optimized curved stem, and collision avoidance kinematics for single port procedure. Mech. Mach. Theory 2022, 173, 104863. [Google Scholar] [CrossRef]
Vitiello, V.; Lee, S.L.; Cundy, T.P.; Yang, G.Z. Emerging robotic platforms for minimally invasive surgery. IEEE Rev. Biomed. Eng. 2012, 6, 111–126. [Google Scholar] [CrossRef]
Wu, D.; Zhang, R.; Pore, A.; Alba, D.D.; Ha, X.T.; Li, Z.; Zhang, Y.; Herrera, F.; Ourak, M.; Kowalczyk, W. A review on machine learning in flexible surgical and interventional robots: Where we are and where we are going. Biomed. Signal Process. 2024, 93, 106179. [Google Scholar] [CrossRef]
Wang, Z.; Gao, Q.; Zhao, H. CPG-inspired locomotion control for a snake robot basing on nonlinear oscillators. J. Intell. Robot. Syst. 2017, 85, 209–227. [Google Scholar] [CrossRef]
Iliukhin, V.N.; Mitkovskii, K.B.; Bizyanova, D.A.; Akopyan, A.A. The modeling of inverse kinematics for 5 DOF manipulator. Procedia Eng. 2017, 176, 498–505. [Google Scholar] [CrossRef]
Tang, L.; Zhu, L.; Zhu, X.; Gu, G. Confined spaces path following for cable-driven snake robots with prediction lookup and interpolation algorithms. Sci. China Technol. Sci. 2020, 63, 255–264. [Google Scholar] [CrossRef]
Cagigas-Muñiz, D. Artificial Neural Networks for inverse kinematics problem in articulated robots. Eng. Appl. Artif. Intell. 2023, 126, 107175. [Google Scholar] [CrossRef]
Hrdina, J.; Návrat, A.; Vašík, P. Control of 3-link robotic snake based on conformal geometric algebra. Adv. Appl. Clifford Algebras 2016, 26, 1069–1080. [Google Scholar] [CrossRef]
Tong, Y.; Liu, J.; Liu, Y.; Yuan, Y. Analytical inverse kinematic computation for 7-DOF redundant sliding manipulators. Mech. Mach. Theory 2021, 155, 104006. [Google Scholar] [CrossRef]
Jain, T.; Jain, J.K.; Roy, D. Joint space redundancy resolution of serial link manipulator: An inverse kinematics and continuum structure numerical approach. Mater. Today Proc. 2021, 38, 423–431. [Google Scholar] [CrossRef]
Sardana, L.; Sutar, M.K.; Pathak, P.M. A geometric approach for inverse kinematics of a 4-link redundant In-Vivo robot for biopsy. Robot. Auton. Syst. 2013, 61, 1306–1313. [Google Scholar] [CrossRef]
Lin, P.F.; Huang, M.B.; Huang, H.P. Analytical solution for inverse kinematics using dual quaternions. IEEE Access 2019, 7, 166190–166202. [Google Scholar] [CrossRef]
Kucuk, S.; Bingul, Z. Inverse kinematics solutions for industrial robot manipulators with offset wrists. Appl. Math. Model. 2014, 38, 1983–1999. [Google Scholar] [CrossRef]
Park, S.O.; Lee, M.C.; Kim, J. Trajectory planning with collision avoidance for redundant robots using jacobian and artificial potential field-based real-time inverse kinematics. Int. J. Control Autom. 2020, 18, 2095–2107. [Google Scholar] [CrossRef]
Chen, Y.; Luo, X.; Han, B.; Jia, Y.; Liang, G.; Wang, X. A general approach based on Newton’s method and cyclic coordinate descent method for solving the inverse kinematics. Appl. Sci. 2019, 9, 5461. [Google Scholar] [CrossRef]
Dereli, S.; Köker, R.I. A meta-heuristic proposal for inverse kinematics solution of 7-DOF serial robotic manipulator: Quantum behaved particle swarm algorithm. Artif. Intell. Rev. 2020, 53, 949–964. [Google Scholar] [CrossRef]
Huang, H.C.; Chen, C.P.; Wang, P.R. Particle swarm optimization for solving the inverse kinematics of 7-DOF robotic manipulators. In Proceedings of the 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Seoul, Republic of Korea, 14–17 October 2012; IEEE: New York, NY, USA, 2012; pp. 3105–3110. [Google Scholar]
Aydogmus, O.; Boztas, G. Implementation of singularity-free inverse kinematics for humanoid robotic arm using Bayesian optimized deep neural network. Measurement 2024, 229, 114471. [Google Scholar] [CrossRef]
Yang, Y.; Li, W.; Song, B.; Zou, Y.; Pan, Y. Enhanced fault tolerant kinematic control of redundant robots with linear-variational-inequality based zeroing neural network. Eng. Appl. Artif. Intell. 2024, 133, 108068. [Google Scholar] [CrossRef]
Sharkawy, A.N.; Khairullah, S.S. Forward and Inverse Kinematics Solution of A 3-DOF Articulated Robotic Manipulator Using Artificial Neural Network. Int. J. Robot. Control Syst. 2023, 3, 330–353. [Google Scholar] [CrossRef]
Aysal, F.E.; Çelik, İ.; Cengiz, E.; Oğuz, Y.K. A comparison of multi-layer perceptron and inverse kinematic for RRR robotic arm. Politek. Derg. 2023, 27, 121–131. [Google Scholar] [CrossRef]
Srisuk, P.; Sento, A.; Kitjaidure, Y. Inverse kinematics solution using neural networks from forward kinematics equations. In 2017 9th International Conference on Knowledge and Smart Technology (KST); IEEE: New York, NY, USA, 2017; pp. 61–65. [Google Scholar]
Thomas, M.J.; Sanjeev, M.M.; Sudheer, A.P.; ML, J. Comparative study of various machine learning algorithms and Denavit–Hartenberg approach for the inverse kinematic solutions in a 3-PP SS parallel manipulator. Ind. Robot. 2020, 47, 683–695. [Google Scholar] [CrossRef]
Chawla, I.; Pathak, P.M.; Notash, L.; Samantaray, A.K.; Li, Q.; Sharma, U.K. Inverse and forward kineto-static solution of a large-scale cable-driven parallel robot using neural networks. Mech. Mach. Theory 2023, 179, 105107. [Google Scholar] [CrossRef]
Wagaa, N.; Kallel, H.; Mellouli, N.D. Analytical and deep learning approaches for solving the inverse kinematic problem of a high degrees of freedom robotic arm. Eng. Appl. Artif. Intell. 2023, 123, 106301. [Google Scholar] [CrossRef]
Wu, Z.; Li, Z.; Zhu, D.; Liao, Q.; Yao, W. A Multi-Robot Task Allocation Method Based on Graph Attention Network and Unsupervised Learning. In Proceedings of the 2024 IEEE International Conference on Unmanned Systems (ICUS), Nanjing, China, 18–20 October 2024; IEEE: New York, NY, USA, 2024; pp. 1222–1227. [Google Scholar]
Shao, X.; Xu, L.; Sun, G.; Yao, W.; Wu, L.; Della Santina, C. Self-attention enhanced dynamics learning and adaptive fractional-order control for continuum soft robots with system uncertainties. IEEE Trans. Autom. Sci. Eng. 2025, 22, 18694–18708. [Google Scholar] [CrossRef]
Zhang, X.; Zheng, L.; Fan, W.; Mao, L.; Li, Z.; Cao, Y. Multi-domain data-driven chatter detection in robotic milling under varied robot poses based on directional attention mechanism. Mech. Syst. Signal Process. 2025, 227, 112406. [Google Scholar] [CrossRef]
He, H.; Fu, H.; Wang, Q.; Zhou, S.; Liu, W.; Chen, Y. Spatio-temporal transformer-based reinforcement learning for robot crowd navigation. In Proceedings of the 2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), Koh Samui, Thailand, 4–9 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Wang, W.; Wang, R.; Mao, L.; Min, B.C. Navistar: Socially aware robot navigation with hybrid spatio-temporal graph transformer and preference learning. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: New York, NY, USA, 2023; pp. 11348–11355. [Google Scholar]
Žlajpah, L.; Petrič, T. Kinematic calibration for collaborative robots on a mobile platform using motion capture system. Robot. Comput.-Integr. Manuf. 2023, 79, 102446. [Google Scholar] [CrossRef]
Lu, L.; Li, G.; Xing, P.; Gao, H.; Song, Y.; Zhang, H. Numerical calculation and experimental investigation of the dynamic alignment of ship propulsion shafting based on Latin hypercube stochastic finite element. Ocean Eng. 2024, 296, 116935. [Google Scholar] [CrossRef]
Xia, W.; Zhang, R.; Zhang, X.; Usman, M. A novel method for diagnosing Alzheimer’s disease using deep pyramid CNN based on EEG signals. Heliyon 2023, 9, e14858. [Google Scholar] [CrossRef]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Johnson, R.; Zhang, T. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1, pp. 562–570. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Pan, M.; Su, T.; Liang, K.; Liang, L.; Yang, Q. Sensorless force estimation of teleoperation system based on multilayer depth Extreme Learning Machine. Appl. Soft Comput. 2024, 157, 111494. [Google Scholar] [CrossRef]
Shan, G.; Guoyin, Z.; Chengwei, J.; Yanxia, W. SGDAT: An optimization method for binary neural networks. Neurocomputing 2023, 555, 126431. [Google Scholar] [CrossRef]
Kong, Y.; Yang, L.; Chen, C.; Zhu, X.; Li, D.; Guan, Q.; Du, G. Online kinematic calibration of robot manipulator based on neural network. Measurement 2024, 238, 115281. [Google Scholar] [CrossRef]
Elkholy, H.A.; Shahin, A.S.; Shaarawy, A.W.; Marzouk, H.; Elsamanty, M. Solving inverse kinematics of a 7-DOF manipulator using convolutional neural network. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020), Cairo, Egypt, 8–10 April 2020; Springer: Cham, Switzerland, 2020; pp. 343–352. [Google Scholar]
Toquica, J.S.; Oliveira, P.C.A.S.; Souza, W.S.; Motta, J.M.C.O.S.; Borges, D.B.O.L. An analytical and a Deep Learning model for solving the inverse kinematic problem of an industrial parallel robot. Comput. Ind. Eng. 2021, 151, 106682. [Google Scholar] [CrossRef]

Figure 1. Overview of the surgical robotic system. (a) Main structures of the surgical robotic system. (b) Structural components of the surgical instrument arm. (c) Joint DoF configuration in the surgical instrument arm.

Figure 2. FK analysis of the surgical instrument arm.

Figure 3. The structure of DPCNN.

Figure 4. Computational method for TFSAM.

Figure 5. Training and testing flowchart of the TFSAM-DPCNN.

Figure 6. The framework of the IK estimation model.

Figure 7. Training process of hybrid models guided by different optimizers.

Figure 8. Training and validation loss convergence curves of the proposed model.

Figure 9. Comparison of prediction errors of different methods. (a) The prediction error and fitting degree of the model for DoF d. (b) The prediction error and fitting degree of the model for DoF

θ_{1}

. (c) The prediction error and fitting degree of the model for DoF

θ_{2}

. (d) The prediction error and fitting degree of the model for DoF

θ_{3}

. (e) The prediction error and fitting degree of the model for DoF

θ_{4}

. (f) The prediction error and fitting degree of the model for DoF

θ_{5}

.

Figure 9. Comparison of prediction errors of different methods. (a) The prediction error and fitting degree of the model for DoF d. (b) The prediction error and fitting degree of the model for DoF

θ_{1}

. (c) The prediction error and fitting degree of the model for DoF

θ_{2}

. (d) The prediction error and fitting degree of the model for DoF

θ_{3}

. (e) The prediction error and fitting degree of the model for DoF

θ_{4}

. (f) The prediction error and fitting degree of the model for DoF

θ_{5}

.

Figure 10. Comparison of evaluation indicators for different attention mechanisms.

Figure 11. (a) Motion trajectory generated by the surgical instrument arm simulation model. (b) The robot’s pose and position at the start points. (c) The robot’s pose and position at the end points.

Figure 12. Comparison of the predicted joint variables obtained by the proposed model with the ground truth joint values along the simulated motion trajectory.

Table 1. DH matrix parameters of the surgical instrument arm.

$i$	$α_{i - 1}$ (rad)	$a_{i - 1}$ (mm)	$d_{i}$ (mm)	$θ_{i}$ (rad)
1	0	0	$L_{1} + d$	$π$
2	0	0	$L_{2}$	$θ_{1}$
3	$π / 2$	0	0	${π / 2 + θ}_{2}$
4	$π$	$L_{3}$	0	${- π / 2 + θ}_{2}$
5	$- π / 2$	0	$L_{4}$	$θ_{3}$
6	$π / 2$	0	$L_{5}$	$π / 2 + θ_{4}$
7	$π / 2$	$L_{5}$	0	$θ_{5}$
8	$- π / 2$	$L_{5}$	0	$θ_{4}$
9	$π / 2$	$L_{5}$	0	$θ_{5}$
10	0	$L_{6}$	0	0

Table 2. The range of joint degrees of freedom.

Joint Variable/Unit	Value
d/mm	0~300
$θ_{1}$ /rad	$- π$ $/ 2 ~ π$ /2
$θ_{2}$ /rad	$0 ~ π$ /12
$θ_{3}$ /rad	$- π$ $/ 2 ~ π$ /2
$θ_{4}$ /rad	$- π$ $/ 12 ~ π$ /12
$θ_{5}$ /rad	$- π$ $/ 12 ~ π$ /12

Table 3. Evaluation metrics of each algorithm under different variables.

Joint Variable	Evaluation Indicators	ELM	LSTM-EKF	MLP-ANNs	CNN	GRU	TFSAM-DPCNN
d	RMSE	4.2381	2.5400	4.5329	10.8827	3.0990	2.3766
	R²	0.9976	0.9991	0.9972	0.9842	0.9987	0.9993
	MAE	3.0755	1.8864	3.1430	9.0097	2.5429	1.7315
	MAPE	0.2689	0.1869	0.2825	0.5944	0.2225	0.1125
	SD	4.2571	2.8692	4.1833	10.8500	3.1208	2.4078
$θ_{1}$	RMSE	0.3332	0.1539	0.1431	0.2442	0.1026	0.08285
	R²	0.8664	0.9715	0.9749	0.9282	0.9873	0.9917
	MAE	0.2384	0.0774	0.0841	0.1760	0.0413	0.03720
	MAPE	0.8497	0.3738	0.3898	0.6990	0.2396	0.2216
	SD	0.3210	0.1479	0.1394	0.2439	0.0933	0.0838
$θ_{2}$	RMSE	0.031	0.0058	0.0060	0.01754	0.0029	0.0027
	R²	0.8322	0.9941	0.9937	0.9461	0.9985	0.9987
	MAE	0.0233	0.0046	0.0046	0.0140	0.0022	0.0020
	MAPE	0.8544	0.3444	0.3235	0.8760	0.0767	0.0880
	SD	0.0260	0.0055	0.0060	0.0166	0.0032	0.0029
$θ_{3}$	RMSE	0.3514	0.1589	0.1468	0.2898	0.1022	0.08513
	R²	0.8496	0.9692	0.9737	0.8976	0.9873	0.9912
	MAE	0.2556	0.0845	0.0841	0.216	0.0428	0.0404
	MAPE	1.4864	0.5336	0.731	1.2089	0.2909	0.2529
	SD	0.3391	0.1512	0.1466	0.2869	0.0939	0.0869
$θ_{4}$	RMSE	0.0165	0.0116	0.0118	0.0482	0.0058	0.0055
	R²	0.988	0.9940	0.9938	0.8974	0.9985	0.9986
	MAE	0.0122	0.0084	0.0089	0.0397	0.0044	0.0042
	MAPE	0.465	0.2473	0.2866	0.9023	0.1124	0.1921
	SD	0.0205	0.0111	0.0117	0.0481	0.0057	0.0057
$θ_{5}$	RMSE	0.0281	0.0107	0.0121	0.0477	0.00544	0.00556
	R²	0.9657	0.9951	0.9936	0.8595	0.99874	0.99870
	MAE	0.0202	0.0075	0.0092	0.039	0.00416	0.00422
	MAPE	0.6374	0.2864	0.3378	0.9714	0.11014	0.1780
	SD	0.0257	0.0110	0.012	0.0474	0.00574	0.00567

Table 4. Comparison of parameter scale and training time for different models.

Model	Training Time (Minutes)	Parameters
ELM	2.2	1900
LSTM-EKF	134.02	804,358
MLP-ANNs	14.22	398,527
CNN	26.59	22,118
GRU	87.88	254,214
TFSAM-DPCNN	197.63	142,660

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Su, T.; Liang, L.; Pan, M.; Fu, C.; Huang, H.; Li, J.; Liang, K. Triplet-Fusion Self-Attention-Enhanced Pyramidal Convolutional Neural Network for Surgical Robot Kinematic Solution. Actuators 2026, 15, 104. https://doi.org/10.3390/act15020104

AMA Style

Su T, Liang L, Pan M, Fu C, Huang H, Li J, Liang K. Triplet-Fusion Self-Attention-Enhanced Pyramidal Convolutional Neural Network for Surgical Robot Kinematic Solution. Actuators. 2026; 15(2):104. https://doi.org/10.3390/act15020104

Chicago/Turabian Style

Su, Tiecheng, Lu Liang, Mingzhang Pan, Changcheng Fu, Hengqiu Huang, Jing’ao Li, and Ke Liang. 2026. "Triplet-Fusion Self-Attention-Enhanced Pyramidal Convolutional Neural Network for Surgical Robot Kinematic Solution" Actuators 15, no. 2: 104. https://doi.org/10.3390/act15020104

APA Style

Su, T., Liang, L., Pan, M., Fu, C., Huang, H., Li, J., & Liang, K. (2026). Triplet-Fusion Self-Attention-Enhanced Pyramidal Convolutional Neural Network for Surgical Robot Kinematic Solution. Actuators, 15(2), 104. https://doi.org/10.3390/act15020104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Triplet-Fusion Self-Attention-Enhanced Pyramidal Convolutional Neural Network for Surgical Robot Kinematic Solution

Abstract

1. Introduction

2. Related Work

2.1. Analytical and Numerical Kinematics

2.2. Neural Network-Based Kinematics

2.3. Attention Mechanisms in Robotics

3. System Description

3.1. Surgical Instrument Arm Model

3.2. Forward Kinematics Analysis

4. Methodology

4.1. Data Sampling Module

4.2. Hybrid Model Architecture Design

4.2.1. DPCNN

4.2.2. Triplet-Fusion Self-Attention Mechanism (TFSAM) Module

4.2.3. DPCNN–TFSAM Collaborative Framework

4.3. Model Evaluation Indexes

4.4. Framework of the Proposed Model

5. Result and Discussion

5.1. Experimental Setup

5.2. Impact of Optimizers on Model Performance

5.3. Comparison of Different Methods

5.4. Computational Efficiency Analysis

5.5. Ablation Experiment

5.6. Mechanical Simulation Verification

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI