Learning-Based Variable Admittance Control Combined with NMPC for Contact Force Tracking in Unknown Environments

Zhang, Yikun; Yao, Jianjun; Qian, Chen

doi:10.3390/act14070323

Open AccessArticle

Learning-Based Variable Admittance Control Combined with NMPC for Contact Force Tracking in Unknown Environments

by

Yikun Zhang

,

Jianjun Yao

^* and

Chen Qian

College of Mechanical and Electrical Engineering, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Actuators 2025, 14(7), 323; https://doi.org/10.3390/act14070323

Submission received: 7 June 2025 / Revised: 25 June 2025 / Accepted: 26 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Motion Planning, Trajectory Prediction, and Control for Robotics)

Download

Browse Figures

Versions Notes

Abstract

With the development of robotics, robots are playing an increasingly critical role in complex tasks such as flexible manufacturing, physical human–robot interaction, and intelligent assembly. These tasks place higher demands on the force control performance of robots, particularly in scenarios where the environment is unknown, making constant force control challenging. This study first analyzes the robot and its interaction model with the environment, highlighting the limitations of traditional force control methods in addressing unknown environmental stiffness. Based on this analysis, a variable admittance control strategy is proposed using the deep deterministic policy gradient algorithm, enabling the online tuning of admittance parameters through reinforcement learning. Furthermore, this strategy is integrated with a quaternion-based nonlinear model predictive control scheme, ensuring coordination between pose tracking and constant-force control and enhancing overall control performances. The experimental results demonstrate that the proposed method improves constant force control accuracy and task execution stability, validating the feasibility of the proposed approach.

Keywords:

contact force tracking; variable admittance control; deterministic policy gradient algorithm; nonlinear model predictive control

1. Introduction

As robots are increasingly deployed in diverse industrial and service applications, their roles in flexible manufacturing, human–robot collaboration, and intelligent assembly have become increasingly prominent. These tasks not only require robots to possess high-precision position control capabilities but also demand the ability to apply current contact forces during interactions with the environment [1]. For example, in human–robot interactions, it is essential to ensure the operator’s safety and comfort; in compliant contact, robots must adapt their interaction according to the characteristics of the environment or object to achieve a soft and compliant response, thereby avoiding the influence of unknown external forces on robot motion, and in environmental interactions, robots must be capable of exerting the required external force or maintaining constant contact force, thereby ensuring the quality and efficiency during the interaction.

In these tasks, safety constraints, physical human–robot interaction, and the diversity of environmental interaction tasks impose higher demands on the force control performance of robots. To address the challenges of complex environmental interactions, admittance control and impedance control have become commonly adopted methods for achieving force–position control. Admittance control takes the external force as the input and outputs the desired position or velocity, enabling the robot to respond compliantly to external forces. Impedance control, on the other hand, establishes a mechanical impedance model by mapping the desired stiffness, damping, and inertia parameters to the robot’s dynamics, thereby achieving controlled force interactions between the robot and the environment. These control approaches have balanced the needs for both position and force to some extent, laying the foundation for safe interaction between robots and their environments [2,3]. However, in cases where the environment’s stiffness is unknown or subject to variation, achieving constant force control remains a significant challenge [4].

Ensuring the safety and compliant response of robots under unknown external forces has long been a key challenge in force control research. From the perspective of compliance, some researchers have designed controllers aimed at accelerating the attenuation of external force effects on robots, thereby enhancing the overall system’s compliance. The research presented by the authors of [5,6] combined model predictive control (MPC) with impedance and admittance control, formulating the control problem as an optimization problem that explicitly incorporates motion constraints while maintaining the contact force within a safe range. This approach effectively improves the system’s compliance and stability. The authors of [7] further transformed the problem of selecting impedance parameters into a multi-step optimization problem, embedding it directly into the control input to address the trade-off between tracking accuracy and motion compliance. The authors of [8,9] proposed optimization-based impedance parameter selection methods, including diagonal-dominant and non-diagonal-dominant inertia matrix designs, which treated unknown external forces as the free response to non-zero initial conditions. These approaches enhanced the robot’s recovery speed under unknown disturbances while preserving robustness and motion performance. The authors of [10,11] integrated sliding mode control (SMC) with admittance control, enabling rapid response to external force variations in joint spaces. This strategy effectively suppressed the influence of external disturbances and achieved high trajectory-tracking accuracy and steady-state performance, thereby substantially enhancing the disturbance rejection capabilities of the force control system.

Such strategies typically treat external forces as disturbances to be suppressed, aiming to avoid hard contact between the robot’s end-effector and the environment to ensure safety. However, this conservative approach makes it challenging for the robot to actively apply and maintain a constant contact force. When the environmental stiffness is unknown, these strategies often cause the robot to yield or retreat in response to external forces, making it difficult to achieve an active, constant force output toward the environment. Consequently, this approach fails to meet the demands of interaction tasks that require the stable and continuous application of a specific contact force.

In complex physical human–robot collaboration scenarios, robots are not only required to complete tasks but also to ensure safety and compliance [12,13]. Keemink et al. systematically analyzed issues such as feedforward control, force signal filtering, and joint flexibility in the admittance control framework [14]. They designed an admittance controller characterized by low inertia and high stability, achieving appropriate disturbance suppression and ensuring safe human–robot interaction. Yao et al. adopted an adaptive admittance control method using a radial basis function (RBF) neural network, enabling sensorless human–robot interaction for industrial robots and enhancing the robot’s force-sensing capability [15]. The authors of [16] leveraged MPC algorithms and task demonstration models to optimize—via online methods—both the robot’s trajectory and impedance model, thereby improving robot–environment interactions across multiple tasks. The authors of [17] proposed a haptic stability observer controller that, combined with MPC, enabled variable admittance adjustments, achieving a good balance between motion stability and compliance during human–robot interactions. The authors of [18] utilized neural networks to classify force and velocity data, thereby realizing the flexibility and portability of adaptive admittance control to accommodate different users and scenarios. The research conducted by the authors of [19] focused on exoskeleton technology, integrating adaptive impedance control into a virtual decomposition control framework, which significantly enhanced interaction performance and stability. Furthermore, the authors of [20] investigated human stiffness adaptation behavior in unstable external force fields based on model predictive control and variable stiffness strategies, thus improving the robot’s dynamic adaptability in complex environments. The authors of [21] proposed an online impedance adjustment method that incorporates safety control barrier functions (SCBFs) into MPC, ensuring force safety during human–robot interaction while enhancing control frequency and adaptability.

Such methods in force control strategies emphasize system compliance and stability to ensure safe interactions with humans or the environment. However, they typically do not aim to apply precise constant forces by design. When the environment’s stiffness is unknown, these methods often rely on higher compliance to accommodate external force disturbances rather than pursuing strict force tracking, which limits their applicability in tasks requiring high-precision constant force control.

In complex environmental interaction tasks, unknown environmental stiffness and non-ideal contact conditions place higher demands on robotic force control systems. The authors of [22] proposed a method within the impedance controller that avoids explicitly designing virtual stiffness; instead, it achieves constant force control in unknown environments by adjusting damping in response to variations in contact force. The authors of [23] utilized inverse reinforcement learning (IRL) to establish the relationship between velocity and force, enabling active compliant control in complex, unknown ultrasonic environments and improving pose stability during continuous scanning. The authors of [24] employed an integral barrier Lyapunov function to design environmental constraints, combined with a neural network to compensate for dynamic model errors; however, their study was limited to a two-link simulation and lacked validation on multi-degree-of-freedom systems. The authors of [25] proposed a deep reinforcement learning (DRL) controller that learns during simulations and transfers the acquired knowledge to the real environment, achieving good task adaptability, although the training process is complex and poses risks during transfers. The authors of [26] introduced a model predictive interaction control (MPIC) framework that integrates the robot’s motion and interaction force models. By incorporating an elastic environment model and a stiffness-constrained controller, it achieved compliant control for multiple tasks, though the study primarily focused on point-loading scenarios and provided limited insight into trajectory-tracking capabilities. The authors of [27] combined MPC with adaptive mechanisms to realize stable force interactions under unknown environmental conditions, but they did not directly treat contact force as a tracking objective.

Many existing environmental interaction strategies rely on prior knowledge of environmental stiffness to adjust control gains [28,29]. However, real-world environments often exhibit spatially and temporally varying stiffness, making accurate modeling difficult and fixed-model approaches prone to degraded force control. Some methods introduce higher damping to mitigate stiffness uncertainty, but this may neglect the essential force–position relationship and compromise control performance. Inspired by recent studies on neural-network-based system approximation [30], this study adopts a data-driven approach to adapt control parameters in the absence of explicit environment models.

For continuous control tasks, algorithms such as PPO, SAC, and TD3 have shown promise. However, PPO’s on-policy nature demands extensive interaction data, SAC incurs higher computation due to entropy tuning, and TD3’s improvements are more effective in highly stochastic settings. In contrast, this work adopts the Actor–Critic Deep Deterministic Policy Gradient (AC-DDPG) algorithm, which offers deterministic low-latency outputs, supports continuous bounded actions (e.g., stiffness/damping modulation), and achieves stable learning through experience replay and target networks—making it well-suited for real-time variable admittance control.

To address the challenges of unknown contact dynamics, this study proposes a hybrid control strategy that integrates AC-DDPG-based admittance regulation with quaternion-based nonlinear model predictive control (NMPC). The AC-DDPG component enables the online adjustment of stiffness and damping, improving force tracking and robustness under uncertain stiffness, while the NMPC ensures precise pose tracking through quaternion orientation representation. Together, this framework achieves the stable and adaptive control of both force and motion in unstructured environments.The main contributions of this study are as follows:

This study proposes an AC-DDPG-based variable admittance parameter optimization method that enables the online tuning of stiffness and damping parameters in unknown environments.
It designs a quaternion-based nonlinear model predictive controller to achieve high-precision pose tracking and position tracking with respect to the robot’s end-effector.
It organically combines these two methods to form a stable constant force–position hybrid control strategy in unknown environments, improving the robustness and accuracy of force–position control in complex scenarios.

The remainder of this study is organized as follows. Section 1 presents a literature review. Section 2 analyzes the robot and its contact environment model. Section 3 details the design of the main control strategy, including the AC-DDPG-based variable admittance strategy and the quaternion-based NMPC force–position hybrid tracking strategy. Section 4 presents experimental analyses, including comparisons with optimal-variable admittance and constant-stiffness strategies. Finally, Section 5 summarizes and concludes this study.

2. System Modeling

This section analyzes the process and modeling of robot–environment interactions. As shown in Figure 1, the interaction process can be divided into three main parts: (1) the location phase, (2) the loading phase, and (3) the following phase. The location phase involves the robot’s kinematic model, while the loading stage incorporates the environmental model. The tracking stage represents a synthesis of these two models.

The homogeneous matrix of the robot’s forward kinematics can be expressed as follows:

{}_{E}^{B}H = [\begin{matrix} {}_{E}^{B}R (q) & {}_{E O R G}^{B}P (q) \\ 0^{T} & 1 \end{matrix}],

(1)

where

{}_{E}^{B}R (q)

represents the rotation matrix of the coordinate system of the end-effector

{W_{E}}

with respect to the coordinate system of the robot base

{W_{B}}

.

{}_{E O R G}^{B}P (q)

represents the position vector of the origin of the end-effector with respect to the robot base.

q = {[q_{1}, \dots, q_{6}]}^{T}

is a vector that contains the joint positions of the robot. According to the differential kinematics of the robot, we obtain the following:

[\begin{matrix} {}_{E}^{B}V \\ {}_{E}^{B}Ω \end{matrix}] =_{E}^{B} J (q) \cdot \dot{q},

(2)

where

{}_{E}^{B}J (q)

is the robot’s Jacobian matrix. When we transform the contact force

{}^{S}{f_{m}} (t)

—measured using the F/T sensor mounted on the end-effector—from the sensor’s coordinate system

{W_{S}}

to the robot base’s coordinate system

{W_{B}}

, we obtain the following:

{}^{B}{f_{m}} (t) = [\begin{matrix} {}_{E}^{B}R (q) & 0 \\ 0 & {}_{E}^{B}R (q) \end{matrix}] {}^{S}{f_{m}} (t) .

(3)

For notation convenience,

{}^{B}{f_{m}} (t)

is denoted as

f_{m} (t)

throughout this study.

As shown in Figure 1 during the loading phase, let

x_{c} (t)

,

x_{m} (t)

, and

x_{e} (t)

represent the commanded pose of the end-effector, the current pose of the end-effector, and the environmental position, respectively. Assuming that the environment is perfectly rigid, it can be approximated as a linear spring with stiffness

K_{e}

. If

x_{m} (t) = x_{e}

, the contact force between the robot’s end-effector and the environment can thus be simplified as follows:

f_{e} = K_{e} (x_{c} (t) - x_{m} (t)) = K_{e} (x_{e} (t) - x_{m} (t)) .

(4)

Let

f_{d}

represent the desired contact force; then, the force-tracking error is

Δ f = f_{e} - f_{d}

. According to the classical admittance model, we have the following:

M ({\ddot{x}}_{c} (t) - {\ddot{x}}_{r}) + D ({\dot{x}}_{c} (t) - {\dot{x}}_{r}) + K (x_{c} (t) - x_{r}) = f_{m} - f_{d},

(5)

where M, D, and K denote the virtual mass, damping, and stiffness matrices, respectively.

x_{r}

represents the reference pose, which, in constant force control, is typically selected as the pose of the end-effector in the loading direction at the loading point, i.e.,

x_{r} = x_{m} (t)

. Assuming that the sensor is noise-free, it follows that

f_{m} = f_{e}

. Analyzing the steady-state behavior of system Equation (5), we obtain the following:

K (x_{c} (t) - x_{r}) = f_{m} .

(6)

A comparison with Equation (4) reveals that Equation (6) holds when

K = K_{e}

and

x_{c} (t) = x_{r} - K_{e}^{- 1} f_{s}

. However, in practice, the environmental stiffness

K_{e}

is often unknown, and using conventional admittance strategies cannot achieve constant-force tracking at the steady state. Therefore, it is necessary to adjust the virtual stiffness K and virtual damping D online, enabling the robot to achieve constant force control in environments with unknown stiffness.

3. Main Control Strategy

This section employs the deep deterministic policy gradient algorithm to adjust the online admittance parameters. The goal is to improve constant force control accuracy under unknown stiffness environments and to reduce oscillations during the force-tracking process. Additionally, a Cartesian-space model predictive controller based on quaternion representation is designed to enhance force control accuracy in the force application direction and trajectory-tracking precision. The underlying principle is illustrated in Figure 2.

3.1. Online Tuning of Admittance Parameters

As analyzed in Section 2, it is not possible to achieve true constant force control under conditions of unknown environmental stiffness

K_{e}

. In this section, the deep deterministic policy gradient algorithm is employed, combined with an actor–critic network structure (AC-DDPG), to adaptively adjust the stiffness and damping matrices in the online admittance control model. This enables the robot to achieve constant force control during interactions with the environment.

The state space designed for the AC-DDPG strategy is defined as follows:

s (t) = \{e_{f} (t), {\dot{e}}_{f} (t), f_{m} (t), x_{c} (t) - x_{m} (t), {\dot{x}}_{c} (t) - {\dot{x}}_{m} (t)\},

(7)

where

e_{f} (t) = f_{m} (t) - f_{d}

represents the real-time force-tracking error, reflecting the key performance indicator of the control core.

{\dot{e}}_{f} (t)

denotes the rate of change of the force error, which captures the oscillatory trend in force variations and guides the adjustment of damping.

f_{m}

represents the environmental force sensed by the sensor.

x_{c} (t) - x_{m} (t)

denotes the integrated error between the commanded pose of the end-effector and its current pose, while

{\dot{x}}_{c} (t) - {\dot{x}}_{m} (t)

represents the integrated error between the commanded velocity of the end-effector and its current velocity. By selecting these states, the force-tracking process within the admittance model is comprehensively described, encompassing both the environmental interaction forces and the robot’s state.

The action space designed for the AC-DDPG strategy is defined as follows:

a (t) = \{Δ K (t), Δ D (t)\},

(8)

where

Δ K (t)

and

Δ D (t)

denote the variations in stiffness and damping at time t. Based on Equation (8), the update expressions for stiffness and damping at each time step can be described as follows:

\begin{matrix} K (t + 1) & = K (t) + η Δ K (t) \\ D (t + 1) & = D (t) + η Δ D (t), \end{matrix}

(9)

where

η

is the learning rate of stiffness and damping.

The reward function is designed to minimize the force-tracking error, reduce oscillations during the force-tracking process, and avoid abrupt changes in the admittance parameters, and it is expressed as follows:

r (t) = - α e_{f}^{2} (t) - β {\dot{e}}_{f}^{2} (t) - γ (Δ K^{2} + Δ D^{2}),

(10)

where

α

,

β

, and

γ

are design weights. Negative values are selected for

r (t)

to align with the maximization of the reward function. In the reward function, the first term minimizes the force-tracking error, guiding the adjustment of the stiffness K. The second term suppresses oscillations, directing the adjustment of the damping D. The third term serves as a soft constraint to mitigate abrupt parameter changes.

Based on Equations (7) and (8), during the online tuning process of the admittance parameters using the AC-DDPG algorithm, the input is a continuous-state vector, and the output is a continuous-action vector. Consequently, fully connected feedforward neural networks are selected for both the actor and critic networks. According to the universal approximation theorem, a fully connected neural network with a sufficiently large number of hidden neurons can approximate any nonlinear function. This architecture ensures stable online training, facilitates hyperparameter tuning, and avoids additional computational complexity, making it suitable for implementing the online training and execution of the variable admittance control strategy.

Critic network

Q (s, a; θ^{Q})

is a fully connected neural network with a single output, which serves to estimate the cumulative return when taking action a in state s. Let

θ^{Q}

denote the set of all trainable weights and biases, and the critic network can be expressed as follows:

Q (s, a; θ^{Q}) = σ (W_{i} (s; a) + b_{i}),

(11)

where

W_{i}

denotes the weight vector of the i-th hidden layer,

b_{i}

represents the bias vector of the i-th hidden layer, and

σ (\cdot)

is the ReLU activation function.

The actor network

π (s; θ^{π})

is a fully connected neural network that takes the state s as an input and outputs a continuous-action vector

a = \{Δ K, Δ D\}

. The output range is constrained, and the actor network is expressed as follows:

a = π (s; θ^{π}) = κ tanh (σ (W_{i} s + b_{i})),

(12)

where

κ

is the range-limiting variable that ensures that the output remains bounded.

Under the AC-DDPG strategy, experience replay buffer and target network techniques are employed to train both the actor and critic networks, with the aim of optimizing the value and policy functions. At each time step t, all collected interaction data—comprising the state

s (t)

, action

a (t)

, reward

r (t)

, and the next state

s (t + 1)

—are stored in the experience replay buffer for subsequent batch training. This mechanism helps break the correlation among samples, thereby improving training stability. During training, let N denote the capacity of the experience replay buffer; a batch of data

{\{(s_{i}, a_{i}, r_{i}, s_{i + 1})\}}_{i = 1}^{N}

is randomly sampled from the experience replay buffer. The critic network then utilizes the target network to compute the temporal difference (TD) target value, where the action at the next time step is generated by the target actor network, and it is expressed as follows:

a_{i}^{'} = π (s_{i}^{'}, θ^{π^{'}}) .

(13)

Moreover, the corresponding Q’-value is computed using the target critic network:

Q_{i}^{'} = Q_{i}^{'} (s_{i}^{'}, a_{i}^{'}; θ^{Q^{'}}) .

(14)

Then, the TD target value is obtained as follows:

y_{i} = r_{i} + λ Q^{'},

(15)

where

λ

denotes the discount factor. The loss function of the critic network is defined as follows:

L_{Critic} = \frac{1}{N} \sum_{i = 1}^{N} {(Q (s_{i}, a_{i}; θ^{Q}) - y_{i})}^{2} .

(16)

By minimizing Equation (16), the learning parameter

θ^{Q}

of the critic network is updated. The actor network is updated by maximizing the Q-value output of the critic network, with the corresponding loss function defined as follows:

L_{Actor} = - \frac{1}{N} \sum_{i = 1}^{N} Q (s_{i}, π (s_{i}; θ^{π}); θ^{Q}) .

(17)

The actor network updates its parameters by minimizing Equation (17). Additionally, to enhance training stability, a soft update mechanism is introduced for the target network, with its parameter update expression defined as follows:

\begin{matrix} θ^{Q^{'}} & \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \\ θ^{π^{'}} & \leftarrow τ θ^{π} + (1 - τ) θ^{π^{'}}, \end{matrix}

(18)

where

τ

denotes the soft update coefficient. The AC-DDPG algorithm triggers training when the number of samples in the experience buffer exceeds N and the specified training interval is reached, ensuring both the efficiency of network updates and the real-time performance of the overall system. At each sampling time t, the virtual stiffness and damping values are updated by the actor network, which then allows the computation of the commanded pose at each time step. Based on Equation (9), the variable admittance controller is expressed as follows:

M ({\ddot{x}}_{c} (t) - {\ddot{x}}_{r} (t)) + D (t) ({\dot{x}}_{c} (t) - {\dot{x}}_{r} (t)) - K (t) (x_{c} (t) - x_{r} (t)) = f_{m} (t) - f_{d} .

(19)

By rearranging Equation (19), the commanded acceleration can be computed as follows:

{\ddot{x}}_{c} (t) = {\ddot{x}}_{r} (t) + \frac{1}{M} (e_{f} (t) - D (t) ({\dot{x}}_{c} (t) - {\dot{x}}_{r} (t)) - K (t) (x_{m} (t) - x_{r} (t))) .

(20)

Using the forward Euler method, the commanded velocity and commanded pose can be computed as follows:

\begin{matrix} {\dot{x}}_{c} (k + 1) & = {\dot{x}}_{c} (k) + T_{s} {\ddot{x}}_{c} (k) \\ x_{c} (k + 1) & = x_{c} (k) + T_{s} {\dot{x}}_{c} (k) . \end{matrix}

(21)

3.2. Force–Position Hybrid Control Based on a Quaternion-Based MPC Controller

In the previous section, the commanded pose was computed. In this section, a pose controller is designed to enhance the performance of force–position hybrid control. Traditional end-effector pose control strategies often employ numerical inverse kinematics combined with PID joint control. However, these approaches can result in oscillations during position tracking, and offline trajectory-planning strategies cannot achieve the real-time tracking of the desired trajectory. This section transforms the trajectory-tracking and location problems into a constrained optimal control problem, ensuring both smooth motions and good trajectory-tracking performance. In high-stiffness scenarios, even small displacements in the force application direction can induce large force variations. As a result, the difference between the commanded pose

x_{c} (t)

and the actual pose

x_{m} (t)

is often small, making it difficult to directly drive the robot’s motion. The MPC strategy can amplify the error in the force application direction by selecting appropriate directional gains, thereby improving force control precision. Additionally, by incorporating joint constraints, it is possible to prevent the commanded pose from taking abnormal values due to the learning component, thus protecting the actuator and enhancing overall safety.

In the trajectory-tracking tasks of the robotic end-effector’s location, situations involving both pose adjustment and pose holding frequently occur. When using Euler angles to represent orientation changes, the gimbal lock problem arises. Furthermore, during optimization problems involving orientation angles, the periodic nature of angle parameters (with a period of

2 π

) can cause numerical discontinuities when angles cross the domain boundary (e.g., transitioning from

π

to −

π

). Although both representations physically correspond to the same orientation, they can cause a sudden increase in the cost function, resulting in controller failure. Additionally, the linear interpolation of Euler angles often results in non-uniform rotation paths, which can cause end-effector oscillations. Even small oscillations can significantly impact force control accuracy. Rotation matrices require nine parameters to represent three degrees of freedom, introducing six redundant constraints that increase the computational complexity of the optimal control problem and require orthogonalization, which may violate constraints.

In this subsection, quaternions are employed to describe changes in orientation angles. Although this approach transforms the linear time-varying optimal control problem into a nonlinear problem, it offers notable advantages, including the absence of singularities in orientation representation, high computational efficiency, and numerical stability in optimization problems. These features make it well-suited for trajectory tracking within the lower-level controller of force–position hybrid control.

3.2.1. The Discrete Form of the System-State Equation

Let

x_{m} (t) = {[p (t), q_{u} (t)]}^{T}

denote the end-effector’s pose, where

x_{m} (t) \in R^{7 \times 1}

.

p (t) \in R^{3 \times 1}

represents the position vector and

q_{u} (t) \in R^{4 \times 1}

represents the orientation vector in quaternion form. The linear velocity part of the discrete formulation of Equation (2) is computed using the fourth-order Runge–Kutta method and is expressed as follows:

\begin{matrix} k_{1} & = f (p (k), u (k)) \\ k_{2} & = f (p (k) + \frac{T_{s} k_{1}}{2}, u (k)) \\ k_{3} & = f (p (k) + \frac{T_{s} k_{2}}{2}, u (k)) \\ k_{4} & = f (p (k) + T_{s} k_{3}, u (k)) \\ p (k + 1) & = p (k) + \frac{T_{s}}{6} (k_{1} + 2 k_{2} + 2 k_{3} + k_{4}), \end{matrix}

(22)

where

u (k) = (q (k) - q (k - 1)) / T_{s}

represents the discrete form of the joint velocity, which serves as the system’s input. Using the exponential map of quaternions to discretize orientations, the incremental quaternion is computed as follows:

Δ q_{u} (k) = [\begin{matrix} cos (\frac{θ}{2}) \\ sin (\frac{θ}{2}) \cdot v \end{matrix}],

(23)

where

{θ = ∥}_{E}^{B} {Ω ∥}_{2}

.

v

represents the unit’s rotation axis and is expressed as follows:

v = \{\begin{matrix} 0, & ∥_{E}^{B} {Ω ∥}_{2} < ε \\ \frac{{}_{E}^{B}Ω}{∥_{E}^{B} {Ω ∥}_{2}}, & otherwise \end{matrix}

(24)

Then, the quaternion at time step

k + 1

is updated using quaternion multiplication:

q_{u} (k + 1) = Δ q_{u} (k) \otimes q_{u} (k),

(25)

where quaternion multiplication

a \otimes b

is defined as follows:

a \otimes b = (w_{a} w_{b} - v_{a} \cdot v_{b}) + (w_{a} v_{a} + w_{b} v_{a} + v_{a} \times v_{b}) .

(26)

By combining Equations (22) and (26), the nonlinear discrete equations based on quaternion representation can be obtained, and they are expressed as follows:

x_{m} (k + 1) = F (x_{m} (k), q (k), u (k), T_{s}) .

(27)

3.2.2. Constraints Design

In the MPC pose controller, equality constraints are formulated based on the state Equation (27), along with joint smoothness constraints determined by the system’s control inputs. Since the end-effector of the robot can, in principle, reach any position in space during the location and tracking phases, no state constraints are imposed on the MPC controller.

Let the prediction horizon of the controller be

N_{c}

. Using a multiple shooting approach, the predicted state equation at time step k is formulated as follows:

\begin{matrix} x_{m}^{1 | k} & = F (x_{m}^{0 | k}, q^{0 | k}, u^{0 | k}, T_{s}) \\ x_{m}^{2 | k} & = F (x_{m}^{1 | k}, q^{1 | k}, u^{1 | k}, T_{s}) \\ ⋮ \\ x_{m}^{N_{c} | k} & = F (x_{m}^{N_{c} - 1 | k}, q^{N_{c} - 1 | k}, u^{N_{c} - 1 | k}, T_{s}), \end{matrix}

(28)

where

i | k

denotes the prediction at future time

k + i

given the information available at time k. By analyzing constraint Equation (28), the time-varying joint position sequence

{q}_{i = 0}^{N_{c} - 1}

is updated and expressed as follows:

u^{i | k} = \frac{q^{i | k} - q^{i - 1 | k}}{T_{s}} \Rightarrow q^{i | k} = q^{i - 1 | k} + T_{s} u^{i | k - 1} i \in [0, N_{c} - 1],

(29)

where

u^{i | k - 1}

represents the control input obtained from the solution of the optimal control problem at the previous time step.

To prevent an abnormal commanded pose

x_{c} (k)

from resulting in invalid control input

u (k)

and to ensure the safety of the robot’s actuated joints, joint constraints are introduced. Specifically, the joint velocity constraints determined by the control input are provided by the following:

U : = \{u^{i | k} \in R^{m \times 1} | u_{min} \leq u^{i | k} \leq u_{max}, i \in [0, \dots, N_{c}]\},

(30)

where m is the dimension of control action

U \in R^{N_{c} m \times 1}

. The corresponding robot’s joint position limits are expressed as follows:

Q : = \{U_{k} \in R^{m N_{c} \times 1} | Q_{min} \leq T_{s} U_{k} + Q_{k - 1} \leq Q_{max}\},

(31)

where

Q_{min}

and

Q_{max}

denote the joint position limits extended over the entire prediction horizon. By combining sets Equations (30) and (31), the comprehensive control input constraints are provided by the following:

U : = \{u^{i | k} \in R^{m \times 1} | u^{i | k} \in U \cap Q\} .

(32)

3.2.3. The Optimal Control Problem

The objective of the NMPC controller is to enable the current pose

x_{m} (k)

to stably track the commanded pose

x_{c} (k)

computed by the admittance controller. Let the pose error at time step k be defined as follows:

e (k) = x_{m} (k) - x_{c} (k) = [\begin{matrix} e_{p} (k) \\ e_{u} (k) \end{matrix}] = [\begin{matrix} p_{m} (k) - p_{c} (k) \\ q_{u}^{m} (k) ⊖ q_{u}^{c} (k) \end{matrix}] .

(33)

The position component can be directly computed via algebraic subtraction. The orientation error is computed using the logarithmic map of quaternions as follows:

q_{u}^{0} (k) = \frac{{\bar{q}}_{u}^{m} (k)}{∥ {\bar{q}}_{u}^{m} {(k) ∥}_{2}} \otimes \frac{q_{u}^{c} (k)}{∥ q_{u}^{c} {(k) ∥}_{2}},

(34)

where

{\bar{q}}_{u}^{m} (k)

denotes the conjugate quaternion of

q_{u}^{m} (k)

. The error quaternion can be converted to the angle-axis representation, and it is expressed as follows:

e_{u} (k) = 2 arccos (w_{q_{u}^{0} (k)}) \cdot \frac{v_{q_{u}^{0} (k)}}{∥ v_{q_{u}^{0} (k)} ∥_{2}} .

(35)

From Equation (35), it can be seen that the dimension of the orientation error is

e_{u} \in R^{3 \times 1}

. In the pose optimization problem, representing the quaternion error using the angle-axis form offers several advantages over directly using the relative rotation quaternion. These advantages include eliminating the double-covering singularity of quaternions, facilitating the formulation of unconstrained optimization problems, and improving the convergence of gradient-based algorithms.

The cost function of the optimal control problem is constructed as follows:

V_{N_{c}} (e, u) = \sum_{i = 0}^{N_{c} - 1} (e_{i | k}^{T} Q e_{i | k} + u_{i | k}^{T} R u_{i | k}) + e_{N_{c} | k}^{T} T e_{N_{c} | k},

(36)

where

Q = Q^{T} ≻ 0

and

R = R^{T} ≻ 0

, respectively, represent the state gain and the control gain; ≻ represents the matrix’s positive definiteness.

T

denotes the terminal gain, which estimates the cost at a future time point. In summary, the optimal control problem is defined as follows:

\begin{matrix} P_{N_{c}} (x_{m}, u) : V_{N_{c}}^{0} (e, u) = min_{u} V_{N_{c}} (e, u) \\ s . t . \{\begin{matrix} x_{m}^{i + 1 | k} = F (x_{m}^{i | k}, q^{i | k}, u^{i | k}, T_{s}) & i \in [0, N_{c}] \\ u^{i | k} \in U & i \in [0, N_{c} - 1] \end{matrix} \end{matrix}

(37)

The optimal control problem

P_{N_{c}}

is a constrained nonlinear programming problem. In this section, it is solved using the sequential quadratic programming (SQP) method. Let

{[{\tilde{x}}_{m}^{0 | k}, {\tilde{x}}_{m}^{1 | k}, \dots, {\tilde{x}}_{m}^{N_{c} | k}, {\tilde{u}}^{0 | k}, {\tilde{u}}^{1 | k}, \dots, {\tilde{u}}^{N_{c} - 1 | k}]}^{T}

denote the optimal solution of the optimal control problem, and let

u (k) = {\tilde{u}}^{0 | k}

represent the input to the control system. The corresponding position input to the robot joint controller is then defined as follows:

q (k) = q (k - 1) + T_{s} u (k)

(38)

4. Experimental Results

In this section, point-loading and constant-force trajectory-tracking experiments were designed to validate the effectiveness of the proposed variable admittance control strategy based on the AC-DDPG algorithm. The point-loading experiments were conducted with target forces of 10 N and 20 N. The constant-force trajectory-tracking experiment was performed and compared with the strategies using constant admittance parameters and optimal admittance parameters.

4.1. Experimental Setup

Figure 3 illustrates the utilized experimental platform. The platform primarily consists of the following components: a six-degree-of-freedom collaborative robot with a positioning accuracy of 0.1 mm; an end-effector tool; an aluminum plate; and a KWR75C six-axis force/torque sensor mounted at the end-effector, featuring a maximum measurement range of 500 N in the

f_{z}

direction and a measurement accuracy of 0.5 N. The robot communicates with the host computer via the TCP protocol, enabling command transmission and feedback acquisition. The control algorithm is implemented in C++ on an Ubuntu 20.04 operating system, and it integrates a real-time system plugin to ensure efficient and stable control loops. To meet real-time requirements and guarantee data integrity, the control architecture is designed with two separate threads: one for sensor data acquisition and one for the main control process. The sensor thread operates at a high sampling frequency of 1000 Hz to communicate with the force/torque sensor and acquire real-time force and position data. The main control thread operates at a frequency of 50 Hz to perform control computations and, when necessary, accesses the latest data from the sensor thread via a communication interface. For the nonlinear programming problem arising in the variable admittance force–position hybrid control strategy, the CasADi tool is employed to solve the optimization problem. Additionally, Libtorch is used to train the neural network, thereby addressing the system’s nonlinearity while ensuring computational efficiency. This integrated approach ensures the real-time feasibility and stability of the proposed control strategy.

4.1.1. Desired Parameters

In the point-loading experiments, a target force of

f_{d}^{1} = {[0, 0, 10, 0, 0, 0]}^{T}

N is selected for the first group, and

f_{d}^{2} = {[0, 0, 20, 0, 0, 0]}^{T}

N is selected for the second group, with the loading pose set to

x_{d} = {[- 0.43091, - 0.00312, 0.090324, 0, 0.707, 0.707, 0]}^{T}

.

For the constant-force tracking experiments, the initial trajectory pose is similarly selected as

x_{d}

. The trajectory is defined as follows:

\{\begin{matrix} x_{d}^{x} (t) = x_{d}^{x} (0) + a sin (ϑ) m \\ x_{d}^{y} (t) = x_{d}^{y} (0) + b sin (ϑ) cos (ϑ) m \\ x_{d}^{z} (t) = 0.090324 m \end{matrix}

(39)

where

a = 0.1

,

b = 0.05

,

ϑ = 2 π (t / T)

, and

T = 3000

denote the time period. During the tracking experiment, the desired force is set to

f_{d}^{1}

.

4.1.2. AC-DDPG Parameters

Both the actor and target actor neural networks are designed as fully connected neural networks with two hidden layers. The input to the network is the admittance state, as defined in Equation (7), with a dimensionality of 30 that corresponds to information such as force errors, velocity errors, and position errors during a task. Each hidden layer contains 64 neurons, and the ReLU activation function is employed to introduce nonlinear representation capability. The output layer has a dimensionality of 12, representing the adjustment quantities for stiffness and damping in the admittance model. A scaling factor of

κ = 10

is applied, and the tanh function is used to constrain the stiffness and damping variations within the range of [−10, 10].

The critic and target critic neural networks are also designed as fully connected neural networks with two hidden layers. Their input consists of the concatenation of the admittance state vector and the admittance action vector, resulting in an input dimensionality of 42. In the critic network, the first hidden layer contains 128 neurons, and the second hidden layer contains 64 neurons, both employing the ReLU activation function. The output layer consists of a single neuron that outputs a scalar value function of the admittance state–action pair, which evaluates the value of the current policy. The discount factor is selected as

λ = 0.99

. For all of the aforementioned networks, the Adam optimizer is utilized. The initial learning rates are set to

1 \times 10^{- 3}

for the actor network and

1 \times 10^{- 2}

for the critic network.

The network’s size was chosen to provide sufficient representation capacities for modeling the relationship between force/position errors and the appropriate admittance adjustment, without overcomplicating the model and risking overfitting. The learning rates for actor and critic networks were chosen to balance the difference in sensitivity between policy updates and value estimation. The critic is typically trained with a higher learning rate to rapidly reflect value changes, while the actor uses a lower rate to avoid instability in policy learning.

In the reward function, when

f_{m} (k) \in [0, 2 f_{m} (0)]

is within a reasonable range,

α

is set to 100; otherwise, if there is no contact or the force is excessively large,

α

is set to 10,000. The oscillation gain is

β = 1

, and the abrupt change penalty gain is

γ = 0.99

to balance tracking performance with the continuity of admittance parameter variations. The soft update coefficient is set to

τ = 0.005

to ensure smooth updates with respect to the target network. The capacity of the experience replay buffer is set to 1000, and at each training step, 64 samples are randomly drawn from the buffer for training. Network updates are triggered every 64 steps during training, ensuring the real-time performance and computational efficiency of the algorithm. All hyperparameters were preliminarily tuned through experiments to ensure control performance during constant-force trajectory tracking.

The learning rate for stiffness and damping is set to

η = 1

. The initial values are selected as

K (0) = 35, 000

N/m and

D (0) = 3000

Ns/m. The upper and lower bounds for stiffness are set to

K_{min} = 1000

N/m and

K_{max} = 100, 000

N/m, respectively. The upper and lower bounds for damping are set to

D_{min} = 500

Ns/m and

D_{max} = 5000

Ns/m, respectively. The virtual mass is set to

M = 50

N.

4.1.3. NMPC Parameters

In the nonlinear model predictive control (NMPC) framework proposed in this study, the state cost’s weighting matrix is set as

Q

= diag{20, 20, 2000, 20, 20, 20}; the control cost’s weighting matrix is set as

R

= diag{0.1, 0.1, 0.1, 0.1, 0.1, 0.1}; the terminal cost weighting matrix is set as

T

= diag{10, 10, 2000, 10, 10, 10}. Due to the high stiffness of the environment, the displacement command corresponding to the desired contact force is relatively small, which results in a reduction in the optimized joint commands obtained from the nonlinear program. This may cause the joint command to fall within the actuator’s dead zone, resulting in motion stagnation. To address this issue, large weights are assigned to the state and terminal cost in the contact direction (i.e., the force-dominant direction), thereby ensuring sufficient joint actuation. In contrast, applying the same weighting across the orthogonal degrees of freedom would cause the controller to overly prioritize position-tracking accuracy, compromising force control performance. One of the core advantages of the NMPC approach lies in the flexibility of its cost function’s configuration, which allows for differentiated weighting to achieve task-specific trade-offs between position-tracking and force-tracking precision.

The soft constraints for the robot joints are set as

q_{max}

= [3.05, 1.48, 2.82, 1.48, 3.05, 3.05]^Trad and

q_{min}

= [−3.05, −3.05, −2.82, −4.62, −3.05, −3.05]^Trad. The control input limits are set to 0.1 rad/s and −0.1 rad/s, respectively, with a prediction horizon of

N_{c} = 10

.

4.2. Experimental Results

All training was performed online during interactions, with the total update time constrained below 0.02 s per step to meet real-time requirements, and the convergence behavior was empirically tuned for the target hardware setup, which may vary across different platforms and sampling frequencies.

4.2.1. The Point-Loading Experiment

As shown in Figure 4, the system’s point-loading response curve under a target contact force of 10 N is presented. During the initial loading phase, the contact force rapidly increases to a peak value of approximately 16 N, and a significant overshoot is observed at

t \approx 0.15

s. Subsequently, the contact force exhibits a rapid decay characteristic and gradually converges to the steady-state value after

t \approx 0.2

s. Ultimately, the system stably maintains the specified target value of 10 N.

Figure 5 shows the system’s point-loading response curve under a target contact force of 20 N. In the initial phase, the contact force also rises rapidly but is accompanied by oscillations. At

t \approx 0.3

, the system begins to gradually converge to the target value of 20 N, with a convergence trend similar to that observed in the 10 N loading scenario. However, at approximately 0.4 s, the system is subjected to an unknown disturbance, resulting in noticeable fluctuations in the measured contact force, deviating from the target value. Around 0.5 s, the admittance controller attempts to quickly reduce the contact force, causing the end-effector to momentarily detach from the contact point. At approximately 0.7 s, the system re-establishes stable contact, and after

t \approx 1.0

s, the contact force gradually stabilizes, with the system ultimately maintaining the target force of 20 N.

In summary, the proposed variable admittance control strategy effectively accomplishes the point-loading task, demonstrating a fast response and strong robustness.

4.2.2. The Tracking Experiment

Figure 6, Figure 7, and Figure 8 present the trajectory-tracking performance in the non-loading direction during the tracking task, comparing the proposed nonlinear model predictive control strategy with the classical approach based on kinematic inverse solutions combined with a PID joint controller [31]. As shown in Figure 6, the trajectory generated by the proposed method (red trajectory) closely aligns with the desired infinity-shaped trajectory, with no observable tracking deviation, demonstrating excellent trajectory-tracking accuracy. In contrast, the classical method (blue trajectory) exhibits pronounced oscillations during the tracking process, particularly in regions with high curvature (e.g., at turning points), resulting in significant trajectory deviations and increased overall tracking errors. Furthermore, due to numerical precision limitations in the inverse kinematics computation, the classical method causes the end-effector to display sustained oscillatory behavior near the desired trajectory, thereby failing to achieve stable trajectory tracking. Such trajectory oscillations not only fail to meet the requirements of high-precision trajectory tracking but also adversely affect the stability and accuracy of contact force control.

As shown in Figure 7, a comparison of the x-direction position-tracking errors indicates that the proposed method consistently exhibits lower error amplitudes and smaller oscillations throughout the entire tracking period. In contrast, the classical method demonstrates pronounced periodic fluctuations during most of the time intervals, with particularly severe oscillations occurring at t = 0–100 s and t = 120–300 s, indicating its inferior tracking accuracy.

Figure 8 illustrates the y-direction trajectory-tracking error curve. Due to the relatively smaller variation in the desired y-direction trajectory, the tracking accuracy in the y-direction is generally better than in the x-direction. From the curves, it can be observed that both methods experience minor fluctuations at the beginning and end of the trajectory. However, the overall error amplitude of the classical method is slightly higher than that of the proposed method.

To quantitatively evaluate the performance differences between the two control strategies, the root mean square error (RMSE) and error variance in the x- and y-directions are calculated in this section to comprehensively assess tracking accuracy and stability. The results are presented in Table 1. The RMSE values in both the x-direction and y-direction for the proposed method are lower than those of the classical method, indicating smaller overall tracking errors and higher position-tracking accuracy. Additionally, the error variance for the proposed method is lower than that of the classical method, demonstrating higher system stability and smaller fluctuations during the trajectory-tracking process. In summary, the proposed nonlinear model predictive control tracking strategy offers advantages in tracking accuracy and stability, making it suitable for force–position hybrid control.

The classical control strategy exhibits limitations in pose-tracking accuracy. In this section, rather than adopting the classical method, the nonlinear model predictive control strategy for pose tracking is integrated with a variable admittance control strategy within the force–position hybrid control framework to achieve constant contact-force trajectory tracking. To evaluate the effectiveness of the proposed strategy, three parallel contact points and three fully parallel x-y plane desired trajectories, as shown in Figure 6, are selected. A constant contact force of 10 N is applied in the z-direction, and comparative experiments are conducted. The comparison strategies include the following:

Constant admittance control strategy: The parameters are set as K = 35,000 N/m and $D = 2000$ Ns/m.
Optimal admittance control strategy: The basic formulation is similar to the diagonal-dominant optimal impedance algorithm proposed in [9], which can be expressed as follows:

$\begin{matrix} \underset{D, K}{arg min} & {∥μ D + K∥}_{F}^{2} \\ s . t . & \{\begin{matrix} l_{d_{i, j}} \leq d_{i, j} \leq u_{d_{i, j}}, \\ l_{k_{i, j}} \leq k_{i, j} \leq u_{k_{i, j}}, \\ max |\tilde{x} (t)| \leq b, \\ M \ddot{\tilde{x}} (t) + D \dot{\tilde{x}} (t) + K \tilde{x} (t) = f_{m} - f_{d} . \end{matrix} \end{matrix}$

(40)
The proposed variable admittance control strategy is based on the AC-DDPG algorithm.

Figure 9 shows the time-varying contact-force curves under the constant-force trajectory-tracking task for the three control strategies. Under the constant admittance control strategy, after an initial loading to 10 N, the contact force exhibits large-amplitude oscillations, with peak forces exceeding 40 N. This phenomenon arises from the fixed virtual stiffness setting: When a contact force significantly higher than the target is perceived, the admittance controller generates a pose command to rapidly move away from the contact surface in order to unload the force, resulting in an instantaneous decrease in the contact force to 0 N. Subsequently, based on the accumulated force error, the controller recalculates the pose command, causing the end-effector to collide with the contact surface and thereby generating impact forces. This “contact–separation–recontact” periodic oscillation pattern prevents the end-effector from maintaining stable contact with the aluminum plate, causing the constant force control objective to essentially fail.

In contrast, the optimal admittance control strategy, with online stiffness adjustments and slightly overdamped system characteristics, significantly reduces the oscillation amplitude compared to the constant admittance control and is able to maintain continuous contact with the aluminum plate. However, due to installation errors or flatness deviations, the aluminum plate is not a perfect plane. During the trajectory-tracking phase following the loading phase, the contact force still exhibits fluctuations. Throughout the tracking experiment, this strategy fails to achieve satisfactory constant force stability. Although this strategy improves the system’s stability, the oscillation problem in the force response remains.

The proposed method achieves the best overall smoothness of the contact force output. Although it is theoretically impossible to achieve completely oscillation-free force control due to the absolute positioning accuracy of the robot’s end-effector (approximately 0.1 mm), the proposed strategy effectively limits the force fluctuation amplitude to a low level. The contact force closely fluctuates around the target value of 10 N without severe peaks, demonstrating superior constant-force tracking performance.

Figure 10 further quantifies the time-varying force-tracking error of the three control strategies. Under the constant admittance control strategy, the error consistently exhibits large-amplitude oscillations, with maximum error amplitudes exceeding 30 N. This phenomenon also arises from the periodic “contact-separation-recontact” process, wherein the constant admittance controller fails to maintain contact after impacts, leading to frequent alternations between force surges and force drops to zero, resulting in severe periodic error fluctuations. Under the optimal admittance control strategy, the error shows significant fluctuations with amplitudes slightly lower than those observed in the constant admittance control strategy. However, small “surge-to-zero” cycles persist, with errors exceeding ±20 N. In contrast, the proposed method demonstrates superior smoothness, with the absolute force error remaining within ±1 N and no observed impacts or severe oscillations. By continuously optimizing the virtual stiffness and damping parameters in real time, the proposed method effectively maintains continuous contact between the end-effector and the environment, thereby avoiding the unstable “contact–separation–recontact” process and achieving constant force control.

Figure 11 further presents the motion trajectories of the robot’s end-effector in the z-direction (loading direction) under the three constant-force control strategies. The lower limits of all three trajectories indicate that the aluminum plate is not an ideal flat surface, which directly reflects the actual unevenness of the aluminum plate’s surface. Under the constant admittance control strategy, the end-effector frequently detaches from the contact surface (corresponding to the periods of contact force dropping to zero, as observed in Figure 9), clearly confirming the unstable “contact–separation–recontact” working mode. The trajectory under the optimal admittance control strategy more closely adheres to the contact surface overall, maintaining a relatively continuous contact state (consistent with the continuous but oscillatory contact observed in Figure 9). However, the trajectory still exhibits small high-frequency oscillations. Given the high environmental stiffness, these small position fluctuations amplify the contact force variations (as evidenced by the ±20 N force fluctuations in Figure 9). In contrast, the proposed strategy yields a relatively smooth z-direction trajectory, with significantly smaller fluctuations than the other two strategies. The trajectory shape aligns well with the contact force response and the force-tracking error. The smaller trajectory fluctuations reflect the ability of the proposed strategy to better maintain constant-force control.

Figure 12 provides an intuitive representation of the actual contact trajectories left by the robot’s end-effector on the contact surface under the three constant-force control strategies. These physical traces directly reflect the actual contact state between the end-effector and the environment (aluminum plate) during the task. Under the constant admittance control strategy, the trajectory exhibits pronounced alternating light and dark bands. This striped pattern results from repeated detachment and the subsequent impact of the end-effector with the aluminum plate, confirming its unstable “contact–separation–recontact” operational mode. Under the optimal admittance control strategy, the alternating light and dark patterns are comparatively mitigated, indicating a reduction in the degree and frequency of contact loss. This observation aligns with the relatively continuous contact state shown in the z-direction’s trajectory in Figure 11. However, subtle streaks of alternating light and dark remain in the trajectory, stemming from the small-amplitude oscillations of the end-effector under high contact force conditions (as shown in Figure 11). Although the position variations are minor, they are amplified into force fluctuations due to the high environmental stiffness. In contrast, the trajectory produced by the proposed strategy is the clearest, most coherent, and most uniform, without any alternating bands of light and dark.

In summary, based on the comprehensive analysis of the physical trajectory, z-direction motion trajectory, contact force response, and force-tracking error, the control strategy proposed in this section—featuring the online adjustment of virtual stiffness and damping—effectively achieves the desired constant force control in force–position hybrid tasks under unknown environmental stiffness conditions while enhancing trajectory-tracking smoothness. By dynamically adjusting the virtual impedance parameters, the system can adaptively compensate for force fluctuations caused by end-effector rigidity and environmental uncertainties, thereby avoiding the periodic “impact–separation–recontact” phenomenon and suppressing contact force oscillations. Moreover, the experimental results show that failure to maintain constant contact force leads to discontinuities or oscillations in the trajectory during contact, which in turn cause instability in force outputs and an increase in force-tracking error. Conversely, the proposed strategy successfully maintains stable contact between the end-effector and the environment, even under non-ideal flat surface conditions. In conclusion, compared to existing admittance strategies, the proposed variable admittance strategy exhibits superior constant-force control performance.

Finally, the performance of the AC-DDPG algorithm in the variable admittance control strategy is analyzed. As shown in Figure 13, the loss function of the actor network exhibits large values during the initial phase but quickly decreases to below 1, and in subsequent training iterations, it stabilizes and oscillates within a small range (approximately within 0.2). This observation indicates that the actor network effectively learns a reasonable action policy during the early stages and subsequently maintains stable action outputs through minor policy adjustments. The loss function of the critic network also shows large values (approximately

1 \times 10^{7}

; not displayed in the figure) during the initial training phase. As shown in Figure 14, the critic network loss gradually decreases after experiencing a series of oscillations and eventually converges to approximately 40. This result demonstrates that the critic network progressively improves the prediction accuracy of the Q-value while continuously fitting the state–action value function.

Furthermore, as shown in Figure 15, the reward function generally maintains a high level; however, occasional abrupt changes and decreases are observed at certain time steps. These fluctuations are closely related to external disturbances and the sensor noise present in the actual task. Overall, the consistent trends observed in the actor network, critic network, and reward function collectively corroborate the effectiveness of the AC-DDPG algorithm in the admittance control task and its convergence behavior in handling unknown environments.

5. Conclusions

This study addresses the problem of constant-force control under unknown environmental stiffness and investigates the influence of admittance parameters on system stability and trajectory-tracking performance. Within a reinforcement learning framework, an AC-DDPG-driven online optimization method for admittance parameters is proposed, enabling the autonomous adjustment of stiffness and damping coefficients in the presence of unknown environmental stiffness. By integrating a nonlinear model predictive controller (NMPC) based on quaternion orientation representation, high-precision pose tracking is achieved, effectively ensuring both accurate end-effector trajectory tracking and precise force control.

Experimental validation demonstrates that the proposed AC-DDPG-NMPC variable admittance control architecture reduces the impact of unknown stiffness in rigid environment interactions. This approach not only improves the stability of constant force outputs but also optimizes the performance of force–position hybrid control, maintaining force tracking errors within ±1 N. These results verify the effectiveness of the synergy between the variable admittance control strategy and high-precision trajectory-tracking mechanisms in contact tasks.

Despite the demonstrated effectiveness of the proposed method in unknown high-stiffness environments, certain limitations remain. First, the current framework does not explicitly consider extreme operating conditions, such as sudden impacts or rapidly changing environment dynamics. Second, while online learning enables real-time adaptation, guaranteeing safety during the learning process—particularly during early interaction stages—remains a challenging issue. In future research, we will investigate safe online learning mechanisms to improve robustness during exploration and explore transfer learning across different robotic platforms to enhance scalability and reduce retraining efforts during deployment.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, Y.Z.; validation, C.Q. and Y.Z.; formal analysis, C.Q.; investigation, Y.Z.; resources, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z.; visualization, C.Q.; supervision, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported in part by the Ningbo Science and Technology Bureau under grant 2024Z266.

Data Availability Statement

The data, materials, and codes are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NMPC	Nonlinear model predictive control
DDPG	Deep deterministic policy gradient
AC	Actor–critic neural network scheme

References

Kozlovsky, S.; Newman, E.; Zacksenhouse, M. Reinforcement learning of impedance policies for peg-in-hole tasks: Role of asymmetric matrices. IEEE Robot. Autom. Lett. 2022, 7, 10898–10905. [Google Scholar] [CrossRef]
Hogan, N. Impedance Control: An Approach to Manipulation. In Proceedings of the 1984 American Control Conference, San Diego, CA, USA, 6–8 June 1984; pp. 304–313. [Google Scholar] [CrossRef]
Albu-Schaffer, A. Cartesian impedance control of redundant robots: Recent results with the DLR-light-weight-arms. In Proceedings of the IEEE International Conference on Intelligent Robots & Systems, Las Vegas, NV, USA, 27 October–1 November 2003. [Google Scholar]
Buchli, J.; Stulp, F.; Theodorou, E.; Schaal, S. Learning variable impedance control. Int. J. Robot. Res. 2011, 30, 820–833. [Google Scholar] [CrossRef]
Bednarczyk, M.; Omran, H.; Bayle, B. Model Predictive Impedance Control. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation ICRA, Paris, France, 31 May–31 August 2020; pp. 4702–4708. [Google Scholar] [CrossRef]
Wahrburg, A.; Listmann, K. MPC-based Admittance Control for Robotic Manipulators. In Proceedings of the 55th IEEE Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 7548–7554. [Google Scholar]
Jin, Z.; Qin, D.; Liu, A.; Zhang, W.A.; Yu, L. Model Predictive Variable Impedance Control of Manipulators for Adaptive Precision-Compliance Tradeoff. IEEE-ASME Trans. Mechatron. 2023, 28, 1174–1186. [Google Scholar] [CrossRef]
Angelini, F.; Xin, G.; Wolfslag, W.J.; Tiseo, C.; Mistry, M.; Garabini, M.; Bicchi, A.; Vijayakumar, S. Online Optimal Impedance Planning for Legged Robots. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6028–6035. [Google Scholar] [CrossRef]
Pollayil, M.J.; Angelini, F.; Xin, G.; Mistry, M.; Vijayakumar, S.; Bicchi, A.; Garabini, M. Choosing Stiffness and Damping for Optimal Impedance Planning. IEEE Trans. Robot. 2023, 39, 1281–1300. [Google Scholar] [CrossRef]
Kikuuwe, R. Sliding-Mode-Like Position Controller for Admittance Control with Bounded Actuator Force. IEEE-ASME Trans. Mechatron. 2014, 19, 1489–1500. [Google Scholar] [CrossRef]
Gao, H.; Ma, C.; Zhang, X.; Zhou, C. Compliant variable admittance adaptive fixed-time sliding mode control for trajectory tracking of robotic manipulators. Robotica 2024, 42, 1731–1760. [Google Scholar] [CrossRef]
dos Santos, W.M.; Siqueira, A.A. Optimal impedance via model predictive control for robot-aided rehabilitation. Control Eng. Pract. 2019, 93, 104177. [Google Scholar] [CrossRef]
Cao, R.; Cheng, L.; Li, H. Passive Model-Predictive Impedance Control for Safe Physical Human–Robot Interaction. IEEE Trans. Cogn. Dev. Syst. 2023, 16, 426–435. [Google Scholar] [CrossRef]
Keemink, A.Q.L.; van der Kooij, H.; Stienen, A.H.A. Admittance control for physical human-robot interaction. Int. J. Robot. Res. 2018, 37, 1421–1444. [Google Scholar] [CrossRef]
Yao, B.; Zhou, Z.; Wang, L.; Xu, W.; Liu, Q.; Liu, A. Sensorless and adaptive admittance control of industrial robot in physical human-robot interaction. Robot. Comput.-Integr. Manuf. 2018, 51, 158–168. [Google Scholar] [CrossRef]
Haninger, K.; Hegeler, C.; Peternel, L. Model predictive impedance control with Gaussian processes for human and environment interaction. Robot. Auton. Syst. 2023, 165, 104431. [Google Scholar] [CrossRef]
Li, Z.; Wei, H.; Zhang, H.; Liu, C. A Variable Admittance Control Strategy for Stable and Compliant Human-Robot Physical Interaction. IEEE Robot. Autom. Lett. 2025, 10, 1138–1145. [Google Scholar] [CrossRef]
Guler, B.; Niaz, P.P.; Madani, A.; Aydin, Y.; Basdogan, C. An adaptive admittance controller for collaborative drilling with a robot based on subtask classification via deep learning. Mechatronics 2022, 86, 102851. [Google Scholar] [CrossRef]
Tahamipour-Z, S.M.; Mattila, J. Distributed impedance control of coordinated dissimilar upper-limb exoskeleton arms. Control Eng. Pract. 2024, 142, 105753. [Google Scholar] [CrossRef]
Naceri, A.; Schumacher, T.; Li, Q.; Calinon, S.; Ritter, H. Learning optimal impedance control during complex 3D arm movements. IEEE Robot. Autom. Lett. 2021, 6, 1248–1255. [Google Scholar] [CrossRef]
Ducaju, J.M.S.; Olofsson, B.; Johansson, R. Model-Based Predictive Impedance Variation for Obstacle Avoidance in Safe Human–Robot Collaboration. IEEE Trans. Autom. Sci. Eng. 2024, 22, 9571–9583. [Google Scholar] [CrossRef]
Jinjun, D.; Yahui, G.; Ming, C.; Xianzhong, D. Adaptive variable impedance control for dynamic contact force tracking in uncertain environment. Robot. Auton. Syst. 2018, 102, 54–65. [Google Scholar] [CrossRef]
Ning, G.; Liang, H.; Zhang, X.; Liao, H. Inverse-reinforcement-learning-based robotic ultrasound active compliance control in uncertain environments. IEEE Trans. Ind. Electron. 2023, 71, 1686–1696. [Google Scholar] [CrossRef]
Li, Y.; Yang, C.; Yan, W.; Cui, R.; Annamalai, A. Admittance-Based Adaptive Cooperative Control for Multiple Manipulators with Output Constraints. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3621–3632. [Google Scholar] [CrossRef]
Li, Y.; Zheng, L.; Wang, Y.; Dong, E.; Zhang, S. Impedance Learning-Based Adaptive Force Tracking for Robot on Unknown Terrains. IEEE Trans. Robot. 2025, 41, 1404–1420. [Google Scholar] [CrossRef]
Gold, T.; Völz, A.; Graichen, K. Model predictive interaction control for robotic manipulation tasks. IEEE Trans. Robot. 2022, 39, 76–89. [Google Scholar] [CrossRef]
Minniti, M.V.; Grandia, R.; Fäh, K.; Farshidian, F.; Hutter, M. Model predictive robot-environment interaction control for mobile manipulation tasks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May 30–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1651–1657. [Google Scholar]
Han, S.W.; Iskandar, M.; Lee, J.; Kim, M.J. Online multi-contact feedback model predictive control for interactive robotic tasks. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 11556–11562. [Google Scholar]
Peng, G.; Chen, C.P.; Yang, C. Neural networks enhanced optimal admittance control of robot–environment interaction using reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4551–4561. [Google Scholar] [CrossRef] [PubMed]
Pozzi, A.; Incremona, A.; Toti, D. Neural Network-Based Imitation Learning for Approximating Stochastic Battery Management Systems. IEEE Access 2025, 13, 71041–71052. [Google Scholar] [CrossRef]
Beeson, P.; Ames, B. TRAC-IK: An open-source library for improved solving of generic inverse kinematics. In Proceedings of the 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), Seoul, Republic of Korea, 3–5 November 2015; pp. 928–935. [Google Scholar] [CrossRef]

Figure 1. The model for robot contact with environments.

Figure 2. The overview of the control strategy.

Figure 3. Experimental hardware setup.

Figure 4. The 10 N point-loading response curve.

Figure 5. The 20 N point-loading response curve.

Figure 6. XY-plane infinity-shaped trajectory tracking. The black line represents the desired trajectory, the blue line represents the trajectory generated via the IK-PID joint control strategy, and the red line represents the trajectory generated using the proposed quaternion-based NMPC strategy.

Figure 7. X-plane trajectory-tracking error.

Figure 8. Y-plane trajectory-tracking error.

Figure 9. Force performance comparison.

Figure 10. Force error comparison.

Figure 11. Z-plane infinity-shaped trajectory.

Figure 12. Real trajectory. The blue line represents the trajectory generated by the optimal admittance control strategy, while the yellow line represents the trajectory generated by the constant admittance control strategy.

Figure 13. Actor loss.

Figure 14. Critic loss.

Figure 15. Reward curve.

Table 1. Trajectory-tracking error.

	x-Proposed	y-Proposed	x-Classical	y-Classical
RMSE	0.0026216 m	0.0025137 m	0.0051176 m	0.0051035 m
variance	6.8637 × 10⁻⁶ m	6.3175 × 10⁻⁶ m	2.6134 × 10⁻⁵ m	2.5993 × 10⁻⁵ m

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Yao, J.; Qian, C. Learning-Based Variable Admittance Control Combined with NMPC for Contact Force Tracking in Unknown Environments. Actuators 2025, 14, 323. https://doi.org/10.3390/act14070323

AMA Style

Zhang Y, Yao J, Qian C. Learning-Based Variable Admittance Control Combined with NMPC for Contact Force Tracking in Unknown Environments. Actuators. 2025; 14(7):323. https://doi.org/10.3390/act14070323

Chicago/Turabian Style

Zhang, Yikun, Jianjun Yao, and Chen Qian. 2025. "Learning-Based Variable Admittance Control Combined with NMPC for Contact Force Tracking in Unknown Environments" Actuators 14, no. 7: 323. https://doi.org/10.3390/act14070323

APA Style

Zhang, Y., Yao, J., & Qian, C. (2025). Learning-Based Variable Admittance Control Combined with NMPC for Contact Force Tracking in Unknown Environments. Actuators, 14(7), 323. https://doi.org/10.3390/act14070323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning-Based Variable Admittance Control Combined with NMPC for Contact Force Tracking in Unknown Environments

Abstract

1. Introduction

2. System Modeling

3. Main Control Strategy

3.1. Online Tuning of Admittance Parameters

3.2. Force–Position Hybrid Control Based on a Quaternion-Based MPC Controller

3.2.1. The Discrete Form of the System-State Equation

3.2.2. Constraints Design

3.2.3. The Optimal Control Problem

4. Experimental Results

4.1. Experimental Setup

4.1.1. Desired Parameters

4.1.2. AC-DDPG Parameters

4.1.3. NMPC Parameters

4.2. Experimental Results

4.2.1. The Point-Loading Experiment

4.2.2. The Tracking Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI