Integrated Whole-Body Control and Manipulation Method Based on Teacher–Student Perception Information Consistency

Liu, Shuqi; Zhuang, Yufeng; Hu, Shuming; Hu, Yanzhu; Zeng, Bin

doi:10.3390/act14030131

Open AccessArticle

Integrated Whole-Body Control and Manipulation Method Based on Teacher–Student Perception Information Consistency

by

Shuqi Liu

¹

,

Yufeng Zhuang

^1,*,

Shuming Hu

¹,

Yanzhu Hu

¹ and

Bin Zeng

²

¹

Key Laboratory of IoT Monitoring and Early Warning, Ministry of Emergency Management, School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Central Research Institute of Building and Construction Co., Ltd., MCC Group, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Actuators 2025, 14(3), 131; https://doi.org/10.3390/act14030131

Submission received: 16 January 2025 / Revised: 3 March 2025 / Accepted: 5 March 2025 / Published: 7 March 2025

(This article belongs to the Special Issue Design and Application of Actuators with Multi-DOF Movement-2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In emergency scenarios, we focus on studying how to manipulate legged robot dogs equipped with robotic arms to move and operate in a small space, known as legged emergency manipulation. Although the legs of the robotic dog are mainly used for movement, we found that implementing a whole-body control strategy can enhance its operational capabilities. This means that the robotic dog’s legs and mechanical arms can be synchronously controlled, thus expanding its working range and mobility, allowing it to flexibly enter and exit small spaces. To this end, we propose a framework that can utilize visual information to provide feedback for whole-body control. Our method combines low-level and high-level strategies: the low-level strategy utilizes all degrees of freedom to accurately track the body movement speed of the robotic dog and the position of the end effector of the robotic arm; the advanced strategy is based on visual input, intelligently planning the optimal moving speed and end effector position. At the same time, considering the uncertainty of visual guidance, we integrate fully supervised learning into the advanced strategy to construct a teacher network and use it as a benchmark network for training the student network. We have rigorously trained these two levels of strategies in a simulated environment, and through a series of extensive simulation validations, we have demonstrated that our method has significant improvements over baseline methods in moving various objects in a small space, facing different configurations and different target objects.

Keywords:

visual information consistency; manipulation; control policy

1. Introduction

Driven by advances in manipulation controllers and navigation systems, mobile manipulation research for emergency rescue scenarios has made significant progress. Although existing research has shown that wheeled manipulators can handle household tasks [1,2], their application in outdoor environments with complex terrains remains challenging. Consider emergency rescue missions in rugged terrains: a robot capable of clearing rubble or delivering emergency supplies in disaster areas would greatly enhance rescue efforts.

In light of this, we are dedicated to researching legged mobile manipulation systems equipped with robotic arms. Our research focuses on enabling these robots to autonomously approach rubble or supplies using their visual perception capabilities in unpredictable outdoor environments. Through this research, we aim to develop a flexible and practical robotic solution for emergency rescue applications.

In the context of emergency rescue, position and orientation control offers a core advantage: it enables the full utilization of all degrees of freedom of the robot, facilitating coordinated control of its entire body. While the legs primarily handle locomotion, integrating all joints of both the legs and arms can significantly enhance the robot’s operational capabilities and flexibility in complex and dynamic environments, demonstrating substantial application potential. This concept aligns with human behavior patterns: we stand up and leverage the strength of our legs to reach high objects that are otherwise difficult to access; similarly, we may bend or squat to more easily accomplish tasks at lower heights (as shown in Figure 1).

In practical emergency rescue applications, we envision robots, empowered by whole-body joint-coordinated control technology, to successfully perform a series of complex tasks. For example, during search and rescue missions in post-earthquake ruins, robots can use visual information to navigate to the roadside and, by bending their front legs and extending their arms, effortlessly pick up rubble or transport rescue supplies from grasslands or road surfaces. Without this whole-body coordination capability, robots might struggle to reach items on the ground using only the length of their arms, particularly in confined spaces. This limitation would significantly reduce their practicality and efficiency in emergency rescue operations.

Despite the practicality of whole-body coordination in actual rescue scenarios, simultaneously coordinating all high-degree-of-freedom joints is an extremely complex and challenging problem. A key challenge is ensuring that robots fully utilize visual feedback and maintain robustness against external disturbances.

Moreover, precise manipulation requires a stable robotic platform, which is more challenging for legged robots. Wheeled robots provide a stable base, while legged robots sacrifice stability for greater workspace flexibility.

Due to the tendency of relying solely on vision, which often produces low-frequency artifacts, jitter, and ghosts [3,4], adversely affecting perception outcomes, autonomously achieving all the aforementioned functionalities in diverse environments using only egocentric camera observations is significantly more challenging.

Recently, learning-based methods have shown promising results in enabling legged robots to navigate obstacles, climb stairs, and jump over steps robustly using visual input [5,6]. However, these studies focus solely on mobility and lack delicate manipulation, which demands higher precision. Given the complexity of scenarios and the precision required for manipulation, relying on a single network alone is insufficient to meet current demands.

In this paper, we achieve mobility and control functions by introducing a teacher–student training network framework. The main task of the teacher phase is to train the end effector’s posture and the robot body’s velocity commands based on visual information. Considering the difficulty of training the student network, full supervised learning is introduced to assist in training the student network. Additionally, a synchronous teacher–student network training strategy is introduced, which involves jointly comparing and optimizing the outputs of the teacher and student networks to simplify the training process. Therefore, the core of our method consists of three main points: supervised learning, synchronous teacher–student network, and visual information guidance.

Overall, we adopt the teacher–student framework primarily to provide high-quality prior knowledge and control strategies through a network of teachers. Specifically, the teacher network uses supervised learning to train encoders and policy networks to generate preliminary control commands, and optimizes the learning process of the student network through an information feedback mechanism. On this basis, the student network further generates more flexible and adaptable movement commands. Compared with a single network, the teacher–student framework not only accelerates the training process but also improves the robustness and diversity of control commands, especially in the task of adapting to grasping objects of different height positions.

Our simulation environment is configured in Isaac Gym, and the hardware platform is based on the Unitree B1 quadruped robot, which mainly consists of two parts: a mobile robotic dog and a robotic arm. The simulation results demonstrate that our proposed method can achieve mobility and manipulation in the simulation environment, and the strategies presented in this paper can be effectively transferred to real-world scenarios.

We summarize the contribution of this paper as:

We have developed a vision-based locomotion and manipulation method, which leverages supervised reinforcement learning and a synchronized teacher–student network strategy to reduce the learning discrepancy between the teacher and student networks, thereby more effectively generating whole-body control commands.
Our proposed method effectively integrates the mutual feedback strategies between the policy network and reinforcement learning, combined with teacher–student networks and supervised learning, enhancing the system’s adaptability in complex environments while optimizing the collaborative performance between quadruped robots and robotic arms.
Our proposed method has been thoroughly validated through simulations, demonstrating its capability to perform object manipulation tasks across varying height positions, distances, and diverse terrains.

The remainder of this paper is organized as follows: In Section 2, we review typical methods for legged robot locomotion and manipulation, as well as recent advancements in robots integrating mobility with precision manipulation. In Section 3, we detail the fusion of visual information with reinforcement learning and the control command generation method. Section 4 describes the simulation setup and presents the results. Finally, in Section 6, we summarize the robot’s capabilities and discuss future research directions.

2. Related Work

2.1. Legged Locomotion

In recent years, research on legged robots has achieved remarkable progress, particularly in navigating complex terrains. To enhance the locomotor performance of legged robots, researchers have explored various strategies and technologies. Reinforcement learning has enabled the development of motion strategies that adapt to diverse environmental changes. For example, refs. Miura and Sreenath et al. [7,8,9] demonstrate how reinforcement learning improves the walking stability of quadruped robots in complex terrains. Meanwhile, ref. Feng et al. [10] proposes a framework for learning robust, agile, and natural legged motion skills on challenging terrains. Hybrid locomotion combines wheeled and legged motion, enabling robots to switch seamlessly between terrains. For instance, ref. Belli et al. [11] introduces robotic systems capable of hybrid wheel-leg motion. Perception and decision-making technologies are crucial for enabling robots to understand their surroundings and make informed decisions. Ref. Melon et al. [12] highlights the importance of online perception for agile motion, while Zhao et al. [13] focuses on high-level mobility decisions through formal synthesis methods. Model Predictive Control (MPC) is an advanced strategy that enables real-time action planning in dynamic [14]. Bio-inspired design leverages animal behavioral patterns to inspire the development of more efficient and adaptable robots. Ref. Alexander [15] describes Oncilla, a novel quadruped robot that achieves flexible motion under model-free control through bio-inspired design.

2.2. Manipulation

Existing literature provides a wealth of background and research on robot mobile manipulation and cross-entity interaction. Liu et al. (2024) [16] developed OK-Robot, a universal robot system that integrates open knowledge-based models with robotic technology. This system excels at performing pick-and-place tasks in unknown household environments. Xiong et al. (2024) [17] proposed a full-stack open-world mobile manipulation system combining Behavior Cloning (BC) and Online Reinforcement Learning (RL). This system enables robots to learn quickly from limited data and adapt to new objects during operation. These studies provide a strong theoretical foundation for deploying mobile robots in complex and dynamic environments. Fu et al. (2024) [2] introduced a low-cost whole-body teleoperation method for dual-arm mobile manipulation in the Mobile Aloha project. This method uses the Mobile ALOHA teleoperation system and a co-training strategy with static dual-arm manipulation data to enhance performance and data efficiency in complex tasks. Wu et al. (2023) [18] demonstrated the effectiveness of personalized robotic assistants supported by large-scale language models in the Tidybot project. Sleiman et al. (2023) [19] explored the flexibility of multi-contact planning and legged robot control, highlighting the need for robotic systems to adapt to diverse environments. Yokoyama et al. (2023) [20] proposed an adaptive skill coordination method to improve operational efficiency in complex tasks. Margolis et al. (2022) [21] highlighted the potential of reinforcement learning for rapid motion adaptation. Their research demonstrated that robots can run and turn rapidly on natural terrains like grass, ice, and gravel while maintaining robustness against disturbances. Margolis and Agrawal (2023) [22] used a self-supervised approach to train a visual system with real-world robot data. This system predicts terrain properties (e.g., friction coefficients and roughness), enabling efficient task planning with minimal labeled data. They used real-world walking data to train the visual perception module, enhancing the robot’s adaptability and task efficiency. Zhao et al. (2024) [23] developed a combined learning and optimization framework to transfer human whole-body motion and manipulation skills to mobile manipulators. Despite their potential, these methods require further improvement for extreme scenarios like emergency rescue.

2.3. Legged Loco-Manipulation

In recent years, legged robots have achieved significant advancements in mobile manipulation and operational capabilities. The key to these advancements lies in achieving coordinated control between legs and manipulators while executing tasks in complex environments. Agarwal et al. (2023) [5] investigated the use of egocentric vision for legged robots to navigate complex terrains, demonstrating the potential of visual input in enhancing the robots’ environmental adaptability. Ahn et al. (2022) [24] explored the integration of natural language with mechanical manipulation capabilities, emphasizing the interoperability between natural language processing and robotic manipulation systems. Antonova et al. (2017) [25] further optimized controllers, achieving exceptional performance in dynamic gait control. Brohan et al. (2022) [26] proposed the Robot Transformer (RT-1), which was trained on large-scale data to enable control tasks in complex environments. Feng et al. (2014) [10] and Ferrolho et al. (2023) [27] enhanced robots’ operational capabilities during movement through whole-body control optimization and robust trajectory optimization methods, respectively. Additionally, Bloesch et al. (2015) [28] improved the robustness of visual-inertial odometry by introducing a direct Extended Kalman Filter (EKF) approach. In recent years, legged robots have made significant progress in their mobile manipulation and operational capabilities, with the key challenge lying in achieving coordinated control between their legs and manipulators while executing tasks in complex environments. Agarwal et al. (2023) [5] investigated egocentric vision for legged robots to navigate complex terrains, demonstrating the potential of visual input in enhancing environmental adaptability. Ahn et al. (2022) [24] explored the integration of language with mechanical manipulation capabilities, highlighting the interoperability between natural language processing and robotic manipulation. Antonova et al. (2017) [25] further optimized controllers, achieving exceptional performance in dynamic gait control. Brohan et al. (2022) [26] proposed the Robot Transformer (RT-1), trained on large-scale data, which enables control tasks in complex environments. Feng et al. (2014) and Ferrolho et al. (2023) [10,27], respectively, enhanced the robots’ operational capabilities during movement through whole-body control optimization and robust trajectory optimization methods. Additionally, Bloesch et al. (2015) [28] improved the robustness of visual-inertial odometry by introducing a direct Extended Kalman Filter (EKF).

Compared to other methods, our work presents an autonomous mobile manipulation approach based on visual supervisory information. This process primarily relies on reinforcement learning strategies and self-feedback mechanisms, which integrate perception and fine control to adapt to objects at various heights and in diverse scenarios. Particularly in confined and buried spaces, autonomous control and movement are highly advantageous for exploring unknown environments. During task exploration, if obstacles are encountered or the predefined trajectory is not followed, the algorithm will re-attempt the exploration process. The greatest advantage of our method lies in its unrestricted action space and expansive operational workspace. In contrast, other non-full-body control methods are limited by height constraints and exhibit weaker adaptability to objects on different planes [29].

3. Our Control Method

In this section, the main content is to introduce our visual consistency whole-body control framework (as described in Figure 2). Our structure is shown in the figure, where privileged information and environmental information are used as inputs to the encoding network in the teacher network, and the proprioceptive observation state is used as input to the policy network in the teacher network and the encoding network in the student network. The main goal of our method is to allow the robot to approach and manipulate objects, while autonomously moving to the vicinity of the target object and executing manipulation commands relying only on proprioceptive states. In this section, we mainly elaborate on the core concepts involved in the network, such as the teacher–student network, observation information, reward function and objective function settings, and the policy network.

3.1. Teacher-Student Network

Firstly, initialize the policy network of the teacher network into the student network, which means that the initial policy network parameters of the student network are the same as those of the teacher network. Secondly, train the encoder and policy network of the teacher network through supervised learning, using labeled data (e.g., predefined actions and cor responding environmental states) to adjust network parameters so that the output is closer to the expected control commands. Meanwhile, information feedback learning is conducted between the teacher network and the student network, where the policy network of the teacher network and the encoder of the student network learn and improve by sharing information and feedback. The encoder network is optimized by comparing the output vectors of the encoder and calculating the loss, with the loss function mainly based on the consistency of encoded information.

Finally, the trained encoder and policy network can generate more flexible motion commands, which are used to control the quadruped robot and robotic arm, enabling them to autonomously move and manipulate objects.

3.2. Observations

In terms of environmental observation, we use visual sensors to acquire environmental information and process the visual information through a pre-trained visual information processing network. This extracts environmental features crucial for the motion planning of the quadruped robot, including but not limited to obstacle positions and terrain features. Based on the processed environmental information, we adjust the state of the quadruped robot’s base, including its position, posture, and movement speed, to ensure the robot can traverse complex environments stably and efficiently. Finally, we focus on the state of the end effector, achieving precise interaction with the environment by accurately controlling its position and posture to accomplish tasks such as grasping and placing. The entire process can be summarized as:

Environmental Observation \to Base State Adjustment \to End Effector State Control

.

For the teacher policy, the state

x_{t}^{t}

is composed of proprioceptive observations

o_{t}^{p}

, privileged state

s_{t}^{p}

, and environmental point cloud information

z^{shape}

. The proprioceptive observations

o_{t}^{p}

encompass joint positions, joint velocities (excluding the gripper joint velocity), end-effector position and orientation, base velocities, and the previous action

a_{t - 1}

selected by the current policy. Given the presence of target objects in the task, the local state of the target object, including position and orientation, is incorporated as part of the privileged state

s_{t}^{p}

to guide the end-effector and robot body towards executing actions. The environmental variable

z^{shape}

represents a latent shape feature vector encoded from the object point cloud, obtained through a pre-trained PointNet++. For the student policy, it can only access and utilize proprioceptive observations

o_{t}^{p}

. This implies that the student policy must make decisions solely based on these limited internal state information. This can be mathematically expressed as:

x_{t}^{t} = [o_{t}^{p}, s_{t}^{p}, z^{s h a p e}]

(1)

3.3. Policy Network Architecture

After processing the observation data, we pass these high-dimensional observation vectors as inputs to the policy network. The policy network performs dimensional transformation and feature extraction through its internal structure and algorithms, ultimately outputting actions to control the quadruped robot and the robotic arm.

The policy networks in the teacher–student network are composed of a three-layer Multi-Layer Perceptron (MLP), with each layer having 128 neurons and using ELU as the activation function. This network receives latent variables output by the encoder. Additionally, in the teacher–student network, the policy network of the teacher network is initialized into the student network. Through supervised learning and information feedback learning, the encoder and policy network of the teacher network are trained to provide a basis for generating control commands. Our method aims to generate more flexible motion commands.

3.4. Reward Functions

The reward settings in the simulation are divided into three parts: state reward, speed limit reward, and action smoothness reward. The total reward settings in the simulation are as follows:

r_{total} = r_{state} + r_{varm} + r_{action}

(2)

State rewards

r_{state}

, which include success reward, failure reward, and operation reward. The success reward, denoted as

r_{over}

, is given when the robot successfully moves the object. The failure reward, also known as the approach reward

r_{arrived}

, is given when the robot fails to move the object. The operation reward

r_{doing}

is given for the robot’s effort in executing the commands. In the task, the robot needs to move the target object to a specified location, and the task is considered complete when the displacement of the target object exceeds a preset threshold.

r_{state} = \{\begin{matrix} r_{over}, & Successfully moved the object \\ r_{doing}, & Grasping or moving task is in progress \\ r_{arrived}, & Robot is approaching the object \\ - 1, & Failed to move the object after 100 attempts \end{matrix}

(3)

The first stage is the approach phase, where the reward

r_{arrived}

encourages the robotic arm and quadruped robot to get close to the target object:

r_{arrived} = min (d_{\min} - d, 0)

(4)

where

d_{\min}

represents the current minimum distance between the gripper and the target object. The second stage is the task execution phase, where the reward

r_{doing}

guides the robotic arm and robot to achieve their goals, such as touching or moving the object:

r_{doing} = min (d - d_{\max}, 0)

(5)

where

d_{\max}

is the current maximum distance of the object. The third stage is the task completion phase, where a reward

r_{over}

is given when the quadruped robot and robotic arm complete the task:

r_{over} = 1 (Task is over)

(6)

In tasks involving lifting a target object, the condition for successful completion is met when the object’s displacement exceeds a predefined threshold. The design of auxiliary rewards is intended to enhance the smoothness of robot behavior, mitigating deviations and the accumulation of non-informative data. This involves ensuring that the robot’s body and gripper maintain alignment with the object, thereby preventing the camera from losing track of the target. Furthermore, action-related rewards have been defined to expedite the convergence of the algorithm. The primary goals encompass adhering to instructions, providing opportunities for the robot to re-explore, and ultimately achieving consistent command execution. This is mathematically represented as follows:

\begin{matrix} r_{varm} = 1 - exp (∥v_{t - 1} - v_{t}∥) \end{matrix}

(7)

\begin{matrix} r_{action} = 1 - exp (- ∥a_{t - 1} - a_{t}∥) \end{matrix}

(8)

In these equations,

r_{varm}

serves to limit the rate of change in the arm joint velocity, while

r_{action}

facilitates smoother command execution by preventing excessively rapid transitions.

3.5. Actions

In our policy architecture, the teacher network and the student network exhibit significant differences in the definition and execution of action commands. Specifically, the core function of the teacher network is to generate complex and highly adaptive action commands, which have a decisive influence on the execution layer of the student network.

The commands generated by the teacher network and the student network,

a_{t}

, are defined as a vector comprising three components: the incremental gripper posture

o_{t}^{cmd}

(six dimensions), the linear and yaw velocity commands of the quadruped robot

v_{t}^{cmd}

(two dimensions), and the gripper state

s_{t}^{gripper}

(one dimension, representing open or closed). These commands together form a nine-dimensional vector, expressed as:

a_{t} = [o_{t}^{cmd}, v_{t}^{cmd}, s_{t}^{gripper}] \in R^{9}

(9)

When executing these commands, the linear velocity is uniformly sampled from the range

[- 0.6 m / s, 0.6 m / s]

, and the yaw velocity is sampled from

[- 1.0 m / s, 1.0 m / s]

.

In summary, the teacher network is mainly used to learn action commands in complex environments, while the student network is responsible for learning and generating simpler commands and executing them. The two networks are structurally identical but have slight functional differences.

3.6. Objective Function

Due to the structural change of the overall architecture and the way we exploit the information from the teacher and student groups, the conventional training process does not apply directly and hence needs to be adjusted. Since agents are divided into two groups, the Monte-Carlo approximation of PPO-Clip objective functions of each group

L^{ppo, t} (θ, θ^{t}), L^{ppo, s} (θ)

is defined as:

\begin{matrix} L^{ppo, t} (θ, θ^{t}) = & \frac{1}{| D^{t} | T} \sum_{τ \in D^{t}} \sum_{t = 0}^{T} \\ min (r_{t}^{t} {\hat{A}}_{t}^{t}, \\ clip (r_{t}^{t}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{t}^{t}) \end{matrix}

(10)

\begin{matrix} L^{ppo, s} (θ) = & \frac{1}{| D^{s} | T} \sum_{τ \in D^{s}} \sum_{t = 0}^{T} \\ min (r_{t}^{s} {\hat{A}}_{t}^{s}, \\ clip (r_{t}^{s}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{t}^{s}) \end{matrix}

(11)

where

D^{t}

and

D^{s}

are sets of teacher group and student group trajectories by interacting with environment using

E_{θ_{k}}^{t}

and

π_{θ_{k}}

, respectively. T is the length of corresponding trajectory.

In the Proximal Policy Optimization (PPO) algorithm, the clip function is utilized to constrain the magnitude of policy updates. It is formally defined as:

clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) = \{\begin{matrix} 1 - ϵ & if r_{t} (θ) < 1 - ϵ \\ r_{t} (θ) & if 1 - ϵ \leq r_{t} (θ) \leq 1 + ϵ \\ 1 + ϵ & if r_{t} (θ) > 1 + ϵ \end{matrix}

where

r_{t}^{t}, r_{t}^{s}

are ratio functions of two groups:

r_{t}^{t} (θ, θ^{t}) = \frac{π_{θ} (a_{t}^{t} | o_{t}^{t}, z_{t}^{t})}{π_{θ_{old}} (a_{t}^{t} | o_{t}^{t}, z_{t}^{t})}

(12)

r_{t}^{s} (θ) = \frac{π_{θ} (a_{t}^{s} | o_{t}^{s}, z_{t}^{s})}{π_{θ_{old}} (a_{t}^{s} | o_{t}^{s}, z_{t}^{s})}

(13)

In order to enhance the learning of the student network from the teacher network, while taking into account the consistency of encoded information, the latent information reconstruction loss is introduced. This loss is used to further update the student network by minimizing the output discrepancy between the teacher network and the student network. Its approximate definition is as follows:

L^{rec} (θ^{s}) = \frac{1}{| D | T} \sum_{τ \in D} \sum_{t = 0}^{T} {∥l_{t}^{s} - l_{t}^{t}∥}_{2}^{2}

(14)

3.7. Randomization

To develop a robust gripper, we sample and track multiple gripper orientations. To ensure smooth trajectories, we define a coordinate system with the robot’s base as the origin and sample trajectories at fixed time intervals. This approach reduces the impact of the quadruped robot on sampled targets and improves the system’s adaptability to diverse environments and objects. During training, we implemented a comprehensive randomization strategy to enhance system robustness. We randomized terrain types (flat and rugged), friction coefficients, robot mass, and center of mass position. These measures enhance the robot’s adaptability to diverse environments. For picking tasks, we randomized table height, robot initial position and orientation, and object position and orientation. This ensures robust system performance across diverse and complex environments.

To enhance the method’s robustness and generalization, we introduced randomization during training. Specifically, the initial position is randomized within ±0.5 m of the target object, and the initial orientation is randomized within ±30 degrees. This randomization simulates real-world uncertainties, enabling the method to adapt to varying initial conditions.

4. Simulation

We conducted a series of simulations for mobile object tasks, demonstrating the method’s adaptability to varying heights and distances, as well as its stability across different terrains.

The core advantage of our method is the successful decoupling of base movement and robotic arm control through a state reward constraint mechanism, significantly enhancing task adaptability and execution efficiency. Specifically, the state reward mechanism comprises three stages: (1) the quadruped robot approaches objects on the table plane, (2) the robotic arm fine-tunes its distance to the object, and (3) the state reward is completed. This staged reward mechanism enables a full-body control decoupling strategy. Unlike traditional methods that rely on fixed collaboration strategies, our method dynamically adjusts base and robotic arm control instructions based on task requirements.

Additionally, it utilizes consistent environmental information and the state-of-the-art PointNet++ network to analyze object shape features in depth. This foundation enables the execution of complex tasks, such as drawer opening.

4.1. Overview of the Simulation Platform

In this study, we simulate a robotic system based on the Unitree B1 quadruped robot, integrated with a Unitree Z1 robotic arm and a gripper. Two RealSense D435 cameras are simulated: one mounted on the quadruped body and the other near the gripper. The simulated system has 19 degrees of freedom (DoFs): 12 for the quadruped, 6 for the robotic arm, and 1 for the gripper. The system is visualized in Figure 3. For object detection, we employed a virtual depth camera in the simulation. The camera features a resolution of 640 × 480 pixels, a 30 fps frame rate, and ±2 mm accuracy.

4.2. Simulation Setting

In the simulation setup, given our focus on applications in disaster scenarios, we have categorized objects into six types based on their shape characteristics: spherical, cuboidal, cylindrical, concave, convex, and others (encompassing shapes that cannot be unequivocally classified into the aforementioned five categories).

During the simulation process, when the robot initiates the execution of a command, both its initial position and orientation, as well as the initial position and orientation of the object, are randomly assigned. There are two metrics to measure the proposed method. Success rate is defined as the successful movement of the object, whereas failure encompasses situations where the object is incorrectly initialized underneath the table, drops to the ground, or remains unmoved after 100 attempts. The failure rate refers to the proportion of tasks that result in failure relative to the total number of tasks. While the success rate indicates positive performance, the failure rate, by contrast, indicates negative performance.

4.3. Algorithm

The algorithm “Enconder-consistency Training” as shown in Algorithm 1 initializes the environment and networks. For each iteration

k = 0, 1, \dots

, it collects sets of trajectories

D^{t}

and

D^{s}

using the latest policy and computes

l_{t}^{t}

and

l_{t}^{s}

using the encoder. In the policy optimization loop, for each epoch

i = 0, 1, \dots

, it uses

θ

to represent

θ^{t}

and

θ

for notational brevity, updates

θ

by adding

α_{ppo}

times the gradient of the combined PPO loss

L_{i}^{ppo, t} (θ) + L_{i}^{ppo, s} (θ)

. Where

α_{ppo}

is a hyperparameter used to control the magnitude of gradient updates in policy optimization. In this paper, the value of

α_{ppo}

adopted is 0.0005. In the reconstruction loss optimization loop, for each epoch

i = 0, 1, \dots

, it updates

θ^{s}

by subtracting

α_{ts}

times the gradient of the reconstruction loss

L_{i}^{rec} (θ^{s})

. This process iteratively trains the networks to improve visual consistency and policy performance.

Algorithm 1 Enconder-consistency Training

1:: Initialize environment and networks
2:: for $k = 0, 1, \dots$ do
3:: Collect sets of trajectories $D^{t}$ and $D^{s}$ with latest policy
4:: Compute $l_{t}^{t}$ and $l_{t}^{s}$ using Enconder
5:: for epoch $i = 0, 1, \dots$ do
6:: Use $θ$ represent $θ^{t}, θ$ for notational brevity
7:: $θ \leftarrow θ + α_{ppo} \nabla_{θ} (L_{i}^{ppo, t} (θ) + L_{i}^{ppo, s} (θ))$
8:: end for
9:: for epoch $i = 0, 1, \dots$ do
10:: $θ^{s} \leftarrow θ^{s} - α_{ts} \nabla_{θ^{s}} L_{i}^{rec} (θ^{s})$
11:: end for
12:: end for

4.4. Quantitative Simulation

During the simulation process, we compared our method with networks designed for the following three scenarios and evaluated their performance across six shape datasets.

Without Visual Consistency Strategy

L^{rec}

: This is a scenario where the variability of sensory information is not taken into account.

Non-Full-Body Control Strategy: This strategy possesses superior navigation capabilities, but lacks full-body coordination control. The robot can achieve good results within a height range of 0.4 to 0.55 m, but the workspace is limited, with 0.4 m serving as the lower bound. Here, the advantage of the larger workspace provided by the full-body control strategy becomes apparent.

Non-Phased Control Strategy: This strategy jointly trains the first and second stages in an end-to-end manner without differentiation. It combines observations from both stages and outputs the target positions for 12 robot joint angles and the target posture for the gripper. However, using this method did not yield satisfactory results, further validating the effectiveness and necessity of our proposed method.

In our quantitative simulation, we evaluated the success rate of manipulating six objects under four control strategies and tested them in a simulated environment. We collected test results for each object, and during the testing process, the initial heights of each object’s location were randomly set within the range of 0 to 0.5 m. As shown in Table 1, our method performs best on cuboid objects. It is worth noting that the strategy of not considering visual consistency, that is, not using visual information, performs best on spherical objects, that is, objects with regular shapes. The key reason for this phenomenon is that the consideration of visual consistency contains richer information, which has a more significant impact on objects with more complex shapes.

In addition, we tested the success rate of tasks at different heights in the simulator, as shown in Figure 4. Wherein, the height refers to the height of the object’s position. It can be observed that the success rate of the same object model processed by different methods varies with height, but our method performs equally at different heights, which also proves the superiority of our method in adapting to objects of different heights. Moreover, to validate the decoupling strategy of the model, we tested the success rate of different distance tasks in the simulator, as shown in Table 2. Our method outperforms methods lacking visual consistency and demonstrates superiority in handling object manipulation tasks at different distances. (The distance refers to the space between the object and the robot.).

At the same time, relying on the advantage of a larger inclination angle in the whole-body control strategy, our method can adapt more stably and maintain balance when tackling tasks on different terrains. Therefore, we use the failure rate as a measure of stability and test the mission failure rate for different terrains (plane, heightfield, trimesh) in the simulator, as shown in Table 3. It can be observed that our method is superior to the method lacking visual consistency in different terrain tasks, indicating that it can adapt to different terrain tasks.

Finally, the rewards in the simulation setup include three types of rewards, which are uniformly weighted and summed: the reward for successfully moving the object. As shown in Figure 5, we can see that as the training process continues, the reward values gradually increase, indicating that our method gradually converges during training and ultimately achieves good results. The top left corner shows the combined display of the maximum, minimum, and average rewards, where it can be seen that the rewards converge, demonstrating superior performance.

4.5. Qualitative Simulation

In qualitative simulation, Figure 6 presents the visualization of a quadruped robot manipulating a robotic arm to move objects. It particularly demonstrates the adaptability of the robot dog and robotic arm, under the full-body control strategy, to objects of different heights and the achievement of object manipulation. In the figure, we observe the posture of the quadruped robot on flat ground, including squatting, leaning forward and other actions. Through simulation data analysis, we found that leaning forward behavior can effectively reduce the center of gravity and improve stability in specific tasks, such as when grabbing objects at a low height. However, this behavior also increases the use of control torque, so it needs to be further optimized in terms of task completion efficiency.

Furthermore, the information in the figure indicates that in a scene with a height of 0.6, the process of a quadruped robot transitioning from its initial pose to a pose that is not parallel to the ground is demonstrated. The figure shows that by combining full-body control with consistent visual information, the quadruped robot and robotic arm can collaborate more effectively. They can handle objects of different heights or shapes. Meanwhile, for the task of full-body control grasping, the proposed grasping structure can initially achieve full-body control moving. However, since this article focuses on the collaboration between quadruped robots and robotic arms, flexible grasping ends are not utilized. In summary, this figure vividly showcases the diversity and flexibility of robot behavior under the full-body control strategy, as well as its better performance in object manipulation tasks.

It is equally noteworthy that during the training phase in a simulated environment, the robot exhibits retrying behavior when executing commands. Specifically, when the robot fails to successfully grasp an object, it automatically attempts to grasp again without any human intervention, showcasing the advantage of self-feedback learning in reinforcement learning. During our simulation process, we observed that randomly initialized poses and movements of the robot provided a safeguard for success rates.

5. Discussion

In this article, the whole-body controller shows significant advantages in the future. For example, in rescue missions, robots need to move quickly and maintain balance in irregular terrain, and a large torso angle adjustment can effectively improve their stability and adaptability. In addition, in industrial automation scenarios, full-body controllers enable robots to operate flexibly in confined spaces while avoiding collisions with the environment.Our controller takes into account differences in perception information and is able to achieve high-precision control. These scenarios fully demonstrate the practical value of the whole-body controller.

6. Conclusions, Limitations, and Future Works

In this paper, we propose a fully autonomous mobile manipulation system based on quadruped robots and a teacher–student network. Our proposed method comprises a command generation module and a command execution module, which are trained through reinforcement learning and a teacher–student network. It also considers the consistency of visual information and employs a loss optimization method to enhance the training effect of the teacher–student network. Despite achieving remarkable results in moving and manipulating obstacles of various shapes, the system still has limitations due to system-level and real-world deployment constraints, which are outlined below: Hardware and Environmental Adaptation: Our phased approach to mobility and manipulation demands high-precision and coherent modules for seamless cooperation between the quadruped robot and the robotic arm. However, the depth estimates provided by current depth cameras are inaccurate in certain scenarios, particularly in dimly lit environments with indistinct features. Gripper Design Limitations: The currently used parallel gripper has a tendency to push objects away during operation, which makes precise manipulation challenging.

These two factors are the most common causes of failure in our real-world experiments. It is worth noting that, due to the incomplete exploitation of environmental perception information, we plan to incorporate visual SLAM into the system to fully leverage this information. Specifically, by integrating lidar, visual odometry, and a computing board onto the quadruped robot, it can identify objects to be manipulated within a 3D model, fully understand environmental information, and improve manipulation accuracy.

Author Contributions

Methodology, validation, Writing—original draft, Writing—review and editing, S.L.; Conceptualization, supervision, funding acquisition, Y.Z.; methodology, S.H.; Conceptualization, supervision, Y.H.; Project administration, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Key R&D Program of China (Grant No. 2024YFC3016003) and the National Natural Science Foundation of China (Grant No. 52478123).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

Author Bin Zeng was employed by the company Central Research Institute of Building and Construction Co., Ltd., MCC Group. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yang, T.; Jing, Y.; Wu, H.; Xu, J.; Sima, K.; Chen, G.; Sima, Q.; Kong, T. MOMA-Force: Visual-Force Imitation for Real-World Mobile Manipulation. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 6847–6852. [Google Scholar]
Fu, Z.; Zhao, T.Z.; Finn, C. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv 2024, arXiv:2401.02117. [Google Scholar]
Cheng, X.; Shi, K.; Agarwal, A.; Pathak, D. Extreme Parkour with Legged Robots. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 11443–11450. [Google Scholar] [CrossRef]
Zhou, Z.; Chen, Z.; Cai, M.; Li, Z.; Kan, Z.; Su, C.Y. Vision-Based Reactive Temporal Logic Motion Planning for Quadruped Robots in Unstructured Dynamic Environments. IEEE Trans. Ind. Electron. 2024, 71, 5983–5992. [Google Scholar] [CrossRef]
Agarwal, A.; Kumar, A.; Malik, J.; Pathak, D. Legged locomotion in challenging terrains using egocentric vision. In Proceedings of the Conference on Robot Learning, PMLR, Atlanta, GA, USA, 6–9 November 2023; pp. 403–415. [Google Scholar]
Duan, H.; Pandit, B.; Gadde, M.S.; van Marum, B.J.; Dao, J.; Kim, C.; Fern, A. Learning Vision-Based Bipedal Locomotion for Challenging Terrain. arXiv 2023, arXiv:2309.14594. [Google Scholar]
Miura, H.; Shimoyama, I. Dynamic walk of a biped. Int. J. Robot. Res. 1984, 3, 60–74. [Google Scholar] [CrossRef]
Sreenath, K.; Park, H.W.; Poulakakis, I.; Grizzle, J.W. A compliant hybrid zero dynamics controller for stable, efficient and fast bipedal walking on MABEL. Int. J. Robot. Res. 2011, 30, 1170–1193. [Google Scholar] [CrossRef]
Xu, P.; Li, Z.; Liu, X.; Zhao, T.; Zhang, L.; Zhao, Y. Reinforcement learning-based distributed impedance control of robots for compliant operation in tight interaction tasks. Eng. Appl. Artif. Intell. 2024, 136, 108913. [Google Scholar] [CrossRef]
Feng, S.; Whitman, E.; Xinjilefu, X.; Atkeson, C.G. Optimization based full body control for the atlas robot. In Proceedings of the 2014 IEEE-RAS International Conference on Humanoid Robots, Madrid, Spain, 18–20 November 2014; pp. 120–127. [Google Scholar]
Belli, I.; Polverini, M.P.; Laurenzi, A.; Hoffman, E.M.; Rocco, P.; Tsagarakis, N. Optimization-Based Quadrupedal Hybrid Wheeled-Legged Locomotion. In Proceedings of the 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids), Munich, Germany, 19–21 July 2021; pp. 41–46. [Google Scholar] [CrossRef]
Melon, O.; Orsolino, R.; Surovik, D.; Geisert, M.; Havoutis, I.; Fallon, M. Receding-Horizon Perceptive Trajectory Optimization for Dynamic Legged Locomotion with Learned Initialization. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 9805–9811. [Google Scholar] [CrossRef]
Zhao, Y.; Li, Y.; Sentis, L.; Topcu, U.; Liu, J. Reactive task and motion planning for robust whole-body dynamic locomotion in constrained environments. Int. J. Robot. Res. 2022, 41, 812–847. [Google Scholar] [CrossRef]
Rathod, N.; Bratta, A.; Focchi, M.; Zanon, M.; Villarreal, O.; Semini, C.; Bemporad, A. Model predictive control with environment adaptation for legged locomotion. IEEE Access 2021, 9, 145710–145727. [Google Scholar] [CrossRef]
Spröwitz, A.; Tuleu, A.; Ajaoolleian, M.; Vespignani, M.; Moeckel, R.; Eckert, P.; D’Haene, M.; Degrave, J.; Nordmann, A.; Schrauwen, B.; et al. Oncilla robot: A versatile open-source quadruped research robot with compliant pantograph legs. arXiv 2018, arXiv:1803.06259. [Google Scholar] [CrossRef]
Liu, P.; Orru, Y.; Paxton, C.; Shafiullah, N.M.M.; Pinto, L. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv 2024, arXiv:2401.12202. [Google Scholar]
Xiong, H.; Mendonca, R.; Shaw, K.; Pathak, D. Adaptive Mobile Manipulation for Articulated Objects In the Open World. arXiv 2024, arXiv:2401.14403. [Google Scholar]
Wu, J.; Antonova, R.; Kan, A.; Lepert, M.; Zeng, A.; Song, S.; Bohg, J.; Rusinkiewicz, S.; Funkhouser, T. Tidybot: Personalized robot assistance with large language models. Auton. Robot. 2023, 47, 1087–1102. [Google Scholar] [CrossRef]
Sleiman, J.P.; Farshidian, F.; Hutter, M. Versatile multicontact planning and control for legged loco-manipulation. Sci. Robot. 2023, 8, eadg5014. [Google Scholar] [CrossRef] [PubMed]
Yokoyama, N.; Clegg, A.W.; Undersander, E.; Ha, S.; Batra, D.; Rai, A. Adaptive skill coordination for robotic mobile manipulation. arXiv 2023, arXiv:2304.00410. [Google Scholar] [CrossRef]
Margolis, G.B.; Yang, G.; Paigwar, K.; Chen, T.; Agrawal, P. Rapid locomotion via reinforcement learning. arXiv 2022, arXiv:2205.02824. [Google Scholar]
Margolis, G.B.; Fu, X.; Ji, Y.; Agrawal, P. Learning to see physical properties with active sensing motor policies. arXiv 2023, arXiv:2311.01405. [Google Scholar]
Zhao, J.; Tassi, F.; Huang, Y.; Momi, E.D.; Ajoudani, A. A Combined Learning and Optimization Framework to Transfer Human Whole-body Loco-manipulation Skills to Mobile Manipulators. arXiv 2024, arXiv:2402.13915. [Google Scholar]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv 2022, arXiv:2204.01691. [Google Scholar]
Antonova, R.; Rai, A.; Atkeson, C.G. Deep kernels for optimizing locomotion controllers. In Proceedings of the Conference on Robot Learning. PMLR, Mountain View, CA, USA, 13–15 November 2017; pp. 47–56. [Google Scholar]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. Rt-1: Robotics transformer for real-world control at scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
Ferrolho, H.; Ivan, V.; Merkt, W.; Havoutis, I.; Vijayakumar, S. Roloma: Robust loco-manipulation for quadruped robots with arms. Auton. Robot. 2023, 47, 1463–1481. [Google Scholar] [CrossRef]
Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; pp. 298–304. [Google Scholar]
Zhang, J.; Gireesh, N.; Wang, J.; Fang, X.; Xu, C.; Chen, W.; Dai, L.; Wang, H. GAMMA: Graspability-Aware Mobile MAnipulation Policy Learning based on Online Grasping Pose Fusion. arXiv 2023, arXiv:2309.15459. [Google Scholar]

Figure 1. The module on the left utilizes a whole-body control method, whereas the module on the right employs a non-whole-body control method. The circles in the diagram represent the activity space of the quadruped robot body and the robotic arm. The whole-body control method offers a more flexible workspace compared to the non-whole-body control method, making it easier to adapt to different environments and handle objects at various height positions.

Figure 2. The training of a full-body control method based on visual information involves the use of supervised learning and mutual feedback learning with visual consistency information to train the command generation strategy, providing a foundation for the generation of control commands.

Figure 3. Real robot system setup.It mainly includes a robotic arm, a quadruped robot body and a visual perception module.

Figure 4. Success rates of different methods at different height positions, tested in the simulator. The dots represent the mean performance of the same objects.

Figure 5. Rewards of our methods during training.

Figure 6. Visualization of qualitative experiments.

Table 1. Success rates of our method compared to other methods on six categories of objects, tested in the simulator. Each distinct object for more than 1500 episodes.

Method/Objects	Spherical	Cuboidal	Cylindrical	Concave	Convex	Others
Without Visual Consistency Strategy $L^{rec}$	90.26	65.74	72.51	86.24	84.96	74.11
Non-Full-Body Control Strategy	60.23	30.51	54.64	70.78	63.46	10.16
Non-Phased Control Strategy	58.46	50.48	64.78	84.62	55.84	53.64
OURS	88.56	90.26	81.45	86.21	78.96	80.47

Table 2. Success rates of our method compared to the method lacking visual consistency at different distance, tested in the simulator.

Method/Distance	0.1	0.3	0.5
Without Visual Consistency Strategy $L^{rec}$	80.2	82.4	83.3
OURS	85.3	83.5	86.3

Table 3. Failure rates of our method compared to the method lacking visual consistency at different terrains, tested in the simulator.

Method/Terrains	Plane	Heightfield	Trimesh
Without Visual Consistency Strategy $L^{rec}$	15.6	18.7	20.1
OURS	13.3	17.5	18.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Zhuang, Y.; Hu, S.; Hu, Y.; Zeng, B. Integrated Whole-Body Control and Manipulation Method Based on Teacher–Student Perception Information Consistency. Actuators 2025, 14, 131. https://doi.org/10.3390/act14030131

AMA Style

Liu S, Zhuang Y, Hu S, Hu Y, Zeng B. Integrated Whole-Body Control and Manipulation Method Based on Teacher–Student Perception Information Consistency. Actuators. 2025; 14(3):131. https://doi.org/10.3390/act14030131

Chicago/Turabian Style

Liu, Shuqi, Yufeng Zhuang, Shuming Hu, Yanzhu Hu, and Bin Zeng. 2025. "Integrated Whole-Body Control and Manipulation Method Based on Teacher–Student Perception Information Consistency" Actuators 14, no. 3: 131. https://doi.org/10.3390/act14030131

APA Style

Liu, S., Zhuang, Y., Hu, S., Hu, Y., & Zeng, B. (2025). Integrated Whole-Body Control and Manipulation Method Based on Teacher–Student Perception Information Consistency. Actuators, 14(3), 131. https://doi.org/10.3390/act14030131

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrated Whole-Body Control and Manipulation Method Based on Teacher–Student Perception Information Consistency

Abstract

1. Introduction

2. Related Work

2.1. Legged Locomotion

2.2. Manipulation

2.3. Legged Loco-Manipulation

3. Our Control Method

3.1. Teacher-Student Network

3.2. Observations

3.3. Policy Network Architecture

3.4. Reward Functions

3.5. Actions

3.6. Objective Function

3.7. Randomization

4. Simulation

4.1. Overview of the Simulation Platform

4.2. Simulation Setting

4.3. Algorithm

4.4. Quantitative Simulation

4.5. Qualitative Simulation

5. Discussion

6. Conclusions, Limitations, and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI