A Review of Embodied Grasping

Sun, Jianghao; Mao, Pengjun; Kong, Lingju; Wang, Jun

doi:10.3390/s25030852

Open AccessReview

A Review of Embodied Grasping

¹

School of Mechanical and Electrical Engineering, Henan University of Science and Technology, Luoyang 471000, China

²

School of Information Engineering, Henan University of Science and Technology, Luoyang 471000, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(3), 852; https://doi.org/10.3390/s25030852

Submission received: 18 November 2024 / Revised: 12 January 2025 / Accepted: 28 January 2025 / Published: 30 January 2025

(This article belongs to the Section Sensors and Robotics)

Download

Browse Figures

Versions Notes

Abstract

Pre-trained models trained with internet-scale data have achieved significant improvements in perception, interaction, and reasoning. Using them as the basis of embodied grasping methods has greatly promoted the development of robotics applications. In this paper, we provide a comprehensive review of the latest developments in this field. First, we summarize the embodied foundations, including cutting-edge embodied robots, simulation platforms, publicly available datasets, and data acquisition methods, to fully understand the research focus. Then, the embodied algorithms are introduced, starting from pre-trained models, with three main research goals: (1) embodied perception, using data captured by visual sensors to perform point cloud extraction or 3D reconstruction, combined with pre-trained models, to understand the target object and external environment and directly predict the execution of actions; (2) embodied strategy: In imitation learning, the pre-trained model is used to enhance data or as a feature extractor to enhance the generalization ability of the model. In reinforcement learning, the pre-trained model is used to obtain the optimal reward function, which improves the learning efficiency and ability of reinforcement learning; (3) embodied agent: The pre-trained model adopts hierarchical or holistic execution to achieve end-to-end robot control. Finally, the challenges of the current research are summarized, and a perspective on feasible technical routes is provided.

Keywords:

pre-trained models; embodied foundation; embodied perception; embodied strategy; embodied agent

1. Introduction

In embodied grasping tasks, the embodied robot is the basis for performing the task. Before performing the task, the robot usually needs to be pre-trained on a certain scale of data, so that it can understand human intentions, perceive the surrounding environment fully, and make appropriate decisions when performing the task. The embodied robot then accurately executes the operation instructions and hopefully can learn from the data interacted in real time to improve its adaptability and generalization ability in an unstructured environment.

In recent years, thanks to the development of computing hardware such as GPUs [1] and TPUs [2], deep learning has been applied to large-scale labeled images, videos, texts, and other data in a way that mines data [3] and has achieved a series of breakthroughs in fields such as image recognition [4,5] and language processing [6,7]. Visual foundation models (VFMs) [8] can accurately estimate object categories, poses, and geometric shapes, as well as the spatial relationships between objects, thereby helping the agent make decisions, which allows embodied robots to fully perceive the dynamically complex environment. Large language models (LLMs) [9] enable a robot to better understand language commands from humans and make corresponding decision inferences. Visual–language models (VLMs) [10] combine the advantages of VFMs and LLMs, enabling an agent to reason and make decisions based on task language commands and visual observations of the environment, thereby improving the agent’s perception and understanding of the environment. Generative large models (GLMs) [11] can generate new content such as images and videos based on various data types such as text, images, and videos, thereby enabling richer and more complex creative processes. Robotics domain-specific modes (RDSMs) [12] are trained on specialized datasets, such as human or robotic operation videos, and have better application results.

The breakthroughs in the perception, understanding, and decision-making of pre-trained models have pushed the research of traditional robots in vision-based manipulation [13], reinforcement learning [14], and imitation learning [15] to new heights. In visual-based manipulation and imitation learning, the pre-trained model acts as a visual encoder and text encoder, using its own prior knowledge to improve the efficiency and generalization ability of visual manipulation and imitation learning. The pre-trained model can also generate expert data for imitation learning training in a simulated environment according to specific tasks, alleviating the problem of data scarcity. In reinforcement learning, a large model can generate an appropriate reward function based on an understanding of the task and scenario to guide the generation of reinforcement learning strategies. At the same time, reinforcement learning can be used as a basic strategy for the pre-trained model to continuously optimize the model itself during interaction with the environment. In addition, a new research method has been pioneered, in which a pre-trained model is used as an advanced task planner or to directly output control commands based on sensor information.

As shown in Figure 1, it is elaborated from five aspects. Among them, embodied perception is the foundation for robots to perceive the external environment. Robots can accurately understand the spatial position information and status of target objects, directly predict actions, or provide input for subsequent decisions. On the basis of perception, embodied strategy generate specific operational strategies through reinforcement learning and imitation learning, guiding robots to make reasonable decisions in complex and dynamic environments. Hierarchical execution within embodied agents integrates perception and strategy through multi-level task planning, ensuring the consistency and robustness of robotic actions. In hierarchical execution, in addition to embodied strategy methods, traditional strategy methods are also incorporated. Overall execution is a fusion of both, achieving global optimization and efficient execution of complex tasks, thereby ensuring the overall consistency and final quality of tasks. Pre-trained models play a crucial driving role in this system. The prior knowledge of large models can help in understanding task objectives, environmental information, and multimodal inputs (such as the combination of vision and language), thereby enhancing the accuracy of perception, the intelligence of strategy generation, and the flexibility of task execution. The required training data come from the embodied foundations, including simulation platforms, open datasets, and data acquisition methods. These data are fully utilized under the impetus of pre-trained models, with the collaboration of three main components: embodied perception, embodied strategy, and embodied agent, ultimately executed by embodied robots in real environments. Section 2 introduces the embodied foundation, including various embodied robots, as well as the corresponding simulation platforms, open datasets, and data acquisition methods. Section 3 introduces five pre-trained models, and Section 4 introduces the embodied perception from the perspectives of 3D features and 3D reconstruction. Section 5 introduces the embodied strategy based on imitation learning and reinforcement learning. Section 6 elaborates on the embodied agent in terms of hierarchical execution and holistic execution. Finally, Section 7 discusses existing challenges and promising future research directions.

The aim of this paper is to provide a review of the research in the field of embodied grasping, with a focus on the application and research progress of traditional two-finger grippers and dexterous hands as end effectors. However, we acknowledge that end effectors also include other types (such as soft grippers, pneumatic grippers, etc.) Although this paper does not delve into these types of end effectors in detail, we believe that the specific structure of end effectors does not fundamentally impact the application of the discussed methods. Therefore, when introducing the end effectors included in the embodied foundation, we not only cover traditional two-finger grippers and dexterous hands but also briefly mention other types of end effector mechanisms.

Comparison with related works: With the rise of large models, various reviews on pre-trained models have emerged [16,17]. However, the integration of pre-trained models and robotics is a relatively new field, and there are few review papers focusing on this area. Existing studies, such as [18,19,20,21], differ significantly from our survey. Yang et al. [18] explored how pre-trained models can be applied to real-world decision-making problems using methods such as prompting, conditional generative modeling, planning, optimal control, and reinforcement learning, while we specifically focus on the application of pre-trained models in robotic grasping and classify the work into three main branches: embodied perception, embodied strategy, and embodied agent. Hu et al. [19] categorized robotic capabilities and investigated systematic ways to enhance these capabilities by integrating pre-trained models, whereas our work focuses on how pre-trained models can be combined with robotic technologies to improve robotic abilities. Xiao et al. [20] reviewed the integration of pre-trained models with traditional robotic learning approaches, covering areas like manipulation, navigation, planning, and reasoning, whereas we concentrate on grasping and provide a more detailed analysis of current techniques in this domain. Finally, Zheng et al. [21] emphasized improving robotic manipulation through physical interaction and sensory feedback, while we highlight the role of pre-trained models in enhancing the learning process itself.

2. Embodied Foundation

As shown in Figure 2, this section will separately introduce commonly used embodied robots, including a single system such as dexterous hand and robotic arm, low-integration mobile complex systems such as wheeled robotic arms and quadrupedal robotic arms, highly integrated mobile complex systems like bipedal humanoid robots and wheeled humanoid robots, as well as corresponding popular simulation platforms, high-quality datasets, and data acquisition methods.

2.1. Robotic Arm

Robotic arms were the first robots to be studied, and they have gone through the entire development history of traditional perception and motion control algorithms to intelligent algorithms. They have also experienced the development process from bulky industrial robotic arms to lightweight robotic arms. They are currently widely used in fields such as industrial manufacturing [22] and agricultural harvesting [23]. Commonly used robotic arms include the Franka [24] manufactured by Franka Emika in Munich, Germany, the xArm series [25] produced by UFactory in Shenzhen, China, the UR series [26] developed by Universal Robots in Odense, Denmark, and the ViperX [27] made by Interbotix in the Seattle, WA, USA. In robotic arm research, it is too costly to conduct algorithm research directly on the robot itself. In order to conduct research on artificial intelligence algorithms, researchers have launched a series of simulation platforms, including Gazebo [28], PyBullet [29], and SAPIEN [30], which focus on high-fidelity simulation; RoboSuite [31], ManiSkill series [32,33], and RoboCasa [34], which are optimized for specific tasks or environments; and Isaac Sim [35], which combines commercial applications with advanced GPU computing. Simulation platforms can verify the advantages and disadvantages of the algorithm and make improvements, reducing research costs. Additionally, these platforms also have highly physical characteristics, and the interaction data obtained by robots within them can be used to train embodied-intelligence algorithms. However, due to the existing discrepancies between simulated and real-world environments, models tested on actual robots do not perform as well as the results achieved in simulation environments [36]. With the release of numerous high-quality human-demonstrated robotic arm datasets, research on training robots with real-world datasets has been further developed. Commonly used robotic operation datasets include BridgeData V2 [37], RH20T [38], Open-X [39], and robotic vision datasets RED [40], REGRAD [41], GraspNet-1Billion [42], Grasp-Anything [43]. Additionally, there are datasets focused on specific objects or containing specific types of objects, such as Transpose [44], PokeFlex [45], ClothesNet [46], and SurgT [47]. In terms of constructing custom datasets, UMI [48] utilizes a handheld gripper as the data collection interface, combined with a well-designed interface to collect data on the informative hands and dynamic operation demonstrations in a low-cost and portable manner. ALOHA [49], launched by Stanford University, and GELLO [50], released by the University of California, Berkeley, design active manipulators that can remotely control manipulated arms to directly collect real-world data for imitation learning training. Recently, Stanford University further launched ALOHA2 [51], which not only optimizes the operational performance of the first-generation ALOHA but also integrates remote operation within the Mujoco simulation platform, thereby reducing the cost of data collection. Other methods, including 3D SpaceMouse [52] and RoboTurk [53], can also be employed.

2.2. End Effector

Common end effectors include two-finger grippers and dexterous hands [54]. Different end effectors have been designed for various application scenarios, such as pneumatic grippers [55], suction cups [56], jamming [57], Bernoulli [58], and Vortex [59], as well as soft grippers like soft pneumatic grippers [60], cable-driven grippers [61], and hydraulic grippers [62]. Common two-finger grippers include the Robotiq 2F-85 [63] produced by Robotiq in Lévis, Canada and the Franka Emika Gripper [64] made by Franka Emika in Germany. In complex dexterous manipulation scenarios, dexterous hands [65] are a current research hotspot. Due to the fact that they can imitate various functions of human hands, they have great potential in medical services [66,67] and home services [68]. Common dexterous hands include the Allegro Hand [69] made by Allegro Robotics in Seattle, WA, USA, the Shadow Hand [70] manufactured by Shadow Robot in the London, UK, and an open source dexterous hand called Leap Hand [71]. Commonly used simulation platforms for researchers include Isaac Gym [72] and Mujoco [73]. Isaac Gym supports large-scale parallel training for various dexterous manipulation tasks, enabling the rapid accumulation of extensive training data. Additionally, it allows for the exploration of optimal manipulation strategies under different conditions by adjusting simulation parameters. Using hardware such as data gloves [74] and cameras [75] in the real world is also an important method to capture training data. Data gloves can accurately capture the motion and posture of the operator’s hands and turn them into commands to control the end effector. When performing precise operations, data gloves can provide very fine motion control, thereby obtaining high-quality training data. Cameras provide real-time visual information, enabling operators to accurately assess and adjust hand movements to ensure the precise execution of tasks. Common dexterous hand demonstration datasets include UniDexGrasp [76], Handversim [77], and DAPG [78].

2.3. Mobile Composite Robot

Mobile composite robots expand the application scenarios of fixed robotic arms. Among them, as a mobile chassis, wheeled robots have a simple structure, are relatively low in cost, and are energy-efficient, allowing for rapid movement on flat surfaces. Quadruped robots can maintain balance and maneuverability on uneven terrain. According to the respective advantages of the two robots, they are combined with robotic arms to form mobile composite robots, which are applied in logistics [79] and disaster relief [80]. Common wheeled composite robots include Fetch Robotics [81] developed by Zebra in Lincolnshire, IL, USA and Hello Robot Stretch [82] from Hello Robot in Martinez, CA, USA. Four-legged composite robots include Spot Arm [83] designed by Boston Dynamics in Waltham, MA, USA, as well as the combination of the quadruped robot B1 and the robotic arm Z1 [84] launched by Unitree in Hangzhou, China. Commonly used simulation platforms include the iGibson series [85,86], the Habitat series [87,88], and AI2-THOR [89], in which the iGibson series also provides data collection and labeling tools, which allow for researchers to conveniently collect behavioral data of robots in a simulated environment and then label and analyze the data. Stanford University has developed the mobile composite robot Mobile ALOHA [90]. In terms of mobility, the robot’s movement speed is close to that of humans, which gives it a significant advantage in scenarios requiring collaboration with humans or tasks executed within human activity spaces. It can also maintain stability when handling large household items and is capable of performing delicate operations. The integrated teleoperation function is a highlight. Through teleoperation, operators can remotely control the robot to carry out tasks and directly collect real-world data in the process.

2.4. Humanoid Robot

Compared with other types of robots, humanoid robots are more aligned with human operational behaviors and more adaptable to the diverse scenarios of the human world. As a highly integrated product, they are currently used in research areas such as motor capabilities [91,92] and cognitive operational abilities [93]. In recent years, with the key breakthrough of large models, humanoid robots have ushered in a golden age of rapid development. Representative humanoid robots include Optimus [94] from the American company Tesla, which integrates Tesla’s multiple technological advancements in electric vehicles and artificial intelligence, enabling relatively smooth limb movements. Atlas [95] from the American company Boston Dynamics and H1 [96] from China’s Unitree are renowned for their exceptional mobility performance. Their power systems and balance control capabilities are outstanding. China’s UBTech Walker series [97] and AGIbot Expedition series [98] are unique in terms of intelligence, focusing on deeply integrating artificial intelligence technology into the robot’s behavior decisions. Under traditional control methods, whole-body motion control of humanoid robots presents a complex challenge. Because humanoid robots have multiple degrees of freedom joints, their kinematic and dynamic models are highly intricate. Traditional control algorithms often struggle to precisely coordinate the movements of each joint, leading to insufficient stability. There are limitations in terms of environmental perception. Their limited sensor-data-processing capabilities make it difficult to deeply understand and analyze complex environmental information. Breakthroughs in pre-training models and the application of embodied algorithms have brought new solutions to humanoid robots. Like other robots, simulation training is also the primary method for humanoid robots to learn tasks. Commonly used simulation platforms include Isaac Gym, Mujoco, and BiGym [99]. The AMASS [100] dataset, a motion capture dataset created by the Graphics Laboratory at Carnegie Mellon University, captures a wide range of human movements and actions and is widely used in research in the field of robotics. The University of California, San Diego [101], designed an exoskeleton system that utilizes hand cameras to capture 3D hand gestures and accurately track the position of the end effector through the exoskeleton. In collaboration with the Massachusetts Institute of Technology, they launched a method for collecting data based on VR teleoperation of humanoid robots [102], which offers high real-time performance and stability. Stanford University [103] has developed a low-cost teleoperation method based on visual recognition. It uses visual equipment, such as cameras, to obtain information about the environment and human actions, and the robot operates based on the visual information. This method is low-cost and easy to promote. Embodied robots, simulation platforms, datasets, and data acquisition methods are summarized in Table 1.

3. Pre-Trained Model

Pre-trained models have accumulated rich general feature representation capabilities through self-supervised learning on large-scale datasets. They can not only improve the generalization ability of existing models in the robotics field and enable models to better adapt to unknown environments and tasks but also optimize downstream tasks by providing natural language descriptions and prompts.

3.1. Large Language Model

In 2018, Google and OpenAI released the BERT [104] with a bidirectional transformer structure and the GPT [105] with a generative pre-trained transformer structure, respectively. In 2019, Google continued to propose a unified text-to-text framework T5 [106]. After several updates and iterations, in 2022, Google once again released PaLM [107], and OpenAI officially launched ChatGPT-3.5 [108]. This model has obtained an accurate understanding of context through pre-training on a large amount of data and has accumulated a wealth of common-sense knowledge about the real world. It also has the ability to solve different tasks through fine-tuning instructions. When it receives a variety of problems, it can flexibly adjust its coping strategies according to the specific needs of the problem based on the knowledge system obtained through pre-training, so as to effectively solve different types of tasks. At the same time, with the help of reinforcement learning optimization training, it can output more positive and positive answers that are in line with human values. Subsequently, in 2023, OpenAI released GPT-4 [9], which increased the model’s context length, which allows it to maintain coherence and accuracy when processing longer text content, while also having the ability to understand multiple modalities. In addition to processing textual information, it can also identify and extract image information, thereby providing users with a more comprehensive and richer interactive experience. GPT-4 has also demonstrated its powerful capabilities, whether it is answering questions about more specialized domain knowledge, or performing complex mathematical reasoning and programming tasks. In 2024, OpenAI released GPT-o1 [109] again, using the Chain-of-Thought (COT) [110] method, which internally generates a detailed chain of thought, just as humans plan and reason first when solving complex problems. First, carefully plan the steps to answer the question, then carry out a rigorous reasoning process, and finally give the final answer. This unique method allows for GPT-o1 to show unprecedented accuracy and efficiency when handling complex tasks.

3.2. Visual Foundation Model

In the early stages, the visual field was mainly based on CNN [111] architecture. With the breakthrough of the language model, RNN [112] and transformer [113] in its network architecture were used in the visual field, which has made important progress in image classification [114], object detection [115], and semantic segmentation [116], which further promotes the development of visual pre-training model. Several visual models have been released after GPT. DINO [117] proposes a self-supervised learning method based on knowledge distillation, which ensures the stability and consistency of training. DINOv2 [118] introduces a deeper transformer structure and a more complex attention mechanism based on DINO, significantly improving the model’s expressive power and performance. MAE [119] utilizes visually masked inputs to restore the original image. It can be pre-trained from a large number of unlabeled images, which greatly expands the data resources available for pre-training. SAM [120] is a visual segmentation model. It uses more than 110 million segmentation masks and about 11 million images during training and can accurately segment corresponding regions from images based on given linguistic or visual cues. SAM2 [121] is recently released, trained with more than 51,000 videos and more than 600,000 mask annotations, and improves segmentation accuracy and processing speed based on SAM, as well as supporting object segmentation in videos. Am–radio [122] and Theia [123] are trained by distilling multiple off-the-shelf vision foundation models (VFMs), which results in a smaller model that achieves comparable performance to the larger models while maintaining a smaller model size, reducing the requirements for robot hardware deployment.

3.3. Visual–Language Model

LLMs cannot directly understand the output of visual encoders, so they need to convert image encoding into features that the LLM can understand. This is the origin of visual–language models (VLMs), which combine computer vision and natural-language-processing techniques to give models the ability to understand and process image and text data. These models excel at tasks that require simultaneous comprehension of visual content and language and can perform a range of tasks without task-specific training, demonstrating impressive generalization [124]. CLIP [8] uses a large number of paired image and text descriptions as learning materials. The image encoder and text encoder are used to process images and texts, respectively, to extract their respective features, and then these features are optimized in the feature space through contrastive learning [125], enhancing the model’s ability to understand the semantics of images. BLIP [126] and BLIP-2 [10] introduce a curriculum learning strategy that can guide from simpler tasks to more complex tasks, significantly improving performance on tasks such as image captioning and visual question answering. Both Flamingo [127] and GIT [128] pretrain an image encoder through contrastive learning and then perform generative pretraining. PandaGPT [129] and MiniGPT-4 [130] use a single projection layer to achieve visual text alignment, reducing the need to train additional parameters. LLaVa [131], LLaVa2 [132], and KOSMOS-2 [133] are transformer-based causal language models that add the ability to localize and cite. ConvLLaVA [134] uses a hierarchical ConvNeXt as the visual encoder and introduces two key optimization strategies: updating the visual encoder and adding an additional compression stage. Together, these improve the model’s performance on high-resolution inputs.

3.4. Generative Large Model

Diffusion [135] models have been used for controllable image generation [136] and conditional image generation of text [137], with the advantages of controllability, conditional generation capabilities, and high-fidelity image generation [138]. DALL-E [139] has learned the complex relationship between text and images through large-scale pretraining and has demonstrated the ability to generate high-quality images from text descriptions. DALL-E 2 [140] proposes a two-stage diffusion model consisting of an a priori algorithm that generates a CLIP image embedding given a text title and a decoder that generates an image based on the encoded image embedding. GLIDE [141] is a text-conditional diffusion model with both CLIP guidance and no-classifier guidance. Make-A-Scene [142] introduces implicit conditions by deriving optional control scene markers from split images and encodes and decodes images and scene markers using two improved Vector-Quantized Variational Autoencoders (VQ-VAEs). IMAGEN [143] is another unsupervised text-conditional diffusion model. Unlike previous methods, it proposes dynamic thresholds to generate more realistic images and a U-Net structure to improve training efficiency. Parti [144] designs a transformer-based autoregressive model that uses ViT-VQGAN as an image encoder to improve the quality of image reconstruction and code utilization. Video-LaVIT [145] predicts the next image or text token through autoregression and processes images and text simultaneously under the unified generation objective. Sora [146] is trained on large-scale video data, including videos and images of varying durations, resolutions, and aspect ratios. It can generate videos not only through text prompts but also by using existing image or video prompts.

3.5. Robotics Domain-Specific Model

Pre-trained models, such as CLIP, have been widely used as visual front ends in the robotics field. However, they are trained on large-scale internet data. Although they are very versatile, they usually require dedicated data to obtain exclusive models in specialized fields such as robotics. Such models are more suitable for downstream robotics tasks.

MVP [12] and R3M [147] focus on masked autoencoders and contrastive learning methods, respectively, and perform better in specific types of tasks. VIP [148] is able to generate dense and smooth reward functions for previously unseen robot tasks by combining value function learning and time contrast learning and shows superior performance on multiple tasks, especially in reward function generation and robot control tasks. VC-1 [149] adopts a large version of Vision Transformer (ViT) with more parameters and is trained on a combination of more than 4000 h of human demonstration videos and the ImageNet dataset. Voltron [150] uses videos and corresponding language descriptions based on MVP and R3M to learn representations through linguistic conditional visual reconstruction and visual–language generation. It uses linguistic supervision to improve the recognition of visual patterns and has better generalization capabilities. GR-1 [151] proposes a single GPT-style model that can accept language commands, observe image sequences and robot state sequences as input, and predict robotic actions and future images in an end-to-end manner. After pre-training, it can be fine-tuned on robot data to learn multi-task visual robot manipulation. GR-2 [152] is pre-trained on 38 million video clips, which is one of the largest-scale video pre-trainings used for robotic manipulation agents to date, and a new model architecture has been developed to enable the knowledge gained in the pre-training stage to be seamlessly and losslessly transferred to the fine-tuning stage. SpawnNet [153] has designed an architecture that includes both pre-trained network flows and learnable network flows, enabling the model to combine powerful features of pre-trained and task-specific learning features. In addition to the commonly used datasets, ImageNet [154] and CoCo [155], in the field of vision, there are dedicated datasets in the field of robotics, such as Ego4D [156], Epic-Kitchen [157], and Kinetics-700 [158]. A summary of the pre-trained models is shown in Table 2.

4. Embodied Perception

Embodied perception is manifested as an estimation of grasping poses of target objects based on visual sensors. The accuracy of the pose estimation significantly impacts the robot’s ability to successfully grasp the target object, and a robust and efficient pose estimation algorithm needs to be developed. Early grasping tasks were regarded as 2D pose detection [159,160,161], which usually defined the grasping pose as a rectangle from a top-down perspective at a certain height, and predicted the orientation and width of the rectangle. However, due to the lack of 3D geometric information, the predicted grasping pose is limited to 3-DoF. To enable more dexterous grasping by robots, a significant amount of work has focused on 6-DoF grasping, enhancing grasp pose detection [162,163] by directly using depth information, or converting the input RGB-D data into point clouds, then voxelizing the point clouds to generate heatmaps, and estimating the 6-DoF grasp pose under the guidance of the heat maps [164,165,166], or directly inputting them into the pose estimator to generate the pose information required for robotic grasping [167,168]. These methods show high success rates when the depth information is accurate but suffer from degraded performance when encountering photometrically challenging objects (e.g., transparent objects). To alleviate this problem, depth information can be fused with RGB images [169,170].

Other work focus on 3D scene reconstruction to improve the model’s understanding of 3D geometric shapes and then predict the robot’s grasping poses. There are usually two methods: the implicit representation method [171,172], which uses neural rendering and feature distillation to reconstruct 3D feature fields. As for the explicit representation method [173,174], 2D features are reprojected in 3D, and the feature field is optimized. However, these methods all suffer from a lack of semantic information or insufficient generalization.

Pre-trained models possess abundant prior knowledge of visual semantics, and are usually based on point cloud information or 3D scene reconstruction. Combined with traditional 3D visual grasping methods, the use of pre-trained models enhances the ability of visual–language-guided robotic grasping.

4.1. Three-Dimensional Feature

(1) Semantic and 3D feature fusion: Extracting textual features using large language models and integrating them with point clouds or voxels enables robots to more comprehensively understand and execute robot operation tasks based on natural language instructions, as shown in Figure 3a. Polarnet [175] and Hiveformer [176] apply CLIP to fuse their outputs with point cloud features, while PERACT [177] utilizes CLIP to integrate its outputs with voxel features. Both approaches help the model consider language context when performing action prediction. GraspGPT [178] generates language descriptions of object categories and tasks through a LLM and merges these descriptions with the original dataset to form a new dataset. BERT is used to encode the language descriptions and instructions, combine them with point cloud features to assess the compatibility of the grasping candidates with the task, and predict the final grasping pose. PhyGrasp [179] employs the PointNext framework to convert the point cloud of an object into global and local visual features, and capitalizes Llama 2 to encode the language description of each instance and generate language features. Based on a bridged network, the visual and language features are combined to generate grasping heatmaps and grasping embeddings, which provide the robot with a set of candidate grasping positions.

(2) Point cloud extraction: In the two-stage grasping method, pre-trained models are used to improve visual localization capabilities, which can significantly improve the accuracy of point cloud information extraction, as shown in Figure 3b. VL-Grasp [180] introduces BERT as a text extractor and ResNet as an image extractor in the first stage to predict the 2D bounding boxes and segmentation masks of targets object. The results of the first stage are then used to convert scene-level point clouds into object-level point clouds using a point-cloud-filtering module, followed by the use of a 6-DoF grasp pose detection network to predict the optimal grasp pose. OVGNet [181] leverages GroundingDINO to combine image features and text features for localizing the target object and generating a bounding box. Based on this bounding box, it segments the point cloud belonging to the target object from the complete point cloud data.

(3) Affordance information: Extracting and mapping affordance data from the dataset to achieve zero-shot generalization capabilities for new objects, as shown in Figure 3c. OpenAD [182] jointly learns the visual features of 3D point clouds and the text embeddings of operability labels, leveraging the similarity of text embeddings to achieve zero-shot detection. Robo-abc [183] leverages CLIP to map cropped images of objects into feature vectors, enabling the retrieval of objects from affordance memory that are most visually and semantically similar to a new object. It employs a diffusion model to map the retrieved contact points onto the new object and utilizes AnyGrasp to predict grasp poses based on affordances. Ram [184] extracts unified 2D affordance information from various data sources to construct a comprehensive affordance memory bank. It uses a language model to retrieve tasks that match the given instructions, utilizes feature maps from a visual model to find demonstrations with the most similar viewpoints, then performs 2D affordance transfer, ultimately enhancing 3D affordance.

4.2. Three-Dimensional Scene Reconstruction

The focus of 3D scene reconstruction tasks is to enhance the model’s understanding of 3D geometric shapes by converting the input RGB-D data into 3D representations such as NERF or 3DGS.

(1) Based on traditional features: Focusing on using open-vocabulary features to understand image content, suitable for scenarios with limited computing resources or where the accuracy of instance segmentation is not critical, as shown in Figure 4a. SPARSEDFF [185] adopts features extracted by DINO to initialize the 3D feature field and designs a lightweight network to solve the problem of local feature differences. F3RM [186] and Splat-MOVER [187] integrate the visual attributes (such as color and lighting effects) and semantic embeddings extracted by CLIP from 2D images with feature fields. LERF-TOGO [188] employs Language Embedded Radiance Fields (LERFs) to combine CLIP’s powerful visual–language capabilities, DINO’s intensive feature extraction capabilities, and 3D scene reconstruction technology to achieve zero-shot task-oriented grasping.

(2) Based on instance segmentation: It provides more refined scene segmentation, suitable for applications that require precise recognition and manipulation of individual objects in the scene as shown in Figure 4b. Object-Aware [189] utilizes GroundedSAM to dynamically segment the camera views and dynamically extract semantic features online to assign semantic labels to each Gaussian point. This endows the scene representation with “object awareness”. GaussianGrasper [190] employs the segmentation priors provided by SAM to accelerate feature field reconstruction and reduce memory usage. CLIP is capable of aligning text descriptions with image content, enabling the mapping of natural language instructions to corresponding visual content. The extracted features are then used to enhance 3D Gaussian primitives.

(3) Based on Diffusion model: Suitable for scenarios that require high resolution and rich semantic information as shown in Figure 4c, GNFactor [191] utilizes Stable Diffusion to extract semantic features from 2D images and designs the Generalizable Neural Feature Fields (GNFs) module to convert these 2D semantic features into representations within 3D space, thereby forming neural feature fields. ManiGaussian [192] introduces a dynamic Gaussian framework and also utilizes Stable Diffusion to extract semantic features from RGB images. Through a Gaussian regressor, the extracted semantic features are mapped into the Gaussian parameter space. This enhances the semantic understanding of the scene representation, enabling robots to comprehend the interactions between objects. A summary of representative algorithms is shown in Table 3.

5. Embodied Strategy

Research on embodied strategy primarily focuses on imitation learning and reinforcement learning (RL). Imitation learning achieves skill acquisition by collecting trajectory datasets for specific tasks and using deep neural networks to fit the mapping from time series of states or observations (such as first-person perspective images) to actions [193]. Reinforcement learning, on the other hand, involves the agent learning new skills by interacting directly with the environment and optimizing predefined reward functions related to specific tasks during the interaction [194].

5.1. Imitation Learning

Behavior cloning (BC) [195] is the fundamental framework for imitation learning. The loss function of behavioral cloning is expressed as follows:

ζ (θ) = - E_{τ, l \sim D} [\sum_{t = 0}^{T - 1} \log π_{θ} (a_{t} | s_{t}, l)]

(1)

where the robot’s policy representation

π (a | s, l)

is derived by imitating expert data. The expert dataset is represented as

D = {\{τ_{i}, l_{i}\}}_{i \in [N]}

, in which

τ_{i} = [s_{0}, a_{0}, s_{1}, a_{1}, \dots, s_{t - 1}, a_{t - 1}, s_{T}]

represent the trajectory of experts, and

l_{i}

represents the task description.

On this basis, imitation learning has been continuously developing in recent years. ACT [151] decomposes complex action sequences into smaller “action blocks” and utilizes Conditional Variational Autoencoders (CVAEs) to learn the latent representations of these blocks. This approach reduces the effective temporal scope of the prediction task, thereby decreasing cumulative errors. During execution, a temporal integration method is employed, generating smooth action sequences by weighted averaging multiple overlapping action predictions, thereby improving the accuracy and stability of the policy. Methods [101,196] have both continued this framework. ATM [197] first utilizes self-supervised learning to pre-train a trajectory model on a large amount of unlabeled videos to predict the future motion trajectories of any point in the video; then, with a small amount of labeled demonstration data, it trains a policy model to learn how to generate control actions based on these predicted trajectories, thereby achieving transfer learning from video demonstrations to actual policies. Diffusion policy [198] first processes the input image sequences through a visual encoder, then uses Denoising Diffusion Probabilistic Models (DDPMs) to simulate the generation process from noise to clear action sequences, and finally optimizes the action sequences through the predicted gradient fields to control the behavior of the robot. Pearce et al. [199] improved the accuracy and reliability of the imitation strategy based on Diffusion-X and Diffusion-KDE as the sampling strategy and designed a network architecture suitable for sequential environments.

The data for imitation learning mainly rely on teleoperation to control the robot to collect data for imitation learning training under specific tasks [200]. However, the high cost of collecting human demonstrations has led to concerns about scaling up the size of the demonstration data. For example, MimicGen [201] has designed a system that takes some expert demonstrations as input and creates an augmented dataset by integrating various scenes and segmenting objects.

Pre-trained models are used in imitation learning in two ways: (1) data augmentation: expand the original dataset, generate expert demonstration data in a simulated environment, or create video sequences to guide the model’s learning; (2) feature extractor, by leveraging the prior knowledge of pre-trained models, improve the model’s ability to adapt to new environments.

5.1.1. Data Augmentation

(1) Direct generation: One method is to directly capitalize a generative model to enhance image data, as shown in Figure 5a. GreenAug [202] employs image-text generation models and large-scale image segmentation models to identify objects and backgrounds within images. This approach enables semantic modifications of interactive objects and backgrounds while maintaining the robot’s behavior unchanged, thereby increasing the diversity of the dataset. GenAug [203] provides semantically meaningful data augmentation for imitation learning by leveraging DALL-E 2 and Stable Diffusion in a simulated environment when only a small number of real-world samples are available and expands the dataset by generating diverse visual scenes, including different objects, distractions, and backgrounds. FoAM [204] introduces a fine-tuned vision–language model called InstructPix2Pix (Ip2p) as its Goal Imagination Module. The role of this module is to automatically generate target images, which are then used as conditional inputs to the imitation learning strategy. Another approach is to enrich the dataset of motion trajectories based on diffusion models. xTED [205] adopts diffusion models to edit the data and generate trajectories that are more consistent with the target domain distribution. The edited trajectories are used to train a policy network to generate actions.

(2) Indirect generation: First, tasks are generated using a large language model, and then the dataset is expanded based on these tasks, as shown in Figure 5b. SUaDD [206] capitalizes a large language model for high-level planning and code generation of tasks and generates a large amount of robot data (grasping poses and motion trajectories) with language tags through the combination of a sampling-based planner. GenSim [207] and GenSim2 [208] language models are applied to generate code implementations of tasks that can be executed directly in the simulation environment to generate expert demonstrations. These tasks and demonstrations are used to train robot policies, reducing the reliance on real-world data collection while enhancing the generalization of the policies to new environments and tasks. The typical applications of these two methods are shown in Figure 5c.

5.1.2. Feature Extractor

(1) Text feature extractor, as shown in Figure 6a: MCIL [209] and HULC [210] adopts language models to convert natural language instructions into feature representations in the latent target space, which are then used to train imitation-learning policies. MIDAS [211] enhances the model by adding a residual connection (RC) to the pre-trained language model. It employs a two-stage training process consisting of inverse dynamics pre-training and multi-task fine-tuning. During the inverse dynamics pre-training phase, the model learns to recover action sequences from observed image sequences. In the multi-task fine-tuning phase, the model further learns to execute specific robotic manipulation tasks. RoboCat [212] uses a pre-trained VQ-GAN encoder and CLIP text encoder. RoboCat is trained to mimic the behaviors demonstrated by experts. RoboAgent [213] apply BERT to generate language embeddings for task descriptions. These embeddings are used to condition the transformer policy network, enabling the robot to perform tasks based on natural language instructions. DROID [214] utilizes DistilBERT to convert natural language instructions into feature vectors. The ResNet-50 visual encoder converts each image into a fixed-size feature vector. The diffusion model uses the outputs of the visual and language encoders and the robot’s proprioceptive state to generate action instructions for the robot.

(2) Visual feature extractor, as shown in Figure 6b: EmbCLIP [215] leverages features extracted by CLIP and language instructions as inputs to train a policy network that outputs actions. UMI [48] and DSL [216] extract visual features such as object pose, shape, and location from video data based on the visual foundation model and then apply the extracted features to learn the mapping from observation to action based on a diffusion strategy. HomeRobot [217] can use the visual features extracted by CLIP directly for training strategies or in combination with other modal information (such as depth and semantic segmentation). Vid2robot [218] processes each frame of a video through a visual model to obtain high-dimensional feature representations. The robot’s current state (captured by a camera image) is also encoded using the same image encoder to obtain feature representations similar to those of the prompt video. Actions are then trained using behavior cloning and cross-entropy loss.

(3) Text and visual feature extractor, as shown in Figure 6c: CLIPORT [219] employs a dual-stream architecture, where the semantic stream uses CLIP to process image data and combines it with language instructions, which are fused with the spatial stream and trained using expert demonstrations. VIMA [220] utilizes the T5 to tokenize the input text prompt and then converts these tokens into word embeddings. These embeddings are then fed into the encoder of T5, which outputs a high-dimensional representation of the text that provides the robot agent with contextual information about the task instructions. A Mask R-CNN detector is used to identify objects in the image and extract bounding boxes for the objects and crop the image. The cropped images are divided into fixed-size patches, which are then fed into the ViT model to output visual feature embeddings. These visual embeddings are used together with the bounding-box embeddings to represent the objects in the image. Finally, training is carried out using a behavior-cloning method to imitate the behavior of experts. Open-TeleVision [102] replaces ResNet, used in the ACT algorithm framework, with the more powerful visual backbone network DinoV2 to extract features from the input stereo vision images. SPOC [221] applies DINOv2 and SIGLIP to first extract visual features from images captured by the RGB camera. Then, the CLIP model plays a dual role: on the one hand, it serves as a visual encoder to assist in the extraction of image features, and on the other hand, it is used to process the matching between images and text, helping the model understand and execute text-based instructions. Finally, T5, as a text encoder, specifically processes natural language instructions and converts the text information into a format that the model can execute. These vectors are used to guide the behavior of the agent, enabling it to imitate the expert’s behavior to perform the task. MPI [222] utilizes deformable attention layers to capture the causal relationships between two initial states and the final state and generates aggregated tokens of visual and language embeddings through Multi-Headed Attention Pooling (MAP). SCR [223] extracts feature maps from the middle layer of the Stable Diffusion model and combines them into a final feature representation using a spatial aggregation method. A policy network is trained using supervised learning (such as behavior cloning) to predict the optimal action. RDT [224] provides an algorithmic framework for developing and training foundational robot models for two-handed manipulation tasks. This framework includes mechanisms for handling multimodal inputs and for capturing the multimodality of action distributions through diffusion models.

5.2. Reinforcement Learning

By modeling the policy-learning process as a Markov decision process (MDP), the aim is to find a state–behavior mapping function that maximizes the total expected reward of the agent [225]:

J (π) = E_{τ \sim π} [\sum_{t = 0}^{T} γ^{t} r_{t} (v_{t}, a_{t})]

(2)

where

τ = {\{(v_{t}, a_{t})\}}_{t = 0}^{T}

represent the trajectory, and

v_{t}

and

a_{t}

denote observation and action at time step

t

, respectively. The reward

r

corresponds to the reward provided as environmental feedback after each action is completed, and

γ \in [0, 1]

is a discount factor that balances the importance of the current reward and future rewards. Therefore, the objective of RL can be expressed as follows:

π^{*} = \underset{π}{\arg \max} J (π)

(3)

Commonly used algorithms for reinforcement learning include DDPG [226], AC series [227,228,229], PPO [230,231], SAC [232], etc. Among them, PPO is widely used in robotic manipulation because of its simplicity and effectiveness. In reinforcement learning, the quality of an agent’s strategy depends on the reward function that is designed. However, the definition of a body-specific reward function often requires some prior knowledge. In current research on embodied intelligence based on reinforcement learning, the reward function is designed manually. Zeng et al. [233] define that the UR5 robot arm successfully pushing the box obtains a reward value of +1, and the reward value is 0 in other cases. Berscheid et al. [234] used the Franka robot arm to perform grasping tasks in cluttered environments, and the reward function was defined as a binary function. Zuo et al. [235] defined the reward function as a function of the distance between the target point and the end-effector of the robot arm. The action of the robot arm is assigned a positive reward as long as it is close to the target point; otherwise, the action is assigned a negative reward. Another approach is to directly learn the reward function itself through human feedback [236], human annotation [237], and behavioral preferences [238].

In the context of large models, because pre-trained large models have a certain prior knowledge of the robot system and state, an optimal reward function can be obtained using a large language model or a visual–language model. This allows the reward function to evolve based on new insights or changes in task requirements, simplifies the solution to complex manipulation tasks, reduces the dependence on manually created reward functions, and potentially improves the effectiveness of the learning process. There are two methods: (1) reward function calculation and (2) reward function estimation.

5.2.1. Reward Function Calculation

(1) Generate reward function code: The generated reward function can be dynamically adjusted according to different task requirements and scenarios, as shown in Figure 7a. Text2Reward [239] represents the elements and states of the environment in the form of Python classes, providing an abstract representation of the environment for the model. It also provides function information and usage examples that help generate reward codes. A large language model is used to generate an intensive reward function that completes the task based on the instruction and the abstract environment. Combined with human feedback, a more refined and effective reward function is iteratively generated. Eureka [240] adopts the GPT-4 to generate an executable reward function using the environmental source code as context. An evolutionary search for rewards is performed within the context window of the LLM by iteratively sampling and improving reward candidate functions. Reward reflection is generated based on the scalar values of each reward component and the task adaptation function at intermediate checkpoints of policy training to guide the improvement of the reward function. ASD [241] employs LLMs to generate task proposals and accordingly defines multiple candidate reward functions. It uses GPT-4V to understand visual scenes and explain the actions of the RL agent, thereby providing an objective assessment of whether the task was successful. The candidate reward functions that are evaluated as successful are retained and further fine-tuned.

(2) Give reward signal: As an evaluation tool, pre-trained visual representations are used to automatically recognize and provide a reward signal, as shown in Figure 7b. ALF [242] applies CLIP to calculate the reward by calculating the similarity score between the observed image after robot execution and the two text prompts (one indicating that the door is closed and the other indicating that the door is open). UVD [243] utilizes VIP, or R3M can help identify the small steps that need to be completed first, generate an embedding vector for each sub-goal, and use the distance between the embedding vector of the current state of the agent and the sub-goal embedding vector as a reward signal. ROBOFUME [244] capitalizes VLMs as reward models and constructs a surrogate reward function that can provide reward signals during the online fine-tuning stage by fine-tuning these models. This reward model can output binary labels for the success state based on the current observation and task name, thereby providing the necessary reward signals when the robot learns a new task. MOKA [245] inputs the description and observation image of the subtask to VLM and prompts it to generate corresponding key points and path points to indirectly estimate the reward signal. RLFP [246] proposes a framework in which GPT-4V serves as the success–reward prior model, used to automatically determine whether a task has been successfully completed and to output corresponding success reward signals (0 or 1). VIP functions as the value base model, accepting current image observations and target image observations as input to infer the value of the state. Seer is an open-source video diffusion model used to generate videos and, through an inverse dynamics model, generate actions to provide inputs for the policy base mode.

5.2.2. Reward Function Estimation

(1) Non-parametric estimation: The form of the reward function is not strictly defined, and a more flexible model is usually used to estimate the reward function, as shown in Figure 7c. MWM [247] designs an auxiliary reward prediction task by adding a linear output head to the autoencoder to predict rewards. VoxPoser [248] leverages a VLM to locate objects of interest in the scene based on the linguistic description of the task. Assign rewards to relevant positions in the observation space; for example, allocate high values to regions of objects that need to be manipulated and low values to regions that should be avoided. Finally, synthesize a 3D value map that includes task-related reward and cost information. These value maps serve as the objective function for the motion planner. LIV [249] estimates rewards by learning a multimodal representation that can implicitly represent the reward function. The distance between the feature representation of the video frame and the feature representation of the target is calculated. The smaller the distance, the closer the content of the video frame is to the target, and the higher the assigned reward value, and vice versa. It can be improved through pretraining and fine-tuning to suit specific tasks and environments. RL-VLM-F [250] queries VLMs to give a preference label for the pair of observed agent-image based on the textual description of the task goal and then learns a reward function consisting of a neural network from these preference labels.

(2) Parameterized estimation: Assuming that the reward function has a specific mathematical form, the reward function is fitted by adjusting predefined parameters, as shown in Figure 7d. CenterGrasp [251] reward function is defined by the weighted sum of a series of terms, each of which is a residual term, and the specific parameters for defining the residual terms are estimated by LLMs. LAMP [252] also adopts an intrinsic incentive function defined by uncertainty in reinforcement learning to guide the agent to efficiently explore the environment. SARU [253] defines the basic structure of the reward function in GPT-4, determines which environmental observation features are relevant to the task and should be included in the reward function, and assigns initial parameter values to each component of the reward function. These parameters are then adjusted during self-alignment to optimize the performance of the reward function. FuRL [254] uses a VLM to generate a reward signal by comparing the cosine similarity between language embeddings (task instructions) and image embeddings (observation of the current state). This reward signal is designed to assist in sparse reward tasks. In terms of video prediction models, VIPER [255] utilizes an autoregressive transformer to train expert videos to obtain a model that can predict the probability distribution of the next frame in a video sequence, with the maximum log likelihood of the next frame given the context as the reward function. Diffusion reward [256] capitalizes expert videos to pretrain a video model. By estimating the conditional entropy of a given historical frame and using its negative value as a reward, the generative diversity of the expert’s behavior can be captured, which better guides the training of the model. A summary of representative algorithms is shown in Table 4.

Figure 7. Reward function calculation framework.

6. Embodied Agent

Model-driven robotic manipulation is a new research approach proposed in recent years. There are two types: (1) Hierarchical execution: High-level task planning is performed for large models, and long-term tasks are decomposed into simpler subtasks. The plan is executed directly using low-level control strategies or a skill library predefined in advance by humans without human intervention; (2) Holistic execution: One is fine-tuning based on a pre-trained model, representing robotic actions as text tags and training with an internet-scale vision–language dataset to directly obtain a VLA (vision–language-action) model, where the robot obtains the task and environmental information to directly output actions. One is visual motion planning, synthesizing video through a pre-trained model and directly controlling the robot with this synthesized video. Actions can also be generated directly using a pre-trained model.

6.1. Hierarchical Execution

6.1.1. Low-Level Control Strategy

(1) Traditional control: Stable performance in a known and controlled environment, low computing requirements, suitable for real-time control, as shown in Figure 8a. LLM-GROP [257] applies a LLM to assist the robot in the task of rearranging multiple objects, which is achieved by using the Gazebo simulator. HIP [258] adopts an LLM to generate a symbolic sub-goal sequence, the video diffusion model is used to generate detailed observation trajectories that take into account the geometry and physics of the environment, and the inverse dynamics model is responsible for converting the observation trajectories into specific action instructions. CLOVER [259] utilizes textual conditional video diffusion models to generate visual plans and guide robotic actions. The reliability and accuracy of these plans are enhanced through depth information and optical flow regularization and are executed by a controller designed using inverse dynamics modeling. LMPC [260] leverages PaLM 2 to perform task decomposition, and MPC calculates the specific movement paths and speeds of each joint of the robot based on the actions or strategies provided by the LLM to ensure that the robot can accurately perform these actions. OK-Robot [261] capitalizes OWL-ViT to scan the home environment, generate a 3D map of the environment, and identify the various objects in the environment. After receiving a natural language command, CLIP is first used to convert the command into a semantic embedding, and then the VoxelMap is searched to find the voxel that most closely matches the embedding to determine the location of the target object. After the robot navigates to the vicinity of the target object, Lang-SAM is used to further refine the masked region of the target object so that AnyGrasp can generate a more accurate grasping pose.

(2) Strategy learning: The ability to adapt to complex and dynamically changing environments and handle diverse tasks, as shown in Figure 8b. DEPS [262] introduces a trainable goal selector that selects among parallel candidate subgoals based on how easy they are to achieve and finally generates actions based on the goal condition policy, the current state, and the subgoals. PSL [263] capitalizes GPT-4 to generate high-level plans, and the serialization module is responsible for converting each step in the high-level plan into a target area that the robot needs to reach. Reinforcement learning algorithms are used to learn how to perform specific low-level control actions after reaching the target area. EmbodiedGPT [264] maps feature concrete actions executed by a policy network consisting of a multilayer perceptron (MLP) by combining information from the visual encoder embedding and the planning information provided by the LLM. PALO [265] and YAY Robot [266] adopt pre-trained models to decompose long-term tasks at the semantic level, generating a series of candidate instructions for subtasks. During the training phase, imitation learning is used to learn control strategies from expert demonstrations.

6.1.2. Skills Library

A skills library is a set of predefined actions or action sequences. After a task is broken down, the system can select and combine these actions from the library to complete specific tasks. By combining different actions, diverse task requirements can be met.

(1) Dynamic Invocation: A value function is matched to each skill, which can dynamically adjust the skill selection strategy, as shown in Figure 8c. SayCan [267] utilizes PaLM to break down these high-level instructions into a series of lower-level subtasks or skills. For example, ‘get me a can of Coke’ might be broken down into the subtasks ‘find Coke’, ‘pick up Coke’, and ‘bring it to you’. For each subtask, PaLM evaluates a set of pre-trained skills, which consist of action descriptions and policy networks such as ‘pick up object’ and are trained with imitation learning to obtain an execution policy. A value function is matched to each skill using reinforcement learning, which quantifies the likelihood of successfully executing the skill from the current state and helps select the skill most likely to help complete the current subtask. PaLM-E [268] generates a textual plan based on perceived images and high-level language instructions. In a mobile operating environment, they use SayCan to map the generated plan to executable low-level instructions. When the low-level strategy executes the operation, it can be re-planned according to changes in the environment.

(2) Direct invocation: Generated code invokes the skill library, indirectly invoking the functions defined in the skill library through the generated code, and the designed skill library is suitable for different scenarios and hardware, as shown in Figure 8d. VoicePilot [269] capitalizes GPT-3.5 Turbo for a feeding assistant robot, defining high-level robot control functions that are named in prompts, being able to understand the user’s spoken command generation plan, access these function names, and use them to control the robot. ChatGPT for Robotics [270] and RobotGPT [271] introduce ChatGPT to generate code that instructs the robot to perform a task by calling predefined API functions. These API functions represent primitive actions of the robot. G4R [272] uses GPT-4V to analyze the given RGB video, compile the affordances information, task the plan into a hardware-independent executable file, and save it in JSON format. COME-robot [273] has designed and implemented a complete set of motion primitive libraries, including API functions such as exploration, navigation, and manipulation. It can detect the cause of failure during execution, dynamically request expert feedback, and re-plan actions based on the feedback to resume task execution. LABOR [274] generates a sequence of uncoordinated or coordinated motion commands to coordinate the robot’s hands to complete complex tasks, and adjusts and optimizes this plan through an interactive feedback loop.

Figure 8. Low-level control strategy and skills library framework.

6.2. Holistic Execution

(1) Fine-tuning or training: The entire network is trained or fine-tuned as a whole, with the robot’s movements output directly, and the entire process is completed within a unified framework. RT-1 [275] uniformly discretizes each dimension of the robot’s motion, tokenizes the motion, uses the robot’s state and historical images as input, and directly outputs the motion from the model. LEO [276] uses a two-stage strategy, first performing 3D visual-language alignment and then the 3D visual–language–action command adjustment. RT-2 [277] adopts co-fine-tuning, which fine-tunes both internet-scale visual–linguistic data and robot trajectory data at the same time. This method allows the model to not only learn the robot’s actions but also retain the rich visual and linguistic knowledge it learned during the pre-training stage. LLaRP [278] utilizes LLM as a fixed (frozen) base network, and an adaptation layer is added. The output adaptation layer (action output module) is trained by learning how to adjust its strategy based on sparse rewards provided by the environment through interaction with the environment. OpenVLA [279] employs multiple strategies for fine-tuning and comparison. LLARA [280] fine-tunes the language model and adaptation layer by generating auxiliary command data through self-supervision, without the need for additional action tags. CoGeLoT [281] capitalizes T5 to encode the linguistic part of the command and injects the encoded visual features into the embedding space of T5. The model then outputs an action that defines the linear movement of the robot arm in 3D space, including the start and end postures. DeeR-VLA [282] has proposed an efficient framework capable of dynamically adjusting model size, enabling the real-time execution of complex language-guided tasks on resource-constrained robotic platforms.

(2) Video and image prediction: Action inference is performed by generating a model to generate a video or image. VLP [283] combines PaLM-E and a text-to-video model to generate detailed and rich video and language plans, and uses a goal-conditioned policy to perform the task. DrM [284] leverages a CLIP-Text encoder to encode text descriptions into vector embeddings, which are then used in a conditional video generation model to infer actions by synthesizing videos, which are then used to train a robot policy. Dreamitate [285] applies the Stable Video Diffusion model to receive image input of a new scene and generates a video showing the execution process of the task. The generated video is used for the trajectories of the 3D tracking tool. This trajectory information is converted into movement instructions for the robot, so that the robot can accurately imitate the movements in the video to complete the task. UniPi [286] utilizes the output (text embedding) of T5-XXL as conditional information input to the diffusion model to help generate a video sequence that matches the text description. The action sequence is inferred from the generated video through inverse kinematics modeling. GR-MG [287] generates target images using a target image generation model, and the multimodal target conditional policy uses these target images to predict actions.

(3) Based on VLM: No additional training required, rapid deployment, and efficient processing. ZSTG [288] does not require pre-trained skills, motion primitives, or an external trajectory optimizer. It uses GPT-4 to directly generate high-level plans for robotic manipulation tasks based on task descriptions and predicts a series of dense end-effector poses to guide the robot in completing specific manipulation tasks. KaP [289] designs parser converts the geometric and kinematic structure of an object into a unified text description, including kinematic joint and contact position information. It generates an abstract text operation sequence based on GPT-4, which is further used to generate precise 3D operation points. Chen et al. [290] applies GPT-4o to understand the task and translate it into an action plan that the robot can execute. Then, it generates a series of action instructions based on the task, and uses its spatial reasoning ability to determine how to get from the current state to the target state. KAT [291] utilizes GPT-4 Turbo to convert the visual observations and action sequences in the demonstration into a sequence of tokens and then generates an action sequence. Figure 9 presents the classic framework for implementing the three methods overall. A summary of the algorithms is provided in Table 5.

7. Challenges and Prospects

In the past few years, the use of learning-based methods in robotic grasping tasks has increased significantly, promoting rapid development in this field. However, current technology still faces some highly challenging problems. Further exploration of these issues is critical to promoting the widespread use of robots in various fields. This section will discuss several challenges and potential future research directions.

7.1. Problems with Dataset Acquisition

Data sources are divided into public datasets and self-built datasets. Data can be obtained from simulation environments or real environments. However, in order to obtain valid data under operational scenarios, it is necessary to remotely control the robot to perform operational tasks based on people. In addition, motion capture of people or animals can also be performed, and the collected data can be used for training by redirection methods. Current simulation platforms still have deficiencies in the simulation of complex flexible objects, fluids, and tactile sensors. There are large gaps in friction, collision, and dynamic interaction and display. At the same time, they face high computational resource consumption, cannot make full use of hardware resources, and lack sufficient environmental diversity. The dataset required for embodied perception is mainly generated in a simulated environment or a combination of simulated and real data. There are differences between the objects in the real environment, which leads to gaps in transfer learning. Moreover, it only covers a limited number of object types and grasping methods, lacking extensive coverage of various objects and complex grasping scenarios, resulting in insufficient generalization. Embodied strategy mainly rely on datasets obtained through robot teleoperation, but teleoperation and manual collection are costly. Open datasets lack a unified format and standards, which makes data integration and comparison more difficult. They may also contain missing data, incorrect labels, or outdated information, which affects the reliability of research results. Future research needs to develop more realistic simulation platforms while optimizing their computational efficiency to reduce resource consumption. Advanced transfer learning methods can be explored to narrow the gap between simulated environments and real-world environments. Large-scale datasets covering diverse object shapes, materials, interaction methods, and complex dynamic scenarios should be developed. To address the lack of standardized formats and protocols for public datasets, efforts can be made to promote the establishment of cross-domain open dataset standards in the future, including data formats and annotation specifications. The direct application of large models to generate datasets, as an emerging approach, has made some progress, but further improvements are still needed in acquiring high-quality datasets and achieving efficient code generation.

7.2. Adaptation Problems in Realistic Tasks of Models

Current research is mainly limited to specific tasks and usually achieves good test results in simulation environments. However, in real environments, factors such as the accuracy of sensors and environmental noise limit the ability of models to capture the complexity of the real world. Test results have not reached an ideal level, and there is no immediate and effective handling when operations fail. When decomposing tasks in a large model, it is difficult to ensure that the generated subtask sequence not only conforms to semantic logic but can also be effectively executed in a real environment, which increases the probability of task failure. And in real-time applications, it is necessary to quickly and accurately complete task decomposition and planning, directly call API interfaces, and it is difficult to meet real-time requirements due to transmission rate limitations. Direct deployment requires a model with numerous parameters, which requires a lot of computing resources and makes deployment more difficult. VLA must process and integrate information from multiple modalities, including vision, language, and action. Although significant progress has been made in this area, achieving optimal integration of these modalities remains an ongoing challenge. Future research could introduce diversified simulation scenarios to train models to handle different types of noise, sensor errors, or environmental changes, thereby enhancing their robustness. More efficient large models should be designed, enabling robots to dynamically adjust subtask sequences based on real-time environmental changes to ensure the rationality of task planning. Research into more lightweight model architectures is needed to improve execution efficiency and reduce the computational resource demands of large models, thereby supporting local real-time deployment. For VLA, its multimodal information integration capabilities need to be further enhanced.

7.3. Problem of Generalization of Strategies

Existing robotic-grasping technology has insufficient generalization ability when faced with objects of uncommon or special materials or shapes. Robots lack intuitive understanding of physical properties (such as material, density, mass, friction) when grasping, which may lead to improper handling of objects, such as damage when grasping fragile objects. Embodied tasks often involve diverse entity types, and the environment is dynamically changing, such as changes in lighting conditions, movement of surrounding objects, or interference. In this complex situation, after learning the strategy, as long as the dynamic parameters of the agent and the environment change slightly, the original embodied strategy will be difficult to directly apply. This is because existing strategies are often built based on specific training data, which can only cover a limited number of object types, physical properties, and environmental conditions. When faced with new environments or tasks outside the training data, the robot’s existing strategies can hardly achieve effective generalization. For example, a robotic-grasping system trained in a laboratory environment on objects of specific shapes and materials may not be able to accurately grasp objects when applied to actual industrial production scenarios or home environments because the object types and physical properties in the new environment may be quite different from the training data. Future research should focus on addressing the limitations of robotic-grasping generalization. Key areas include enhancing the perception and modeling of physical properties, improving adaptability to dynamic environments, strengthening physical common-sense reasoning, exploring multi-task learning methods, and constructing diversified datasets. Additionally, improving cross-modal data integration and user interaction capabilities will lay a technical foundation for efficient robotic grasping in diverse environments.

7.4. Problems in Executing Long Sequence Tasks

Executing long-sequence tasks requires breaking down complex tasks into multiple subtasks and ensuring the logical order of these subtasks. During the execution process, robots need to retain long-term goals and real-time adjust their action plans in response to environmental changes and unexpected events. In this regard, pre-trained models have already achieved initial success, but the continuity of tasks and real-time replanning capabilities are not yet prominent. It is still necessary to integrate various perception and decision-making modules to enhance the overall collaborative capabilities of robots. Future research should continue to focus on the intelligentization of task decomposition, long-term memory and context awareness in task models, real-time response to unexpected events, and deep integration of multimodal perception. Furthermore, enhancing the application of pre-trained models in task planning, with an emphasis on task continuity modeling, failure detection, and recovery mechanisms, will further advance the generalization and adaptability of robots in executing long-sequence tasks.

7.5. Interpretability Problem

Despite the rapid development of model applications in robotics, the decision-making process of the model remains a “black box” as the complexity of the model continues to increase, which poses a huge challenge for model debugging and verification. For example, when the robot encounters a sudden obstacle, the model will quickly adjust the trajectory of movement, but it is very challenging to understand how the model makes this adjustment based on previous experience and current environmental changes. Improving the interpretability of the model makes its decision-making process more transparent, which facilitates debugging and verification. Due to the heterogeneity of the data, the model may not be able to effectively extract and integrate all relevant features. This makes it difficult to determine the contribution of each sensor’s data to the final decision when interpreting the model’s decision-making process, which increases the difficulty of interpretability. A better understanding of the dimensions in which existing models do not effectively extract features is needed in order to more effectively collect and utilize data. Future research should focus on improving the interpretability of robotic models, with an emphasis on feature contribution analysis of multimodal data, transparency of decision-making paths, and real-time interpretation of dynamic environments. Additionally, it is necessary to balance the trade-off between model performance and interpretability.

8. Conclusions

In this review, we conducted a comprehensive survey of research methods applying foundational models to robotic grasping tasks. We discussed embodied foundations, including robotic systems, simulation platforms, and datasets. Subsequently, we analyzed embodied perception, embodied strategy, and the embodied agent in detail, with a focus on the applications of vision models, language models, vision–language models, and generative models. We summarized the methodologies employed in these studies. Finally, we discussed the remaining challenges in robotics that foundational models have yet to address, as well as promising research directions. We hope this survey will provide researchers with a comprehensive understanding of this emerging field and offer new insights for future exploration.

Author Contributions

Conceptualization, J.S., P.M., L.K. and J.W.; methodology, J.S., P.M., L.K. and J.W.; investigation, J.S., P.M., L.K. and J.W.; re-sources, J.S., P.M., L.K. and J.W.; writing—original draft preparation, J.S., P.M., L.K. and J.W.; writing—review and editing, L.K. and J.W.; visualization, J.S., P.M., L.K. and J.W.; supervision, P.M., L.K. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Program for Science and Technology of Luoyang (2101018A).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, Y.; Baruah, T.; Mojumder, S.A.; Dong, S.; Gong, X.; Treadway, S.; Bao, Y.; Hance, S.; McCardwell, C.; Zhao, V. Mgpusim: Enabling Multi-Gpu Performance Modeling and Optimization. In Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019. [Google Scholar]
Jouppi, N.; Kurian, G.; Li, S.; Ma, P.; Nagarajan, R.; Nai, L.; Patil, N.; Subramanian, S.; Swing, A.; Towles, B. Tpu V4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, Los Angeles, CA, USA, 18–22 June 2023. [Google Scholar]
Mining, W.I.D. Data Mining: Concepts and Techniques. Morgan Kaufinann 2006, 10, 4. [Google Scholar]
Chen, X.; Hsieh, C.-J.; Gong, B. When Vision Transformers Outperform Resnets without Pre-Training or Strong Data Augmentations. arXiv 2021, arXiv:2106.01548. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, Y. Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine learning, Virtual Event, 18–24 July 2021. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S. Gpt-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Li, C. Large Multimodal Models: Notes on Cvpr 2023 Tutorial. arXiv 2023, arXiv:2306.14895. [Google Scholar]
Xiao, T.; Radosavovic, I.; Darrell, T.; Malik, J. Masked Visual Pre-Training for Motor Control. arXiv 2022, arXiv:2203.06173. [Google Scholar]
Nguyen, N.; Vu, M.N.; Huang, B.; Vuong, A.; Le, N.; Vo, T.; Nguyen, A. Lightweight Language-Driven Grasp Detection Using Conditional Consistency Model. arXiv 2024, arXiv:2407.17967. [Google Scholar]
Seo, Y.; Uruç, J.; James, S. Continuous Control with Coarse-to-Fine Reinforcement Learning. arXiv 2024, arXiv:2407.07787. [Google Scholar]
Sharma, M.; Fantacci, C.; Zhou, Y.; Koppula, S.; Heess, N.; Scholz, J.; Aytar, Y. Lossless Adaptation of Pretrained Vision Models for Robotic Manipulation. arXiv 2023, arXiv:2304.06600. [Google Scholar]
Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the Power of Llms in Practice: A Survey on Chatgpt and Beyond. ACM Trans. Knowl. Discov. Data 2024, 18, 1–32. [Google Scholar] [CrossRef]
Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A Survey on Large Language Model (Llm) Security and Privacy: The Good, the Bad, and the Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
Yang, S.; Nachum, O.; Du, Y.; Wei, J.; Abbeel, P.; Schuurmans, D. Foundation Models for Decision Making: Problems, Methods, and Opportunities. arXiv 2023, arXiv:2303.04129. [Google Scholar]
Hu, Y.; Xie, Q.; Jain, V.; Francis, J.; Patrikar, J.; Keetha, N.; Kim, S.; Xie, Y.; Zhang, T.; Fang, H.-S. Toward General-Purpose Robots Via Foundation Models: A Survey and Meta-Analysis. arXiv 2023, arXiv:2312.08782. [Google Scholar]
Xiao, X.; Liu, J.; Wang, Z.; Zhou, Y.; Qi, Y.; Cheng, Q.; He, B.; Jiang, S. Robot Learning in the Era of Foundation Models: A Survey. arXiv 2023, arXiv:2311.14379. [Google Scholar]
Zheng, Y.; Yao, L.; Su, Y.; Zhang, Y.; Wang, Y.; Zhao, S.; Zhang, Y.; Chau, L.-P. A Survey of Embodied Learning for Object-Centric Robotic Manipulation. arXiv 2024, arXiv:2408.11537. [Google Scholar]
Ma, S.; Tang, T.; You, H.; Zhao, Y.; Ma, X.; Wang, J. An Robotic Arm System for Automatic Welding of Bars Based on Image Denoising. In Proceedings of the 2021 3rd International Conference on Robotics and Computer Vision (ICRCV), Guangzhou, China, 19–21 November 2021. [Google Scholar]
Li, M.; Wu, F.; Wang, F.; Zou, T.; Li, M.; Xiao, X. Cnn-Mlp-Based Configurable Robotic Arm for Smart Agriculture. Agriculture 2024, 14, 1624. [Google Scholar] [CrossRef]
Haddadin, S.; Parusel, S.; Johannsmeier, L.; Golz, S.; Gabl, S.; Walch, F.; Sabaghian, M.; Jähne, C.; Hausperger, L.; Haddadin, S. The Franka Emika Robot: A Reference Platform for Robotics Research and Education. IEEE Robot. Autom. Mag. 2022, 29, 46–64. [Google Scholar] [CrossRef]
Feng, Y.; Hansen, N.; Xiong, Z.; Rajagopalan, C.; Wang, X. Finetuning Offline World Models in the Real World. arXiv 2023, arXiv:2310.16029. [Google Scholar]
Raviola, A.; Guida, R.; Bertolino, A.C.; De Martin, A.; Mauro, S.; Sorli, M. A Comprehensive Multibody Model of a Collaborative Robot to Support Model-Based Health Management. Robotics 2023, 12, 71. [Google Scholar] [CrossRef]
Humphreys, J.; Peers, C.; Li, J.; Wan, Y.; Sun, J.; Richardson, R.; Zhou, C. Teleoperating a Legged Manipulator through Whole-Body Control. In Proceedings of the Annual Conference Towards Autonomous Robotic Systems, Manchester, UK, 12–14 July 2022. [Google Scholar]
Koenig, N.; Howard, A. Design and Use Paradigms for Gazebo, an Open-Source Multi-Robot Simulator. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), Sendai, Japan, 28 September–2 October 2004. [Google Scholar]
Coumans, E.; Bai, Y. Pybullet, a Python Module for Physics Simulation for Games, Robotics and Machine Learning. 2016. Available online: http://pybullet.org (accessed on 16 November 2024).
Xiang, F.; Qin, Y.; Mo, K.; Xia, Y.; Zhu, H.; Liu, F.; Liu, M.; Jiang, H.; Yuan, Y.; Wang, H. Sapien: A Simulated Part-Based Interactive Environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Zhu, Y.; Wong, J.; Mandlekar, A.; Martín-Martín, R.; Joshi, A.; Nasiriany, S.; Zhu, Y. Robosuite: A Modular Simulation Framework and Benchmark for Robot Learning. arXiv 2020, arXiv:2009.12293. [Google Scholar]
Mu, T.; Ling, Z.; Xiang, F.; Yang, D.; Li, X.; Tao, S.; Huang, Z.; Jia, Z.; Su, H. Maniskill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations. arXiv 2021, arXiv:2107.14483. [Google Scholar]
Gu, J.; Xiang, F.; Li, X.; Ling, Z.; Liu, X.; Mu, T.; Tang, Y.; Tao, S.; Wei, X.; Yao, Y. Maniskill2: A Unified Benchmark for Generalizable Manipulation Skills. arXiv 2023, arXiv:2302.04659. [Google Scholar]
Nasiriany, S.; Maddukuri, A.; Zhang, L.; Parikh, A.; Lo, A.; Joshi, A.; Mandlekar, A.; Zhu, Y. Robocasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. arXiv 2024, arXiv:2406.02523. [Google Scholar]
Zhou, Z.; Song, J.; Xie, X.; Shu, Z.; Ma, L.; Liu, D.; Yin, J.; See, S. Towards Building Ai-Cps with Nvidia Isaac Sim: An Industrial Benchmark and Case Study for Robotics Manipulation. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, Madrid, Spain, 29 May–2 June 2024. [Google Scholar]
Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020. [Google Scholar]
Walke, H.R.; Black, K.; Zhao, T.Z.; Vuong, Q.; Zheng, C.; Hansen-Estruch, P.; He, A.W.; Myers, V.; Kim, M.J.; Du, M. Bridgedata V2: A Dataset for Robot Learning at Scale. In Proceedings of the Conference on Robot Learning, New York, NY, USA, 13–15 October 2023. [Google Scholar]
Fang, H.-S.; Fang, H.; Tang, Z.; Liu, J.; Wang, J.; Zhu, H.; Lu, C. Rh20t: A Robotic Dataset for Learning Diverse Skills in One-Shot. In Proceedings of the RSS 2023 Workshop on Learning for Task and Motion Planning, New York, NY, USA, 13–15 October 2023. [Google Scholar]
O’Neill, A.; Rehman, A.; Gupta, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A. Open X-Embodiment: Robotic Learning Datasets and Rt-X Models. arXiv 2023, arXiv:2310.08864. [Google Scholar]
Chen, X.; Ye, Z.; Sun, J.; Fan, Y.; Hu, F.; Wang, C.; Lu, C. Transferable Active Grasping and Real Embodied Dataset. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020. [Google Scholar]
Zhang, H.; Yang, D.; Wang, H.; Zhao, B.; Lan, X.; Ding, J.; Zheng, N. Regrad: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter. IEEE Robot. Autom. Lett. 2022, 7, 2929–2936. [Google Scholar] [CrossRef]
Fang, H.-S.; Wang, C.; Gou, M.; Lu, C. Graspnet-1billion: A Large-Scale Benchmark for General Object Grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Vuong, A.D.; Vu, M.N.; Le, H.; Huang, B.; Huynh, B.; Vo, T.; Kugi, A.; Nguyen, A. Grasp-Anything: Large-Scale Grasp Dataset from Foundation Models. arXiv 2023, arXiv:2309.09818. [Google Scholar]
Kim, J.; Jeon, M.-H.; Jung, S.; Yang, W.; Jung, M.; Shin, J.; Kim, A. Transpose: Large-Scale Multispectral Dataset for Transparent Object. Int. J. Robot. Res. 2024. [Google Scholar] [CrossRef]
Obrist, J.; Zamora, M.; Zheng, H.; Hinchet, R.; Ozdemir, F.; Zarate, J.; Katzschmann, R.K.; Coros, S. Pokeflex: A Real-World Dataset of Deformable Objects for Robotics. arXiv 2024, arXiv:2410.07688. [Google Scholar]
Zhou, B.; Zhou, H.; Liang, T.; Yu, Q.; Zhao, S.; Zeng, Y.; Lv, J.; Luo, S.; Wang, Q.; Yu, X. Clothesnet: An Information-Rich 3d Garment Model Repository with Simulated Clothes Environment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 27 October–2 November 2023. [Google Scholar]
Cartucho, J.; Weld, A.; Tukra, S.; Xu, H.; Matsuzaki, H.; Ishikawa, T.; Kwon, M.; Jang, Y.E.; Kim, K.-J.; Lee, G. Surgt Challenge: Benchmark of Soft-Tissue Trackers for Robotic Surgery. Med. Image Anal. 2024, 91, 102985. [Google Scholar] [CrossRef]
Chi, C.; Xu, Z.; Pan, C.; Cousineau, E.; Burchfiel, B.; Feng, S.; Tedrake, R.; Song, S. Universal Manipulation Interface: In-the-Wild Robot Teaching without in-the-Wild Robots. arXiv 2024, arXiv:2402.10329. [Google Scholar]
Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv 2023, arXiv:2304.13705. [Google Scholar]
Wu, P.; Shentu, Y.; Yi, Z.; Lin, X.; Abbeel, P. Gello: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators. arXiv 2023, arXiv:2309.13037. [Google Scholar]
Aldaco, J.; Armstrong, T.; Baruch, R.; Bingham, J.; Chan, S.; Draper, K.; Dwibedi, D.; Finn, C.; Florence, P.; Goodrich, S. Aloha 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation. arXiv 2024, arXiv:2405.02292. [Google Scholar]
Dhat, V.; Walker, N.; Cakmak, M. Using 3d Mice to Control Robot Manipulators. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Chicago, IL, USA, 11–14 March 2024. [Google Scholar]
Mandlekar, A.; Booher, J.; Spero, M.; Tung, A.; Gupta, A.; Zhu, Y.; Garg, A.; Savarese, S.; Fei-Fei, L. Scaling Robot Supervision to Hundreds of Hours with Roboturk: Robotic Manipulation Dataset through Human Reasoning and Dexterity. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019. [Google Scholar]
Freiberg, R.; Qualmann, A.; Vien, N.A.; Neumann, G. Diffusion for Multi-Embodiment Grasping. arXiv 2024, arXiv:2410.18835. [Google Scholar] [CrossRef]
Chen, C.-C.; Lan, C.-C. An Accurate Force Regulation Mechanism for High-Speed Handling of Fragile Objects Using Pneumatic Grippers. IEEE Trans. Autom. Sci. Eng. 2017, 15, 1600–1608. [Google Scholar] [CrossRef]
Pham, H.; Pham, Q.-C. Critically Fast Pick-and-Place with Suction Cups. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
D’Avella, S.; Tripicchio, P.; Avizzano, C.A. A Study on Picking Objects in Cluttered Environments: Exploiting Depth Features for a Custom Low-Cost Universal Jamming Gripper. Robot. Comput.-Integr. Manuf. 2020, 63, 101888. [Google Scholar] [CrossRef]
Li, X.; Li, N.; Tao, G.; Liu, H.; Kagawa, T. Experimental Comparison of Bernoulli Gripper and Vortex Gripper. Int. J. Precis. Eng. Manuf. 2015, 16, 2081–2090. [Google Scholar] [CrossRef]
Peng, H.-S.; Liu, C.-Y.; Chen, C.-L. Dynamic Performance Analysis and Design of Vortex Array Grippers. Actuators 2022, 11, 137. [Google Scholar] [CrossRef]
Guo, J.; Elgeneidy, K.; Xiang, C.; Lohse, N.; Justham, L.; Rossiter, J. Soft Pneumatic Grippers Embedded with Stretchable Electroadhesion. Smart Mater. Struct. 2018, 27, 055006. [Google Scholar] [CrossRef]
Manes, L.; Fichera, S.; Fakhruldeen, H.; Cooper, A.I.; Paoletti, P. A Soft Cable Loop Based Gripper for Robotic Automation of Chemistry. Sci. Rep. 2024, 14, 8899. [Google Scholar] [CrossRef] [PubMed]
Sinatra, N.R.; Teeple, C.B.; Vogt, D.M.; Parker, K.K.; Gruber, D.F.; Wood, R.J. Ultragentle Manipulation of Delicate Structures Using a Soft Robotic Gripper. Sci. Robot. 2019, 4, eaax5425. [Google Scholar] [CrossRef] [PubMed]
Patni, S.P.; Stoudek, P.; Chlup, H.; Hoffmann, M. Evaluating Online Elasticity Estimation of Soft Objects Using Standard Robot Grippers. arXiv 2024, arXiv:2401.08298. [Google Scholar]
Dai, H.; Lu, Z.; He, M.; Yang, C. Novel Gripper-Like Exoskeleton Design for Robotic Grasping Based on Learning from Demonstration. In Proceedings of the 2022 27th International Conference on Automation and Computing (ICAC), Manchester, UK, 7–9 September 2022. [Google Scholar]
Song, D.; Ek, C.H.; Huebner, K.; Kragic, D. Embodiment-Specific Representation of Robot Grasping Using Graphical Models and Latent-Space Discretization. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 September 2011. [Google Scholar]
Wang, X.; Geiger, F.; Niculescu, V.; Magno, M.; Benini, L. Smarthand: Towards Embedded Smart Hands for Prosthetic and Robotic Applications. In Proceedings of the 2021 IEEE Sensors Applications Symposium (SAS), Virtual Event, 22–24 February 2021. [Google Scholar]
Moore, C.H.; Corbin, S.F.; Mayr, R.; Shockley, K.; Silva, P.L.; Lorenz, T. Grasping Embodiment: Haptic Feedback for Artificial Limbs. Front. Neurorobotics 2021, 15, 662397. [Google Scholar] [CrossRef] [PubMed]
Agarwal, A.; Uppal, S.; Shaw, K.; Pathak, D. Dexterous Functional Grasping. In Proceedings of the 7th Annual Conference on Robot Learning, New York, NY, USA, 13–15 October 2023. [Google Scholar]
Attarian, M.; Asif, M.A.; Liu, J.; Hari, R.; Garg, A.; Gilitschenski, I.; Tompson, J. Geometry Matching for Multi-Embodiment Grasping. In Proceedings of the Conference on Robot Learning, New York, NY, USA, 13–15 October 2023. [Google Scholar]
Li, Y.; Liu, B.; Geng, Y.; Li, P.; Yang, Y.; Zhu, Y.; Liu, T.; Huang, S. Grasp Multiple Objects with One Hand. IEEE Robot. Autom. Lett. 2024, 9, 4027–4034. [Google Scholar] [CrossRef]
Shaw, K.; Agarwal, A.; Pathak, D. Leap Hand: Low-Cost, Efficient, and Anthropomorphic Hand for Robot Learning. arXiv 2023, arXiv:2309.06440. [Google Scholar]
Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A. Isaac Gym: High Performance Gpu-Based Physics Simulation for Robot Learning. arXiv 2021, arXiv:2108.10470. [Google Scholar]
Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A Physics Engine for Model-Based Control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012. [Google Scholar]
Wang, C.; Shi, H.; Wang, W.; Zhang, R.; Fei-Fei, L.; Liu, C.K. Dexcap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation. arXiv 2024, arXiv:2403.07788. [Google Scholar]
Qin, Y.; Yang, W.; Huang, B.; Van Wyk, K.; Su, H.; Wang, X.; Chao, Y.-W.; Fox, D. Anyteleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System. arXiv 2023, arXiv:2307.04577. [Google Scholar]
Xu, Y.; Wan, W.; Zhang, J.; Liu, H.; Shan, Z.; Shen, H.; Wang, R.; Geng, H.; Weng, Y.; Chen, J. Unidexgrasp: Universal Robotic Dexterous Grasping Via Learning Diverse Proposal Generation and Goal-Conditioned Policy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 18–24 June 2023. [Google Scholar]
Chao, Y.-W.; Paxton, C.; Xiang, Y.; Yang, W.; Sundaralingam, B.; Chen, T.; Murali, A.; Cakmak, M.; Fox, D. Handoversim: A Simulation Framework and Benchmark for Human-to-Robot Object Handovers. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar]
Rajeswaran, A.; Kumar, V.; Gupta, A.; Vezzani, G.; Schulman, J.; Todorov, E.; Levine, S. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. arXiv 2017, arXiv:1709.10087. [Google Scholar]
Jauhri, S.; Lueth, S.; Chalvatzaki, G. Active-Perceptive Motion Generation for Mobile Manipulation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 20–24 May 2024. [Google Scholar]
Ulloa, C.C.; Domínguez, D.; Barrientos, A.; del Cerro, J. Design and Mixed-Reality Teleoperation of a Quadruped-Manipulator Robot for Sar Tasks. In Proceedings of the Climbing and Walking Robots Conference, Virtual Event, 7–9 September 2022. [Google Scholar]
Gu, J.; Chaplot, D.S.; Su, H.; Malik, J. Multi-Skill Mobile Manipulation for Object Rearrangement. arXiv 2022, arXiv:2209.02778. [Google Scholar]
Shafiullah, N.M.M.; Rai, A.; Etukuru, H.; Liu, Y.; Misra, I.; Chintala, S.; Pinto, L. On Bringing Robots Home. arXiv 2023, arXiv:2311.16098. [Google Scholar]
Bharadhwaj, H.; Mottaghi, R.; Gupta, A.; Tulsiani, S. Track2act: Predicting Point Tracks from Internet Videos Enables Diverse Zero-Shot Robot Manipulation. arXiv 2024, arXiv:2405.01527. [Google Scholar]
Zhang, J.; Gireesh, N.; Wang, J.; Fang, X.; Xu, C.; Chen, W.; Dai, L.; Wang, H. Gamma: Graspability-Aware Mobile Manipulation Policy Learning Based on Online Grasping Pose Fusion. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 20–24 May 2024. [Google Scholar]
Shen, B.; Xia, F.; Li, C.; Martín-Martín, R.; Fan, L.; Wang, G.; Pérez-D’Arpino, C.; Buch, S.; Srivastava, S.; Tchapmi, L. Igibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
Li, C.; Xia, F.; Martín-Martín, R.; Lingelbach, M.; Srivastava, S.; Shen, B.; Vainio, K.; Gokmen, C.; Dharan, G.; Jain, T. Igibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks. arXiv 2021, arXiv:2108.03272. [Google Scholar]
Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J. Habitat: A Platform for Embodied Ai Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Szot, A.; Clegg, A.; Undersander, E.; Wijmans, E.; Zhao, Y.; Turner, J.; Maestre, N.; Mukadam, M.; Chaplot, D.S.; Maksymets, O. Habitat 2.0: Training Home Assistants to Rearrange Their Habitat. Adv. Neural Inf. Process. Syst. 2021, 34, 251–266. [Google Scholar]
Kolve, E.; Mottaghi, R.; Han, W.; VanderBilt, E.; Weihs, L.; Herrasti, A.; Deitke, M.; Ehsani, K.; Gordon, D.; Zhu, Y. Ai2-Thor: An Interactive 3d Environment for Visual Ai. arXiv 2017, arXiv:1712.05474. [Google Scholar]
Fu, Z.; Zhao, T.Z.; Finn, C. Mobile Aloha: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. arXiv 2024, arXiv:2401.02117. [Google Scholar]
Marew, D.; Perera, N.; Yu, S.; Roelker, S.; Kim, D. A Biomechanics-Inspired Approach to Soccer Kicking for Humanoid Robots. arXiv 2024, arXiv:2407.14612. [Google Scholar]
Bertrand, S.; Penco, L.; Anderson, D.; Calvert, D.; Roy, V.; McCrory, S.; Mohammed, K.; Sanchez, S.; Griffith, W.; Morfey, S. High-Speed and Impact Resilient Teleoperation of Humanoid Robots. arXiv 2024, arXiv:2409.04639. [Google Scholar]
Guo, Y.; Wang, Y.-J.; Zha, L.; Jiang, Z.; Chen, J. Doremi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment. arXiv 2023, arXiv:2307.00329. [Google Scholar]
Malik, A.A.; Masood, T.; Brem, A. Intelligent Humanoids in Manufacturing to Address Worker Shortage and Skill Gaps: Case of Tesla Optimus. arXiv 2023, arXiv:2304.04949. [Google Scholar]
Feng, S.; Whitman, E.; Xinjilefu, X.; Atkeson, C.G. Optimization Based Full Body Control for the Atlas Robot. In Proceedings of the 2014 IEEE-RAS International Conference on Humanoid Robots, Madrid, Spain, 18–20 November 2014. [Google Scholar]
Cheng, X.; Ji, Y.; Chen, J.; Yang, R.; Yang, G.; Wang, X. Expressive Whole-Body Control for Humanoid Robots. arXiv 2024, arXiv:2402.16796. [Google Scholar]
Yang, S.; Chen, H.; Fu, Z.; Zhang, W. Force-Feedback Based Whole-Body Stabilizer for Position-Controlled Humanoid Robots. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
Zeng, F.; Gan, W.; Wang, Y.; Liu, N.; Yu, P.S. Large Language Models for Robotics: A Survey. arXiv 2023, arXiv:2311.07226. [Google Scholar]
Chernyadev, N.; Backshall, N.; Ma, X.; Lu, Y.; Seo, Y.; James, S. Bigym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark. arXiv 2024, arXiv:2407.07788. [Google Scholar]
Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. Amass: Archive of Motion Capture as Surface Shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019. [Google Scholar]
Yang, S.; Liu, M.; Qin, Y.; Ding, R.; Li, J.; Cheng, X.; Yang, R.; Yi, S.; Wang, X. Ace: A Cross-Platform Visual-Exoskeletons System for Low-Cost Dexterous Teleoperation. arXiv 2024, arXiv:2408.11805. [Google Scholar]
Cheng, X.; Li, J.; Yang, S.; Yang, G.; Wang, X. Open-Television: Teleoperation with Immersive Active Visual Feedback. arXiv 2024, arXiv:2407.01512. [Google Scholar]
Fu, Z.; Zhao, Q.; Wu, Q.; Wetzstein, G.; Finn, C. Humanplus: Humanoid Shadowing and Imitation from Humans. arXiv 2024, arXiv:2406.10454. [Google Scholar]
Devlin, J. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 30 October 2024).
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S. Palm: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Massey, P.A.; Montgomery, C.; Zhang, A.S. Comparison of Chatgpt–3.5, Chatgpt-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. JAAOS-J. Am. Acad. Orthop. Surg. 2023, 31, 1173–1179. [Google Scholar] [CrossRef] [PubMed]
Hu, H.; Shang, Y.; Xu, G.; He, C.; Zhang, Q. Can Gpt-O1 Kill All Bugs? arXiv 2024, arXiv:2409.10033. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25. Available online: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (accessed on 30 October 2024). [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3d Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Lyu, H.; Sha, N.; Qin, S.; Yan, M.; Xie, Y.; Wang, R. Advances in Neural Information Processing Systems. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://par.nsf.gov/servlets/purl/10195511 (accessed on 30 October 2024).
Dosovitskiy, A. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 14–19 June 2020. [Google Scholar]
Jain, J.; Li, J.; Chiu, M.T.; Hassani, A.; Orlov, N.; Shi, H. Oneformer: One Transformer to Rule Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A. Dinov2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L. Sam 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
Ranzinger, M.; Heinrich, G.; Kautz, J.; Molchanov, P. Am-Radio: Agglomerative Model—Reduce All Domains into One. arXiv 2023, arXiv:2312.06709. [Google Scholar]
Shang, J.; Schmeckpeper, K.; May, B.B.; Minniti, M.V.; Kelestemur, T.; Watkins, D.; Herlant, L. Theia: Distilling Diverse Vision Foundation Models for Robot Learning. arXiv 2024, arXiv:2407.20179. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. Imagebind: One Embedding Space to Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–23 June 2023. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A Generative Image-to-Text Transformer for Vision and Language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
Su, Y.; Lan, T.; Li, H.; Xu, J.; Wang, Y.; Cai, D. Pandagpt: One Model to Instruction-Follow Them All. arXiv 2023, arXiv:2305.16355. [Google Scholar]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. Llama: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv 2023, arXiv:2306.14824. [Google Scholar]
Ge, C.; Cheng, S.; Wang, Z.; Yuan, J.; Gao, Y.; Song, J.; Song, S.; Huang, G.; Zheng, B. Convllava: Hierarchical Backbones as Visual Encoder for Large Multimodal Models. arXiv 2024, arXiv:2405.15738. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion Models Beat Gans on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Epstein, D.; Jabri, A.; Poole, B.; Efros, A.; Holynski, A. Diffusion Self-Guidance for Controllable Image Generation. Adv. Neural Inf. Process. Syst. 2023, 36, 16222–16239. [Google Scholar]
Clark, K.; Jaini, P. Text-to-Image Diffusion Models Are Zero Shot Classifiers. Adv. Neural Inf. Process. Syst. 2024, 36. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/b87bdcf963cad3d0b265fcb78ae7d11e-Paper-Conference.pdf (accessed on 30 October 2024).
Li, R.; Li, W.; Yang, Y.; Wei, H.; Jiang, J.; Bai, Q. Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation. Neural Comput. Appl. 2023, 36, 17245–17260. [Google Scholar] [CrossRef]
Reddy, M.D.M.; Basha, M.S.M.; Hari, M.M.C.; Penchalaiah, M.N. Dall-E: Creating Images from Text. UGC Care Group I J. 2021, 8, 71–75. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with Clip Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Gafni, O.; Polyak, A.; Ashual, O.; Sheynin, S.; Parikh, D.; Taigman, Y. Make-a-Scene: Scene-Based Text-to-Image Generation with Human Priors. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T. Photorealistic text-to-Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arXiv 2022, arXiv:2206.10789. [Google Scholar]
Jin, Y.; Sun, Z.; Xu, K.; Chen, L.; Jiang, H.; Huang, Q.; Song, C.; Liu, Y.; Zhang, D.; Song, Y. Video-Lavit: Unified Video-Language Pre-Training with Decoupled Visual-Motional Tokenization. arXiv 2024, arXiv:2402.03161. [Google Scholar]
Brooks, T.; Peebles, B.; Holmes, C.; DePue, W.; Guo, Y.; Jing, L.; Schnurr, D.; Taylor, J.; Luhman, T.; Luhman, E. Video Generation Models as World Simulators. 2024. Available online: https://openai.com/index/video-generation-models-as-world-simulators (accessed on 1 November 2024).
Nair, S.; Rajeswaran, A.; Kumar, V.; Finn, C.; Gupta, A. R3m: A Universal Visual Representation for Robot Manipulation. arXiv 2022, arXiv:2203.12601. [Google Scholar]
Ma, Y.J.; Sodhani, S.; Jayaraman, D.; Bastani, O.; Kumar, V.; Zhang, A. Vip: Towards Universal Visual Reward and Representation Via Value-Implicit Pre-Training. arXiv 2022, arXiv:2210.00030. [Google Scholar]
Majumdar, A.; Yadav, K.; Arnaud, S.; Ma, J.; Chen, C.; Silwal, S.; Jain, A.; Berges, V.-P.; Wu, T.; Vakil, J. Where Are We in the Search for an Artificial Visual Cortex for Embodied Intelligence? Adv. Neural Inf. Process. Syst. 2023, 36, 655–677. [Google Scholar]
Karamcheti, S.; Nair, S.; Chen, A.S.; Kollar, T.; Finn, C.; Sadigh, D.; Liang, P. Language-Driven Representation Learning for Robotics. arXiv 2023, arXiv:2302.12766. [Google Scholar]
Wu, H.; Jing, Y.; Cheang, C.; Chen, G.; Xu, J.; Li, X.; Liu, M.; Li, H.; Kong, T. Unleashing Large-Scale Video Generative Pre-Training for Visual Robot Manipulation. arXiv 2023, arXiv:2312.13139. [Google Scholar]
Cheang, C.-L.; Chen, G.; Jing, Y.; Kong, T.; Li, H.; Li, Y.; Liu, Y.; Wu, H.; Xu, J.; Yang, Y. Gr-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. arXiv 2024, arXiv:2410.06158. [Google Scholar]
Lin, X.; So, J.; Mahalingam, S.; Liu, F.; Abbeel, P. Spawnnet: Learning Generalizable Visuomotor Skills from Pre-Trained Network. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. Imagenet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft Coco: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014. [Google Scholar]
Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X. Ego4d: Around the World in 3,000 Hours of Egocentric Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Damen, D.; Doughty, H.; Farinella, G.M.; Fidler, S.; Furnari, A.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W. The Epic-Kitchens Dataset: Collection, Challenges and Baselines. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4125–4141. [Google Scholar] [CrossRef] [PubMed]
Carreira, J.; Noland, E.; Hillier, C.; Zisserman, A. A Short Note on the Kinetics-700 Human Action Dataset. arXiv 2019, arXiv:1907.06987. [Google Scholar]
Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics. arXiv 2017, arXiv:1703.09312. [Google Scholar]
Guo, D.; Sun, F.; Liu, H.; Kong, T.; Fang, B.; Xi, N. A Hybrid Deep Architecture for Robotic Grasp Detection. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, Singapore, 29 May–3 June 2017. [Google Scholar]
Zhou, X.; Lan, X.; Zhang, H.; Tian, Z.; Zhang, Y.; Zheng, N. Fully Convolutional Grasp Detection Network with Oriented Anchor Box. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
Zhu, X.; Sun, L.; Fan, Y.; Tomizuka, M. 6-Dof Contrastive Grasp Proposal Network. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021. [Google Scholar]
Zhu, X.; Zhou, Y.; Fan, Y.; Sun, L.; Chen, J.; Tomizuka, M. Learn to Grasp with Less Supervision: A Data-Efficient Maximum Likelihood Grasp Sampling Loss. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar]
Chisari, E.; Heppert, N.; Welschehold, T.; Burgard, W.; Valada, A. Centergrasp: Object-Aware Implicit Representation Learning for Simultaneous Shape Reconstruction and 6-Dof Grasp Estimation. IEEE Robot. Autom. Lett. 2024, 9, 5094–5101. [Google Scholar] [CrossRef]
Chen, S.; Tang, W.; Xie, P.; Yang, W.; Wang, G. Efficient Heatmap-Guided 6-Dof Grasp Detection in Cluttered Scenes. IEEE Robot. Autom. Lett. 2023, 8, 4895–4902. [Google Scholar] [CrossRef]
Ma, H.; Shi, M.; Gao, B.; Huang, D. Generalizing 6-Dof Grasp Detection Via Domain Prior Knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Lu, Y.; Deng, B.; Wang, Z.; Zhi, P.; Li, Y.; Wang, S. Hybrid Physical Metric for 6-Dof Grasp Pose Detection. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar]
Ma, H.; Huang, D. Towards Scale Balanced 6-Dof Grasp Detection in Cluttered Scenes. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Zhai, G.; Huang, D.; Wu, S.-C.; Jung, H.; Di, Y.; Manhardt, F.; Tombari, F.; Navab, N.; Busam, B. Monograspnet: 6-Dof Grasping with a Single Rgb Image. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
Shi, J.; Yong, A.; Jin, Y.; Li, D.; Niu, H.; Jin, Z.; Wang, H. Asgrasp: Generalizable Transparent Object Reconstruction and 6-Dof Grasp Detection from Rgb-D Active Stereo Camera. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Ichnowski, J.; Avigal, Y.; Kerr, J.; Goldberg, K. Dex-Nerf: Using a Neural Radiance Field to Grasp Transparent Objects. arXiv 2021, arXiv:2110.14217. [Google Scholar]
Dai, Q.; Zhu, Y.; Geng, Y.; Ruan, C.; Zhang, J.; Wang, H. Graspnerf: Multiview-Based 6-Dof Grasp Detection for Transparent and Specular Objects Using Generalizable Nerf. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
Tung, H.-Y.F.; Cheng, R.; Fragkiadaki, K. Learning Spatial Common Sense with Geometry-Aware Recurrent Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Tung, H.-Y.F.; Xian, Z.; Prabhudesai, M.; Lal, S.; Fragkiadaki, K. 3d-Oes: Viewpoint-Invariant Object-Factorized Environment Simulators. arXiv 2020, arXiv:2011.06464. [Google Scholar]
Chen, S.; Garcia, R.; Schmid, C.; Laptev, I. Polarnet: 3d Point Clouds for Language-Guided Robotic Manipulation. arXiv 2023, arXiv:2309.15596. [Google Scholar]
Guhur, P.-L.; Chen, S.; Pinel, R.G.; Tapaswi, M.; Laptev, I.; Schmid, C. Instruction-Driven History-Aware Policies for Robotic Manipulations. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Shridhar, M.; Manuelli, L.; Fox, D. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Tang, C.; Huang, D.; Ge, W.; Liu, W.; Zhang, H. Graspgpt: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping. IEEE Robot. Autom. Lett. 2023, 8, 7551–7558. [Google Scholar] [CrossRef]
Guo, D.; Xiang, Y.; Zhao, S.; Zhu, X.; Tomizuka, M.; Ding, M.; Zhan, W. Phygrasp: Generalizing Robotic Grasping with Physics-Informed Large Multimodal Models. arXiv 2024, arXiv:2402.16836. [Google Scholar]
Lu, Y.; Fan, Y.; Deng, B.; Liu, F.; Li, Y.; Wang, S. Vl-Grasp: A 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023. [Google Scholar]
Meng, L.; Qi, Z.; Shuchang, L.; Chunlei, W.; Yujing, M.; Guangliang, C.; Chenguang, Y. Ovgnet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping. arXiv 2024, arXiv:2407.13175. [Google Scholar]
Nguyen, T.; Vu, M.N.; Vuong, A.; Nguyen, D.; Vo, T.; Le, N.; Nguyen, A. Open-Vocabulary Affordance Detection in 3d Point Clouds. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023. [Google Scholar]
Ju, Y.; Hu, K.; Zhang, G.; Zhang, G.; Jiang, M.; Xu, H. Robo-Abc: Affordance Generalization Beyond Categories Via Semantic Correspondence for Robot Manipulation. arXiv 2024, arXiv:2401.07487. [Google Scholar]
Kuang, Y.; Ye, J.; Geng, H.; Mao, J.; Deng, C.; Guibas, L.; Wang, H.; Wang, Y. Ram: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation. arXiv 2024, arXiv:2407.04689. [Google Scholar]
Wang, Q.; Zhang, H.; Deng, C.; You, Y.; Dong, H.; Zhu, Y.; Guibas, L. Sparsedff: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation. arXiv 2023, arXiv:2310.16838. [Google Scholar]
Shen, W.; Yang, G.; Yu, A.; Wong, J.; Kaelbling, L.P.; Isola, P. Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation. arXiv 2023, arXiv:2308.07931. [Google Scholar]
Shorinwa, O.; Tucker, J.; Smith, A.; Swann, A.; Chen, T.; Firoozi, R.; Kennedy, M.D.; Schwager, M. Splat-Mover: Multi-Stage, Open-Vocabulary Robotic Manipulation Via Editable Gaussian Splatting. In Proceedings of the 8th Annual Conference on Robot Learning, Munich, Germany, 6–9 November 2024. [Google Scholar]
Rashid, A.; Sharma, S.; Kim, C.M.; Kerr, J.; Chen, L.Y.; Kanazawa, A.; Goldberg, K. Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping. In Proceedings of the 7th Annual Conference on Robot Learning, London, UK, 5–8 November 2023. [Google Scholar]
Li, Y.; Pathak, D. Object-Aware Gaussian Splatting for Robotic Manipulation. In Proceedings of the ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, Yokohama, Japan, 17 May 2024. [Google Scholar]
Zheng, Y.; Chen, X.; Zheng, Y.; Gu, S.; Yang, R.; Jin, B.; Li, P.; Zhong, C.; Wang, Z.; Liu, L. Gaussiangrasper: 3d Language Gaussian Splatting for Open-Vocabulary Robotic Grasping. arXiv 2024, arXiv:2403.09637. [Google Scholar] [CrossRef]
Ze, Y.; Yan, G.; Wu, Y.-H.; Macaluso, A.; Ge, Y.; Ye, J.; Hansen, N.; Li, L.E.; Wang, X. Gnfactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Lu, G.; Zhang, S.; Wang, Z.; Liu, C.; Lu, J.; Tang, Y. Manigaussian: Dynamic Gaussian Splatting for Multi-Task Robotic Manipulation. In Proceedings of the European Conference on Computer Vision, Paris, France, 26–27 March 2025. [Google Scholar]
Zhang, T.; McCarthy, Z.; Jow, O.; Lee, D.; Chen, X.; Goldberg, K.; Abbeel, P. Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018. [Google Scholar]
Zhu, X.; Wang, D.; Su, G.; Biza, O.; Walters, R.; Platt, R. On Robot Grasp Learning Using Equivariant Models. Auton. Robot. 2023, 47, 1175–1193. [Google Scholar] [CrossRef]
Torabi, F.; Warnell, G.; Stone, P. Behavioral Cloning from Observation. arXiv 2018, arXiv:1805.01954. [Google Scholar]
Fang, H.; Fang, H.-S.; Wang, Y.; Ren, J.; Chen, J.; Zhang, R.; Wang, W.; Lu, C. Airexo: Low-Cost Exoskeletons for Learning Whole-Arm Manipulation in the Wild. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Wen, C.; Lin, X.; So, J.; Chen, K.; Dou, Q.; Gao, Y.; Abbeel, P. Any-Point Trajectory Modeling for Policy Learning. arXiv 2023, arXiv:2401.00025. [Google Scholar]
Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; Song, S. Diffusion Policy: Visuomotor Policy Learning Via Action Diffusion. Int. J. Robot. Res. 2023. [Google Scholar] [CrossRef]
Pearce, T.; Rashid, T.; Kanervisto, A.; Bignell, D.; Sun, M.; Georgescu, R.; Macua, S.V.; Tan, S.Z.; Momennejad, I.; Hofmann, K. Imitating Human Behaviour with Diffusion Models. arXiv 2023, arXiv:2301.10677. [Google Scholar]
Dass, S.; Ai, W.; Jiang, Y.; Singh, S.; Hu, J.; Zhang, R.; Stone, P.; Abbatematteo, B.; Martin-Martin, R. Telemoma: A Modular and Versatile Teleoperation System for Mobile Manipulation. arXiv 2024, arXiv:2403.07869. [Google Scholar]
Mandlekar, A.; Nasiriany, S.; Wen, B.; Akinola, I.; Narang, Y.; Fan, L.; Zhu, Y.; Fox, D. Mimicgen: A Data Generation System for Scalable Robot Learning Using Human Demonstrations. arXiv 2023, arXiv:2310.17596. [Google Scholar]
Teoh, E.; Patidar, S.; Ma, X.; James, S. Green Screen Augmentation Enables Scene Generalisation in Robotic Manipulation. arXiv 2024, arXiv:2407.07868. [Google Scholar]
Chen, Z.; Kiami, S.; Gupta, A.; Kumar, V. Genaug: Retargeting Behaviors to Unseen Situations Via Generative Augmentation. arXiv 2023, arXiv:2302.06671. [Google Scholar]
Liu, L.; Wang, W.; Han, Y.; Xie, Z.; Yi, P.; Li, J.; Qin, Y.; Lian, W. Foam: Foresight-Augmented Multi-Task Imitation Policy for Robotic Manipulation. arXiv 2024, arXiv:2409.19528. [Google Scholar]
Niu, H.; Chen, Q.; Liu, T.; Li, J.; Zhou, G.; Zhang, Y.; Hu, J.; Zhan, X. Xted: Cross-Domain Policy Adaptation Via Diffusion-Based Trajectory Editing. arXiv 2024, arXiv:2409.08687. [Google Scholar]
Ha, H.; Florence, P.; Song, S. Scaling up and Distilling Down: Language-Guided Robot Skill Acquisition. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Wang, L.; Ling, Y.; Yuan, Z.; Shridhar, M.; Bao, C.; Qin, Y.; Wang, B.; Xu, H.; Wang, X. Gensim: Generating Robotic Simulation Tasks Via Large Language Models. arXiv 2023, arXiv:2310.01361. [Google Scholar]
Hua, P.; Liu, M.; Macaluso, A.; Lin, Y.; Zhang, W.; Xu, H.; Wang, L. Gensim2: Scaling Robot Data Generation with Multi-Modal and Reasoning Llms. arXiv 2024, arXiv:2410.03645. [Google Scholar]
Lynch, C.; Sermanet, P. Language Conditioned Imitation Learning over Unstructured Data. arXiv 2020, arXiv:2005.07648. [Google Scholar]
Mees, O.; Hermann, L.; Burgard, W. What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data. IEEE Robot. Autom. Lett. 2022, 7, 11205–11212. [Google Scholar] [CrossRef]
Li, J.; Gao, Q.; Johnston, M.; Gao, X.; He, X.; Shakiah, S.; Shi, H.; Ghanadan, R.; Wang, W.Y. Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-Task Fine-Tuning. arXiv 2023, arXiv:2310.09676. [Google Scholar]
Bousmalis, K.; Vezzani, G.; Rao, D.; Devin, C.; Lee, A.X.; Bauza, M.; Davchev, T.; Zhou, Y.; Gupta, A.; Raju, A. Robocat: A Self-Improving Foundation Agent for Robotic Manipulation. arXiv 2023, arXiv:2306.11706. [Google Scholar]
Bharadhwaj, H.; Vakil, J.; Sharma, M.; Gupta, A.; Tulsiani, S.; Kumar, V. Roboagent: Generalization and Efficiency in Robot Manipulation Via Semantic Augmentations and Action Chunking. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Khazatsky, A.; Pertsch, K.; Nair, S.; Balakrishna, A.; Dasari, S.; Karamcheti, S.; Nasiriany, S.; Srirama, M.K.; Chen, L.Y.; Ellis, K. Droid: A Large-Scale in-the-Wild Robot Manipulation Dataset. arXiv 2024, arXiv:2403.12945. [Google Scholar]
Khandelwal, A.; Weihs, L.; Mottaghi, R.; Kembhavi, A. Simple but Effective: Clip Embeddings for Embodied Ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Lin, F.; Hu, Y.; Sheng, P.; Wen, C.; You, J.; Gao, Y. Data Scaling Laws in Imitation Learning for Robotic Manipulation. arXiv 2024, arXiv:2410.18647. [Google Scholar]
Yenamandra, S.; Ramachandran, A.; Yadav, K.; Wang, A.; Khanna, M.; Gervet, T.; Yang, T.-Y.; Jain, V.; Clegg, A.W.; Turner, J. Homerobot: Open-Vocabulary Mobile Manipulation. arXiv 2023, arXiv:2306.11565. [Google Scholar]
Jain, V.; Attarian, M.; Joshi, N.J.; Wahid, A.; Driess, D.; Vuong, Q.; Sanketi, P.R.; Sermanet, P.; Welker, S.; Chan, C. Vid2robot: End-to-End Video-Conditioned Policy Learning with Cross-Attention Transformers. arXiv 2024, arXiv:2403.12943. [Google Scholar]
Shridhar, M.; Manuelli, L.; Fox, D. Cliport: What and Where Pathways for Robotic Manipulation. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022. [Google Scholar]
Jiang, Y.; Gupta, A.; Zhang, Z.; Wang, G.; Dou, Y.; Chen, Y.; Fei-Fei, L.; Anandkumar, A.; Zhu, Y.; Fan, L. Vima: General Robot Manipulation with Multimodal Prompts. arXiv 2022, arXiv:2210.03094. [Google Scholar]
Ehsani, K.; Gupta, T.; Hendrix, R.; Salvador, J.; Weihs, L.; Zeng, K.-H.; Singh, K.P.; Kim, Y.; Han, W.; Herrasti, A. Spoc: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Zeng, J.; Bu, Q.; Wang, B.; Xia, W.; Chen, L.; Dong, H.; Song, H.; Wang, D.; Hu, D.; Luo, P. Learning Manipulation by Predicting Interaction. arXiv 2024, arXiv:2406.00439. [Google Scholar]
Gupta, G.; Yadav, K.; Gal, Y.; Batra, D.; Kira, Z.; Lu, C.; Rudner, T.G. Pre-Trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control. arXiv 2024, arXiv:2405.05852. [Google Scholar]
Liu, S.; Wu, L.; Li, B.; Tan, H.; Chen, H.; Wang, Z.; Xu, K.; Su, H.; Zhu, J. Rdt-1b: A Diffusion Foundation Model for Bimanual Manipulation. arXiv 2024, arXiv:2410.07864. [Google Scholar]
Zakka, K.; Wu, P.; Smith, L.; Gileadi, N.; Howell, T.; Peng, X.B.; Singh, S.; Tassa, Y.; Florence, P.; Zeng, A. Robopianist: Dexterous Piano Playing with Deep Reinforcement Learning. arXiv 2023, arXiv:2304.04150. [Google Scholar]
Lillicrap, T. Continuous Control with Deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Mnih, V. Asynchronous Methods for Deep Reinforcement Learning. arXiv 2016, arXiv:1602.01783. [Google Scholar]
Han, M.; Zhang, L.; Wang, J.; Pan, W. Actor-Critic Reinforcement Learning for Control with Stability Guarantee. IEEE Robot. Autom. Lett. 2020, 5, 6217–6224. [Google Scholar] [CrossRef]
Gu, S.; Holly, E.; Lillicrap, T.P.; Levine, S. Deep Reinforcement Learning for Robotic Manipulation. arXiv 2016, arXiv:1610.00633. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Yuan, H.; Zhou, B.; Fu, Y.; Lu, Z. Cross-Embodiment Dexterous Grasping with Reinforcement Learning. arXiv 2024, arXiv:2410.02479. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Zeng, A.; Song, S.; Welker, S.; Lee, J.; Rodriguez, A.; Funkhouser, T. Learning Synergies between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
Berscheid, L.; Meißner, P.; Kröger, T. Robot Learning of Shifting Objects for Grasping in Cluttered Environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, SAR, China, 3–8 November 2019. [Google Scholar]
Zuo, Y.; Qiu, W.; Xie, L.; Zhong, F.; Wang, Y.; Yuille, A.L. Craves: Controlling Robotic Arm with a Vision-Based Economic System. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Bıyık, E.; Losey, D.P.; Palan, M.; Landolfi, N.C.; Shevchuk, G.; Sadigh, D. Learning Reward Functions from Diverse Sources of Human Feedback: Optimally Integrating Demonstrations and Preferences. Int. J. Robot. Res. 2022, 41, 45–67. [Google Scholar] [CrossRef]
Cabi, S.; Colmenarejo, S.G.; Novikov, A.; Konyushkova, K.; Reed, S.; Jeong, R.; Zolna, K.; Aytar, Y.; Budden, D.; Vecerik, M. A Framework for Data-Driven Robotics. arXiv 2019, arXiv:1909.12200. [Google Scholar]
Ibarz, B.; Leike, J.; Pohlen, T.; Irving, G.; Legg, S.; Amodei, D. Reward Learning from Human Preferences and Demonstrations in Atari. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper_files/paper/2018/file/8cbe9ce23f42628c98f80fa0fac8b19a-Paper.pdf (accessed on 30 October 2024).
Xie, T.; Zhao, S.; Wu, C.H.; Liu, Y.; Luo, Q.; Zhong, V.; Yang, Y.; Yu, T. Text2reward: Reward Shaping with Language Models for Reinforcement Learning. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Ma, Y.J.; Liang, W.; Wang, G.; Huang, D.-A.; Bastani, O.; Jayaraman, D.; Zhu, Y.; Fan, L.; Anandkumar, A. Eureka: Human-Level Reward Design Via Coding Large Language Models. arXiv 2023, arXiv:2310.12931. [Google Scholar]
Zhao, X.; Weber, C.; Wermter, S. Agentic Skill Discovery. arXiv 2024, arXiv:2405.15019. [Google Scholar]
Xiong, H.; Mendonca, R.; Shaw, K.; Pathak, D. Adaptive Mobile Manipulation for Articulated Objects in the Open World. arXiv 2024, arXiv:2401.14403. [Google Scholar]
Zhang, Z.; Li, Y.; Bastani, O.; Gupta, A.; Jayaraman, D.; Ma, Y.J.; Weihs, L. Universal Visual Decomposer: Long-Horizon Manipulation Made Easy. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Yang, J.; Mark, M.S.; Vu, B.; Sharma, A.; Bohg, J.; Finn, C. Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Liu, F.; Fang, K.; Abbeel, P.; Levine, S. Moka: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting. arXiv 2024, arXiv:2403.03174. [Google Scholar]
Ye, W.; Zhang, Y.; Weng, H.; Gu, X.; Wang, S.; Zhang, T.; Wang, M.; Abbeel, P.; Gao, Y. Reinforcement Learning with Foundation Priors: Let Embodied Agent Efficiently Learn on Its Own. In Proceedings of the 8th Annual Conference on Robot Learning, Munich, Germany, 6–9 November 2024. [Google Scholar]
Seo, Y.; Hafner, D.; Liu, H.; Liu, F.; James, S.; Lee, K.; Abbeel, P. Masked World Models for Visual Control. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; Fei-Fei, L. Voxposer: Composable 3d Value Maps for Robotic Manipulation with Language Models. arXiv 2023, arXiv:2307.05973. [Google Scholar]
Ma, Y.J.; Kumar, V.; Zhang, A.; Bastani, O.; Jayaraman, D. Liv: Language-Image Representations and Rewards for Robotic Control. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Wang, Y.; Sun, Z.; Zhang, J.; Xian, Z.; Biyik, E.; Held, D.; Erickson, Z. Rl-Vlm-F: Reinforcement Learning from Vision Language Foundation Model Feedback. arXiv 2024, arXiv:2402.03681. [Google Scholar]
Yu, W.; Gileadi, N.; Fu, C.; Kirmani, S.; Lee, K.-H.; Arenas, M.G.; Chiang, H.-T.L.; Erez, T.; Hasenclever, L.; Humplik, J. Language to Rewards for Robotic Skill Synthesis. arXiv 2023, arXiv:2306.08647. [Google Scholar]
Adeniji, A.; Xie, A.; Sferrazza, C.; Seo, Y.; James, S.; Abbeel, P. Language Reward Modulation for Pretraining Reinforcement Learning. arXiv 2023, arXiv:2308.12270. [Google Scholar]
Zeng, Y.; Mu, Y.; Shao, L. Learning Reward for Robot Skills Using Large Language Models Via Self-Alignment. arXiv 2024, arXiv:2405.07162. [Google Scholar]
Fu, Y.; Zhang, H.; Wu, D.; Xu, W.; Boulet, B. Furl: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning. arXiv 2024, arXiv:2406.00645. [Google Scholar]
Escontrela, A.; Adeniji, A.; Yan, W.; Jain, A.; Peng, X.B.; Goldberg, K.; Lee, Y.; Hafner, D.; Abbeel, P. Video Prediction Models as Rewards for Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2024, 36. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/d9042abf40782fbce28901c1c9c0e8d8-Paper-Conference.pdf (accessed on 30 October 2024).
Huang, T.; Jiang, G.; Ze, Y.; Xu, H. Diffusion Reward: Learning Rewards Via Conditional Video Diffusion. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Ding, Y.; Zhang, X.; Paxton, C.; Zhang, S. Task and Motion Planning with Large Language Models for Object Rearrangement. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, Michigan, USA, 1–5 October 2023. [Google Scholar]
Ajay, A.; Han, S.; Du, Y.; Li, S.; Gupta, A.; Jaakkola, T.; Tenenbaum, J.; Kaelbling, L.; Srivastava, A.; Agrawal, P. Compositional Foundation Models for Hierarchical Planning. Adv. Neural Inf. Process. Syst. 2024, 36. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/46a126492ea6fb87410e55a58df2e189-Paper-Conference.pdf (accessed on 30 October 2024).
Bu, Q.; Zeng, J.; Chen, L.; Yang, Y.; Zhou, G.; Yan, J.; Luo, P.; Cui, H.; Ma, Y.; Li, H. Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation. arXiv 2024, arXiv:2409.09016. [Google Scholar]
Liang, J.; Xia, F.; Yu, W.; Zeng, A.; Arenas, M.G.; Attarian, M.; Bauza, M.; Bennice, M.; Bewley, A.; Dostmohamed, A. Learning to Learn Faster from Human Feedback with Language Model Predictive Control. arXiv 2024, arXiv:2402.11450. [Google Scholar]
Liu, P.; Orru, Y.; Paxton, C.; Shafiullah, N.M.M.; Pinto, L. Ok-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics. arXiv 2024, arXiv:2401.12202. [Google Scholar]
Wang, Z.; Cai, S.; Chen, G.; Liu, A.; Ma, X.; Liang, Y. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents. arXiv 2023, arXiv:2302.01560. [Google Scholar]
Dalal, M.; Chiruvolu, T.; Chaplot, D.; Salakhutdinov, R. Plan-Seq-Learn: Language Model Guided Rl for Solving Long Horizon Robotics Tasks. arXiv 2024, arXiv:2405.01534. [Google Scholar]
Mu, Y.; Zhang, Q.; Hu, M.; Wang, W.; Ding, M.; Jin, J.; Wang, B.; Dai, J.; Qiao, Y.; Luo, P. Embodiedgpt: Vision-Language Pre-Training Via Embodied Chain of Thought. Adv. Neural Inf. Process. Syst. 2024, 36. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/4ec43957eda1126ad4887995d05fae3b-Paper-Conference.pdf (accessed on 30 October 2024).
Myers, V.; Zheng, B.C.; Mees, O.; Levine, S.; Fang, K. Policy Adaptation Via Language Optimization: Decomposing Tasks for Few-Shot Imitation. arXiv 2024, arXiv:2408.16228. [Google Scholar]
Shi, L.X.; Hu, Z.; Zhao, T.Z.; Sharma, A.; Pertsch, K.; Luo, J.; Levine, S.; Finn, C. Yell at Your Robot: Improving on-the-Fly from Language Corrections. arXiv 2024, arXiv:2403.12910. [Google Scholar]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K. Do as I Can, Not as I Say: Grounding Language in Robotic Affordances. arXiv 2022, arXiv:2204.01691. [Google Scholar]
Driess, D.; Xia, F.; Sajjadi, M.S.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T. Palm-E: An Embodied Multimodal Language Model. arXiv 2023, arXiv:2303.03378. [Google Scholar]
Padmanabha, A.; Yuan, J.; Gupta, J.; Karachiwalla, Z.; Majidi, C.; Admoni, H.; Erickson, Z. Voicepilot: Harnessing Llms as Speech Interfaces for Physically Assistive Robots. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, Pittsburgh, PA, USA, 13–16 October 2024. [Google Scholar]
Vemprala, S.H.; Bonatti, R.; Bucker, A.; Kapoor, A. Chatgpt for Robotics: Design Principles and Model Abilities. IEEE Access 2024, 12, 55682–55696. [Google Scholar] [CrossRef]
Jin, Y.; Li, D.; Yong, A.; Shi, J.; Hao, P.; Sun, F.; Zhang, J.; Fang, B. Robotgpt: Robot Manipulation Learning from Chatgpt. IEEE Robot. Autom. Lett. 2024, 9, 2543–2550. [Google Scholar] [CrossRef]
Wake, N.; Kanehira, A.; Sasabuchi, K.; Takamatsu, J.; Ikeuchi, K. Gpt-4v (Ision) for Robotics: Multimodal Task Planning from Human Demonstration. IEEE Robot. Autom. Lett. 2024, 9, 10567–10574. [Google Scholar] [CrossRef]
Zhi, P.; Zhang, Z.; Han, M.; Zhang, Z.; Li, Z.; Jiao, Z.; Jia, B.; Huang, S. Closed-Loop Open-Vocabulary Mobile Manipulation with Gpt-4v. arXiv 2024, arXiv:2404.10220. [Google Scholar]
Chu, K.; Zhao, X.; Weber, C.; Li, M.; Lu, W.; Wermter, S. Large Language Models for Orchestrating Bimanual Robots. arXiv 2024, arXiv:2404.02018. [Google Scholar]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J. Rt-1: Robotics Transformer for Real-World Control at Scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
Huang, J.; Yong, S.; Ma, X.; Linghu, X.; Li, P.; Wang, Y.; Li, Q.; Zhu, S.-C.; Jia, B.; Huang, S. An Embodied Generalist Agent in 3d World. arXiv 2023, arXiv:2311.12871. [Google Scholar]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C. Rt-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv 2023, arXiv:2307.15818. [Google Scholar]
Szot, A.; Schwarzer, M.; Agrawal, H.; Mazoure, B.; Metcalf, R.; Talbott, W.; Mackraz, N.; Hjelm, R.D.; Toshev, A.T. Large Language Models as Generalizable Policies for Embodied Tasks. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P. Openvla: An Open-Source Vision-Language-Action Model. arXiv 2024, arXiv:2406.09246. [Google Scholar]
Li, X.; Mata, C.; Park, J.; Kahatapitiya, K.; Jang, Y.S.; Shang, J.; Ranasinghe, K.; Burgert, R.; Cai, M.; Lee, Y.J. Llara: Supercharging Robot Learning Data for Vision-Language Policy. arXiv 2024, arXiv:2406.20095. [Google Scholar]
Parekh, A.; Vitsakis, N.; Suglia, A.; Konstas, I. Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks. arXiv 2024, arXiv:2407.03967. [Google Scholar]
Yue, Y.; Wang, Y.; Kang, B.; Han, Y.; Wang, S.; Song, S.; Feng, J.; Huang, G. Deer-Vla: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution. arXiv 2024, arXiv:2411.02359. [Google Scholar]
Du, Y.; Yang, M.; Florence, P.; Xia, F.; Wahid, A.; Ichter, B.; Sermanet, P.; Yu, T.; Abbeel, P.; Tenenbaum, J.B. Video Language Planning. arXiv 2023, arXiv:2310.10625. [Google Scholar]
Ko, P.-C.; Mao, J.; Du, Y.; Sun, S.-H.; Tenenbaum, J.B. Learning to Act from Actionless Videos through Dense Correspondences. arXiv 2023, arXiv:2310.08576. [Google Scholar]
Liang, J.; Liu, R.; Ozguroglu, E.; Sudhakar, S.; Dave, A.; Tokmakov, P.; Song, S.; Vondrick, C. Dreamitate: Real-World Visuomotor Policy Learning Via Video Generation. arXiv 2024, arXiv:2406.16862. [Google Scholar]
Du, Y.; Yang, S.; Dai, B.; Dai, H.; Nachum, O.; Tenenbaum, J.; Schuurmans, D.; Abbeel, P. Learning Universal Policies Via Text-Guided Video Generation. Adv. Neural Inf. Process. Syst. 2024, 36. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/1d5b9233ad716a43be5c0d3023cb82d0-Paper-Conference.pdf (accessed on 30 October 2024).
Li, P.; Wu, H.; Huang, Y.; Cheang, C.; Wang, L.; Kong, T. Gr-Mg: Leveraging Partially Annotated Data Via Multi-Modal Goal Conditioned Policy. arXiv 2024, arXiv:2408.14368. [Google Scholar] [CrossRef]
Kwon, T.; Di Palo, N.; Johns, E. Language Models as Zero-Shot Trajectory Generators. IEEE Robot. Autom. Lett. 2024, 9, 6728–6735. [Google Scholar] [CrossRef]
Xia, W.; Wang, D.; Pang, X.; Wang, Z.; Zhao, B.; Hu, D.; Li, X. Kinematic-Aware Prompting for Generalizable Articulated Object Manipulation with Llms. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Chen, H.; Yao, Y.; Liu, R.; Liu, C.; Ichnowski, J. Automating Robot Failure Recovery Using Vision-Language Models with Optimized Prompts. arXiv 2024, arXiv:2409.03966. [Google Scholar]
Di Palo, N.; Johns, E. Keypoint Action Tokens Enable in-Context Imitation Learning in Robotics. arXiv 2024, arXiv:2403.19578. [Google Scholar]

Figure 1. Main organizational framework of this article.

Figure 2. Embodied-foundation content.

Figure 3. Three-dimensional feature framework.

Figure 4. Three-dimensional scene reconstruction framework.

Figure 5. Data augmentation framework and application.

Figure 6. Feature extractor framework.

Figure 9. Classic framework for the comprehensive implementation of three methods.

Table 1. Summary of embodied robots, simulation platforms, datasets, and data acquisition methods.

Embodied robots	Robotic arms	Franka [24] xArm series [25] UR series [26] ViperX [27]
	End Effectors	Robotiq 2F-85 [63] Franka Emika Gripper [64] Allegro [69] Shadow [70] Leap [71]
	Mobile composite robots	Fetch Robotics [81] Hello Robot Stretch [82] Spot Arm [83] B1 and Z1 [84]
Humanoid robots		Optimus [94] Atlas [95] H1 [96] Walker series [97] Expedition series [98]
Simulation platforms		Gazebo [28] PyBullet [29] SAPIEN [30] RoboSuite [31] ManiSkill series [32,33] RoboCasa [34] Isaac Sim [35] Isaac Gym [72] Mujoco [73] iGibson series [85,86] Habitat series [87,88] AI2-THOR [89] BiGym [99]
Datasets		BridgeData V2 [37] RH20T [38] Open-X [39] RED [40] REGRAD [41] GraspNet-1Billion [42] Grasp-Anything [43] Transpose [44] PokeFlex [45] ClothesNet [46] SurgT [47] UniDexGrasp [76] Handversim [77] DAPG [78] AMASS [100]
Data acquisition methods		Self-made Equipment [48,49,50,51,90] 3D SpaceMouse [52] RoboTurk [53] Data Gloves [74] Camera [75,103] Exoskeleton System [101] VR [102]

Table 2. Representative pre-trained models.

Date	LLM	VFM	VLM	GLM	RDSM
2018	BERT [104] GPT [105]
2019	T5 [106]
2020
2021		DINO [117]	CLIP [8]	DALL-E [139] GLIDE [141]
2022	PaLM [107] GPT-3.5 [108]		BLIP [126] Flamingo [127] GIT [128]	DALL-E 2 [140] Make-A-Scene [142] IMAGEN [143] Parti [144]	MVP [12] R3M [147] VIP [148]
2023	GPT-4 [9]	DINOv2 [118] SAM [120] Am-radio [122]	BLIP-2 [10] PandaGPT [129] MiniGPT-4 [130] LLaVA [131] LLaVA2 [132] KOSMOS-2 [133] ConvLLaVA [134]	Video LaVIT [145]	VC-1 [149] Voltron [150] GR-1 [151]
2024	GPT-o1 [109]	SAM2 [121] Theia [123]		Sora [146]	GR-2 [152] SpawnNet [153]

Table 3. Representative algorithms for embodied perception.

3D feature	Semantic and 3D feature fusion	Polarnet [175] Hiveformer [176] PERACT [177] GraspGPT [178] PhyGrasp [179]
	Point cloud extraction	VL-Grasp [180] OVGNet [181]
	Affordance information	OpenAD [182] Robo-abc [183] Ram [184]
3D scene reconstruction	Based on traditional feature	SPARSEDFF [185] F3RM [186] Splat-MOVER [187] LERF-TOGO [188]
	Based on instance segmentation	Object-Aware [189] GaussianGrasper [190]
	Based on diffusion model	GNFactor [191] ManiGaussian [192]

Table 4. Representative algorithms for embodied strategy.

Imitation learning	Data augmentation	Direct generation	GreenAug [202] GenAug [203] FoAM [204] xTED [205]
	Data augmentation	Indirect generation	SUaDD [206] GenSim [207] GenSim2 [208]
	Feature extractor	Text feature extractor	MCIL [209] HULC [210] MIDAS [211] RoboCat [212] RoboAgent [213] DROID [214]
		Visual feature extractor	EmbCLIP [215] UMI [48] DSL [216] HomeRobot [217] Vid2robot [218]
		Text and visual feature extractor	CLIPORT [219] VIMA [220] Open-TeleVision [102] SPOC [221] MPI [222] SCR [223] RDT [224]
Reinforcement learning	Reward function calculation	Generate reward function code	Text2Reward [239] Eureka [240] ASD [241]
	Reward function calculation	Provide reward signal	ALF [242] UVD [243] ROBOFUME [244] MOKA [245] RLFP [246]
	Reward function estimation	Non-parametric estimation	MWM [247] VoxPoser [248] LIV [249] RL-VLM-F [250]
	Reward function estimation	Parametric estimation	CenterGrasp [251] LAMP [252] SARU [253] FuRL [254] VIPER [255] Diffusion Reward [256]

Table 5. Representative algorithms for embodied agent.

Hierarchical execution	Low-level control strategy	Traditional control	LLM-GROP [257] HIP [258] CLOVER [259] LMPC [260] OK-Robot [261]
		Strategy learning	DEPS [262] PSL [263] EmbodiedGPT [264] PALO [265] YAY Robot [266]
	Skills library	Dynamic invocation	SayCan [267] PaLM-E [268]
		Direct invocation	VoicePilot [269] ChatGPT for Robotics [270] RobotGPT [271] G4R [272] COME-robot [273] LABOR[274]
Holistic execution	Fine-tuning or training		RT-1 [275] LEO [276] RT-2 [277] LLaRP [278] OpenVLA [279] LLARA [280] CoGeLoT [281] DeeR-VLA [282]
	Video and image prediction		VLP [283] DrM [284] Dreamitate [285] UniPi [286] GR-MG [287]
	Based on VLM		ZSTG [288] KaP [289] Chen et al. [290] KAT [291]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, J.; Mao, P.; Kong, L.; Wang, J. A Review of Embodied Grasping. Sensors 2025, 25, 852. https://doi.org/10.3390/s25030852

AMA Style

Sun J, Mao P, Kong L, Wang J. A Review of Embodied Grasping. Sensors. 2025; 25(3):852. https://doi.org/10.3390/s25030852

Chicago/Turabian Style

Sun, Jianghao, Pengjun Mao, Lingju Kong, and Jun Wang. 2025. "A Review of Embodied Grasping" Sensors 25, no. 3: 852. https://doi.org/10.3390/s25030852

APA Style

Sun, J., Mao, P., Kong, L., & Wang, J. (2025). A Review of Embodied Grasping. Sensors, 25(3), 852. https://doi.org/10.3390/s25030852

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Embodied Grasping

Abstract

1. Introduction

2. Embodied Foundation

2.1. Robotic Arm

2.2. End Effector

2.3. Mobile Composite Robot

2.4. Humanoid Robot

3. Pre-Trained Model

3.1. Large Language Model

3.2. Visual Foundation Model

3.3. Visual–Language Model

3.4. Generative Large Model

3.5. Robotics Domain-Specific Model

4. Embodied Perception

4.1. Three-Dimensional Feature

4.2. Three-Dimensional Scene Reconstruction

5. Embodied Strategy

5.1. Imitation Learning

5.1.1. Data Augmentation

5.1.2. Feature Extractor

5.2. Reinforcement Learning

5.2.1. Reward Function Calculation

5.2.2. Reward Function Estimation

6. Embodied Agent

6.1. Hierarchical Execution

6.1.1. Low-Level Control Strategy

6.1.2. Skills Library

6.2. Holistic Execution

7. Challenges and Prospects

7.1. Problems with Dataset Acquisition

7.2. Adaptation Problems in Realistic Tasks of Models

7.3. Problem of Generalization of Strategies

7.4. Problems in Executing Long Sequence Tasks

7.5. Interpretability Problem

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI