Grasping in Shared Virtual Environments: Toward Realistic Human–Object Interaction Through Review-Based Modeling

Christoff, Nicole; Neshov, Nikolay N.; Petkova, Radostina; Tonchev, Krasimir; Manolova, Agata

doi:10.3390/electronics14193809

Open AccessReview

Grasping in Shared Virtual Environments: Toward Realistic Human–Object Interaction Through Review-Based Modeling

by

Nicole Christoff

^*

,

Nikolay N. Neshov

,

Radostina Petkova

,

Krasimir Tonchev

and

Agata Manolova

Faculty of Telecommunications, Technical University of Sofia, 8 Kliment Ohridski Blvd., 1756 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3809; https://doi.org/10.3390/electronics14193809

Submission received: 14 August 2025 / Revised: 21 September 2025 / Accepted: 24 September 2025 / Published: 26 September 2025

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

Virtual communication, involving the transmission of all human senses, is the next step in the development of telecommunications. Achieving this vision requires real-time data exchange with low latency, which in turn necessitates the implementation of the Tactile Internet (TI). TI will ensure the transmission of high-quality tactile data, especially when combined with audio and video signals, thus enabling more realistic interactions in virtual environments. In this context, advances in realism increasingly depend on the accurate simulation of the grasping process and hand–object interactions. To address this, in this paper, we methodically present the challenges of human–object interaction in virtual environments, together with a detailed review of the datasets used in grasping modeling and the integration of physics-based and machine learning approaches. Based on this review, we propose a multi-step framework that simulates grasping as a series of biomechanical, perceptual, and control processes. The proposed model aims to support realistic human interaction with virtual objects in immersive settings and to enable integration into applications such as remote manipulation, rehabilitation, and virtual learning.

Keywords:

grasping modeling; tactile internet; object representation; augmented reality; virtual reality; mixed reality

1. Introduction

Grasping is one of the human motor actions that allows us to interact with and manipulate objects in our environment. Virtual communication is an important part of the developing modern world, combining various methods and algorithms, including intelligent systems for data analysis, visualization, and interaction. It is used as a substitute for physical presence, but not for real-time communication, creating new opportunities for interaction and presentation of information. In this regard, Tactile Internet (TI) provides real-time communication characterized by extremely low latency and high reliability [1]. This, in turn, allows not only the transmission of audio and video data but also sensor data, providing prerequisites for precise remote capture and manipulation of objects in virtual environments. The introduction of TI into augmented reality (AR), virtual reality (VR), and mixed reality (MR) would significantly improve the realism of interactions, overcoming important limitations that exist in the modeling of human–object interactions [2]. The main stages in the implementation of this type of communication range from multidimensional data processing to the creation of holographic environments and virtual applications.

For example, stroke rehabilitation applications that use haptic feedback to help patients re-think hand movements could benefit from the Tactile Internet. Similarly, simulations for training engineers, mechanics, and surgeons could mimic real-world tasks and provide learners with hands-on experience in virtual environments [3]. The Tactile Internet could provide realistic simulations of remote grasping and control without exposing the operator to danger in situations where physical access is limited or dangerous. This could be achieved while maintaining extremely accurate grasping by combining real-time control and haptic feedback. Simulations could include emergency, accident, or disaster response scenarios, where robots with tactile sensors could be programmed to perform delicate tasks such as moving broken objects or debris [4]. The constant tactile feedback provided by the Tactile Internet (TI) allows for real-time adaptation to different shapes and materials of objects. This enables the grasping models to dynamically adjust to the hardness, texture, and deformation of objects [5].

The potential of grasping, in the context of its application through the Tactile Internet in virtual, augmented, and mixed reality, is gaining even greater importance in areas such as mental health and medical education. For example, Space Exodus is a VR role-playing game for children with intellectual disabilities that has been shown to improve fine motor skills, attention, and engagement after six weeks of use [6]. Similarly, virtual reality simulations based on problem-based learning allow medical students to study remotely without losing the quality of education [7]. Furthermore, VR therapies are effective in reducing symptoms of anxiety, depression, and stress, suggesting that integrating grip modeling and haptic feedback could enhance their effect and therapeutic value [8].

Although the Tactile Internet is the next step in the development of telecommunications, its implementation and modeling in virtual environments must meet a number of requirements. The first condition is to achieve submillisecond latency for real-time feedback, especially in large-scale or geographically dispersed networks [9]. Then it is necessary to ensure the transmission of high-quality tactile data. This has a major impact on network capacity, especially when audio and video data are added [10]. Further, it is necessary to develop fusion algorithms to synchronize haptic feedback with other sensory modalities in dynamic environments. Data ethics and security, privacy, and accountability in failure scenarios are necessary for applications in telemedicine and human–robot collaboration [11]. Potential solutions to these limitations would be the development of 6G networks, the creation of operating protocols, and optimization. It should be noted that the creation of standardized protocols for tactile data transmission and synchronization is also necessary, which would facilitate the implementation of the TI in grasping modeling.

Motivation: Human–object interactions in virtual environments should become more natural and realistic with the development of VR, AR, and MR. However, these applications are currently limited by challenges related to accurately simulating the grasping process and hand–object interactions [12]. Nevertheless, limitations arise in scenarios where physical access is dangerous or limited, in particular when incorporating real-time feedback, dynamic adaptation to changing object attributes, and the need to accurately simulate human grasping [13]. Accurate real-time exchange of tactile and sensory data is possible thanks to the Tactile Internet. Implementing tactile and sensory infrastructure in grasping modeling requires high-quality datasets, accurate physics-based simulations, and building machine learning models to ensure realism and scalability.

Contributions:

This article makes the following important contributions:

1.: A seven-step grasping framework: We propose a multi-layered grasping model that combines biomechanical, perceptual, physical, and contextual processes. Our framework differs from others because it combines the layers of object geometry and physics into a single sequence, making human–object interactions more realistic and adaptive.
2.: Addressing technical issues: The proposed framework methodically addresses a number of important technical issues. We consider the latency and synchronization of multimodal feedback. We propose ways to address the computational complexity of real-time physics-based simulations. The model takes into account data limitations, such as data set bias and lack of diversity. We explore approaches to modeling deformable and composite objects that have not been extensively addressed in the majority of the current literature.
3.: Real-World Effects: Our conceptual framework supports real-world applications in areas where safe and realistic interactions are crucial, by connecting grasping modeling to the Tactile Internet and low-latency communication over 6G.

Structure: The article is organized into six main sections. Section 1 presents the motivation behind the study and our contributions. In Section 2, a process for modeling grasping in virtual environments is proposed. In Section 3, the role of physics-based modeling in grasping is presented. In Section 4, we present the more commonly used datasets in different studies and compare them. In Section 5, the effectiveness of grasping models is given, and potential limitations are suggested. In Section 6, future directions for grasping modeling research are suggested.

Figure 1 presents different examples of a realistic hand in a virtual environment, interacting with objects of different shapes and materials (grasping objects). Several positions of a hand grasping a green cube (a solid, geometrically simple object) are shown. The hand uses different grips, which is a consequence of a change in the positioning of the fingers and the dynamics of the interaction. Then, the hand is presented interacting with a blue object (a geometrically complex object), similar to a rabbit figurine. The type of material cannot be determined from the visual characteristics, but it can be assumed based on knowledge about the object itself. This creates uncertainty regarding its virtual characteristics (deformability, etc.). The third object with which the virtual hand interacts is a complex geometric structure containing intertwined circles and a central red part, which, in addition to the material, has no knowledge of what the object represents. In this case, a decision must be made about the interaction when choosing a grip—whether the object is likely to bend under pressure, whether precise gripping of small and thin parts is necessary, etc. Finally, an example with gripping a familiar object from everyday life (a jug) is presented, where different possibilities for gripping are possible—the edge of the jug, and in the second one, the handle is used. All these examples aim to highlight the importance of gripping depending on the shape and functionality of the object. In this section, we will describe the grasping process in seven steps. Together, these steps enable the construction of a complete, realistic grasping model that mimics human behavior in virtual environments. The next sections will expand on the data, physics, and evaluation mechanisms used to support each stage of this framework.

2. Modeling of the Grasping Process

In order to model grasping in a virtual environment, it is first necessary to consider the process in a real environment (see Figure 2).

Ideally, the hand is initially in a random position; of course, it has a certain range of motion and is at rest. We assume that there is a zero instant during which the hand does not move, even when one action is switched to another. If we assume that we have information about the object (shape, mass, volume, material, etc.) and know its position (located within the range of allowed coordinates for movement), then we first start by changing the position of the hand itself. Then we make a choice of grip, including the number of fingers.

There are different grasping strategies depending on the object geometry, material, and scene constraints (Figure 1). This is a decision that has become unconscious in humans over time and is influenced not only by the object itself but also by the environment in which it is located. After a choice has been made, for example, to use five fingers and the palm, they are in a position that determines a volume greater than the volume of the object itself. The hand approaches the object and grasps it while simultaneously applying force to perform an additional action: lifting, moving, pushing, etc. This subsequent action requires a change in the initial force in response to the real capabilities of the hand, i.e., even with initial knowledge of the object and the hypothetical forces that would need to be applied, it is possible for a correction to occur.

Modeling grasping in a virtual environment involves simulating the processes of perception, control, and adaptation that allow humans to interact with various objects. To adapt grasping to specific tasks, this involves building geometric and physical models, modeling the movement of one or both hands and the dynamics of the contact itself, as well as integrating contextual information. We split the modeling process into three main stages (see Figure 3):

Stage 1: Perceptual Modeling (Steps 1–2)—representation of the hand and the object and identification of contact candidates;
Stage 2: Physical and Dynamic Modeling (Steps 3–6)—simulation of the movement, applying physics, and refining the grip through feedback;
Stage 3: Contextual Adaptation (Step 7)—adjustment of the grip according to the scene, the capabilities of the object, or the goals of the task.

In this section, we present a framework consisting of seven steps, which are summarized in Figure 3. The steps are sequential and aim to represent the construction of realistic grasping interactions.

2.1. Step 1: Geometric Representation of Hand and Object

The first step of the modeling process is a digital representation of the real world, where geometric and structural descriptions of the human hand and objects can be used. Realistic and physically plausible interactions can be produced using a variety of techniques, including a combination of geometric representations, sensor-based feedback, and simulation methods. To represent the human hand, typically, high-resolution 3D meshes, joint angle mapping, or parametric models such as MANO [15] or SMPL-X [16] are used. Other methods include fitting a Universal Hand Model (UHM) [17], which simulates accurate articulation and deformations of the hand using 3D joint coordinates, identification codes, and pose matching [17,18]. Depending on the computational needs of the application and the level of precision, objects can be modeled using Signed Distance Functions (SDFs), voxel meshes, triangle meshes, or point clouds. Geometric models can be derived from synthetic environments or photogrammetric reconstruction, as well as from real-world sensors such as RGB-D cameras, LiDAR, or tactile arrays, to facilitate contact prediction and simulation. To have the possibility for estimation of possible interaction zones and creating smooth surface representations of the hand and object, point clouds, 3D Gaussian representations, and implicit surface modeling can be used [19]. If we combine these representations, we will obtain spatial information about size, shape, and relative location in three dimensions, and can serve as input for the next step in the grasping process.

2.2. Step 2: Predicting Contact Points

It is possible to predict the contact areas during the interaction using geometric representations of the hand and the object. Predicting both the contact points and the associated forces required for stable interaction is the goal of contact estimation. This can be referred to as an optimal control problem [20,21]. Various methods are used for this purpose, including proximity sensors applied to the phalanges of the fingers or capsule triggers for collision detection. They detect the initial contact and calculate the distances between the hand and the object surface [22]. Other approaches are training-based and use annotated contact map datasets to predict interaction areas. The third type of techniques is optimization-based. Their goal is to refine hand postures by minimizing the elastic or potential energy in the contact fields [23]. The contact points can be dynamically adjusted during the grasping process using physics-based models that impose a balance of force and torque to maintain grip stability [24,25]. Accuracy becomes essential for the accuracy of the simulation as well as for the performance of the computer during virtual grasping, and these contact points are necessary to ensure stable, physically realistic, and efficient grips.

2.3. Step 3: Modeling Hand Motion Towards Grasp

Once the contact zones are predicted, the hand needs to be moved from its initial position to one that is more suitable for grasping. This movement must be consistent with the predicted contact zones, be able to occur in real time, and be anatomically correct. When creating trajectories for transitioning into a grasping position, joint constraints, motion continuity, and virtual environment constraints need to be taken into account. A common way to formalize this process is through physical interaction modeling, using simulation to predict how the hand and object will move. Motion paths, for example, are directed to the best contact areas, and 3D point clouds and Gaussian fields are used to estimate proximity [26,27].

Hand movements can be produced using inverse kinematics. In order to adhere to the principles of human biomechanics, anatomical constraints need to be introduced. Another option is to use movement models that have already been trained from real human movements or sensor-based control, which changes the approach in real time based on tactile and visual feedback. Sensor-based approaches change the way the movement works based on visual and touch feedback. This, in turn, improves the stability of the grip and the precision of the interaction. It is also possible to use tactile data analysis to understand how objects will react to movement, and from there, the system can predict the grip [28]. It is also possible to use Convolutional Neural Networks (CNNs) to make decisions about grip selection and movement type. Watkins et al. [29] also use them to represent the surfaces of objects that are hidden or only partially visible in the virtual scene. The goal is to be able to move the hand into the correct position to grasp an object in a way that makes sense physically and under the given circumstances.

2.4. Step 4: Finger Configuration and Initial Grasp

Let us assume that the hand is approaching the object, and during the movement towards it, no choice has been made with the finger configuration to establish a stable initial grip. In order to optimize the stability of the grip and the compatibility of the grasping itself, it is necessary to make a selection of the fingers to grasp and establish joint angles and initial force vectors. One possibility for this is to define the goal of the grasp, the number of fingers, the type of contact with the object, and its characteristics. This will also require sampling strategies and quality metrics to explore appropriate grasping postures. It is possible to use and train neural network architectures with reinforcement and Bayesian optimization to navigate in space with multidimensional finger configurations while taking into account the geometry of the object and the constraints of the virtual environment.

Haptic testing can be used to assess grip stability. It involves a robotic hand interacting with different points on an object. In the case of sensory uncertainty, it is used to determine the safest configurations using Bayesian optimization [30]. Modern systems increasingly integrate visual and tactile information to predict the outcome of a grasp, using local feedback signals and joint attention mechanisms. Reliable representations of the contact interface are created by fusing tactile sensor readings with 3D point cloud data, which allows for more precise and flexible grasping [31]. In addition, some models simulate complex human–object interactions, such as pushing, carrying, and lifting, using multimodal sensors, such as RGB-D. To faithfully represent physical contact and force distribution during manipulation, these simulations use accurate 3D meshes for the hand and the object [32].

2.5. Step 5: Applying Physics-Based Simulation

To verify that the grip is feasible, stable, and physically realistic, it is possible to implement a physics-based simulation. This will allow the system to validate that the proposed hand–object interaction is dynamically plausible under real-world constraints. Depending on the physical properties of the material, the simulation can include three elements. The first is rigid and soft body dynamics to model deformable and articulated components. Then, to perform a contact force and torque simulation to estimate the grip and sliding force. This, in turn, requires a balance check based on the torque balance and force distribution to estimate stability.

Some systems reduce the geometry of objects to primitives and use these abstractions to select the best gripping strategies based on various performance metrics obtained from human demonstrations [33]. To increase the accuracy of physical interaction reconstructions, others use kinematic modeling to estimate object attributes, contact points, and hand joint positions [34]. To obtain high-quality simulation and training data, multi-camera systems are often used to capture full 3D sequences of real-world grasping poses, covering the stages of approach, contact, grip closure, and stabilization [35]. Depending on the task, different simulation platforms can be used, such as Blender or TACTO. They can be used to simulate complex surface interactions, such as transparency or tactile effects [36,37]. Reliable dynamic modeling must be considered when managing interactions, collisions, and hiding of objects in more complex or cluttered virtual environments. To ensure reliability in these circumstances, various techniques have been used, including neural network architectures and volumetric reasoning [38,39].

2.6. Step 6: Grasp Refinement Using Feedback

Due to environmental variability, sensor noise, or uncertainties in object properties, initial grasps are often unstable. To overcome this, the system needs to have a possibility for monitoring and an adaptation mechanism that allows it to optimize the hand configuration in real time. The idea is to ensure stability and task performance in a manner comparable to human behavior in a real environment.

The fine-tuning process can be implemented through various types of feedback, such as haptic information, visual inspection, or trained models. In this context, deep learning approaches such as CNN, Variational Autoencoders (VAEs), and transformer architectures are used. They assist the system in predicting appropriate hand configurations, assessing the quality of the grasp, and adapting to changes in the posture or position of the object [40,41]. Generative models such as Conditional Variational Autoencoders (CVAEs) are used to generate initial grasp candidates, which are then refined using physical reasoning and feedback. Reinforcement learning, especially in tactile robotic grasping, allows the system to develop adaptive policies that maximize stability under uncertainty [37]. Training the model on large datasets of hand–object interactions, often recorded in controlled conditions, such as desktop scenes, also makes it much more realistic [35]. This kind of data supports refinement models that show how people act, such as two-handed and object-aware corrections [42]. Further improvements are achieved through hierarchical policy structures that separate high-level planning (e.g., task goal or object semantics) from low-level control (e.g., joint configurations and contact forces) [43]. These methods allow for task-specific improvements that generalize across a wide range of grasping scenarios.

The output of this step will be used as input to the final stage of the grasping process, which involves contextual adaptation of the grasp based on the task, scene, and semantic information about the object. In practice, this can be achieved through reinforcement learning with task-oriented reward functions or multimodal fusions that combine tactile and visual data. In the context of the Tactile Internet, for real-time application systems, feedback optimization is of particular importance. Ultra-low latency communication allows users to correct unstable grasps in real time in shared virtual or remote environments. This ensures realistic and immediate tactile responses, consistent with the goals of immersive VR, AR, and MR applications.

2.7. Step 7: Contextual Adaptation of the Grasp

Adapting the grasp according to the task goal, the context of the scene, and the semantic information about the object is the final step in the grasping process. For example, a cup may be held differently depending on whether it is placed on a surface filled by the user or handed to someone. The reasoning required for this semantic variation should not go beyond what is physically possible. Contextual reasoning modules contribute to better efficiency by aligning the selected grips with typical human behavior. In this way, the system not only performs the grip physically correctly, but also adapts it to the specific task and context in which it is located. This allows better connection between low-level control, such as positioning and grip strength, and higher-level understanding of the situation and goal. When this type of reasoning is integrated into the grasping process, the system responds more flexibly and makes its actions more meaningful, both functionally and in relation to the intention behind the interaction itself.

The movements of objects affect the overall stability of the scene and model the dependencies of the objects. By combining scene graphs and manipulation relationship graphs (MRGs), context modeling makes this flexibility possible [44]. Another option is to use task-specific constraints and language-based descriptions to help map grasps to desired actions or goals [45]. Depending on the functional roles and capabilities, the type of grasp is decided, depending on the purpose of the object. In addition to taking into account environmental constraints, context-aware approaches allow grasp planning in scenes that are dynamic or cluttered. Such systems reason about the broader implications of each grasp by combining information about the scene layout, object semantics, and current interactions between the hand and the object [46].

The final step in the grasping process is contextual adaptation, which tailors the grasp based on the task goal, the scene, and the semantic information of the object. In practical applications, this can be achieved through manipulation relationship graphs that define permissible interactions, or by using high-level language models to translate task instructions into specific grasp constraints. These strategies transform the conceptual stage into a functional module that effectively links high-level task semantics with low-level grasp planning.

When integrated with the Tactile Internet, semantic adaptation ensures that the system not only generates physically stable grasps, but also dynamically adjusts them to task-specific goals transmitted over multimodal channels in real time. This integration allows virtual and remote interactions to accurately reflect human intent, which is a priority for the next generation of immersive and teleoperation systems.

2.8. Guidelines for Implementing Tactile Internet in Grasping Frameworks

In order to model the grasping process in a virtual environment, it is necessary to take into account the technological requirements to achieve realism and responsiveness. Recent studies have shown that the choice of Tactile Internet solutions must be adapted to meet the requirements of ultra-low latency, reliability, and multimodal feedback [47,48]. Similar to human behavior, where unconscious choices are made about the type of grasp, reinforcement learning techniques have been used in the algorithmic design to enable the acquisition of adaptive grasping policies in a dynamic context [49].

In addition, generative models such as Conditional Variational Autoencoders and diffusion models can generate different grasping variants [49]. This creates a wealth of possibilities that need to be managed effectively. For this reason, data preprocessing is required, including techniques such as normalization, filtering, and the use of compact geometric representations, such as sign distance functions. These methods are essential for noise reduction and lower computational costs [49].

To improve the performance of Tactile Internet systems, it is recommended to use lightweight models (e.g., sieved or quantized Convolutional Neural Networks), which can be combined with additional algorithms. This ensures real-time operation without loss of responsiveness [48]. The balance between latency and computational load can be further achieved through optimization methods such as distributed learning, ADMM-based decomposition, or splitting the model between edge and cloud nodes. Particular attention is paid to the balance between latency, realism, and user experience, especially in medical training and rehabilitation applications [47].

3. Physics-Informed Modeling of Grasping

Besides being used as a simulation tool to validate grasping performance (as discussed in Section 2.5), physics is at the heart of the design, training, and evaluation of grasping models. Direct integration of physical laws, constraints, and material properties into the computational representation of hands, objects, and their interactions is known as physics-informed modeling. This section presents physics as embedded in grasping approaches from precomputed constraints and loss functions to real-time dynamic simulation and differentiable physical models. The goal is to improve grasping realism, physical plausibility, and generalization to complex scenarios, such as manipulating deformable objects or cluttered environments.

To summarize the main approaches to grasp modeling, a comparative analysis of different techniques used in the literature is presented in Table 1. The table provides an overview of the methods used, their specific techniques, and main applications in the simulated environment.

One of the most challenging aspects of accurate and realistic hand–object interaction models is the incorporation of physics into the representation of hands, surfaces, and objects. Physics is incorporated into the methods in a number of ways, ranging from real-time simulation with actual machines to precomputed physical features. To create realistic hand–object interactions, several datasets incorporate physics during preprocessing. To ensure that hands do not cross objects in an unusual way and maintain consistency with physical expectations, for example, methods often use penetration loss and plane constraints [18,54]. By imposing constraints derived from physical principles, techniques such as SDF and loss functions that minimize interpenetration improve realism [55].

To simulate dynamic interactions, physics engines such as Bullet [56], PyBullet [57], Gazebo [58], and MuJoCo [59] are often used. They enable complex and flexible interactions during grasping tasks by simulating forces, torques, collisions, and deformations. The DART physics engine used in the Gazebo environment allows for the simulation of contact forces and object dynamics, making it suitable for modeling realistic robotic grasping [53,60]. Similar goals are achieved using PyBullet, which is used to evaluate grip stability and derive physically consistent configurations [61]. These tools are particularly effective in simulations involving soft materials or deformable objects, where real-time calculations are required [25].

Modeling dynamic interactions relies on fundamental physical principles, such as Newton’s laws, force balance, and collision detection. Calculations of contact forces, torques, and friction coefficients are used to improve grip. In other cases, biomechanical concepts, such as virtual springs or finite element methods (FEMs), are applied to simulate soft tissue deformation. These techniques can be applied to activities that require a high degree of precision. Physics is often incorporated into learning-based models as additional losses or constraints for direct learning. For example, realistic contact dynamics is simulated using repulsive and attractive forces in spring-mass systems [23]. Soft tissue deformation is addressed by differentiable, physics-based contact models, such as DiffContact [52]. The robustness of the trained models is improved by loss functions that penalize interpenetration in relation to physical constraints, such as forced closure [62,63].

When implementing physics in grasp simulation in a virtual environment, the physical properties are pre-computed, which saves computational effort at the expense of simulation flexibility. This approach is effective for real-time manipulation, but by robotic hands, which requires fast inference [64]. Dynamic physical simulation, which calculates forces and interactions in real time, is more precise and adaptive, but also requires more processing resources. The Isaac Gym simulator [63] checks if the grasping movements are correct.

The use of grasping simulations, such as those based on differentiable physical machines [65], requires a lot of computing power to accurately model contact forces, object deformations, and soft tissue interactions. While they provide computational efficiency, a drawback is the lack of adaptability to dynamic scenarios [42]. On the other hand, models that include real-time physics engines [66] need to combine accuracy and efficiency, which further slows down data processing. In addition, multi-view grasping methods [67] have additional computational costs due to the need to integrate multiple viewpoints and, consequently, process the sensor data themselves.

4. Types of Datasets and Their Focus

The way in which different elements and surfaces are represented in a virtual environment should be tied to the way in which grasping is represented as a process, and therefore consider the interaction between people and objects. This includes considering the type of material and its physical characteristics. For the simulation to be accurate, the user’s hand needs to be represented, and the movement during the grasping process must be realistic. The datasets used in grasping simulation models can be broadly divided into real data and synthetic data sets. Some visual examples are given in Figure 4. The comparison of nine datasets used in material synthesis and analysis is given in Table 2. The Text-to-Motion column examines ways to perform or modify grasping movements using natural language instructions, such as “grab the jug by the handle.” These methods combine semantic descriptions with movement, allowing the grasping process to be controlled both semantically and geometrically.

4.1. Real-World Datasets

The first group, namely real data sets, is obtained by means of physical sensors, such as RGB-D cameras, LiDAR, or tactile sensors. This, in turn, includes, in addition to the useful signal, noise, and variability in the data, which requires preprocessing. BEHAVE is an example of such a dataset that records hand–object interactions using RGB-D cameras with multiple viewpoints. Contact points, object pose, and hand pose are included in the database [32]. Similarly, GRAB’s dataset uses motion capture (MoCap) to record hand–object interactions. The 3D reconstructions of hand positions and object interactions are the final output [72]. DexYCB’s dataset uses real-world RGB-D sequences and provides annotations of hand objects for comparative analysis [35]. The problem with collecting such datasets is the cost, time, and complexity of the process itself [28,73].

4.2. Synthetic Datasets

The second group of datasets, hereafter called synthetic datasets, is generated using 3D modeling software or simulation environments. This type of data provides significant advantages, including faster data generation, greater adaptability, and improved control over the characteristics of the dataset. For example, the ReplicaGrasp dataset simulates whole-body interactions between a person and an object in various 3D environments [74], while the ObMan dataset offers synthetic hand–object interaction data produced using the MANO model and ShapeNet objects [41]. Text descriptions are included in other datasets, such as TextGraspDiff, to guide the creation of synthetic grasp data using diffusion models [75]. The lack of real-world realism in these data sets is a disadvantage that can affect the generalization of the models trained with them [51,76].

4.3. Data Representation Approach

The way in which the data are represented is directly related to the way in which the grasping process is modeled and represented. This is related not only to the data structure or the quality of visualization, but also to the analysis and interpretation of the hand–object interactions and the grasping process. Commonly used data representations include point clouds, polygon meshes, voxels (see Figure 5) and also Signed Distance Functions (SDFs), and depth images.

Point clouds are used precisely because of their detailed spatial information, which makes them useful for capturing fine geometric features and modeling object surfaces [78]. Point clouds are represented as sets of 3D points in space. To illustrate hand–object interactions with accurate 3D spatial annotations, point clouds are used in datasets such as ObMan and ContactPose [41,52]. In real-time applications such as the REGRAD dataset, where the representation facilitates the creation of grip points and orientations, point clouds are also preferred due to their computational efficiency [44].

Polygon meshes also offer a more precise representation, capturing complex hand and object geometries, precisely because of their representation by vertices, edges, and faces. They are commonly used in data sets that require detailed modeling of hand–object interactions, such as the GRAB and DexYCB datasets, which use meshes to model objects at a fine-grained level [35,72]. DexYCB combines RGB-D images, meshes, and 6D object poses used for model development for grip prediction and posture estimation [35]. The Tactile Glove dataset [28] provides sensory feedback for dynamic grasping tasks by combining tactile data and point clouds. Meshes are particularly suitable for applications that demand realistic 3D reconstructions and interaction modeling. For example, the MOW dataset contains reconstructed hand and object meshes from real-world images [55].

Voxels represent 3D mesh model objects as structured volumetric data, allowing compatibility with Convolutional Neural Networks (CNNs) such as [78]. This structured representation is well-suited for tasks that involve the completion of the shape of the object or the prediction of grasp stability, as seen in IsaacGym-based simulation datasets that use voxel meshes to evaluate multi-finger grasps on deformable objects [51]. Another, also compatible with 2D CNNs, is depth mapping, which represents structured, pixel-based data that capture 3D spatial information. Some datasets, such as BEHAVE, use depth images to supplement RGB data to analyze and model hand–object interactions [32]. Depth images are also used in datasets such as HUMANISE, where they provide additional spatial context for human movement within 3D indoor scenes [45]. Although less sophisticated than point clouds or meshes, voxel-based meshes and depth representations are ideal for their simplicity and memory economy, as well as for fitting with deep neural networks.

Signed Distance Functions provide a continuous representation of three-dimensional shapes by encoding the distance between a point in space and the nearest surface of an object. In data sets such as ShapeNet and the Grasp database, objects are represented by SDF values to aid in the planning of collision-free interactions and predictive comprehension [50,79]. Geometric modeling of object surfaces in simulation environments is well-suited to SDFs.

4.4. Object Material Types

Materials found in databases range from soft and flexible, such as fabric, foam, or rubber, to rigid, such as metal and plastic. Accurate representation of interaction forces, deformation, and friction behavior is essential in simulations of realistic contact dynamics since their variations rely on the material properties.

Almost all datasets contain rigid materials, such as wood, plastic, and metal, which are among the easiest to image. For example, the YCB dataset includes objects made of plastic, metal, and other rigid materials, which allow for precise modeling of shapes with high structural stability and constant geometry [22,67]. Such objects are suitable for improving the accuracy, robustness, and reproducibility of requirement-related tasks.

On the other hand, when modeling interactions with deformable materials such as textiles, foams, or rubber, it is necessary to account for their elastic and plastic properties. For realistic simulation, datasets that contain physical characteristics such as Young’s modulus and loading behavior are used, which allow for an accurate representation of the material response upon contact. Some examples include virtual workspaces where precision during movement is required (e.g., virtual operating rooms). To track the interaction between deformable objects and a virtual hand, Grady et al. [52] have constructed the ContactPose dataset. It contains thermal data and labels related to soft materials, which are part of the deformable object family.

Glass and polished metal are examples of transparent and reflective materials that are somewhat more difficult to model. Datasets such as the Blender-based synthetic transparent object dataset are used to accurately reproduce the optical characteristics of these materials [36]. Algorithms that interpret complex visual features can be trained on real or synthetic data, but optical distortions can also occur.

In the real world, not all objects are made of a single type of material. Implementing these complex objects in the virtual world also requires complex feature extraction. For example, in the OBJECTFOLDER dataset [37], objects made of materials such as ceramic, wood, steel, and polycarbonate are added through visual, auditory, and tactile inputs. Similarly, the BEHAVE dataset represents objects made of fabric and plastic [32]. To simulate interactions with hard furniture and soft household objects, for example, datasets such as ReplicaGrasp include semantic information regarding the materials themselves [74]. For activities such as furniture assembly, packaging, or handling delicate products in virtual environments, this semantic information would enhance the realism of the simulation.

Datasets based on haptic feedback and interaction are already available. These represent material-specific characteristics, including compliance and coefficient of friction. Key features, such as surface texture, flexibility, and friction, are otherwise impossible to identify with visual information alone and are captured from measurements taken using tactile sensors. More exact simulation of grip stability is therefore possible in the future [30]. These datasets are particularly useful for models that aim to distinguish between slippery and stable surfaces. The goal is to improve grip planning and performance encountered in real-world grasping tasks. Methods based primarily on visual information often ignore important factors such as sliding resistance or deformation of the object during grasping. Due to the limitations of current simulation approaches, fragile or highly deformable objects are often absent from the databases. This makes it difficult to compare different models and realistically reproduce complex material properties.

5. Evaluation and Limitations

In the real world, two criteria, objective and subjective, can be distinguished to assess grasping. In the objective criteria, there are two possible outcomes: success and failure. Success is considered to be achieved when the object is in the hand and has been moved from its initial coordinates, with the process being controlled and following the movement and direction of the hand consciously. Failure can be considered to be the inability to grasp the object, improper feeding, dropping it, etc. In the robotic grasping of 3D deformable objects, Huang et al. [25] consider the following performance metrics: pickup success, stress, deformation, strain energy, linear and angular instability, and controllability of deformation. In Table 3, we propose six formal evaluation categories for hand–object interaction in a virtual world.

The evaluation of grasping models in shared virtual environments enables realistic human–object interaction. In one of our previous studies [88], we provide guidance for selecting appropriate evaluation methods based on the application context. For comparison of large datasets, quantitative metrics are most appropriate. For physics-based simulations, physical realism tests can be applied. For VR/AR applications, where user perception and immersion are most important, qualitative evaluation is needed. To ensure a fair and balanced evaluation, we suggest that the selection be based on identifying the application scenario, selecting the main dimension to evaluate (e.g., geometric accuracy, physical stability, or user experience), and using at least two complementary metrics.

5.1. Evaluation

Real-world testing, benchmarking, and quantitative and qualitative evaluations are part of the evaluation of generated hand–object interactions. They are intended to assess the practical utility, variety, realism, and accuracy of the models. The idea of quantitative evaluation is to assess the accuracy and realism of the interactions generated using objective measurements. The accuracy of pose and shape reconstructions is often evaluated using metrics such as the mean joint position error (MPJPE) [26,80,81], Chamfer distance (CD) [17,80,82], and F-score [80,83]. While CD quantifies the similarity between the generated and ground-truth point clouds to assess geometric accuracy, MPJPE calculates the Euclidean distance between the predicted and actual joint positions. The assessment of physical plausibility in hand–object interactions is often performed using metrics that measure the degree of penetration between surfaces, such as penetration depth and intersection volume [18,84]. Additional information about the realism of the contact can be extracted from metrics such as contact ratio [52,75] and intersection volume (IV) [50,52], which assess the degree of overlap and precision of the established contact points. It is also possible to use the so-called simulation displacement (SD), which measures the displacement of the object during the simulation [18,78].

The capacity of a model to reproduce the diversity and unpredictability of real-world interactions is assessed using other criteria, such as diversity and coverage. Diversity metrics assess the variation in the movements produced, often examining rotational axes and angles [18,26]. Coverage metrics assess how well the generated movements capture the distribution of hand–object interactions in the real world [35,36]. Combining these criteria allows models to be generalized to increase robustness. In addition to classical quantitative metrics such as intersection over union (IoU) and scene interaction vector (SIV), which are used to assess the shape reconstruction and the accuracy of predicted contacts [29,40,62], indicators based on sensory output are also applied. For example, the frequency of successful capture, which serves to measure the efficiency and quality of the performed manipulation [89,90].

Qualitative methods are often added to quantitative assessments, which provide a more complete picture of the perceived behavior of the system. Experts or end users evaluate naturalness, visual plausibility, and overall sense of realism through visual inspection [72,85]. Additional information on subjective perception is obtained through surveys and Likert-type scales that detect the extent to which interactions are perceived as natural and functional [61,70,78].

Benchmarking, in particular the HOMan [86] and PointNetGPD [40] models, is used to objectively measure the performance of the models. HOMan uses a generative model to predict a human’s ability to grasp an object among many other objects from single RGB images to assess the realism and accuracy of grasping. On the other hand, PointNetGPD evaluates performance by generating spatial grasp predictions from sparse point clouds. The model evaluates the success rate, precision, and efficiency of grasp detection.

Ablation studies are used to assess the contribution of different components of a model [55,63]. The goal is to remove or modify components in the model to examine their impact on overall performance. For example, Cao et al. [55] conducted ablation studies and evaluated the effectiveness of using 2D images and 3D contacts together and separately. Xu et al. [63] used ablation studies to investigate how each part of their proposed approach contributes to the quality of the final grasp and the generalizability.

In addition to the above, tests are also performed in real-world conditions. For example, robotic experiments are being conducted [30,87], where models are evaluated in practical conditions. Performance is measured in the presence of sensor noise, changing object position, and environmental disturbances. For instance, GraspNet-1Billion [87] uses analytical computation to predict grip stability across a wide range of real-world objects, configurations, and locations. Siddiqui et al. [30] examine grip stability prediction using tactile feedback mixed with Bayesian optimization. They analyze the capacity of models to control uncertainty encountered during real-world robotic manipulations.

Simulation-based validation [14,19] creates controlled conditions for testing model performance without real-world limitations. Karunratanakul et al. [19] evaluated their Grasping Field model by comparing synthesized grasps with ground truth data in simulations, assessing the physical plausibility of the generated grasps for various invisible objects. Additional evaluation metrics are diversity [18,75], simulated displacement [78], and penetration depth [18,84]. These measures assess the diversity, stability, and physical plausibility of interactions. Furthermore, the accuracy and reliability of the methods in different contexts are characterized by surface estimation errors, depth estimation accuracy, action recognition speeds, and object position estimation metrics such as average distance (ADD) [21,53,79].

5.2. Limitations

5.2.1. Hardware Limitations

Simplified physical models or malfunctioning sensor data often cause unrealistic movements, resulting in contact points or forces that do not match real-world expectations. Approaches that use objects without distinctive features, such as highly reflective or untextured surfaces, cannot accurately represent interactions [64]. Errors in object recognition and pose estimation are more noticeable in cluttered or poorly lit virtual environments. Due to technological limitations and the difficulty of reproducing the sensation of touch in virtual worlds, the integration of tactile sensors remains a challenge.

The creation of unrealistic movements is one of the most common disadvantages that can compromise the apparent realism of interactions [91]. The scope of these methods is limited because they focus on single-handed grasping of static objects, typically ignoring dynamic interactions or those involving multiple hands [92]. In tasks that require coordinated actions, including grasping with two hands or interacting with moving objects, this can lead to confusion. For example, current models often assume stationary objects and predictable trajectories, making them unsuitable for tasks that involve grasping, passing, or manipulating objects with different movement patterns. The impossibility of models to adapt to small changes in object surfaces, including differences in texture or compliance, and therefore their inability to produce solid and secure grips, is caused by the ineffective integration of tactile feedback [52]. This often results in a grip that is either too weak, causing the object to drop, or too strong, which can cause the object to deform or break.

5.2.2. Data Limitations

Assuming that the data will be used for training of model, then the training data should be of good quality. Neural networks predict unsuccessfully when trained on inaccurate or biased data sets, which limits their application in the real world [93]. Over-fitted models that are unable to handle the diversity of the real world can be obtained due to a lack of data or tuning of model parameters. Current techniques have limited robustness and adaptability, as studies present difficulties in generalizing across different types and sizes of objects [94]. Variations in hand size, posture, and individual grasping approaches, which are often underrepresented in training data, make generalization very difficult.

5.2.3. Methodological Limitations

In addition, some approaches require a lot of computation, which limits their use in real time [60]. This limitation is an important element when methods require processing power. Both the use of neural networks in model building and simulations based on physics are computationally intensive. Touch dynamics and realistic simulation of the deformability of the hand and finger articulation should also be considered when refining the models. Especially when working with fragile or unusually shaped objects, inconsistencies between predicted and actual contact pressures can lead to unstable grasps or sliding [38].

6. Conclusions

In this article, we propose a conceptual framework for modeling human grasping in shared virtual environments, based on an extensive literature review. The method is based on creating realism, feedback, and a sequence of interactions. Future work will focus on implementation, validation, and application in immersive systems such as virtual learning and remote collaboration.

Future research should explore the integration of the proposed grasping framework with innovative technologies that impact next-generation interactive systems. The Tactile Internet and 6G communication networks will work together to create the ultra-reliable, low-latency channels needed for real-time grasping and haptic interaction on a global scale. Finally, embedding the grasping model in digital twin platforms would make a connection between simulated and real-world environments. This would allow humans to test, validate, and co-design grasping strategies over and over again before they are deployed in real-world or robotic systems. For example, it is possible to combine our conceptual framework for grasping in virtual environments with emerging energy-efficient robotic architectures such as neuromorphic computing and liquid flow batteries [95].

Author Contributions

Conceptualization N.C. and K.T.; methodology, N.C. and K.T.; formal analysis, N.C., N.N.N. and R.P.; investigation, N.C. and N.N.N.; resources, K.T.; data curation, N.C. and N.N.N.; writing—original draft preparation, N.C., N.N.N. and K.T.; writing—review and editing, N.C., N.N.N., R.P. and K.T.; supervision, A.M.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research is financed by the European Union-Next Generation EU, through the National Recovery and Resilience Plan of the Republic of Bulgaria, project No. BG-RRP-2.004-0005: “Improving the research capacity and quality to achieve international recognition and resilience of TU-Sofia” (IDEAS).

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADD	Average distance
AR	Augmented reality
CD	Chamfer distance
CVAE	Conditional Variational Autoencoder
CNN	Convolutional Neural Networks
FEM	Finite element methods
IoU	Intersection over union
IV	Intersection volume
MPJPE	Mean joint position error
MRGs	Manipulation relationship graphs
MoCap	Motion capture
MR	Mixed reality
RL	Reinforcement learning
SIV	Scene interaction vector
SD	Simulation displacement
SDFs	Signed Distance Functions
TI	Tactile Internet
UHM	Universal Hand Model
VAE	Variational Autoencoders
VR	Virtual reality

References

Sharma, S.K.; Woungang, I.; Anpalagan, A.; Chatzinotas, S. Toward tactile internet in beyond 5G era: Recent advances, current issues, and future directions. IEEE Access 2020, 8, 56948–56991. [Google Scholar] [CrossRef]
Stefanidi, Z.; Margetis, G.; Ntoa, S.; Papagiannakis, G. Real-time adaptation of context-aware intelligent user interfaces, for enhanced situational awareness. IEEE Access 2022, 10, 23367–23393. [Google Scholar] [CrossRef]
Lawson McLean, A.; Lawson McLean, A.C. Immersive simulations in surgical training: Analyzing the interplay between virtual and real-world environments. Simul. Gaming 2024, 55, 1103–1123. [Google Scholar] [CrossRef]
Haynes, G.C.; Stager, D.; Stentz, A.; Vande Weghe, J.M.; Zajac, B.; Herman, H.; Kelly, A.; Meyhofer, E.; Anderson, D.; Bennington, D.; et al. Developing a robust disaster response robot: CHIMP and the robotics challenge. J. Field Robot. 2017, 34, 281–304. [Google Scholar] [CrossRef]
Zhu, J.; Cherubini, A.; Dune, C.; Navarro-Alarcon, D.; Alambeigi, F.; Berenson, D.; Ficuciello, F.; Harada, K.; Kober, J.; Li, X.; et al. Challenges and outlook in robotic manipulation of deformable objects. IEEE Robot. Autom. Mag. 2022, 29, 67–77. [Google Scholar] [CrossRef]
Berrezueta-Guzman, S.; Chen, W.; Wagner, S. A Therapeutic Role-Playing VR Game for Children with Intellectual Disabilities. arXiv 2025, arXiv:2507.19114. [Google Scholar] [CrossRef]
Alverson, D.C.; Saiki Jr, S.M.; Kalishman, S.; Lindberg, M.; Mennin, S.; Mines, J.; Serna, L.; Summers, K.; Jacobs, J.; Lozanoff, S.; et al. Medical students learn over distance using virtual reality simulation. Simul. Healthc. 2008, 3, 10–15. [Google Scholar] [CrossRef] [PubMed]
Saeed, S.; Khan, K.B.; Hassan, M.A.; Qayyum, A.; Salahuddin, S. Review on the role of virtual reality in reducing mental health diseases specifically stress, anxiety, and depression. arXiv 2024, arXiv:2407.18918. [Google Scholar]
Santos, J.; Wauters, T.; Volckaert, B.; De Turck, F. Towards low-latency service delivery in a continuum of virtual resources: State-of-the-art and research directions. IEEE Commun. Surv. Tutor. 2021, 23, 2557–2589. [Google Scholar] [CrossRef]
Awais, M.; Ullah Khan, F.; Zafar, M.; Mudassar, M.; Zaigham Zaheer, M.; Mehmood Cheema, K.; Kamran, M.; Jung, W.S. Towards enabling haptic communications over 6G: Issues and challenges. Electronics 2023, 12, 2955. [Google Scholar] [CrossRef]
Manoj, R.; Krishna, N.; TS, M.S. A Comprehensive Study on the Integration of Robotic Technology in Medical Applications considering Legal Frameworks & Ethical Concerns. Int. J. Health Technol. Innov. 2024, 3, 29–37. [Google Scholar]
Ahmad, A.; Migniot, C.; Dipanda, A. Tracking hands in interaction with objects: A review. In Proceedings of the 2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Jaipur, India, 4–7 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 360–369. [Google Scholar]
Si, W.; Wang, N.; Yang, C. A review on manipulation skill acquisition through teleoperation-based learning from demonstration. Cogn. Comput. Syst. 2021, 3, 1–16. [Google Scholar] [CrossRef]
Tian, H.; Wang, C.; Manocha, D.; Zhang, X. Realtime hand-object interaction using learned grasp space for virtual environments. IEEE Trans. Vis. Comput. Graph. 2018, 25, 2623–2635. [Google Scholar] [CrossRef]
Romero, J.; Tzionas, D.; Black, M.J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. Acm Trans. Graph. (Proc. SIGGRAPH Asia) 2017, 36, 245. [Google Scholar] [CrossRef]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 10975–10985. [Google Scholar]
Moon, G.; Xu, W.; Joshi, R.; Wu, C.; Shiratori, T. Authentic Hand Avatar from a Phone Scan via Universal Hand Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 2029–2038. [Google Scholar]
Wang, Y.K.; Xing, C.; Wei, Y.L.; Wu, X.M.; Zheng, W.S. Single-View Scene Point Cloud Human Grasp Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 831–841. [Google Scholar]
Karunratanakul, K.; Yang, J.; Zhang, Y.; Black, M.J.; Muandet, K.; Tang, S. Grasping field: Learning implicit representations for human grasps. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Virtual, 25–28 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 333–344. [Google Scholar]
Li, Z.; Sedlar, J.; Carpentier, J.; Laptev, I.; Mansard, N.; Sivic, J. Estimating 3d motion and forces of person-object interactions from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 8640–8649. [Google Scholar]
Hampali, S.; Rad, M.; Oberweger, M.; Lepetit, V. Honnotate: A method for 3d annotation of hand and object poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 3196–3206. [Google Scholar]
Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Castro-Vargas, J.A.; Orts-Escolano, S.; Garcia-Rodriguez, J. A visually realistic grasping system for object manipulation and interaction in virtual reality environments. Comput. Graph. 2019, 83, 77–86. [Google Scholar] [CrossRef]
Yang, L.; Zhan, X.; Li, K.; Xu, W.; Li, J.; Lu, C. Cpf: Learning a contact potential field to model the hand-object interaction. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 11097–11106. [Google Scholar]
Wang, C.; Zang, X.; Zhang, X.; Liu, Y.; Zhao, J. Parameter estimation and object gripping based on fingertip force/torque sensors. Measurement 2021, 179, 109479. [Google Scholar] [CrossRef]
Huang, I.; Narang, Y.; Eppner, C.; Sundaralingam, B.; Macklin, M.; Bajcsy, R.; Hermans, T.; Fox, D. DefGraspSim: Physics-based simulation of grasp outcomes for 3D deformable objects. IEEE Robot. Autom. Lett. 2022, 7, 6274–6281. [Google Scholar] [CrossRef]
Cha, J.; Kim, J.; Yoon, J.S.; Baek, S. Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 1577–1585. [Google Scholar]
Pokhariya, C.; Shah, I.N.; Xing, A.; Li, Z.; Chen, K.; Sharma, A.; Sridhar, S. MANUS: Markerless Grasp Capture using Articulated 3D Gaussians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 2197–2208. [Google Scholar]
Zhang, Q.; Li, Y.; Luo, Y.; Shou, W.; Foshey, M.; Yan, J.; Tenenbaum, J.B.; Matusik, W.; Torralba, A. Dynamic modeling of hand-object interactions via tactile sensing. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2874–2881. [Google Scholar]
Watkins-Valls, D.; Varley, J.; Allen, P. Multi-modal geometric learning for grasping and manipulation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7339–7345. [Google Scholar]
Siddiqui, M.S.; Coppola, C.; Solak, G.; Jamone, L. Grasp stability prediction for a dexterous robotic hand combining depth vision and haptic bayesian exploration. Front. Robot. AI 2021, 8, 703869. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, Z.; Wang, L.; Zhu, X.; Huang, H.; Cao, Q. Digital twin-enabled grasp outcomes assessment for unknown objects using visual-tactile fusion perception. Robot.-Comput.-Integr. Manuf. 2023, 84, 102601. [Google Scholar] [CrossRef]
Bhatnagar, B.L.; Xie, X.; Petrov, I.A.; Sminchisescu, C.; Theobalt, C.; Pons-Moll, G. Behave: Dataset and method for tracking human object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 15935–15946. [Google Scholar]
Palleschi, A.; Angelini, F.; Gabellieri, C.; Pallottino, L.; Bicchi, A.; Garabini, M. Grasp It Like a Pro 2.0: A Data-Driven Approach Exploiting Basic Shape Decomposition and Human Data for Grasping Unknown Objects. IEEE Trans. Robot. 2023, 39, 4016–4036. [Google Scholar] [CrossRef]
Chen, Z.; Chen, S.; Schmid, C.; Laptev, I. gsdf: Geometry-driven signed distance functions for 3d hand-object reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 12890–12900. [Google Scholar]
Chao, Y.W.; Yang, W.; Xiang, Y.; Molchanov, P.; Handa, A.; Tremblay, J.; Narang, Y.S.; Van Wyk, K.; Iqbal, U.; Birchfield, S.; et al. DexYCB: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021; pp. 9044–9053.
Sajjan, S.; Moore, M.; Pan, M.; Nagaraja, G.; Lee, J.; Zeng, A.; Song, S. Clear grasp: 3d shape estimation of transparent objects for manipulation. In Proceedings of the 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3634–3642. [Google Scholar]
Gao, R.; Chang, Y.Y.; Mall, S.; Fei-Fei, L.; Wu, J. Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations. arXiv 2021, arXiv:2109.07991. [Google Scholar] [CrossRef]
Breyer, M.; Chung, J.J.; Ott, L.; Siegwart, R.; Nieto, J. Volumetric grasping network: Real-time 6 dof grasp detection in clutter. In Proceedings of the Conference on Robot Learning. PMLR 2021, London, UK, 8–11 November 2021; pp. 1602–1611. [Google Scholar]
Mayer, V.; Feng, Q.; Deng, J.; Shi, Y.; Chen, Z.; Knoll, A. FFHNet: Generating multi-fingered robotic grasps for unknown objects in real-time. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 762–769. [Google Scholar]
Ni, P.; Zhang, W.; Zhu, X.; Cao, Q. Pointnet++ grasping: Learning an end-to-end spatial grasp generation algorithm from sparse point clouds. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3619–3625. [Google Scholar]
Li, H.; Lin, X.; Zhou, Y.; Li, X.; Huo, Y.; Chen, J.; Ye, Q. Contact2grasp: 3d grasp synthesis via hand-object contact constraint. arXiv 2022, arXiv:2210.09245. [Google Scholar]
Wang, S.; Liu, X.; Wang, C.C.; Liu, J. Physics-aware iterative learning and prediction of saliency map for bimanual grasp planning. Comput. Aided Geom. Des. 2024, 111, 102298. [Google Scholar] [CrossRef]
Liu, Q.; Cui, Y.; Ye, Q.; Sun, Z.; Li, H.; Li, G.; Shao, L.; Chen, J. DexRepNet: Learning dexterous robotic grasping network with geometric and spatial hand-object representations. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3153–3160. [Google Scholar]
Zhang, H.; Yang, D.; Wang, H.; Zhao, B.; Lan, X.; Ding, J.; Zheng, N. Regrad: A large-scale relational grasp dataset for safe and object-specific robotic grasping in clutter. IEEE Robot. Autom. Lett. 2022, 7, 2929–2936. [Google Scholar] [CrossRef]
Wang, Z.; Chen, Y.; Liu, T.; Zhu, Y.; Liang, W.; Huang, S. Humanise: Language-conditioned human motion generation in 3d scenes. Adv. Neural Inf. Process. Syst. 2022, 35, 14959–14971. [Google Scholar]
Liu, S.; Jiang, H.; Xu, J.; Liu, S.; Wang, X. Semi-supervised 3d hand-object poses estimation with interactions in time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 14687–14697. [Google Scholar]
Chaudhari, B.S. Enabling Tactile Internet via 6G: Application Characteristics, Requirements, and Design Considerations. Future Internet 2025, 17, 122. [Google Scholar] [CrossRef]
Xiang, H.; Wu, K.; Chen, J.; Yi, C.; Cai, J.; Niyato, D.; Shen, X. Edge computing empowered tactile Internet for human digital twin: Visions and case study. arXiv 2023, arXiv:2304.07454. [Google Scholar]
Lu, Y.; Kong, D.; Yang, G.; Wang, R.; Pang, G.; Luo, H.; Yang, H.; Xu, K. Machine learning-enabled tactile sensor design for dynamic touch decoding. Adv. Sci. 2023, 10, 2303949. [Google Scholar] [CrossRef]
Turpin, D.; Wang, L.; Heiden, E.; Chen, Y.C.; Macklin, M.; Tsogkas, S.; Dickinson, S.; Garg, A. Grasp’d: Differentiable contact-rich grasp synthesis for multi-fingered hands. In Proceedings of the European Conference on Computer Vision 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 201–221. [Google Scholar]
Huang, I.; Narang, Y.; Eppner, C.; Sundaralingam, B.; Macklin, M.; Hermans, T.; Fox, D. Defgraspsim: Simulation-based grasping of 3d deformable objects. arXiv 2021, arXiv:2107.05778. [Google Scholar]
Grady, P.; Tang, C.; Twigg, C.D.; Vo, M.; Brahmbhatt, S.; Kemp, C.C. Contactopt: Optimizing contact to improve grasps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 1471–1481. [Google Scholar]
Van der Merwe, M.; Lu, Q.; Sundaralingam, B.; Matak, M.; Hermans, T. Learning continuous 3d reconstructions for geometrically aware grasping. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11516–11522. [Google Scholar]
Yang, L.; Li, K.; Zhan, X.; Lv, J.; Xu, W.; Li, J.; Lu, C. Artiboost: Boosting articulated 3d hand-object pose estimation via online exploration and synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 2750–2760. [Google Scholar]
Cao, Z.; Radosavovic, I.; Kanazawa, A.; Malik, J. Reconstructing hand-object interactions in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 12417–12426. [Google Scholar]
Coumans, E. Bullet physics simulation. In ACM SIGGRAPH 2015 Courses; Association for Computing Machinery: New York, NY, USA, 2015; p. 1. [Google Scholar]
Coumans, E.; Bai, Y. PyBullet Quickstart Guide. 2021. Available online: https://raw.githubusercontent.com/bulletphysics/bullet3/master/docs/pybullet_quickstartguide.pdf (accessed on 23 September 2025).
Koenig, N.; Howard, A. Design and use paradigms for gazebo, an open-source multi-robot simulator. In Proceedings of the 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS)(IEEE Cat. No. 04CH37566), Sendai, Japan, 28 September–2 October 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 2149–2154. [Google Scholar]
Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 5026–5033. [Google Scholar]
Lu, Q.; Van der Merwe, M.; Sundaralingam, B.; Hermans, T. Multifingered grasp planning via inference in deep neural networks: Outperforming sampling by learning differentiable models. IEEE Robot. Autom. Mag. 2020, 27, 55–65. [Google Scholar] [CrossRef]
Li, K.; Wang, J.; Yang, L.; Lu, C.; Dai, B. Semgrasp: Semantic grasp generation via language aligned discretization. arXiv 2024, arXiv:2404.03590. [Google Scholar] [CrossRef]
Liang, H.; Ma, X.; Li, S.; Görner, M.; Tang, S.; Fang, B.; Sun, F.; Zhang, J. Pointnetgpd: Detecting grasp configurations from point sets. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3629–3635. [Google Scholar]
Xu, Y.; Wan, W.; Zhang, J.; Liu, H.; Shan, Z.; Shen, H.; Wang, R.; Geng, H.; Weng, Y.; Chen, J.; et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 4737–4746. [Google Scholar]
Murali, A.; Liu, W.; Marino, K.; Chernova, S.; Gupta, A. Same object, different grasps: Data and semantic knowledge for task-oriented grasping. In Proceedings of the Conference on Robot Learning. PMLR 2021, London, UK, 8–11 November 2021; pp. 1540–1557. [Google Scholar]
Braun, J.; Christen, S.; Kocabas, M.; Aksan, E.; Hilliges, O. Physically plausible full-body hand-object interaction synthesis. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 464–473. [Google Scholar]
Zhang, H.; Christen, S.; Fan, Z.; Zheng, L.; Hwangbo, J.; Song, J.; Hilliges, O. ArtiGrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 235–246. [Google Scholar]
Kasaei, H.; Kasaei, M. Mvgrasp: Real-time multi-view 3d object grasping in highly cluttered environments. Robot. Auton. Syst. 2023, 160, 104313. [Google Scholar] [CrossRef]
Hasson, Y.; Varol, G.; Tzionas, D.; Kalevatykh, I.; Black, M.J.; Laptev, I.; Schmid, C. Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 11807–11816. [Google Scholar]
Brahmbhatt, S.; Tang, C.; Twigg, C.D.; Kemp, C.C.; Hays, J. ContactPose: A dataset of grasps with object contact and hand pose. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. Springer: Cham, Switzerland, 2020; pp. 361–378. [Google Scholar]
Taheri, O.; Ghorbani, N.; Black, M.J.; Tzionas, D. GRAB: A dataset of whole-body human grasping of objects. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer: Cham, Switzerland, 2020; pp. 581–600. [Google Scholar]
Yang, L.; Li, K.; Zhan, X.; Wu, F.; Xu, A.; Liu, L.; Lu, C. Oakink: A large-scale knowledge repository for understanding hand-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 20953–20962. [Google Scholar]
Taheri, O.; Choutas, V.; Black, M.J.; Tzionas, D. GOAL: Generating 4D whole-body motion for hand-object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 13263–13273. [Google Scholar]
Fan, Z.; Taheri, O.; Tzionas, D.; Kocabas, M.; Kaufmann, M.; Black, M.J.; Hilliges, O. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 12943–12954. [Google Scholar]
Tendulkar, P.; Surís, D.; Vondrick, C. Flex: Full-body grasping without full-body grasps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 21179–21189. [Google Scholar]
Chang, X.; Sun, Y. Text2Grasp: Grasp synthesis by text prompts of object grasping parts. arXiv 2024, arXiv:2404.15189. [Google Scholar] [CrossRef]
Farias, C.; Marti, N.; Stolkin, R.; Bekiroglu, Y. Simultaneous tactile exploration and grasp refinement for unknown objects. IEEE Robot. Autom. Lett. 2021, 6, 3349–3356. [Google Scholar] [CrossRef]
Gao, M.; Ruan, N.; Shi, J.; Zhou, W. Deep neural network for 3D shape classification based on mesh feature. Sensors 2022, 22, 7040. [Google Scholar] [CrossRef]
Jiang, H.; Liu, S.; Wang, J.; Wang, X. Hand-object contact consistency reasoning for human grasps generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 11107–11116. [Google Scholar]
Li, Y.; Schomaker, L.; Kasaei, S.H. Learning to grasp 3d objects using deep residual u-nets. In Proceedings of the 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Naples, Italy, 31 August–4 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 781–787. [Google Scholar]
Fan, Z.; Parelli, M.; Kadoglou, M.E.; Chen, X.; Kocabas, M.; Black, M.J.; Hilliges, O. HOLD: Category-agnostic 3d reconstruction of interacting hands and objects from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 494–504. [Google Scholar]
Hao, Y.; Zhang, J.; Zhuo, T.; Wen, F.; Fan, H. Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling. In Proceedings of the AAAI Conference on Artificial Intelligence 2024, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 2076–2084. [Google Scholar]
Petrov, I.A.; Marin, R.; Chibane, J.; Pons-Moll, G. Object pop-up: Can we infer 3d objects and their poses from human interactions alone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 4726–4736. [Google Scholar]
Ye, Y.; Gupta, A.; Tulsiani, S. What’s in your hands? 3d reconstruction of generic objects in hands. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 3895–3905. [Google Scholar]
Zhou, K.; Bhatnagar, B.L.; Lenssen, J.E.; Pons-Moll, G. Toch: Spatio-temporal object-to-hand correspondence for motion refinement. In Proceedings of the European Conference on Computer Vision 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–19. [Google Scholar]
Kiatos, M.; Malassiotis, S.; Sarantopoulos, I. A geometric approach for grasping unknown objects with multifingered hands. IEEE Trans. Robot. 2020, 37, 735–746. [Google Scholar] [CrossRef]
Corona, E.; Pumarola, A.; Alenya, G.; Moreno-Noguer, F.; Rogez, G. Ganhand: Predicting human grasp affordances in multi-object scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 5031–5041. [Google Scholar]
Fang, H.S.; Wang, C.; Gou, M.; Lu, C. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 11444–11453. [Google Scholar]
Christoff, N.; Neshov, N.N.; Tonchev, K.; Manolova, A. Application of a 3D talking head as part of telecommunication AR, VR, MR system: Systematic review. Electronics 2023, 12, 4788. [Google Scholar] [CrossRef]
Zapata-Impata, B.S.; Gil, P.; Pomares, J.; Torres, F. Fast geometry-based computation of grasping points on three-dimensional point clouds. Int. J. Adv. Robot. Syst. 2019, 16, 1729881419831846. [Google Scholar] [CrossRef]
Tekin, B.; Bogo, F.; Pollefeys, M. H+o: Unified egocentric recognition of 3d hand-object poses and interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 4511–4520. [Google Scholar]
Delrieu, T.; Weistroffer, V.; Gazeau, J.P. Precise and realistic grasping and manipulation in virtual reality without force feedback. In Proceedings of the 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Atlanta, GA, USA, 22–26 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 266–274. [Google Scholar]
Fan, H.; Zhuo, T.; Yu, X.; Yang, Y.; Kankanhalli, M. Understanding atomic hand-object interaction with human intention. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 275–285. [Google Scholar] [CrossRef]
Le, T.T.; Le, T.S.; Chen, Y.R.; Vidal, J.; Lin, C.Y. 6D pose estimation with combined deep learning and 3D vision techniques for a fast and accurate object grasping. Robot. Auton. Syst. 2021, 141, 103775. [Google Scholar] [CrossRef]
Baek, S.; Kim, K.I.; Kim, T.K. Weakly-supervised domain adaptation via gan and mesh model for estimating 3d hand poses interacting objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 6121–6131. [Google Scholar]
Stavrev, S. Reimagining Robots: The Future of Cybernetic Organisms with Energy-Efficient Designs. Big Data Cogn. Comput. 2025, 9, 104. [Google Scholar] [CrossRef]

Figure 1. Examples of different grasping strategies depending on object type and task [14].

Figure 2. Graphical representation of the sequential process of initiating and executing a grasp in a real-world environment. The process begins with the hand at rest in an arbitrary starting position. The hand then changes position to select an appropriate grip, followed by a decision regarding the number of fingers or the use of the palm. Finally, the grasp is performed in interaction with the environment, which requires knowledge of the object’s properties (e.g., shape, mass, volume, and material) as well as its spatial location.

Figure 3. Overview of the seven-step grasping process in virtual environments.

Figure 4. Dataset examples: (a) HO-3D [21], (b) ObMan [68], (c) Contact Pose [69], (d) HUMANISE [45], (e) GRAB [70], and (f) DexYCB [35].

Figure 5. Three-dimensional models can be represented as point clouds, voxels, or polygon meshes [77].

Table 1. Comparative analysis of grasp modeling techniques.

Approach	Techniques	Applications
Geometric	Point clouds, 3D Gaussian, SDFs	Prediction of contact points and refinement of hand–object interactions [19,26,27]; applicability to precise positioning and shape adaptation; use of implicit surface modeling [19], contact by Gaussian approximation [27] and diffusion text-driven models [26].
Physical interaction modeling	Optimal control, force/torque equilibrium, elastic energy	Evaluation and stabilization of contact points and grip forces [20,22,24]; dynamic correction via capsular triggers [22]; realistic poses via elastic energy optimization [23]; fine-tuning with control strategies [20].
Tactile and sensor-based	Tactile data, haptic feedback, visual-tactile fusion	Tactile–visual grasp planning [28,29,30]; Bayesian optimization of secure configurations [30]; application to uncertain environments and transparent objects [28,36].
Simulation and kinematic	Rigid-body dynamics, differentiable simulation, 3D pose tracking	Simulation of contact dynamics and sequences from approach to stabilization of the grip [35,50,51]; extraction of detailed kinematic data through simulations and multi-camera configurations [35,50]; application in realistic animation and testing of robotic grippers.
Machine learning	CNNs, transformers, VAEs, RL	Predicting poses and grip quality through deep learning [37,40,52]; improving touch-based grips through reinforcement learning [37]; generative approaches to modeling stable configurations and refining through physical constraints [14,53].
Contextual	MRGs, contextual reasoning, language descriptions	Processing of task-specific constraints and scenes with high object density [44,46]; modeling dependencies between objects via manipulation relationship graphs [44]; contextual modules for task understanding and action prediction [45,46].
Multidimensional	Co-attention, RGB-D and tactile fusion, 3D fitting	Integration of multidimensional data for more accurate modeling [31,32]; combining RGB-D and tactile information in manipulation tasks [32]; co-attention mechanisms for combining visual and tactile signals in grasp prediction [31].
Human-centric learning	Human motion tracking, bimanual grasp, shape decomposition	Bimanual grip prediction [42]; structure extraction from human demonstrations for better grip quality [33]; motion tracking for adaptive learning [35].

Table 2. Comprehensive comparison of distinct datasets used in material synthesis and analysis.

Dataset	Type	Materials Represented	Data Representation	Preprocessing	Text-to-Motion
HO-3D [21]	Real	Ten objects (plastic, organic materials, metal, paper, etc.)	Three-dimensional hand/object poses; segmentation masks	Manual alignment of grasps and masks	No
ObMan [68]	Synthetic	Eight object categories from ShapeNet (bottles, bowls, cans, jars, knives, cellphones, cameras, remote controls)	MANO hand model; object meshes; point clouds	Contact map derivation; SDF computation	No
GRAB [70]	Real	Fifty-one objects (plastic, glass, organic materials, etc.)	SMPL-X models; 3D contact and motion data	Annotation filtering; ground-truth cleanup	Yes
DexYCB [35]	Real	YCB objects (wood, metal, plastic, cardboard, natural materials, etc.)	Six-dimensional hand/object poses; RGB-D images	Downsampling; normalization	No
OakInk [71]	Real	Ceramic, metal, etc.	Point clouds; labeled grasp vectors	Point segmentation	No
ContactPose [69]	Real	Glass, metal, plastic, etc.	Three-dimensional hand/object poses; high-resolution thermal-based contact maps; multi-view RGB-D images	Thermal data preprocessing; object surface extraction	No
MOW [55]	Real	Organic, plastic, metal, etc.	RGB images; 3D reconstructed hand– object models	Object segmentation; 3D mesh selection	No
HUMANISE [45]	Synthetic	Indoor objects; full-body interactions	Three-dimensional human motion sequences; point clouds	Alignment of motion and scenes; text description generation	Yes
Tactile Glove [28]	Real	Ceramic, plastic, metal	Tactile images; 3D positions; velocities	Tactile encoding via CNN; data embedding	No

Table 3. Evaluation techniques for hand–object interaction.

Evaluation Category	Techniques	Advantages	Limitations
Quantitative metrics	MPJPE [26,80,81], Chamfer distance [17,80,82], F-Score [80,83], intersection volume [50,52], contact ratio [52,75], simulation displacement [18,78]	Objective and precise; widely used in assessing pose and geometry accuracy; ensures quantitative evaluation of joint-level accuracy (MPJPE) and evaluates shape fidelity (Chamfer distance)	Capturing nuances of human perception or complex interactions; does not always correlate with subjective quality; different object geometries comparison.
Physical realism	Penetration depth [18,84], contact consistency [52,75], stability in simulation [18,78,84]	Ensures grasps are physically possible. Penetration depth highlights unrealistic overlaps. Stability in simulation tests interaction feasibility in robotic settings.	Simulations do not fully capture real-world physics. Variability in object surface and friction coefficients may affect accuracy. Computationally intensive for complex models.
Qualitative assessments	Visual inspection [72,85], perceptual studies (e.g., Likert scales, user feedback) [61,70,78]	Captures nuances of human perception; identifies naturalness and aesthetic aspects missed by quantitative metrics. Effective for end-user-focused evaluations.	Subjective and often biased. Requires large-scale user studies for statistical significance. Difficult to replicate consistently.
Comparative analysis	Baseline methods (e.g., PointNetGPD [40], HOMan [86]), ablation studies [55,63]	Provides context by benchmarking in comparison to existing approaches. Highlights the impact of model components.	Dependent on the choice of baselines, not always captured interdependencies between components
Dataset-specific metrics	Diversity metrics [18,26], coverage of real-world interactions [35,36]	Ensures models handle a wide range of interaction scenarios, making them robust for real-world use. Diversity metrics capture variability across grasp styles and motions.	Dependent on dataset quality and diversity. Limited coverage may reduce generalizability. Metrics might overemphasize diversity while neglecting physical feasibility.
Real-world testing	Simulation-based validation [14,19], robotic experiments [30,87]	Tests the model in practical environments, validating its application readiness. Highlights failure modes not evident in simulations.	Expensive and resource-intensive. Real-world conditions like lighting, clutter, or material properties may introduce variability, making results less reproducible.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Christoff, N.; Neshov, N.N.; Petkova, R.; Tonchev, K.; Manolova, A. Grasping in Shared Virtual Environments: Toward Realistic Human–Object Interaction Through Review-Based Modeling. Electronics 2025, 14, 3809. https://doi.org/10.3390/electronics14193809

AMA Style

Christoff N, Neshov NN, Petkova R, Tonchev K, Manolova A. Grasping in Shared Virtual Environments: Toward Realistic Human–Object Interaction Through Review-Based Modeling. Electronics. 2025; 14(19):3809. https://doi.org/10.3390/electronics14193809

Chicago/Turabian Style

Christoff, Nicole, Nikolay N. Neshov, Radostina Petkova, Krasimir Tonchev, and Agata Manolova. 2025. "Grasping in Shared Virtual Environments: Toward Realistic Human–Object Interaction Through Review-Based Modeling" Electronics 14, no. 19: 3809. https://doi.org/10.3390/electronics14193809

APA Style

Christoff, N., Neshov, N. N., Petkova, R., Tonchev, K., & Manolova, A. (2025). Grasping in Shared Virtual Environments: Toward Realistic Human–Object Interaction Through Review-Based Modeling. Electronics, 14(19), 3809. https://doi.org/10.3390/electronics14193809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Grasping in Shared Virtual Environments: Toward Realistic Human–Object Interaction Through Review-Based Modeling

Abstract

1. Introduction

2. Modeling of the Grasping Process

2.1. Step 1: Geometric Representation of Hand and Object

2.2. Step 2: Predicting Contact Points

2.3. Step 3: Modeling Hand Motion Towards Grasp

2.4. Step 4: Finger Configuration and Initial Grasp

2.5. Step 5: Applying Physics-Based Simulation

2.6. Step 6: Grasp Refinement Using Feedback

2.7. Step 7: Contextual Adaptation of the Grasp

2.8. Guidelines for Implementing Tactile Internet in Grasping Frameworks

3. Physics-Informed Modeling of Grasping

4. Types of Datasets and Their Focus

4.1. Real-World Datasets

4.2. Synthetic Datasets

4.3. Data Representation Approach

4.4. Object Material Types

5. Evaluation and Limitations

5.1. Evaluation

5.2. Limitations

5.2.1. Hardware Limitations

5.2.2. Data Limitations

5.2.3. Methodological Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI