HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization

Grushko, Stefan; Vysocký, Aleš; Chlebek, Jakub

doi:10.3390/ai7020072

Open AccessArticle

HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization

by

Stefan Grushko

,

Aleš Vysocký

^*

and

Jakub Chlebek

Faculty of Mechanical Engineering, VSB—Technical University of Ostrava, 17. listopadu 2172/15, Ostrava-Poruba, 708 00 Ostrava, Czech Republic

^*

Author to whom correspondence should be addressed.

AI 2026, 7(2), 72; https://doi.org/10.3390/ai7020072

Submission received: 8 January 2026 / Revised: 9 February 2026 / Accepted: 10 February 2026 / Published: 13 February 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Hand localization in cluttered industrial environments remains challenging due to variations in appearance and the gap between synthetic and real-world data. Domain randomization addresses this “reality gap” by intentionally introducing randomized and unrealistic visual features in simulated scenes, encouraging neural networks to focus on essential domain-invariant cues. In this study, we applied domain randomization to generate a synthetic Red-Green-Blue–Depth (RGB-D) dataset for training multimodal instance segmentation models, with the aim of achieving color-agnostic hand localization in complex industrial settings. We introduce a new synthetic dataset tailored to various hand detection tasks and provide ready-to-use pretrained instance segmentation models. To enhance robustness in unstructured environments, the proposed approach employs multimodal inputs that combine color and depth information. To evaluate the contribution of each modality, we analyzed the individual and combined effects of color and depth on model performance. All evaluated models were trained exclusively on the proposed synthetic dataset. Despite the absence of real-world training data, the results demonstrate that our models outperform corresponding models trained on existing state-of-the-art datasets, achieving higher Average Precision and Probability-Based Detection Quality.

Keywords:

deep learning; domain randomization; hand detection; instance segmentation; multimodal input; synthetic dataset

1. Introduction

When it comes to flexibility in the manufacturing process, robot programming can be a barrier to customization and reconfiguration. In practice, programming is usually done through interaction with the robot control panel or through workspace simulation. Today, programming an industrial robot considers the design of the human-centered interaction process [1] and often employs human–human interaction as a model of inspiration for modalities of human–robot interaction such as gestures, speech, and gaze [2,3]. Interfaces based on hand gestures may be considered more natural and easier to use for human users, who can use them in many tasks, not limited to high-level robot action control (but also including augmented reality, virtual reality, and general human–computer interaction), because gestures are a common interaction that humans naturally use in social communication. Typical approaches to gesture recognition and hand tracking use data collection gloves and various markers [4,5]; however, most approaches focus on vision-based systems since they do not require additional equipment for the user to work with.

Machine learning and, in particular, deep learning (DL) approaches have made it possible to achieve high accuracy and robustness and have generally made recognition models more accessible, provided that an appropriate amount of data is used to train the model. In computer vision, instance segmentation is the typical task of performing pixel-level segmentation of individual instances (as opposed to semantic segmentation). It is a more challenging and demanding task than object detection because it requires both instance-level and pixel-level predictions. However, there have been notable advancements in recent times towards improving the performance and accuracy of instance segmentation. Nevertheless, training these systems from scratch is still a challenge since these methods depend on the availability of large high-quality datasets, whose annotation requires expensive pixel-level segmentation due to the amount of time required for a human to manually label a single sample.

One possible solution to address this limitation is to use graphical simulation platforms to generate large sets of automatically labeled samples without human intervention. Yet, in order to successfully use these simulated environments in an autonomous generation process, they must first be manually prepared, which is a tedious and expensive task because it requires careful modeling of specific environments with high attention to photorealistic details. Consequently, the cost required to generate photorealistic environments undermines the main advantage of synthetic data, namely, the generation of arbitrarily large amounts of labeled data. Domain randomization (DR) is a methodology that aims to reduce the expenses associated with generating large quantities of precisely labeled data. This approach involves deliberately disregarding photorealism and instead introducing non-photorealistic variations to the environment through random perturbations, such as adding random textures, light sources, and distracting objects. The goal is to force the network to generalize to essential domain features.

The motivation for this work was to mitigate the persistent problem of publicly available DL solutions, which often suffer from degradation since they are mostly trained on real-world data, biasing the network towards relying on the texture and skin color of the hand (such as MediaPipe [6]). Our goal was to provide a fully synthetic dataset that would allow the DL model to be trained in a color-agnostic manner without using real camera data. Our intended field of application is an unstructured industrial environment since our goal is to further incorporate the developed system as part of a human–robot interaction interface. As the color of the work gloves (which are often part of personal protective equipment [7]) and the background can often be difficult to distinguish from the Red-Green-Blue (RGB) data alone, we explored a multimodal input that includes color and depth information about the scene and the influence of the components of this input, as opposed to our previous work in which only depth data were used to segment the image [8]. We focus on the instance segmentation task because it will further allow us to aggregate and filter all points related to the user’s hand and then combine these point clouds from multiple cameras to evaluate the shape and position of the hand more accurately (which would not be possible in the case of hand landmark recognition). To the best of the authors’ knowledge, this is the first attempt to create a color-agnostic RGB-D synthetic dataset that can mitigate the issue of overreliance on human skin features. In our previous work, we focused on different modalities and the number of samples in the dataset and its impact on model predictions.

Instance segmentation models trained solely on our synthetic dataset outperform the corresponding models trained on state-of-the-art real and synthetic datasets by achieving up to 0.525 AP (average precision, COCO’s challenge standard metric Intersection over Union (IoU) at 0.5:0.95 [9], which is assumed throughout the text unless otherwise stated). Our models also outperform the MediaPipe hands solution in the bounding box detection task in both AP and Probability-Based Detection Quality (PDQ) [10] metrics on a challenging dataset representing an unstructured industrial environment, demonstrating independence from the color of the work gloves worn on the hands.

In this study, our aim was to develop a hand instance segmentation system which can handle conditions typical for industrial applications. Based on analysis and verification of current state-of-the-art methods, we defined expanded requirements for standard approaches. The developed solution was evaluated and verified with strict metrics, and the results and relevance were verified against state-of-the-art datasets and solutions.

2. Related Work

Recent advances in DL have boosted research and produced many breakthroughs in computer vision and machine learning in general [11]. However, supervised deep neural network training is highly dependent on the availability of a sufficiently large, domain-specific, and properly labeled dataset, which is time-consuming and generally complicated to prepare manually. Several simplifying approaches have been proposed to streamline the process of collecting large datasets. In the case of the domain of hand recognition, the hand detection task itself can be simplified using various (visual or magnetic) tags [4,5,12,13], sensor-equipped gloves (data gloves) [14,15], and specialized sensors [16]. Another alternative is represented by synthetic approaches in which each sample is constructed by combining a known ground truth with a random background [17] or by generating a complete sample using a Generative Adversarial Network (GAN) [18,19] or simulation environment [20,21].

2.1. Synthetic Dataset Generation

Synthetic datasets have gained in popularity in recent years because they offer a way to generate arbitrarily large, accurately labeled datasets. The commonly used option is to incorporate renderers or full-fledged simulators into the dataset acquisition process. These visualization tools may generate the entire image or only the objects of interest and utilize a real photo as the background. Dwibedi et al. [22] and Georgakis et al. [23] proposed an alternative to complete image synthesis within a simulation but instead embedding real object images on a set of randomly selected real background images.

Synthetic data can have a “reality gap” [24] with the real world due to limitations in fully replicating real world data, such as textures, lighting, and complex domain specifics [25,26]. Realistic simulations are only able to cover a user-defined scale of conditions, such as specific lighting conditions or a limited subset of object positions and interactions. As a result, the generated environments may only represent a portion of the many conditions that can occur in reality. There are two common approaches to overcoming this disparity: reducing the “reality gap” by trying to increase the similarity between the simulated and real environments [27], or focusing on features that are specific and important for generalization.

Achieving photorealism using high-fidelity rendering engines comes at the cost of computing resources and rendering time, as well as costs associated with manual synthetic scene layout, environment design, shader and effect adjustments, and implementation of domain rules within the simulated model. Several research groups [18,19,28] have attempted to use GANs to improve the photorealism of synthetic images as an alternative to high-fidelity rendering. Mueller et al. [19] utilized the CycleGAN image-to-image translation model [29] to transform synthetic hand images generated in a simulation into more photorealistic representations while ensuring geometric consistency. However, GAN-based approaches imply an additional effort required to train GAN.

As an alternative, the latter approach embraces the reality gap and instead applies additional disturbances to features that the machine learning model should generalize rather than trying to perfectly mimic reality. Sadeghi and Levine [30] extended the known domain adaptation approach [31] and proposed a fully simulated vision-based policy training for indoor quadcopter navigation and collision avoidance tasks where a simulation environment was used without relying on visual fidelity. The term domain randomization was introduced in the work of Tobin et al. [24], where researchers attempted to mitigate the issue of the “reality gap” by generating unrealistic synthetic RGB training data with sufficient variability in the domain features. The variable parameters included random number, position and shapes of distractor objects, texture of all objects and background, position, orientation, field of view of the camera and lights in the scene, and type and amount of random noise added to the images. This resulted in a dataset with a wide distribution of features, as opposed to one that might be observed when manually collecting real-world data, which helps to increase robustness to high environmental variability. Hinterstoisser et al. [32] combined DR with synthetic images generated by combining real backgrounds with overlaid rendered objects of interest, while these objects were randomized with the use of random noise, illumination and blurring. Tremblay et al. [33] used the DR approach to create a large dataset to train the object detection model, which reduced the time required to prepare the dataset and also provided better results than using photorealistic datasets or real data alone. Similar approaches to generating datasets were applied by Dehban et al. [34], Khirodkar et al. [35], and Horvath et al. [36] for object detection tasks in various domains.

In many cases [33,37], synthetic datasets are complemented and fine-tuned with real-world images to achieve better results; however, in contrast, we use synthetic data to train networks that segment instances of real-world objects based only on the synthetic dataset while achieving stable results.

2.2. Hand Datasets

Here, we focus on the domain of hand detection by covering datasets for semantic and instance segmentation, key point detection, and pose estimation.

For hand detection and segmentation, typical datasets consist of manually annotated images obtained with a real RGB camera. The EgoHands dataset [38] includes 4.8K images of two people’s interactions, where pixel-level segmentation masks for each hand were manually annotated. Nuzzi et al. presented an RGB-D HANDS dataset [39,40] comprising samples with single hand gestures (only gesture classification was provided, and no segmentation masks were created).

Automatically annotated datasets may significantly simplify hand detection and segmentation using markers or colored gloves. An example of such a dataset was presented by A. Bojja et al. [41] with the HandSeg dataset containing 150K samples with random hand gestures in front of a depth camera (ground truth annotations were created automatically using color gloves and Hue–Saturation–Value (HSV) thresholding). Real camera datasets are advantageous as they capture realistic domain features. However, it is worth noting that they typically represent only a subset of environmental conditions, and as a result, they cannot cover the full range of scenarios, such as changes in obstacle positions, illumination, reflections, and shadows. This bias towards the captured conditions may affect the resulting model.

Fully synthetic datasets generated using renderers or simulators are advantageous because of the ease and speed with which they can acquire perfectly aligned multimodal samples. However, one of their limitations is their inability to represent domain-specific characteristics accurately and realistically. Mueller et al. [37] demonstrated pose and shape reconstruction of interacting hands using a model trained on a synthetic dataset (Dense Hands). The dataset consisted of depth image samples enhanced with RGB-encoded segmentation masks representing the vertices of a MANO hand model [20]. To overcome the limitations of the generated dataset (absence of background obstacles and augmentations), the model was also trained on real camera data to improve its generalization. Combining RGB and depth modalities provides more information on segmentation and increases the possible variability of the generated dataset. Zimmermann and Brox presented the RGB-D Rendered Hand Pose dataset (RHD) as part of their work dedicated to the estimation of the 3D hand pose from the camera image [21]. They used 3D character models matched to a highly parameterizable hand 3D MANO model [20]. Each dataset frame contained a randomly posed human character model (using realistic animation), while the camera position was randomly selected from a vicinity surrounding the hand of the character. The generated scenes included randomized backgrounds, global lighting, specular reflections, and directional light sources. However, the possible range of hand positions was limited, so the hand appeared only in the center of the image and never at the edges. The ObMan dataset [42] is a fully synthetic dataset consisting of RGB-D images created using 3D models of human characters holding various objects. The camera was placed randomly to capture various poses, backgrounds, textures, and lighting conditions. The dataset includes 150k fully annotated images (key points in the hand, segmentation masks for objects, and hands).

Our strategy differs from earlier works by utilizing domain randomization instead of relying on photorealistic synthetic images. Additionally, we aim to address limitations present in publicly available datasets, which include:

Being limited to RGB or depth information [37,38];
Supposing that the hand occupies the majority of the image area [43,44];
Only considering samples with a single instance [13,43,44];
Biasing instance locations toward the center of the image [21,41,42];
Supposing that the hand is the object closest to the camera [21,41,42];
A lack of distractor objects and obstructions around the hand [13,37,43,44].

Our decision to create a customized dataset was also driven by the fact that the currently available datasets assume a camera placement that is different from the one used in our specific industrial case. This mismatch, along with other factors, served as a motivation to generate our own dataset tailored to our specific requirements.

2.3. Instance Segmentation

As the scope of this paper does not encompass a comprehensive examination of the instance segmentation task, we direct the readers to surveys such as the work of Gu et al. [45].

Instance segmentation implies correct detection pertaining to all objects in the image and at the same time semantically segmenting each instance at the pixel level. Although the objects in the image are drawn from a predefined set of semantic categories, the number of instances of these objects can vary arbitrarily. Recent approaches in this field can be classified into three distinct groups.

In the top-down approach to the segmentation of instances [46,47,48], the first step involves detecting the boundaries, followed by segmenting the instance mask within each of these boundary boxes. The Mask R-CNN (Region-Based Convolutional Neural Network) [49] is the most widely recognized architecture for instance segmentation as it expands upon the two-stage Faster R-CNN [50] object detector by incorporating a branch that segments object instances within detected bounding boxes.

The bottom-up approach to instance segmentation [51,52,53] produces instance masks by grouping pixels into a variable number of object instances within the image. This is achieved by assigning an embedding vector to each pixel, followed by offsetting pixels belonging to different instances and dragging nearby pixels within the same instance. Subsequently, a clustering post-processing step is required to separate the individual instances.

Direct instance segmentation involves predicting both instance masks and their corresponding semantic categories in a single step, without requiring subsequent grouping processing. The SOLO [54] and SOLOv2 [55] architectures exemplify this approach, as they directly produce instance masks and the corresponding class probabilities in a fully convolutional manner without relying on bounding boxes or clustering. Unlike bottom-up and top-down methods, direct instance segmentation does not depend on precise bounding box detection, nor does it require per-pixel embedding and grouping learning [55].

3. Methods

This section describes the methodology used to enable robust hand instance segmentation in cluttered industrial environments. We first present a simulation-based data generation pipeline based on domain randomization, designed to overcome the reality gap while providing large-scale, accurately annotated RGB-D training data (Section 3.1). We then examine an alternative diffusion-based synthetic data generation approach to contextualize our design choices and highlight the advantages of rule-based simulation for controllability and annotation quality (Section 3.2). Next, we detail the model architectures and training procedure used to learn from the generated data (Section 3.3). Finally, we describe the real-world evaluation protocol and metrics employed to assess generalization performance and to quantify the contribution of multimodal inputs under challenging industrial conditions (Section 3.4).

We propose a method for training instance segmentation models using a synthetic dataset generated through simulation. The dataset is created to mimic the real world scenarios captured by a camera. To generate the synthetic dataset, we utilize domain randomization techniques, which are illustrated in Figure 1. Our dataset generator is based on the CoppeliaSim [56] simulation platform. The generator randomly places up to two instances of objects of interest in a 3D scene with random positions and orientations. Additionally, to improve the network generalization to real-world scenarios, we include a variety of geometric shapes and distractors (unrelated work tools) in the scene. These shapes and distractors have random textures applied to them. Furthermore, we insert a variable number of lights at random locations in the scene. The scene is then rendered in both RGB and depth components, and the dataset generation pipeline automatically generates ground-truth labels (masks) by adjusting object visibility parameters.

3.1. Domain Randomization for Dataset Generation

Our scripted CoppeliaSim environment simplifies the generation of large-scale synthetic datasets with accurate annotations at the pixel level. The conditions of the simulated scene are adjusted to match the intended use and typical camera parameters. The implemented pipeline for dataset generation produces a multimodal output as we assume that each modality provides complementary information and aids model performance in cases where a single modality would fail (e.g., due to blending with background or insufficient depth resolution) by providing distinctive features to the networks. Similarly to the approach of Tremblay et al. [33], our expectation is that generating datasets comprising non-realistic images will force the network to differentiate the most salient features related to the shape of the objects, and this, in turn, would result in better generalization to actual images.

The simulation environment employs a vision sensor to emulate a real camera, with its settings based on the field of view of a standard RGB-D camera (using values for the RealSense L515). During dataset generation, the 3D hand model is moved through a grid (20 mm increment in each axis) across the entire field of view of the camera, which allows for a more uniform generation of dataset samples. The hand’s orientation is semi-random; however, it follows specific policies to ensure that it remains within the field of view of the vision sensor. Specifically, the fingertip point and point in the center of the palm are always within the truncated pyramid that corresponds to the camera’s field of view. Furthermore, the roll of the palm is limited to ±15° and the pitch to ±30°. The vision sensor captures both depth and color images of the scene while scaling the depth pixel values in the range [0.2, 1.0] m into a single-channel, 8-bit range grayscale image. By our definition, random objects in the background can be closer or farther from the camera than from the hand; see Figure 2.

The placement of these objects in the scene is also randomized with a policy in place that prevents them from overlapping with the hand in the camera view. If an overlap occurs, the object is relocated until this requirement is met. Random distractors are added to the scene to reduce the sensitivity of the system to irrelevant objects that might be misidentified as fingers or hands. To avoid overfitting during model training, most textures are intentionally designed with unrealistic patterns. Separate vision sensors are responsible for capturing only instances of hands and outputting a single binary mask for each instance. The resulting images and masks are resized to a resolution of 320 × 256. The samples are stored in lossless Portable Network Graphics (PNG) format.

The following aspects of the scene are randomly varied during the generation of the dataset at each frame:

Number, colors, textures, scales and types of distractor objects selected from a set of 3D models of general tools and geometric primitives. A special type of distractor is used—an articulated human body model without hands (see Figure 2b).
Hand gestures (see Figure 3).
Positions and orientations of the models of the hand.
Texture and surface properties (diffuse, specular and emissive properties) and number (from none to 2) of the object of interest, as well as its background.
The number and positions of directional light sources, ranging from one to four, along with a planar light that provides ambient illumination.

The simulated scene contained only the right hands; however, to help the models learn both sides, we used flip augmentation during training. Our dataset generation pipeline also addressed the generation of instance-free samples where only distractors are presented. The maximum number of instances per sample was limited to 2 (see Figure 4) since we assume one user in our industrial scenario. For our DR dataset, we generated a total of 117k images. Annotations in Common Objects in Context (COCO) [9] format were generated using PyCocoCreator [57].

Our dataset contains only a relatively small number of images; however, since we traverse the entire available workspace in a grid during the collection process, the dataset covers the range of possibilities more uniformly than other datasets (more on this in Section 4.2). This is due to the fact that the collection process for other datasets typically involves random generation without using a grid-based approach; furthermore, most datasets assume that hands can only appear in a certain region of the image. The particular distribution observed in Figure 5 is due to the fact that we generate images in the field of view of the camera, which is represented by a pyramid.

3.2. Alternative Synthetic Methods

Except for the simulator-based approach with a ruled image generation process, we performed a different approach to image generation. We used the stable diffusion model Stable Diffusion 1.5 with ControlNet by Zhang et al. [58]. This approach tunes the stable diffusion-based generative network to generate output in a specific form. ControlNet can take as an input different information such as depth image and pose data. As demonstrated in Figure 6, the results are interesting but highly unpredictable. The input prompt of the network in the form “realistic human hand above the table with tools” produced those results. In the upper row is an input of depth image with a successfully generated RGB image, and beneath is an image with a pose input, which in this example fails. In different cases, the input information either had a successful result or the image did not follow the prompt. To create a large amount of images, we can generate the poses according to the output of pose estimation systems such as Openpose or MediaPipe and use this to control the generative network. For ground truth generation, further post-processing is required. Using generative networks is very promising for future research.

3.3. Training Process

This study adopted two state-of-the-art neural networks (each with two backbone options: ResNet-50 and ResNet-101 as feature extractors) using their open-source implementations in the MMDetection Framework [59]. All models were trained for 20 epochs without freezing the backbone with an initial learning rate of 0.02, which was then divided by 10 at 7 and 18 epochs. The training was performed with stochastic gradient descent (SGD) on 2 GPUs with 2 images per mini-batch. We adopted standard hyperparameters with an initial learning rate of 0.02, a learning rate decay factor of 0.1, and a weight decay of 0.001.

To improve sensitivity for both left and right hands, we employed horizontal and vertical image flipping enhancements during training. This was necessary since the datasets were generated solely using a model of the right hand. All networks were initialized models with COCO-pretrained (MS COCO train2017 [9]), 3-channel models (ResNet backbones initially pre-trained on ImageNet) to reduce the required training time and dataset size. The dataset was divided into training and validation sets with an 80/20 ratio. The training was performed on a workstation with 32 GB of RAM Intel Core i7-9700F CPU, equipped with two MSI NVIDIA GeForce RTX 3080 Ti VENTUS 3X 12G OC (New Taipei City, Taiwan) graphics cards.

3.4. Model Evaluation

To assess the performance of domain randomization, we compared the results of instance segmentation models trained on different datasets and evaluated on a test dataset corresponding to the cluttered and unstructured industrial environment, which was acquired using a RealSense L515 calibrated to spatially align the depth and RGB frames. During our experiment, we employed a 640 × 480 depth stream which underwent processing through a modified Intel RealSense librealsense library equipped with improved depth image filtering. To pre-process the data, we used a colorization tool that rescaled the depth information (in the range [0.2, 1.0] m) to an 8-bit value. Additionally, we utilized a customized hole-filling filter to eliminate depth image shadows that arise from the camera’s stereoscopic technology. To restore missing depth information in obstacle-free areas, we incorporated a static image of the scene captured in the workplace. Our test dataset contains 706 images (see Figure 7) obtained in different circumstances, including variable lighting, background and obstacles, number of hands, and different work gloves (red, green, white, and yellow) with different sleeve lengths. For this dataset, the same assumption was made that the system would have only one user, so the maximum number of instances per sample was limited to 2. This evaluation setup creates difficult conditions for model performance. The model was not evaluated on a part of the training dataset but was evaluated on real camera data. This enhances the generalization of the model to different conditions and environment, which is relevant to the application conditions in real scenarios.

To evaluate the average precision for small, medium, and large objects separately, we evaluated the area distribution of the instances in our dataset (see Figure 8a) as the commonly used values from the COCO dataset are clearly out of place for our case, as shown in Figure 8b, where the category “small” contains only two instances. Figure 8c shows the adapted categorization of instances into small, medium, and large categories according to their size. This ensures that each category contains the same number of instances, allowing for a more fair comparison.

We hypothesize that to obtain reliable predictions in a complex unstructured environment, it is necessary to employ a multimodal input that integrates both color and depth information, with each modality contributing to enhancing the accuracy of model predictions. To assess this hypothesis, we investigate the individual impact of each modality, as well as their synergy, on the overall performance of the model in relation to the chosen metrics.

4. Results

We evaluated our approach in a set of experiments that compared the results of the models trained on our DR dataset with the results of the same networks trained on publicly available hands datasets and with the existing ready-to-use hand tracking solution MediaPipe Hands [6]. The experiments conducted in this study involved training the model exclusively on synthetic datasets without any additional fine-tuning on real images. The purpose of this approach is to evaluate the effectiveness of the proposed method in terms of generalization. The evaluation was done using AP and PDQ metrics and qualitative analysis of the output, where the actual predictions were examined.

4.1. Comparison with Pretrained Models

As the first evaluation strategy, we compared AP for custom-trained with pre-trained RGB COCO models since the COCO dataset itself contains the Person class, which indicates a simple possibility of extracting information about the hands. To make a fair comparison with the COCO-pre-trained models, a second version of the real dataset was prepared, where the masks were adjusted to represent all arms. Since IoU is a relative metric, this procedure allows for a more fair assessment of the performance of the COCO-trained models. We evaluated the trained models on the test dataset presented using metrics AP@0.5:0.95 and AP@0.5 (since our task consists only of a single class) with the default minimum score threshold of 0.1; the results are presented in Table 1.

It can be seen that, in general, the Mask R-CNN models have better results for all modalities (the only exception is for the pre-trained COCO RGB models, where the results of the models with the corresponding backbones differ by a factor of 2.5). In terms of modalities, RGB enabled the best results for all models, while the RGB-D models performed comparably or worse by up to 0.065 AP, achieving AP 0.449 for the Mask R-CNN ResNet50 model. The depth models performed similarly and significantly worse than the RGB models, reaching a maximum of 0.364 AP on the Mask R-CNN with the ResNet50 backbone. We assume that these results are due to differences in the properties of the simulated and real depth images. The best AP was obtained with Mask R-CNN ResNet50 RGB, which reached an AP of 0.523. The ratios of the results of the models evaluated using AP@0.5 are similar; however, the differences between their performances are less prominent, especially considering RGB and RGB-D models. The best-performing models remain the same for all modalities—the Mask R-CNN models, achieving 0.821 AP for RGB input. This outcome is somewhat surprising as it was anticipated that the use of RGB-D data, which provides more information, would result in improved predictions compared to RGB. All models trained on our DR dataset show a significant increase in metrics compared to models trained using the COCO dataset. We assume that the lower AP shown by models trained with COCO is due to the fact that these models perform better when a larger portion of the human body can be observed in the image. When testing with a real camera, we noticed that the pre-trained COCO models with the Person class performed better when the image contained part of the human body; however, if only the hands were visible or if gloves were on the palms, the results deteriorated significantly. It was also found that SOLO models had significantly lower prediction confidence scores compared to the corresponding Mask R-CNN models. This is in part due to the way the confidence scores are calculated in each approach; however, we assume that the differences in the characteristics of the generated DR dataset and the actual camera images also play a role in the low prediction scores. The evaluation of the models using the COCO evaluation API for different confidence score thresholds leads to similar results for all models (see Figure 9). The AP of the models monotonically decreases with increasing thresholds, but the trend of the decrease is quite different for the Mask R-CNN and SOLOv2 models, where the deterioration of the results is much steeper for SOLOv2, while the Mask R-CNN maintains a high AP up to a reliability score threshold of 0.95 without any significant deterioration. The other COCO metrics (AP_small, AP_medium, AP_large, etc.) show similar trends for each model, but the ratios between them are ordered differently for each model than for the AP comparison.

The choice of the confidence score threshold plays a critical role in the evaluation of the performance of a model. Lower confidence thresholds may result in a higher number of false positives, while higher thresholds result in more false negatives. The mean average precision score, as used in the COCO evaluation, penalizes false positives only marginally [10]: multiple true-positive bounding boxes predicted for a single instance are not penalized as false positives as long as they satisfy the IoU threshold used in the evaluation. However, in practical applications, a high number of false positives can have significant negative consequences (especially in the context of robotics).

To further investigate the performance of our models, we utilized the probability-based detection quality (PDQ) metric presented by Hall et al. [10]. PDQ deals with a probabilistic evaluation of object detection, explicitly penalizing false positives and false negatives. PDQ evaluates the best match between the detected instances and ground-truth labels for each sample, with mismatched bounding boxes considered false positives. The search for the optimal confidence score threshold was carried out by evaluating the PDQ as a function of the confidence score threshold in the range [0.0, 1.0] with 0.025 steps (see Figure 10), following the experiment presented by Wenkel et al. [60]. The PDQ evaluation currently supports only the bounding box-based evaluation; thus, it was performed using the bounding boxes predicted by the Mask R-CNN models and bounding boxes calculated for the masks predicted by the SOLOv2 models.

Calculating the PDQ as a function of the confidence score threshold results in curves that are distinct from the curves obtained from the COCO metrics (Figure 9). Taking into account these PDQ curves, we identified the confidence score threshold that represents the optimal operating point for each of the models. Unlike the COCO evaluation, the PDQ imposes penalties on both false negatives and false positives while also assessing the confidence scores of the predictions. Therefore, the curve could be interpreted as follows: selecting a confidence score threshold lower than the one corresponding to the maximum PDQ value results in a higher occurrence of false positives, while choosing a higher threshold leads to an increasing number of false negatives. The curve of each model is distinctive; however, in general, the Mask R-CNN and SOLOv2 models have their own typical trends. Table 2 presents the results of the analysis of the optimal confidence score, in which the threshold that yields the maximum PDQ is selected as the optimal confidence threshold. Furthermore, the table provides information on the AP value at the same confidence threshold, along with the highest AP value achieved by the model (which is achieved by all models at the score threshold of 0, as illustrated in Figure 10).

SOLOv2 models typically have an optimal confidence score threshold in the range [0.5, 0.6], while Mask R-CNN models tend to perform the best, with a score threshold greater than 0.825. According to the results shown in Table 2, the best-performing model is Mask R-CNN ResNet50 RGB-D, obtaining up to 0.1576 PDQ at a confidence score threshold of 0.9, while retaining high AP.

Apart from that, the RGB-D models performed consistently better in terms of PDQ metric, while the previously high-performing Mask R-CNN RGB models obtained the worst results, whereas the corresponding depth-based models achieved results that were up to two times better.

Figure 11 shows several sample predictions for the evaluated models. The examples include various environments, lighting conditions as well as different working gloves and combinations of backgrounds which makes instance segmentation challenging.

The results observed during the qualitative evaluation also support the conclusions drawn by the PDQ evaluation. The qualitative evaluation was performed by comparing the mask generated by the model with the mask (ground truth) of the image from the test dataset. The evaluation threshold was set according to the previous PDQ analysis; due to this setting, the predictions do not contain false positive regions to a large extent. Compared to the AP evaluation, where the models working with RGB input showed the best results, in the qualitative evaluation these models achieved the worst results in terms of the quality of the identification of the instances. In the columns in Figure 11d,e, corresponding to RGB images, 6 out of 10 instances are identified; in the columns in Figure 11a,f, corresponding to the SOLOv2 models identified, 7 out of 10 instances are identified when using depth-only and RGB-D inputs. The best results are achieved by the models trained by Mask R-CNN models, where 9 out of 10 instances are identified when using depth Figure 11b and all instances are identified with the RGB-D input Figure 11g. The prediction based on RGB-D images according to the qualitative test shows the best results in terms of the number and quality of instance detection. From the point of view of the difficulty of recognizing the hand instance, the cases of using a white glove and the conditions of background and hand color blending were the most difficult; here the RGB-D models showed the best results, while RGB models contained a higher number of false positives. In contrast, all models identified the cases where the hand color was significantly different from the background.

4.2. Comparison with Existing Datasets

To benchmark our dataset against existing works in this field, we adapted several popular publicly available hand datasets (see Table 3; descriptions are provided in the Relate work section). Statistical distributions of these datasets are shown in Figure 12—it can be observed that instances are biased towards the centers of the images. In order to adapt the datasets to our pipeline, the following modifications were applied:

Unified and merged class masks: DenseHands—dense correspondence maps were binarized and used as masks for hand instances; ObMan and RHD—all masks except hands were omitted.
Depth range: The depth range [0.2, 1.0] m was assigned to the byte range according to the settings in the test environment. The remaining range was truncated to the 1 m limit.

The code for adapting the dataset can be accessed in our GitHub repository. The models were trained using adapted datasets with the training parameters outlined in Section 3.2. We evaluated the trained models on the test dataset, which is representative of the expected environment. Many existing datasets for hand recognition rely on certain assumptions, such as the hands being the closest objects to the camera and being in the center of the image and the absence of other objects in the frame. In reality, these assumptions may not hold in all situations. We also assume that they influenced the evaluation results presented in Table 4.

The PDQ curves generated by models trained on external datasets resemble those of our DR dataset in shape but have lower values for any given score threshold, resulting in lower maximum PDQ values.

The DenseHands dataset showed the worst generalization, in terms of both PDQ and AP metrics. This dataset does not contain any distractors present in the scene; in addition, the images in this dataset are characterized by low variability in the position and orientation of the hand. In contrast, the ObMan dataset, which contains several background objects, including a 3D human model, showed better results in terms of both PDQ and AP metrics. However, the ObMan hands are predominantly located in the center of the image while maintaining a similar distance for the camera. The high variability of the RHD images allows the corresponding models to obtain results similar to those obtained by ObMan. However, the absence of distractors in the RHD dataset limits the quality of the predictions. The results of the models trained on RHD and ObMan are similar, but the RHD models obtain slightly better results in terms of PDQ, whereas the ObMan models obtain slightly higher APs.

Although having the smallest sample size among all datasets, EgoHands produced the best results for both metrics (PDQ_max 0.1040, AP_max 0.25). However, it is also one of the two datasets based on real camera data, so we assume that this factor is the reason for such a high result. The HandSeg dataset, which was also collected with a real depth camera, shows low variability in dataset features and performs significantly worse than EgoHands in terms of both metrics. The best results for all datasets were achieved using Mask R-CNN models.

However, in general, the results obtained by training the models on the existing datasets are much worse than the results obtained by training the models on our DR dataset, despite the fact that this dataset is randomly generated. In general, these findings highlight the importance of the characteristics of the dataset, particularly the inclusion of distractors and the variability of the image content, for the development of effective hand detection models.

4.3. Comparison with Existing Solution

In order to compare the performance of our trained models with the state-of-the-art MediaPipe solution (release 0.9.0), we evaluated bounding-box detection for both models. Since MediaPipe predicts landmark positions for each hand and SOLOv2 models output the instance masks, we enveloped their predictions into axis-oriented bounding boxes. The complexity of the MediaPipe model was set to 1 (notating the more accurate and therefore more complex model), and the static image mode was enabled as the images in the dataset do not represent a video sequence and the use of the MediaPipe tracking would not be beneficial. The maximum number of recognized hand instances was set to 2 since in our dataset the maximum number of instances per sample was also limited to 2. AP and PDQ evaluations were performed for confidence thresholds in the range [0.0, 1.0] with 0.025 steps. The evaluation results are shown in Table 5.

The results of this evaluation show that a deep learning model for hand detection based on a Mask R-CNN trained on a custom synthetic dataset outperforms the state-of-the-art solution MediaPipe in terms of both PDQ and AP metrics when evaluated on bounding boxes. Specifically, the MediaPipe solution achieved PDQ of 0.0836 and AP of 0.18, while the proposed Mask R-CNN model trained on the synthetic RGB-D dataset achieved PDQ of 0.1546 and AP of 0.455. To better understand the reason for this outcome, we made a qualitative comparison of the prediction models. We compared our solution with the YOLOv10 model trained on the HaGRIDv2 dataset [61], which contains over 1 million annotated gestures and has a great performance in hand and gesture detections. Examples of side-by-side predictions are available in Figure 13.

It can be observed that MediaPipe’s predictions tend to fail if work gloves are used. In addition, it is worth noting that red and yellow gloves were predicted with a higher probability than green ones, which we assume is due to the similarity of the typical skin color to the red spectrum in the color space. In contrast to this, our best-performing model, namely, Mask R-CNN ResNet50 RGB-D, predicts the instances more stably regardless of illumination conditions and work glove color. We do not claim that our trained models are superior to MediaPipe and can fully replace it, especially considering the wide range of capabilities that MediaPipe provides. However, within our specific industrial setting, our models effectively recognize hand instances and perform this particular task. Compared to the YOLO model we achieved similar results to the original study of PDQ [10], where YOLO is very confident in predictions and reaches the best AP of 0.645 with the largest version of the model. PDQ value 0.05 is lower than in other models.

5. Discussion

Our initial goal was to develop an instance segmentation solution for the operator’s hands localization. We ended up developing a cross-validation platform for models with different outputs and datasets of varying complexity. Our results show the importance of rigorous validation of the model performance, and the frequently used average precision metric is not always sufficient.

Our qualitative comparison of other models and solutions proved that even if the quantitative results according to AP are better, the comparison based on PDQ showed more accurate detections.

It should be mentioned that the synthetic data generator employed in this study has certain limitations, such as a restricted range of hand meshes and the absence of simulation models for hand interactions with tools. Increasing the variety of hand meshes and allowing them to interact with each other could create more variations in the synthetic dataset and improve the generalization of the models trained on it. Furthermore, simulating hand interactions with a diverse set of tools may help to narrow the discrepancy between synthetic data and real-world images and therefore present promising avenues for future research.

6. Conclusions

An essential factor for achieving optimal performance in deep learning-based applications is the availability of large, accurately labeled training datasets. However, manual data collection and annotation can be a laborious and costly process. This study follows on from the topic of synthetic dataset generation [8], which incorporates domain randomization to create large and accurately annotated datasets for multimodal hand instance segmentation. The generator randomizes several scene features, including hand pose and texture, scene texture and lighting, and simulated noise and occluding objects. Although the resulting images are crude and not photorealistic, these characteristics can encourage the deep neural network to focus on the fundamental problem structure instead of details that may be absent in real-world scenarios during testing, improving the network’s ability to generalize. The high data variety in the samples reduces the likelihood that the models overfit to synthetically generated training datasets. Furthermore, these datasets can be created more quickly and with less expertise required.

The effectiveness of the synthetic datasets was assessed by training state-of-the-art instance segmentation models (SOLOv2 and Mask R-CNN) and evaluating their performance on a complex test dataset that included real images of hands. In general, Mask R-CNN models generally outperform other models for all modalities, with the RGB-D models performing comparably with the RGB models in terms of average precision. All models trained on the DR dataset showed a significant increase in metrics compared to models trained using the COCO dataset. The SOLO models had significantly lower prediction confidence scores compared to the corresponding Mask R-CNN models, partly due to the way the confidence scores are calculated in these models but also potentially due to differences in the characteristics of the generated DR dataset and the actual camera images. Since the choice of confidence score threshold plays a critical role in evaluating the performance of a model, the probability-based detection quality (PDQ) metric was used to find the optimal confidence score threshold. The Mask R-CNN ResNet50 RGB-D model was found to be the best-performing model, achieving high PDQ at a high confidence score threshold while retaining high average precision. The qualitative evaluation also supports the conclusion that the RGB-D models outperform both depth and RGB models in terms of the quality of instance detection. By comparing our DR dataset to several publicly available hand datasets, we found that many existing datasets for hand recognition rely on certain assumptions, such as the hands being the closest objects to the camera and being located in the center of the image, and the absence of other objects in the image. The results of AP and PDQ evaluations obtained by training the models on the existing datasets (even those based on real camera data) are significantly worse than the results obtained by training the models on our DR dataset. This again highlights the importance of the characteristics of the dataset in the development of accurate hand detection models. In particular, incorporating variability in image content and the presence of random distractors can significantly improve the performance of such models, making them more suitable for real-world applications in various environments.

The trained instance segmentation models are intended to be part of the human–robot interaction interface. For this particular scenario, it is essential that trained models recognize instances regardless of their color as workers often use work gloves in industrial environments. In order to compare the performance of our trained models with the state-of-the-art MediaPipe solution, we evaluated both models in terms of AP and PDQ metrics for bounding boxes. The results of this evaluation show that a deep learning model for hand detection based on the Mask R-CNN trained on a custom synthetic dataset outperforms the state-of-the-art solution MediaPipe in terms of both PDQ and AP metrics when evaluated on bounding boxes. Our qualitative comparison of model predictions showed that MediaPipe’s predictions tend to fail when work gloves are used and that red and yellow gloves were predicted with higher probability than green ones. We do not claim that our trained models are superior to MediaPipe and can fully replace it, especially considering the wide range of capabilities that MediaPipe provides. However, within our specific industrial setting, our best-performing model predicts instances more stably regardless of illumination conditions and work glove colors. In general, our work highlights the importance of using a high-quality and diverse dataset to train hand detection models in industrial settings and provides a promising direction for future research in this area.

The findings of this study indicate that deep convolutional neural networks can be trained to accurately segment hand instances in real-world images utilizing only synthetic datasets. This method has the advantage of not requiring manual annotation, a labor-intensive and time-consuming process, and of producing pixel-level accurate masks for each instance. Conversely, annotated datasets created manually are vulnerable to errors that may arise from human oversight, leading to the production of noisy data because multiple annotators may have discrepancies in their annotations. The use of synthetic datasets eliminates the need to collect and annotate large training datasets for a specific object. Instead, it becomes possible to create extensive datasets for various objects of interest utilizing their CAD models. This is particularly advantageous in specialized tasks, such as bin picking and logistics, where data may be limited. It is crucial to comprehend the essential characteristics that are required in the training dataset when utilizing synthetic data for deep learning. This is necessary to enable the model to generalize effectively to real-world images. For hand instance segmentation, the model must be capable of learning the features associated with the shape of the hand. Consequently, the synthetic data generator was specifically designed to incorporate both realistic and unrealistic textures to encourage the network to learn shape-related features. This helps to ensure that the model perceives real-world images as mere variations of the synthetic data on which it was trained.

Author Contributions

Conceptualization, S.G. and A.V.; methodology, S.G.; software, S.G.; validation, A.V. and J.C.; formal analysis, J.C.; data curation, A.V. and J.C.; writing—original draft preparation, S.G.; writing—review and editing, A.V.; visualization, A.V.; supervision, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This article has been produced with the financial support of the European Union under the REFRESH – Research Excellence For REgion Sustainability and High-tech Industries project number CZ.10.03.01/00/22_003/0000048 via the Operational Programme Just Transition. The article received financial assistance from the state budget of the Czech Republic as part of the SP2024/082 research project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset and pretrained models are available in: https://doi.org/10.34740/KAGGLE/DS/2970535 (accessed on 7 January 2026). Code is available in: https://github.com/anion0278/HaDR (accessed on 7 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, Y.; Chen, C.; Zhao, Z.; Hu, T.; Zhang, J. Robot teaching system based on hand-robot contact state detection and motion intention recognition. Robot. Comput.-Integr. Manuf. 2023, 81, 102492. [Google Scholar] [CrossRef]
Li, S.; Zheng, P.; Liu, S.; Wang, Z.; Wang, X.V.; Zheng, L.; Wang, L. Proactive human–robot collaboration: Mutual-cognitive, predictable, and self-organising perspectives. Robot. Comput.-Integr. Manuf. 2023, 81, 102510. [Google Scholar] [CrossRef]
Schött, S.Y.; Amin, R.M.; Butz, A. A Literature Survey of How to Convey Transparency in Co-Located Human Robot Interaction. Multimodal Technol. Interact. 2023, 7, 25. [Google Scholar] [CrossRef]
Kim, E.; Kirschner, R.; Yamada, Y.; Okamoto, S. Estimating probability of human hand intrusion for speed and separation monitoring using interference theory. Robot. Comput.-Integr. Manuf. 2020, 61, 101819. [Google Scholar] [CrossRef]
Amorim, A.; Guimares, D.; Mendona, T.; Neto, P.; Costa, P.; Moreira, A.P. Robust human position estimation in cooperative robotic cells. Robot. Comput.-Integr. Manuf. 2021, 67, 102035. [Google Scholar] [CrossRef]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar] [CrossRef]
Lucci, N.; Monguzzi, A.; Zanchettin, A.M.; Rocco, P. Workflow modelling for human–robot collaborative assembly operations. Robot. Comput.-Integr. Manuf. 2022, 78, 102384. [Google Scholar] [CrossRef]
Vysocky, A.; Grushko, S.; Spurny, T.; Pastor, R.; Kot, T. Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localization. IEEE Access 2022, 10, 99734–99744. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Hall, D.; Dayoub, F.; Skinner, J.; Zhang, H.; Miller, D.; Corke, P.; Carneiro, G.; Angelova, A.; Sünderhauf, N. Probabilistic Object Detection: Definition and Evaluation. arXiv 2020, arXiv:1811.10800. [Google Scholar] [CrossRef]
Jalayer, R.; Jalayer, M.; Orsenigo, C.; Tomizuka, M. A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human–robot interaction. Robot. Comput.-Integr. Manuf. 2026, 97, 103110. [Google Scholar] [CrossRef]
Hillebrand, G.; Bauer, M.; Achatz, K.; Klinker, G. Inverse kinematic infrared optical finger tracking. In Proceedings of the 9th International Conference on Humans and Computers (HC 2006), Aizu, Japan, 6–9 March 2006; pp. 6–9. [Google Scholar]
Wetzler, A.; Slossberg, R.; Kimmel, R. Rule Of Thumb: Deep derotation for improved fingertip detection. arXiv 2015, arXiv:1507.05726. [Google Scholar] [CrossRef][Green Version]
Baldi, T.L.; Scheggi, S.; Meli, L.; Mohammadi, M.; Prattichizzo, D. GESTO: A Glove for Enhanced Sensing and Touching Based on Inertial and Magnetic Sensors for Hand Tracking and Cutaneous Feedback. IEEE Trans. Hum.-Mach. Syst. 2017, 47, 1066–1076. [Google Scholar] [CrossRef]
Grushko, S.; Vysocký, A.; Heczko, D.; Bobovský, Z. Intuitive Spatial Tactile Feedback for Better Awareness about Robot Trajectory during Human-Robot Collaboration. Sensors 2021, 21, 5748. [Google Scholar] [CrossRef]
Vysocký, A.; Grushko, S.; Oščádal, P.; Kot, T.; Babjak, J.; Jánoš, R.; Sukop, M.; Bobovský, Z. Analysis of Precision and Stability of Hand Tracking with Leap Motion Sensor. Sensors 2020, 20, 4088. [Google Scholar] [CrossRef] [PubMed]
Mazhar, O.; Navarro, B.; Ramdani, S.; Passama, R.; Cherubini, A. A real-time human-robot interaction framework with robust background invariant hand gesture detection. Robot. Comput.-Integr. Manuf. 2019, 60, 34–48. [Google Scholar] [CrossRef]
Yurtsever, E.; Yang, D.; Koc, I.M.; Redmill, K.A. Photorealism in Driving Simulations: Blending Generative Adversarial Image Synthesis with Rendering. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23114–23123. [Google Scholar] [CrossRef]
Mueller, F.; Bernard, F.; Sotnychenko, O.; Mehta, D.; Sridhar, S.; Casas, D.; Theobalt, C. GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 49–59. [Google Scholar] [CrossRef]
Romero, J.; Tzionas, D.; Black, M.J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Trans. Graph. 2017, 36, 1–17. [Google Scholar] [CrossRef]
Zimmermann, C.; Brox, T. Learning to Estimate 3D Hand Pose from Single RGB Images. arXiv 2017, arXiv:1705.01389. [Google Scholar] [CrossRef]
Dwibedi, D.; Misra, I.; Hebert, M. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. arXiv 2017, arXiv:1708.01642. [Google Scholar] [CrossRef]
Georgakis, G.; Mousavian, A.; Berg, A.C.; Kosecka, J. Synthesizing Training Data for Object Detection in Indoor Scenes. arXiv 2017, arXiv:1702.07836. [Google Scholar] [CrossRef]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv 2017, arXiv:1703.06907. [Google Scholar] [CrossRef]
Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. arXiv 2015, arXiv:1409.7495. [Google Scholar] [CrossRef]
Liebelt, J.; Schmid, C. Multi-View Object Class Detection with a 3D Geometric Model. In 23rd IEEE Conference on Computer Vision & Pattern Recognition; IEEE: New York, NY, USA, 2010; p. 1688. [Google Scholar] [CrossRef]
Planche, B.; Singh, R.V. Physics-based Differentiable Depth Sensor Simulation. arXiv 2021, arXiv:2103.16563. [Google Scholar] [CrossRef]
Oprea, S.; Karvounas, G.; Martinez-Gonzalez, P.; Kyriazis, N.; Orts-Escolano, S.; Oikonomidis, I.; Garcia-Garcia, A.; Tsoli, A.; Garcia-Rodriguez, J.; Argyros, A. H-GAN: The power of GANs in your Hands. arXiv 2021, arXiv:2103.15017. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv 2020, arXiv:1703.10593. [Google Scholar] [CrossRef]
Sadeghi, F.; Levine, S. CAD2RL: Real Single-Image Flight without a Single Real Image. arXiv 2017, arXiv:1611.04201. [Google Scholar] [CrossRef]
Tzeng, E.; Devin, C.; Hoffman, J.; Finn, C.; Abbeel, P.; Levine, S.; Saenko, K.; Darrell, T. Adapting Deep Visuomotor Representations with Weak Pairwise Constraints. arXiv 2017, arXiv:1511.07111. [Google Scholar] [CrossRef]
Hinterstoisser, S.; Lepetit, V.; Wohlhart, P.; Konolige, K. On Pre-Trained Image Features and Synthetic Images for Deep Learning. arXiv 2017, arXiv:1710.10710. [Google Scholar] [CrossRef]
Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. arXiv 2018, arXiv:1804.06516. [Google Scholar] [CrossRef]
Dehban, A.; Borrego, J.; Figueiredo, R.; Moreno, P.; Bernardino, A.; Santos-Victor, J. The Impact of Domain Randomization on Object Detection: A Case Study on Parametric Shapes and Synthetic Textures. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 2593–2600. [Google Scholar] [CrossRef]
Khirodkar, R.; Yoo, D.; Kitani, K.M. Domain Randomization for Scene-Specific Car Detection and Pose Estimation. arXiv 2018, arXiv:1811.05939. [Google Scholar] [CrossRef]
Horváth, D.; Erdős, G.; Istenes, Z.; Horváth, T.; Földi, S. Object Detection Using Sim2Real Domain Randomization for Robotic Applications. IEEE Trans. Robot. 2022, 39, 1225–1243. [Google Scholar] [CrossRef]
Mueller, F.; Davis, M.; Bernard, F.; Sotnychenko, O.; Verschoor, M.; Otaduy, M.A.; Casas, D.; Theobalt, C. Real-time Pose and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera. ACM Trans. Graph. 2019, 38, 1–13. [Google Scholar] [CrossRef]
Bambach, S.; Lee, S.; Crandall, D.J.; Yu, C. Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1949–1957. [Google Scholar] [CrossRef]
Nuzzi, C.; Pasinetti, S.; Pagani, R.; Coffetti, G.; Sansoni, G. HANDS: An RGB-D dataset of static hand-gestures for human-robot interaction. Data Brief 2021, 35, 106791. [Google Scholar] [CrossRef] [PubMed]
Nuzzi, C.; Pasinetti, S.; Pagani, R.; Ghidini, S.; Beschi, M.; Coffetti, G.; Sansoni, G. MEGURU: A gesture-based robot program builder for Meta-Collaborative workstations. Robot. Comput.-Integr. Manuf. 2021, 68, 102085. [Google Scholar] [CrossRef]
Bojja, A.K.; Mueller, F.; Malireddi, S.R.; Oberweger, M.; Lepetit, V.; Theobalt, C.; Yi, K.M.; Tagliasacchi, A. HandSeg: An Automatically Labeled Dataset for Hand Segmentation from Depth Images. arXiv 2018, arXiv:1711.05944. [Google Scholar] [CrossRef]
Hasson, Y.; Varol, G.; Tzionas, D.; Kalevatykh, I.; Black, M.J.; Laptev, I.; Schmid, C. Learning joint reconstruction of hands and manipulated objects. arXiv 2019, arXiv:1904.05767. [Google Scholar] [CrossRef]
Qian, C.; Sun, X.; Wei, Y.; Tang, X.; Sun, J. Realtime and Robust Hand Tracking from Depth. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1106–1113. [Google Scholar] [CrossRef]
Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Trans. Graph. 2014, 33, 1–10. [Google Scholar] [CrossRef]
Gu, W.; Bai, S.; Kong, L. A review on 2D instance segmentation based on deep neural networks. Image Vis. Comput. 2022, 120, 104401. [Google Scholar] [CrossRef]
Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask Scoring R-CNN. arXiv 2019, arXiv:1903.00241. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. arXiv 2019, arXiv:1904.02689. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. arXiv 2018, arXiv:1803.01534. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef]
De Brabandere, B.; Neven, D.; Van Gool, L. Semantic Instance Segmentation with a Discriminative Loss Function. arXiv 2017, arXiv:1708.02551. [Google Scholar] [CrossRef]
Liu, S.; Jia, J.; Fidler, S.; Urtasun, R. SGN: Sequential Grouping Networks for Instance Segmentation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3516–3524. [Google Scholar] [CrossRef]
Newell, A.; Huang, Z.; Deng, J. Associative Embedding: End-to-End Learning for Joint Detection and Grouping. arXiv 2017, arXiv:1611.05424. [Google Scholar] [CrossRef]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting Objects by Locations. arXiv 2020, arXiv:1912.04488. [Google Scholar] [CrossRef]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar] [CrossRef]
Rohmer, E.; Singh, S.P.N.; Freese, M. V-REP: A versatile and scalable robot simulation framework. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 1321–1326. [Google Scholar] [CrossRef]
Wspanialy, P. Pycococreator, version 0.2.1; Zenodo: Geneva, Switzerland, 2018. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
Wenkel, S.; Alhazmi, K.; Liiv, T.; Alrshoud, S.; Simon, M. Confidence Score: The Forgotten Dimension of Object Detection Performance Evaluation. Sensors 2021, 21, 4350. [Google Scholar] [CrossRef]
Kapitanov, A.; Kvanchiani, K.; Nagaev, A.; Kraynov, R.; Makhliarchuk, A. HaGRID–HAnd Gesture Recognition Image Dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2024. [Google Scholar]

Figure 1. To enhance instance segmentation, domain randomization was implemented by overlaying annotated synthetic data onto randomized backgrounds, alongside random distractors such as geometric shapes and generic tools. The resulting scenes were then rendered with random lighting conditions, and random textures were applied to both the objects of interest and the distractor objects prior to rendering. These samples, along with automatically generated masks, were utilized for training instance segmentation models.

Figure 2. CoppeliaSim scene used as dataset generator. The scene contains randomly positioned objects, random colors of textures and random noise in background: (a) scene containing two hand meshes; (b) scene containing no hand meshes and a single articulated human body model.

Figure 3. Meshes with annotated hand instances used in dataset generator. Colors of forearm and hand are random.

Figure 4. Samplesfrom the generated dataset: (a) RGB image of the scene; (b) depth image of the scene; (c) mask of the first generated instance; (d) mask of the second instance. Generated samples include different lighting, background clutter, distractors and multiple cases where the task of hand segmentation from RGB image is complicated due to blending with the background. Samples may contain up to two instances of hands.

Figure 5. Statistics of our dataset: the distributions of the number of hand instances per image (left) and hand centroid locations (right).

Figure 6. Images generated by stable diffusion model controlled by ControlNet with the depth input (top row) and hand keypoints (bottom row).

Figure 7. Real dataset for evaluation: (a) RGB; (b) depth; (c) mask of the first instance; (d) mask of the second instance.

Figure 8. Object count histogram: (a) instances in our DR dataset; (b) by categorization into small, medium and large object groups according to standard COCO-calculated thresholds—the instances are distributed into categories unevenly; (c) by categorization into small, medium and large object groups according to thresholds calculated for our dataset—instances are distributed evenly into the categories.

Figure 9. AP curves (higher is better) for trained models evaluated for different confidence score thresholds. Mask R-CNN models retain their AP up to a confidence threshold of 0.95.

Figure 10. PDQ curves (higher is better) obtained by evaluating PDQ as a function of confidence score threshold. Evaluated on test dataset (bounding boxes).

Figure 11. Qualitative results on test dataset. Model predictions: (a) RGB image of the scene; (b) SOLOv2 ResNet-101 depth, threshold: 0.55; (c) Mask R-CNN ResNet-101 depth, threshold: 0.825; (d) SOLOv2 ResNet-50 RGB, threshold: 0.6; (e) Mask R-CNN ResNet-50 RGB, threshold: 0.975; (f) SOLOv2 ResNet-101 RGB-D, threshold: 0.5; (g) Mask R-CNN ResNet-50 RGB-D, threshold: 0.9. True positive (white), false positive (red), and false negative (blue) pixels in the predicted output compared to the ground truth.

Figure 12. Statistics of the datasets: (a) RHD; (b) Dense Hands; (c) ObMan; (d) HandSeg; (e) EgoHands. Graphs depict distributions of the number of instances per image (left) and instance centroid locations (right). Applying DR simplifies the generation of data with a wider distribution (see Figure 5).

Figure 13. Side-by-sidecomparison of predictions: (first row) MediaPipe Hands—landmarks and bounding boxes; (second row) Mask R-CNN ResNet50 RGB-D trained on our DR dataset—masks and bounding boxes; (third row) YOLOv10 trained on HaGRIDv2 dataset—bounding boxes (red is ground truth).

Table 1. Average precision and average recall of models trained on our DR dataset with all modalities (depth, RGB, and RGB-D) compared with metrics for COCO. Evaluated on test dataset (real camera images) with minimum confidence score threshold of 0.1. Mask R-CNN ResNet50 achieves highest AP. The best result in each section is highlighted in bold.

Model	AP@0.5:0.95	AP@0.5	AP_small@0.5:0.95	AP_medium@0.5:0.95	AP_large@0.5:0.95	AR@0.5:0.95
SOLOv2 ResNet50 (Depth)	0.338	0.644	0.263	0.369	0.405	0.265
SOLOv2 ResNet101 (Depth)	0.356	0.681	0.293	0.392	0.415	0.274
Mask R-CNN ResNet50 (Depth)	0.364	0.712	0.291	0.424	0.415	0.273
Mask R-CNN ResNet101 (Depth)	0.357	0.686	0.308	0.415	0.394	0.274
SOLOv2 ResNet50 (RGB)	0.473	0.717	0.399	0.578	0.557	0.330
SOLOv2 ResNet101 (RGB)	0.474	0.740	0.391	0.583	0.550	0.336
Mask R-CNN ResNet50 (RGB)	0.523	0.821	0.459	0.612	0.576	0.357
Mask R-CNN ResNet101 (RGB)	0.443	0.704	0.388	0.567	0.497	0.321
SOLOv2 ResNet50 (RGB-D)	0.408	0.718	0.326	0.465	0.460	0.306
SOLOv2 ResNet101 (RGB-D)	0.410	0.709	0.348	0.457	0.458	0.306
Mask R-CNN ResNet50 (RGB-D)	0.449	0.800	0.412	0.505	0.479	0.325
Mask R-CNN ResNet101 (RGB-D)	0.445	0.779	0.423	0.504	0.468	0.323
SOLOv2 ResNet50 (COCO)	0.329	0.607	0.394	0.490	0.274	0.281
SOLOv2 ResNet101 (COCO)	0.252	0.574	0.278	0.349	0.244	0.225
Mask R-CNN ResNet50 (COCO)	0.123	0.292	0.218	0.322	0.082	0.157
Mask R-CNN ResNet101 (COCO)	0.101	0.234	0.221	0.284	0.082	0.148

Table 2. Summary of PDQ bounding box evaluation. The maximum PDQ is presented along with the corresponding confidence score threshold and AP obtained. PDQ_max—maximum PDQ obtained by the model; confidence score threshold at PDQ_max—score threshold at which PDQ_max was obtained (optimal confidence threshold); AP at PDQ_max—AP that corresponds to PDQ_max; AP_max—maximum AP obtained by the model. The best result in each section is highlighted in bold.

Model	PDQ_max	Confidence Score Threshold at PDQ_max	AP at PDQ_max	AP_max
SOLOv2 ResNet50 (Depth)	0.1071	0.550	0.301	0.338
SOLOv2 ResNet101 (Depth)	0.1178	0.550	0.318	0.357
Mask R-CNN ResNet50 (Depth)	0.1259	0.925	0.346	0.372
Mask R-CNN ResNet101 (Depth)	0.1318	0.825	0.339	0.364
SOLOv2 ResNet50 (RGB)	0.1147	0.600	0.380	0.474
SOLOv2 ResNet101 (RGB)	0.1054	0.550	0.385	0.475
Mask R-CNN ResNet50 (RGB)	0.0874	0.975	0.498	0.525
Mask R-CNN ResNet101 (RGB)	0.0768	0.975	0.420	0.445
SOLOv2 ResNet50 (RGB-D)	0.1351	0.575	0.342	0.408
SOLOv2 ResNet101 (RGB-D)	0.1449	0.500	0.372	0.410
Mask R-CNN ResNet50 (RGB-D)	0.1576	0.900	0.424	0.455
Mask R-CNN ResNet101 (RGB-D)	0.1480	0.975	0.409	0.450

Table 3. Existing hand datasets—comparison of the features.

Dataset	Annotation Method	Samples Source	Instances	Modalities	Number of Images
EgoHands	Manual	Real	Up to 4	RGB	4800
HandSeg	Automatic (marker/gloves)	Real	Up to 2	RGB-D	158,315
DenseHands	Automatic	Synthetic	Up to 2	Depth	85,611
Rendered Hand Pose (RHD)	Automatic	Synthetic	Up to 2	RGB-D	43,986
ObMan	Automatic	Synthetic	Up to 2	RGB-D	154,298
HaDR (Ours)	Automatic	Synthetic	Up to 2	RGB-D	117,438

Table 4. AP and PDQ evaluations for models trained on existing datasets (evaluated on test dataset). The best result in each section is highlighted in bold.

Model	PDQ_max	Confidence Score Threshold at PDQ_max	AP at PDQ_max	AP_max
DenseHands: SOLOv2 ResNet50 (Depth)	0.0000	0.000	0.009	0.009
DenseHands: SOLOv2 ResNet101 (Depth)	0.0001	0.225	0.008	0.008
DenseHands: Mask R-CNN ResNet50 (Depth)	0.0061	0.975	0.084	0.093
DenseHands: Mask R-CNN ResNet101 (Depth)	0.0034	0.975	0.027	0.029
HandSeg: SOLOv2 ResNet50 (Depth)	0.0104	0.350	0.035	0.039
HandSeg: SOLOv2 ResNet101 (Depth)	0.0194	0.450	0.06	0.069
HandSeg: Mask R-CNN ResNet50 (Depth)	0.0147	0.975	0.018	0.02
HandSeg: Mask R-CNN ResNet101 (Depth)	0.0158	0.925	0.017	0.018
EgoHands: SOLOv2 ResNet50 (RGB)	0.1025	0.575	0.264	0.299
EgoHands: SOLOv2 ResNet101 (RGB)	0.0975	0.500	0.229	0.250
EgoHands: Mask R-CNN ResNet50 (RGB)	0.0742	0.950	0.161	0.174
EgoHands: Mask R-CNN ResNet101 (RGB)	0.1040	0.900	0.214	0.253
ObMan: SOLOv2 ResNet50 (RGB-D)	0.0606	0.450	0.138	0.165
ObMan: SOLOv2 ResNet101 (RGB-D)	0.0535	0.375	0.135	0.160
ObMan: Mask R-CNN ResNet50 (RGB-D)	0.0798	0.975	0.187	0.217
ObMan: Mask R-CNN ResNet101 (RGB-D)	0.0786	0.950	0.206	0.227
RHD: SOLOv2 ResNet50 (RGB-D)	0.0591	0.725	0.130	0.148
RHD: SOLOv2 ResNet101 (RGB-D)	0.0794	0.675	0.157	0.168
RHD: Mask R-CNN ResNet50 (RGB-D)	0.0775	0.975	0.165	0.175
RHD: Mask R-CNN ResNet101 (RGB-D)	0.0839	0.975	0.169	0.178

Table 5. AP and PDQ evaluated for bounding boxes for Mask R-CNN ResNet50 models trained on our DR dataset compared with MediaPipe. Evaluated on test dataset (real camera images). The best result in each section is highlighted in bold.

Model	PDQmax	Confidence Score Threshold at PDQ_max	AP at PDQ_max	AP_max
Mask R-CNN ResNet101 (Depth)	0.1318	0.825	0.339	0.364
SOLOv2 ResNet50 (RGB)	0.1147	0.600	0.380	0.474
Mask R-CNN ResNet50 (RGB-D)	0.1576	0.900	0.424	0.455
MediaPipe (RGB)	0.0836	0.050	0.181	0.181
YOLOv10 (RGB)	0.0613	0.200	0.825	0.825

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Grushko, S.; Vysocký, A.; Chlebek, J. HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization. AI 2026, 7, 72. https://doi.org/10.3390/ai7020072

AMA Style

Grushko S, Vysocký A, Chlebek J. HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization. AI. 2026; 7(2):72. https://doi.org/10.3390/ai7020072

Chicago/Turabian Style

Grushko, Stefan, Aleš Vysocký, and Jakub Chlebek. 2026. "HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization" AI 7, no. 2: 72. https://doi.org/10.3390/ai7020072

APA Style

Grushko, S., Vysocký, A., & Chlebek, J. (2026). HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization. AI, 7(2), 72. https://doi.org/10.3390/ai7020072

Article Menu

HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization

Abstract

1. Introduction

2. Related Work

2.1. Synthetic Dataset Generation

2.2. Hand Datasets

2.3. Instance Segmentation

3. Methods

3.1. Domain Randomization for Dataset Generation

3.2. Alternative Synthetic Methods

3.3. Training Process

3.4. Model Evaluation

4. Results

4.1. Comparison with Pretrained Models

4.2. Comparison with Existing Datasets

4.3. Comparison with Existing Solution

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI