HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe abstract should be improved to include quantitative results to express the proposed method superiority over the literature.
The paper outline, described from lines 84 to 97, could be eliminated without losing the manuscript quality. Instead, the authors should include the main findings of this study.
Several claims of novelty and superiority over similar methods are only partially substantiated when compared with the state of the art, technically and scientifically.
The method uses only one dataset to obtain a general conclusion; additional comparisons are required.
Results contradict general color image understanding; the RGB color space is a correlated space many times outperformed by non-correlated color frameworks. The authors claim RGB superiority, but results show that the composed/multimodal RGB-D color space is superior; please revise Table 1 and the conclusions.
Why was the PDQ only applied to bounding boxes, not to masks?
In this study, the articulation realism is weaker, and the hand orientation is limited to ±15° roll and ±30° pitch, as well as gesture diversity. Therefore, hand pose variation is constrained to narrow angular ranges and static meshes.
A deeper ablation study is required, including issues as added image noise (add realism), oclusions, and so on.
The claim of “general hand detection" is not supported by experimental results and the controlled conditions used in this study. Make a deep literature review, which provides methods for analysing multi-person and interaction scenarios.
Incomplete baseline fairness, the MediaPipe was optimized for landmarks, not for segmentation.
It is important to explicitly state the position of the dataset as industrial-HRI-specific, not universal, including the technical drawbacks for general use.
Comments on the Quality of English LanguageThe manuscript requires deep proofreading.
Author Response
Comments 1: The abstract should be improved to include quantitative results to express the proposed method superiority over the literature.
Response 1: Thank you for your comment. We changed the abstract so that it better follows the structure of the paper. We could not include the quantitative results because it may be confusing for the reader. Results have to be described in wider explanation and context, and it is not possible to simply add some number into the abstract.
Comments 2: The paper outline, described from lines 84 to 97, could be eliminated without losing the manuscript quality. Instead, the authors should include the main findings of this study.
Response 2: We agree and we removed this paragraph.
Comments 3: Several claims of novelty and superiority over similar methods are only partially substantiated when compared with the state of the art, technically and scientifically.
Response 3: We appreciate your observation regarding the widespread use of domain randomization in data synthesis. Our approach leverages this technique to create a custom synthetic RGB-D dataset tailored specifically for hand instance segmentation in industrial environments. This customization allows us to generate a large volume of training data efficiently, significantly reducing the time and resources typically required for manual data collection and annotation.
In our study, models trained exclusively on this synthetic dataset demonstrated superior performance compared to those trained on existing real-world datasets. This indicates that our domain-randomized dataset effectively captures the variability and complexity necessary for robust hand detection, even in challenging industrial settings. In this article, we not only propose our system for creating a hand detector but also describe a methodological approach for evaluating the proposed system and comparing it with other systems that produce different outputs. While domain randomization itself is not novel, our contribution lies in its application to generate a specialized dataset that addresses the specific challenges of hand instance segmentation. This approach offers a practical and efficient solution for developing high-performing models in scenarios where acquiring extensive labeled data is impractical. We have added an information to the conclusion section.
Comments 4: The method uses only one dataset to obtain a general conclusion; additional comparisons are required.
Response 4: We are not sure, how to understand this comment. In section 4.2 and Table 4 we show results of models trained on state-of-the-art datasets. We perform comparison of models trained on different datasets as well as comparison of different models.
Comments 5: Results contradict general color image understanding; the RGB color space is a correlated space many times outperformed by non-correlated color frameworks. The authors claim RGB superiority, but results show that the composed/multimodal RGB-D color space is superior; please revise Table 1 and the conclusions.
Response 5: This is one of our findings in this paper. The standard AP evaluation in table 1 shows RGB superiority, which was also hard to understand for us. Depth channel should bring benefits with additional information. In table 2 and PDQ analysis the RGB-D performs better, and this is the expected result.
Comments 6: Why was the PDQ only applied to bounding boxes, not to masks?
Response 6: Thank you for your comment, we agree that it could bring more interesting results, but we had to work with tools we had. We used the PDQ evaluation tool, which does not work with pixel-level results, and could only use bounding box evaluations. We tried to set up the fairest possible conditions for comparing different models and methods. The development of new tools was not within our time capabilities.
Comments 7: In this study, the articulation realism is weaker, and the hand orientation is limited to ±15° roll and ±30° pitch, as well as gesture diversity. Therefore, hand pose variation is constrained to narrow angular ranges and static meshes.
A deeper ablation study is required, including issues as added image noise (add realism), oclusions, and so on. 
Response 7: Our previous work extensively examined the impact of different DR levels, including object presence, random noise, and post-processing techniques designed to closely mimic camera images. As this topic was thoroughly addressed in our earlier publication, it falls outside the scope of the current study. We highlighted this in the paper.
Comments 8: The claim of “general hand detection" is not supported by experimental results and the controlled conditions used in this study. Make a deep literature review, which provides methods for analysing multi-person and interaction scenarios.
Response 8: Our goal was to provide a fully synthetic dataset that would allow training the DL model in a color-agnostic manner without using real camera data. To the best of the authors’ knowledge, this is the first attempt to create a color-agnostic RGB-D synthetic dataset that can mitigate the issue of over-reliance on human skin features. We think that another analysis would make this study confusing.
Comments 9: Incomplete baseline fairness, the MediaPipe was optimized for landmarks, not for segmentation.
Response 9: As we use MediaPipe in different research, we found some restrictions and limits, we decided to create our own solution. We compared that with MediaPipe and tried to perform just comparison between different outputs (bounding box, landmarks, masks).
Comment 10: It is important to explicitly state the position of the dataset as industrial-HRI-specific, not universal, including the technical drawbacks for general use.
Response 10: As there are no available industrial datasets to compare to, we decided to make the dataset more general, but we focused on conditions typical for industrial use such as wearing colored gloves or presence of tools in the background with a shape like human fingers.
Reviewer 2 Report
Comments and Suggestions for AuthorsAUTHORS propose a study using domain randomization to generate a synthetic RGB-D dataset to train multimodal instance segmentation models, with the aim of achieving color-agnostic hand localization in cluttered industrial environments.
Domain randomization is a simple technique for addressing the "reality gap" by randomly rendering unrealistic features in a simulation scene to force the neural network to learn essential domain features.
AUTHORS provide a new synthetic dataset for various hand detection applications in industrial environments, as well as ready-to-use pretrained instance segmentation models.
To achieve robust results in a complex unstructured environment, THEY use a multimodal input that includes both color and depth information, which we hypothesize helps to improve the accuracy of model prediction.
In order to test this assumption, THEY analyze the influence of each modality and their synergy.
The evaluated models were trained solely on THEIR synthetic dataset; yet, THEY show that THEIR approach enables the models to outperform corresponding models trained on existing state-of-the-art datasets in terms of Average Precision and Probability-Based Detection Quality.
The study is interesting and makes a valuable contribution to the scientific literature.
I have the following comments for the authors:
- The abstract requires improvements in its structure. It should more clearly follow the logical flow of background / aim / methods / results. In its current form, for instance, it starts directly with the aim and then returns to the background to explain what domain randomization is.
- The abstract contains an acronym that is not defined (RGB-D).
- In general, study titles would preferably avoid the use of acronyms.
- In the Introduction, the sentence appearing abruptly in rows 52–56 (“Domain randomization (DR) is a methodology that aims to…”) seems to represent the core of the study; however, it is not adequately supported by references to the scientific literature.
- A clear aim should be explicitly stated, including a general objective and, if appropriate, specific sub-objectives. I found a partial description in Section 2 (around lines 201–210), but it does not appear to be well structured or clearly formulated.
- All reported figures (e.g., Figures 1 and 2) require a greater descriptive effort in the main text. Figures should be placed immediately after their first citation and within the same section; otherwise, the reader may become confused (see Figures 7 and 8, which appear in the subsequent section).
- The Methods section does not follow a standard structure, but this does not necessarily represent a major issue. This could be addressed by adding a brief introductory summary or sketch at the beginning of the section to guide the reader.
- Figure 5 is incomplete, as data labels are missing from the histograms.
- The Results section is excellent overall, aside from minor improvements needed in some figures (e.g., adding data labels).
- I suggest including a Discussion section, even a brief one, with comparisons to other studies and a discussion of the study’s limitations.
Author Response
Coments 1 : The abstract requires improvements in its structure. It should more clearly follow the logical flow of background / aim / methods / results. In its current form, for instance, it starts directly with the aim and then returns to the background to explain what domain randomization is.
The abstract contains an acronym that is not defined (RGB-D).
Response 1: Thank you for your advice, we rewrote the abstract according to your suggestion including meaning of abbreviation.
Comments 2: In general, study titles would preferably avoid the use of acronyms.
Response 2: We agree with this idea but papers introducing a dataset usually have some short name which is beneficial for subsequent papers which perform comparison with our solution.
Comments 4: A clear aim should be explicitly stated, including a general objective and, if appropriate, specific sub-objectives. I found a partial description in Section 2 (around lines 201–210), but it does not appear to be well structured or clearly formulated.
Response 4: Thank you for your advice, we added a comment to the end of the introduction section. This paper provides not only a dataset which was our first goal to develop a system for our special purposes, but we found out that the evaluation of models and cross-evaluation of datasets which usually have different input format and output (bounding box, landmarks, mask) needs to be carefully verified and also we had to prove that synthetic dataset is valid for real camera data. From the simple aim we had to solve many "side-problems" which we tried to put to this paper.
Comments 5: All reported figures (e.g., Figures 1 and 2) require a greater descriptive effort in the main text. Figures should be placed immediately after their first citation and within the same section; otherwise, the reader may become confused (see Figures 7 and 8, which appear in the subsequent section).
Response 5: We tried to add the necessary information to the text, but the conditions of dataset generation are described more in our previous paper. Current paper is more focused on evaluation and validation, this is why some topics are not described in maximal detail. We tried to optimize figure positions but latex follows some rules and this is not final visual output of the paper, after review and editorial check proportions of the paper may change.
Comments 6: The Methods section does not follow a standard structure, but this does not necessarily represent a major issue. This could be addressed by adding a brief introductory summary or sketch at the beginning of the section to guide the reader.
Response 6: We added suggested section.
Comments 7: Figure 5 is incomplete, as data labels are missing from the histograms.
Response 7: We added label data to Figures 5 and 12.
Comments 8:
The Results section is excellent overall, aside from minor improvements needed in some figures (e.g., adding data labels).
I suggest including a Discussion section, even a brief one, with comparisons to other studies and a discussion of the study’s limitations.
Response 8:
Thank you, we added suggested sections.
Reviewer 3 Report
Comments and Suggestions for AuthorsIn this paper, the authors present a simple technique for randomly rendering unrealistic features in a simulation scene to force the neural network to learn essential domain features. They provided a new synthetic dataset for various hand detection applications in industrial environments and ready-to-use pretrained instance segmentation models to achieve robust results in a complex unstructured environment For this purpose, they employed a multimodal input including color and depth information; and obtained the relatively high accuracy of their model prediction. The authors compared their models, including the Mask R-CNN ResNet50 RGB-D, to the MediaPipe model. They turned out to be more stable regardless of illumination conditions and work glove color. Compared to the next YOLO model, their models lead to similar results. Their tests seem solid, however, I have some comments and issues requiring clarification.
- he diagram in Figure 1 is imprecise regarding the authors' research methodology. A workflow-flowchart would be more accurate, including precise objects, activities, theoretical foundations, technical models, and the method of developing results, including the stages of the presented algorithm.
2.The diagram and architecture of the networks used and the description of their operation as well as the method of loading data and obtaining results seem very necessary for the reader.
3.Very small descriptions, e.g., in Figure 12.
4.A dictionary of abbreviations is necessary.
5.The concept in the form of a diagram explaining which computer programs were used for which purposes, in what order, what data they accepted, how they operated, and how they produced results. The basics should be provided and the algorithm explained in detail using examples so that the reader can repeat the tests to check the solidity of this research.
Author Response
Comments 1:The diagram in Figure 1 is imprecise regarding the authors' research methodology. A workflow-flowchart would be more accurate, including precise objects, activities, theoretical foundations, technical models, and the method of developing results, including the stages of the presented algorithm.
The diagram and architecture of the networks used and the description of their operation as well as the method of loading data and obtaining results seem very necessary for the reader.
Response 1: We tried to develop such a diagram, but the problem is complex for a simple diagram. Finally we put there the image in the style of simple graphical abstract. We provide the whole code and datasets (links to repositories in the paper) so that we ensure the repeatability of this experiment. We tried to keep the paper compact and we share the details in the code which is fully available.
Comments 3.Very small descriptions, e.g., in Figure 12.
Response 3: We agree that the desriptions in Figure 12 are small in a pdf version, but this paper is available in an online form, and Figures are available in full resolution. We could adjust the size of texts, but we think it is important for the reader to see those graphs next to each other and for details it is possible to enlarge the graph.
Comments 4: A dictionary of abbreviations is necessary.
Response 4: Thank you for some advice. As a dictionary would increase the length of the paper, we focused, revised and added description in the place of a first use of abbreviation.
Comments 5: The concept in the form of a diagram explaining which computer programs were used for which purposes, in what order, what data they accepted, how they operated, and how they produced results. The basics should be provided and the algorithm explained in detail using examples so that the reader can repeat the tests to check the solidity of this research.
Response 5: We have the same answer as to first two comments.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe paper demonstrates strong performance of models trained solely on the proposed synthetic RGB-D dataset; however, a more detailed ablation study on domain randomization parameters (e.g., types of distractors, texture randomization, lighting variability, and hand pose distributions) would clarify which components contribute most to the observed generalization gains.
Although multimodal RGB-D input is hypothesized to improve robustness, the results show mixed behavior across AP and PDQ metrics for RGB, depth-only, and RGB-D models; a deeper analysis explaining these discrepancies, possibly supported by feature-level visualization or failure-case categorization, would strengthen the interpretation of results.
The dataset limits the maximum number of hand instances per image to two and assumes a single user scenario; discussing how the method could scale to multi-user or crowded scenes, and whether additional domain randomization strategies would be required, would improve the scope and applicability of the work.
While the evaluation on real industrial images is a strong point, including a small amount of real-data fine-tuning or cross-domain adaptation experiments could help quantify the potential performance ceiling and assess whether hybrid training further improves robustness.
Some figures and tables contain dense information (e.g., AP/PDQ curves and large comparison tables); improving readability through clearer legends, consistent color coding across plots, and a concise summary table highlighting the best-performing configurations would enhance clarity for readers.
Author Response
Comments 1: The paper demonstrates strong performance of models trained solely on the proposed synthetic RGB-D dataset; however, a more detailed ablation study on domain randomization parameters (e.g., types of distractors, texture randomization, lighting variability, and hand pose distributions) would clarify which components contribute most to the observed generalization gains.
Response 1: Our previous work extensively examined the impact of different DR levels, including object presence, random noise, and post-processing techniques designed to closely mimic camera images. As this topic was thoroughly addressed in our earlier publication, it falls outside the scope of the current study. We highlighted this in the paper.
Comments 2: Although multimodal RGB-D input is hypothesized to improve robustness, the results show mixed behavior across AP and PDQ metrics for RGB, depth-only, and RGB-D models; a deeper analysis explaining these discrepancies, possibly supported by feature-level visualization or failure-case categorization, would strengthen the interpretation of results.
Response 2: This is a very good question but the answer is not easy, we expected that the additional modality gives more information to the model and it could give better results however the effect is opposite. We have a few hypotheses for this fact. RGB and D channels are captured with different devices and even if we performed calibration and mapping, there still can be some shift between those two images, model can be then distracted when features do not correspond to the same pixel or area in the image. We started to perform some sensitivity analyses for different situations and feature sensitivity analysis of the model but this is a topic for a separate paper and we would not like to go so deep in this specific topic in this study.
This problem could be somewhat expected, since the relative hands position with respect to the other objects in the image varies (from near to far from the objects), so the model could not rely on that to distinguish the hand from the rest of the image. If the hands were the sole object that hoovers above the others, then the task would have been trivial and the model would have captured that right away.
Comments 3: The dataset limits the maximum number of hand instances per image to two and assumes a single user scenario; discussing how the method could scale to multi-user or crowded scenes, and whether additional domain randomization strategies would be required, would improve the scope and applicability of the work.
Response 3: Thank you for this point, but this paper is focused on a specific field of industrial applications, our research is focused on assistive assembly scenarios, and we tried to keep this scope in this paper. Crowded scenes focus more on diversity of the scene we focus more on precise detection and details.
Comments 4: While the evaluation on real industrial images is a strong point, including a small amount of real-data fine-tuning or cross-domain adaptation experiments could help quantify the potential performance ceiling and assess whether hybrid training further improves robustness.
Response 4 : We agree with this comment, during the study we performed several cross-validation experiments, we validated the data on the synthetic and real images. In the paper we summarized the most interesting information. There is always a space to perform wider study but it is now not in our time possibilities.
Comments 5 : Some figures and tables contain dense information (e.g., AP/PDQ curves and large comparison tables); improving readability through clearer legends, consistent color coding across plots, and a concise summary table highlighting the best-performing configurations would enhance clarity for readers.
Response 5: We tried different styles of visualization but the best and most informative were those graphs we used, where all solutions are compared in one picture. In the online version, the reader can always see the image in full resolution. We adjusted Figure 5 and 12, where we added label data.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors addressed all my concerns satisfactorily.
Author Response
Thank you
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors responded to my comments in a positive way and provided relevant links to websites. The way the results are presented and the research is conducted seems correct. I think that the work is suitable for publishing.
Author Response
Thank you
Reviewer 4 Report
Comments and Suggestions for AuthorsManuscript has been well improved after author addressed all the comments by the reviewer. Some minor comments: Authors are encouraged to include more latest start of the art references like those here: doi:10.3390/electronics13030530 and doi:10.3390/app14135784
Author Response
Comments 1: Authors are encouraged to include more latest start of the art references like those here: doi:10.3390/electronics13030530 and doi:10.3390/app14135784
Response 1: Thank you for your suggestion, we found one very recent review (2026) and added to the related work section.
