HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization
Abstract
1. Introduction
2. Related Work
2.1. Synthetic Dataset Generation
2.2. Hand Datasets
2.3. Instance Segmentation
3. Methods
3.1. Domain Randomization for Dataset Generation
- Number, colors, textures, scales and types of distractor objects selected from a set of 3D models of general tools and geometric primitives. A special type of distractor is used—an articulated human body model without hands (see Figure 2b).
- Hand gestures (see Figure 3).
- Positions and orientations of the models of the hand.
- Texture and surface properties (diffuse, specular and emissive properties) and number (from none to 2) of the object of interest, as well as its background.
- The number and positions of directional light sources, ranging from one to four, along with a planar light that provides ambient illumination.
3.2. Alternative Synthetic Methods
3.3. Training Process
3.4. Model Evaluation
4. Results
4.1. Comparison with Pretrained Models
4.2. Comparison with Existing Datasets
- Unified and merged class masks: DenseHands—dense correspondence maps were binarized and used as masks for hand instances; ObMan and RHD—all masks except hands were omitted.
- Depth range: The depth range [0.2, 1.0] m was assigned to the byte range according to the settings in the test environment. The remaining range was truncated to the 1 m limit.
4.3. Comparison with Existing Solution
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Pan, Y.; Chen, C.; Zhao, Z.; Hu, T.; Zhang, J. Robot teaching system based on hand-robot contact state detection and motion intention recognition. Robot. Comput.-Integr. Manuf. 2023, 81, 102492. [Google Scholar] [CrossRef]
- Li, S.; Zheng, P.; Liu, S.; Wang, Z.; Wang, X.V.; Zheng, L.; Wang, L. Proactive human–robot collaboration: Mutual-cognitive, predictable, and self-organising perspectives. Robot. Comput.-Integr. Manuf. 2023, 81, 102510. [Google Scholar] [CrossRef]
- Schött, S.Y.; Amin, R.M.; Butz, A. A Literature Survey of How to Convey Transparency in Co-Located Human Robot Interaction. Multimodal Technol. Interact. 2023, 7, 25. [Google Scholar] [CrossRef]
- Kim, E.; Kirschner, R.; Yamada, Y.; Okamoto, S. Estimating probability of human hand intrusion for speed and separation monitoring using interference theory. Robot. Comput.-Integr. Manuf. 2020, 61, 101819. [Google Scholar] [CrossRef]
- Amorim, A.; Guimares, D.; Mendona, T.; Neto, P.; Costa, P.; Moreira, A.P. Robust human position estimation in cooperative robotic cells. Robot. Comput.-Integr. Manuf. 2021, 67, 102035. [Google Scholar] [CrossRef]
- Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar] [CrossRef]
- Lucci, N.; Monguzzi, A.; Zanchettin, A.M.; Rocco, P. Workflow modelling for human–robot collaborative assembly operations. Robot. Comput.-Integr. Manuf. 2022, 78, 102384. [Google Scholar] [CrossRef]
- Vysocky, A.; Grushko, S.; Spurny, T.; Pastor, R.; Kot, T. Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localization. IEEE Access 2022, 10, 99734–99744. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Hall, D.; Dayoub, F.; Skinner, J.; Zhang, H.; Miller, D.; Corke, P.; Carneiro, G.; Angelova, A.; Sünderhauf, N. Probabilistic Object Detection: Definition and Evaluation. arXiv 2020, arXiv:1811.10800. [Google Scholar] [CrossRef]
- Jalayer, R.; Jalayer, M.; Orsenigo, C.; Tomizuka, M. A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human–robot interaction. Robot. Comput.-Integr. Manuf. 2026, 97, 103110. [Google Scholar] [CrossRef]
- Hillebrand, G.; Bauer, M.; Achatz, K.; Klinker, G. Inverse kinematic infrared optical finger tracking. In Proceedings of the 9th International Conference on Humans and Computers (HC 2006), Aizu, Japan, 6–9 March 2006; pp. 6–9. [Google Scholar]
- Wetzler, A.; Slossberg, R.; Kimmel, R. Rule Of Thumb: Deep derotation for improved fingertip detection. arXiv 2015, arXiv:1507.05726. [Google Scholar] [CrossRef][Green Version]
- Baldi, T.L.; Scheggi, S.; Meli, L.; Mohammadi, M.; Prattichizzo, D. GESTO: A Glove for Enhanced Sensing and Touching Based on Inertial and Magnetic Sensors for Hand Tracking and Cutaneous Feedback. IEEE Trans. Hum.-Mach. Syst. 2017, 47, 1066–1076. [Google Scholar] [CrossRef]
- Grushko, S.; Vysocký, A.; Heczko, D.; Bobovský, Z. Intuitive Spatial Tactile Feedback for Better Awareness about Robot Trajectory during Human-Robot Collaboration. Sensors 2021, 21, 5748. [Google Scholar] [CrossRef]
- Vysocký, A.; Grushko, S.; Oščádal, P.; Kot, T.; Babjak, J.; Jánoš, R.; Sukop, M.; Bobovský, Z. Analysis of Precision and Stability of Hand Tracking with Leap Motion Sensor. Sensors 2020, 20, 4088. [Google Scholar] [CrossRef] [PubMed]
- Mazhar, O.; Navarro, B.; Ramdani, S.; Passama, R.; Cherubini, A. A real-time human-robot interaction framework with robust background invariant hand gesture detection. Robot. Comput.-Integr. Manuf. 2019, 60, 34–48. [Google Scholar] [CrossRef]
- Yurtsever, E.; Yang, D.; Koc, I.M.; Redmill, K.A. Photorealism in Driving Simulations: Blending Generative Adversarial Image Synthesis with Rendering. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23114–23123. [Google Scholar] [CrossRef]
- Mueller, F.; Bernard, F.; Sotnychenko, O.; Mehta, D.; Sridhar, S.; Casas, D.; Theobalt, C. GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 49–59. [Google Scholar] [CrossRef]
- Romero, J.; Tzionas, D.; Black, M.J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Trans. Graph. 2017, 36, 1–17. [Google Scholar] [CrossRef]
- Zimmermann, C.; Brox, T. Learning to Estimate 3D Hand Pose from Single RGB Images. arXiv 2017, arXiv:1705.01389. [Google Scholar] [CrossRef]
- Dwibedi, D.; Misra, I.; Hebert, M. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. arXiv 2017, arXiv:1708.01642. [Google Scholar] [CrossRef]
- Georgakis, G.; Mousavian, A.; Berg, A.C.; Kosecka, J. Synthesizing Training Data for Object Detection in Indoor Scenes. arXiv 2017, arXiv:1702.07836. [Google Scholar] [CrossRef]
- Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv 2017, arXiv:1703.06907. [Google Scholar] [CrossRef]
- Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. arXiv 2015, arXiv:1409.7495. [Google Scholar] [CrossRef]
- Liebelt, J.; Schmid, C. Multi-View Object Class Detection with a 3D Geometric Model. In 23rd IEEE Conference on Computer Vision & Pattern Recognition; IEEE: New York, NY, USA, 2010; p. 1688. [Google Scholar] [CrossRef]
- Planche, B.; Singh, R.V. Physics-based Differentiable Depth Sensor Simulation. arXiv 2021, arXiv:2103.16563. [Google Scholar] [CrossRef]
- Oprea, S.; Karvounas, G.; Martinez-Gonzalez, P.; Kyriazis, N.; Orts-Escolano, S.; Oikonomidis, I.; Garcia-Garcia, A.; Tsoli, A.; Garcia-Rodriguez, J.; Argyros, A. H-GAN: The power of GANs in your Hands. arXiv 2021, arXiv:2103.15017. [Google Scholar] [CrossRef]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv 2020, arXiv:1703.10593. [Google Scholar] [CrossRef]
- Sadeghi, F.; Levine, S. CAD2RL: Real Single-Image Flight without a Single Real Image. arXiv 2017, arXiv:1611.04201. [Google Scholar] [CrossRef]
- Tzeng, E.; Devin, C.; Hoffman, J.; Finn, C.; Abbeel, P.; Levine, S.; Saenko, K.; Darrell, T. Adapting Deep Visuomotor Representations with Weak Pairwise Constraints. arXiv 2017, arXiv:1511.07111. [Google Scholar] [CrossRef]
- Hinterstoisser, S.; Lepetit, V.; Wohlhart, P.; Konolige, K. On Pre-Trained Image Features and Synthetic Images for Deep Learning. arXiv 2017, arXiv:1710.10710. [Google Scholar] [CrossRef]
- Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. arXiv 2018, arXiv:1804.06516. [Google Scholar] [CrossRef]
- Dehban, A.; Borrego, J.; Figueiredo, R.; Moreno, P.; Bernardino, A.; Santos-Victor, J. The Impact of Domain Randomization on Object Detection: A Case Study on Parametric Shapes and Synthetic Textures. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 2593–2600. [Google Scholar] [CrossRef]
- Khirodkar, R.; Yoo, D.; Kitani, K.M. Domain Randomization for Scene-Specific Car Detection and Pose Estimation. arXiv 2018, arXiv:1811.05939. [Google Scholar] [CrossRef]
- Horváth, D.; Erdős, G.; Istenes, Z.; Horváth, T.; Földi, S. Object Detection Using Sim2Real Domain Randomization for Robotic Applications. IEEE Trans. Robot. 2022, 39, 1225–1243. [Google Scholar] [CrossRef]
- Mueller, F.; Davis, M.; Bernard, F.; Sotnychenko, O.; Verschoor, M.; Otaduy, M.A.; Casas, D.; Theobalt, C. Real-time Pose and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera. ACM Trans. Graph. 2019, 38, 1–13. [Google Scholar] [CrossRef]
- Bambach, S.; Lee, S.; Crandall, D.J.; Yu, C. Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1949–1957. [Google Scholar] [CrossRef]
- Nuzzi, C.; Pasinetti, S.; Pagani, R.; Coffetti, G.; Sansoni, G. HANDS: An RGB-D dataset of static hand-gestures for human-robot interaction. Data Brief 2021, 35, 106791. [Google Scholar] [CrossRef] [PubMed]
- Nuzzi, C.; Pasinetti, S.; Pagani, R.; Ghidini, S.; Beschi, M.; Coffetti, G.; Sansoni, G. MEGURU: A gesture-based robot program builder for Meta-Collaborative workstations. Robot. Comput.-Integr. Manuf. 2021, 68, 102085. [Google Scholar] [CrossRef]
- Bojja, A.K.; Mueller, F.; Malireddi, S.R.; Oberweger, M.; Lepetit, V.; Theobalt, C.; Yi, K.M.; Tagliasacchi, A. HandSeg: An Automatically Labeled Dataset for Hand Segmentation from Depth Images. arXiv 2018, arXiv:1711.05944. [Google Scholar] [CrossRef]
- Hasson, Y.; Varol, G.; Tzionas, D.; Kalevatykh, I.; Black, M.J.; Laptev, I.; Schmid, C. Learning joint reconstruction of hands and manipulated objects. arXiv 2019, arXiv:1904.05767. [Google Scholar] [CrossRef]
- Qian, C.; Sun, X.; Wei, Y.; Tang, X.; Sun, J. Realtime and Robust Hand Tracking from Depth. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1106–1113. [Google Scholar] [CrossRef]
- Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Trans. Graph. 2014, 33, 1–10. [Google Scholar] [CrossRef]
- Gu, W.; Bai, S.; Kong, L. A review on 2D instance segmentation based on deep neural networks. Image Vis. Comput. 2022, 120, 104401. [Google Scholar] [CrossRef]
- Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask Scoring R-CNN. arXiv 2019, arXiv:1903.00241. [Google Scholar] [CrossRef]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. arXiv 2019, arXiv:1904.02689. [Google Scholar] [CrossRef]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. arXiv 2018, arXiv:1803.01534. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef]
- De Brabandere, B.; Neven, D.; Van Gool, L. Semantic Instance Segmentation with a Discriminative Loss Function. arXiv 2017, arXiv:1708.02551. [Google Scholar] [CrossRef]
- Liu, S.; Jia, J.; Fidler, S.; Urtasun, R. SGN: Sequential Grouping Networks for Instance Segmentation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3516–3524. [Google Scholar] [CrossRef]
- Newell, A.; Huang, Z.; Deng, J. Associative Embedding: End-to-End Learning for Joint Detection and Grouping. arXiv 2017, arXiv:1611.05424. [Google Scholar] [CrossRef]
- Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting Objects by Locations. arXiv 2020, arXiv:1912.04488. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar] [CrossRef]
- Rohmer, E.; Singh, S.P.N.; Freese, M. V-REP: A versatile and scalable robot simulation framework. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 1321–1326. [Google Scholar] [CrossRef]
- Wspanialy, P. Pycococreator, version 0.2.1; Zenodo: Geneva, Switzerland, 2018. [Google Scholar] [CrossRef]
- Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
- Wenkel, S.; Alhazmi, K.; Liiv, T.; Alrshoud, S.; Simon, M. Confidence Score: The Forgotten Dimension of Object Detection Performance Evaluation. Sensors 2021, 21, 4350. [Google Scholar] [CrossRef]
- Kapitanov, A.; Kvanchiani, K.; Nagaev, A.; Kraynov, R.; Makhliarchuk, A. HaGRID–HAnd Gesture Recognition Image Dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2024. [Google Scholar]













| Model | AP@0.5:0.95 | AP@0.5 | APsmall@0.5:0.95 | APmedium@0.5:0.95 | APlarge@0.5:0.95 | AR@0.5:0.95 |
|---|---|---|---|---|---|---|
| SOLOv2 ResNet50 (Depth) | 0.338 | 0.644 | 0.263 | 0.369 | 0.405 | 0.265 |
| SOLOv2 ResNet101 (Depth) | 0.356 | 0.681 | 0.293 | 0.392 | 0.415 | 0.274 |
| Mask R-CNN ResNet50 (Depth) | 0.364 | 0.712 | 0.291 | 0.424 | 0.415 | 0.273 |
| Mask R-CNN ResNet101 (Depth) | 0.357 | 0.686 | 0.308 | 0.415 | 0.394 | 0.274 |
| SOLOv2 ResNet50 (RGB) | 0.473 | 0.717 | 0.399 | 0.578 | 0.557 | 0.330 |
| SOLOv2 ResNet101 (RGB) | 0.474 | 0.740 | 0.391 | 0.583 | 0.550 | 0.336 |
| Mask R-CNN ResNet50 (RGB) | 0.523 | 0.821 | 0.459 | 0.612 | 0.576 | 0.357 |
| Mask R-CNN ResNet101 (RGB) | 0.443 | 0.704 | 0.388 | 0.567 | 0.497 | 0.321 |
| SOLOv2 ResNet50 (RGB-D) | 0.408 | 0.718 | 0.326 | 0.465 | 0.460 | 0.306 |
| SOLOv2 ResNet101 (RGB-D) | 0.410 | 0.709 | 0.348 | 0.457 | 0.458 | 0.306 |
| Mask R-CNN ResNet50 (RGB-D) | 0.449 | 0.800 | 0.412 | 0.505 | 0.479 | 0.325 |
| Mask R-CNN ResNet101 (RGB-D) | 0.445 | 0.779 | 0.423 | 0.504 | 0.468 | 0.323 |
| SOLOv2 ResNet50 (COCO) | 0.329 | 0.607 | 0.394 | 0.490 | 0.274 | 0.281 |
| SOLOv2 ResNet101 (COCO) | 0.252 | 0.574 | 0.278 | 0.349 | 0.244 | 0.225 |
| Mask R-CNN ResNet50 (COCO) | 0.123 | 0.292 | 0.218 | 0.322 | 0.082 | 0.157 |
| Mask R-CNN ResNet101 (COCO) | 0.101 | 0.234 | 0.221 | 0.284 | 0.082 | 0.148 |
| Model | PDQmax | Confidence Score Threshold at PDQmax | AP at PDQmax | APmax |
|---|---|---|---|---|
| SOLOv2 ResNet50 (Depth) | 0.1071 | 0.550 | 0.301 | 0.338 |
| SOLOv2 ResNet101 (Depth) | 0.1178 | 0.550 | 0.318 | 0.357 |
| Mask R-CNN ResNet50 (Depth) | 0.1259 | 0.925 | 0.346 | 0.372 |
| Mask R-CNN ResNet101 (Depth) | 0.1318 | 0.825 | 0.339 | 0.364 |
| SOLOv2 ResNet50 (RGB) | 0.1147 | 0.600 | 0.380 | 0.474 |
| SOLOv2 ResNet101 (RGB) | 0.1054 | 0.550 | 0.385 | 0.475 |
| Mask R-CNN ResNet50 (RGB) | 0.0874 | 0.975 | 0.498 | 0.525 |
| Mask R-CNN ResNet101 (RGB) | 0.0768 | 0.975 | 0.420 | 0.445 |
| SOLOv2 ResNet50 (RGB-D) | 0.1351 | 0.575 | 0.342 | 0.408 |
| SOLOv2 ResNet101 (RGB-D) | 0.1449 | 0.500 | 0.372 | 0.410 |
| Mask R-CNN ResNet50 (RGB-D) | 0.1576 | 0.900 | 0.424 | 0.455 |
| Mask R-CNN ResNet101 (RGB-D) | 0.1480 | 0.975 | 0.409 | 0.450 |
| Dataset | Annotation Method | Samples Source | Instances | Modalities | Number of Images |
|---|---|---|---|---|---|
| EgoHands | Manual | Real | Up to 4 | RGB | 4800 |
| HandSeg | Automatic (marker/gloves) | Real | Up to 2 | RGB-D | 158,315 |
| DenseHands | Automatic | Synthetic | Up to 2 | Depth | 85,611 |
| Rendered Hand Pose (RHD) | Automatic | Synthetic | Up to 2 | RGB-D | 43,986 |
| ObMan | Automatic | Synthetic | Up to 2 | RGB-D | 154,298 |
| HaDR (Ours) | Automatic | Synthetic | Up to 2 | RGB-D | 117,438 |
| Model | PDQmax | Confidence Score Threshold at PDQmax | AP at PDQmax | APmax |
|---|---|---|---|---|
| DenseHands: SOLOv2 ResNet50 (Depth) | 0.0000 | 0.000 | 0.009 | 0.009 |
| DenseHands: SOLOv2 ResNet101 (Depth) | 0.0001 | 0.225 | 0.008 | 0.008 |
| DenseHands: Mask R-CNN ResNet50 (Depth) | 0.0061 | 0.975 | 0.084 | 0.093 |
| DenseHands: Mask R-CNN ResNet101 (Depth) | 0.0034 | 0.975 | 0.027 | 0.029 |
| HandSeg: SOLOv2 ResNet50 (Depth) | 0.0104 | 0.350 | 0.035 | 0.039 |
| HandSeg: SOLOv2 ResNet101 (Depth) | 0.0194 | 0.450 | 0.06 | 0.069 |
| HandSeg: Mask R-CNN ResNet50 (Depth) | 0.0147 | 0.975 | 0.018 | 0.02 |
| HandSeg: Mask R-CNN ResNet101 (Depth) | 0.0158 | 0.925 | 0.017 | 0.018 |
| EgoHands: SOLOv2 ResNet50 (RGB) | 0.1025 | 0.575 | 0.264 | 0.299 |
| EgoHands: SOLOv2 ResNet101 (RGB) | 0.0975 | 0.500 | 0.229 | 0.250 |
| EgoHands: Mask R-CNN ResNet50 (RGB) | 0.0742 | 0.950 | 0.161 | 0.174 |
| EgoHands: Mask R-CNN ResNet101 (RGB) | 0.1040 | 0.900 | 0.214 | 0.253 |
| ObMan: SOLOv2 ResNet50 (RGB-D) | 0.0606 | 0.450 | 0.138 | 0.165 |
| ObMan: SOLOv2 ResNet101 (RGB-D) | 0.0535 | 0.375 | 0.135 | 0.160 |
| ObMan: Mask R-CNN ResNet50 (RGB-D) | 0.0798 | 0.975 | 0.187 | 0.217 |
| ObMan: Mask R-CNN ResNet101 (RGB-D) | 0.0786 | 0.950 | 0.206 | 0.227 |
| RHD: SOLOv2 ResNet50 (RGB-D) | 0.0591 | 0.725 | 0.130 | 0.148 |
| RHD: SOLOv2 ResNet101 (RGB-D) | 0.0794 | 0.675 | 0.157 | 0.168 |
| RHD: Mask R-CNN ResNet50 (RGB-D) | 0.0775 | 0.975 | 0.165 | 0.175 |
| RHD: Mask R-CNN ResNet101 (RGB-D) | 0.0839 | 0.975 | 0.169 | 0.178 |
| Model | PDQmax | Confidence Score Threshold at PDQmax | AP at PDQmax | APmax |
|---|---|---|---|---|
| Mask R-CNN ResNet101 (Depth) | 0.1318 | 0.825 | 0.339 | 0.364 |
| SOLOv2 ResNet50 (RGB) | 0.1147 | 0.600 | 0.380 | 0.474 |
| Mask R-CNN ResNet50 (RGB-D) | 0.1576 | 0.900 | 0.424 | 0.455 |
| MediaPipe (RGB) | 0.0836 | 0.050 | 0.181 | 0.181 |
| YOLOv10 (RGB) | 0.0613 | 0.200 | 0.825 | 0.825 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Grushko, S.; Vysocký, A.; Chlebek, J. HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization. AI 2026, 7, 72. https://doi.org/10.3390/ai7020072
Grushko S, Vysocký A, Chlebek J. HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization. AI. 2026; 7(2):72. https://doi.org/10.3390/ai7020072
Chicago/Turabian StyleGrushko, Stefan, Aleš Vysocký, and Jakub Chlebek. 2026. "HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization" AI 7, no. 2: 72. https://doi.org/10.3390/ai7020072
APA StyleGrushko, S., Vysocký, A., & Chlebek, J. (2026). HaDR: Hand Instance Segmentation Using a Synthetic Multimodal Dataset Based on Domain Randomization. AI, 7(2), 72. https://doi.org/10.3390/ai7020072
