Robust 3D Hand Detection from a Single RGB-D Image in Unconstrained Environments
Abstract
:1. Introduction
- We propose a novel adaptive fusion network (AF-Net) which adaptively fuse multi-level features for 3D hand detection. The core of AF-Net is a cross-modal feature fusion unit named “adaptive fusion unit” (AFU). As can be seen in Figure 3e, AFUs control the connectivity of fusion paths: if the weights of a AFU are set to value 0, the fusion path is blocked, otherwise it is activated. The fusion structures enumerated in Figure 3d can be obtained by adjusting the weights of AFUs. Thus, AF-Net can be regarded as a generalized version of [20]. Instead of exhaustively searching for an optimal joint to fuse the RGB-D branches [20], multi-level features are adaptively fused by AFUs and their weights are optimized in an end-to-end manner. It performs significantly robuster than hand detectors without fusion.
- We propose a stacked sub-range representation (SSR) for 3D hand detection in unconstrained environments. The whole depth range of the D-channel is evenly divided into a series of smaller stacked sub-ranges, so that the normalized local depth representations within each sub-range can be enhanced (please refer to Section 3.1). The D-channel is transformed to SSR first, and then it is fed into the network for feature extraction, fusion and hand detection. SSR produces much more accurate results than the raw depth representation.
- We propose a challenging RGB-D hand detection dataset named “CUG Hand”. To the best of our knowledge, it is the first RGB-D hand detection dataset collected in unconstrained environments. Existing RGB-D hand datasets are normally captured indoors, and contain only a single subject (up to 2 hands) per-image, whereas our dataset contains unconstrained environments, the number of subjects varies from 1 to 7 per-image, and the maximum number of hands per-image is up to 8. In order to evaluate the robustness and accuracy of the state-of-the-arts, various challenging factors such as extreme light conditions, hand shape, scale, view point, partial occlusion are considered in this dataset.
- The proposed 3D hand detection approach is extensively evaluated on CUG Hand dataset, as well as a public RHD hand dataset [21]. Experimental results show that the proposed approach significantly outperforms the stat-of-the-arts in terms of accuracy, and it can robustly detect 3D hand even in extreme light conditions. The proposed approach can have a wide range of hand related applications, such as hand gesture recognition, hand pose estimation, activity analysis, human–computer interaction, and so on.
2. Related Work
- 2D convolutional representations. In [41], the raw depth image (i.e., the D channel) is concatenated with the RGB channels, and then the RGB-D channels are fed into a 2D convolutional network. In [38,42], the depth image is transformed into a 3-channel HHA representation (Height above ground, Horizontal disparity, and Angle with gravity) for semantic segmentation of indoor scenes. In [43], object detection proposals are generated in a top-down bird view which is based on a restrictive assumption that all objects are on the same spatial plane, e.g., cars on road.
- 3D convolutional representations. The RGB-D image can be converted into 3D convolutional representations such as Voxel [44] and TSDF [45]. However, due to the curse of dimensionality, these representations are computationally expensive with large memory footprints. 3D convolutional representations are usually applied in constrained environment within a limited cubic range, e.g., indoor scenes.
- Point-cloud representations. The depth image can be represented as point-cloud [39] for recognition. The point-cloud representations can be further enhanced by concatenating each point with their corresponding RGB features extracted from CNN [40,46]. These methods follow the two-step scheme mentioned above. As they take the 2D bounding boxes detected from only RGB image as input, the information in the D channel is not fully fused for detection.
- Late fusion. The RGB-D channels are fused at the end of the feature extraction CNN networks [15,16,17,18,19]. The RGB and D branches are trained in parallel and then the features from both modalities are fused at the last stage. High-level features are fused by late fusion, but mid-level features are not fully fused.
- Intermediate fusion. The RGB-D channels are fused at intermediate stages of the CNN networks [20,48]. In the CNN networks, a single stage or multiple stages are selected at which the RGB and D branches are joined. Mid-level and high-level features are fused by intermediate fusion. However, it is not clear which position is the optimal fusion joint. One solution is to conduct an exhaustive enumeration [20] so that the best position can be found. Another solution [49,50,51] is to progressively fuse the features from one branch to another on multiple corresponding stages. While the later solution is primarily applied in per-pixel classification tasks such as semantic segmentation and salient object detection, and it is seldom used for region proposal based object detection tasks.
- Basic fusion operator. In [49], the pixel-wise summation operator is used for RGB-D fusion in semantic segmentation application. In [52], basic operators such as concatenation, summation, multiplication, etc. are compared, and it is found that the summation operator works well in the extreme exposure image fusion application.
- Advanced fusion layer. Instead of directly using basic operators, advanced fusion layers are designed by combining basic operators or sub-networks. In [20], a fusion layer is defined as a combination of a concatenation operator and a 2D convolutional layer. In [53], an fusion layer is proposed by combining a contrast-enhanced sub-network and a pixel-wise multiplication operator for per-pixel salient object detection task. Furthermore, sub-networks such as graph convolutional network [17], gating network [15] and LSTM [19] have been used to construct advanced fusion layers for the high-level features in the late fusion stage. In [54], tree-structured LSTM is used to extract relations between lexical-level features and syntactic features.
3. Methods
3.1. Stacked Sub-Range Representation (SSR)
3.2. Adaptive Fusion Network (AF-Net)
3.2.1. Adaptive Fusion Unit (AFU)
3.2.2. Feature Extraction and Fusion
3.3. 3D Hand Detection
3.3.1. Region Proposal Network (RPN)
3.3.2. 2D Hand Detection
3.3.3. Cascaded 3D Estimation
3.4. Hand Reconstruction
4. Cug Hand Dataset
5. Experiments
5.1. Experimental Settings
5.2. CUG Hand Dataset
5.2.1. Robustness in the Unseen Cases
5.2.2. Fusion Direction
5.2.3. Reconstruction Module
5.2.4. The Number of Sub-Ranges in SSR
5.2.5. 3D Hand Location Estimation on Z-Axis
5.2.6. Qualitative Results
5.3. RHD Dataset
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Gianluca, P.; Valentina, G. Human-Computer Interaction in Smart Environments. Sensors 2015, 15, 19487–19494. [Google Scholar] [CrossRef] [Green Version]
- Xu, C.; Cheng, L. Efficient Hand Pose Estimation from a Single Depth Image. In Proceedings of the International Conference on Computer Vision (ICCV), Darling Harbour, Sydney, Australia, 1–8 December 2013; pp. 3456–3462. [Google Scholar]
- Xu, C.; Govindarajan, L.N.; Zhang, Y.; Cheng, L. Lie-X: Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups. Int. J. Comput. Vis. (IJCV) 2017, 123, 454–478. [Google Scholar] [CrossRef] [Green Version]
- Ge, L.; Ren, Z.; Li, Y.; Xue, Z.; Wang, Y.; Cai, J.; Yuan, J. 3D Hand Shape and Pose Estimation From a Single RGB Image. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–18 June 2019; pp. 10833–10842. [Google Scholar]
- Kirishima, T.; Sato, K.; Chihara, K. Real-time gesture recognition by learning and selective control of visual interest points. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2005, 27, 351–364. [Google Scholar] [CrossRef]
- Lin, H.; Hsu, M.; Chen, W. Human hand gesture recognition using a convolution neural network. In Proceedings of the International Conference on Automation Science and Engineering (CASE), Taipei, Taiwan, 18–22 August 2014; pp. 1038–1043. [Google Scholar] [CrossRef]
- Mittal, A.; Zisserman, A.; Torr, P.H.S. Hand detection using multiple proposals. In Proceedings of the British Machine Vision Conference (BMVC), Dundee, UK, 29 August–2 September 2011; pp. 1–11. [Google Scholar]
- Le, T.H.N.; Quach, K.G.; Zhu, C.; Duong, C.N.; Luu, K.; Savvides, M. Robust Hand Detection and Classification in Vehicles and in the Wild. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1203–1210. [Google Scholar] [CrossRef]
- Deng, X.; Zhang, Y.; Yang, S.; Tan, P.; Chang, L.; Yuan, Y.; Wang, H. Joint Hand Detection and Rotation Estimation Using CNN. IEEE Trans. Image Process. 2018, 27, 1888–1900. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Narasimhaswamy, S.; Wei, Z.; Wang, Y.; Zhang, J.; Hoai, M. Contextual attention for hand detection in the wild. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 9567–9576. [Google Scholar]
- Yang, L.; Qi, Z.; Liu, Z.; Liu, H.; Ling, M.; Shi, L.; Liu, X. An embedded implementation of CNN-based hand detection and orientation estimation algorithm. Mach. Vis. Appl. 2019, 30, 1071–1082. [Google Scholar] [CrossRef]
- Xu, C.; Cai, W.; Li, Y.; Zhou, J.; Wei, L. Accurate Hand Detection from Single-Color Images by Reconstructing Hand Appearances. Sensors 2020, 20, 192. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Feng, R.; Perez, C.; Zhang, H. Towards transferring grasping from human to robot with RGBD hand detection. In Proceedings of the Conference on Computer and Robot Vision (CRV), Edmonton, AB, Canada, 16–19 May 2017. [Google Scholar] [CrossRef]
- Xu, C.; Govindarajan, L.N.; Cheng, L. Hand action detection from ego-centric depth sequences with error-correcting Hough transform. Pattern Recognit. 2017, 72, 494–503. [Google Scholar] [CrossRef] [Green Version]
- Mees, O.; Eitel, A.; Burgard, W. Choosing Smartly: Adaptive Multimodal Fusion for Object Detection in Changing Environments. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016. [Google Scholar] [CrossRef] [Green Version]
- Schwarz, M.; Milan, A.; Periyasamy, A.S.; Behnke, S. RGB-D Object Detection and Semantic Segmentation for Autonomous Manipulation in Clutter. Int. J. Robot. Res. 2018, 37, 437–451. [Google Scholar] [CrossRef] [Green Version]
- Yuan, Y.; Xiong, Z.; Wang, Q. ACM: Adaptive Cross-Modal Graph Convolutional Neural Networks for RGB-D Scene Recognition. Assoc. Adv. Artif. Intell. (AAAI) 2019, 33, 9176–9184. [Google Scholar] [CrossRef]
- Rahman, M.M.; Tan, Y.; Xue, J.; Shao, L.; Lu, K. 3D object detection: Learning 3D bounding boxes from scaled down 2D bounding boxes in RGB-D images. Inf. Sci. 2019, 476, 147–158. [Google Scholar] [CrossRef]
- Li, G.; Gan, Y.; Wu, H.; Xiao, N.; Lin, L. Cross-Modal Attentional Context Learning for RGB-D Object Detection. IEEE Trans. Image Process. 2019, 28, 1591–1601. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ophoff, T.; Van Beeck, K.; Goedemé, T. Exploring RGB+Depth fusion for real-time object detection. Sensors 2019, 19, 866. [Google Scholar] [CrossRef] [Green Version]
- Christian, Z.; Thomas, B. Learning to estimate 3D hand pose from single RGB images. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef] [Green Version]
- Binkovitz, L.A.; Berquist, T.H.; McLeod, R.A. Masses of the hand and wrist: Detection and characterization with MR imaging. Am. J. Roentgenol. 1990, 154, 323–326. [Google Scholar] [CrossRef]
- Nölker, C.; Ritter, H. Detection of fingertips in human hand movement sequences. In Gesture and Sign Language in Human-Computer Interaction; Springer: Berlin/Heidelberg, Germany, 1998; pp. 209–218. [Google Scholar] [CrossRef]
- Sigal, L.; Sclaroff, S.; Athitsos, V. Skin color-based video segmentation under time-varying illumination. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2004, 26, 862–877. [Google Scholar] [CrossRef] [Green Version]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Meng, X.; Lin, J.; Ding, Y. An extended HOG model: SCHOG for human hand detection. In Proceedings of the International Conference on Systems and Informatics (ICSAI), Lądek Zdrój, Poland, 20–23 June 2012. [Google Scholar] [CrossRef]
- Guo, J.; Cheng, J.; Pang, J.; Guo, Y. Real-time hand detection based on multi-stage HOG-SVM classifier. In Proceedings of the International Conference on Image Processing (ICIP), Melbourne, Australia, 15–18 September 2013; pp. 4108–4111. [Google Scholar]
- Del Solar, J.R.; Verschae, R. Skin detection using neighborhood information. In Proceedings of the International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, 19 May 2004. [Google Scholar] [CrossRef]
- Li, C.; Kitani, K.M. Pixel-Level Hand Detection in Ego-centric Videos. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3570–3577. [Google Scholar] [CrossRef]
- Gao, Q.; Liu, J.; Ju, Z. Robust real-time hand detection and localization for space human–robot interaction based on deep learning. Neurocomputing 2020, 390, 198–206. [Google Scholar] [CrossRef]
- Wang, G.; Luo, C.; Sun, X.; Xiong, Z.; Zeng, W. Tracking by instance detection: A meta-learning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6288–6297. [Google Scholar]
- Kohli, P.; Shotton, J. Key developments in human pose estimation for kinect. In Consumer Depth Cameras for Computer Vision; Springer: London, UK, 2013; pp. 63–70. [Google Scholar] [CrossRef] [Green Version]
- Qian, C.; Sun, X.; Wei, Y.; Tang, X.; Sun, J. Realtime and Robust Hand Tracking from Depth. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014. [Google Scholar] [CrossRef]
- Xu, C.; Nanjappa, A.; Zhang, X.; Cheng, L. Estimate Hand Poses Efficiently from Single Depth Images. Int. J. Comput. Vis. 2015, 116, 21–45. [Google Scholar] [CrossRef] [Green Version]
- Oberweger, M.; Lepetit, V. Deepprior++: Improving fast and accurate 3d hand pose estimation. In Proceedings of the International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef] [Green Version]
- Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Trans. Graph. 2014, 33, 1–10. [Google Scholar] [CrossRef]
- Rogez, G.; Khademi, M.; Supančič, J.S., III; Montiel, J.M.M.; Ramanan, D. 3D Hand Pose Detection in Egocentric RGB-D Images. In European Conference on Computer Vision Workshops (ECCVW); Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 356–371. [Google Scholar] [CrossRef] [Green Version]
- Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. In European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2014; pp. 345–360. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from rgb-d data. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef] [Green Version]
- Wang, C.; Xu, D.; Zhu, Y.; Martin-Martin, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Wang, X.; Liu, W.; Feng, B. Deep attention network for joint hand gesture localization and recognition using static RGB-D images. Inf. Sci. 2018, 441, 66–78. [Google Scholar] [CrossRef]
- Gupta, S.; Arbelaez, P.; Malik, J. Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar] [CrossRef] [Green Version]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
- Zhao, C.; Sun, L.; Purkait, P.; Duckett, T.; Stolkin, R. Dense RGB-D Semantic Mapping with Pixel-Voxel Neural Network. Sensors 2018, 18, 3099. [Google Scholar] [CrossRef] [Green Version]
- Song, S.; Xiao, J. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 244–253. [Google Scholar] [CrossRef] [Green Version]
- Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. RGBD Salient Object Detection: A Benchmark and Algorithms. In European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2014; pp. 92–109. [Google Scholar] [CrossRef]
- Xu, X.; Li, Y.; Wu, G.; Luo, J. Multi-modal deep feature learning for RGB-D object detection. Pattern Recognit. 2017, 72, 300–313. [Google Scholar] [CrossRef]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In Computer Vision—ACCV 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 213–228. [Google Scholar] [CrossRef]
- Chen, H.; Li, Y. Progressively Complementarity-Aware Fusion Network for RGB-D Salient Object Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
- Chen, H.; Li, Y.; Su, D. Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognit. 2019, 86, 376–385. [Google Scholar] [CrossRef]
- Prabhakar, K.R.; Srikar, V.S.; Babu, R.V. DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4724–4732. [Google Scholar]
- Zhao, J.X.; Cao, Y.; Fan, D.P.; Cheng, M.M.; Li, X.Y.; Zhang, L. Contrast Prior and Fluid Pyramid Integration for RGBD Salient Object Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
- Geng, Z.Q.; Chen, G.F.; Han, Y.M.; Lu, G.; Li, F. Semantic Relation Extraction Using Sequential and Tree-structured LSTM with Attention. Inf. Sci. 2020, 509, 183–192. [Google Scholar] [CrossRef]
- Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multi-task learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
- El, R.O.; Rosman, G.; Wetzler, A.; Kimmel, R.; Bruckstein, A.M. RGBD-fusion: Real-time high precision depth recovery. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
- Bambach, S.; Lee, S.; Crandall, D.J.; Yu, C. Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; Volume 2015, pp. 1949–1957. [Google Scholar] [CrossRef] [Green Version]
- Martin, S.; Yuen, K.; Trivedi, M.M. Vision for Intelligent Vehicles & Applications (VIVA): Face detection and head pose challenge. In Proceedings of the Intelligent Vehicles Symposium (IV), Gotenburg, Sweden, 19–22 June 2016. [Google Scholar]
- Yuan, S.; Ye, Q.; Stenger, B.; Jain, S.; Kim, T.K. BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–16 July 2017. [Google Scholar] [CrossRef] [Green Version]
- Mueller, F.; Mehta, D.; Sotnychenko, O.; Sridhar, S.; Casas, D.; Theobalt, C. Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In Proceedings of the International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 1284–1293. [Google Scholar]
- Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–16 July 2017. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Uijlings, J.R.R.; Sande, K.E.A.V.D.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
- Alexe, B.; Deselaers, T.; Ferrari, V. Measuring the Objectness of Image Windows. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2189–2202. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using Part Affinity Fields. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 2015, pp. 91–99. [Google Scholar]
- Khan, A.U.; Borji, A. Analysis of Hand Segmentation in the Wild. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef] [Green Version]
- Baek, S.; Kim, K.I.; Kim, T.K. Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef] [Green Version]
Simple | Ordinary | Complex | Back-Light | Dark | All | |
---|---|---|---|---|---|---|
Training | \ | 367 | 257 | \ | \ | 625 |
Testing | 96 | 108 | 168 | 128 | 119 | 619 |
Channels | Method | Simple | Ordinary | Complex | Back-Light | Dark | All |
---|---|---|---|---|---|---|---|
RGB | OpenPose [66] | 74.3 | 62.0 | 56.5 | 54.2 | 37.3 | 55.2 |
RGB | FPN [61] | 100.0 | 78.9 | 63.1 | 56.7 | 27.8 | 61.8 |
RGB | FasterRCNN [67] | 93.6 | 79.5 | 67.7 | 63.3 | 30.0 | 63.2 |
RGB | Xu2020 [12] | 99.8 | 86.1 | 68.4 | 66.9 | 48.9 | 69.1 |
D | Raw | 89.5 | 34.8 | 14.8 | 28.1 | 56.0 | 31.0 |
D | SSR | 96.5 | 45.8 | 19.7 | 28.3 | 67.5 | 37.5 |
RGB-D | Exhaustive enumeration [20] | 99.9 | 79.1 | 71.9 | 56.2 | 36.7 | 65.5 |
RGB-D | Cross-stitch [55] | 100.0 | 84.0 | 71.3 | 62.5 | 45.1 | 68.5 |
RGB-D | Ours w/o reconstruct | 100.0 | 87.5 | 71.6 | 65.7 | 58.4 | 72.2 |
RGB-D | Ours | 100.0 | 88.0 | 72.7 | 65.9 | 62.5 | 74.1 |
Fusion Direction | Simple | Ordinary | Complex | Back-Light | Dark | All |
---|---|---|---|---|---|---|
from D to RGB | 100.0 | 87.5 | 71.6 | 65.7 | 58.4 | 72.2 |
from RGB to D | 100.0 | 88.0 | 72.7 | 63.2 | 45.5 | 70.1 |
Options | Simple | Ordinary | Complex | Back-Light | Dark | All |
---|---|---|---|---|---|---|
w/o reconstruct | 100.0 | 87.5 | 71.6 | 65.7 | 58.4 | 72.2 |
Reconstruct SSR | 100.0 | 87.1 | 69.1 | 71.3 | 58.9 | 72.0 |
Reconstruct RGB | 100.0 | 89.4 | 72.0 | 64.5 | 59.1 | 73.0 |
Reconstruct RGB-SSR | 100.0 | 88.0 | 72.7 | 65.9 | 62.5 | 74.1 |
Distance | <1000 mm | 1000–2000 mm | 2000–3000 mm | >3000 mm | All |
---|---|---|---|---|---|
Direct regression | 22.233 | 27.312 | 27.785 | 51.110 | 29.156 |
Cascaded network | 6.396 | 10.619 | 13.998 | 24.253 | 12.154 |
Patch Size | |||||
---|---|---|---|---|---|
Direct regression | 54.746 | 32.754 | 29.156 | 33.712 | 34.897 |
Cascaded network | 26.016 | 16.895 | 12.154 | 13.223 | 14.356 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, C.; Zhou, J.; Cai, W.; Jiang, Y.; Li, Y.; Liu, Y. Robust 3D Hand Detection from a Single RGB-D Image in Unconstrained Environments. Sensors 2020, 20, 6360. https://doi.org/10.3390/s20216360
Xu C, Zhou J, Cai W, Jiang Y, Li Y, Liu Y. Robust 3D Hand Detection from a Single RGB-D Image in Unconstrained Environments. Sensors. 2020; 20(21):6360. https://doi.org/10.3390/s20216360
Chicago/Turabian StyleXu, Chi, Jun Zhou, Wendi Cai, Yunkai Jiang, Yongbo Li, and Yi Liu. 2020. "Robust 3D Hand Detection from a Single RGB-D Image in Unconstrained Environments" Sensors 20, no. 21: 6360. https://doi.org/10.3390/s20216360
APA StyleXu, C., Zhou, J., Cai, W., Jiang, Y., Li, Y., & Liu, Y. (2020). Robust 3D Hand Detection from a Single RGB-D Image in Unconstrained Environments. Sensors, 20(21), 6360. https://doi.org/10.3390/s20216360