Designing a Computer-Vision Application: A Case Study for Hand-Hygiene Assessment in an Open-Room Environment
Abstract
:1. Introduction
- A multi-modality framework to recognize cross scenario hand-hygiene actions in untrimmed video sequences. This hierarchical system incorporates a variety of modalities (RGB, optical flow, hand masks, and human skeleton joints) for recognizing certain subsets of hand-hygiene actions. Combined, these modalities perform effective hand hygiene assessment across multiple scenarios.
- A comparison and evaluation of the performance of a baseline spatial-only deep learning model and a spatio-temporal model on the same scenario hand-hygiene situation to process both trimmed and untrimmed hand-hygiene videos. We demonstrate that the spatial-only model has equally good performance as the spatio–temporal model for hand-hygiene recognition in the same scenario.
- Demonstration of two methods that explore potential reasons why our baseline RGB-only model performs poorly across scenarios. These indicate that a primary reason models designed for one scenario do not transfer well into a second scenario is that the model may have learned to focus on irrelevant objects.
2. Background and Motivation
2.1. Applications for a Hand-Hygiene Assessment System
2.2. Background in Video Analytics and Computer Vision
2.3. Implications for Designing a Cross-Scenario Hand-Hygiene System
3. Method
- Step 1:
- Define the problem and gather appropriate data;
- Step 2:
- Label the data, which helps assess the appropriateness of the visual task;
- Step 3:
- Design and implement a baseline system;
- Step 4:
- Evaluate the baseline system;
- Step 5:
- Redesign both to improve the robustness of our feature representations and to leverage advances in computer vision;
- Step 6:
- Evaluate the improved system.
- Step 4(a):
- Apply Grad-CAM to better understand the weak design;
- Step 4:(b):
- Apply a hidden-patch experiment to verify that the baseline system learns to pay attention to extraneous information.
- Step 6(a):
- Evaluate each subsystem of the hierarchical system separately;
- Step 6(b):
- Evaluate the overall hierarchical system.
4. Dataset: Class-23 Hand Hygiene Dataset
4.1. Rationale for a New Dataset
4.2. Data Collection
4.3. Data Processing and Labeling
4.3.1. Creation of Untrimmed Videos
4.3.2. Defining the Set of Actions to Be Labeled
4.3.3. Labeling and Creation of the Trimmed Videos
4.4. Training, Validation and Testing Data Creation
5. A Baseline System for Hand-Hygiene Action Recognition Using RGB Inputs
5.1. Baseline Hand-Hygiene System Description
5.2. Performance of Baseline System and Its Variants
- Does the baseline system perform well in the same scenario case?
- Does adding temporal information to the baseline system improve performance?
- Is performance of the baseline system degraded if it must also decide about non-hygiene actions?
- Does the baseline system still perform well in the cross-scenario case?
5.3. Discussion about Baseline System
6. Exploring Reasons for Poor Performance of the Baseline System
6.1. Grad-CAM Exploratory Experiment: Method, Results, and Discussion
6.2. Hidden-Patch Exploratory Experiment: Method, Results, and Discussion
7. Hierarchical Action Detection System Using Multiple Modalities for Cross-Scenario Hand Hygiene
7.1. The Overall Hierarchical System
7.2. Subsystem Descriptions for Each Modality
7.2.1. Subsystem 1: Deciding among Hygiene and Non-Hygiene Actions Using Skeleton and Object Coordinates
7.2.2. Subsystem 2: Deciding among Hand-to-Hand and Hand-to-Object Actions Using Optical Flow Modality
7.2.3. Subsystem 3: Deciding between Actions Apply Soap and Touch Faucet with Hand Using a Hand-Mask Modality
7.2.4. Subsystem 4: Deciding between Actions Rub Hands with Water and Rub Hands without Water Using the RGB Modality
7.2.5. Summary of Observations Regarding the Multiple Modalities
8. Cross-Scenario Results
8.1. Experiments and Results: Individual Subsystems
8.1.1. Subsystem 1: Experimental Method and Results
8.1.2. Subsystem 2: Experimental Method and Results
8.1.3. Subsystem 3: Experimental Method and Results
8.1.4. Subsystem 4: Experimental Method and Results
8.2. Results: Hierarchical System
8.2.1. Experimental Methods for Hierarchical System
8.2.2. Experimental Results for Hierarchical System
9. Discussion
9.1. Discussion on System Limitations
9.2. Suggestions for Designing Video Analytics Systems for Real-World Applications
10. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- McIntyre, L.; Vallaster, L.; Wilcott, L.; Henderson, S.B.; Kosatsky, T. Evaluation of food safety knowledge, attitudes and self-reported hand washing practices in FOODSAFE trained and untrained food handlers in British Columbia, Canada. Food Control 2013, 30, 150–156. [Google Scholar] [CrossRef]
- World Health Organization. WHO Guidelines on Hand Hygiene in Health Care: A Summary; World Health Organization: Geneva, Switzerland, 2009. [Google Scholar]
- Zhong, C.; Reibman, A.R.; Mina, H.A.; Deering, A.J. Multi-View Hand-Hygiene Recognition for Food Safety. J. Imaging 2020, 6, 120. [Google Scholar] [CrossRef] [PubMed]
- Mihailidis, A.; Boger, J.N.; Craig, T.; Hoey, J. The COACH prompting system to assist older adults with dementia through handwashing: An efficacy study. BMC Geriatr. 2008, 8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ameling, S.; Li, J.; Zhou, J.; Ghosh, A.; Lacey, G.; Creamer, E.; Humphreys, H. A vision-based system for handwashing quality assessment with real-time feedback. In Proceedings of the Eighth IASTED International Conference on Biomedical Engineering, Innsbruck, Austria, 16–18 February 2011. [Google Scholar]
- Llorca, D.F.; Parra, I.; Sotelo, M.Á.; Lacey, G. A vision-based system for automatic hand washing quality assessment. Mach. Vis. Appl. 2011, 22, 219–234. [Google Scholar] [CrossRef]
- Zhong, C.; Reibman, A.R.; Cordoba, H.M.; Deering, A.J. Hand-hygiene activity recognition in egocentric video. In Proceedings of the 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia, 27–29 September2019; pp. 1–6. [Google Scholar]
- Yeung, S.; Alahi, A.; Haque, A.; Peng, B.; Luo, Z.; Singh, A.; Platchek, T.; Milstein, A.; Li, F.F. Vision-Based Hand Hygiene Monitoring in Hospitals; American Medical Informatics Association: Chicago, IL, USA, 2016. [Google Scholar]
- Haque, A.; Guo, M.; Alahi, A.; Yeung, S.; Luo, Z.; Rege, A.; Jopling, J.; Downing, L.; Beninati, W.; Singh, A.; et al. Towards vision-based smart hospitals: A system for tracking and monitoring hand hygiene compliance. arXiv 2017, arXiv:1708.00163. [Google Scholar]
- Yamamoto, K.; Miyanaga, K.; Miyahara, H.; Yoshii, M.; Kinoshita, F.; Touyama, H. Toward the evaluation of handwashing skills based on image processing. In Proceedings of the 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems and 19th International Symposium on Advanced Intelligent Systems, SCIS-ISIS 2018, Toyama, Japan, 5–8 December 2018; pp. 855–858. [Google Scholar] [CrossRef]
- Yamamoto, K.; Yoshii, M.; Kinoshita, F.; Touyama, H. Classification vs Regression by CNN for Handwashing Skills Evaluations in Nursing Education. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC 2020), Fukuoka, Japan, 19–21 February 2020; pp. 590–593. [Google Scholar] [CrossRef]
- Naim, F.; Romaino, M.A.; Hamid, R. Enhancement on Stain Detection for Automatic Handwashing Audit Vision System. In Lecture Notes in Electrical Engineering; Springer: Berlin, Germany, 2019; Volume 538, pp. 381–389. [Google Scholar] [CrossRef]
- Hoey, J.; Poupart, P.; von Bertoldi, A.; Craig, T.; Boutilier, C.; Mihailidis, A. Automated handwashing assistance for persons with dementia using video and a partially observable Markov decision process. Comput. Vis. Image Underst. 2010, 114, 503–519. [Google Scholar] [CrossRef] [Green Version]
- Taati, B.; Snoek, J.; Giesbrecht, D.; Mihailidis, A. Water flow detection in a handwashing task. In Proceedings of the CRV 2010—7th Canadian Conference on Computer and Robot Vision, Ottawa, ON, Canada, 31 May–2 June 2010; pp. 175–182. [Google Scholar] [CrossRef]
- Ashraf, A.; Taati, B. Automated Video Analysis of Handwashing Behavior as a Potential Marker of Cognitive Health in Older Adults. IEEE J. Biomed. Health Inform. 2016, 20, 682–690. [Google Scholar] [CrossRef]
- Scallan, E.; Hoekstra, R.M.; Angulo, F.J.; Tauxe, R.V.; Widdowson, M.A.; Roy, S.L.; Jones, J.L.; Griffin, P.M. Foodborne Illness Acquired in the United States—Major Pathogens. Emerg. Infect. Dis. 2011, 17, 7–15. [Google Scholar] [CrossRef] [PubMed]
- Michaels, B.; Keller, C.; Blevins, M.; Paoli, G.; Ruthman, T.; Todd, E.; Griffith, C.J. Prevention of food worker transmission of foodborne pathogens: Risk assessment and evaluation of effective hygiene intervention strategies. Food Serv. Technol. 2004, 4, 31–49. [Google Scholar] [CrossRef]
- Laptev, I. On space-time interest points. Int. J. Comput. Vis. 2005, 64, 107–123. [Google Scholar] [CrossRef]
- Dollár, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; pp. 65–72. [Google Scholar]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 15–16 May 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef] [Green Version]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 803–818. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Diba, A.; Fayyaz, M.; Sharma, V.; Karami, A.H.; Arzani, M.M.; Yousefzadeh, R.; Van Gool, L. Temporal 3D ConvNets: New architecture and transfer learning for video classification. arXiv 2017, arXiv:1711.08200. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
- Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2914–2923. [Google Scholar]
- Dai, X.; Singh, B.; Zhang, G.; Davis, L.S.; Chen, Y.Q. Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5727–5736. [Google Scholar]
- Chao, Y.W.; Vijayanarasimhan, S.; Seybold, B.; Ross, D.A.; Deng, J.; Sukthankar, R. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1130–1139. [Google Scholar]
- Gao, J.; Yang, Z.; Chen, K.; Sun, C.; Nevatia, R. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3628–3636. [Google Scholar]
- Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 156–165. [Google Scholar]
- Farha, Y.A.; Gall, J. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3575–3584. [Google Scholar]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
- Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
- Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 7167–7176. [Google Scholar]
- Munro, J.; Damen, D. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 122–132. [Google Scholar]
- Huang, H.; Huang, Q.; Krahenbuhl, P. Domain transfer through deep activation matching. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 590–605. [Google Scholar]
- Kong, Y.; Ding, Z.; Li, J.; Fu, Y. Deeply learned view-invariant features for cross-view action recognition. IEEE Trans. Image Process. 2017, 26, 3028–3037. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Lu, Z.; Li, J.; Yang, T. Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2416–2430. [Google Scholar] [CrossRef] [Green Version]
- Fang, H.S.; Cao, J.; Tai, Y.W.; Lu, C. Pairwise body-part attention for recognizing human-object interactions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 51–67. [Google Scholar]
- Chao, Y.W.; Liu, Y.; Liu, X.; Zeng, H.; Deng, J. Learning to detect human-object interactions. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 381–389. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Val Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Meng, L.; Zhao, B.; Chang, B.; Huang, G.; Sun, W.; Tung, F.; Sigal, L. Interpretable spatio-temporal attention for video action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1010–1019. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017. [Google Scholar]
- Fathi, A.; Ren, X.; Rehg, J.M. Learning to recognize objects in egocentric activities. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 3281–3288. [Google Scholar]
- Pirsiavash, H.; Ramanan, D. Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 2847–2854. [Google Scholar]
- Rohrbach, M.; Amin, S.; Andriluka, M.; Schiele, B. A database for fine grained activity detection of cooking activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Providence, RI, USA, 16–21 June 2012; pp. 1194–1201. [Google Scholar]
- Stein, S.; McKenna, S.J. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Zurich, Switzerland, 8–12 September 2013; pp. 729–738. [Google Scholar]
- Kuehne, H.; Arslan, A.B.; Serre, T. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv 2018, arXiv:1812.08008. [Google Scholar] [CrossRef] [Green Version]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–12 June 2015; pp. 2625–2634. [Google Scholar]
- Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–12 June 2015; pp. 4694–4702. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y. Towards good practices for very deep two-stream ConvNets. arXiv 2015, arXiv:1507.02159. [Google Scholar]
- Tommasi, T.; Patricia, N.; Caputo, B.; Tuytelaars, T. A deeper look at dataset bias. In Domain Adaptation in Computer Vision Applications; Springer: Berlin, Germany, 2017; pp. 37–55. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 618–626. [Google Scholar]
- Singh, K.K.; Lee, Y.J. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3544–3553. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision; Springer: Berlin, Germany, 2014; pp. 818–833. [Google Scholar]
- Wei, Y.; Feng, J.; Liang, X.; Cheng, M.M.; Zhao, Y.; Yan, S. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1568–1576. [Google Scholar]
- Zach, C.; Pock, T.; Bischof, H. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium; Springer: Berlin, Germany, 2007; pp. 214–223. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Urooj, A.; Borji, A. Analysis of hand segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4710–4719. [Google Scholar]
- Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation. arXiv 2016, arXiv:1611.06612. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, virtual, 3–7 May 2021. [Google Scholar]
Action Name | Hand-Hygiene | Hand-to-Hand | Hand-to-Object |
---|---|---|---|
Touch faucet with hand | Y | N | Y |
Rub Hands with Water | Y | Y | N |
Apply Soap | Y | N | Y |
Rub Hands without Water | Y | Y | N |
Swing Hands | N | - | - |
Grab Paper Towel | N | - | - |
Dry Hands with Paper Towel | N | - | - |
Camera Occlusion | N | - | - |
ROI | Purpose |
---|---|
Sink | To eliminate people not performing hand hygiene |
Hands | Necessary for all hand-hygiene actions |
Sanitizer | Only for the hidden-patch experiment in Section 6.2 |
Faucet | To assist the hand-mask model in detecting H2O actions |
Soap head | To assist the hand-mask model in detecting H2O actions |
Water spout | To assist the hand-mask model in detecting H2O actions, and to create waterflow ROI |
Water flow | To assist distinguishing rubbing with and without water; for model training only; derived from water spout ROI |
Scene | Train | Validation | Test | |||
---|---|---|---|---|---|---|
Room1 cam1 | 127 | (177) | 16 | (26) | 60 | (80) |
Room2 cam1 | 55 | (78) | 9 | (12) | 34 | (48) |
Room2 cam2 | 96 | (114) | 14 | (17) | 53 | (65) |
Scene | R1C1 | R2C1 | R2C2 | ALL | |
---|---|---|---|---|---|
Model | |||||
ResNet50 (H) | 91.7% | 97.1% | 98.1% | 95.2% | |
ResNet50 + LSTM (H) | 91.8% | 94.6% | 93.4% | 95.2% | |
ResNet50 + TRN (H) | 93.7% | 92.7% | 91.4% | 95.9% | |
ResNet50 (H + N) | 92.5% | 91.7% | 95.4% | 93.3% |
Scene | R1C1 | R2C1 | R2C2 | |
---|---|---|---|---|
Model | ||||
R1C1 CNN | 91.7% | 5.9% | 30.2% | |
R2C1 CNN | 33.3% | 97.1% | 43.4% | |
R2C2 CNN | 56.7% | 73.5% | 98.1% | |
R1C1 LSTM | 91.8% | 11.8% | 43.4% | |
R2C1 LSTM | 28.3% | 94.6% | 47.2% | |
R2C2 LSTM | 35.0% | 55.9% | 93.4% | |
R1C1 TRN | 93.7% | 38.2% | 54.7% | |
R2C1 TRN | 58.3% | 92.7% | 45.3% | |
R2C2 TRN | 50.0% | 61.8% | 91.4% |
Action | Act1 | Act2 | Act3 | Act4 | |
---|---|---|---|---|---|
Model | |||||
R1C1 origin | −1.93 | 4.71 | 0.02 | −2.86 | |
R1C1 hide | −1.24 | 2.39 | −1.24 | 0.08 | |
R2C1 origin | −1.98 | 4.99 | −0.05 | −3.10 | |
R2C1 hide | −1.67 | 4.44 | 0.25 | −3.24 |
(a) Left side joint detection rate. | |||||
Scene | Shoulder | Elbow | Wrist | Hand | |
Model | |||||
R1C1 | 98.4% | 97.0% | 91.2% | 91.2% | |
R2C1 | 98.8% | 98.6% | 97.9% | 97.9% | |
R2C2 | 98.7% | 83.9% | 52.7% | 52.7% | |
(b) Right side joint detection rate. | |||||
Scene | Shoulder | Elbow | Wrist | Hand | |
Model | |||||
R1C1 | 99.4% | 99.3% | 99.0% | 99.0% | |
R2C1 | 95.8% | 94.0% | 56.4% | 56.4% | |
R2C2 | 99.3% | 99.2% | 99.1% | 99.1% |
Scene | R1C1 | R2C1 | R2C2 | |
---|---|---|---|---|
Model | ||||
R1C1 flow | 98.3% | 94.1% | 98.1% | |
R2C1 flow | 95.0% | 100.0% | 96.2% | |
R2C2 flow | 96.7% | 94.1% | 100.0% |
Scene | R1C1(C) | R1C2(C) | R2C2(C) | |
---|---|---|---|---|
Action | ||||
Faucet hand | 100.0% | 100.0% | 68.0% | |
Soap | 90.9% | 28.6% | 75.0% |
Scene | R1C1 | R2C1 | R2C2 | |
---|---|---|---|---|
Model | ||||
R1C1 W | 100.0% | 88.9% | 62.2% | |
R1C1 W + D | 100.0% | 96.3% | 70.3% | |
R1C1 W + D + A | 100.0% | 92.6% | 56.8% | |
R1C1 W + D + A(I) | 100.0% | 88.9% | 81.1% | |
R2C1 W | 73.1% | 100.0% | 62.2% | |
R2C1 W + D | 84.6% | 100.0% | 78.4% | |
R2C1 W + D + A | 100.0% | 100.0% | 83.8% | |
R2C1 W + D + A(I) | 100.0% | 100.0% | 94.6% | |
R2C2 W | 88.5% | 100.0% | 100.0% | |
R2C2 W + A | 88.5% | 100.0% | 100.0% | |
R2C2 W + A(I) | 100.0% | 100.0% | 100.0% |
(a) Baseline system | ||||
Model | R1C1 | R2C1 | R2C2 | |
Metric | ||||
R1C1 F-acc | 85.8% | 10.9% | 12.3% | |
R1C1 W-acc | 86.3% | 10.3% | 12.3% | |
R2C1 F-acc | 9.7% | 74.8% | 34.1% | |
R2C1 W-acc | 9.5% | 79.2% | 33.4% | |
R2C2 F-acc | 38.0% | 52.7% | 82.8% | |
R2C2 W-acc | 38.9% | 55.5% | 85.4% | |
(b) Multi-modal hierarchical system | ||||
Model | R1C1 | R2C1 | R2C2 | |
Metric | ||||
R1C1 F-acc | 74.9% | 67.6% | 54.6% | |
R1C1 W-acc | 76.7% | 67.7% | 56.7% | |
R2C1 F-acc | 58.1% | 75.1% | 68.7% | |
R2C1 W-acc | 63.1% | 76.1% | 71.5% | |
R2C2 F-acc | 53.2% | 74.1% | 82.3% | |
R2C2 W-acc | 57.2% | 74.6% | 86.1% | |
(c) Ideal multi-modal hierarchical system | ||||
Model | R1C1 | R2C1 | R2C2 | |
Metric | ||||
R1C1 F-acc | 74.9% | 67.1% | 66.0% | |
R1C1 W-acc | 76.7% | 66.9% | 69.9% | |
R2C1 F-acc | 58.5% | 75.1% | 69.4% | |
R2C1 W-acc | 63.8% | 76.1% | 72.8% | |
R2C2 F-acc | 62.8% | 75.3% | 82.3% | |
R2C2 W-acc | 66.7% | 75.7% | 86.1% |
(a) Flow labeling: R1C1 model | |||||
Task | RW | RNW | Faucet | Soap | |
Scene | |||||
R1C1 | 0.89 s | 4.11 s | 67% | 89% | |
R2C1 | 3.90 s | 6.30 s | 70% | 80% | |
R2C2 | 6.87 s | 8.93 s | 80% | 87% | |
(b) Flow labeling: R2C1 model | |||||
Task | RW | RNW | Faucet | Soap | |
Scene | |||||
R1C1 | 2.67 s | 3.44 s | 67% | 100% | |
R2C1 | 2.30 s | 2.40 s | 60% | 90% | |
R2C2 | 4.00 s | 3.80 s | 27% | 93% | |
(c) Flow labeling: R2C2 model | |||||
Task | RW | RNW | Faucet | Soap | |
Scene | |||||
R1C1 | 6.33 s | 6.11 s | 89% | 100% | |
R2C1 | 2.10 s | 3.60 s | 60% | 70% | |
R2C2 | 1.60 s | 2.20 s | 33% | 100% |
(a) Ideal labeling: R1C1 model | |||||
Task | RW | RNW | Faucet | Soap | |
Scene | |||||
R1C1 | 0.89 s | 4.11 s | 67% | 89% | |
R2C1 | 4.00 s | 6.50 s | 70% | 80% | |
R2C2 | 4.60 s | 6.80 s | 80% | 93% | |
(b) Ideal labeling: R2C1 model | |||||
Task | RW | RNW | Faucet | Soap | |
Scene | |||||
R1C1 | 2.00 s | 2.89 s | 67% | 100% | |
R2C1 | 2.30 s | 2.40 s | 60% | 90% | |
R2C2 | 3.60 s | 3.67 s | 27% | 93% | |
(c) Ideal labeling: R2C2 model | |||||
Task | RW | RNW | Faucet | Soap | |
Scene | |||||
R1C1 | 2.00 s | 3.89 s | 89% | 89% | |
R2C1 | 1.90 s | 3.20 s | 60% | 80% | |
R2C2 | 1.60 s | 2.20 s | 33% | 100% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhong, C.; Reibman, A.R.; Mina, H.A.; Deering, A.J. Designing a Computer-Vision Application: A Case Study for Hand-Hygiene Assessment in an Open-Room Environment. J. Imaging 2021, 7, 170. https://doi.org/10.3390/jimaging7090170
Zhong C, Reibman AR, Mina HA, Deering AJ. Designing a Computer-Vision Application: A Case Study for Hand-Hygiene Assessment in an Open-Room Environment. Journal of Imaging. 2021; 7(9):170. https://doi.org/10.3390/jimaging7090170
Chicago/Turabian StyleZhong, Chengzhang, Amy R. Reibman, Hansel A. Mina, and Amanda J. Deering. 2021. "Designing a Computer-Vision Application: A Case Study for Hand-Hygiene Assessment in an Open-Room Environment" Journal of Imaging 7, no. 9: 170. https://doi.org/10.3390/jimaging7090170
APA StyleZhong, C., Reibman, A. R., Mina, H. A., & Deering, A. J. (2021). Designing a Computer-Vision Application: A Case Study for Hand-Hygiene Assessment in an Open-Room Environment. Journal of Imaging, 7(9), 170. https://doi.org/10.3390/jimaging7090170