Ensuring Computers Understand Manual Operations in Production: Deep-Learning-Based Action Recognition in Industrial Workflows
Abstract
:1. Introduction
- We propose a novel method that combines multiple neural networks to recognize manual operations in an industrial environment, which is a rare attempt in the field.
- The proposed method leverages a spatial transformation network to reduce the recognition errors caused by the diversity of human poses in the real working environment.
- A graph convolutional neural network is constructed to extract the spatial and temporal information of the skeleton image at the same time, which, combined with the classifier, can accurately recognize human action.
- An attention mechanism. Considering the unique characteristics of the real-world production environment, different weights are applied to more than a dozen keypoints of the human body to improve the accuracy of recognition.
2. Literature Review
2.1. Human Action Recognition
2.2. Graph Convolutional Neural Network
3. The Proposed Method
3.1. Overall Scheme
Algorithm 1 The process of action recognition |
Input:V, Video to be recognized, pre-trained parameters of the STN Output: Result matrix [, , ⋯, ], is the probability of operation i
|
3.2. Pose Estimation
3.2.1. Human Detection for Pose Estimation
3.2.2. Pose Estimation for a Single Human
3.3. Spatial Transformer Networks
3.4. Graph Convolutional Neural Network
3.5. Attention Mechanism
4. Experimental Study
4.1. Datasets
4.2. Implementation Details
4.3. Experimental Results
4.4. Discussion
5. Conclusions and Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Yu, M.; Zhang, W.; Zeng, Q.; Wang, C.; Li, J. Human-Object Contour for Action Recognition with Attentional Multi-modal Fusion Network. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Okinawa, Japan, 11–13 February 2019; pp. 241–246. [Google Scholar]
- Lughofer, E.; Zavoianu, A.C.; Pollak, R.; Pratama, M.; Meyer-Heye, P.; Zörrer, H.; Eitzinger, C.; Radauer, T. Autonomous supervision and optimization of product quality in a multi-stage manufacturing process based on self-adaptive prediction models. J. Process. Control 2019, 76, 27–45. [Google Scholar] [CrossRef]
- Makantasis, K.; Doulamis, A.; Doulamis, N.; Psychas, K. Deep learning based human behavior recognition in industrial workflows. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 1609–1613. [Google Scholar]
- Luo, X.; Li, H.; Yang, X.; Yu, Y.; Cao, D. Capturing and Understanding Workers’ Activities in Far-Field Surveillance Videos with Deep Action Recognition and Bayesian Nonparametric Learning. Comput.-Aided Civ. Infrastruct. Eng. 2019, 34, 333–351. [Google Scholar] [CrossRef]
- Lu, M.; Li, Z.N.; Wang, Y.; Pan, G. Deep Attention Network for Egocentric Action Recognition. IEEE Trans. Image Process. 2019, 28, 3703–3713. [Google Scholar] [CrossRef] [PubMed]
- Weinland, D.; Ronfard, R.; Boyer, E. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 2006, 104, 249–257. [Google Scholar] [CrossRef] [Green Version]
- Wang, H.; Kläser, A.; Schmid, C.; Cheng-Lin, L. Action recognition by dense trajectories. In Proceedings of the CVPR 2011—IEEE Conference on Computer Vision & Pattern Recognition, Providence, RI, USA, 20–25 June 2011; pp. 3169–3176. [Google Scholar]
- Wang, H.; Kläser, A.; Schmid, C.; Liu, C.L. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 2013, 103, 60–79. [Google Scholar] [CrossRef] [Green Version]
- Fathi, A.; Mori, G. Action recognition by learning mid-level motion features. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AL, USA, 24–26 June 2008; pp. 1–8. [Google Scholar]
- Sun, J.; Wu, X.; Yan, S.; Cheong, L.F.; Chua, T.S.; Li, J. Hierarchical spatio-temporal context modeling for action recognition. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2004–2011. [Google Scholar]
- Gall, J.; Yao, A.; Razavi, N.; Van Gool, L.; Lempitsky, V. Hough forests for object detection, tracking, and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2188–2202. [Google Scholar] [CrossRef] [PubMed]
- Kovashka, A.; Grauman, K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2046–2053. [Google Scholar]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- Lv, F.; Nevatia, R. Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 359–372. [Google Scholar]
- Whitehouse, S.; Yordanova, K.; Ludtke, S.; Paiement, A.; Mirmehdi, M. Evaluation of cupboard door sensors for improving activity recognition in the kitchen. In Proceedings of the 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), Athens, Greece, 19–23 March 2018; pp. 167–172. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 8–13 December 2014; pp. 568–576. [Google Scholar]
- Liu, J.; Shahroudy, A.; Wang, G.; Duan, L.Y.; Chichung, A.K. Skeleton-Based Online Action Prediction Using Scale Selection Network. IEEE Trans. Pattern Anal. Mach. Intell. 2019. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition. In Proceedings of the CVPR 2019, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Qin, Y.; Mo, L.; Li, C.; Luo, J. Skeleton-based action recognition by part-aware graph convolutional networks. Vis. Comput. 2019, 1–11. [Google Scholar] [CrossRef]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Zhang, S.; Liu, X.; Xiao, J. On geometric features for skeleton-based action recognition using multilayer lstm networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 148–157. [Google Scholar]
- Tu, Z.; Xie, W.; Qin, Q.; Poppe, R.; Veltkamp, R.C.; Li, B.; Yuan, J. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 2018, 79, 32–43. [Google Scholar] [CrossRef]
- Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Gener. Comput. Syst. 2019, 96, 386–397. [Google Scholar] [CrossRef]
- Huang, Y.; Lai, S.H.; Tai, S.H. Human Action Recognition Based on Temporal Pose CNN and Multi-dimensional Fusion. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Qi, M.; Wang, Y.; Qin, J.; Li, A.; Luo, J.; Van Gool, L. stagNet: An Attentive Semantic RNN for Group Activity and Individual Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2019. [Google Scholar] [CrossRef]
- Kuehne, H.; Richard, A.; Gall, J. A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation. arXiv 2019, arXiv:1906.01028. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Majd, M.; Safabakhsh, R. Correlational Convolutional LSTM for human action recognition. Neurocomputing 2019. [Google Scholar] [CrossRef]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; LU, H. Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks. arXiv 2019, arXiv:1912.06971. [Google Scholar]
- Moya Rueda, F.; Grzeszick, R.; Fink, G.; Feldhorst, S.; ten Hompel, M. Convolutional neural networks for human activity recognition using body-worn sensors. Informatics 2018, 5, 26. [Google Scholar] [CrossRef] [Green Version]
- Yu, M.; Liu, L.; Shao, L. Structure-preserving binary representations for RGB-D action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 1651–1664. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yordanova, K. From textual instructions to sensor-based recognition of user behaviour. In Proceedings of the 21st International Conference on Intelligent User Interfaces, Sonoma, CA, USA, 7–10 March 2016; ACM: New York, NY, USA, 2016; pp. 67–73. [Google Scholar]
- Liu, R.; Xu, C.; Zhang, T.; Zhao, W.; Cui, Z.; Yang, J. Si-GCN: Structure-induced Graph Convolution Network for Skeleton-based Action Recognition. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Tang, Y.; Tian, Y.; Lu, J.; Li, P.; Zhou, J. Deep progressive reinforcement learning for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5323–5332. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 483–499. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 7–12 December 2015; pp. 2017–2025. [Google Scholar]
- Kim, S.; Yun, K.; Park, J.; Choi, J.Y. Skeleton-based Action Recognition of People Handling Objects. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 61–70. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv 2018, arXiv:1812.08008. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. RMPE: Regional Multi-person Pose Estimation. In Proceedings of the ICCV 2017, Venice, Italy, 22–29 October 2017. [Google Scholar]
Operation | Train | Val | Test | Total |
---|---|---|---|---|
Blasting sand | 552 | 50 | 100 | 702 |
Spray gelcoat | 599 | 50 | 100 | 749 |
Laying materials | 662 | 50 | 100 | 812 |
Pumping gas | 605 | 50 | 100 | 755 |
Plastering adhesive | 803 | 50 | 100 | 953 |
Using remote controller | 728 | 50 | 100 | 878 |
Threshold | Precision (%) | Recall (%) | F-Measure (%) | G-Mean (%) |
---|---|---|---|---|
0.15 | 73.80 | 99.54 | 84.76 | 85.71 |
0.2 | 76.68 | 99.47 | 86.60 | 87.34 |
0.25 | 78.16 | 99.01 | 87.36 | 87.97 |
0.3 | 88.62 | 98.78 | 93.42 | 93.56 |
0.35 | 93.64 | 98.40 | 95.96 | 95.99 |
0.4 | 95.34 | 97.87 | 96.59 | 96.60 |
0.45 | 95.92 | 91.19 | 93.49 | 93.52 |
0.5 | 96.56 | 76.75 | 85.52 | 86.09 |
0.55 | 97.86 | 72.95 | 83.59 | 84.49 |
0.6 | 98.08 | 69.91 | 81.63 | 82.81 |
Blasting Sand | Spraying Gelcoat | Laying Materials | Pumping Gas | Plastering Adhesive | Using Remote Controller | Average | |
---|---|---|---|---|---|---|---|
Accuracy | 96.5% | 92.5% | 91.0% | 93.5% | 93.0% | 90.5% | 93.6% |
Methods | Blasting Sand (%) | Spraying Gelcoat (%) | Laying Materials (%) | Pumping Gas (%) | Plastering Adhesive (%) | Using Remote Controller (%) | Total (%) |
---|---|---|---|---|---|---|---|
iDT+SVM [42] | 75.26 | 83.51 | 72.16 | 80.41 | 82.47 | 79.38 | 76.50 |
Two-stream [16] | 82.47 | 85.57 | 83.51 | 77.32 | 78.35 | 91.75 | 80.83 |
C3D [43] | 85.57 | 90.72 | 90.72 | 90.72 | 87.63 | 61.86 | 82.17 |
ST-GCN [28] | 85.57 | 92.78 | 92.78 | 91.75 | 88.66 | 87.63 | 88.33 |
R(2+1)D [44] | 88.66 | 93.81 | 93.81 | 91.75 | 90.67 | 92.78 | 90.67 |
Ours | 92.78 | 95.88 | 94.85 | 93.81 | 89.69 | 96.91 | 91.17 |
W/o attention | 85.57 | 86.60 | 87.63 | 88.66 | 89.69 | 90.72 | 88.50 |
W/o STN | 82.22 | 88.92 | 84.79 | 85.05 | 85.57 | 80.41 | 81.17 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiao, Z.; Jia, G.; Cai, Y. Ensuring Computers Understand Manual Operations in Production: Deep-Learning-Based Action Recognition in Industrial Workflows. Appl. Sci. 2020, 10, 966. https://doi.org/10.3390/app10030966
Jiao Z, Jia G, Cai Y. Ensuring Computers Understand Manual Operations in Production: Deep-Learning-Based Action Recognition in Industrial Workflows. Applied Sciences. 2020; 10(3):966. https://doi.org/10.3390/app10030966
Chicago/Turabian StyleJiao, Zeyu, Guozhu Jia, and Yingjie Cai. 2020. "Ensuring Computers Understand Manual Operations in Production: Deep-Learning-Based Action Recognition in Industrial Workflows" Applied Sciences 10, no. 3: 966. https://doi.org/10.3390/app10030966
APA StyleJiao, Z., Jia, G., & Cai, Y. (2020). Ensuring Computers Understand Manual Operations in Production: Deep-Learning-Based Action Recognition in Industrial Workflows. Applied Sciences, 10(3), 966. https://doi.org/10.3390/app10030966