Pairwise CNN-Transformer Features for Human–Object Interaction Detection
Abstract
:1. Introduction
- We extract features of human and object instances from both the CNN backbone and the transformer architecture.
- The instance features are fused and paired to form enhanced human–object pair features.
- The enhanced features are fed into the interaction head with global features to predict actions.
- We propose fusing the CNN and transformer features of the instances to enhance the feature representations. These enhanced representations effectively improve the precision of HOI detection while introducing little computational cost.
- We propose utilizing global features from the transformer to enrich contextual information. Through cross-attention, human–object pair features can aggregate valuable information from these global features.
- We fairly compare the performance of pairwise CNN and pairwise transformer features in HOI detection. The results show that CNN features outperform transformer features.
- Our model achieves competitive performance on two commonly used HOI datasets, HICO-DET and V-COCO.
2. Related Work
2.1. Object Detection
2.2. Features Fusion
2.3. Human–Object Interaction Detection
3. Method
3.1. Overview
3.2. CNN Features for Human-Object Pairs
3.3. Interaction Head
3.4. Training and Inference
3.4.1. Training
3.4.2. Inference
4. Experiments
4.1. Experiment Settings
4.1.1. Datasets
4.1.2. Evaluation Metrics
4.1.3. Implementation Details
4.2. Comparisons with State-of-the-Art Methods
4.3. Ablation and Contrast Studies
4.3.1. Ablation Study
4.3.2. Contrast Study
4.4. Computational Costing Study
4.5. Comparing CNN and Transformer Features
4.6. Qualitative Results and Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Xiao, Y.; Gao, G.; Wang, L.; Lai, H. Optical flow-aware-based multi-modal fusion network for violence detection. Entropy 2022, 24, 939. [Google Scholar] [CrossRef] [PubMed]
- Lv, J.; Hui, T.; Zhi, Y.; Xu, Y. Infrared Image Caption Based on Object-Oriented Attention. Entropy 2023, 25, 826. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Yao, W.; Chen, C.; Yang, H. Driving behavior recognition algorithm combining attention mechanism and lightweight network. Entropy 2022, 24, 984. [Google Scholar] [CrossRef]
- Antoun, M.; Asmar, D. Human object interaction detection: Design and survey. Image Vis. Comput. 2023, 130, 104617. [Google Scholar] [CrossRef]
- Chao, Y.W.; Liu, Y.; Liu, X.; Zeng, H.; Deng, J. Learning to detect human–object interactions. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 381–389. [Google Scholar]
- Gao, C.; Zou, Y.; Huang, J.B. iCAN: Instance-Centric Attention Network for Human–Object Interaction Detection. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018; pp. 1–13. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. Detecting and recognizing human–object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8359–8367. [Google Scholar]
- Liao, Y.; Liu, S.; Wang, F.; Chen, Y.; Qian, C.; Feng, J. Ppdm: Parallel point detection and matching for real-time human–object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Seattle, WA, USA, 14–19 June 2020; pp. 482–490. [Google Scholar]
- Wang, T.; Yang, T.; Danelljan, M.; Khan, F.S.; Zhang, X.; Sun, J. Learning human–object interaction detection using interaction points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Seattle, WA, USA, 14–19 June 2020; pp. 4116–4125. [Google Scholar]
- Kim, B.; Choi, T.; Kang, J.; Kim, H.J. Uniondet: Union-level detector towards real-time human–object interaction detection. In Computer Vision–ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 498–514. [Google Scholar]
- Tamura, M.; Ohashi, H.; Yoshinaga, T. Qpic: Query-based pairwise human–object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual, 19–25 June 2021; pp. 10410–10419. [Google Scholar]
- Kim, B.; Lee, J.; Kang, J.; Kim, E.S.; Kim, H.J. Hotr: End-to-end human–object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 74–83. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision–ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Proc. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Zhang, A.; Liao, Y.; Liu, S.; Lu, M.; Wang, Y.; Gao, C.; Li, X. Mining the benefits of two-stage and one-stage hoi detection. Proc. Adv. Neural Inf. Process. Syst. 2021, 34, 17209–17220. [Google Scholar]
- Ulutan, O.; Iftekhar, A.; Manjunath, B.S. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 13617–13626. [Google Scholar]
- Sun, X.; Hu, X.; Ren, T.; Wu, G. Human object interaction detection via multi-level conditioned network. In Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 26–34. [Google Scholar]
- Zhang, F.Z.; Campbell, D.; Gould, S. Efficient two-stage detection of human–object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 20104–20112. [Google Scholar]
- Zhang, F.Z.; Campbell, D.; Gould, S. Spatially conditioned graphs for detecting human–object interactions. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Virtual, 11–17 October 2021; pp. 13319–13327. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates: Red Hook, NY, USA, 2015; Volume 28, pp. 91–99. [Google Scholar]
- Chen, M.; Liao, Y.; Liu, S.; Chen, Z.; Wang, F.; Qian, C. Reformulating hoi detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual, 19–25 June 2021; pp. 9004–9013. [Google Scholar]
- Qu, X.; Ding, C.; Li, X.; Zhong, X.; Tao, D. Distillation using oracle queries for transformer-based human–object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 19558–19567. [Google Scholar]
- Liao, Y.; Zhang, A.; Lu, M.; Wang, Y.; Li, X.; Liu, S. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 20123–20132. [Google Scholar]
- Wang, G.; Guo, Y.; Wong, Y.; Kankanhalli, M. Distance Matters in Human–Object Interaction Detection. In Proceedings of the 30th ACM International Conference on Multimedia 2022, Lisboa, Portuga, 10–14 October 2022; pp. 4546–4554. [Google Scholar]
- Liu, Y.; Wang, L.; Cheng, J.; Chen, X. Multiscale feature interactive network for multifocus image fusion. IEEE Trans. Instrum. Meas. 2021, 70, 1–16. [Google Scholar] [CrossRef]
- Kansizoglou, I.; Bampis, L.; Gasteratos, A. Deep feature space: A geometrical perspective. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6823–6838. [Google Scholar] [CrossRef] [PubMed]
- Gao, C.; Xu, J.; Zou, Y.; Huang, J.B. Drg: Dual relation graph for human–object interaction detection. In Computer Vision–ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 696–712. [Google Scholar]
- Liang, Z.; Liu, J.; Guan, Y.; Rojas, J. Visual-semantic graph attention networks for human–object interaction detection. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; pp. 1441–1447. [Google Scholar]
- Li, Y.L.; Liu, X.; Wu, X.; Huang, X.; Xu, L.; Lu, C. Transferable Interactiveness Knowledge for Human–Object Interaction Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3870–3882. [Google Scholar] [PubMed]
- Wu, X.; Li, Y.L.; Liu, X.; Zhang, J.; Wu, Y.; Lu, C. Mining cross-person cues for body-part interactiveness learning in hoi detection. In Computer Vision–ECCV 2022; Springer: Cham, Switzerland, 2022; pp. 121–136. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning PMLR, New York, USA 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Zhang, Y.; Pan, Y.; Yao, T.; Huang, R.; Mei, T.; Chen, C.W. Exploring structure-aware transformer over interaction proposals for human–object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 19548–19557. [Google Scholar]
- DETR’s Hands on Colab Notebook. Facebook AI. Available online: https://github.com/facebookresearch/detr (accessed on 26 May 2020).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Gupta, S.; Malik, J. Visual Semantic Role Labeling. arXiv 2015, arXiv:1505.04474. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Tu, D.; Min, X.; Duan, H.; Guo, G.; Zhai, G.; Shen, W. Iwin: Human–object interaction detection via transformer with irregular windows. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 87–103. [Google Scholar]
- Xia, L.; Li, R. Multi-stream neural network fused with local information and global information for HOI detection. Appl. Intell. 2020, 50, 4495–4505. [Google Scholar] [CrossRef]
- Zhu, L.; Lan, Q.; Velasquez, A.; Song, H.; Kamal, A.; Tian, Q.; Niu, S. SKGHOI: Spatial-Semantic Knowledge Graph for Human–Object Interaction Detection. arXiv 2023, arXiv:2303.04253. [Google Scholar]
- Zou, C.; Wang, B.; Hu, Y.; Liu, J.; Wu, Q.; Zhao, Y.; Li, B.; Zhang, C.; Zhang, C.; Wei, Y.; et al. End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual, 19–25 June 2021; pp. 11825–11834. [Google Scholar]
- Li, Z.; Zou, C.; Zhao, Y.; Li, B.; Zhong, S. Improving human–object interaction detection via phrase learning and label composition. In Proceedings of the AAAI Conference on Artificial Intelligence 2022, Online, 22 February–1 March 2022; Volume 36, pp. 1509–1517. [Google Scholar]
- Kim, B.; Mun, J.; On, K.W.; Shin, M.; Lee, J.; Kim, E.S. Mstr: Multi-scale transformer for end-to-end human–object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 19578–19587. [Google Scholar]
- Peng, H.; Liu, F.; Li, Y.; Huang, B.; Shao, J.; Sang, N.; Gao, C. Parallel Reasoning Network for Human–Object Interaction Detection. arXiv 2023, arXiv:2301.03510. [Google Scholar]
Method | Backbone | Default Setting | Known Objects Setting | ||||
---|---|---|---|---|---|---|---|
Full | Rare | Non-Rare | Full | Rare | Non-Rare | ||
CNN-based methods | |||||||
HO-RCNN [5] | CaffeNet | 7.81 | 5.37 | 8.54 | 10.41 | 8.94 | 10.85 |
InteractNet [7] | Res50-FPN | 9.94 | 7.16 | 10.77 | - | - | - |
ICAN [6] | Res50 | 14.84 | 10.45 | 16.15 | 16.26 | 11.33 | 17.73 |
Xia et al. [42] | Res50 | 15.85 | 13.12 | 16.65 | - | - | - |
UnionDet [10] | Res50-FPN | 17.58 | 11.72 | 19.33 | 19.76 | 14.68 | 21.27 |
VSGNet [18] | Res152 | 19.80 | 16.05 | 20.91 | - | - | - |
TIN [31] | Res50-FPN | 20.93 | 18.95 | 21.32 | 23.02 | 20.96 | 23.42 |
PPDM [8] | Hourglass104 | 21.94 | 13.97 | 24.32 | 24.81 | 17.09 | 27.12 |
DRG [29] | Res50-FPN | 24.53 | 19.47 | 26.04 | 27.98 | 23.11 | 29.43 |
SKGHOI [43] | Res50-FPN | 26.95 | 21.28 | 28.56 | - | - | - |
SCG [21] | Res50-FPN | 29.26 | 24.61 | 30.65 | 32.87 | 27.89 | 34.35 |
Transformer-based methods | |||||||
HOTR [12] | Res50 | 25.10 | 17.34 | 27.42 | - | - | - |
HOI-Trans [44] | Res101 | 26.61 | 19.15 | 28.84 | 29.13 | 20.98 | 31.57 |
AS-Net [23] | Res50 | 28.87 | 24.25 | 30.25 | 31.74 | 27.07 | 33.14 |
QPIC [11] | Res101 | 29.90 | 23.92 | 31.69 | 32.38 | 26.06 | 34.27 |
PhraseHOI [45] | Res101 | 30.03 | 23.48 | 31.99 | 33.74 | 27.35 | 35.64 |
MSTR [46] | Res50 | 31.17 | 25.31 | 32.92 | 34.02 | 28.83 | 35.57 |
CDN [17] | Res101 | 32.07 | 27.19 | 33.53 | 34.79 | 29.48 | 36.38 |
UPT [20] | Res101 | 32.31 | 28.55 | 33.44 | 35.65 | 31.60 | 36.86 |
Iwin [41] | Res101-FPN | 32.79 | 27.84 | 35.40 | 35.84 | 28.74 | 36.09 |
PR-Net [47] | Res101 | 32.86 | 28.03 | 34.30 | - | - | - |
Ours (PCT) | Res50 | 33.63 | 28.73 | 35.10 | 36.96 | 31.47 | 38.60 |
Ours (PCT) | Res101 | 33.79 | 29.70 | 35.00 | 37.16 | 32.63 | 38.52 |
Method | Backbone | ||
---|---|---|---|
CNN-based methods | |||
InteractNet [7] | Res50-FPN | 40.0 | - |
ICAN [6] | Res50 | 45.3 | - |
Xia et al. [42] | Res50 | 47.2 | - |
UnionDet [10] | Res50-FPN | 47.5 | 56.2 |
VSGNet [18] | Res152 | 51.8 | 57.0 |
TIN [31] | Res50-FPN | 49.1 | - |
DRG [29] | Res50-FPN | 51.0 | - |
SCG [21] | Res50-FPN | 54.2 | 60.9 |
Transformer-based methods | |||
HOTR [12] | Res50 | 55.2 | 64.4 |
HOI-Trans [44] | Res101 | 52.9 | - |
AS-Net [23] | Res50 | 53.9 | - |
QPIC [11] | Res101 | 58.3 | 60.7 |
MSTR [46] | Res50 | 62.0 | 65.2 |
CDN [17] | Res101 | 63.9 | 65.9 |
UPT [20] | Res101 | 60.7 | 66.2 |
Iwin [41] | Res101-FPN | 60.9 | - |
PR-Net [47] | Res101 | 62.9 | 64.2 |
Ours (PCT) | Res50 | 59.4 | 65.0 |
Ours (PCT) | Res101 | 61.4 | 67.1 |
Instance Suppression Strategies | ||||
---|---|---|---|---|
Backbone | λ = 1.0 | λ = 1.9 | λ = 2.8 | TIN [31] |
Res50 | 32.10 | 33.44 | 33.63 | 32.92 |
Res101 | 32.48 | 33.63 | 33.79 | 33.11 |
Δ | +0.38 | +0.19 | +0.16 | +0.19 |
Model | CNN Features | Decoder | Full | Rare | Non-Rare |
---|---|---|---|---|---|
Baseline | - | - | 29.48 | 24.46 | 30.97 |
- | ✔ | 31.50 | 26.98 | 32.84 | |
✔ | - | 32.80 | 27.89 | 34.26 | |
Ours | ✔ | ✔ | 33.63 | 28.73 | 35.10 |
Δ Pos. | Δ Easy Neg. | Δ Hard Neg. | ||
---|---|---|---|---|
Reference | Δ Submodules | (25,342) | (4,373,390) | (46,172) |
PCT w/o encoder layer | + encoder layer | +0.0064 | −0.0000 | −0.0025 |
PCT w/o decoder layer | + decoder layer | +0.0127 | −0.0000 | +0.0001 |
PCT w/o CNN features | + CNN features | +0.0167 | −0.0001 | −0.0152 |
Baseline | PCT | +0.0332 | −0.0001 | −0.0273 |
Interaction Head | Full | Rare | Non-Rare |
---|---|---|---|
Only E.N. | 32.80 | 27.89 | 34.26 |
Only D.E. | 32.84 | 28.48 | 34.14 |
E.N.+E.N. | 33.15 | 28.03 | 34.68 |
E.N.+D.E. | 33.41 | 28.64 | 34.84 |
D.E.+D.E. | 33.63 | 28.46 | 35.18 |
D.E.+E.N. (ours) | 33.63 | 28.73 | 35.10 |
Method | Backbone | Params (M) | FLOPs (G) | FPS |
---|---|---|---|---|
SCG | Res50-FPN | 54.16 | 130.90 | 11.98 |
DETR | Res50 | 41.52 | 91.73 | 13.90 |
Res101 | 60.46 | 163.70 | 9.61 | |
QPIC | Res50 | 41.46 | 91.82 | 13.78 |
Res101 | 60.40 | 163.79 | 9.62 | |
UPT | Res50 | 54.76 (13.24) | 91.91 | 12.23 |
Res101 | 73.70 (13.24) | 163.97 | 8.84 | |
PCT(ours) | Res50 | 50.12 (8.60) | 92.47 | 13.02 |
Res101 | 69.06 (8.60) | 164.49 | 9.27 |
Method | Params (M) | FLOPs (G) | FPS |
---|---|---|---|
UPT | 13.24 | 91.913 | 12.23 |
Only E.N. (Baseline) | 3.21 (+0.00) | 91.773 | 13.41 |
+CNN | 4.26 (+1.05) | 91.777 | 13.18 |
+D.E. | 7.55 (+4.34) | 92.468 | 13.21 |
+all (ours) | 8.60 (+5.39) | 92.472 | 13.02 |
Feature | Coop. | Full | Rare | Non-Rare |
---|---|---|---|---|
transformer | E.N. | 30.39 | 24.95 | 32.02 |
CNN | E.N. | 31.16 | 27.12 | 32.37 |
transformer | M.E. | 30.22 | 24.71 | 31.86 |
CNN | M.E. | 32.17 | 26.61 | 33.82 |
transformer | M.E.&Res101 | 30.47 | 26.38 | 31.69 |
CNN | M.E.&Res101 | 32.23 | 27.65 | 33.60 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Quan, H.; Lai, H.; Gao, G.; Ma, J.; Li, J.; Chen, D. Pairwise CNN-Transformer Features for Human–Object Interaction Detection. Entropy 2024, 26, 205. https://doi.org/10.3390/e26030205
Quan H, Lai H, Gao G, Ma J, Li J, Chen D. Pairwise CNN-Transformer Features for Human–Object Interaction Detection. Entropy. 2024; 26(3):205. https://doi.org/10.3390/e26030205
Chicago/Turabian StyleQuan, Hutuo, Huicheng Lai, Guxue Gao, Jun Ma, Junkai Li, and Dongji Chen. 2024. "Pairwise CNN-Transformer Features for Human–Object Interaction Detection" Entropy 26, no. 3: 205. https://doi.org/10.3390/e26030205
APA StyleQuan, H., Lai, H., Gao, G., Ma, J., Li, J., & Chen, D. (2024). Pairwise CNN-Transformer Features for Human–Object Interaction Detection. Entropy, 26(3), 205. https://doi.org/10.3390/e26030205