Skip to Content
DataData
  • Article
  • Open Access

25 January 2024

MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

,
,
and
1
School of Engineering, Edith Cowan University, 270 Joondalup Drive, Joondalup, Perth, WA 6027, Australia
2
School of Science, Edith Cowan University, 270 Joondalup Drive, Joondalup, Perth, WA 6027, Australia
3
School of Computing and Information Systems, The University of Melbourne, Melbourne Connect, 700 Swanston Street, Carlton, WA 3053, Australia
*
Author to whom correspondence should be addressed.

Abstract

Audio-image representations for a multimodal human action (MHAiR) dataset contains six different image representations of the audio signals that capture the temporal dynamics of the actions in a very compact and informative way. The dataset was extracted from the audio recordings which were captured from an existing video dataset, i.e., UCF101. Each data sample captured a duration of approximately 10 s long, and the overall dataset was split into 4893 training samples and 1944 testing samples. The resulting feature sequences were then converted into images, which can be used for human action recognition and other related tasks. These images can be used as a benchmark dataset for evaluating the performance of machine learning models for human action recognition and related tasks. These audio-image representations could be suitable for a wide range of applications, such as surveillance, healthcare monitoring, and robotics. The dataset can also be used for transfer learning, where pre-trained models can be fine-tuned on a specific task using specific audio images. Thus, this dataset can facilitate the development of new techniques and approaches for improving the accuracy of human action-related tasks and also serve as a standard benchmark for testing the performance of different machine learning models and algorithms.

1. Introduction

The recent progress in deep learning architectures, coupled with enhancements in Graphics Processing Unit (GPU) hardware and software stacks, has significantly empowered the handling of computationally demanding tasks, including Multimodal Human Action Recognition (MHAR). Analyzing human activities in a multimodal information context is a challenging endeavor that necessitates substantial computational resources [1]. This has emerged as a prominent research issue in the field of computer vision. Human Action Recognition (HAR) involves the process of categorizing human actions depicted in a sequence of images, essentially entailing the classification of objectives pursued by individuals across a series of image frames.
Video modality inherently holds spatial information, which lends itself well to Convolutional Neural Network (CNN)-based classification architectures. In the pursuit of more effectively encompassing the multimodal facets of action data, a contemporary approach involves the integration of data from various modalities, including optical flow, RGB difference, and warped optical flow. Audio is a lightweight signal in comparison to video data. However, image-based representations are optimal for vision models in machine learning, specifically for convolution neural network-based vision models. Further, features from spectral centroid-based representations are visually favorable when compared to convolution-based methods. Spectral Centroids provide a compact and informative representation of the audio signal that captures the discriminative features and temporal dynamics of human actions. Therefore, the dataset described in this manuscript was generated during the process of screening diverse image-based representations for action sequences for multimodal fusion with video data. This dataset can thus be used to analyze critical features from the action sequences in the image form. This dataset extends our previous publication [2] which outperforms state-of-the-art methods producing an accuracy of 91.2% by focusing on multimodal representations of action sequences to present critical features in audio from different perspectives, as captured from each action sample. These datasets were also used as a pre-requisite requirement in developing an intelligent multimodal action recognition system for classifying actions using deep learning algorithms based on acoustic and video modality. To the best of our knowledge, MHAiR is the first audio-image representation dataset for multimodal human recognition that uses image-based representations of audio to leverage CNN and transformer-based architectures for improving action recognition. The key contributions of our work can be summarized as follows:
  • We introduce (Multimodal Audio-image Representations), MHAiR, a new multimodal lightweight dataset.
  • We build a new feature representation strategy to select the most informative candidate representations for audio-visual fusion.
  • We achieve state-of-the-art or competitive results on standard public benchmarks, validating the generalizability of our proposed approach through extensive evaluation.

Value of Data

There are several ways in which this dataset can be valuable compared to the original dataset and in serving other novel use cases. The distinguished characteristics of this dataset are the following:
  • It provides a significant reduction in dimensionality. The spectral centroid images represent the frequency content of the audio signal over time, which is a lower-dimensional representation of the original video dataset. This can make it easier and faster to process the data and extract meaningful features.
  • It is robust against visual changes. The spectral centroid images are based on the audio signal, which is less affected by visual changes such as changes in lighting conditions or camera angles. This makes the dataset more robust to visual changes and can improve the accuracy of human action analysis.
  • It offers standardization as spectral centroid images can be standardized to a fixed size and format, which can make it easier to compare and combine data from diverse sources. This can be useful for tasks such as cross-dataset validation and transfer learning. Hence, this dataset can serve as a standard benchmark for evaluating performance of different machine learning algorithms for human action analysis based on audio signals.
  • It is suitable for privacy-oriented applications such as surveillance or healthcare monitoring, which may require analysis of human actions without capturing original visual information. Spectral centroid images provide a privacy-preserving alternative that can still enable effective analysis in applications where audio can be fused and aligned with non-visual sensory datasets such as HH105 and HH1251.
  • Dataset versatility can facilitate the exploration of different approaches and the development of newer techniques for various applications and an extension of the existing ones.
  • Audio images, derived from sound data, when fused with visual data can enhance interpretation, improve noise reduction, augment AR/VR experiences, refine content-based multimedia retrieval, and assist in healthcare applications like telemedicine. However, effective fusion requires advanced algorithms and careful attention to challenges such as data alignment, synchronization, and fusion model selection.
The structure of this paper is organized as follows. Section 2 discusses related works. Section 3 describes the key characteristics of the dataset. Section 4 elaborates on the process of extraction of distinct modalities and rationale behind feature extraction in the context of multimodal human action recognition. Section 5 provides an analysis and comparison of a downstream task to establish a benchmark for our proposed dataset, and Section 6 presents the conclusion of this paper.

3. Data Description

Data in this study were arranged in two directories: one for training and another for evaluating the model. Audio samples were extracted from videos lasting an average of 10 s (6837 samples overall with 4893 for training and 1944 for testing) [18].
Figure 1 shows a sample of action with different image representations. Images in both training and testing folders were organized in a format of {category}_{action group}_{sample number}.{file extension}, i.e., in ”ApplyEyeMakeup_g08_c01.png”, ApplyEyeMakeup is the class followed by ”g08”, which is the supergroup of the sample, and then ”c01” is the sample number for this particular action class. The statistics describing all image representation samples employed in this experimental setting, including action class and a number of samples, are reported in Table 1.
Figure 1. Six different audio-image representations of the same action. Each image represents different characteristics of the same audio signal (adopted from [19]). (a) Waveplot. (b) Spectral Centroids. (c) Spectral Rolloff. (d) MFCCs. (e) MFCCs Feature Scaling. (f) Chromagram.
Table 1. Statistics describing the image representations employed in the experimental setting: for all considered categories, we report the total number of training and testing samples.

4. Methodology

A high-level schematic of a prospective downstream multimodal task is illustrated in Figure 2. Audio samples for this dataset of human actions were extracted from videos with a sampling rate of 22,050 Hz. The process of extracting audio from UCF101 video dataset used the “ffmpeg” tool. The resulting audio file was saved separately. For each image representation, post-processing and metadata handling were applied. Particularly following best practices, for chromagram-based representation, a hop length of 512 was used. The extracted audio files were organized and stored according to UCF101 splits, and a quality control check was performed to ensure the audio met the desired standards. This process allowed for the isolation of the audio component from video data, making it available for various applications, including multimodal action recognition and standalone audio analysis.
Figure 2. High-level schematic representation of our approach.
These features were then projected onto images that could be processed by Convolutional Neural Networks (CNN) such as (IRV4) [20] or Transformers such as (AST) [21]. Samples that did not have any audio channels were removed from consideration. In total, 51 categories were analyzed to represent the audio-image features extracted from the audio signals.
Since the dataset delineates experimentations on human action recognition in daily life scenarios, all daily life actions occurring in action recognition were retained in order to inform the models (e.g., through fine-tuning) on specificities characterizing the audio at hand. Data were thus preserved in raw format, whereby no form of image normalization was undertaken, and no forms of pre-processing were applied to the collected data. No Data Augmentation (DA) approaches were adopted (such as horizontal flipping) in order to prevent injecting any kind of noise into the sample and to ensure the inclusion of extensively trimmed action sequences. DA was customarily performed through Rotation [22], Flipping [23], Cropping [24], Scaling [25], Translation [26], Noise Injection [27], Color Modification [28], and other modes. Carefully selecting appropriate data augmentation techniques ensures that modified images are still representative of the original dataset and do not introduce any unwanted biases. These types of processing can be easily completed with off-the-shelf software libraries, according to specific application needs by starting from our data.

5. Results

In the context of multimodal action recognition, as in Multimodal Audio-image and Video Action Recognition (MAiVAR) framework [2], these data are utilized, and they demonstrate superior performance compared to other audio representations. The study establishes a benchmark approach for using this dataset. According to Table 2, the data illustrate the performance of multimodal deep learning models using different audio representations, namely Waveplot, Spectral Centroids, Spectral Rolloff, and MFCCs. These representations are used in two scenarios: audio only and fusion of audio and video. Waveplot Representation shows mediocre performance in the audio-only scenario (12.08) but excels when combined with video, reaching a performance of 86.21 in the fusion scenario. However, Spectral Centroids Representation performs poorly in the audio-only scenario (13.22) but improves when combined with video, achieving a performance of 86.26 in the fusion scenario. In addition, Spectral Rolloff representation performs slightly better than the previous two in the audio-only scenario (16.46). Lastly, MFCC representation shows deficient performance in the audio-only scenario (12.96), and its performance in the fusion scenario (83.95) is also lower compared to that of other representations. In summary, all representations perform significantly better in the fusion scenario, indicating that the combined use of audio and video data enhances the effectiveness of these models. MFCCs representation, however, seems to be less effective when combined with video data compared to the others. This indicates that preprocessing steps for audio representations might play a crucial role in improving the model’s performance.
Table 2. Comparing audio-image representations before and after fusion based on accuracy in percentage (adopted from [2]). Note: video-only accuracy is 75.67%.
Finally, our previous work in [2] reports state-of-the-art results for action recognition on audio-visual datasets, highlighting the impact of this work in the research community. We use this dataset [29] to conduct an experiment for human action recognition. Extensive experiments are conducted in the following publications listed in Table 3 against several features.
Table 3. Prior publications produced using the proposed dataset.
We conducted comprehensive experiments on the proposed datasets and the results were derived against various features discussed in our prior publications listed in Table 4.
Table 4. Classification accuracy of MAiVAR using Chromagram representation and comparison to the state-of-the-art methods on the UCF51 dataset after fusion of audio and video features.

6. Conclusions

In conclusion, this paper presents an innovative dataset comprising spectral centroid images representing human actions, derived from audio signals of the UCF101 video dataset. These spectral centroid images provide a compact and information-rich representation of the temporal dynamics of human actions, making them robust to noise and distortion and highly suitable for diverse applications such as surveillance, healthcare monitoring, and robotics.
Moreover, the unique characteristics of the dataset allow for it to serve as a robust benchmark for assessing the efficacy of various machine learning models in human action recognition tasks. It also provides opportunities for cross-dataset validation and transfer learning, opening avenues for fine-tuning pre-existing models on new tasks. Therefore, this dataset not only enhances the accuracy of human action-related tasks, but also provides a novel methodology that can contribute to the field of human action recognition.
In the future, subsequent investigations might center on the exploration of various large-scale multimodal datasets in conjunction with more efficient feature representations to extend and improve multimodal action recognition applications.

Author Contributions

Conceptualization, M.B.S. and D.C.; Methodology, M.B.S.; Software, M.B.S.; Validation, N.A.; Formal Analysis, M.B.S.; Investigation, M.B.S.; Data Curation, M.B.S.; Writing—Original Draft Preparation, M.B.S.; Writing—Review and Editing, S.M.S.I.; Visualization, M.B.S.; Supervision, D.C.; Project Administration, D.C.; Funding Acquisition, M.B.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Edith Cowan University (ECU), Australia and Higher Education Commission (HEC), Pakistan under GRANT No. PM/HRDI-UESTPs/UETs-I/Phase-1/Batch-VI/2018.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in Mendeley Data at [29,31,32,33,34,35].

Acknowledgments

Naveed Akhtar is a recipient of the Office of National Intelligence National Intelligence Postdoctoral Grant NIPG-2021-001 funded by the Australian Government.

Conflicts of Interest

The authors declare no conflict of interest.

Note

1
https://casas.wsu.edu/datasets/ (accessed on 19 January 2024)

References

  1. Shaikh, M.B.; Chai, D. RGB-D data-based action recognition: A review. Sensors 2021, 21, 4246. [Google Scholar] [CrossRef] [PubMed]
  2. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. MAiVAR: Multimodal Audio-Image and Video Action Recognizer. In Proceedings of the International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2022; pp. 1–5. [Google Scholar] [CrossRef]
  3. Sudhakaran, S.; Escalera, S.; Lanz, O. Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1102–1111. [Google Scholar] [CrossRef]
  4. Yang, G.; Yang, Y.; Lu, Z.; Yang, J.; Liu, D.; Zhou, C.; Fan, Z. STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. PLoS ONE 2022, 17, e0265115. [Google Scholar] [CrossRef] [PubMed]
  5. Zhang, K.; Li, D.; Huang, J.; Chen, Y. Automated video behavior recognition of pigs using two-stream convolutional networks. Sensors 2020, 20, 1085. [Google Scholar] [CrossRef] [PubMed]
  6. Lei, J.; Li, L.; Zhou, L.; Gan, Z.; Berg, T.L.; Bansal, M.; Liu, J. Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7331–7341. [Google Scholar] [CrossRef]
  7. Girdhar, R.; Ramanan, D.; Gupta, A.; Sivic, J.; Russell, B. ActionVLAD: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 971–980. [Google Scholar] [CrossRef]
  8. Li, Y.; Li, W.; Mahadevan, V.; Vasconcelos, N. VLAD3: Encoding dynamics of deep features for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1951–1960. [Google Scholar] [CrossRef]
  9. Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 803–818. [Google Scholar] [CrossRef]
  10. Kwon, H.; Kim, M.; Kwak, S.; Cho, M. Learning Self-Similarity in Space and Time As Generalized Motion for Video Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Montreal, QC, Canada, 10–17 October 2021; pp. 13065–13075. [Google Scholar] [CrossRef]
  11. Mei, X.; Lee, H.C.; Diao, K.Y.; Huang, M.; Lin, B.; Liu, C.; Xie, Z.; Ma, Y.; Robson, P.M.; Chung, M. Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Nat. Med. 2020, 26, 1224–1228. [Google Scholar] [CrossRef] [PubMed]
  12. Gu, J.; Cai, H.; Dong, C.; Ren, J.S.; Timofte, R.; Gong, Y.; Lao, S.; Shi, S.; Wang, J.; Yang, S. NTIRE 2021 challenge on perceptual image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 677–690. [Google Scholar] [CrossRef]
  13. Yan, C.; Teng, T.; Liu, Y.; Zhang, Y.; Wang, H.; Ji, X. Precise no-reference image quality evaluation based on distortion identification. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–21. [Google Scholar] [CrossRef]
  14. Liu, J.; Wang, X.; Wang, C.; Gao, Y.; Liu, M. Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Trans. Multimed. 2023, 26, 811–823. [Google Scholar] [CrossRef]
  15. Giannakopoulos, T.; Pikrakis, A. (Eds.) Introduction. In Introduction to Audio Analysis; Academic Press: Oxford, UK, 2014. [Google Scholar] [CrossRef]
  16. Imtiaz, H.; Mahbub, U.; Schaefer, G.; Zhu, S.Y.; Ahad, M.A.R. Human Action Recognition based on Spectral Domain Features. Procedia Comput. Sci. 2015, 60, 430–437. [Google Scholar] [CrossRef]
  17. Peeters, G. A large set of audio features for sound description (similarity and classification) in the CUIDADO project. CUIDADO Ist Proj. Rep. 2004, 54, 1–25. [Google Scholar]
  18. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
  19. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Multimodal Fusion for Audio-Image and Video Action Recognition. Neural Comput. Appl. 2024, 1–14. [Google Scholar] [CrossRef]
  20. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
  21. Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 571–575. [Google Scholar] [CrossRef]
  22. Chen, T.; Zhai, X.; Ritter, M.; Lucic, M.; Houlsby, N. Self-supervised GANs via auxiliary rotation loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12154–12163. [Google Scholar] [CrossRef]
  23. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
  24. Takahashi, R.; Matsubara, T.; Uehara, K. Data augmentation using random image cropping and patching for deep CNNs. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 2917–2931. [Google Scholar] [CrossRef]
  25. Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
  26. Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Van Gool, L. Night-to-day image translation for retrieval-based localization. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5958–5964. [Google Scholar] [CrossRef]
  27. Alharbi, Y.; Wonka, P. Disentangled image generation through structured noise injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5134–5142. [Google Scholar] [CrossRef]
  28. Liao, X.; Yu, Y.; Li, B.; Li, Z.; Qin, Z. A new payload partition strategy in color image steganography. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 685–696. [Google Scholar] [CrossRef]
  29. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Spectral Centroid Images for Multi-Class Human Action Analysis: A Benchmark Dataset. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/yfvv3crnpy/1 (accessed on 29 October 2023).
  30. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. PyMAiVAR: An open-source Python suite for audio-image representation in human action recognition. Softw. Impacts 2023, 17, 100544. [Google Scholar] [CrossRef]
  31. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Chroma-Actions Dataset: Acoustic Images. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/r4r4m2vjvh/1 (accessed on 29 October 2023).
  32. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Waveplot-Based Dataset for Multi-Class Human Action Analysis. Mendeley Data. Available online: https://data.mendeley.com/datasets/3vsz7v53pn/1 (accessed on 29 October 2023).
  33. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Spectral Rolloff Images for Multi-class Human Action Analysis: A Benchmark Dataset. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/nd5kftbhyj/1 (accessed on 29 October 2023).
  34. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. MFFCs for Multi-Class Human Action Analysis: A Benchmark Dataset; Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/6ng2kgvnwk/1 (accessed on 29 October 2023).
  35. Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. MFCCs Feature Scaling Images for Multi-Class Human Action Analysis: A Benchmark Dataset. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/6d8v9jmvgm/1 (accessed on 29 October 2023).
  36. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef]
  37. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar] [CrossRef]
  38. Takahashi, N.; Gygli, M.; Van Gool, L. AENet: Learning deep audio features for video analysis. IEEE Trans. Multimed. 2017, 20, 513–524. [Google Scholar] [CrossRef]
  39. Tian, Y.; Shi, J.; Li, B.; Duan, Z.; Xu, C. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 247–263. [Google Scholar] [CrossRef]
  40. Brousmiche, M.; Rouat, J.; Dupont, S. Multimodal Attentive Fusion Network for audio-visual event recognition. Inf. Fusion 2022, 85, 52–59. [Google Scholar] [CrossRef]
  41. Long, X.; De Melo, G.; He, D.; Li, F.; Chi, Z.; Wen, S.; Gan, C. Purely attention based local feature integration for video classification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2140–2154. [Google Scholar] [CrossRef] [PubMed]
  42. Gao, R.; Oh, T.H.; Grauman, K.; Torresani, L. Listen to Look: Action Recognition by Previewing Audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10457–10467. [Google Scholar] [CrossRef]
  43. Shaikh, M.B.; Chai, D.; Shamsul Islam, S.M.; Akhtar, N. MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers. In Proceedings of the 2023 11th European Workshop on Visual Information Processing (EUVIP), Gjovik, Norway, 11–14 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.