MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

: Audio-image representations for a multimodal human action (MHAiR) dataset contains six different image representations of the audio signals that capture the temporal dynamics of the actions in a very compact and informative way. The dataset was extracted from the audio recordings which were captured from an existing video dataset, i


Introduction
The recent progress in deep learning architectures, coupled with enhancements in Graphics Processing Unit (GPU) hardware and software stacks, has significantly empowered the handling of computationally demanding tasks, including Multimodal Human Action Recognition (MHAR).Analyzing human activities in a multimodal information context is a challenging endeavor that necessitates substantial computational resources [1].This has emerged as a prominent research issue in the field of computer vision.Human Action Recognition (HAR) involves the process of categorizing human actions depicted in a sequence of images, essentially entailing the classification of objectives pursued by individuals across a series of image frames.
Video modality inherently holds spatial information, which lends itself well to Convolutional Neural Network (CNN)-based classification architectures.In the pursuit of more effectively encompassing the multimodal facets of action data, a contemporary approach involves the integration of data from various modalities, including optical flow, RGB difference, and warped optical flow.Audio is a lightweight signal in comparison to video data.However, image-based representations are optimal for vision models in machine learning, specifically for convolution neural network-based vision models.Further, features from spectral centroid-based representations are visually favorable when compared to convolution-based methods.Spectral Centroids provide a compact and informative representation of the audio signal that captures the discriminative features and temporal dynamics of human actions.Therefore, the dataset described in this manuscript was generated during the process of screening diverse image-based representations for action sequences for multimodal fusion with video data.This dataset can thus be used to analyze critical features from the action sequences in the image form.This dataset extends our previous publication [2] which outperforms state-of-the-art methods producing an accuracy of 91.2% by focusing on multimodal representations of action sequences to present critical features in audio from different perspectives, as captured from each action sample.These datasets were also used as a pre-requisite requirement in developing an intelligent multimodal action recognition system for classifying actions using deep learning algorithms based on acoustic and video modality.To the best of our knowledge, MHAiR is the first audio-image representation dataset for multimodal human recognition that uses image-based representations of audio to leverage CNN and transformer-based architectures for improving action recognition.The key contributions of our work can be summarized as follows: • We introduce (Multimodal Audio-image Representations), MHAiR, a new multimodal lightweight dataset.

•
We build a new feature representation strategy to select the most informative candidate representations for audio-visual fusion.

•
We achieve state-of-the-art or competitive results on standard public benchmarks, validating the generalizability of our proposed approach through extensive evaluation.

Value of Data
There are several ways in which this dataset can be valuable compared to the original dataset and in serving other novel use cases.The distinguished characteristics of this dataset are the following:

•
It provides a significant reduction in dimensionality.The spectral centroid images represent the frequency content of the audio signal over time, which is a lowerdimensional representation of the original video dataset.This can make it easier and faster to process the data and extract meaningful features.

•
It is robust against visual changes.The spectral centroid images are based on the audio signal, which is less affected by visual changes such as changes in lighting conditions or camera angles.This makes the dataset more robust to visual changes and can improve the accuracy of human action analysis.

•
It offers standardization as spectral centroid images can be standardized to a fixed size and format, which can make it easier to compare and combine data from diverse sources.This can be useful for tasks such as cross-dataset validation and transfer learning.Hence, this dataset can serve as a standard benchmark for evaluating performance of different machine learning algorithms for human action analysis based on audio signals.

•
It is suitable for privacy-oriented applications such as surveillance or healthcare monitoring, which may require analysis of human actions without capturing original visual information.Spectral centroid images provide a privacy-preserving alternative that can still enable effective analysis in applications where audio can be fused and aligned with non-visual sensory datasets such as HH105 and HH125  The structure of this paper is organized as follows.Section 2 discusses related works.Section 3 describes the key characteristics of the dataset.Section 4 elaborates on the process of extraction of distinct modalities and rationale behind feature extraction in the context of multimodal human action recognition.Section 5 provides an analysis and comparison of a downstream task to establish a benchmark for our proposed dataset, and Section 6 presents the conclusion of this paper.

Multimodal Recognition Methods
Feature extraction is a process of yielding critical information from raw instances, which in turn contributes to the learning process.Temporal Segment Network (TSN) is used as a feature extractor based on its temporal pooling of frame-level features, where it is rigorously used as an efficient video feature extractor for different problems.The Gate-Shift Module (GSM) can turn a 2D CNN into a highly efficient spatio-temporal feature extractor.For example, when TSN is plugged into GSM [3], an accuracy improvement of 32% is achieved.Furthermore, Yang et al. [4] used TSN with a soft attention mechanism to capture important frames from each segment.Moreover, Zhang et al. [5] have used the TSN model as a feature extractor with ResNet101 for efficient behavior recognition of pigs.
Recently, TSN has been adapted as a backbone in video understanding scenarios [6][7][8][9][10], and it is typically used in conjunction with a succeeding module.In [10], TSN was employed as a 2D CNN backbone to learn motion dynamics in videos.However, IRV2 has been used for feature extraction from images [11], helping with different image restoration and enhancement tasks [12,13].In another work, Liu et al. [14] addressed a limitation in existing skeleton-based gesture recognition methods by introducing temporal-dependent adjacency matrices.This innovative approach enhanced the ability of GCN to model temporal information.

Audio-Image Representations
This subsection describes the six different image representations of audio signals.

Waveplot
A waveplot is a specialized graphical representation predominantly utilized in the field of signal processing and music technology for the analysis of audio data.This plot renders the temporal progression of an audio signal's amplitude, offering a vivid depiction of the audio properties and their fluctuations over time.In the construction of a waveplot, the horizontal axis, or the x-axis, symbolizes the dimension of time, while the vertical axis, or the y-axis, stands for amplitude.The fluctuations in the wave's amplitude, captured over time, generate an illustrative portrayal of the auditory characteristics of the sound, including its loudness and periods of silence.However, it is crucial to acknowledge that a waveplot, while informative, lacks the specificity to offer insights into an audio file's frequency content or pitch.For acquiring a more nuanced understanding of an audio file, analysts often resort to the usage of other types of plots such as spectrograms or mel spectrograms.These advanced graphical representations are capable of illuminating frequency-related information.The waveform provides a visual representation of the audio signal's temporal structure.This can be especially useful for recognizing actions that have distinct audio patterns or start and end abruptly.For example, the visual representation of waveform for a clapping action shows sharp spikes corresponding to claps.

Spectral Centroid
The spectral centroid is a measure of the center of "gravity" of the power spectrum of an audio signal [15].Mathematically, the value of the spectral centroid (SC), for the kth frame is defined as where SC t is spectral centroid frequency at time t, k is the kth frequency bin, m t (k) is the power spectral density value at frequency k, and the summation is taken over all frequency bins.Essentially, this equation calculates the average frequency of a signal weighted by the power at each frequency.
In practice, the spectral centroid is usually computed using the Discrete Fourier Transform (DFT) of a short-time windowed segment of the audio signal.This results in a sequence of spectral centroids over time, which can be further processed and analyzed to extract useful features for various audio signal processing applications.
Overall, spectral centroid-based images provide an efficient, robust, and informative representation of the audio signal that can be used for human action recognition [16].For example, a higher spectral centroid value often corresponds to a "brighter" or "sharper" sound, while a lower spectral centroid value usually indicates a "duller" or "muddier" sound [17].By converting the spectral centroid over time into an image, we can capture spatial and temporal information that can be effectively processed by deep learning models [2].Spectral centroid can also help in distinguishing actions based on their tonal or harmonic characteristics.For example, it can be valuable in recognizing actions involving musical instruments or vocalizations, where the timbre or brightness of the sound varies.

Spectral Rolloff
Spectral rolloff is a measure in digital signal processing that provides an estimation of the frequency below which a specified percentage of the total spectral energy lies.In other words, it is the cutoff frequency where any additional increase in frequency contains less power or energy.Typically, spectral rolloff is expressed as a fraction of Nyquist frequency (half of the sampling rate), and it serves as an important feature in audio analysis for various tasks including music information retrieval, speech processing, and detection of musical onsets and offsets.Rolloff frequency can provide a sense of the bandwidth of the signal.A lower rolloff frequency often indicates a narrower bandwidth or a more tonal signal, while a higher rolloff frequency may suggest a broader bandwidth or a more noisy signal.Spectral rolloff can be relevant for recognizing actions based on the high-frequency content of the audio.For instance, actions that involve high-pitched sounds or actions that have significant energy in the higher frequency range can be distinguished using spectral rolloff.

Mel Frequency Cepstral Coefficients (MFCCs)
Mel Frequency Cepstral Coefficients (MFCCs) are a type of feature widely used in the field of digital signal processing and speech recognition.They provide a representation of the power spectrum of an audio signal that is more aligned with human auditory perception.
MFCCs are based on the known variation for the critical bandwidth of human ear.This variation is often expressed in terms of the Mel scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another.Hence, the MFCCs take into account the non-linear human ear perception of frequencies, making them a robust feature for speech and music modeling.
The process to extract MFCCs involves several steps: • Pre-emphasis: This step is performed to increase the signal's amplitude of the highfrequency part.

•
Framing: The continuous signal is divided into frames of N samples, with adjacent frames being separated by M (M < N).

•
Windowing: Each frame is multiplied by a window function (Hamming window, for instance).
• Fast Fourier Transform (FFT): This step is taken to convert each frame from the time domain to the frequency domain.

•
Mel Filter Bank Processing: The power spectrum is then multiplied with a set of Mel filters to obtain a set of Mel-scaled spectra.• Discrete Cosine Transform (DCT): Finally, the log Mel spectrum is transformed to the time domain using the DCT.The result is called the Mel Frequency Cepstral Coefficients.
In the context of human action recognition, MFCCs can provide information about the rhythm, tempo, and acoustic cues related to actions.

MFCC Feature Scaling
MFCC Feature Scaling is a normalization process used when working with Mel Frequency Cepstral Coefficients (MFCCs) in machine learning applications, particularly in audio and speech processing.
The goal of feature scaling, also known as data normalization, is to normalize the range of feature values in order to promote computational efficiency and reduce the potential impact of the so-called "curse of dimensionality".This is especially critical in machine learning models, such as neural networks, where features with different scales can have a detrimental impact on the learning process.
When applied to MFCCs, feature scaling might take a couple of forms: • Standardization: This technique scales the MFCC features so they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.This is achieved by subtracting the mean and then dividing by standard deviation.• MinMax Scaling: Also known as normalization, this technique rescales the features to a fixed range, usually 0 to 1, or −1 to 1.The scaler subtracts the minimum value in the feature and then divides by range (max value-min value).
By applying MFCC Feature Scaling, it is possible to optimize the performance of machine learning models by ensuring that all MFCC features contribute equitably to the model's learning, preventing features with larger scales from dominating those with smaller scales.

Chromagram
A chromagram is a graphical representation of the chroma feature of an audio signal, utilized extensively in the field of music information retrieval.The term "chroma" pertains to the 12 different pitch classes in music, which correspond to the traditional Western music scale.In other words, it refers to the color of music that offers a sense of key and harmony.
A chromagram visually represents that the intensity of these pitch classes changes over time in a piece of music.Each row in a chromagram corresponds to one of the 12 pitch classes, and the columns correspond to points in time.The color or intensity at each point in the plot shows the degree to which that pitch class is present in the sound at that moment in time.
Generating a chromagram involves several steps: • The audio signal is first converted into the frequency domain using Fourier Transform or a similar method.

•
The resulting spectral information is then mapped onto the 12 pitch classes in an octave using a filter bank tuned to chroma frequencies.

•
Over time, a 2D representation (time-pitch intensity) is obtained.
There are several benefits of using image-based representations for human action recognition, including:

•
Efficient representation: Spectral centroid-based images provide efficient representation of the audio signal that can be easily processed by deep learning models.Unlike raw audio signals, which can be difficult to process due to their high dimensionality and variability, spectral centroid-based images provide a compact and informative representation that captures temporal dynamics of the audio signal.
• Robustness to noise: Spectral centroid-based images are less sensitive to noise and distortions than other audio features, such as the raw audio signal or Mel-frequency cepstral coefficients (MFCCs).This is because spectral centroids capture the "center of gravity" of the frequency content, which is less affected by noise and distortions than the fine-grained details of the audio signal.This makes them suitable for noisy environments where other audio features might be unreliable.• Spatial information: Spectral centroid-based images provide spatial information that can be used by deep learning models to recognize human actions.By converting the spectral centroid over time into an image, we can capture the spatial and temporal information of the frequency distribution of the audio signal, which can be interpreted by deep learning models to recognize different human actions.

•
Transfer learning: Spectral centroid-based images can be used for transfer learning, where pre-trained models can be fine-tuned on a specific task.This is because spectral centroid-based images provide a standardized and efficient representation that can be used to compare and combine data from dissimilar sources.This can be useful for tasks such as cross-dataset validation and transfer learning, where models trained on one dataset can be applied to another dataset.

Methodology
A high-level schematic of a prospective downstream multimodal task is illustrated in Figure 2. Audio samples for this dataset of human actions were extracted from videos with a sampling rate of 22,050 Hz.The process of extracting audio from UCF101 video dataset used the "ffmpeg" tool.The resulting audio file was saved separately.For each image representation, post-processing and metadata handling were applied.Particularly following best practices, for chromagram-based representation, a hop length of 512 was used.The extracted audio files were organized and stored according to UCF101 splits, and a quality control check was performed to ensure the audio met the desired standards.This process allowed for the isolation of the audio component from video data, making it available for various applications, including multimodal action recognition and standalone audio analysis.These features were then projected onto images that could be processed by Convolutional Neural Networks (CNN) such as (IRV4) [20] or Transformers such as (AST) [21].Samples that did not have any audio channels were removed from consideration.In to-tal, 51 categories were analyzed to represent the audio-image features extracted from the audio signals.
Since the dataset delineates experimentations on human action recognition in daily life scenarios, all daily life actions occurring in action recognition were retained in order to inform the models (e.g., through fine-tuning) on specificities characterizing the audio at hand.Data were thus preserved in raw format, whereby no form of image normalization was undertaken, and no forms of pre-processing were applied to the collected data.No Data Augmentation (DA) approaches were adopted (such as horizontal flipping) in order to prevent injecting any kind of noise into the sample and to ensure the inclusion of extensively trimmed action sequences.DA was customarily performed through Rotation [22], Flipping [23], Cropping [24], Scaling [25], Translation [26], Noise Injection [27], Color Modification [28], and other modes.Carefully selecting appropriate data augmentation techniques ensures that modified images are still representative of the original dataset and do not introduce any unwanted biases.These types of processing can be easily completed with off-the-shelf software libraries, according to specific application needs by starting from our data.

Results
In the context of multimodal action recognition, as in Multimodal Audio-image and Video Action Recognition (MAiVAR) framework [2], these data are utilized, and they demonstrate superior performance compared to other audio representations.The study establishes a benchmark approach for using this dataset.According to Table 2, the data illustrate the performance of multimodal deep learning models using different audio representations, namely Waveplot, Spectral Centroids, Spectral Rolloff, and MFCCs.These representations are used in two scenarios: audio only and fusion of audio and video.Waveplot Representation shows mediocre performance in the audio-only scenario (12.08) but excels when combined with video, reaching a performance of 86.21 in the fusion scenario.However, Spectral Centroids Representation performs poorly in the audio-only scenario (13.22) but improves when combined with video, achieving a performance of 86.26 in the fusion scenario.In addition, Spectral Rolloff representation performs slightly better than the previous two in the audio-only scenario (16.46).Lastly, MFCC representation shows deficient performance in the audio-only scenario (12.96), and its performance in the fusion scenario (83.95) is also lower compared to that of other representations.In summary, all representations perform significantly better in the fusion scenario, indicating that the combined use of audio and video data enhances the effectiveness of these models.MFCCs representation, however, seems to be less effective when combined with video data compared to the others.This indicates that preprocessing steps for audio representations might play a crucial role in improving the model's performance.Finally, our previous work in [2] reports state-of-the-art results for action recognition on audio-visual datasets, highlighting the impact of this work in the research community.We use this dataset [29] to conduct an experiment for human action recognition.Extensive experiments are conducted in the following publications listed in Table 3 against several features.We conducted comprehensive experiments on the proposed datasets and the results were derived against various features discussed in our prior publications listed in Table 4.

Conclusions
In conclusion, this paper presents an innovative dataset comprising spectral centroid images representing human actions, derived from audio signals of the UCF101 video dataset.These spectral centroid images provide a compact and information-rich representation of the temporal dynamics of human actions, making them robust to noise and distortion and highly suitable for diverse applications such as surveillance, healthcare monitoring, and robotics.
Moreover, the unique characteristics of the dataset allow for it to serve as a robust benchmark for assessing the efficacy of various machine learning models in human action recognition tasks.It also provides opportunities for cross-dataset validation and transfer learning, opening avenues for fine-tuning pre-existing models on new tasks.Therefore, this dataset not only enhances the accuracy of human action-related tasks, but also provides a novel methodology that can contribute to the field of human action recognition.
In the future, subsequent investigations might center on the exploration of various large-scale multimodal datasets in conjunction with more efficient feature representations to extend and improve multimodal action recognition applications.

Figure 2 .
Figure 2. High-level schematic representation of our approach. 1

Table 1 .
Statistics describing the image representations employed in the experimental setting: for all considered categories, we report the total number of training and testing samples.

Table 3 .
Prior publications produced using the proposed dataset.

Table 4 .
Classification accuracy of MAiVAR using Chromagram representation and comparison to the state-of-the-art methods on the UCF51 dataset after fusion of audio and video features.