Human Action Representation Learning Using an Attention-Driven Residual 3DCNN Network

: The recognition of human activities using vision-based techniques has become a crucial research ﬁeld in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms have demonstrated impressive performance in activity recognition, they often exhibit a bias towards either model performance or computational efﬁciency. This biased trade-off between robustness and efﬁciency poses challenges when addressing complex human activity recognition problems. To address this issue, this paper presents a computationally efﬁcient yet robust approach, exploiting saliency-aware spatial and temporal features for human action recognition in videos. To achieve effective representation of human actions, we propose an efﬁcient approach called the dual-attentional Residual 3D Convolutional Neural Network (DA-R3DCNN). Our proposed method utilizes a uniﬁed channel-spatial attention mechanism, allowing it to efﬁciently extract signiﬁcant human-centric features from video frames. By combining dual channel-spatial attention layers with residual 3D convolution layers, the network becomes more discerning in capturing spatial receptive ﬁelds containing objects within the feature maps. To assess the effectiveness and robustness of our proposed method, we have conducted extensive experiments on four well-established benchmark datasets for human action recognition. The quantitative results obtained validate the efﬁciency of our method, showcasing signiﬁcant improvements in accuracy of up to 11% as compared to state-of-the-art human action recognition methods. Additionally, our evaluation of inference time reveals that the proposed method achieves up to a 74× improvement in frames per second (FPS) compared to existing approaches, thus showing the suitability and effectiveness of the proposed DA-R3DCNN for real-time human activity recognition.


Introduction
Convolutional neural networks (CNNs) have become one of the most widely used deep learning architectures in computer vision due to their ability to effectively capture the spatial features of image and video data.In recent years, CNNs have shown remarkable success in a variety of applications, including object detection and recognition [1], image segmentation [2], and scene understanding [3].The design of deep neural networks is crucial for their efficiency, including the depth and structure of the network layers, depending on the task.In some cases, such as object recognition and video analytics, over-parameterization is necessary to ensure the model captures complex hidden patterns and generalizes well.However, this comes at the cost of increased computational complexity, making them unsuitable for real-time environments and resource-limited devices, and requiring high-end GPUs for training [4,5].The design of network architecture is task-specific which varies from problem-to-problem.For instance, recognizing objects in still images demands a plain 2DCNN network composed of convolutional layers for spatial feature extraction and classification layer for classification task.Recognizing human actions in video stream cannot be handled with single 2DCNN networks, in view of the fact that videos are composed of large sequence of frames presenting the temporal flow of video across the frames in the temporal dimension (t).
To cope with the challenge of human action recognition task in video streams, researchers have introduced different solutions that include two-stream 2D convolutional neural network (CNN) architectures [6], 3D CNN (2D + 1D) architectures [7], and CNN with a recurrent neural network (RNN) [8].Typically, two-stream 2DCNNs [6] use two different CNN architectures to extract two different kinds of features from the input video.The first CNN model extracts the spatial features from the input video frames, whereas the second network extracts the temporal optical flow features with respect to their corresponding spatial features.The extracted features from both the models can then be combined as a single latent representation vector for the activity recognition task.On the other hand, 3DCNNs [7] use single CNN architecture having 3D convolution kernels, where the first two dimensions capture the spatial features and the last dimension of the 3D kernel captures temporal flow of the spatial features across the frames.The CNN with RNN architecture frameworks include two different type of models (i.e., CNN followed by RNN).The CNN model in CNN + RNN architectures extracts the spatial features from the video frames and converts it to a one-dimensional latent representation such as feature vectors.The extracted latent representation from the CNN models can then be fed to the RNN model for activity classification task using sequential pattern learning.Typically, an activity recognition framework with two different network architectures increases the parameters space (computational complexity) of the entire framework as well as the time complexity of the model for the task under the consideration.Considering the parameter space and time complexity of two-stream architectures, 3DCNNs are considered a suitable candidate for human activity recognition task.
Therefore, in this paper, we propose a computationally efficient residual 3DCNN architecture called dual-attentional residual 3D convolutional neural network (DA-R3DCNN) with channel and spatial attention for human activity recognition task.The proposed DA-R3DCNN has channel and spatial attention layers after each residual block which helps our model to propagate salient features from the early layers to later layers.This propagation of salient information significantly improves the performance of our model for human activity recognition task.More precisely, the major contributions of this paper are as follows: 1.
To overcome the issue of over-parameterization, we present a computationally efficient yet robust end-to-end residual 3DCNN model coupled with dual 3D attention and residual 3D convolution mechanism, learning object and motion-centric spatiotemporal representations of human actions in video sequence; 2.
To prevent gradient vanishing, this work proposes a 3D residual convolution mechanism that allows the flow of learned representations from the early layers to the later layers of the network.Moreover, instead of using plain shortcut path, we use convoluted shortcut path having a 3D convolution layer of kernel size 1 × 1 × 1; 3.
To efficiently extract spatial saliency from video frames, we utilize a dual 3D channelspatial attention mechanism along with residual 3D skip connections.Our approach integrates the dual-attentional module after every two consecutive 3D convolutional layers within the 3DCNN model.This enables the extraction of discriminative features that are sensitive to object saliency, allowing for precise localization of action-specific regions in the video frames.
The remaining sections of this paper are organized as follows.Section 2 presents a concise overview of related works in the field of human activity recognition.Section 3 delves into a comprehensive discussion of the proposed DA-R3DCNN framework and its key components.The detailed experimental evaluation of the DA-R3DCNN framework, along with comparisons to the state-of-the-art human action recognition methods, is presented in Section 4. Finally, Section 5 concludes this paper, and also highlights potential future research directions in this domain.

Related Works
Over the past decade, there has been significant research on human activity recognition, with several advanced methods proposed to effectively tackle the problem of recognizing human actions.These approaches include two-stream 2DCNNs [9][10][11][12][13], CNN + LSTM [14][15][16][17][18], and 3DCNN-based methods [7,[19][20][21][22][23].Typically, the two-stream 2DCNN architecture paradigm uses two different CNN architectures for modeling human actions in video data.Both CNN architectures operate on the same input data, however, they extract different representations from the input video.One model extract the discriminative spatial features (i.e., encoded visual representations), while the other network extract temporal features (i.e., temporal flow of spatial features) from the input video.For instance, Wang et al. [11] have proposed a two-stream CNN architecture approach for human activity recognition.Their approach utilizes two separate CNNs to extract spatial and temporal features from the input video frames.They have also introduced a video frames segmentation strategy, which involves segmenting the input video into three segments.The two-stream CNN architecture is then applied to these extracted segments to perform segment classification.The segment classification scores are then combined using average pooling to perform video-level classification.Karpathy et al. [9] have proposed a dualstream 2DCNN framework, to model both spatial and temporal features from the given video frames.To expedite the computation process, they operate their dual-stream CNN model on two different resolution of video frames.The extracted features from both the models are then fused together to obtain spatial-temporal representation of human actions in video.In another work, Zhang et al. [13] have introduced a multi-task learning approach for human activity recognition in low-resolution videos.To improve the resolution of the input video, they have proposed two super-resolution techniques that transformed the low-resolution input video into a high-resolution video.The transformed high-resolution video frames are then fed to their dual-stream classification network for human activity recognition task.
Unlike the two-stream 2DCNN approaches, the CNN + LSTM paradigm uses two different types of networks for spatial and temporal features representation learning.The first part of this paradigm uses 2DCNN architecture to extract discriminative spatial features and convert it to latent representation, where the later part operates the RNN model on the extracted latent representation and learns the temporal hidden patterns in the spatial features.For instance, Srivastava et al. [15] have proposed unsupervised encoder and decoder long short-term memory (LSTM) networks for learning temporal modeling of human actions.The authors initially transformed the input video into a fixed-length representation of temporal features using an encoder LSTM network.Subsequently, they employed a decoder network to reconstruct the video from the latent representation, which facilitated human action predictions.Donahue et al. [14] have presented a recurrent convolution driven approach called long term recurrent convolutional network (LRCN) for recognizing human actions in videos.They have used a 2DCNN architecture to transform the input video frames to 1D latent representation of spatial features.The extracted latent representations are then fed to an LSTM network to capture the temporal changes in the extracted spatial features across array of frames.In the work presented in [18], Sudhakaran et al. have utilized a task-specific recurrent unit that incorporates a spatial attention mechanism.This mechanism enables the capture of salient features across sequences of video frames.The extracted salient features are then processed by an LSTM network to learn the temporal relations of the salient information, facilitating video-level activity recognition.Sharma et al. [16] have proposed the utilization of a deep multi-layer LSTM for the recursive estimation of visual attention maps.Their approach involves applying the multi-layer LSTM to RGB video frames, allowing for the computation of weighted attention maps through recursive operations.They have claimed that their proposed weighted attention maps mechanism greatly helps the model in enhancing feature representation, which turns in better performance of the model for the activity recognition task.
Both the two-stream 2DCNN and CNN + LSTM based methods use two distinct architectures for capturing spatiotemporal features in video frames.The utilization of two different networks makes these approaches computationally inefficient, thereby, increasing the overall computational complexity of the model for the activity recognition task.To alleviate the computational burden of the model, numerous studies have proposed unified end-to-end 3DCNN approaches that encapsulate the learning objective of both spatial and temporal features through a single model.Typically, 3DCNNs utilized the first two channel of the learnable kernels for capturing spatial features, where the last channel captures the temporal flow of the spatial features across the sequence of input data samples (i.e., video frames in case of human activity recognition).For example, Diba et al. [19] have introduced a modified version of the DenseNet [24] architecture called DenseNet-3D or temporal 3DCNN (T3D).They have achieved this by replacing 2D convolution and pooling kernels with 3D convolution and pooling kernels.Through their experiments, they have claimed that their T3D model demonstrates the potential to capture both short and longterm spatiotemporal features within video sequences.In another study, Varol et al. [20] have presented a specialized variant of CNN known as long-term temporal convolution (LTC).They have extended the temporal depth of their 3DCNN convolutional layers and reduced the receptive field of feature maps.These modifications have allowed the model to effectively learn long-term spatiotemporal patterns in video streams.To capture temporal-specific features, Hussain et al. [22] have proposed a multi-scale 3DCNN called Timeception that is designed to handle significant fluctuations in the temporal dimension by accommodating different temporal extents, which helps to effectively recognize long and intricate actions.The advent of 3DCNN concept has allowed researchers to solve the sequential learning task using unified approach instead of using two different architectures.Although, the reported 3DCNN-based approached have shown noticeable improvement over two-stream 2DCNN and RNN-based approaches, these models are usually overparameterized and can be optimized in terms of task-specific parameters reduction.
To address the limitations of existing 3DCNNs for the human activity recognition task, this paper proposes a residual 3DCNN architecture with encapsulated 3D channel and spatial attention mechanisms.The proposed DA-R3DCNN framework uses a residual 3DCNN architecture, where each convolution layers is stacked with a channel and spatial attention module that helps our backbone model to progressively learn salient features during training.This way, the 3D channel and spatial attention module encourages the backbone residual 3DCNN to enhance the representation of salient information across the multiple 3D convolution layers and eliminates the contribution of sparse parameters in the learning process.By eliminating the sparsity of parameter space, the proposed framework learns robust features while having a small parameters space.

Proposed DA-R3DCNN Human Activity Recognition Framework
This section presents the detailed overview of the proposed DA-R3DCNN architecture and its sub-components.The proposed framework incorporates three essential components: a 3DCNN architecture for learning spatiotemporal representations, a 3D convolution residual block, and a dual-attention module (comprising channel and spatial attention).These components work together to enable the residual 3DCNN to effectively capture salient features within video frames.For better understanding, we divided the discussion on these components in separate sections.First, we provide insights of the proposed residual 3DCNN, focusing on architecture details, and then present the technical details of the 3D convolution residual block.Finally, we present the detailed technical aspects of channel and spatial attention.The visual overview of our proposed DA-R3DCNN framework and its workflow is depicted in Figure 1.

DA-R3DCNN Architecture
In this work, we propose a residual 3DCNN model coupled with dual-channel spatial attention mechanism.The proposed DA-R3DCNN network consists of eight 3D convolutional + batch normalization layers, four 3D max pooling layers, three residual blocks, and four dual-attention modules.The formation of convolution layers in our proposed DA-R3DCNN model is determined through empirical assessments.It is important to note that we maintained a consistent number of filters across all standard 3D convolution and residual 3D convolution layers, with a fixed value of 128 kernels per layer.Furthermore, all 3D convolution layers, including residual 3D convolution layers, are coupled with batch normalization layers.The architectural details of the proposed DA-R3DCNN model are given in Table 1.As given in Table 1, the first 3D convolutional layer operates 128 kernels of size 3 × 3 × 3 on input frames, which are then down-sampled by the first 3D max pooling layer having a kernel size of 3 × 3 × 3. The second 3D convolution layer operates 128 kernels of size 3 × 3 × 3 on the output feature maps from the first 3D max pooling layer.The convoluted feature maps are then operated by the first attention module that computes channel and spatial attention in input feature maps from the second 3D convolution layer, followed by the first residual block enhancing feature representations using residual convolution connection.The output features maps from the first residual block are then down-sampled by the second 3D max pooling layer, followed by two consecutive 3D convolution layers (i.e., third and fourth 3D convolution layers) which operate 128 kernels of size 3 × 3 × 3 on the output feature maps from the second 3D max pooling layer.The convoluted feature maps are then operated by the second attention module, followed by the second residual block.The output feature maps from the second residual block are then down-sampled by the third 3D max pooling layer having kernel size of 3 × 3 × 3.
The intermediate pooled feature maps from the third 3D max pooling layer are then operated by two consecutive 3D convolution layers (i.e., fifth and sixth 3D convolution layers) using 128 kernels of size 3 × 3 × 3. The convoluted feature maps from the fifth and sixth 3D convolution layers are then operated by the third attention module, followed by the third residual block.The resultant feature maps from the third residual block are further down-sampled by the fourth 3D max pooling layer having kernel size of 3 × 3 × 3. The down-sampled feature maps form fourth 3D max pooling layer are further convoluted by the seventh and eighth 3D convolution layers, having 128 kernels of size 3 × 3 × 3. The output convoluted feature maps from the seventh and eighth 3D convolution layers are then operated by the fourth attention module.The feature maps generated by the fourth attention module are further improved by passing them through the fourth residual block.Subsequently, these feature maps are down-sampled using the fifth 3D max pooling layer.The pooled feature maps are then converted to 1D (i.e., 1 × n size, where n represents the number of feature values) latent representation by 3D global average pooling layer.The resultant 1D feature values are then operated by 2 consecutive fully connected layers (i.e., FC1 and FC2 layers) having dimensions of 1 × 512.Finally, the output (i.e., logits having negative and positive values) of FC1 and FC2 layers are passed to softmax layer which converts it to final probabilities (values between 0 and 1).

3D Residual Convolution Block
To limit the propagation of vanishing gradients across the network layers, we used a 3D residual convolution mechanism inspired by the 2D residual convolution in [25], with convoluted shortcut path.As shown in Figure 2, the utilized 3D residual convolution block consists of three convolution layers, with one additional convolution layer over the shortcut path, where each convolution layer is binned with a batch normalization layer.The second convolution layer in the residual block operates a 3 × 3 size kernel, where the first, third, and the shortcut path convolution layers operate 1 × 1 size kernels.Unlike, the original residual block presented in [25], in this paper, we used 3D convolution block containing 3D convolution layers instead of 2D convolution layers.Further, instead of using a plain shortcut path as used in [25], in this paper, we used a 3D convoluted shortcut path to ensure the compatibility of input and output dimensions.The 3D residual convolution block used in this paper consists of two key components, the residual mapping and the shortcut path (skip connection).Mathematically, the utilized 3D residual convolution block can be expressed as follows: where x is the input of the residual block and g represents the mapping function (convolution layers of the residual block), which learns the mapping (transforming input to output) between input and output using a set of weights represented by w.The variable x represents the convoluted shortcut path having 1 × 1 × 1 convolution, which enables the gradients to flow more easily through the network layers resulting in better performance.Finally, variable y denotes the weighted mapping of input to output of the residual block.

Dual Channel-Spatial Attention Module
Our proposed framework utilizes an attention-driven CNN architecture to selectively concentrate on the most significant regions within video frames.This method enables efficient and accurate localization of the salient regions while also enhancing the quality of the feature representation.The proposed attention mechanism is the modification of the convolutional block attention module (CBAM) [26].To achieve this modification, the 7 × 7 convolution layer in CBAM is replaced with a 3D convolution layer having a kernel size of 3 × 3 × 3. Additionally, the spatial attention module is fused with the intermediate output of the channel attention module using an element-wise product operation.The resulting dual-attention block is visually represented in Figure 3.Our proposed approach employs a fusion of channel and spatial attentions to efficiently extract important features from video frames while minimizing the number of parameters required.This design not only improves the representation of features but also reduces overhead.To implement this approach, we incorporated a stacked dual-attention module after every two consecutive convolutional layers in our network.This construction strategy optimizes the extraction of salient features, resulting in a highly efficient and accurate model.The channel attention module in our proposed architecture calculates the weighted contribution of RGB channels by applying intermediate channel attention A C to the output feature maps F M from the previous convolutional layer.This process results in the channel attention Att C , which is used to enhance the overall feature representation.Once the channel attention module computes the channel attention feature maps Att C , they are then passed into the spatial attention module for further processing.The spatial attention module uses the channel attention maps to identify relevant object-specific regions within the video frames.To generate the refined feature maps F M , we fused the spatial attention feature maps Att S with the input feature maps F M using a residual skip connection by employing an elementwise addition operation.This approach significantly enhances the quality of the feature representation, enabling more precise localization of salient regions.Mathematically, channel attention, spatial attention, and refined attention feature maps can be expressed as follows: In the above equations, H, W, and C represent the height, width, and the number of channels of the feature maps, respectively, and A C and A S represent the intermediate channel and spatial attentions.The refined feature maps, denoted as F M , are obtained by fusing the spatial attention feature maps A S with the input feature maps F M .This process allowed us to enhance the representation of features and improve the quality of the final output.

Channel Attention
In the context of image/object recognition problems, the contribution of each color channel is crucial in achieving accurate pattern recognition.CNN models leverage this information by constructing feature maps from the input image data and extracting deep discriminative features over multiple convolutional layers.However, certain color channels may be more important than others in the recognition process, and often the object recognition model takes this into account during training.This approach allows an object recognition model to capture the most important visual features of the input images and improve recognition accuracy.Prior attention-based approaches in video analysis utilized either global max pooling or global average pooling layers.However, the proposed DA-R3DCNN model surpasses this limitation and combines both pooling methods to extract more effective features.The global max pooling layer selects the maximum value from the receptive field, emphasizing highly activated values, while the global average pooling layer estimates equally weighted feature maps for each channel.By leveraging the strengths of both pooling techniques, the model can capture and highlight the most important and discriminative features in videos.This results in an improved performance in various video analysis tasks, including action recognition and spatio-temporal localization.
Once the feature maps have been computed, they are fed into a shared multilayer perceptron (MLP), which comprises two fully connected layers, each with 512 nodes.The MLP leverages a rectified linear unit (ReLU) activation function to learn the non-linearity between the two fully connected layers.The MLP then produces two distinct feature vectors-V 1×1×C C−max and V 1×1×C C−avg -through global max pooling and global average pooling, respectively.These feature vectors play a critical role in capturing the most salient and essential information present in the feature maps.This approach can significantly enhance the performance of the model across various video analysis tasks.After computing feature vectors from global max pooling and global average pooling, they are fused through elementwise addition and passed through a sigmoid activation function σ to obtain intermediate channel attention features A 1×1×C C .These features are then fused with the input feature maps F H×W×C M through a residual skip connection using element-wise multiplication operation, resulting in the final channel attention feature maps Att H×W×C

C
. Figure 4 provides a visual representation of this process.Mathematically, the channel attention and its key components can be formulated as follows: where f c1 and f c2 denote the first and the second fully connected layer, respectively.

Spatial Attention
The spatial attention mechanism involves learning a weighting mechanism that assigns importance scores to different spatial locations within an image.These importance scores indicate the relevance or saliency of each location in relation to the task at hand.The mechanism typically consists of trainable parameters that are optimized during the training process.To highlight the salient object-specific regions in the feature maps, DA-R3DCNN takes advantage of inter-spatial features and their relationship between channels.This allows for more accurate tracing of the target object in the feature maps.DA-R3DCNN achieves this by computing the relation of inter-spatial features between channels through max pooling and average pooling applied to the input channel attention feature maps, resulting in max-pooled channel attention ), In the above equations, denotes the concatenation operation, fusing Att H×W×C C and Att H×W×C S .

Results and Discussion
In this section, we provide a comprehensive experimental evaluation of our proposed framework on various human activity recognition datasets.This section begins by providing a brief overview of the datasets used, and the implementation details and tools utilized in this study.Afterwards, we present a detailed analysis of the experimental results obtained from the proposed framework, including a comparative analysis with the state-ofthe-art human action recognition methods.Additionally, an ablation study is presented, where the proposed method was analyzed with different modifications to the network architecture.Lastly, we assess the runtime performance of our proposed framework using metrics such as seconds per frame (SPF) and frames per second (FPS).We compare the obtained runtime results with the runtime results of state-of-the-art methods.

Datasets and Implementation Details
In this paper, we evaluate the perofmance of our DA-R3DCNN method on four publicly available benchmark datasets for human activity recognition tasks: UCF11 [27], HMDB51 [28], UCF50 [29], and UCF101 [30].These datasets are exclusively created for human activity recognition task, and contain videos collected from different sources and have different lengths, resolutions, and viewpoints of humans actions in the videos.The UCF11 [27] dataset comprises 1640 videos collected from YouTube which are then categorized into 11 distinct action classes of human actions.All videos in the dataset are annotated by action appearance, where each video has a spatial resolution of 320 × 240.The HMDB51 [28] dataset is a relatively large dataset, containing 6849 videos, categorized into 51 categories.This dataset has a wide range of variation in camera motion, object scale, view point, and background clutter, which makes it challenging for human action recognition tasks.Videos in this dataset are collected from different sources, including movies, YouTube, Prelinger archive, and Google videos.The UCF50 [29] dataset consists of 6676 realistic videos collected from YouTube, containing human actions performed by different subjects in different environments with varying viewpoints.Videos in this dataset are divided into 50 distinct actions by action appearance in the video.Finally, the UCF101 [30] is the largest dataset amongst the above mentioned datasets, containing 13,320 videos of different human actions.This dataset is the extended version of the UCF50 [29], having comparatively more videos and large variation in actions, categorized into 101 action classes.The number of videos per class in each dataset is approximately 100 to 200, and the duration of video clips is in between 2 and 3 s, with a frame rate of 25 FPS.
For implementation, we used Python version 3 utilizing Keras with a TensorFlow 2.0 backend.We performed the experiments on a computer system equipped with an Intel(R) Xeon(R) CPU E5-2640, operating at a frequency of 2.50 GHz, and 32 GB of dedicated main memory (RAM).Additionally, we employed two dedicated Tesla GPUs with compute capabilities of 7.5 as hardware resources along with the Nvidia CUDA 11.0 library.To train the proposed DA-R3DCNN model, we used 70% of data for training, 20% for validation, and 10% for testing the model performance after training.The same data splitting ratio was considered for each dataset used in the experiments of this paper.It is worth mentioning here that each set of data (including training, validation, and test sets) contained all classes, where each class consisted of videos as per their corresponding split ratios (training 70%, validation 20%, and test 10%).Further, for model's weights adjustment and convergence, we employed the Adam optimizer with a fixed learning rate of 0.0001 and utilized categorical cross-entropy loss to adjust the network weights.We set the input sequence length to 16 frames, allowing the DA-R3DCNN model to extract spatiotemporal information by sliding multiple 3D kernels over the sequence of frames.To obtain and compare the performance of our DA-R3DCNN method with the state-of-the-art methods, we used two different evaluation metrics: model accuracy performance evaluation and runtime performance evaluation.For accuracy comparison, we compared the average accuracy of our model for each dataset with the state-of-the-art methods, whereas, for runtime performance comparison, we used two metrics: FPS and SPF.

Quantitative Evaluation
In this section, we present the performance evaluation of the proposed DA-R3DCNN framework.To evaluate the performance of our proposed framework, we conducted quantitative performance evaluation experiments on the four benchmark datasets: UCF11, UCF50, HMDB51, and UCF101.To better analyze the model performance for a specific class in each dataset, we computed the confusion matrix (reflecting true positive, false positive, true negative, and false negative predictions) based on the model predictions for each dataset.The obtained confusion matrices for UCF11, UCF50, HMDB51, and UCF101 datasets are depicted in Figure 5. Further, the obtained quantitative results for both with and without the dual-attention mechanism are listed in Table 2. Based on the listed values in Table 2, it is evident that the proposed method demonstrates strong performance when combined with the dual-attention module.For instance, when applied to the UCF11 dataset, the proposed method achieved an accuracy of 98.6% with the dual-attention module, while achieving an accuracy of 93.1% without the dual-attention module.When applied to the HMDB51 dataset, the proposed method achieved an accuracy of 82.5% when coupled with the dual-attention module and attained an accuracy of 77.2% without the module.Similarly, on the UCF101 dataset, the proposed method obtained an accuracy of 97.8% with the dual-attention module and had an accuracy of 93.6% without the module.The listed accuracy values demonstrate that the proposed method with the dual-attention module achieved improvements of 5.5%, 5.6%, 5.3%, and 4.2% for the UCF11, UCF50, HMDB51, and UCF101 datasets, respectively.Thus, the obtained noticeable improvements in accuracies for each dataset validate the effectiveness of the dual-attention module for the activity recognition task.

Comparison with the State-of-the-Art Methods
This section presents a comprehensive quantitative comparison between our proposed DA-R3DCNN model and state-of-the-art methods for human action recognition.The comparisons were based on average accuracy and were conducted on the UCF11, UCF50, HMDB51, and UCF101 datasets, as shown in Tables 3, 4, 5, and 6, respectively.Table 3 showcases the results that indicate that our proposed DA-R3DCNN achieved the highest accuracy of 98.6%, surpassing all other methods.The Fusion-based discriminative features method [31]came in second place, with an accuracy of 97.8%.Among the comparative methods, the lowest accuracy on the UCF11 dataset was obtained by the Local-global features + QSVM method [32], which achieved an accuracy of 82.6%.The rest of the comparative methods included Multi-task hierarchical clustering [33], BT-LSTM [34], Deep autoencoder [35], Two-stream attention LSTM [36], Weighted entropy-variances-based feature selection [37], Dilated CNN + BiLSTM + RB [38], DS-GRU [39], Squeezed CNN [40], BS-2SCN [41], and 3DCNN [42].These methods achieved accuracies of 89.7%, 85.3%, 96.2%, 96.9%, 94.5%, 89.0%, 97.1%, 87.4%, 90.1%, and 85.1%, respectively.Based on the comparative assessment, the proposed DA-R3DCNN achieved an average accuracy improvement of 8.47% as compared to the average results of state-of-the-art methods on the UCF11 dataset.
For the UCF50 dataset, the results presented in Table 4 validate that the proposed DA-R3DCNN framework achieved the best results by attaining an accuracy of 97.4%, followed by the Deep autoencoder [35] method, which obtained a runner-up accuracy of 96.4%.Among all the comparative methods on the UCF50 dataset, the Local-global features + QSVM method [32] achieved the lowest accuracy of 69.4%.The other methods in the comparison included Multi-task hierarchical clustering [33], Ensemble model with swarmbased optimization [43], DS-GRU [39], Hybrid Deep Evolving Neural Networks [44], ViT + Multi Layer LSTM [45], and 3DCNN [42], which achieved accuracies of 93.2%, 92.2%, 95.2%, 77.3%, 96.1%, and 82.6%, respectively.Upon analyzing the comparative results presented in Table 4, it is evident that the proposed DA-R3DCNN exhibited an average accuracy improvement of 10.93% over the average results of the state-of-the-art methods on the UCF50 dataset.
Furthermore, to assess the performance generalization of our method, we conducted an analysis of confidence intervals, following the methodology outlined in [73].This analysis was performed on each dataset used in this study, and a comparison was made between the confidence intervals of our proposed method and those of the state-of-the-art approaches.It is worth noting that a confidence level of 95% was employed for estimating the confidence intervals of both our method and the state-of-the-art methods.The resulting confidence interval values for our proposed method and the state-of-the-art methods are given in Table 7. Upon examining these values, we observe that our proposed method exhibited higher confidence levels with narrower intervals on all the dataset when compared to the state-of-the-art methods.For instance, for the UCF11 dataset, the confidence interval of our proposed method spanned from 97.31 to 99.10, with a range of only 1.79.In contrast, the average confidence interval of the state-of-the-art methods ranged from 87.74 to 94.20, showing a comparatively larger range of 6.46.Similarly, for the UCF50 dataset, our proposed method achieved a confidence interval between 96.78 and 98.42, with a small range of 1.64, while the state-of-the-art methods had an average confidence interval ranging from 79.86 to 95.74, indicating a larger range of 15.88.Analyzing the HMDB51 dataset, we observe that our proposed method had a confidence interval of 92.21 to 94.16, with a narrow range of 1.95.In contrast, the state-of-the-art methods exhibited an average confidence interval ranging from 65.97 to 72.67, demonstrating a comparatively larger range of 6.70.Lastly, for the UCF101 dataset, our proposed method demonstrated a confidence interval between 96.89 and 98.46, with a small range of 1.57, whereas the state-of-the-art methods have an average confidence interval ranging from 88.93 to 93.57, indicating a larger range of 4.64.It is worth mentioning here that our proposed method consistently achieved higher confidence levels across all datasets, with narrower intervals, in comparison to the stateof-the-art methods.This observation serves to verify the effectiveness of our proposed method in surpassing existing approaches in terms of performance generalization.
The conducted comparative assessments validate the effectiveness of the proposed DA-R3DCNN based on the obtained improvement in the results, across each dataset used in this work.These results verify the robustness of our proposed DA-R3DCNN framework over the state-of-the-art methods for human action recognition task.

Run Time Analysis
In this section, we examine the inference time of our proposed DA-R3DCNN framework and assess its suitability for real-time human activity recognition tasks, considering metrics such as SPF and FPS.To evaluate the overall run time performance, we conducted inference time measurements of the proposed DA-R3DCNN model on both GPU and CPU computing platforms.These measurements were then compared with the runtime results of the state-of-the-art human activity recognition methods, and the findings are presented in Table 8.This analysis provides a comprehensive perspective on the run time efficiency of our proposed DA-R3DCNN model across different computing platforms.The results presented in Table 8 highlight the superior inference efficiency of the proposed DA-R3DCNN model as compared to state-of-the-art methods, as demonstrated by SPF and FPS metrics on both GPU and CPU computing platforms.The findings indicate that, when utilizing GPU resources, our proposed DA-R3DCNN achieved the best SPF of 0.0045 and an FPS of 240.The runner-up method, OFF [61], achieved an SPF of 0.0048 and an FPS of 215.Conversely, the Videolstm [17] method exhibited the highest SPF of 0.0940 and the lowest FPS of 10.6, indicating the least favorable run time performance among all the comparative methods.These results underscore the exceptional inference efficiency of our proposed DA-R3DCNN model when compared to existing approaches.On the CPU computing platform, the proposed DA-R3DCNN framework demonstrated significant superiority over existing methods, achieving an SPF of 0.0058 and an FPS of 187.In comparison, the second-best performing method, Optical-flow + Multi-layer LSTM [47], achieved an SPF of 0.18 and an FPS of 3.5.Conversely, the Deep autoencoder [35] method exhibited the poorest runtime performance with an SPF of 0.43 and an FPS of 1.5.These results further validate the exceptional run time efficiency of our proposed DA-R3DCNN framework when compared to alternative approaches on the CPU computing platform.
Further, to ensure a fair comparison of the run time results obtained for both GPU and CPU platforms, we scaled the run time results (as in [74]) of the state-of-the-art methods to match the hardware resources utilized in our study (i.e., a 2.5 GHz CPU and a 585 MHz GPU).The scaled run time results are provided in the second section of Table 8, enabling an equitable assessment and comparison of the performance of the proposed DA-R3DCNN framework against existing methods.Analyzing the scaled results presented in Table 8, it becomes evident that scaling amplifies the advantages of the proposed DA-R3DCNN model in terms of SPF and FPS metrics for both GPU and CPU computing platforms.When utilizing GPU resources, the proposed DA-R3DCNN outperformed other methods with the best SPF of 0.0045 and an FPS of 240.The STPP + LSTM [46] method secured the second-best position, with SPF and FPS values of 0.0063 and 154.6, respectively.These findings highlight the enhanced performance of the proposed DA-R3DCNN model when considering the scaled runtime results, solidifying its superiority over alternative approaches.The Videolstm [17] method had the highest SPF of 0.1606 and lowest FPS of 6.2, indicating the worst run time results amongst all the comparative methods.When running on CPU computing platform, the proposed DA-R3DCNN framework had the lowest SPF of 0.0058 and highest FPS of 187, indicating the best results obtained on CPU resources as compared to other comparative methods.Among the scaled results in Table 8, the Optical-flow + Multi-layer LSTM [47] emerged as the runner-up with an SPF of 0.23 and an FPS of 2.6 on the CPU computing platform.On the other hand, the Deep autoencoder [35] method exhibited the least favorable performance on CPU resources, achieving an SPF of 0.56 and an FPS of 1.1.These findings further solidify the superior run time performance of the proposed DA-R3DCNN model when compared to alternative methods on the CPU computing platform.The best and runner-up SPF and FPS scores for GPU and CPU are highlighted in bold and italic text, respectively.
It is evident from the listed scaled and non-scaled results in Table 8 that the proposed DA-R3DCNN provides significant improvement for both GPU and CPU computing platforms.For instance, for non-scaled run time results, the proposed DA-R3DCNN provided an improvement of up to 7× for SPF and 3× for FPS metric when running on GPU resources.When running on CPU resources, the proposed DA-R3DCNN achieved an improvment of up to 52× for SPF and 74× for FPS for the non-scaled run time results.Similarly, for the scaled run time results, the proposed DA-R3DCNN provided an improvement of 13× for SPF and 5× for FPS when running on GPU resources.When running on CPU resources, the proposed DA-R3DCNN framework achieved an improvement of 68× for SPF and 100× for FPS.These results show the efficiency and applicability of the proposed DA-R3DCNN method for real-time human activity recognition in resource constraint environments.

Conclusions and Future Research Directions
In this work, we have proposed an attention-driven 3DCNN with residual skip connections for recognizing human activities in videos.The proposed method combines the powerful characteristics of dual channel-spatial attention and residual 3D convolutional neural network (3DCNN) into a unified framework for efficient modeling of human actions and single instance training.The utilized dual channel-spatial attention mechanism incorporates both channel and spatial attentions, enabling the extraction of highly discriminative features from regions of interest related to the objects involved in the activities.This results in the generation of high-quality feature maps containing object saliency-aware features boosting the overall learning process of the proposed residual 3DCNN network.By employing residual 3DCNN coupled with dual attention, our method, known as DA-R3DCNN, effectively captures the temporal dynamics of human actions through the use of multiple 3D kernels.By leveraging the knowledge acquired from the immediately preceding frames within the input sequence, the model becomes capable of learning the spatial and temporal relationships within unseen frames.This enables the model to grasp the connections and patterns existing between different frames.The incorporation of attention-guided learning further enhances our method's capability to acquire spatial and temporal understanding of human actions, leading to improved learning performance during training and enhanced prediction accuracy during inference.
We have extensively evaluated the performance of our proposed DA-R3DCNN method on four widely recognized benchmark datasets for human action recognition: UCF11, UCF50, HMDB51, and UCF101.These datasets are well-established in the research community and serve as reliable benchmarks for comparison.Through rigorous experimentation and comparison with the state-of-the-art approaches, we have demonstrated the superiority of our method in terms of model robustness and computational efficiency.The obtained results validate the efficacy of our approach in tackling the challenges of human action recognition across diverse datasets.Further, we have assessed the run time performance of our proposed framework in terms of seconds per frame (SPF) and frames per second (FPS) on both CPU and GPU execution environments.This analysis has allowed us to measure the computational efficiency of our method, and to provide valuable insights into the speed of frame processing and overall video processing capabilities across different hardware configurations.The run time assessment results clearly indicate that the proposed DA-R3DCNN method exhibits remarkable improvements when leveraging GPU resources.It demonstrates a significant enhancement of up to 13× in SPF and 5× in FPS metrics as compared to the state-of-the-art methods.Additionally, even when limited to CPU resources, our approach achieves substantial advancements, with SPF improving by 68× and FPS by 100× as compared to existing approaches.These findings establish that the proposed DA-R3DCNN method is exceptionally well-suited for real-time human activity recognition on resource-constrained devices.
While our current implementation of the DA-R3DCNN method leverages the spatial attention mechanism (channel and spatial attention), which has proven to be highly effective, in our future work, we plan to incorporate a temporal attention mechanism for precise temporal localization of human activities within video scenes.Additionally, we are actively exploring the integration of multi-modal data, which holds significant potential for recognizing complex human activities in uncertain environments.These future advancements aim to further enhance the capabilities of our method in capturing both spatial and temporal dynamics for improved activity recognition performance.

Figure 1 .
Figure 1.The graphical abstract of our proposed DA-R3DCNN network architecture.

Figure 2 .
Figure 2. The visual overview of the utilized 3D residual convolutional block used in this study, with convoluted shortcut path.

Figure 3 .
Figure 3.The visual overview of the dual channel-spatial attention module.

Figure 4 .
Figure 4. Architecture of the dual channel-spatial attention module.
Att H×W×1 C−max and average-pooled channel attention Att H×W×1 C−avg , respectively.The concatenated max-pooled channel attention Att H×W×1 C−max and average-pooled channel attention Att H×W×1 C−avg are passed through a 3 × 3 convolutional layer Conv 3×3 to form single-channel convoluted feature maps.The resulting maps are then normalized by a sigmoid activation function to produce intermediate spatial attention features A H×W×1 S .These intermediate features are fused with the input channel attention feature maps Att H×W×C C using a residual skip connection through element-wise multiplication operations to obtain the final spatial attention feature maps Att H×W×C S as illustrated in Figure 4. Mathematically, spatial attention and its key components can be expressed as follows: Att H×W×1 C−max = maxpool(Att H×W×C C

First value in the
square brackets represents the lower bound and the second value represents the upper bound.Together, the lower and upper bounds represent the confidence interval.

Figure 6 .
Figure 6.The graphical overview of the conducted comparative analysis of our proposed DA-R3DCNN with the state-of-the-art methods on (a) UCF11 dataset, (b) UCF50 dataset, (c) HMDB51 dataset, and (d) UCF101 dataset.

Table 1 .
Architectural overview of our proposed DA-R3DCNN Framework.

Table 2 .
The average accuracies obtained by our proposed framework with and without the dualattention module on UCF11, UCF50, HMDB51, and UCF101 datasets.

Table 3 .
Comparative analysis of the proposed DA-R3DCNN with the state-of-the-art methods on the UCF11 dataset.

Table 4 .
Comparative analysis of the proposed DA-R3DCNN with the state-of-the-art methods on the UCF50 dataset.Bold value represents the best accuracy, where the italic value indicates the runner-up accuracy.

Table 5 .
Comparative analysis of the proposed DA-R3DCNN with the state-of-the-art methods on the HMDB51 dataset.

Table 7 .
The obtained confidence interval values (with 95% confidence) for our proposed method and state-of-the-art mainstream methods.

Table 8 .
Comparison of the run time performance between our proposed DA-R3DCNN framework and state-of-the-art human action recognition methods, considering both scaled and unscaled results.