Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer

Liu, Minghua; Li, Wenjing; He, Bo; Wang, Chuanxu; Qu, Lianen

doi:10.3390/app15052695

Open AccessArticle

Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer

by

Minghua Liu

¹,

Wenjing Li

¹,

Bo He

¹,

Chuanxu Wang

¹ and

Lianen Qu

^1,2,*

¹

College of Information Science & Technology, Qingdao University of Science & Technology, Qingdao 266101, China

²

College of Information Engineering, Xinjiang Institute of Engineering, Urumqi 830023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2695; https://doi.org/10.3390/app15052695

Submission received: 3 January 2025 / Revised: 24 February 2025 / Accepted: 25 February 2025 / Published: 3 March 2025

(This article belongs to the Special Issue Selected Papers from CCF 39th China Computer Application Conference (CCF NCCA 2024))

Download

Browse Figures

Versions Notes

Abstract

To address the limitations of traditional two-stream networks, such as inadequate spatiotemporal information fusion, limited feature diversity, and insufficient accuracy, we propose an improved two-stream network for human action recognition based on multi-scale attention Transformer and 3D convolutional (C3D) fusion. In the temporal stream, the traditional 2D convolutional is replaced with a C3D network to effectively capture temporal dynamics and spatial features. In the spatial stream, a multi-scale convolutional Transformer encoder is introduced to extract features. Leveraging the multi-scale attention mechanism, the model captures and enhances features at various scales, which are then adaptively fused using a weighted strategy to improve feature representation. Furthermore, through extensive experiments on feature fusion methods, the optimal fusion strategy for the two-stream network is identified. Experimental results on benchmark datasets such as UCF101 and HMDB51 demonstrate that the proposed model achieves superior performance in action recognition tasks.

Keywords:

multi-attention; multi-scale; two-stream network; action recognition; transformer; C3D

1. Introduction

Human action recognition involves extracting and analyzing human movements and behavioral features from video sequences to determine the specific categories of actions being performed. This technology is essential for enabling machines to interpret human behavior and serves as a cornerstone in the field of computer vision. It has significant importance and broad application potential across various domains, including intelligent security, virtual reality, human–computer interaction, and intelligent caregiving, where it can deliver substantial economic and social benefits.

Currently, deep learning-based human action recognition methods have attracted a great deal of attention and have achieved numerous research results. Depending on the different network architectures used, deep learning-based human action recognition methods can be categorized into four types: 3D convolutional(3D CNN) [1,2,3], two-stage network [4,5], Transformer model [6,7], and graph convolutional (GCN) [8,9].

Three-dimensional convolution has the advantage of simultaneously modeling spatio-temporal information and has been widely used in action recognition tasks. For example, Duan [1] emphasized that 3D CNNs can better integrate with multiple modalities such as RGB and optical flow information, thus improving the recognition results. Ou [2] proposed a 3D deformable convolutional reasoning network (DCTR), which is used to model and reason about the dependency relationships among multiple entities in the spatiotemporal dimensions to enhance the recognition performance. Although 3D CNN-based methods can model spatio-temporal features simultaneously, there is still room for improvement in handling long-term dependencies.

Two-stream networks extract spatiotemporal information through two separate branches, enabling richer feature representations and more precise extraction of motion information, leading to higher recognition accuracy. For example, Pang [4] improved recognition performance via contrastive learning in the two-stream network. Xie [5] designed a multimodal two-stream fusion network to enhance the saliency of action-related characteristics. Although two-stream networks can extract spatiotemporal features, the interaction and fusion strategies of the spatial and temporal streams, as well as the representation and learning methods of temporal information, still need improvement. Additionally, most methods using 2D CNN cannot effectively capture the temporal information between video frames.

In recent years, Transformers have been introduced into the field of video understanding. The application of the Transformer model provides a new paradigm for action recognition. Its powerful long-range dependency modeling, multimodal fusion ability, and good scalability make it particularly suitable for handling long sequence data and complex temporal contextual relationships in videos. For example, Zhao [6] proposed a spatiotemporal two-stream multi-scale Transformer (STDM Transformer) model for action recognition. Wu [7] proposed a Transformer-based multi-view spatiotemporal feature interaction fusion framework, which achieves effective action recognition through deep fusion of multi-view information. Although the performance of action recognition based on Transformer has greatly improved, Transformer models still face challenges such as large data requirements and insufficient local information capture.

The method based on a GCN can leverage the topological structure of skeletal data to capture the relationships between different joint points, providing a more natural representation of spatiotemporal features. For example, Xie [8] proposed a dynamic semantic-based graph convolutional network (DS-GCN) action recognition method. Liu [9] introduced an action recognition method based on graph prototype learning (ProtoGCN). Although graph convolutional networks (GCNs) perform well in modeling human skeleton sequences and joint relationships, they face challenges in fusing spatiotemporal features and handling long-term temporal data.

In summary, to fully leverage the advantages of various networks, we propose a two-stream network model for human action recognition based on a multi-scale attention Transformer and 3D convolution (MAT-C3D). By constructing a temporal network based on 3D convolution to capture the temporal dependencies between video frames, it can better understand the continuous changes of actions. By constructing a spatial network based on a multi-scale Transformer model, it can capture long-term dependencies and generate receptive fields of different sizes for feature fusion. Moreover, an adaptive weighting mechanism is employed for feature fusion, which enriches the diversity of features and better adapts to long-term, multi-scale action recognition tasks in complex environments.

The main work and contributions are summarized as follows:

1.: A brand-new two-stream network structure containing a C3D temporal network and a multi-scale Transformer spatial network has been constructed, which fully utilizes the correlation between spatial and temporal information to extract spatiotemporal features with interactive relationships, effectively improving the action recognition ability with long-term dependencies.
2.: A multi-scale convolution module is introduced to replace the original FFN module in the Transformer; after inputting context-based relative position encoding, adaptive integrated multi-scale features with stronger representation capabilities are obtained, effectively improving the recognition accuracy of the model in complex environments.
3.: A more efficient feature fusion strategy has been designed, and the spatiotemporal feature fusion method of the two-stream network has been improved by using weighted fusion, which enhances the model’s ability to model complex spatiotemporal interaction relationships and further improves the recognition accuracy in complex environments.
4.: Experimental results on the datasets of UCF101 and HMDB51 demonstrate that our algorithm is a highly accurate human action recognition model.

The rest of this paper is organized as follows: Section 2 reviews related work in action recognition, including 3D convolution, two-stream networks, and Transformer models, and discusses the limitations of existing methods. Section 3 introduces our proposed two-stream network of action recognition based on Transformer and C3D, including a C3D time-stream convolutional network, multi-scale convolution Transformer, positional encoding, and a multi-scale residual model. Section 4 discusses the feature fusion methods and identified the optimal fusion strategy. Section 5 describes the experimental setup, including the software and hardware environments and the used datasets, and presents the results of our experiments, including the main results, ablation studies, and an extended analysis on experimental results. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Action Recognition Based on 3D Convolution

Compared to 2D convolution, 3D convolutional networks are more suitable for learning spatiotemporal features and have been widely used in the field of human action recognition. Tran [10] first used 3D convolutional video (Convolutional 3D, C3D), which streamlines the network structure and can process multiple frames at a time thus improving computational efficiency. Qiu [11] used a parallel serial residual connection method to construct a network based on a small convolutional kernel, which was optimized in terms of the running rate and the number of parameters, but suffered from long training time and easy overfitting in the case of uneven data distribution. Tran [12] decomposed the 3D convolutional filter into independent spatiotemporal components, which significantly improved the accuracy but caused the problems of easily overfitting with a large number of parameters and insufficient processing of global information in the video. Yang [13] proposed an integrated model that combined a 3D convolutional network (C3D), LSTM, and a spatiotemporal attention mechanism, effectively enhancing the recognition accuracy. The above methods directly use 3D networks to learn human action representations from videos. Due to the limitation of video frame length, the ability to learn spatial domain contour information and temporal domain motion information is limited, and a model cannot fully learn and capture the spatiotemporal dependencies during long-term motion processes. To fully leverage the advantages of 3D convolution, we embed 3D convolution into the time-stream branch of a two-stream network to better extract global contextual features.

2.2. Action Recognition Based on Two-Stream Networks

Feichtenhofer [14] proposed a SlowFast network for video behavior recognition using a low frame rate to capture semantics and a high frame rate to capture actions, which has higher accuracy with lower computational complexity, but the performance on small datasets is unstable. Li [15] used residual frames instead of optical flow to capture motion information, avoiding complex optical flow computation with simple structure and better results. However, it requires a lot of adjustments and experiments on model configuration and parameter optimization, which has certain limitations in processing long-term series. Li [16] proposed a Short-term Action Discriminant Attention (SADA) model based on a two-stream structure, which incorporates a fast–slow attention mechanism to emphasize the varying importance of short-term actions and enhance recognition performance. Zhou [17] introduced a highly efficient two-stream network utilizing multi-head attention for action recognition. This network captures key action information through an attention mechanism to differentiate similar actions in videos. However, these methods employ simple feature fusion approaches, which fail to fully integrate the interactive relationships between spatiotemporal features, overlooking the correlation between spatial and temporal information. To address these limitations, we improve the two-stream network by introducing multi-scale attention into the spatial stream and 3D convolutional fusion into the temporal stream, enriching the diversity of feature extraction. Furthermore, the recognition performance is significantly enhanced through an adaptive weighted feature fusion approach.

2.3. Action Recognition Based on Transformer Model

With the tremendous success of the Transformer model in Natural Language Processing, it has been widely applied in the field of action recognition [18,19,20,21,22,23,24,25,26,27,28,29]. For example, in order to model spatiotemporal features, Pareek [18] improved motion detection and enhanced understanding of human motion in adjacent frames by utilizing spatial and temporal attributes. Ma [19] proposed a relative position embedding (RPE-STDT) model based on spatial and temporal decoupling for action recognition. This model replaces absolute position embedding with relative position embedding, effectively captures token sequences, reduces computational costs, and optimizes recognition performance. Qiu [20] proposed the STTFormer network, which utilizes a spatiotemporal tuple self-attention mechanism module to capture the relationships between different joints in consecutive frames. Do [21] proposed the SkateFormer network, which partitions joints and skeletons based on different types of skeletal–temporal relationships, and performs a self-attention mechanism within each partition, alleviating the requirement of a large amount of resources to obtain correlations between joints. Liu [22] proposed the TranSkeleton action recognition method that can effectively capture remote dependencies and subtle time structures. In addition, Sun [23] used k-NN attention instead of the original self-attention, optimizing the training process by ignoring irrelevant or noisy labels in the input sequence, achieving efficient recognition. This improved the attention mechanism to make the original Transformer model more suitable for action recognition tasks. Zhang [24] proposed ActionFormer, a single-stage temporal action location model based on anchor-free. It uses a lightweight decoder to classify actions at every moment and estimate corresponding action boundaries. Zheng [25] employed the attention mechanism and adopted a two-stage training strategy for mutual enhancement of the two motions during training. Yang [26] proposed the Recurrent Vision Transformer (RViT) method for spatiotemporal feature learning to perform video behavior recognition. However, the above methods are all based on a single scale for action recognition, ignoring that the feature scale changes during the recognition process, resulting in uneven attention allocation and limited recognition performance. Therefore, we introduce a multi-scale Transformer model under the two-stream network to model spatial features. Compared with single-scale methods, the attention allocation is more balanced and accurate, and can better identify actions of different granularities.

3. Transformer and C3D-Based Two-Stream Network Modeling

The original two-stream convolutional network utilizes two parallel networks: one branch learns the spatial dimension of static features such as appearance and background, and its input is the RGB image of a single video image sequence frame; the other branch uses a convolutional neural network of the same structure to process multiple frames of the optical streaming image to extract the dynamic features in the time dimension. The final prediction of the video is obtained by fusing the results of the two branches. The same two-dimensional CNN is used for both temporal and spatial dimensions, and the two-dimensional feature maps are outputted after convolution, which leads to some limitations in the temporal information and insufficient model accuracy. In this paper, three-dimensional convolution is introduced in the temporal flow network and the timing problem is optimized using the C3D network model. In the spatial flow network, a multi-scale convolutional Transformer is used to generate feature representations using residual-connected multi-head self-attention to further enhance the feature representation capability of the model. The optimal network architecture is finally obtained by testing different spatiotemporal feature fusions. The overall network architecture of this paper is shown in Figure 1.

3.1. C3D Time-Stream Convolutional Neural Network

Two-dimensional convolutional neural networks work well for feature extraction on single frames but do not capture the temporal information well when faced with video-based tasks, hence three-dimensional convolutional neural networks are proposed. Compared to 2D convolutional neural networks, 3D convolutional neural networks can model temporal information better. The C3D convolutional structure is shown in Figure 2.

Figure 2a shows the application of a two-dimensional convolutional network to a video image sequence frame, and after the convolution operation, the output is a two-dimensional feature map, which loses the time sequence information. Figure 2b shows that the three-dimensional convolutional network is subjected to a convolution operation, and the output is a three-dimensional feature map that retains the time information of the video image sequence frame. Using the C3D network for feature extraction in the temporal dimension enables better extraction of temporal features of continuous optical flow images. The C3D network structure is shown in Figure 2c, which employs five convolutional layers, five pooling layers, two fully connected layers, and a softmax classifier. All layers use 3 × 3 × 3 convolutional kernels; the first pooling kernel is a 1 × 2 × 2 structure and the rest of the pooling kernels are 2 × 2 × 2. The C3D network model has a simple structure and trains the model quickly for better extraction of temporal features.

3.2. Multi-Scale Convolution Transformer

We improve the Transformer model and apply it to feature extraction on the spatial dimension of dual-stream networks. The encoder module for feature extraction consists of two sublayers, multi-head self-attention and adaptive-scale convolutional neural network, both of which have a residual join followed by layer normalization. The overall structure is shown in Figure 3.

The multihead self-attention layer generates a hidden representation from the input using the multi-head self-attention mechanism, which can directly integrate the information in the entire input sequence, overcoming the long-time dependency problem of RNN and LSTM. Assuming the input sequence is X,

X \in R^{(L \times d)}

, using the scaled dot product attention mechanism, the query vector Q, key vector K, and value vector V are generated by linear projection:

Q = X W^{Q}; K = X W^{K}; V = X W^{V}

(1)

where

W^{Q} \in R^{(d \times d_{k})}, W^{K} \in R^{(d \times d_{k})}, W^{V} \in R^{(d \times d_{v})}

are the projection parameters and the weight matrix. Compare the query and the key vectors of dimension

d_{k}

. The input is a vector of values, where the weight assigned to each value is computed from the dot product of the query and the corresponding key, and the output is a weighted sum of the following values:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(2)

Using

\sqrt{d_{k}}

instead of

d_{k}

prevents the inner product from becoming too large while smoothing out the scaling factor. Multi-head self-attention uses multiple self-attention modules to compute in parallel, combining the outputs of all the attention heads into a final result. The input vector X is passed to each of the h different self-attention modules, and the output matrix is stitched together and passed into the linear layer to obtain the final output:

M u l t i H e a d (Q, K, V) = C o n c a t (〖 h e a d 〗_{1}, \dots, 〖 h e a d 〗_{h}) W^{O}

(3)

h e a d_{i} = A t t e n t i o n (X W_{i}^{Q}, X W_{i}^{K}, X W_{i}^{V})

(4)

where

W^{O} \in R^{(〖 h d 〗_{v} \times d)}

is the final projection parameter. To facilitate residuals connection, all sublayers use the same dimension as the input layer, i.e.,

d_{k} = d_{v} = d / h

. In the traditional Transformer, the output of the multi-head self-attention layer is sent to the feed-forward neural network module, which consists of two fully connected layers. In this paper, a multi-scale convolutional module is used to replace the feed-forward neural network module, which captures region feature information at different scales by using convolutional kernels at different scales and an adaptive scale-attention mechanism to adjust between scales. Given a set of kernel size

K = k_{1}, \dots, 〖 k 〗_{n}, P_{i}

is the eigenmatrices with ith kernel

k_{i}

.

P_{i} = R e L U (D r o p o u t (B N (C o n v (W^{k_{i}}; X))))

(5)

where X is the output feature of the multi-head self-attention module;

W^{k_{i}} \in R (d \times k_{i} \times d)

is a filter with a specific receptive field, consisting of d filters, each of size

k_{i} \times d

. The module is computed by four operations: convolution (Conv), batch normalization (BN), dropout, and ReLU activation unit. The core is the Conv module. Fill zeros on both sides of feature X with

(k_{i} - 1) / 2

so that the generated feature map is

L \times d

in size. Then, the output of multiple convolution blocks is

P = P_{1}, \dots, P_{n}

. The output of all feature vectors containing features with problems such as incomplete information, redundant information, etc., are integrated into the feature vector using an FFN adaptive-scale attention mechanism to adaptively integrate features at different scales. The integrated features

P_{e n s}

are defined as follows:

P_{e n s} = α P = \sum 〖 α_{i} P_{i} 〗

(6)

where

α_{i}

is the attention score of

P_{i}

on the scale

k_{i}, α = (α_{1}, \dots, 〖 α 〗_{i})

is the vector of attention scores, and

α

is computed by

α = s o f t m a x (F F N (P))

(7)

where

F F N (\cdot)

is a two-layer fully connected feed-forward neural network:

F F N (P_{i}) = R e L U (P_{i} W_{1} + b_{1}) W_{2} + b_{2}

(8)

where

W_{1} \in d \times d_{h}

and

W_{2} \in d_{h} \times 1

are matrices of parameters with hidden dimensions

d_{h}

, and the parameter matrices of

b_{1}

and

b_{2}

are the biases. An example of using a convolution kernel with scales 1, 3, and 7 to generate three feature vectors is given in Figure 4, in which the three single-feature vectors are integrated as the final integrated feature through an adaptive-scale attention mechanism.

3.3. Positional Encoding

Since the Transformer operates in parallel with the input data, resulting in the loss of information about the position of the image blocks, some information about the position of the sequence needs to be injected, so positional encoding is added to the bottom layer of the encoder. The original Transformer model used absolute conditional encoding, where only the absolute position of the sequence was input. Relative position coding, in addition to the absolute position of the input sequences, encodes the relative distance between the input sequences and learns pairwise relationships between the sequences. In this paper, context-based relative position coding is used in the attention mechanism to improve the expressiveness of the model [13]. Suppose that the input sequence is

X = (X_{1}, \dots, X_{n})

, and the output sequence is

Z = (Z_{1}, \dots, Z_{n})

. Each output element

Z_{i}

represents the relationship of the current position i to all positions in the sequence, which is the result of linear weighting. The calculation is published as follows:

Z_{i} = \sum_{j = 1}^{n} β_{i j} (X_{j} W^{V})

(9)

where

β_{i j}

is the weight. This is obtained by the softmax function

β_{i j} = s o f t m a x (e_{i j}) = \frac{e x p (e_{i j})}{\sum_{k = 1}^{n} e x p (e_{i k})}

(10)

where

e_{i j}

is obtained by scaling the dot product of the two elements:

e_{i j} = \frac{(X_{i} W^{Q}) {(X_{j} W^{K})}^{T}}{\sqrt{d_{k}}}

(11)

Context-based relative positional encoding takes into account the interactions with queries, keys, and values; the core idea is to add learnable relative positional encoding to the

e_{i j}

, incorporating learnable relative position encoding in the following:

e_{i j} = \frac{(X_{i} W^{Q}) {(X_{j} W^{K})}^{T} + b_{i j}}{\sqrt{d_{k}}}

(12)

b_{i j} = (X_{i} W^{Q}) r_{i j}^{T}

(13)

where

r_{i j}^{T}

is a trainable vector. To calculate the relative position on the 2D image and define the relative weight

r_{i j}

, it is mapped directionally using the crossover method, where the position encoding in the horizontal and vertical directions are computed separately and combined as follows:

r_{i j} = p_{I^{x} (i, j)}^{x} + p_{I^{y} (i, j)}^{y}

(14)

I^{x} (i, j) = g (x_{i} - x_{j})

(15)

I^{y} (i, j) = g (y_{i} - y_{j})

(16)

where

p_{I^{x} (i, j)}^{x}

and

p_{I^{y} (i, j)}^{y}

are learnable vectors that store the weights of relative positions, and

g (\cdot)

is a many-to-one segmented indexing function that maps relative positions to weights, mapping multiple relative positions to the same position encoding, reducing the number of parameters introduced by the relative positions.

3.4. Multi-Scale Residual Model

The multi-scale residual model consists of two parts: residual connection and layer normalization. In order to ensure better flow of multi-scale information, we improve the residual module of the Transformer to process outputs from multi-head attention and multi-scale convolution modules, respectively. The residual connections are calculated as follows:

Z_{m h a} = X + M H A (X)

(17)

Z_{m s a} = X + M S A (X)

(18)

where X is the input of multi-head attention layer or multi-scale convolution module,

M H A (X)

is the output of multi-head attention layer,

M S A (X)

is the output of multi-scale convolution module, and

Z_{m h a}

,

Z_{m s a}

is the output of residual connection. Each small batch of data is normalized by layer normalization, so that the distribution characteristics of each dimension are the same, so as to improve the stability and learning speed of the network. The formula for layer normalization is as follows:

L a y e r N o r m (Z_{m h a})

(19)

L a y e r N o r m (Z_{m s a})

(20)

The output of residual connection is normalized by Formulas (19) and (20) to obtain the final output. The operational visualization of residual connection and layer normalization is shown in Figure 5.

4. Feature Fusion

The original two-stream network judges the features of temporal and spatial streams separately after obtaining them and obtains the final result by both averaging and SVM. This method does not consider the connection between the information of temporal and spatial flow features, and the recognition accuracy is not high. In this paper, multiple feature fusion methods are used to compare the advantages of different feature fusion strategies. The accuracy of using different fusion methods is compared through multiple sets of experiments and the best feature fusion method is determined after experimental comparison.

4.1. Summing Fusion Approach

Firstly, calculate the sum of pixel values of the corresponding pixels of two feature maps with the same dimension

(i, j)

and the sum of the pixel values of the two feature maps as the fusion result.

y_{i, j, d}^{s u m} = z_{i, j, d}^{s} + z_{i, j, d}^{T}

(21)

Among them,

z_{i, j, d}^{s}

,

z_{i, j, d}^{T}

denote, respectively, the spatial and temporal features with channel number d at pixel point

(i, j)

and the pixel values of the spatial and temporal features at the pixel point;

y_{i, j, d}^{s u m}

denotes the result of feature fusion by the summation fusion method.

4.2. Average Fusion Approach

Calculate the average pixel value of the corresponding pixel points of two feature maps with the same dimension

(i, j)

with the average pixel value as the fusion result.

y_{i, j, d}^{a v e} = \frac{z_{i, j, d}^{s} + z_{i, j, d}^{T}}{2}

(22)

where

y_{i, j, d}^{a v e}

denotes the result of feature fusion by averaging fusion.

4.3. Maximum Fusion Approach

The maximum pixel value of the corresponding pixel point of two feature maps with the same dimension is taken as the fusion result.

(i, j)

of two feature maps with the same dimension as the maximum pixel value as the fusion result.

y_{i, j, d}^{m a x} = max {z_{i, j, d}^{s}, z_{i, j, d}^{T}}

(23)

where

y_{i, j, d}^{m a x}

denotes the result of feature fusion by averaging fusion.

4.4. Weighted Fusion Approach

Two feature maps with the same dimensions are weighted at pixels

(i, j)

at the pixel point for weighted fusion and take the output as the fusion result.

y_{i, j, d}^{w e} = σ z_{i, j, d}^{s} + (1 - σ) z_{i, j, d}^{T}

(24)

where

y_{i, j, d}^{w e}

denotes the result of feature fusion by weighted fusion approach, and

σ

is the weighted number.

4.5. Fusion Mode Experiment

Make the size of RGB frames and optical flow maps reset to 224 × 224, feed the RGB frames into the pretrained spatial flow network to extract spatial features, stack the 16 optical flow maps and feed them into the optical flow network to extract temporal features, and output the final result after feature fusion. The results were fused by different fusion methods and the different fusion accuracies on the dataset UCF101 are shown in Table 1.

It can be concluded that the weighted fusion approach is slightly higher than the rest of the fusion approaches with an accuracy of 88.6%. The following experiments will use the weighted fusion approach for feature fusion.

5. Experimental Results and Analyses

5.1. Datasets

We evaluate our approach on three widely used action recognition datasets: UCF101 [30], HMDB51 [31], and KTH [32]. To reduce the training difficulty of the model and enhance its performance, we employ a pretrained model on the Kinetics-400 [33] dataset.

UCF101: The UCF101 dataset is an expanded version of the UCF50 dataset, comprising over 13,000 videos sourced from YouTube and organized into 101 distinct action classes. This dataset offers a diverse range of actions captured from various angles, featuring complexities such as object appearance, varied viewpoints, camera motion, cluttered backgrounds, and different lighting conditions. The videos within each action class are divided into 25 groups, with each group containing 4 to 7 videos. The action classes can be broadly categorized into five types: human–object interaction, human–human interaction, body motion, sports, and playing musical instruments. The UCF101 dataset is partitioned into a training set with approximately 9500 videos and a test set with around 3700 videos.

HMDB51: The HMDB51 dataset comprises over 6000 videos, primarily sourced from internet movies. These videos are categorized into 51 action classes, with most actions representing everyday activities. Each action class contains at least 101 videos. The action classes can be broadly divided into two main categories: facial actions and body movements. All videos in the HMDB51 dataset are annotated with detailed information, including action class labels, video conditions, and metadata. The annotations specify the position of the body, the visibility of the body parts, and the number of objects involved in the actions. The HMDB51 dataset is divided into a training set with approximately 3500 videos and a test set with around 1500 videos.

KTH: The KTH action dataset, released by the Royal Institute of Technology (KTH) in Sweden in 2004, is a widely used video dataset for action recognition research. The dataset consists of 600 video sequences, each annotated with specific action categories and performer information. It includes six basic action categories, with each action performed by 25 different individuals across four distinct scenarios: outdoor, outdoor with scale variation, outdoor with different clothing, and indoor.

Kinetics-400: The Kinetics-400 dataset is a large-scale and meticulously labeled video dataset, encompassing 400 diverse action classes. It comprises 240,000 training videos, 40,000 testing videos, and 20,000 validation videos. Each action class contains over 600 video instances. The dataset contains a wide range of actions, including human–object interactions such as riding a bike and typing, as well as human–human interactions like shaking hands and salsa dancing.

5.2. Experimental Implementation Details

This experiment was conducted under the Ubuntu 20.04 platform using the MMaction2 repository and deep learning framework PyTorch2.1.1 via Python3.10 under NVIDIA GeForce RTX3090 GPU. The video RGB frames as well as the optical flow maps in the UCF101 and HMDB51 dataset are extracted by preprocessing through MATLAB R2024A, respectively. One RGB image is generated every 4 frames, and one optical flow image in the x and y directions is generated every 1 frame, which is stored in the corresponding directories, respectively. To enhance the model’s generalization capability, we employed data augmentation techniques on the dataset to enrich the diversity of the data. Specifically, we applied random cropping and randomly flipped the images horizontally with a probability of 0.5, which can improve the model performance by 1.3% on the dataset UCF101. Let the initial learning rate be 0.01 and the learning rate decay value be 0.001; the batch size is set to 16; and on the UCF101 and HMDB51 datasets, the epochs is set to 100 and 80, respectively. Training with a pretrained model takes about 4.5 h on UCF101. The training process on the UCF101 dataset is shown in Figure 6.

The results in Figure 6 are the training accuracy rate and loss value curves of the algorithm in this paper, respectively. As can be seen from the figure, with the increase in training times, the accuracy rate keeps increasing, the training loss keeps decreasing, and the recognition result gradually converges after 60 epochs, that is, the recognition result gradually approaches the real result. The final recognition accuracy of the dataset was used as an evaluation criterion for the human behavior recognition model.

5.3. Ablation Experiment

To verify the importance of the proposed two parts, the C3D time-stream module and multi-scale convolutional Transformer module, in the dual-stream network, ablation experiments were conducted separately. The recognition accuracy of the original two-stream network is compared with the network after the addition of the two modules, and the experimental results are shown in Table 2.

From the results in Table 2, it can be seen that the recognition accuracy on the dataset UCF101 is improved by 3.2% and 1.9% after adding the C3D module and the multi-scale convolutional Transformer module, respectively. By employing data augmentation and optimizing the model training strategies, the recognition accuracy of the model can be further enhanced. To further verify the effectiveness of the proposed module, we conducted ablation experiments on six categories of the KTH video dataset, and the results are shown in Figure 7. In the figure, TS is the abbreviation of two-stream, C3D-TS is the abbreviation of two-stream based on C3D, and C3D-MS-TS is the abbreviation of two-stream based on multi-scale Transformer module and C3D.

The results indicate that the use of 3D convolution instead of 2D convolution can enhance the extraction of dynamic features, and the fusion of multi-scale attention integration information can enhance the expression of features. Finally, our behavior recognition algorithm is compared with other behavior recognition algorithms on the UCF101 dataset and the comparison results are shown in Table 3.

As can be seen from Table 3, the accuracy of the two-stream algorithm with the addition of the C3D module and the multi-scale convolutional Transformer module is obviously enhanced in comparison with the other algorithms, which indicates that the behavioral recognition method designed in this paper achieves a better recognition effect after strengthening the extraction of the dynamic features as well as improving the accuracy of the feature extraction.

6. Conclusions and Future Works

6.1. Conclusions

We proposed an improved two-stream convolutional neural network for a human action recognition algorithm based on the C3D and Transformer model. Firstly, the C3D network is used in the time dimension to replace the 2D convolution, which enhances the extraction of dynamic features. Secondly, in the spatial dimension, the multi-scale Transformer encoder is used for feature extraction, and after inputting the context-based relative position encoding, the features are adaptively integrated into different scales under the action of adaptive scale-attention mechanism and the accuracy of extracted features is improved. Finally, the optimal network architecture is obtained by experimenting with the fusion method of two-stream features. The experimental results demonstrate that the proposed algorithm achieves superior recognition accuracy compared to the original two-stream network, C3D network, and other state-of-the-art methods.

6.2. Future Works

The action recognition algorithm we proposed in this paper demonstrates superior recognition performance. However, the incorporation of 3D convolution and multi-scale Transformer has led to an increase in the model’s parameter count and complexity. In future research, we plan to delve into the application of channel pruning and knowledge distillation techniques to achieve network compression and mitigate the training burden. The challenge of recognizing actions when parts of the human body are occluded is a critical area for further investigation. In real-world applications, partial occlusion of the human body is a frequent occurrence, significantly complicating the task of action recognition. Consequently, developing new algorithms and methods to enhance the accuracy and robustness of action recognition systems under occlusion conditions will be a primary focus of subsequent research endeavors. Additionally, under the framework of a two-stream network architecture, leveraging the fusion of multimodal information for action recognition is another promising avenue for future exploration.

Author Contributions

Conceptualization, M.L. and B.H.; methodology, B.H. and M.L.; software, B.H. and W.L.; validation, B.H. and W.L.; formal analysis, W.L.; investigation, L.Q. and C.W.; resources, M.L. and B.H.; data curation, B.H. and C.W.; writing—original draft preparation, B.H. and W.L.; writing—review and editing, W.L., B.H. and L.Q.; visualization, B.H. and M.L.; supervision, M.L. and C.W.; project administration, B.H. and M.L.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Natural Science Foundation (61973180, 61472196); Key Research and Development Foundation of Shandong Province (2017GGX10133).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are available upon reasonable request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
Ou, Y.; Chen, Z. 3D Deformable Convolution Temporal Reasoning network for action recognition. J. Vis. Commun. Image R. 2023, 93, 103804. [Google Scholar] [CrossRef]
Bai, J.; Yang, Z.; Peng, B.; Li, W. Research on 3D convolutional neural network and its application to video understanding. J. Electron. Inf. Technol. 2023, 6, 2273–2283. [Google Scholar]
Pang, C.; Lu, X.; Lyu, L. Skeleton-based action recognition through contrasting two-stream spatial temporal networks. IEEE Trans. Multimed. 2023, 25, 8699–8711. [Google Scholar] [CrossRef]
Xie, Z.; Gong, Y.; Ji, J.; Ma, Z.; Xie, M. Mask guided two-stream network for end-to-end few-shot action recognition. Neurocomputing 2024, 583, 127582. [Google Scholar] [CrossRef]
Zhao, Z.; Chen, Z.; Li, J.; Xie, X.; Chen, K.; Wang, X.; Shi, G. STDM-transformer: Spacetime dual multi-scale transformer network for skeleton based action recognition. Neurocomputing 2024, 563, 126903. [Google Scholar] [CrossRef]
Wu, H.; Ma, X.; Li, Y. Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos. Signal Process. Image Commun. 2025, 131, 117244. [Google Scholar] [CrossRef]
Xie, J.; Meng, Y.; Zhao, Y.; Nguyen, A.; Yang, X.; Zheng, Y. Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 6225–6233. [Google Scholar]
Liu, H.; Liu, Y.; Ren, M.; Wang, H.; Wang, Y.; Sun, Z. Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition. arXiv 2024, arXiv:2411.18941. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 4489–4497. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5534–5542. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Yang, F.; Li, S.; Sun, C.; Li, X.; Xiao, Z. Action recognition in rehabilitation: Combining 3D convolution and LSTM with spatiotemporal attention. Front. Physiol. 2024, 15, 1472380. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar]
Tao, L.; Wang, X.; Yamasaki, T. Rethinking Motion Representation: Residual Frames With 3D ConvNets. IEEE Trans. Image Proces. 2020, 30, 9231–9244. [Google Scholar] [CrossRef]
Li, S.; Wang, Z.; Liu, Y.; Zhang, Y.; Zhu, J.; Cui, X.; Liu, J. FSformer: Fast-Slow Transformer for video action recognition. Image Vis. Comput. 2023, 137, 104740. [Google Scholar] [CrossRef]
Zhou, A.; Ma, Y.; Ji, W.; Zong, M.; Yang, P.; Wu, M.; Liu, M. Multi-head attention-based two-stream EfficientNet for action recognition. Multimed. Syst. 2023, 29, 487–498. [Google Scholar] [CrossRef]
Pareek, G.; Nigam, S.; Singh, R. Modeling transformer architecture with attention layer for human activity recognition. Neural Comput. Appl. 2024, 36, 5515–5528. [Google Scholar] [CrossRef]
Ma, Y.; Wang, R. Relative-position embedding based spatially and temporally decoupled Transformer for action recognition. Pattern Recognit. 2024, 145, 109905. [Google Scholar] [CrossRef]
Qiu, H.; Hou, B.; Ren, B.; Zhang, X. Spatio-temporal tuples transformer for skeleton-based action recognition. arXiv 2022, arXiv:2201.02849. [Google Scholar]
Do, J.; Kim, M. SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 9 September–4 October 2024; pp. 401–420. [Google Scholar]
Liu, H.; Liu, Y.; Chen, Y.; Yuan, C.; Li, B.; Hu, W. TranSkeleton: Hierarchical Spatial–Temporal Transformer for Skeleton-Based Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 8, 4137–4148. [Google Scholar] [CrossRef]
Sun, W.; Ma, Y.; Wang, R. k-NN attention-based video vision transformer for action recognition. Neurocomputing 2024, 574, 127256. [Google Scholar] [CrossRef]
Zhang, C.; Wu, J.; Li, Y. ActionFormer: Localizing Moments of Actions with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 492–510. [Google Scholar]
Zheng, Y.; Li, Z.; Wang, Z.; Wu, L. Transformer-Based Two-Stream Network for Global and Local Motion Estimation. In Proceedings of the IEEE International Conference on Pattern Recognition and Machine Learning (PRML), Urumqi, China, 4–6 August 2023; pp. 328–334. [Google Scholar]
Yang, J.; Dong, X.; Liu, L.; Zhang, C.; Shen, J.; Yu, D. Recurring the Transformer for Video Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14043–14053. [Google Scholar]
Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video Transformer Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 3156–3165. [Google Scholar]
Wang, X.; Chen, K.; Zhao, Z.; Shi, G.; Xie, X.; Jiang, X.; Yang, Y. Multi-Scale Adaptive Skeleton Transformer for action recognition. Comput. Vis. Image Underst. 2025, 250, 104229. [Google Scholar] [CrossRef]
Kong, J.; Bian, Y.; Jiang, M. MTT: Multi-Scale Temporal Transformer for Skeleton-Based Action Recognition. IEEE Signal Process. Lett. 2022, 29, 528–532. [Google Scholar] [CrossRef]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE Conference on Computer Vision(ICCV), Sydney, Australia, 1–8 December 2011; pp. 2556–2563. [Google Scholar]
Schuldt, C.; Laptev, I.; Caputo, B. Recognizing Human Actions: A Local SVM Approach. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR), Cambridge, UK, 23–26 August 2004; pp. 32–36. [Google Scholar]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705. 06950. [Google Scholar]
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
Morgado, P.; Vasconcelos, N.; Misra, I. Audio-visual Instance Discrimination with Cross-Modal-Agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2021; pp. 12475–12486. [Google Scholar]
Wang, L.; Tong, Z.; Ji, B.; Wu, G. TDN: Temporal Difference Networks for Efficient Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1895–1904. [Google Scholar]
Liu, Z.; Li, Z.; Wang, R.; Zong, M.; Ji, W. Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition. Neural Comput. Appl. 2020, 18, 14593–14602. [Google Scholar] [CrossRef]
Manh, N.D.D.; Hang, D.V.; Wang, J.C. YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition. arXiv 2024, arXiv:2408.02623v2. [Google Scholar]
Wang, J.; Peng, X.; Qiao, Y. Cascade multi-head attention networks for action recognition. Comput. Vis. Image Underst. 2020, 192, 102898. [Google Scholar] [CrossRef]

Figure 1. Improved Two-Stream Network Based on C3D and Multi-scale Transformer.

Figure 2. C3D Convolutional Structure. (a) Two-dimensional convolution in multi-frame images; (b) Three-dimensional convolution in multi-frame images; (c) C3D network structure.

Figure 3. Spatial Stream Network Structure.

Figure 4. Multi-scale Convolution Module in the Transformer Encoder.

Figure 5. Multi-Scale Residual Model.

Figure 6. Training Accuracy and Loss on the UCF101.

Figure 7. Ablation Experiment on the KTH.

Table 1. Accuracy of different fusion methods on UCF101.

Integration Approach	Recognition Accuracy (ACC)/%
Seek peace and integration	87.6
Average fusion	84.3
Maximum fusion	85.5
Weighted integration	88.6

Table 2. Recognition accuracy of different modules.

Modeling	Recognition Accuracy (ACC)/%
Two-stream	88.6
C3D-Two-stream	91.8
Multi-Transformer-Two-stream	90.5
C3D+Multi-Scale Transformer	92.9

Table 3. Recognition accuracy of different methods.

Modeling	UCF101 (ACC)/%	HMDB51 (ACC)/%
Two-stream [34]	87.6	59.4
P3D [4]	88.6	-
C3D [12]	85.4	56.8
VTN [27]	92.7	72.3
IDT [35]	85.9	57.2
AVID-CAM [36]	87.5	-
TDN [37]	91.5	75.3
STS [38]	90.1	62.4
YOWOv3 [39]	95.4	68.8
TDD [40]	90.3	63.2
Ours	92.9	74.6
Ours-DA	94.2	-

Ours-DA: used Data Augmentation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, M.; Li, W.; He, B.; Wang, C.; Qu, L. Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer. Appl. Sci. 2025, 15, 2695. https://doi.org/10.3390/app15052695

AMA Style

Liu M, Li W, He B, Wang C, Qu L. Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer. Applied Sciences. 2025; 15(5):2695. https://doi.org/10.3390/app15052695

Chicago/Turabian Style

Liu, Minghua, Wenjing Li, Bo He, Chuanxu Wang, and Lianen Qu. 2025. "Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer" Applied Sciences 15, no. 5: 2695. https://doi.org/10.3390/app15052695

APA Style

Liu, M., Li, W., He, B., Wang, C., & Qu, L. (2025). Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer. Applied Sciences, 15(5), 2695. https://doi.org/10.3390/app15052695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer

Abstract

1. Introduction

2. Related Work

2.1. Action Recognition Based on 3D Convolution

2.2. Action Recognition Based on Two-Stream Networks

2.3. Action Recognition Based on Transformer Model

3. Transformer and C3D-Based Two-Stream Network Modeling

3.1. C3D Time-Stream Convolutional Neural Network

3.2. Multi-Scale Convolution Transformer

3.3. Positional Encoding

3.4. Multi-Scale Residual Model

4. Feature Fusion

4.1. Summing Fusion Approach

4.2. Average Fusion Approach

4.3. Maximum Fusion Approach

4.4. Weighted Fusion Approach

4.5. Fusion Mode Experiment

5. Experimental Results and Analyses

5.1. Datasets

5.2. Experimental Implementation Details

5.3. Ablation Experiment

6. Conclusions and Future Works

6.1. Conclusions

6.2. Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI