Tohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition

Gou, Ruru; Yang, Wenzhu; Luo, Zifei; Yuan, Yunfeng; Li, Andong

doi:10.3390/electronics11213498

Open AccessArticle

T_ohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition

by

Ruru Gou

^1,2,

Wenzhu Yang

^1,2,*,

Zifei Luo

^1,2,

Yunfeng Yuan

^1,2 and

Andong Li

³

¹

College of Cyber Security and Computer, Hebei University, Baoding 071002, China

²

Hebei Machine Vision Engineering Research Center, Baoding 071002, China

³

School of Management, Shanghai University of Engineering Science, Shanghai 200000, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(21), 3498; https://doi.org/10.3390/electronics11213498

Submission received: 4 October 2022 / Revised: 20 October 2022 / Accepted: 24 October 2022 / Published: 28 October 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In recent years, spatial-temporal graph convolutional networks have played an increasingly important role in skeleton-based human action recognition. However, there are still three major limitations to most ST-GCN-based approaches: (1) They only use a single joint scale to extract action features, or process joint and skeletal information separately. As a result, action features cannot be extracted dynamically through the mutual directivity between the scales. (2) These models treat the contributions of all joints equally in training, which neglects the problem that some joints with difficult loss-reduction are critical joints in network training. (3) These networks rely heavily on a large amount of labeled data, which remains costly. To address these problems, we propose a T_ohjm-trained multiscale spatial-temporal graph convolutional neural network for semi-supervised action recognition, which contains three parts: encoder, decoder and classifier. The encoder’s core is a correlated joint–bone–body-part fusion spatial-temporal graph convolutional network that allows the network to learn more stable action features between coarse and fine scales. The decoder uses a self-supervised training method with a motion prediction head, which enables the network to extract action features using unlabeled data so that the network can achieve semi-supervised learning. In addition, the network is also capable of fully supervised learning with the encoder, decoder and classifier. Our proposed time-level online hard joint mining strategy is also used in the decoder training process, which allows the network to focus on hard training joints and improve the overall network performance. Experimental results on the NTU-RGB + D dataset and the Kinetics-skeleton dataset show that the improved model achieves good performance for action recognition based on semi-supervised training, and is also applicable to the fully supervised approach.

Keywords:

action recognition; semi-supervised; multi-scale; time-level online hard joint mining; spatial temporal graph convolutional neural network

1. Introduction

Action recognition of video sequences is not only an important research content in the field of computer vision, but also a cross-cutting research topic in several disciplines such as machine vision, pattern recognition and artificial intelligence. It has widespread applications in video surveillance, human-computer interaction, intelligent robotics and virtual reality [1].

The recognition of human action based on video streams includes various approaches, which are based on image sequences [2], depth image sequences [3], dual-stream fusion (e.g., RGB + optical streams) [4] and human skeleton sequences [5]. Among them, human skeleton data is a topological way to represent human joints and skeletal structures, which has an inherent advantage in the face of complex backgrounds and changes in human scale, viewing angle and movement speed. It requires less computational consumption compared to other data modes. With the continuous development of depth sensors and human pose-estimation techniques, accurate human skeleton structure data can be obtained. Nevertheless, some unlabeled data cannot be directly used in existing skeleton-based fully supervised human action-recognition methods because they rely heavily on expensive manual annotations. Manually distinguishing skeletal actions becomes challenging due to the lack of appearance information in the skeleton data, adding to the difficulty. In contrast, semi-supervised learning can explore useful information from unlabeled data and is widely used to identify human [6,7,8,9] actions in RGB data.

Traditional skeleton-structure-based methods usually extract motion patterns from a particular sequence of skeleton structures using manual features, which performs well on some specific datasets but is less generalizable [10]. In recent years, with the development of deep learning methods in other computer vision applications, models integrated with skeletal structural data [11,12] have obtained beneficial effects. Furthermore, based on ST-GCN [13], many variants [14,15,16,17,18,19] have been explored where attention mechanisms, context-aware modules, and semantic guidance modules are usually introduced to enhance the expressive power of the network. However, there are three main drawbacks of these ST-GCN-based methods: the employment of only a single joint scale; the equal treatment of all joint contributions in training; and the high cost of labeled data.

(1): The employment of only a single joint scale: Most methods consider joint information regardless of skeletal information or separately take joint and skeletal information into account. For example, Shi et al. [20] proposed a two-stream adaptive graph convolutional network where the topological graph of skeleton joints can be learned adaptively by the BP algorithm to increase the flexibility of the graph construction model. The two-stream network utilizes not only the 1st-order information of the skeleton data (joint information) but also the 2nd-order information of the skeleton (length and orientation of the bones), which improves the accuracy by about 7% on the NTU RGB+D dataset compared to the method of Yan et al. [13]. Although the two-stream network involves the skeleton scale in extracting action information, there is no information exchange in utilizing the joint and bone scales. Tu et al. [21] made improvements on this basis by fusing joint information and bone formation in a dual-stream network during the extraction of action features, which achieved some improved performance. However, the network fails to consider that the skeleton scale is instructive to the joint scale during human movement, resulting in the poor ability of the network to extract motion information. Considering the problem of fusing joint and skeletal scales, Zhang et al. [22] firstly proposed that the edges of the skeleton graph are also used for convolutional processing to extract information, and designed two hybrid networks. Both networks use skeletal and joint information, which improves the accuracy of action recognition but fails to consider the extraction of action information from the body-part scale. Due to the mutual actuation and limitation between the arm joint and elbow in action, lifting the elbow will change the position of the arm joint. This complex motor association between joints and bones constitutes a variety of human behaviors. Considering the potential functional correlation between joints and bones in action, we adopt a new representation of the human body: joint scale, bone scale, and body-part scale. The mutual directivity among joints, bones, and body parts in space-time is utilized to serve the presentation and recognition of movements better.
(2): The equal treatment of all joint contributions in training: In human action recognition, there are often a variety of situations—such as uneven sample categories—leading to different levels of difficulty in training joints during network training. Therefore, an online hard key-point mining method based on time-level is proposed, which benefits from the OHKM method [23,24] to deal with difficult sample mining. This method corresponds the human joint to the loss value of that point. The greater the loss value, the more complex the network trained for the joint. In this way, rigid joints can be effectively mined so that the network can be well trained against the problematic joints, and the recognition accuracy of the network for complicated joints can be improved.
(3): High cost of tagging data: Unsupervised learning methods are mostly adopted when trained with unlabeled data, owing to the expensive cost [25,26,27]. The network proposed in this paper can also perform semi-supervised training using unlabeled data, thus reducing the cost of labeled data.

Based on the above, this paper proposes a T_ohjm-trained multiscale spatial-temporal graph convolutional neural network for semi-supervised action recognition (T_ohjm-MSstgcn), which can tap the hard-to-train joints, sufficiently consider the interactions of information among scales and perform semi-supervised learning using unlabeled data. In the framework, the scale-information fusion module (SIFM), as the core component of the encoder, is designed to extract action features for the given multiscale spatial-temporal graph. Each encoder consists of three steps: multiscale graph construction, action feature extraction and scale transform. Further, each step corresponds to an operation in the spatial and temporal domains. We propose new multiscale graph construction methods and multiscale fusion approaches to adaptively handle irregular graph structures and the mapping between body joints and body components. In the decoder, the newly proposed time-level online hard joint mining strategy (T_ohjm) combined with graph-gated recurrent united (G-GRU) is used for training to speed up the convergence of the network and improve the accuracy at the same time; the more valuable joints are mined for the network according to the loss value. The classifier consists of a mean pool and a full connection to complete the action classification. To train T_ohjm-MSstgcn, we introduce a loss function that combines the error value of the predicted joint, the actual joint position and the cross-entropy loss value. In summary, the main contributions of our method are three-fold:

Improvements to address issues one and two above:

(1): We propose a T_ohjm-trained multiscale spatial temporal graph convolutional neural network for semi-supervised action recognition (T_ohjm-MSstgcn) to extract action features at different scales and achieve effective action classification. The encoder elaborately fuses joint coordinate subtraction to obtain multiscale graphs. It automatically converts low-scale graphs to large-scale with scale transformation (St). A scale-information interaction module constructed with a three-layer MLP is used to obtain cross-scale information.
(2): We propose a time-level online hard joint mining training strategy, which means that the network dynamically measures joints with top-K loss in training and sets the loss of other joints to 0. This strategy helps the model to focus more on complex samples, effectively reduces prediction error and improves the training effect on unlabeled samples. The newly constructed loss function is also used to train the overall network.

Improvements for the above-mentioned problem two:

The decoder uses a self-supervised training method with a motion-prediction header, which allows the network to use unlabeled data to extract action features and reduce the cost of labeled data.

The rest of this paper is organized as follows. Section 2 introduces related work. In Section 3, T_ohjm-MSstgcn is described in detail. The experimental results are given in Section 4. Finally, in Section 5, we summarize our work and look forward to future research.

2. Related Works

2.1. ST-GCN

The spatial-temporal graph convolution network (ST-GCN) was proposed by Yan et al. [13], who first applied graph convolution networks to joint point action recognition. ST-GCN extends the graph convolutional neural network with application to skeleton-based human action recognition. Based on this, a spatial-temporal diagram of the skeletal sequence was constructed through the natural connections of the human body and the cross-continuous temporal connections of the same joints. The inclusion of the factor of spatial-temporal relationships between joints, which is essential for identifying human behavior, allows the information to be integrated along the graph and temporal dimensions. The network structure is shown in Figure 1. However, the ST-GCN model does not differentiate the contribution of the joints in the video frames, which leads to the network not focusing on the joints that contribute more to action recognition when extracting the action features. At the same time, some unnecessary noise is extracted, which affects recognition results. The ST-GCN can only be trained with labeled data, but the labor cost of labeling data is high and there is risk of mislabeling. In this paper, our proposed model can not only be trained using unlabeled data, but also enables the network to focus more on high-contribution joints.

2.2. OHKM

Online Hard Key-points Mining (OHKM) [23,24] ranks the L2 loss values of all key-points from largest to smallest, i.e., if there are a total of N key-points to be estimated, then T (T < N) key-points are selected for training. Unlike OHEM, the OHKM approach emphasizes the information on higher-level features, while the OHEM approach focuses more on specific instances. There are advantages and disadvantages to both methods on different networks, and all of them can improve the training accuracy of the model with different focus methods. However, OHKM is only used on a single image; in this paper, we will improve the OHKM method and apply it to the network-training stage, so that the network can focus on high-contribution joints in time frames.

2.3. GRU Model

The function of the graph-gated recurrent unit (G-GRU) [27] is to learn and update the hidden states of graph vertices under the guidance of the graph structure. X^(t) is treated as the initial state of the G-GRU, and H^(t) denotes the online skeleton action feature. the role of the GRU(X^(t), H^(t)) is shown as follows:

r_{t} = σ (r_{i n} (S I_{t}) + r_{h i d} (A_{H} h_{t - 1} W_{H}))

(1)

z_{t} = σ (z_{i n} (S I_{t}) + z_{h i d} (A_{H} h_{t - 1} W_{H}))

(2)

{\tilde{h}}_{t} = \tan h (h_{i n} (S I_{t}) + r_{t} ⊙ h_{h i d} (A_{H} h_{t - 1} W_{H}))

(3)

h_{t} = z_{t} ⊙ h_{t - 1} + (1 - z_{t}) ⊙ {\tilde{h}}_{t}

(4)

where

A_{H} \in R^{M \times d}

is initialized with the skeleton graph and the adjacency matrix of the built-in graph,

r_{i n} (\cdot)

,

r_{h i d} (\cdot)

,

z_{i n} (\cdot)

z_{h i d} (\cdot)

,

h_{i n} (\cdot)

,

h_{h i d} (\cdot)

is a trainable linear mapping and

W_{H}

are trainable weights.

3. Methodology

3.1. Problem Definition

In this paper, G = (V, B, C) denotes the skeleton diagram, where V denotes the joints, B denotes body bones connected by joints and C denotes body parts connected by bones. Define the adjacency matrix as

A \in {0, 1}^{n \times n}

, where

A_{i, j} = 1

when i is connected to j; otherwise, its value is equal to 0. The same partitioning strategy as ST-GCN [13] is used for all three scales of the adjacency matrix.

3.2. T_ohjm-MSstgcn

T_ohjm-MSstgcn can perform weighting updating with the unlabeled and labeled data in the training phase and get the classes of actions using the labeled data in the testing phase. The main framework of T_ohjm-MSstgcn is shown in Figure 2. The model consists of three main modules: the encoder to extract action information, the decoder with the T_ohjm based on graph GRU, and the classifier of action. The backbone module of the encoder consists of the coordinate-subtraction block (Cs) of the joints, the scale-information fusion module (SIFM) and the scale transformation (St). The purpose is to extract richer semantic information from action through the mutual guidance of different scale information, and output three feature sequences as input to the decoder and classifier. The encoder and decoder with T_ohjm employ unlabeled data as input, and update the weights in both networks by adopting the error between the future pose predicted by the decoder and the actual value as a loss. The trained encoder and decoder, combined with the untrained classifier, take the labeled data as input to further update the network weights, which achieve accurate classification of actions. The T_ohjm method can be achieved during training to improve the performance of the network by dynamically considering the joints that contribute much to the network and are tough to train.

Encoder: Existing models either do not use skeletal information [13,15,19,28] or treat joints and bones separately and independently [14,20,29,30]. These methods have neither fully explored the movement transmission between joints and bones nor fully used the human body’s physical structure. It is well known that if we want to recognize someone’s movement, we need cooperation between large limbs, such as arms and legs, and some subtle movements, such as wrist rotation, so the scale-information interaction strategy fits our cognitive rules. Therefore, we draw inspiration from previous works [27,31]. First, we propose a new representation of the human body: the multiscale map. This differs from previous work as follows: we divide the human body joints into three scales according to the human skeleton and body parts, whereas previous multiscale maps only classify the joints. As shown in Figure 3, the single-scale diagrams S1, S2 and S3 provide a pyramidal representation of the human skeleton. Each cross-scale diagram is a bipartite diagram that connects one single-scale diagram to another. For example, the “arm” joint in the coarse-scale diagram can be connected to the “hand” and “elbow” joints in the fine-scale diagram. This multiscale graph is initialized by predefined physical connections, and is adaptively tuned to be sensitive to motion during training. The multiscale calculation in this paper relies on a coordinate subtraction block, obtained by subtracting the coordinates of the joints, which is not generated by erasing the excess joints on the bones or body parts, as in previous work. Our approach can retain more information about the movement without destroying the integrity of the information.

Second, inspired by Li et al. [27], we proposed the scale-information fusion module (SIFM), as shown in Figure 4, which consists of a single-scale feature-extraction block ST-GCN and a multiscale fusion block FS-B. Unlike Li et al. [26], the SIFM block contains three single-scale extraction blocks. Similarly, what differs from the complex interaction block of Li et al. [26] is that we design three-layer MLPs, as shown in Figure 5, to obtain the attention matrix As_is_j for two adjacent scales, which speeds up the computation and makes it easy to train. The process of generating As_is_j can be described as:

h_{s_{1}} = f_{M L P_{1}} (F^{s_{1}})

(5)

h_{s_{2}} = f_{M L P_{2}} (F^{s_{2}})

(6)

A_{s_{2} s_{1}} = s o f t \max (h_{s_{1}}^{T} h_{s_{2}})

(7)

where

f_{M L P 1}

and

f_{M L P 2}

denote MLPs as shown in Figure 5,

F^{s_{1}}

and

F^{s_{2}}

denote features at joints s₁ and s₂, respectively and As₂s₁ is the attention matrix from s₂ to s₁. Thanks to the attention matrix As₂s₁, we can now explore cross-scale human correlations adaptively in different ways. Next, we bring in the surveillance information from the adjacent scale s₂ to s₁ using As₂s₁.The features at s₁ are updated as follows:

F^{s_{1}} \leftarrow A_{s_{2} s_{1}} F^{s_{2}} + F^{s_{1}}

(8)

After the SIFM module, the bone scale and body-part scale, as shown in Figure 6, are converted with module St to the joint scale as the input of the decoder or classifier. The bone scale graph or the body-part scale graph is converted to a joint scale graph, e.g., the conversion of one joint of the left arm to the elbow, wrist and hand joints. One MLP is applied to each component, with 15 MLPs corresponding to 15 components in S₂ and S₃, to achieve the scale conversion of the graph. When converting S₂ back to S₁ (S₃ operation is similar), it can be described by Equation (9):

{F^{'}}_{i : j}^{S_{2}} = f_{2} (F_{k}^{S_{2}})

(9)

where

{F^{'}}_{i : j}^{S_{2}} = f_{2} (F_{k}^{S_{2}})

is the feature at S₂, but the number of joints is the same as S₁.

F_{k}^{S_{2}}

is the feature when the scale is bone.

Decoder: The decoder takes the G-GRU unit from Li et al. [27]; however, unlike it, we decode and fuse the features extracted by the encoder sequentially from high- to low-scale using the decode. Larger scales provide information on the global action evolution, which can indicate the general direction of action, speed and type of action. The more minor scales can predict the exact joint position with global supervised information. In the training phase, we use T_ohjm training, which allows the network to focus on the joints where the loss is hard to reduce. Prospectively, the strategy helps the model to focus more on complex samples and effectively reduces prediction errors. The structure of the decoder is shown in Figure 6.

Classifier: The classifier, which utilizes a spatial-time-averaged pooling layer (ST-AVG pool) and a fully connected layer (FC), serves to classify the human actions of the skeleton sequence based on the motion features extracted by the encoder. Then, the output scores of the joints from large-scale to low-scale are summed up as the final action classification score. The structure is shown in Figure 7.

T_ohjm and Loss: Since the network-training process for certain hard-to-identify joints exhibits hard-to-reduce loss function values, nodes with top-K loss values are dynamically considered in the loss calculation function of the decoder. Compared to OHKM with the same number of hard joints mined, T_ohjm mines hard joints in time frames instead of single frames. Thus, it can better mine the joints that are difficult to train in time frames but important to whole-network training; this means that if all single frames are these complex joints, they will all be reflected in the sorting of the joint loss values in the time sequences, and will all be involved in the gradient calculation. Moreover, if all the single frames are easily detectable joints, then by the same token, all the joints are filtered out after the ranking of the loss values of the time-frame joints. In general, T_ohjm makes it more reasonable for the network to mine hard joints during the training process. The embodiment of hard joints is not limited to a single frame of the picture but all joints in the time frame during training.

The primary step of the T_ohjm method is to calculate the loss values for all the joints in the time frame during the training process and then sort the set of loss values for all the joints in the calculated time frame from the largest to the smallest, before selecting the largest loss value in the top K × T frame for gradient calculation to update the weight parameters of the network. The Algorithm 1 is shown as follows:

Algorithm 1: Time-level online hard joint mining

input:
M: Number of frames in training time t;
N: Number of human joints in a single frame;

M^{' i}

: denotes the i-th frame (i-th image) in M frames (M images);

M^{' i j}

: denotes the jth joint in the i-th frame

{\overset{ˇ}{P}}_{j, m}

: denotes the position of j joints in frame m obtained from the training of encoder;

P_{j, m}

: denotes the true position of joint j in frame m;

L_{j}

: denotes the loss of the jth joint in the i-th frame of the calculation

L_{j} = {‖ {\hat{p}}_{j, m} - p_{j, m} ‖}^{2}

;

L_{t - o h j m} :

Loss values calculated by time-level online hard joint mining algorithm;
output:

L_{t - o h j m}

: Loss values calculated by time-level online hard joint mining algorithm
Specific steps:

for $M^{’ i}$ in M frams:
for $M^{’ i j}$ in N joints:
calculated loss $L_{j} = {‖ {\hat{p}}_{j, m} - p_{j, m} ‖}^{2}$ place in collection S: $S = {L_{1}, \dots, L_{M \times N}}$
end for
end for
The loss values in the set S are sorted from largest to smallest to obtain $S_{s o r t}$ ;
Select the top K × M loss values from the $S_{s o r t}$ set to get $S_{t o p}$ ;
Summing over the $S_{t o p}$ set, we get $L_{t - ohjm} = \frac{1}{K \times M} \sum S_{t o p}$

Thus, the loss function of T_ohjm-MSstgcn becomes:

L o s s_{w} = L o s s_{c} + λ L o s s_{p}

(10)

L o s s_{c} = \sum_{c = 1}^{M} p_{c} \log (q_{c})

(11)

where

L o s s_{w}

is the loss function of the whole network,

L o s s_{c}

is the cross-entropy loss function and

l o s s_{p}

is the position loss error of the joints.

l o s s_{p}

is shown in Equation (12):

l o s s_{p} = \frac{1}{M K} {\sum_{j = 1}^{K} \sum_{m = 1}^{T} ‖ {\hat{p}}_{j, m} - p_{j, m} ‖}^{2} = L_{t - o h j m}

(12)

where

{\hat{p}}_{j, m}

is the location of the predicted joint and

p_{j, m}

is the location of the actual joint.

4. Results and Discussion

4.1. Experiment Datasets and Evaluation Metric

NTU-RGB + D: NTU-RGB + D is a large-scale dataset with annotated 3D joint coordinates of a human body for the task of human action recognition [32]. NTU-RGB + D includes 56,000 action videos with 60 action categories. These videos are captured indoors from 40 volunteers in different age groups ranging from 10 to 35. For each action, the videos are captured by three cameras from different viewpoints, and the 3D annotations of human body joints are given in the camera coordinate system. Every action video has no more than two subjects, and there are 25 key joints for each subject in the skeleton sequences.

The NTU-RGB + D dataset has two sub-sets: (1) Cross-Subject (CS) sub-set, which consists of 40,320 training videos and 16,560 testing videos. This sub-set contains 20 subjects, where 1 subject is utilized for training and the remaining 19 subjects are used for evaluation; (2) Cross-View (CV) sub-set, which composes of 37,920 training videos and 18,960 testing videos. In this sub-set, the video samples, which come from camera viewpoints two and three, are used for training, and the video samples, which are captured by the camera viewpoint one, are utilized for evaluation. We follow the conventional settings of paper [32] and report the Top-1 and Top-5 accuracy on both sub-sets. Identical to ASSL [33], in this paper, we use 5%, 10%, 20% and 40% of the labeled training data on the X-View and X-Sub benchmarks, respectively.

The Kinetics-skeleton dataset was built on the large-scale action-recognition dataset Kinetics (human action dataset). Kinetics is the largest unconstrained action recognition dataset, containing about 300,000 video clips retrieved from YouTube. The videos cover up to 400 human actions, ranging from everyday activities and sports scenes to complex interactive actions, with each clip in the video lasting about 10 s. Yan et al. built the Kinetics-skeleton dataset by using OpenPose on the Kinetics dataset to obtain the two-dimensional coordinates (X, Y) and confidence scores C of 18 skeletal points on each frame, retaining two persons whose joints with the highest average confidence within a given frame and selecting 300 frames for each action as a sequence of action skeletons. The dataset provides a training set of 240,000 clips and a validation set of 20,000 clips. For comparison, the model is trained on the training set and the performance of the model is verified on the validation set in this paper. For experimental evaluation, we report both the Top-1 and Top-5 accuracies on the testing set, as was reported in Yan et al. [13]. As in ASSL [33], this paper applies 5%, 10%, 20% and 40% marker training data, respectively.

4.2. Implementation Details

The PyTorch-1.8.0 deep learning framework was employed; the programming language is Python_3.8 implemented under Windows OS, the GPUs are two NVIDIA RTX A4000s and the CUDA version is 11.2. We use four cascaded SIFMs with feature scales of 64, 64, 128 and 256 and the stochastic gradient descent (SGD) algorithm with Nesterov momentum (0.9) as the optimizer. The weight decay is set to 0.0005. The batch size is set to 24 and the weighting factor to λ = 0.1, as in [25]. For the NTU-RGB + D dataset, the learning rate is set to 0.1 [25], and the number of training epochs is set to 60. The learning-rate decay operates at the 30th epoch, 40th epoch and 50th epoch in the training process, and the factor is set as 0.1. For the Kinetics-skeleton dataset, the number of training epochs is set to 70, and the learning rate is set to 0.1 [25]. The learning-rate decay is set to 0.1 at the 40th epoch, 50th epoch, and 60th epoch.

4.3. Comparison with Other Models

On the NTU-RGB + D and Kinetics-skeleton datasets, we compared the proposed T_ohjm-MSstgcn with the state-skeleton-based action-recognition method. The methods selected for comparison include fully supervised methods based on the GCN algorithm [13,15,17,18,20,22,34,35,36,37,38,39] and semi-supervised methods [25,26,40,41]. Specifically, in contrast to the fully supervised approach, we use all labeled training data. In contrast to the semi-supervised approach, we use unlabeled data to train the encoder and then train the entire network (including the encoder, decoder and classifier) using partially annotated data from the training set.

4.3.1. Comparison Results of Semi-Supervised Methods on NTU-RGB + D Dataset

The experimental results of the semi-supervised training were compared with those of the semi-supervised methods [25,26,39,40] on the X-view and X-sub benchmarks of the NTU-RGB + D dataset, and the results are shown in Table 1. Our proposed T_ohjm-MSstgcn model outperforms previous semi-supervised methods. Specifically, on the X-view benchmark, Tohjm-MSstgcn utilizes 5%, 10%, 20% and 40% of the tagged training data, which is 1.2%, 7.4%, 9.8% and 10.1% higher than the state-of-the-art ASSL [25] model respectively. Compared to ASSL [25], T_ohjm-MSstgcn improved accuracy by 3.6%, 5.3%, 9.5% and 10.4% on the X-sub benchmark using 5%, 10%, 20% and 40% of the labeled training data, respectively. T_ohjm-MSstgcn has more than 4.4% accuracy on the X-sub benchmark compared with the multitask method MS2L [34]. Based on the above experimental results, it can be concluded that our proposed T_ohjm-MSstgcn model can effectively learn rich action features from unlabeled data.

4.3.2. Comparison Results of Fully Supervised Methods on NTU-RGB + D Dataset and Kinetics-Skeleton Dataset

After fully supervised training with the weights of semi-supervised training above, 50% of the labeled data were randomly selected for training, and 100% of the labeled data were used for testing in the testing phase results in Table 2 and Table 3. The performance of T_ohjm-MSstgcn on the NTU-RGB + D dataset is 1.2% better than the PB-GCN [36] model trained with body parts on the X-sub benchmark, and 1.4% better on the X-view benchmark, because our T_ohjm-MSstgcn model incorporates inputs from three scales joints, bones and body parts and the information from the three scales interacts throughout the training process. The performance of T_ohjm-MSstgcn improved by 4.2% on the X-sub benchmark and by 3.5% on the X-view benchmark over the BPLHM [22] model using joints and skeleton combined with information as input for network training on the NTU-RGB + D dataset, and by 2.3% and 2.6% on the Kinetics-skeleton dataset for Top-1 and Top-5 accuracy, respectively. The performance of T_ohjm-MSstgcn improved by 0.9% on the X-sub benchmark and by 0.4% on the X-view benchmark over the AS-GCN [15] model employing action and structural chains on the NTU-RGB + D dataset. It was higher in Top-1 and Top-5 accuracy on the Kinetics-skeleton dataset by 0.9% and 2.3%. The performance of T_ohjm-MSstgcn improved by 0.5% on the X-sub benchmark and by 0.2% on the Top-5 accuracy of the Kinetics-skeleton dataset over the 2s-AGCN [20] model, which employs a triage of joint and skeleton information. Due to the model’s ability to perform semi-supervised training on unlabeled data and fully supervised training, which allows the network to capture richer action features and achieve accurate classification results, the network outperforms most models. It outperforms by 6.2% on the X-sub benchmark and 16.3% on the X-view benchmark for the NTU-RGB + D dataset, as well as 5.2% higher top accuracy and 6.0% top-five accuracy on the Kinetics-skeleton dataset. The T_ohjm-MSstgcn model outperforms the state-of-the-art model PGCN-TCA by 0.7% on the X-sub benchmark and by 1% on the X-view benchmark on the NTU-RGB + D dataset [34]. On the Kinetics-skeleton dataset, the T_ohjm-MSstgcn model is more accurate than the state-of-the-art model.

Our Tohjm-MSstgcn model performs better on Top-1 and Top-5 than Tripool [40] by 1.8% and 2.6% respectively. It can be concluded from the above experimental analysis and the experimental results in Table 2 and Table 3, that our proposed T_ohjm-MSstgcn model can effectively learn abundant action features from unlabeled data and labeled data.

4.4. Ablation Study

4.4.1. The Effect of SIFM

In order to verify the effect of the number of SIFM models on the model performance, we validated the Kinetics-skeleton dataset for different numbers of SIFM modules. As shown in Table 4, the experimental results show that the network performance is best with four SIFM modules. The network cannot extract rich action features with fewer than four modules, and the network performance decreases due to more noise when more than four modules are added to the network.

4.4.2. The Effect of K Value on Network Performance in T_ohjm

In the decoder training, we use a time-level online hard joint mining method; i.e., the joints with top-K loss are dynamically considered in training. The training effects of taking different K values are compared in Figure 8 and Figure 9. From Figure 8, the network performance is better when the K value is 12 on the NTU-RGB + D dataset. From Figure 9, the network performance is better when the K value is 14 on the Kinetics-skeleton dataset. Because the actions contained in the dataset—such as calling, drinking, painting, hugging, kissing, shaking hands and mowing—do not require a complete skeleton, the class of the action can be recognized by the network focusing on specific joints. The time-level online hard joint mining method can focus the network’s attention on hard-to-train joints and reduce the impact of noise on the network, thus improving the network’s performance.

5. Conclusions

To achieve accurate and robust human action recognition, we propose a T_ohjm-trained multiscale spatial-temporal graph convolutional neural network (T_ohjm-MSstgcn) for semi-supervised skeleton action recognition, and design a time-level online hard joint mining training method. T_ohjm-MSstgcn divides human joints into three scales, extracts single-scale information from each scale and obtains guidance information between different scales from multiple scales. It also significantly enhances the feature-learning capability of the network by combining self-supervised learning using motion prediction in the decoder and supervised learning for action classification in the classifier, and by employing a time-level online hard joint mining training approach. Extensive experiments on two large-scale datasets to evaluate the performance of our approach for human action recognition show that the developed model outperforms advanced semi-supervised models and achieves good results on fully supervised learning. However, since the model takes the most basic ST-GCN to extract the initial features and there is no information exchange between the first three layers of the encoder, the extracted action information is not rich; thus, further research is still needed to improve the information-extraction capability of the first three layers of the encoder.

Author Contributions

Conceptualization, R.G., W.Y., Z.L., Y.Y. and A.L.; methodology, R.G.; software, R.G.; validation, R.G. and A.L.; formal analysis, R.G., W.Y., Z.L., Y.Y. and A.L.; investigation, R.G.; resources, R.G.; data curation, R.G.; writing—original draft preparation, R.G. and A.L.; writing—review and editing, R.G. and A.L.; visualization, R.G. and A.L.; supervision, W.Y., Z.L. and Y.Y.; project administration, R.G., Z.L. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This work is supported by the Natural Science Foundation of Hebei Province under Grant F2022201003 and the Post-graduate’s Innovation Fund Project of Hebei University (HBU2022ss037), and the High-Performance Computing Center of Hebei University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cai, Q.; Deng, Y.; Li, H.S. Review of human behavior recognition methods based on deep learning. Comput. Sci. 2020, 47, 85–93. (In Chinese) [Google Scholar]
Wang, Y.; Xiao, Y.; Xiong, F.; Jiang, W.; Cao, Z.; Zhou, J.T.; Yuan, J. 3DV: 3D dynamic voxel for action recognition in depth video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 511–520. [Google Scholar]
Sanchez-Caballero, A.; de López-Diz, S.; Fuentes-Jimenez, D.; Losada-Gutiérrez, C.; Marrón-Romera, M.; Casillas-Perez, D.; Sarker, M.I. 3dfcnn: Real-time action recognition using 3D deep neural networks with raw depth information. Multimed. Tools Appl. 2022, 81, 24119–24143. [Google Scholar] [CrossRef]
Munro, J.; Damen, D. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea, 27–28 October 2019; pp. 122–132. [Google Scholar]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
Peng, G.; Wang, S. Dual semi-supervised learning for facial action unit recognition. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8827–8834. Available online: https://ojs.aaai.org//index.php/AAAI/article/view/4909 (accessed on 27 September 2022). [CrossRef] [Green Version]
Xu, Z.; Hu, R.; Chen, J.; Chen, C.; Jiang, J.; Li, J.; Li, H. Semi supervised discriminant multi manifold analysis for action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2951–2962. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.Y.; Li, C.; Shi, H.; Zhu, X.; Li, P.; Dong, J. Adapnet: Adaptability decomposing encoder-decoder network for weakly supervised action recognition and localization. IEEE Trans. Neural Netw. Learn. Syst. 2020, 1–17. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Han, Y.; Tang, J.; Hu, Q.; Jiang, J. Semi-supervised image-to-video adaptation for video action recognition. IEEE Trans. Cybern. 2016, 47, 960–973. [Google Scholar] [CrossRef] [PubMed]
Huang, S. Research on Human Action Recognition Based on Skeleton; Shanghai Jiaotong Universit: Shanghai, China, 2014. (In Chinese) [Google Scholar]
Wang, L.; Tong, Z.; Ji, B.; Wu, G. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 1895–1904. [Google Scholar]
Su, K.; Liu, X.; Shlizerman, E. Predict & cluster: Unsupervised skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9631–9640. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence; 2018. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/12328 (accessed on 27 September 2022).
Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling gcn with dropgraph module for skeleton-based action recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 536–553. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
Si, C.; Jing, Y.; Wang, W.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1112–1121. [Google Scholar]
Zhang, X.; Xu, C.; Tao, D. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14333–14342. [Google Scholar]
Zhao, R.; Wang, K.; Su, H.; Ji, Q. Bayesian graph convolution lstm for skeleton based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6882–6892. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
Tu, Z.; Zhang, J.; Li, H.; Chen, Y.; Yuan, J. Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition. IEEE Trans. Multimed. 2022, 1–13. [Google Scholar] [CrossRef]
Zhang, X.; Xu, C.; Tian, X.; Tao, D. Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 3047–3060. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. IEEE Conf. Comput. Vis. Pattern Recognit. 2018, 7103–7112. [Google Scholar]
Wang, C.; Wang, Y.; Huang, Z.; Chen, Z. Simple baseline for single human motion forecasting. IEEE/CVF Int. Conf. Comput. Vis. 2021, 2260–2265. [Google Scholar]
Si, C.; Nie, X.; Wang, W.; Wang, L.; Tan, T.; Feng, J. Adversarial self-supervised learning for semi-supervised 3D action recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 35–51. [Google Scholar]
Zheng, N.; Wen, J.; Liu, R.; Long, L.; Dai, J.; Gong, Z. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence; 2018; pp. 1–8. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/11853 (accessed on 27 September 2022).
Li, M.; Chen, S.; Zhao, Y.; Zhang, Y.; Wang, Y.; Tian, Q. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 214–223. [Google Scholar]
Demisse, G.G.; Papadopoulos, K.; Aouada, D.; Ottersten, B. Pose encoding for robust skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 188–194. [Google Scholar]
Wang, H.; Wang, L. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Process. 2018, 4382–4394. [Google Scholar] [CrossRef] [PubMed]
Zheng, W.; Li, L.; Zhang, Z.; Huang, Y.; Wang, L. Relational network for skeleton-based action recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 826–831. [Google Scholar]
Dang, L.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11467–11476. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Zisserman, A. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
Yang, H.; Gu, Y.; Zhu, J.; Hu, K.; Zhang, X. PGCN-TCA: Pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 2020, 10040–10047. [Google Scholar] [CrossRef]
Wen, Y.H.; Gao, L.; Fu, H. Graph CNNs with motif and variable temporal block for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8989–8996. [Google Scholar]
Thakkar, K.; Narayanan, P.J. Part-based graph convolutional network for action recognition. arXiv 2018, arXiv:1809.04983. [Google Scholar]
Song, Y.F.; Zhang, Z.; Wang, L. Richly activated graph convolutional network for action recognition with incomplete skeletons. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1–5. [Google Scholar]
Peng, W.; Hong, X.; Zhao, G. Tripool: Graph triplet pooling for 3D skeleton-based action recognition. Pattern Recognit. 2021, 115, 107921. [Google Scholar] [CrossRef]
Lin, L.; Song, S.; Yang, W.; Liu, J. Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2490–2498. [Google Scholar]
Miyato, T.; Maeda, S.; Koyama, M.; Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 1979–1993. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The model of ST-GCN.

Figure 2. The framework of our method. In the training phase, we use both the annotated data and unannotated data to train the model. For the training data without action labels, we use a loss_p to optimize the encoder and decoder. For the training data with action labels, we use a composite loss_w to optimize the encoder, decoder and classifier. In the testing phase, the encoder and the classifier are utilized to classify human actions that are represented by the skeleton sequence.

Figure 3. Multiscale graphs.

Figure 4. The model of SIFM.

Figure 5. The model of St, which converts high-scale joints to low-scale joints by St. The figure shows the transfer of bone scale to joint scale.

Figure 6. The model of decoder, which updates the network parameters by decoding and fusing the scales sequentially.

Figure 7. Model of the classifier, which crosses from high-scale to low-scale to classify and sum up the final action category.

Figure 8. Effect of different K values on Top-1 accuracy on X-sub and X-view benchmarks of the NTU-RGB + D dataset.

Figure 9. Effect of different K values on Top-1 and Top-5 accuracy on the Kinetics-skeleton dataset.

Table 1. Comparison of Top-1 accuracy with semi-supervised methods on NTU-RGB + D dataset (bold text is best).

Methods	5% (%)		10% (%)		20% (%)		40% (%)
Methods	X-View	X-Sub	X-View	X-Sub	X-View	X-Sub	X-View	X-Sub
VAT [41]	57.9	51.3	66.3	60.3	72.6	65.6	78.6	70.4
S4L [26]	55.1	48.4	63.6	58.1	71.1	63.1	76.9	68.2
ASSL [25]	63.6	57.3	69.8	64.3	74.7	68.0	80.0	72.3
MS2L [40]	-	-	-	65.2	-	-	-	-
Ours	64.8	60.9	77.2	69.6	84.5	78.1	89.5	82.7

Table 2. Comparison of Top-1 accuracy with fully supervised methods on NTU-RGB + D dataset (bold text is best).

Methods	X-Sub (%)	X-View (%)
ST-GCN [13]	81.5	88.3
HCN [31]	86.5	91.1
PB-GCN [37]	87.5	93.2
M-GCNs + VTDB [35]	84.2	94.2
AS-GCN [15]	86.8	94.2
2s-AGCN [20]	88.2	94.9
BPLHM [22]	84.5	91.1
CA-GCN [18]	83.5	91.4
RA-GCN [38]	85.9	93.5
PGCN-TCA [35]	88.0	93.6
(Ours)	88.7	94.6

Table 3. Comparison of Top-1 and Top-5 accuracies with fully supervised methods on the Kinetics-skeleton dataset (bold text is best).

Methods	Top-1 (%)	Top-5 (%)
ST-GCN [13]	30.7	52.8
AS-GCN [15]	34.8	56.5
2s-AGCN [20]	35.9	58.6
BPLHM [22]	33.4	56.2
CA-GCN [18]	34.1	56.6
Tripool [39]	34.1	56.2
(Ours)	35.9	58.8

Table 4. Comparison of Top-1 and Top-5 accuracy on Kinetics-skeleton dataset with different number of SIFMs (bold text is best).

Module Number	Top-1 (%)	Top-5 (%)
1	30.7	52.8
2	34.8	56.5
3	35.7	58.6
4	35.9	58.8
5	33.4	56.2

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gou, R.; Yang, W.; Luo, Z.; Yuan, Y.; Li, A. T_ohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition. Electronics 2022, 11, 3498. https://doi.org/10.3390/electronics11213498

AMA Style

Gou R, Yang W, Luo Z, Yuan Y, Li A. T_ohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition. Electronics. 2022; 11(21):3498. https://doi.org/10.3390/electronics11213498

Chicago/Turabian Style

Gou, Ruru, Wenzhu Yang, Zifei Luo, Yunfeng Yuan, and Andong Li. 2022. "T_ohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition" Electronics 11, no. 21: 3498. https://doi.org/10.3390/electronics11213498

APA Style

Gou, R., Yang, W., Luo, Z., Yuan, Y., & Li, A. (2022). T_ohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition. Electronics, 11(21), 3498. https://doi.org/10.3390/electronics11213498

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

T_ohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition

Abstract

1. Introduction