Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition

: Incorporating multi-modality data is an effective way to improve action recognition performance. Based on this idea, we investigate a new data modality in which Whole-Body Keypoint and Skeleton (WKS) labels are used to capture reﬁned body information. Unlike directly aggre-gated multi-modality, we leverage distillation to adapt an RGB network to classify action with the feature-extraction ability of the WKS network, which is only fed with RGB clips. Inspired by the success of transformers for vision tasks, we design an architecture that takes advantage of both three-dimensional (3D) convolutional neural networks (CNNs) and the Swin transformer to extract spatiotemporal features, resulting in advanced performance. Furthermore, considering the unequal discrimination among clips of a video, we also present a new method for aggregating the clip-level classiﬁcation results, further improving the performance. The experimental results demonstrate that our framework achieves advanced accuracy of 93.4% with only RGB input on the UCF-101 dataset.


Introduction
Action recognition has attracted a lot of attention because it [1][2][3][4][5][6][7] contains more complicated information than individual images. Since the development of 3D CNNs [1] and two-stream CNNs [2], increasingly advanced deep learning methods have been proposed. In addition to direct operations on RGB frames, multi-modality features have also been employed for classification, such as optical flow [2], skeleton-based [8], motion vector [9], motion history images [10], etc. The application of multi-modality data is considered to be an effective measure for performance improvement.
Optical flow is the most common method for obtaining motion information and improving recognition accuracy. However, computing optical flow is too expensive for realtime application, as it occupies most of the operation time [11]. Therefore, multi-modality data [9,[12][13][14][15] are employed to replace the computationally expensive optical flow. For example, Shou, Lin [11] presented discriminative motion cue (DMC) representation to reduce noise in motion vector estimation, and Chaudhary and Murala [13] proposed a Weber motion history image method for obtaining temporal features. Skeleton-based action recognition has also been used to obtain robust features; based on this approach, Chen et al. [16] proposed channel-wise topology refinement graph convolution (CTR-GC) which is able to adaptively study various topologies and aggregate joint features, achieving remarkable results.
including the skeleton and keypoints of the face, hands, and body, as shown in Figure 1. The marked samples increase the brightness of the keypoints and skeleton, and reduce the brightness of the background, thus increasing the contrast between the body and background. In addition, irrelevant information such as human clothing and appearance can be further filtered, making the model well-suited for classification of actions that need to focus on poses, such as the categories of FrontCrawl and BreastStroke. Most advanced methods generally utilize multi-modality data as a separate stream, and the final results are obtained through aggregating the multi-modality results with the RGB stream. In our work, we explore using the information from the WKS stream to train the RGB network, enabling it to possess the body information capturing ability of the WKS stream. For this purpose, we first trained a high-performing WKS network, which can make decisions based on human pose information. This means the high-level features in the WKS network are produced by that concerning body information. Therefore, if we take these high-level features as a teacher to instruct the learning process of the RGB network, the RGB network will be guided to classify actions focusing on body information. The knowledge extraction is based on distillation [18]. To the best of our knowledge, this study is the first to transfer WKS information to the RGB stream for action recognition.
The shifted windows (Swin) transformer [19] has demonstrated great potential for vision tasks, and has achieved a state-of-the-art performance in the fields of image classification, semantic segmentation, and object detection. However, it is difficult to extract motion information using the conventional 2D Swin transformer. To address this problem, we explore a novel method that takes advantage of both the 3D CNNs and Swin transformer.
Tran et al. [20] demonstrated that temporal modeling is a type of bottom-level operation; it can be extracted by 3D convolution at the bottom level, and performs plane calculations on a high level without accuracy reduction. Based on this theory, we use 3D convolution to extract spatiotemporal features from the bottom layers, then concatenate the temporal features to spatial form which a transformer can process. Finally, the results are obtained by the Swin transformer which operates on the produced features at the top layers. The network structure of CNNs lead to it only focusing on local information and being hard to capture and store a long-distance dependent relationship, while self-attention mechanisms in the transformer can effectively make up this weakness. Therefore, the high-level spatiotemporal features produced by 3D CNNs can be well analyzed in a Swin transformer. To the best of our knowledge, this is the first work to attempt to combine the advantages of 3D CNNs and a transformer.
Deep learning-based methods often require expensive hardware resources, and it is infeasible to feed the model with a whole video due to limited computational resources. In our framework, we take clip-based architecture to keep memory consumption manageable. This architecture aggregates clip-level results to video-level results, and effective Most advanced methods generally utilize multi-modality data as a separate stream, and the final results are obtained through aggregating the multi-modality results with the RGB stream. In our work, we explore using the information from the WKS stream to train the RGB network, enabling it to possess the body information capturing ability of the WKS stream. For this purpose, we first trained a high-performing WKS network, which can make decisions based on human pose information. This means the high-level features in the WKS network are produced by that concerning body information. Therefore, if we take these high-level features as a teacher to instruct the learning process of the RGB network, the RGB network will be guided to classify actions focusing on body information. The knowledge extraction is based on distillation [18]. To the best of our knowledge, this study is the first to transfer WKS information to the RGB stream for action recognition.
The shifted windows (Swin) transformer [19] has demonstrated great potential for vision tasks, and has achieved a state-of-the-art performance in the fields of image classification, semantic segmentation, and object detection. However, it is difficult to extract motion information using the conventional 2D Swin transformer. To address this problem, we explore a novel method that takes advantage of both the 3D CNNs and Swin transformer.
Tran et al. [20] demonstrated that temporal modeling is a type of bottom-level operation; it can be extracted by 3D convolution at the bottom level, and performs plane calculations on a high level without accuracy reduction. Based on this theory, we use 3D convolution to extract spatiotemporal features from the bottom layers, then concatenate the temporal features to spatial form which a transformer can process. Finally, the results are obtained by the Swin transformer which operates on the produced features at the top layers. The network structure of CNNs lead to it only focusing on local information and being hard to capture and store a long-distance dependent relationship, while self-attention mechanisms in the transformer can effectively make up this weakness. Therefore, the high-level spatiotemporal features produced by 3D CNNs can be well analyzed in a Swin transformer. To the best of our knowledge, this is the first work to attempt to combine the advantages of 3D CNNs and a transformer.
Deep learning-based methods often require expensive hardware resources, and it is infeasible to feed the model with a whole video due to limited computational resources. In our framework, we take clip-based architecture to keep memory consumption manageable. This architecture aggregates clip-level results to video-level results, and effective aggregation function is crucial for accurate classification. Most common aggregating techniques, such as average pooling and max pooling, are simple and data-independent. They are not well-suited for evaluating unequal discrimination of each clip. Some complicated aggregation methods [21][22][23][24] have also been proposed; for example, in study [23], a recurrent neural network (RNN) was designed to yield video-level scores. However, confidence of clip-level results is not well considered in these methods. Thus, we introduce a new aggregation function called the confidence-weighting aggregation function (CWAF). We aggregate clip-level prediction results by analyzing the confidence of each result and determine the weights, finally improving the accuracy by approximately 0.5% compared to average pooling.
The contributions of this study are summarized as follows: 1.
We introduce a novel framework that utilizes WKS labeled images for training RGB network processes in body information capturing ability. The knowledge transferring is based on the concept of distillation. The evaluations show that this new data modality effectively improves the recognition ability of the RGB network. So far as we know, this is the first work to transfer body concerned features extracting ability to a RGB network.

2.
We explore a novel architecture for concatenation that uses 3D convolution and the Swin transformer, which fully takes advantage of the two architectures.

3.
By analyzing the confidence of clip-level results, we design an aggregation method to assign more rational weights to each clip output.
We organized our paper as follows. Discussion of related works are presented in Section 2. In Section 3, we describe the details of our method. In Section 4, experiments and analysis on popular action recognition datasets are represented. Finally, the conclusion and discussion are represented in Section 5.

Related Works
This section presents the prior work related to ours; the discussion about recent action classification methods is represented first, then distillation strategies and aggregation functions related to our work are described, respectively.
Recent Action Recognition. In contrast to traditional hand-crafted action recognition methods [25][26][27], deep learning strategies, which have dominated the field of action recognition, have excellent modeling capacity and are capable of learning in an end-toend manner [28]. Deep learning methods can be described in two categories: 3D CNN framework and multi-stream framework.
Simonyan and Zisserman [2] proposed a two-streams CNN framework that employs precomputed optical flow to extract temporal information. Subsequently, Feichtenhofer et al. [29] improved the performance of the two-streams method by introducing a different fusion strategy. Shou, Lin [11] introduced the DMC representation to reduce noise in motion vectors, extracting motion information as an alternative technique to optical flow. Threedimensional CNNs [1,30] have been employed as another efficient tool for modeling temporal information [21,[31][32][33]. Some additional convolution-based methods have also been introduced [20,34], such as R(2 + 1)D [20], which replaces 3D convolutions with separate spatial and temporal convolutions. Furthermore, a number of state-of-the-art strategies [3,5,32] have developed frameworks by taking advantage of both mechanisms to achieve the best performance; however, although state-of-the-art accuracies have been achieved, this combination is computationally intensive.
In addition to the architectural researches, some action recognition methods [35][36][37][38] attempt to extract more refined motion features with object detection, object tracking, and pose detection methods. The emerging advanced tracking and detection methods [39][40][41] make these possible. For example, Cao, Simon [39] proposed an advanced approach to detect a 2D pose of multiple people by associating body parts; Dewi, Chen [40] presented a yolo-4-based high-performance detection framework, and Wachinger, Toews [41] introduced a novel whole-body segmentation method. With the help of these methods, more discriminative features are obtained and contribute to make accurate action recognition. For example, Brehar, Muresan [35] proposed a novel framework in which classification is per-formed with information of pedestrian motion, distance of the pedestrian, and position of pedestrians, and the final results are obtained by a Long Short-Term Memory-based model aggregating temporal features. Yan, Hu [36] proposed a real-time human rehabilitation action recognition method based on a human pose. It fuses OpenPose with Kalman filter to track human targets. Verma, Meenpal [37] improved the performance of multiperson interaction recognition by extracting distance and angular relation features based on body keypoints. Based on the pose detection algorithm, Pandurevic, Draga [38] developed a motion sequences analysis method to help in training speed climbing athletes.
Distillation. The concept of distillation, first proposed by Hinton [18], is a training strategy that takes a complex model as a teacher to train a simple one. Category probabilities of the teacher model are the most common target to transfer knowledge, and these are called a "soft target". The training of our arithmetic is based on it.
After distillation was proposed, generalized distillation [42] was also introduced, it was designed based on distillation and privileged information [43]. In the field of action classification, several works have proposed to utilize distillation. Garcia et al. [44] developed a hallucination network that applied the feature of depth and RGB during the study process, as an inference course classifier only fed with RGB clips. Garcia et al. [45] proposed a DMCL framework to leverage the complementary information of multiple modalities. Crasto, Weinzaepfel [31] introduced a framework that distills the knowledge of flow stream into an RGB network, achieving state-of-the-art accuracy for the one-stream action recognition task. Our proposed framework is inspired by these advanced methods.
Aggregation function. The aggregation function is an important module for completing a video-level classification, and can directly influence the final results. Several approaches have been proposed to better utilize clip-level results. Kalfaoglu, Kalkan [21] concatenated a bidirectional encoder representation from the transformers (BERT) layer at the end of a 3D convolutional architecture for aggregation, achieving promising results. In study [23], an RNN was designed to yield video-level scores concerning all clip recognition results. Aiming at the defects of linear weighting schemes that lack concerning features, Wang, Xiong [22] proposed an adaptive weighting method to automatically assign weights to clip-level results. Wang and Cherian [24] introduced the concept of a positive bag and a negative bag to find useful features. In our approach, the judgement of confidence through analyzing the form of the category probabilities is performed, then weights for each clip-level result are determined by confidence scores.
Different from current detection-based or tracking-based action recognition methods, we attempt to guide the RGB network to pay attention to the body information without extra detection or a tracking process. We attempt to integrate this ability into the RGB network by using the distillation method. At inference, only RGB clips are taken for classification. Furthermore, as current clip-based action recognition methods lack consideration on the confidence differences among clip-level results, we directly took clip-level results to measure the discrimination of clips and, therefore, obtained a more rational aggregation function.

Proposed Method
We propose a framework to train an RGB network to learn the ability of a WKS network. An overview of this procedure is shown in Figure 2. In Section 3.1, we formally define the video-level and clip-level classification, then discuss the clip-level training and prediction course of our framework. The Swin transformer-based network architecture is described in Section 3.2. We describe the detailed training strategy in Section 3.3, and discuss the proposed CWAF in Section 3.4.

Overview of Clip-Level Training and Prediction
To directly feed the classifier with a whole video is infeasible for our computer due to limited hardware resources. Therefore, normal strategies usually divided the video v into a set of clips {c (1) , c (2) , c (3) , …, c (i) , …}, and the classifier was fed with clips instead of a whole video. Given a clip classifier g, the clip-level prediction can be denoted as g(c (i) ;W), and video-level prediction can be formulated as follows: where g(c (i) ;W) is the function that represents the classifier g with weights W. We can infer from Equation (1) that the classification result of the given video is obtained by aggregating the clip-level results through function H. The clip-level training and prediction course is shown in Figure 2. The classifier f operates on fixed-length clips of F frames with spatial resolution H × W, and makes a prediction of classification probabilities.
Our goal is to train the RGB classifier 3 : [ 0 , 1 ] to possess more robust classification ability by focusing on body information when fed with clips that contain a noisy background. For this purpose, we first obtain a well-trained classifier fWKS on the WKS labeled dataset, which is then used as the teacher network for training the RGB network based on the distillation strategy. We use fWKS to indicate the feature map of the target distillation layer; it is localized at the high layer of classifier fWKS and produced by fully concerning the body information. Normally, the extraction course from the input RGB frames to fWKS should consist of two steps: locating the keypoints of input frames, and operating the classification function fWKS. In our method, this course is greatly simplified by using distillation, and high-level features fWKS can be directly produced from RGB frames. Similarly, we use fdistill to indicate the feature map of the distillation layer in the RGB network. During the training process, body information capturing ability is

Overview of Clip-Level Training and Prediction
To directly feed the classifier with a whole video is infeasible for our computer due to limited hardware resources. Therefore, normal strategies usually divided the video v into a set of clips {c (1) , c (2) , c (3) , . . . , c (i) , . . . }, and the classifier was fed with clips instead of a whole video. Given a clip classifier g, the clip-level prediction can be denoted as g(c (i) ;W), and video-level prediction can be formulated as follows: (1) ; W); g(c (2) ; W); . . . ; g(c (n) ; W)), where g(c (i) ;W) is the function that represents the classifier g with weights W. We can infer from Equation (1) that the classification result of the given video is obtained by aggregating the clip-level results through function H. The clip-level training and prediction course is shown in Figure 2. The classifier f operates on fixed-length clips of F frames with spatial resolution H × W, and makes a prediction of classification probabilities.
Our goal is to train the RGB classifier f : R F×3×H×W → [0, 1] C to possess more robust classification ability by focusing on body information when fed with clips that contain a noisy background. For this purpose, we first obtain a well-trained classifier f WKS on the WKS labeled dataset, which is then used as the teacher network for training the RGB network based on the distillation strategy. We use f WKS to indicate the feature map of the target distillation layer; it is localized at the high layer of classifier f WKS and produced by fully concerning the body information. Normally, the extraction course from the input RGB frames to f WKS should consist of two steps: locating the keypoints of input frames, and operating the classification function f WKS . In our method, this course is greatly simplified by using distillation, and high-level features f WKS can be directly produced from RGB frames. Similarly, we use f distill to indicate the feature map of the distillation layer in the RGB network. During the training process, body information capturing ability is transferred to the RGB network as f distill gradually approximates to f WKS . Finally, at inference, our RGB network can distinguish the actions with the ability of concerning body information and do not require any pose detection processes.
Moreover, transformers lack some of the inductive biases inherent to CNNs, and therefore do not generalize well when trained on small datasets [46]. Distillation is an effective method for improving the performance of transformers [47]. Based on these considerations, our proposed transformer-based framework will further benefit from this distillation training method.

Swin Transformer-Based RGB Network
The Swin transformer is a powerful framework for the imaging task, but it has no ability to extract temporal features of videos. It first divides the input image with a size of H × W × 3 into non-overlapping patches with a size of H 4 × W 4 × 48; this part is referred to as "patch partition" in study [19]. Next, a liner projection function is applied on it, and the input matrix is shifted to a sequence that is calculated using a multi-head self-attention mechanism. To extract temporal information, we utilize a 3D convolutional module before the self-attention mechanism, yielding the module architecture shown in Figure 3. It is fed with fixed-length clips of F RGB frames with spatial resolution H × W. We set F as 16; thus, after two down-sampling operations, four feature maps are produced in the temporal dimension: transferred to the RGB network as fdistill gradually approximates to fWKS. Finally, at inference, our RGB network can distinguish the actions with the ability of concerning body information and do not require any pose detection processes.
Moreover, transformers lack some of the inductive biases inherent to CNNs, and therefore do not generalize well when trained on small datasets [46]. Distillation is an effective method for improving the performance of transformers [47]. Based on these considerations, our proposed transformer-based framework will further benefit from this distillation training method.

Swin Transformer-Based RGB Network
The Swin transformer is a powerful framework for the imaging task, but it has no ability to extract temporal features of videos. It first divides the input image with a size of H × W × 3 into non-overlapping patches with a size of H W 48 4 4 × × ; this part is referred to as "patch partition" in study [19]. Next, a liner projection function is applied on it, and the input matrix is shifted to a sequence that is calculated using a multi-head self-attention mechanism. To extract temporal information, we utilize a 3D convolutional module before the self-attention mechanism, yielding the module architecture shown in Figure 3. It is fed with fixed-length clips of F RGB frames with spatial resolution H × W. We set F as 16; thus, after two down-sampling operations, four feature maps are produced in the temporal di-  During the inference, two 3D convolution operations are performed on the input clip, and the temporal dimension is finally reduced to 1/4 of its original size. The convolution kernels' size in the temporal dimension are 7 and 3, respectively. Thus, the receptive field region in the temporal dimension of the final output feature map is of size 11. This means that the concatenated feature map fw can access all of the temporal information.
It was demonstrated in study [19] that the computed objects of the self-attention mechanism in the Swin transformer are unlike those in ViT [46]; its calculations are performed in a window. The self-attention computations in the windows are defined as follows: During the inference, two 3D convolution operations are performed on the input clip, and the temporal dimension is finally reduced to 1/4 of its original size. The convolution kernels' size in the temporal dimension are 7 and 3, respectively. Thus, the receptive field region in the temporal dimension of the final output feature map is of size 11. This means that the concatenated feature map f w can access all of the temporal information.
It was demonstrated in study [19] that the computed objects of the self-attention mechanism in the Swin transformer are unlike those in ViT [46]; its calculations are performed in a window. The self-attention computations in the windows are defined as follows: where Q, K, V ∈ R M 2 ×d are the query, key, and value matrices; d is the query/key dimension, and M 2 is the number of patches in a window. Q, K, V is obtained from a liner function which operated on the pixels in a window, and we can infer from Equation (2) that the spatial features in the window are fully compared and without constraints by the size of the receptive field, which occurred in convolutional operation. The Swin transformer utilizes a window architecture to reduce the computational complexity, but this architecture lacks the computation between windows. Therefore, a method that shifts the window at the next layer is proposed, and by continuously changing the position of the window, the features in different area can be connected and compared. Benefitting from this architecture, in our situation, the temporal differences among {f 1 , f 2 , f 3 , f 4 } can be effectively compared and captured. In addition, because the size of the feature map is reduced in the top layers, the receptive field of each window is relatively expanded, so that the temporal features are more comprehensively extracted. The complete architecture of our proposed framework is illustrated in Figure 2, in which the produced features f W are fed into the network that is formed by the stacked Swin transformer block and patch merging [19]. Considering that the features of the top layers represent high-level global information [31], we designed the distillation layer f distill to be located after the last Swin transformer block. The logit that is obtained by a fully connected layer over f distill is then converted by the SoftMax function, and the final category probabilities are obtained.

Training Strategy
The distillation strategy was first proposed by Hinton et al. [18], and aims to transfer knowledge from a cumbersome pretrained network to a simple lightweight network. We explore how to distill the prior knowledge of a teacher model to a student model on the target layer. As shown in Figure 2, f distll is denoted as the distillation layer in the RGB classifier f; it transfers action recognition ability from f WKS to f by performing a mean squared error (MSE) loss function between f WKS and f distll : where f i distill is obtained from f fed with the i-th RGB clip, and the same clip that is marked with WKS is fed to f WKS , producing f i WKS . As L WKS decreases gradually, f distll will approximate to f WKS , and accordingly, the feature extraction method of f will be more similar to f WKS .
Although the well-trained WKS model can provide knowledge that is learned from refined body information, they can also bring in noise interference, such as improper and low-discriminative features which can lead to confused recognition. Based on this consideration, we introduce a loss function to select the backpropagation path, which uses the following loss error to update the parameters: In Equation (4), the RGB model's f is fed with clip c (i) ,ŷ is ground truth label, and we use cross-entropy as our classification loss. Meanwhile, applying λ 1 and λ 2 to adjust the weights of two loss functions, λ 1 can be computed as follows: We can infer from the Equation (5) that only the teacher model recognizes the actions successfully, namely f WKS c (i) =ŷ., then the features of a teacher model are participated in the backpropagation course through the parameter θ 1 . The updating of parameters connected to distillation f distll can be described as follows: where W d are the parameters before updating, and the function map from distillation to final output is represented as φ : f distill → f(c (i) ) . As observed from Equation (6), the parameters will be modified by L WKS only when the λ 1 > 0, in this case, the f WKS recognize the action correctly and target features are considered as sufficiently discriminative. From this method, we can effectively prevent the model from learning noisy information.

Confidence-Weighting Aggregation Function
We introduce the CWAF, a simple but effective method to adaptively assign weights for each output according to their confidence. The inspiration comes from the observation that more credible input is expected to produce more distinct probabilities over classes. A comparison of the output as the classifier which makes a right or wrong classification is shown in Figure 4. These examples were chosen from the videos v_Biking_g02_c02 and v_Haircut_g02_c04, which are classified correctly and incorrectly, respectively. We can easily determine that category probabilities corresponding to a correct recognition represent minor fluctuations, resulting in a more reliable output.
in the backpropagation course through the parameter θ1.
The updating of parameters connected to distillation fdistll can be described as follows: where Wd are the parameters before updating, and the function map from distillation to final output is represented as . As observed from Equation (6), the parameters will be modified by WKS  only when the λ1 > 0, in this case, the fWKS recognize the action correctly and target features are considered as sufficiently discriminative. From this method, we can effectively prevent the model from learning noisy information.

Confidence-Weighting Aggregation Function
We introduce the CWAF, a simple but effective method to adaptively assign weights for each output according to their confidence. The inspiration comes from the observation that more credible input is expected to produce more distinct probabilities over classes. A comparison of the output as the classifier which makes a right or wrong classification is shown in Figure 4. These examples were chosen from the videos v_Biking_g02_c02 and v_Haircut_g02_c04, which are classified correctly and incorrectly, respectively. We can easily determine that category probabilities corresponding to a correct recognition represent minor fluctuations, resulting in a more reliable output. To assign weights for each clip-level result, we first need to design an estimator h(f(c (i) )) ∈ [0, 1] to score the confidence; thus, we propose a confidence scoring network. The confidence of output f(c (i) ) is mainly determined by the relative difference among the highest class scores and others. To reduce the distribution from the absolute positions of the values, we directly sort the output values in order from large to small, keeping only the relative difference information. We designed the confidence scoring network to be composed of multiple layers of a fully connected neural network. Several architectures have been examined regarding the balance of speed and accuracy, and a simple four-layer network has been found to be completely competent for this task. To assign weights for each clip-level result, we first need to design an estimator h(f(c (i) )) ∈ [0, 1] to score the confidence; thus, we propose a confidence scoring network. The confidence of output f(c (i) ) is mainly determined by the relative difference among the highest class scores and others. To reduce the distribution from the absolute positions of the values, we directly sort the output values in order from large to small, keeping only the relative difference information. We designed the confidence scoring network to be composed of multiple layers of a fully connected neural network. Several architectures have been examined regarding the balance of speed and accuracy, and a simple four-layer network has been found to be completely competent for this task.
For a video that consists of clips {c (1) , c (2) , c (3) , . . . , c (i) }, we estimate each output f(c (i) ) with confidence scoring network and produce scores {s 1 , s 2 , s 3 , . . . , s i } to represent confidence of clip-level results. Video-level results are obtained by: where w i is the weight for each clip-level result. We attempt to assign higher weights for more confident clip-level results and enlarge the margin between the weight of lowerranked and higher-ranked clip results. Therefore, a quadratic nonlinear function is applied to arrange the weight distribution: where the min and max functions are computed for the maximum and minimum values of the sequences, respectively. With this quadratic nonlinear function, the weight of lowerranked and higher-ranked clip results will have a larger difference value. For training the confidence scoring network, we introduce a new dataset D, which shows the relationship between classifier output f(c (i) ) and the predicted situation. It consists of M samples {(f(c (i) ), y (i) )} M i=1 , where y (i) is a scalar, and it denotes whether the classifier derives the correct result or not as it is fed with c (i) . It can be formulated as follows: Note that the form of the clip-level result f(c (i) ) cannot strictly represent whether the clip is predicted correctly or not, that is, there are no forms that are definitively effective. However, it is still able to roughly help discriminate the reliable results.

Experimental Results
This section presents the experimental results of our proposed methods. The details of the datasets we used are firstly introduced, then we give the experimental details. Next, we report the distillation evaluation results of the proposed framework on the UCF-101 dataset of human actions [48] and the HMDB-51 human motion database [49]. The influence of the CWAF is discussed in Section 4.4. Lastly, we compare our results with some advanced methods.

Datasets
There are three datasets used for our experiments: UCF-101, HMDB-51, and Kinetics-400 [50]. The UCF-101 dataset collected 101 action categories. These categories are very diverse, which covers from single person to group, half-body to whole-body, and variations in body scale, motion speed, and camera viewpoint. There are 6766 videos in the HMDB-51 dataset, and it covers 51 categories which consist of videos that come from the real world and from various sources, such as social websites and movies. There are 400 action categories collected in Kinetics-400 for providing enough samples to train models, and each class has more than 400 videos; its volume is much larger than that of UCF-101 and HMDB-51.
Directly training on UCF-101 or HMDB-51 will lead to a serious overfitting due to the relatively small scale. Thus, the Kinetics-400 dataset is used for training the RGB network to complete transfer learning, but it is not applied for evaluation due to limited hardware resources. Both UCF-101 and HMDB-51 introduced three official dataset splits as a way for training and testing. Our experiments are primarily performed on the widely used first split of UCF-101 and HMDB-51.

Implementation Details
We designed our RGB network based on two Swin transformer architectures: Swin-S and Swin-B. Table 1 lists the detailed architectural parameters of our network. Both architectures are applied as RGB networks for training and evaluation. For the WKS teacher model, only the Swin-B-based network is considered for training and distillation. As the scales of UCF-101 and HMDB-51 are relatively small, insufficient data could lead to a serious overfitting problem. Therefore, we first trained two RGB networks on the Kinetics-400 dataset, and then fine-tuned them on UCF-101 and HMDB-51 by studying the large dataset to avoid the problem of overfitting. We set the distillation layer at fc layer in both the Swin-B-based network and Swin-S-based network. It can be found that when the Swin-S based network is training, the dimensions between distillation layer f WKS ∈ R 1024 and f distill ∈ R 768 are not matching. Therefore, a liner layer is designed concatenated after the fc layer to connect the 768-dim layer to 1024-dim distillation layer, as shown in Table 1.
To obtain a well-trained WKS model, we first generated a WKS dataset by marking the UCF-101 and HMDB-51 datasets using MMPose [17]. The marking process followed the official instructions. It is noted that not all samples in the UCF-101 and HMDB-51 datasets can be well marked, and we filtered the noisy samples in which keypoints' mean confidence scores were below 0.4. At the training stage, only qualified samples were used.
For clip-level training and inference, to satisfy the requirement of memory consumption, we divided each video to a series of clips with 16 frames, and spatially resized to 112. At the training stage, to minimize overfitting, several data augmentation techniques were applied. From input frames, we first decided a base position and crop scale, then cropped the input frame in terms of the base position and the crop scale. The scale randomly sampled from set 1, 1 2 1/4 , 1 √ 2 , 1 2 3/4 , 1 2 . Next, we randomly flipped half of the clips. These tricks are all operated in a spatial domain, and we also applied a temporal technique. In the temporal domain, we augmented data by random sample of 16 consecutive frames of videos. These tricks are applied for training both the RGB and WKS networks. We have trained the network with a batch size of 64 for 250 epochs at both pretraining stage and training stage. OpenCV-Python is applied for converting video to image sequences at 25 fps. We used the AdamW optimizer to train the network parameters, with the weight decay set to 1 × 10 −3 , and initial learning rate set to 0.001. The proposed method was implemented on PyTorch 1.2 with Python 3.6, and all experiments were executed on a Tesla K40 GPU and an E5-2620 CPU.
For training the confidence scoring network, we introduced a new dataset D, which consists of f(c (i) ). Note that all clips c (i) come from the training dataset, and are not introduced by any testing dataset information. On the training dataset, the classifier is able to obtain remarkable classification accuracy, so that the negative samples of dataset D are quite scarce. To address this issue, we adopted data augmentation techniques to enlarge our negative samples. Tricks that were applied are similar to tricks of the RGB network, and we also deliberately reduced the number of positive samples to balance these two categories.

Evaluation of Distillation
We denote the Swin-B-based network and the Swin-S-based network as 3D-ConSwin-B and 3D-ConSwin-S, respectively. The evaluation on the UCF-101 and HMDB-51 datasets are listed in Table 2. All accuracies are obtained from split-1 of UCF-101 and HMDB-51. We compared Swin transformer-based models to advanced 3D CNNs, such as 3D-ResNext-101 [51], 3D-DenseNet-121 [51], and 3D-ResNet-152 [51], and our network achieved competitive results. Note that all our models are pretrained on Kinetics-400 and fine-tuned on the target dataset. We observed that a single stream of the WKS network or RGB network does not show remarkable recognition ability, but the combination of the two streams achieved an impressive accuracy of 94.1%. By utilizing the distillation method, the accuracy on UCF-101 is effectively boosted by approximately 2.1% and 1.4%, respectively, for 3D-ConSwin-B and 3D-ConSwin-S. We conducted an experiment to explore how the distillation method changes the RGB network, as illustrated in Figure 5. For deep learning, we found that what kinds of features are extracted for recognizing is very uncertain due to its "black box" characteristic. Therefore, extracted features may not be very discriminative to target action, such as in video v_ApplyLipstick_g04_c01, the RGB network may take the face of a female as a key feature and lead to confusion between categories of Apply Lipstick and Apply Eye Makeup, as both categories have a female face. However, if the hand pose and face keypoints are clearly marked, the network may capture the discriminative features of hand and lip overlapping, thus, making more precise classification based on these features. It can be observed in Figure 5, after distilled pose capturing ability to RGB network, some easily confused categories for the RGB network, such as Blow Dry Hair and Brushing Teeth, Nunchucks, and Archery which can be better distinguished by using pose features. With distillation, WKS augmented the RGB network which is able to classify these categories as more accurate.
A comparison of the confusion matrices is shown in Figure 6, which illustrates the confusion matrices of twenty categories having the lowest accuracy on the UCF-101 dataset; 3D-ConSwin-B(RGB) and 3D-ConSwin-B(RGB) with distillation are presented at the top and the bottom, respectively. We observed that after distillation, some confused categories have improved accuracy. As the distilled RGB network can extract features like the WKS network, it focuses more on the pose and pose motion of the input clips. Parts of the categories that share similar backgrounds are easily confused, but with the benefit of WKS labeling, they can be more effectively classified. For example, the category of FrontCrawl only obtains an accuracy of 45.9% with the RGB network, it is easily confused with the category of BreastStroke. They both have swimming pools as their background, but after distillation, the accuracy is improved to 59.5% because the pose is represented more clearly.
With distillation, WKS augmented the RGB network which is able to classify these categories as more accurate. WKS network, it focuses more on the pose and pose motion of the input clips. Parts of the categories that share similar backgrounds are easily confused, but with the benefit of WKS labeling, they can be more effectively classified. For example, the category of FrontCrawl only obtains an accuracy of 45.9% with the RGB network, it is easily confused with the category of BreastStroke. They both have swimming pools as their background, but after distillation, the accuracy is improved to 59.5% because the pose is represented more clearly.

Evaluation of CWAF
We explore the CWAF method to more rationally aggregate clip-level results. The results of this strategy are listed in Table 3. The metric of video mean score indicates average confident scores for a given video. A comparison between the accuracy and number proportion when the video mean scores are located in sections (0.8, 1], (0.6, 0.8], (0.4, 0.6], [0, 0.4], respectively, is presented in Table 3. From this comparison, we know that a substantial part of the videos obtained high confident scores, which means that they are very likely to be classified successfully. The experiment also verifies that videos with mean video scores above 0.8 are recognized almost completely correctly, whereas the videos containing several noisy clips, often with low scores, are misclassified. Some videos that cannot be validly classified by the densely averaging function can be successfully recognized by lowering the weights of the noisy clip-level results. This experiment also demonstrates that our confidence scoring network effectively assigns proper confidence scores for clip-level results, that is, videos with higher video mean scores obtain higher accuracy.
The experimental results for the recognition accuracies of each category are shown in Figure 7. The red bars represent the improved accuracy of our CWAF, and green bars denote reduced accuracy. We sort the category-level accuracies in order from low to high, and the results are divided into two parts; we demonstrate the fifty categories with the lowest accuracies. From this comparison, we know that a substantial part of the videos obtained high confident scores, which means that they are very likely to be classified successfully. The experiment also verifies that videos with mean video scores above 0.8 are recognized almost completely correctly, whereas the videos containing several noisy clips, often with low scores, are misclassified. Some videos that cannot be validly classified by the densely averaging function can be successfully recognized by lowering the weights of the noisy clip-level results. This experiment also demonstrates that our confidence scoring network effectively assigns proper confidence scores for clip-level results, that is, videos with higher video mean scores obtain higher accuracy.
The experimental results for the recognition accuracies of each category are shown in Figure 7. The red bars represent the improved accuracy of our CWAF, and green bars denote reduced accuracy. We sort the category-level accuracies in order from low to high, and the results are divided into two parts; we demonstrate the fifty categories with the lowest accuracies.  For the categories in which 3D-ConSwin-B can make correct classifications, no significant improvement is obtained from our aggregation function. However, we achieved good performance in categories where the accuracy is lower, which can be observed in Figure 6. In these categories, weighted clips exhibit a more remarkable effect. During the training process of confidence scoring in the network, we introduced a new training set D, which is composed of the large number of negative samples that 3D-ConSwin-B had falsely classified on the UCF-101 training dataset. This ensures our confidence scoring network can achieve better results in the error-prone category.

Comparison with Existing Methods
We compare our proposed approach against the advanced methods on two challenging datasets, HMDB-51 and UCF-101. In Table 4, the experimental results are illustrated, where we make a comparison with both one-stream-based methods and multi-stream-based methods. One-stream-based method only adopts RGB frames as input, and multi-streambased methods are operated on RGB frames and stacked optical flows or other data. We denote the classifier without distillation and CWAF as 3D-ConSwin-B-base, and 3D-ConSwin-B-full represented results are obtained under distillation and CWAF methods. Our framework achieves accuracies of 93.4% and 67.2% on UCF-101 and HMDB-51, respectively. Compared to the existing one-stream methods, our method achieved competitive accuracy. Our models are pretrained on the Kinetics-400 dataset. Among these methods, 3D-ResNet-152 is also pretrained on the Kinetics dataset. Finally, our framework exceeded the 3D-ResNet-152by about 3.8%.
The multi-stream-based approaches achieve a higher classification accuracy, but also have a higher time cost. According to the results obtained in [31], the average time cost per video on the optical flow is approximately 130 times that of the RGB input alone. We present the results of classifying only with the input RGB clips. In contrast to the multi-stream-based approaches, our method saved a great deal of computational cost.

Conclusions and Discussion
We introduced a new data modality, which is marked with WKS labels, strengthening the difference between the body and background. We also introduced a novel framework that attempts to use whole-body labeled images for model training, and transfers this recognition ability to the RGB network. The experimental results show that the data provided more refined body information for classification, and by distilling the high-level semantic features based on refined body information to the RGB model, the original RGB model can capture more comprehensive action information. Irrelevant information such as human clothing and appearance can be further filtered, which makes the model wellsuited for the classification of actions that focus on pose. Some categories which shared a similar background can be classified more accurate by focusing on pose features, such as FrontCrawl and BreastStroke.
Inspired by the success of transformers for vision tasks, we designed an architecture that takes advantage of both the 3D CNNs and Swin transformer to extract spatiotemporal features, which has demonstrated advanced performance. By analyzing the confidence of the clip-level results, we designed an aggregation method to assign more rational weights to each clip output. The experimental results demonstrate that our methods achieve favorable accuracy on the UCF-101 and HMDB-51 datasets. Of note, with limited hardware resources, we temporarily have no ability to perform our experiment on large datasets, such as Kinetics, therefore, we will investigate the performance of our strategy on them in the future.

Conflicts of Interest:
The authors declare no conflict of interest.