MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module

As a sub-field of video content analysis, action recognition has received extensive attention in recent years, which aims to recognize human actions in videos. Compared with a single image, video has a temporal dimension. Therefore, it is of great significance to extract the spatio-temporal information from videos for action recognition. In this paper, an efficient network to extract spatio-temporal information with relatively low computational load (dubbed MEST) is proposed. Firstly, a motion encoder to capture short-term motion cues between consecutive frames is developed, followed by a channel-wise spatio-temporal module to model long-term feature information. Moreover, the weight standardization method is applied to the convolution layers followed by batch normalization layers to expedite the training process and facilitate convergence. Experiments are conducted on five public datasets of action recognition, Something-Something-V1 and -V2, Jester, UCF101 and HMDB51, where MEST exhibits competitive performance compared to other popular methods. The results demonstrate the effectiveness of our network in terms of accuracy, computational cost and network scales.


Introduction
Videos contain richer information than single images, including temporal correlations and motion clues between adjacent frames. As a result, temporal modeling becomes a critical step in video action recognition [1]. With the booming development of deep learning, the application of convolution neural networks (CNN) has reached astonishing success in image classification due to their powerful feature learning and reasoning abilities. Albeit effective, it cannot be directly applied to time series signals (e.g., video). To remedy this deficiency, various works have been published to exploit the signal in the temporal dimension, which could roughly be divided into three categories: two-stream architectures, 3D CNN and its variants, and 2D CNN with temporal modules.
As the name implies, a typical two-stream ConvNet architecture has two streams, namely a spatial stream and a temporal stream, wherein the former one distills appearance information from individual RGB frames, and the latter one leverages optical flow as the input to extract motion information. Both streams are implemented based on deep networks. The final result is a fusion of the two streams. It turns out that the appearance and motion information could be effectively integrated by the two-stream architecture [2]. However, the calculation of dense optical flows across adjacent frames in the video sequence is always computationally heavy. An end-to-end action recognition could not be implemented based on a two-stream structure [2]. In this light, 3D CNNs were developed to capture both appearance and temporal information from videos at the same time [3]. Unfortunately, 3D CNNs are extremely difficult to train, due to their large volume of parameters, overfitting and slow convergence issues, making them difficult to be deployed on normal hardware platforms. In view of this, a lightweight architecture is required to avoid heavy computation. Recently, various works were elaborated by using 2D CNNs as the backbones with additional temporal modeling modules. Du Tran et al. [4] factorized 3D convolution (1) A motion encoder (abbreviated as ME) is presented to capture motion cues across frames without using pixel-level optical flows as additional input. (2) A spatio-temporal module (dubbed SAT) is developed to extract and aggregate longterm spatio-temporal information. The SAT module consists of a channel-wise temporal convolution followed by a 2D spatial convolution, which is used to replace the 3D convolution blocks. (3) The weight standardization (WS) method is applied to the convolution layers in MEST followed by batch normalization (BN) to speed up the training process and convergence.
The rest of the paper is organized as follows: related works are introduced in Section 2. Our method is described in Section 3. The experimental results with ablation studies are shown in Section 4 with thorough analysis. A final conclusion is drawn in Section 5.

Related Works
Deep learning has played dominion in the field of action recognition in recent years [12]. Simonyan et al. [2] laid a solid foundation for the two-stream architecture, which trained single RGB frames and dense optical flows separately with CNN to produce a weighted average score. The significance of the two-stream architecture is the exploration of motion information between adjacent frames. To construct long-range temporal features, Wang et al. [11] advised a temporal segment network (TSN) with a sparse sampling strategy to extract short snippets over long videos, which enabled efficient learning through the entire videos without limiting the length of the sequence. Although it achieved great performance, it also relied heavily on pre-computed optical flows.
Another strategy is to use 3D CNN and 3D pooling to extract spatial-temporal features. Wang et al. [13] developed an OF-PCANet method for micro-expression recognition using a spatiotemporal feature learning strategy based on optical flow sequences. Tran et al. [14] learned both appearance and dynamic features via 3D convolution operation (C3D). An improvement was made in this work with an end-to-end architecture. However, the number of parameters involved was extremely large, compared with 2D convolutions. Furthermore, it was difficult to be trained on small datasets such as UCF101, suffering from over-fitting. Under such circumstance, Carreira et al. [9] created a large-scale dataset (Kinetics) to pre-train 3D models. Tran et al. [15] advocated Res3D, which has fewer parameters than C3D with even better results. Huang et al. [16] outlined a differential residual model along with a new loss function to model the movement of eyes. In addition, Qiu et al. [17] declared Pseudo-3D residual networks by decomposing a 3 × 3 × 3 convolution operation into a 1 × 3 × 3 filter and a 3 × 1 × 1 filter, which is applicable to spatio-temporal related tasks. Dong et al. [18] put forward two types of 3D CNNs (residual and attention residual) to improve the performance of the existing 3D CNN.
Although the 3D convolution-based methods obtained good performance, they are computationally heavy. ECO [8] pointed out the major limitations in the current methods, and provided a novel structure using a combination of historical long-term content with a sampling strategy to exploit the spatio-temporal relations between adjacent frames. Zhang et al. [19] eliminated the camera noise in motion using a novel recurrent attention neural network architecture. Zhang et al. [20] mitigated the impact of noisy samples using an auto-augmented Siamese neural network (ASNet). Guo et al. [21] analyzed the multimodal optimization problems in action recognition and presented a dual-ensemble class imbalance learning method, in which two ensemble learning models were nested with each other. Ji et al. [5] came up with a generic and effective module by shifting part of the channels along the temporal dimensions for information exchange among successive

Related Works
Deep learning has played dominion in the field of action recognition in recent years [12]. Simonyan et al. [2] laid a solid foundation for the two-stream architecture, which trained single RGB frames and dense optical flows separately with CNN to produce a weighted average score. The significance of the two-stream architecture is the exploration of motion information between adjacent frames. To construct long-range temporal features, Wang et al. [11] advised a temporal segment network (TSN) with a sparse sampling strategy to extract short snippets over long videos, which enabled efficient learning through the entire videos without limiting the length of the sequence. Although it achieved great performance, it also relied heavily on pre-computed optical flows.
Another strategy is to use 3D CNN and 3D pooling to extract spatial-temporal features. Wang et al. [13] developed an OF-PCANet method for micro-expression recognition using a spatiotemporal feature learning strategy based on optical flow sequences. Tran et al. [14] learned both appearance and dynamic features via 3D convolution operation (C3D). An improvement was made in this work with an end-to-end architecture. However, the number of parameters involved was extremely large, compared with 2D convolutions. Furthermore, it was difficult to be trained on small datasets such as UCF101, suffering from over-fitting. Under such circumstance, Carreira et al. [9] created a large-scale dataset (Kinetics) to pretrain 3D models. Tran et al. [15] advocated Res3D, which has fewer parameters than C3D with even better results. Huang et al. [16] outlined a differential residual model along with a new loss function to model the movement of eyes. In addition, Qiu et al. [17] declared Pseudo-3D residual networks by decomposing a 3 × 3 × 3 convolution operation into a 1 × 3 × 3 filter and a 3 × 1 × 1 filter, which is applicable to spatio-temporal related tasks. Dong et al. [18] put forward two types of 3D CNNs (residual and attention residual) to improve the performance of the existing 3D CNN.
Although the 3D convolution-based methods obtained good performance, they are computationally heavy. ECO [8] pointed out the major limitations in the current methods, and provided a novel structure using a combination of historical long-term content with a sampling strategy to exploit the spatio-temporal relations between adjacent frames. Zhang et al. [19] eliminated the camera noise in motion using a novel recurrent attention neural network architecture. Zhang et al. [20] mitigated the impact of noisy samples using an auto-augmented Siamese neural network (ASNet). Guo et al. [21] analyzed the multimodal optimization problems in action recognition and presented a dual-ensemble class imbalance learning method, in which two ensemble learning models were nested with each other. Ji et al. [5] came up with a generic and effective module by shifting part of the channels along the temporal dimensions for information exchange among successive frames. It achieved comparable performance to 3D CNNs while maintaining the complexity of 2D CNNs. GST [7] put forth the idea of grouped convolutions in designing an efficient architecture to separate hierarchical features across the channel. It took advantage of the 3D convolutions to extract features and kept the cost as low as 2D CNNs. Partially motivated by these works, 2D CNNs are used as the backbone of our network. However, unlike most existing methods (that only captured certain kinds of temporal information), both short-term and long-term temporal information are extracted and are combined with motion cues.
As discussed earlier, although successful in action recognition, deep networks are difficult to be trained. To facilitate training and convergence, we utilize batch normalization (BN) [22]. BN regulates certain distributions during training on the basis of data normalization and model initialization so as to avoid degeneracy of the normalization effects. It indeed greatly improves the training process by performing normalization along the batch dimension. Nevertheless, the results of BN largely depend on the batch size. When the batch size decreases, the performance degrades dramatically. For this reason, layer normalization [23] transposed BN into layer normalization by computing the mean and variance from the summed inputs to each neuron in a layer during training. Instance normalization [24] implemented BN for each sample individually. The above normalization methods are all activation based. Instead, weight standardization (WS) [25] is utilized to further elevate the performance of BN. The reason is provided in Section 3.4 along with the implementation details.

Algorithm Description
In this section, we describe MEST with its core components in detail. The overall structure of MEST is shown in Figure 2 below. frames. It achieved comparable performance to 3D CNNs while maintaining the complexity of 2D CNNs. GST [7] put forth the idea of grouped convolutions in designing an efficient architecture to separate hierarchical features across the channel. It took advantage of the 3D convolutions to extract features and kept the cost as low as 2D CNNs. Partially motivated by these works, 2D CNNs are used as the backbone of our network. However, unlike most existing methods (that only captured certain kinds of temporal information), both short-term and long-term temporal information are extracted and are combined with motion cues. As discussed earlier, although successful in action recognition, deep networks are difficult to be trained. To facilitate training and convergence, we utilize batch normalization (BN) [22]. BN regulates certain distributions during training on the basis of data normalization and model initialization so as to avoid degeneracy of the normalization effects. It indeed greatly improves the training process by performing normalization along the batch dimension. Nevertheless, the results of BN largely depend on the batch size. When the batch size decreases, the performance degrades dramatically. For this reason, layer normalization [23] transposed BN into layer normalization by computing the mean and variance from the summed inputs to each neuron in a layer during training. Instance normalization [24] implemented BN for each sample individually. The above normalization methods are all activation based. Instead, weight standardization (WS) [25] is utilized to further elevate the performance of BN. The reason is provided in Section 3.4 along with the implementation details.

Algorithm Description
In this section, we describe MEST with its core components in detail. The overall structure of MEST is shown in Figure 2 below. As explained in Section 1, instead of working on a single frame (or stacked frames), the input video is divided into T segments 1 2 { , ,..., } T S S S with equal length. Then, one frame from each segment is randomly selected to produce T frames in total via a sparse sampling strategy [11]. An initial prediction of action category can be attained from each snippet in a segment through our network. The video-level action prediction is generated based on a consensus of a series of snippets. Our network architecture consists of four main components, a temporal shift (TS) module, a spatial and temporal (SAT) module, a motion encoder (ME) and WS method used in each convolution layer, since both spatial and temporal information as well as appearance features are crucial for action recognition. ResNet-50 is chosen as the backbone for two reasons: on one hand, the representation of spatial-temporal information could be enhanced by the TS + ME + SAT structure. On the other hand, the original framewise appearance features are also preserved by the residual structure. The 2nd to the 5th layers of the ResNet-50 are all constructed the same way as shown on the right-hand side As explained in Section 1, instead of working on a single frame (or stacked frames), the input video is divided into T segments {S 1 , S 2 , . . . , S T } with equal length. Then, one frame from each segment is randomly selected to produce T frames in total via a sparse sampling strategy [11]. An initial prediction of action category can be attained from each snippet in a segment through our network. The video-level action prediction is generated based on a consensus of a series of snippets.
Our network architecture consists of four main components, a temporal shift (TS) module, a spatial and temporal (SAT) module, a motion encoder (ME) and WS method used in each convolution layer, since both spatial and temporal information as well as appearance features are crucial for action recognition. ResNet-50 is chosen as the backbone for two reasons: on one hand, the representation of spatial-temporal information could be enhanced by the TS + ME + SAT structure. On the other hand, the original frame-wise appearance features are also preserved by the residual structure. The 2nd to the 5th layers of the ResNet-50 are all constructed the same way as shown on the right-hand side of Figure 2. First of all, the input features are processed by the TS module (followed by a 1 × 1 Conv layer) to complete the partial information exchange across frames. Then the motion features are captured by the ME module. Next, a SAT module is attached to extract spatial-temporal information. Finally, the WS method is embedded in each convolution layer to expedite training and convergence.

Temporal Shift (TS) Module
TS shifts the feature map along the temporal dimension to realize the information exchange between adjacent frames. The shift operation is illustrated in Figure 3, where a tensor contains C channels with T frames. The features that are extracted at different times are identified by different colors in each row. Along the temporal axis, we shift 1/8 of the channels by −1, and shift another 1/8 part by +1, while keeping the remaining 3/4 unchanged. TS achieves a comparable temporal modeling ability of 3D CNN, while maintaining the complexity of 2D CNN.
of Figure 2. First of all, the input features are processed by the TS module (followed by a 1 × 1 Conv layer) to complete the partial information exchange across frames. Then the motion features are captured by the ME module. Next, a SAT module is attached to extract spatial-temporal information. Finally, the WS method is embedded in each convolution layer to expedite training and convergence.

Temporal Shift (TS) Module
TS shifts the feature map along the temporal dimension to realize the information exchange between adjacent frames. The shift operation is illustrated in Figure 3, where a tensor contains C channels with T frames. The features that are extracted at different times are identified by different colors in each row. Along the temporal axis, we shift 1/8 of the channels by −1, and shift another 1/8 part by +1, while keeping the remaining 3/4 unchanged. TS achieves a comparable temporal modeling ability of 3D CNN, while maintaining the complexity of 2D CNN.
The TS module is attached to the residual branch to reserve the features of the current frame without weakening the spatial learning ability of the original 2D CNN backbone.

Motion Encoder
Motions refer to the movement displacements across frames, reflecting the occurrence of actions. Previous methods depicted motion patterns in the form of optical flow, making the learning of motions independent of the spatio-temporal features [11,13]. Although these methods are proven effective, calculating optical flows from the sequence of images is extremely time consuming.
To alleviate this problem, the motion encoder (ME) is presented, the design intention of which comes from the fact that different channels carry distinct information, some of which tend to model background scenes, while other channels describe dynamic motion patterns. Because of this, it is beneficial to explore motion-sensitive channels.
The architecture of the proposed motion encoder is drawn in Figure 4a. The input feature is represented as a 5D tensor . We follow the same squeeze and unsqueeze strategy in the SAT module by placing two 1 × 1 2D convolution layers to obtain channel-wise information. After the squeeze operation, we obtain a feature ∈ [ , , / , , ], where r is a scaling factor (empirically, r = 16 in this module). The motion features at time t are represented by the difference between adjacent frames and . Instead of subtracting the original features directly, we add a channel-wise transformation on feature vectors to extract motions, which is written as Here, represents a 3 × 3 2D channel-wise convolution layer, implementing transformation for each channel. Xt refers to the input at time t.
∈ × × / × × denotes motion feature at time t. The motion feature among adjacent frames is concatenated along the temporal dimension. Meanwhile, the motion feature at the last moment is 0 (i.e., = 0). Therefore, the final motion matrix F can be written as [ 0 , 1 , … , 1 , 0]. Since our main purpose is to find out and excite the motion-sensitive channels, regardless of the detailed spatial layouts. The motion feature F is then processed by spatial average pooling as The TS module is attached to the residual branch to reserve the features of the current frame without weakening the spatial learning ability of the original 2D CNN backbone.

Motion Encoder
Motions refer to the movement displacements across frames, reflecting the occurrence of actions. Previous methods depicted motion patterns in the form of optical flow, making the learning of motions independent of the spatio-temporal features [11,13]. Although these methods are proven effective, calculating optical flows from the sequence of images is extremely time consuming.
To alleviate this problem, the motion encoder (ME) is presented, the design intention of which comes from the fact that different channels carry distinct information, some of which tend to model background scenes, while other channels describe dynamic motion patterns. Because of this, it is beneficial to explore motion-sensitive channels.
The architecture of the proposed motion encoder is drawn in Figure 4a. The input feature is represented as a 5D tensor X ∈ [N, T, C, H, W]. We follow the same squeeze and unsqueeze strategy in the SAT module by placing two 1 × 1 2D convolution layers to obtain channel-wise information. After the squeeze operation, we obtain a feature X 1 ∈ [N, T, C/r, H, W], where r is a scaling factor (empirically, r = 16 in this module). The motion features at time t are represented by the difference between adjacent frames X t−1 and X t . Instead of subtracting the original features directly, we add a channel-wise transformation on feature vectors to extract motions, which is written as Here, C t represents a 3 × 3 2D channel-wise convolution layer, implementing transformation for each channel. X t refers to the input at time t. F(t) ∈ R N×1×C/r×H×W denotes motion feature at time t. The motion feature among adjacent frames is concatenated along the temporal dimension. Meanwhile, the motion feature at the last moment is 0 (i.e., F(T) = 0). Therefore, the final motion matrix F can be written as [F(0), F(1), . . . , F(T − 1), 0]. Since our main purpose is to find out and excite the motionsensitive channels, regardless of the detailed spatial layouts. The motion feature F is then processed by spatial average pooling as where F s ∈ R N×T×C/r×1×1 , F[ ] is the motion matrix. Next, a 1 × 1 2D convolutional layer is applied to expand the channel dimension of the motion feature to the original dimension C. The shape of the processed feature F s is [N, T, C, 1, 1] and we feed it to a Sigmoid activation function to obtain the mask M: Finally, a residual connection is utilized to preserve the original background information: stands for a channel-wise multiplication.
where ∈ × × / ×1×1 , F[ ] is the motion matrix. Next, a 1 × 1 2D convolutional lay applied to expand the channel dimension of the motion feature to the original dimen C. The shape of the processed feature ′ is [N, T, C, 1, 1] and we feed it to a Sig activation function to obtain the mask M: Finally, a residual connection is utilized to preserve the original background i mation: Y′=X′+X′⨀M where X′ is the input, Y′ is the final output of this module with dimension [N, T, C, H ⨀ stands for a channel-wise multiplication.   The structure of (2 + 1)D spatial and temporal module.

(2 + 1)D Spatial and Temporal (SAT) Module
The (2 + 1)D SAT module is designed to learn rich spatial and temporal features by focusing on the main part of action interactions (other than background or other objects). Although stacking 3D convolution blocks in a deep structure is an effective means for temporal modeling, the computational cost also increases exponentially. Instead of using 3D convolution blocks, the SAT module captures both temporal and spatial information at the same time using decoupled 1D + 2D blocks, which simulates the function of 3D convolution with much less computation. Figure 4b shows the detailed structure of SAT. Suppose that the input feature X is written as a vector [N, T, C, H, W], where N denotes the batch size, T denotes the temporal dimension, C represents the number of channels, and H and W are the resolution of the input. The channel numbers of the input tensor are squeezed by a scale ratio δ. δ equals the channel number of the input tensor X. This could be written as where C 1 is a 2D convolution layer with kernel size 1 × 1. Y is the output of the convolution layer, Y ∈ R N×T×1×H×W . Next, feature Y is reshaped to Y * ∈ R NHW×1×T for temporal encoding. Since the semantic information of different channels also varied. To be specific, a temporal convolution C 2 with kernel size 3 is utilized to characterize temporal information for channel-wise features, which could be expressed as where Z ∈ R NHW×1×T , C 2 is a temporal convolution block. Then, we reshape the feature Z into Z * ∈ R N×T×1×H×W to model the spatial information using a 2D convolution C 3 with kernel size 3 × 3 as follows: where Z * ∈ R N×T×1×H×W , Z is the channel-wise features. Finally, a 1 × 1 2D convolution C 4 is utilized to unsqueeze the number of channels to get the output feature X st ∈ R N,T,C,H,W . We implant this operation inside a residual block, and the final output is expressed as Z* represents the spatial information, C 4 is a 1 × 1 2D convolution kernel. In (8), the original frame-level representation and the enhanced spatial-temporal feature is combined via a residual connection. Compared with a standard 3D convolution operation, SAT is computationally more efficient, since the channel dimension of the input feature is reduced by the 1 × 1 convolution and the 3D convolution is decomposed into a 1D temporal convolution followed by a 2D spatial convolution. By inserting the SAT module in ResNet-50, the network is capable of learning long-term spatio-temporal features by focusing on the main part of an action. The encoding of motion cues and the representation of spatio-temporal information are merged into a unified structure through the integration of ME and SAT modules. Different combinations of ME and SAT are studied and analyzed in Section 4.4 below.

Application of Weight Standardization in Convolution Layer
It was revealed that batch normalization (BN) has a huge impact on network training by making the landscape of the optimization problem smoother. As a result, many existing works adopted BN for a faster training and a better convergence solutions. In particular, BN stabilizes the training process by restricting the 1st and 2nd moments of the distribution of outputs in each mini-batch, which is beneficial for training deep structures.
In the meantime, it was observed that BN treats the Lipschitz constants with respect to activation, instead of optimizing the weights directly [26]. In view of this, the weight standardization (WS) method is employed in our network to further smooth the landscape. As shown in Figure 5 below, for each convolution layer (with n filters), a new set of filters is created by using the WS method (i.e., normalized convolution layer), which directly standardizes and optimizes the weights of the convolution layers. Given an input image, a feature map is produced using the new set of filters. Then a batch-normalization (BN) operation is performed to yield the normalized feature map, which is sent to the activation function. Here, the WS method is used to adjust the weights of the filters, and BN is used to process the input data to the activation function. They are deployed in different places and are complementary in expediting convergence. ing works adopted BN for a faster training and a better convergence solutions. In particular, BN stabilizes the training process by restricting the 1st and 2nd moments of the distribution of outputs in each mini-batch, which is beneficial for training deep structures. In the meantime, it was observed that BN treats the Lipschitz constants with respect to activation, instead of optimizing the weights directly [26]. In view of this, the weight standardization (WS) method is employed in our network to further smooth the landscape. As shown in Figure 5 below, for each convolution layer (with n filters), a new set of filters is created by using the WS method (i.e., normalized convolution layer), which directly standardizes and optimizes the weights of the convolution layers. Given an input image, a feature map is produced using the new set of filters. Then a batch-normalization (BN) operation is performed to yield the normalized feature map, which is sent to the activation function. Here, the WS method is used to adjust the weights of the filters, and BN is used to process the input data to the activation function. They are deployed in different places and are complementary in expediting convergence. A standard convolution layer with zero-bias is written as 0 = * 0 , ∈ × (9) A standard convolution layer with zero-bias is written as here X 0 and Y 0 denote the input and output features at time 0, respectively. W denotes the weights in the convolution layer, while I represents the number of input channels within the kernel region of each output channel, and O stands for the number of output channels. Instead of optimizing the loss on the weights directly, WS re-parameterizes the weights W as a function of W , and optimizes the loss on W using the stochastic gradient descent (SGD). The relation between W and W is established as where and µW i,. calculates the mean value of W i,j , while σW i,. computes the variance. As a result, WS restricts the 1st and the 2nd moments of the weights of each output channel, respectively, in the convolution layers, and standardizes the weights in a unique way via gradient normalization during back-propagation (BP). Considering that BN normalizes the convolution layers again, we do not implement any affine transformations on W that slow down training. Instead, we only insert WS into convolution layers followed by BN layers during training.

The Unique Features of MEST
The overall architecture is illustrated in Figure 2 in Section 3. We adopt the ImageNet pre-trained ResNet50 as the backbone, followed by the proposed motion encoder and spatial-temporal modules. We utilize the WS method to boost training and convergence.
In general, we have the following distinct points: (1) Existing methods either capture simple motion cues or process appearance features and motion information separately, which results in less satisfactory performance. In comparison, we extract rich frame-wise appearance features and spatial-temporal information based on the designed TS + ME + SAT structure and combine them as a unity. (2) MEST is a 2D structure with limited computational cost, which does not involve any 3D convolutions or optical flow operations. (3) Some of the works adopt batch-normalization method during training, but the performance is affected by the batch size. We make use of the weight standardization (WS) method to realize faster training and better convergence results.

Experiment and Analysis
In this section, extensive experiments are carried out on five popular public datasets with proven results. First of all, the datasets are introduced with implementation details. Then we compare our results with other popular methods. After that, ablation studies are conducted to verify the effectiveness of each module in MEST. Finally, the comparative and visualization results are created to validate our design.

Datasets
We evaluate the performance of our network on three time-related datasets, Something-Something-V1 and -V2 [27] and Jester [28], and two scene-related datasets, UCF101 [29] and HMDB51 [30]. The number of categories and samples of each dataset are listed in Table 1 below. Something-Something-V1 [27] is a large-scale labeled video dataset recoding the actions of human in daily life. It consists of 108,499 videos with 174 fine-grained actions. We divide the dataset into training set (86,017 videos), validation set (11,522 videos) and test set (10,960 videos) following the official guideline. The general performance is reported on the validation set.
Something-Something-V2 [27] is the 2nd release of V1 with updates in four aspects: (1) The number of videos is expanded to 220,847; (2) V2 provides object annotations in training and validation sets. For example, V2 annotates the action as "Putting an apple on the table", instead of "Putting [something] on [something]" as appears in V1. In total, there are 30,408 objects with 318,572 annotations; (3) The crowd-sourcing method is used in V2 to test the video quality and verify the correct answers for each video; (4) The height for each video is increased to 240 px (it is only 100 px in V1).
Jester [28] is a third-person view gesture dataset, which has a potential usage for human computer interaction.
It has 27 categories with 118,562 training videos, 14,787 validation videos and 14,743 testing videos. UCF101 [29] is a classic dataset of action recognition. It contains 13,320 video sequences from YouTube, with a total of 101 categories.
HMDB51 [30] contains 6766 videos with a total of 51 action categories. Most of the videos in HMDB51 are taken in real scenes, including a large number of facial and limb movements.
Both UCF101 and HMDB51 are scene-related datasets, wherein most of the actions can be reasoned by the background information. For instance, brushing teeth and making up. However, a strong temporal modeling ability is required to recognize the actions in the time-related datasets (Something-Something-V1 and -V2 and Jester). Some symmetrical actions cannot be identified merely based on an individual frame (e.g., "pushing something from left to the right" vs. "pushing something from right to the left").
In Figure 6, the label for the first row is "closing the dishwasher". However, if we reverse the order of the frames, the action changes to "opening the dishwasher". Therefore, the recognition results of the time-related datasets strongly reflect the temporal modeling ability of our model and demonstrate the efficiency of our method. Our research mainly focuses on these time-related datasets, which is also effective for scene-related datasets. UCF101 [29] is a classic dataset of action recognition. It contains 13,320 video sequences from YouTube, with a total of 101 categories.
HMDB51 [30] contains 6766 videos with a total of 51 action categories. Most of the videos in HMDB51 are taken in real scenes, including a large number of facial and limb movements.
Both UCF101 and HMDB51 are scene-related datasets, wherein most of the actions can be reasoned by the background information. For instance, brushing teeth and making up. However, a strong temporal modeling ability is required to recognize the actions in the time-related datasets (Something-Something-V1 and -V2 and Jester). Some symmetrical actions cannot be identified merely based on an individual frame (e.g., "pushing something from left to the right" vs. "pushing something from right to the left").
In Figure 6, the label for the first row is "closing the dishwasher". However, if we reverse the order of the frames, the action changes to "opening the dishwasher". Therefore, the recognition results of the time-related datasets strongly reflect the temporal modeling ability of our model and demonstrate the efficiency of our method. Our research mainly focuses on these time-related datasets, which is also effective for scene-related datasets.

Implementation Details
We utilize ResNet-50 as the backbone in our experiments and sample 8 or 16 frames from each video following the sparse sampling strategy. The length of the short side of

Implementation Details
We utilize ResNet-50 as the backbone in our experiments and sample 8 or 16 frames from each video following the sparse sampling strategy. The length of the short side of the frames is fixed as 256. Both corner cropping and random scaling are employed for data augmentation during training, and finally, each cropped region is resized to a patch of 224 × 224. For Something-Something-V1 and -V2, we pre-train our model on ImageNet. The batch size is 22, and the initial learning rate is set to 0.01 (60 epochs in total, and decays 0.1 at 20, 40 and 50 epochs). We train our network using the SGD algorithm with weight decay 1 × 10 −4 and dropout 0.5. For Jester, the batch size is 22, and the initial learning rate is set to 0.01(30 epochs in total, and decays by 0.1 at 10, 20 and 25 epochs) with weight decay 1 × 10 −4 and dropout 0.5. For UCF101 and HMDB51, we pre-train our model on Kinetics-400. The batch size is 22 and the initial learning rate is 0.001 (30 epochs in total, and decays by 0.1 at 10, 20, and 25 epochs) with weight decay 1 × 10 −4 . Considering that UCF101 and HMDB51 are small-scale datasets, the dropout rate is set to 0.8 to prevent over-fitting. We train our model on a NVIDIA RTX 3090 GPU.
Theoretically, accuracy and computational cost are the main concerns for an action recognition task. However, different application scenarios have different focus. When accuracy is the primary concern, we follow the configuration suggested by [10] and sample two clips per video and full resolution input with shorter side 256 for evaluation. Reversely, when the computational cost becomes the main concern, we use only one clip per video and use center 224 × 224 patch for evaluation.

Experiment Results and Comparison with the State-of-the-Art Methods
In this section, we compare the performance of our network with TSN [11] and TSM [5] on all three time-related datasets (Something-Something-V1 and -V2 and Jester). For a fair comparison, they all use eight frames as input and sample 1 clip per video and use center 224 × 224 crops for evaluations. Moreover, we also compare MEST with TSN and TSM on scene-related datasets (UCF101 and HMDB51) using 16 frames as the input.
As shown in Table 2 below, TSN makes poor results due to the lack of the ability of temporal information modeling. TSM extracts temporal information from sequential signal, but it is weak in explicit temporal modeling, while our network exceeds them in both spatiotemporal and motion modeling by a large margin. Compared with TSM, we improved the accuracy of Top-1 by 2.2% on Something-Something-V1, 1.0% on Something-Something-V2 and 0.9% on Jester, respectively. As for scene-related datasets, compared with TSM, we also improved the Top-1 accuracy by 0.9% on UCF101 and 2.7% on HMDB51. Next, we compare our method with the state-of-the-art methods on five time-related and scene-related datasets. There are five metrics used to measure the combined performance: accuracy for Top-1 and Top-5, number of frames required, FLOPS and network scales. The number of frames by each method is reported with corresponding computational loads in the form of FLOPS and model scale. As expected, sampling more frames will undoubtedly increase the accuracy, which would also inevitably increase the FLOPS at the same time. Table 3 summarizes the results on Something-Something-V1, including the results by using 2D CNN methods and 3D CNN methods. We report the performance of our method for both 8-frame input and 16-frame input and the ensemble results. First of all, we compare MEST with 2D-CNN-based baselines [13,31] in terms of late fusion for long-range temporal modeling. The comparative results prove that MEST goes far beyond the baseline method on Something-Something-V1. Then, we compare with some recent 3D-CNN-based methods, including I3D [9] and non-local I3D [10]. Apparently, MEST yields better results than 3D-CNN-based models with much lower computational cost and a smaller model size. Moreover, MEST achieves much higher accuracy than non-local I3D [10] with far less FLOPs on the validation set. Finally, we compare with other methods with temporal modules (e.g., TSM [5] and TANet [32]), and MEST again outshines them in terms of accuracy and computational complexity, which validates the effectiveness of our design. ECO [8] used early 2D + late 3D architecture to realize medium-level temporal fusion. MEST outperforms ECO by a large margin with a much smaller model size. For instance, MEST achieves 47.8% Top-1 accuracy using only 8 frames as input, which is still 1.4% higher than that of ECO with 92 frames as input. TSM [5], TANet [32] and SmallBigNet [33] obtain 49.7%, 50.6% and 50.4% Top-1 accuracy, respectively when they combine the results of the 8-frame input and 16-frame input, while MEST obtains a better performance of 52.8% Top-1 accuracy. Table 4 summarizes the results on Something-Something-V2 in comparison with state-of-the-art methods. By using only eight frames as input, our model produces 60.1% Top-1 accuracy, which again outshines other methods which sample eight frames or more than eight frames as input. When using 16 frames as input, we increase the accuracy to 61.3%. Finally, we use the ensemble of 8-frame input and 16-frame input and achieve 64.1% Top-1 accuracy. Table 5 shows the performance on Jester. MEST obtains 96.6% Top-1 accuracy with 8 frames input by using 2 clips per video for evaluation.  Finally, we report the comparison results on scene-related datasets: HMDB51 and UCF101 in Table 6. MEST achieves 96.8% and 73.4% Top-1 accuracy on UCF-101 and HMDB51, respectively. These two datasets are relatively small, suffering over fitting. So, we pre-train MEST on Kinetics-400 and migrate the learned models to UCF101 and HMDB51. We compare MEST with previous state-of-the-art methods, such as 2D baselines of TSN, 3D CNNs of C3D and P3D and other temporal modeling methods [5,31], where MEST manifests impressive results. The reason behind these results is 3-fold: firstly, MEST extracts diverse and abundant spatial-temporal information compared to existing methods, which merely capture appearance features; secondly, MEST explores motion-sensitive channels to describe motion patterns; thirdly, MEST merges the above two pieces of information into a unified framework to achieve better performance than the previous methods.

Ablation Study
In this section, we perform several ablation studies on the Something-Something-V1 and Jester datasets to verify the effectiveness of each module in MEST. We apply the 1-clip and center crop testing method for evaluation and report the Top-1 result, where eight frames are sampled from each video of the training set as input to the network.
Investigate the functions of the two modules and WS method: To validate the contributions of each component (SAT, ME module and WS method) in our network, the results from the individual modules and combination of modules are listed and compared in Table 7. The contribution of each module to the overall performance is quite obvious. Here, we choose ResNet-50 with the temporal-shift module as the baseline. Our method achieves better performance than the baseline method. Specifically, with only the SAT module, we gain an accuracy of 45.9% Top-1 on Something-Something-V1 and 95.1% on Jester, but it brings in 0.19 M more parameters; with only ME module, we achieve an accuracy of 46.0% on Something-Something-V1 and 95.6% on Jester with an additional 0.85 M parameters. With ME + SAT, we increase the accuracy to 46.6% and 96.2% with 25.73 M parameters. Finally, with the WS method, the accuracy is further increased to 47.8% and 96.6%, respectively, which verifies its functionality. Investigate different ways of integrating the two modules: In this section, ME and SAT are deployed in both parallel and serial ways (Mode 1 to 4) to test their combined effectiveness: (1) Firstly, the SAT module and ME module are integrated via element-wise addition operation and are deployed after the first Conv1 × 1 of each bottleneck layer (shown in Figure 7a); (2) Secondly, the SAT module and ME module are integrated via element-wise addition, but we append them after Conv 3 × 3 (shown in Figure 7b); (3) Thirdly, the SAT module is placed in each bottleneck layer (after the 1st Conv1 × 1) and appends the ME module after the 3 × 3 Conv layer of all bottleneck layer (shown in Figure 7c); (4) Finally, the positions of SAT and ME are swapped on the basis of mode 3 to create mode 4 (shown in Figure 7d).  Investigate different ways of integrating the two modules: In this section, ME and SAT are deployed in both parallel and serial ways (Mode 1 to 4) to test their combined effectiveness: (1) Firstly, the SAT module and ME module are integrated via element-wise addition operation and are deployed after the first Conv1 × 1 of each bottleneck layer (shown in Figure 7a); (2) Secondly, the SAT module and ME module are integrated via element-wise addition, but we append them after Conv 3 × 3 (shown in Figure 7b); (3) Thirdly, the SAT module is placed in each bottleneck layer (after the 1st Conv1 × 1) and appends the ME module after the 3 × 3 Conv layer of all bottleneck layer (shown in Figure 7c); (4) Finally, the positions of SAT and ME are swapped on the basis of mode 3 to create mode 4 (shown in Figure 7d). Experimental results of the above four modes are listed and compared in Table 8 below. It turns out that we obtain better performance by connecting them in a serial way. In particular, by placing ME before SAT, we can achieve the highest Top-1 accuracy. We believe the reason is that ME calculates the short-term frame-wise motion cues, which are Experimental results of the above four modes are listed and compared in Table 8 below. It turns out that we obtain better performance by connecting them in a serial way. In particular, by placing ME before SAT, we can achieve the highest Top-1 accuracy. We believe the reason is that ME calculates the short-term frame-wise motion cues, which are more suitable to be placed at the early stage of the network, while SAT excels in long-term temporal modeling, which is more suitable for late-stage processing. Testifying the impact of WS in facilitating training and convergence: We propose the WS method (working together with BN) to accelerate the training process and expedite convergence. In this section, we use Something-Something-V1 as the dataset by sampling eight frames as the network input and compare the loss values over iteration during training between the network with WS and without WS so as to demonstrate the function of WS method. The results are shown in Figure 8 below. The blue curve characterizes loss values without WS, while the orange curve indicates the values with WS. Obviously, the orange curve decreases faster than the blue curve, and its value at the 55th epoch is lower than that of the blue curve. Since the loss value directly demonstrates the speed of training, we thereby conclude that WS indeed facilitates training and convergence. more suitable to be placed at the early stage of the network, while SAT excels in long-term temporal modeling, which is more suitable for late-stage processing. Testifying the impact of WS in facilitating training and convergence: We propose the WS method (working together with BN) to accelerate the training process and expedite convergence. In this section, we use Something-Something-V1 as the dataset by sampling eight frames as the network input and compare the loss values over iteration during training between the network with WS and without WS so as to demonstrate the function of WS method. The results are shown in Figure 8 below. The blue curve characterizes loss values without WS, while the orange curve indicates the values with WS. Obviously, the orange curve decreases faster than the blue curve, and its value at the 55th epoch is lower than that of the blue curve. Since the loss value directly demonstrates the speed of training, we thereby conclude that WS indeed facilitates training and convergence.

Confusion Matrix of the Proposed Method
Confusion matrix is a standard format for accuracy evaluation. Figure 9 shows the confusion matrices created by MEST on Jester.

Confusion Matrix of the Proposed Method
Confusion matrix is a standard format for accuracy evaluation. Figure 9 shows the confusion matrices created by MEST on Jester. more suitable to be placed at the early stage of the network, while SAT excels in long-term temporal modeling, which is more suitable for late-stage processing. Testifying the impact of WS in facilitating training and convergence: We propose the WS method (working together with BN) to accelerate the training process and expedite convergence. In this section, we use Something-Something-V1 as the dataset by sampling eight frames as the network input and compare the loss values over iteration during training between the network with WS and without WS so as to demonstrate the function of WS method. The results are shown in Figure 8 below. The blue curve characterizes loss values without WS, while the orange curve indicates the values with WS. Obviously, the orange curve decreases faster than the blue curve, and its value at the 55th epoch is lower than that of the blue curve. Since the loss value directly demonstrates the speed of training, we thereby conclude that WS indeed facilitates training and convergence.

Confusion Matrix of the Proposed Method
Confusion matrix is a standard format for accuracy evaluation. Figure 9 shows the confusion matrices created by MEST on Jester.  Each column of the confusion matrix represents the predicted label of the video, and each row represents the real label of the video. Since Jester has 27 categories, we represent these 27 categories using numbers 1~27. For example, the number 21 stands for "thumbs up" and the number 22 stands for "thumbs down". The value on the diagonal of the matrix represents the proportion of video samples that is correctly classified. Therefore, the more that predicted categories fall on the diagonal, the better the recognition performance. Apparently, almost all categories on Jester are correctly classified, which indicates the strong discrimination ability of MEST. Figure 10 shows the visualization results on Something-Something-V1 produced by GradCAM [36]. Each column of the confusion matrix represents the predicted label of the video, and each row represents the real label of the video. Since Jester has 27 categories, we represent these 27 categories using numbers 1~27. For example, the number 21 stands for "thumbs up" and the number 22 stands for "thumbs down". The value on the diagonal of the matrix represents the proportion of video samples that is correctly classified. Therefore, the more that predicted categories fall on the diagonal, the better the recognition performance. Apparently, almost all categories on Jester are correctly classified, which indicates the strong discrimination ability of MEST. Figure 10 shows the visualization results on Something-Something-V1 produced by GradCAM [36]. For the sake of simplicity, we merely generate the activation maps in the center frames, taking eight frames as input of the network. The left column shows the raw frames sampled from videos. The middle column indicates the activation results of TSM. The right column displays our results. As shown in Figure 10, the activation maps reflect the fact that the TSM (baseline) simply focuses on the objects, while MEST precisely focuses on the motion-related region and interaction between human hands and objects, owing to the strong temporal modeling ability of the TS + ME + SAT structure.

Conclusions
Aiming at the extraction of both spatial-temporal and appearance information to create an efficient action recognition network, we propose MEST in this paper. Firstly, a (2 + 1)D spatial and temporal module (called SAT) is proposed by factorizing the 3D block into the 1D + 2D structure to circumvent heavy computation. Then, a motion encoder (ME) is employed to capture the motion cues, which is then integrated with spatial-temporal information into a unified framework. Finally, apart from BN that some of previous methods used for the training process, the WS method is utilized to further boost the training and convergence results. Our network is simple yet efficient, and does not involve any 3D blocks or optical flow operations. Extensive experiments were conducted on five mainstream datasets to compare the overall performance of our approach compared to other popular methods. The reported results validate the efficacy of our network in reaching a high accuracy while maintaining low computational costs.
Nevertheless, there are some shortcomings that need to be solved: Firstly, the sparse sampling algorithms could be improved. Instead of randomly selecting 1 frame from each sequence, we should highlight the frames with changing motions. Additionally, despite For the sake of simplicity, we merely generate the activation maps in the center frames, taking eight frames as input of the network. The left column shows the raw frames sampled from videos. The middle column indicates the activation results of TSM. The right column displays our results. As shown in Figure 10, the activation maps reflect the fact that the TSM (baseline) simply focuses on the objects, while MEST precisely focuses on the motionrelated region and interaction between human hands and objects, owing to the strong temporal modeling ability of the TS + ME + SAT structure.

Conclusions
Aiming at the extraction of both spatial-temporal and appearance information to create an efficient action recognition network, we propose MEST in this paper. Firstly, a (2 + 1)D spatial and temporal module (called SAT) is proposed by factorizing the 3D block into the 1D + 2D structure to circumvent heavy computation. Then, a motion encoder (ME) is employed to capture the motion cues, which is then integrated with spatial-temporal information into a unified framework. Finally, apart from BN that some of previous methods used for the training process, the WS method is utilized to further boost the training and convergence results. Our network is simple yet efficient, and does not involve any 3D blocks or optical flow operations. Extensive experiments were conducted on five mainstream datasets to compare the overall performance of our approach compared to other popular methods. The reported results validate the efficacy of our network in reaching a high accuracy while maintaining low computational costs.
Nevertheless, there are some shortcomings that need to be solved: Firstly, the sparse sampling algorithms could be improved. Instead of randomly selecting 1 frame from each sequence, we should highlight the frames with changing motions. Additionally, despite no 3D blocks being used, our network structure itself is still riddled with redundancies. In the future, we are committed to network pruning to create more lightweight models for action recognition.