Hierarchical Motion Excitation Network for Few-Shot Video Recognition

Wang, Bing; Wang, Xiaohua; Ren, Shiwei; Wang, Weijiang; Shi, Yueting

doi:10.3390/electronics12051090

Open AccessArticle

Hierarchical Motion Excitation Network for Few-Shot Video Recognition

by

Bing Wang

^1,2

,

Xiaohua Wang

^1,3,

Shiwei Ren

^1,3

,

Weijiang Wang

^1,3

and

Yueting Shi

^1,2,*

¹

School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100081, China

²

Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing 314019, China

³

Beijing Institute of Technology Chongqing Center for Microelectronics and Microsystems, Chongqing 401332, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(5), 1090; https://doi.org/10.3390/electronics12051090

Submission received: 29 December 2022 / Revised: 16 February 2023 / Accepted: 21 February 2023 / Published: 22 February 2023

Download

Browse Figures

Versions Notes

Abstract

:

Most of the existing deep learning algorithms are supervised learning and rely on a tremendous number of manually labeled samples. However, in most domains, due to the scarcity of samples or the excessive cost of labeling, it would be impracticable to provide numerous labeled training samples to the network. In this paper, a few-shot video classification network termed Hierarchical Motion Excitation Network (HME-Net) is proposed from the perspective of accumulated feature-level motion information. An HME module composed of Motion Excitation (ME) and Interval Frame Motion Excitation (IFME) is designed to extract feature-level motion patterns from adjacent frames and interval frames. The HME module can discover and enhance the feature-level motion-sensitive information in the original features. The accumulative time window is expanded to four frames in a hierarchical manner, which achieves the purpose of increasing the receptive field. After extensive experimentation, HME-Net is demonstrated to be able to consistently outperform the existing few-shot video classification models. On the UCF101 and HMDB51 datasets, our method is established as a new state-of-the-art technique for the few-shot settings of five-way three-shot and five-way five-shot video recognition.

Keywords:

few-shot learning; video recognition; motion information; meta-learning

1. Introduction

Watching and sharing videos on social platforms has become an essential part of life for people worldwide. How to better understand and classify videos has become an urgent issue to be addressed. This study could provide strong technical support for tasks such as personalized recommendations and video content reviews. In previous studies, it was necessary to provide a tremendous number of manually labeled video samples to the neural network. When confronted with novel classes, the network had to change the existing model and retrain the classifier. However, in most real-life cases, labels are likely to be lacking because of insufficient quantities of samples or the high cost of manual labeling. It is not realistic to provide a large number of labeled training samples to the network. Unfortunately, there is no way to use the traditional network model to complete the few-shot video classification task in such a case. Only a few videos, by contrast, are needed to complete the entire learning process in human learning. Consequently, it is of great significance to study how to perform effective few-shot action recognition under label-free conditions or using a small number of labels.

Deep learning is one of the standard methods for solving the few-shot problem [1]. Research work on few-shot learning originated in the 1990s. Later, the research boom in deep learning led scholars to apply neural networks to few-shot learning. Many deep learning networks for few-shot scenarios have been proposed. Much research work [2,3,4,5,6,7,8] has been conducted in the field of few-shot image classification, which can be transferred to few-shot video action recognition. The most classic few-shot image classification networks include Siamese Network [2], Matching Network [3], Prototype Network [4], and Relation Network [5]. However, since research on few-shot video classification is still in its infancy, it needs continuous exploration in the future.

The few-shot video action recognition steps can be roughly summarized as follows: We first set out a unique problem formulation and a network training mode for the few-shot problem, and the backbone network obtains the spatio-temporal features of the video; then, the few-shot classifier completes few-shot video action recognition. Some basic problem formulations and classifiers proposed for few-shot image classification can be directly applied to few-shot video action recognition. The performance of few-shot video classification lies in feature extraction, which is mainly divided into feature extraction of the appearance and of the time sequence between video frames. Appearance feature extraction from video frames is indispensable and easier to accomplish, while temporal feature extraction between video frames is a more difficult task. Temporal sequence modeling can be divided into short-range temporal extraction between adjacent frames and long-range temporal extraction between long-distance frames according to different time scales.

The key to action recognition is the extraction of motion information from video. The extracted motion information is divided into three types according to the different extraction methods, including feature-level motion information, frame-level motion information, and video-level motion information. Existing few-shot video classification methods obtain motion information at the frame and video levels. Several of the most advanced few-shot video classification algorithms mostly input video frames into the 2D ResNet backbone network, where the last fully connected layer is removed to extract the frame-level features of the video. Then, the video sequences are aligned by different modules, such as OTAM [9], TRX [10], and so on, for classification. However, these networks do not consider feature-level motion representation between video frames; they only perform feature encoding on a single frame in isolation, losing the short-range temporal relationship from the feature-level perspective.

The TEA [11] network inspires our research. A Hierarchical Motion Excitation (HME) module that extracts short-range motion information from the feature-level perspective is proposed. The targets used to obtain motion information from the feature-level perspective are 3D feature maps. The method used to obtain motion information from the feature-level perspective involves modeling to calculate the difference value between two 3D feature maps corresponding to two frames and integrate the frame difference in the channel dimension. The network is forced to discover and enhance feature-level motion-sensitive information in the original features.

To summarize, the main contributions of this paper are as follows:

(1) As far as we know, in the field of few-shot action recognition, we are the first to propose the enhancing of the captured short-range spatial and temporal features from the perspective of feature-level motion information.

(2) We propose the Hierarchical Motion Excitation (HME) module, composed of Motion Excitation and Interval Frame Motion Excitation. A hierarchical technique using a time window of four frames is used to obtain short-range feature-level motion information.

(3) We conducted extensive experiments on two widely used benchmark datasets, i.e., UCF101 and HMDB51. Our proposed HME allows us to establish a new state-of-the-art technique for the few-shot settings of five-way three-shot and five-way five-shot video recognition on UCF101 and HMDB51. These experiments demonstrate the effectiveness of our proposed framework.

2. Related Work

2.1. Two-Dimensional CNN-Based Video Classification

Using the 2D CNN backbone to extract frame-level features and using time max-pooling/avg-pooling to obtain the feature representation of the entire video represent the traditional method to extract long-range temporal features. The authors of [12] use LSTM to model temporal relationships by aggregating 2D CNN features. In this type of method, temporal modeling is only performed on high-level semantic features, and the spatio-temporal correlations among low-level semantic features (corners or edges) are not fully utilized. The two-stream architecture [13] separately processes the RGB flow and the optical flow. The optical flow extracts features at the pixel level. Optical flow graphs only capture short-range temporal features between adjacent frames, and long-range temporal features are only fused at the last layer. The famous TSN [14,15] has been proposed as a sparse temporal sampling strategy to capture long-range video features. Subsequent works use convolutional fusion [16], spatio-temporal attention mechanism [17], and convolutional activation coding [18,19] to further expand the two-stream structure.

If a video frame is regarded as a single image and only a 2D CNN extracts features, the network used to extract features ignores the dynamic temporal information of the video. At present, many scholars have started to capture the interdependence of features from the perspective of channels. Typical networks include SENet [20], TEA [11], ACTION-Net [21], and so on.

2.2. Three-Dimensional CNN-Based Video Classification

C3D [22] uses 3D convolution to process the local time window, which is used to learn the video features of 16-frame video clips in large-scale supervised video datasets. Subsequent research works have proposed P3D [23], I3D [24], R(2+1)D [25], and aggregate residual network [26], which further improve the performance of 3D CNN. This kind of method mainly uses the repeated superposition of local convolutions. Under the premise of ensuring short-range temporal feature extraction, long-range temporal dependence can be indirectly obtained. However, this process loses a lot of information, and repeated operations cause difficulties in parameter optimization. Only after a large number of local operations, a connection between two long-distance frames can be established, which causes the problem of gradient disappearance. Moreover, these methods require additional time and storage consumption.

2.3. Few-Shot Video Classification

The purpose of few-shot learning is to use a small number of labeled samples to identify unseen concepts. Many studies [2,3,4,5,6,7,8] have been performed in the image field. These works can be transferred to research on few-shot video classification.

The earliest work includes the compound key–value memory network proposed in [27], which uses the multi-saliency embedding function for feature extraction. The researchers of the CMN algorithm have recently proposed an independent label memory module [28]. Ref. [29] proposes the ProtoGAN framework, where class prototype vectors are used to adjust a conditional generative adversarial network, generating additional samples in the novel categories. The authors of [30] use an attention mechanism to achieve the time alignment of video and introduce Relation Network for similarity comparison, while the authors of [9] use the dynamic programming method to align two video sequences. Furthermore, ref. [31] has been the first to introduce the self-supervision mechanism of jigsaws on public few-shot action recognition datasets. Additionally, the authors of [1] propose to combine the depth network and the RGB network to form a two-stream network and use the depth-guided adaptive fusion module to fuse the RGB features and depth features. The authors of [10] use the Transformer attention mechanism and propose a temporal–relational crosstransformer network based on tuples, where the permutation and combination method is utilized to combine the features of the frames and obtain the information between video frames.

Inspired by TEA [11], we propose an HME module to extract feature-level motion information for few-shot video classification. The main difference between the HME module and ME of TEA is that the HME module can accumulate feature-level motion differences among multi-frames. The feature-level frame difference operation of the HME module is performed to add the dynamic temporal pattern. The feature-level motion attention mechanism can make the channel more focused on motion differences. The hierarchical technique can accumulate and expand the receptive field of the feature-level motion pattern.

3. Method

A few-shot video classification network termed Hierarchical Motion Excitation Network (HME-Net) is proposed from the perspective of accumulated feature-level motion information. Figure 1 shows the outline of HME-Net. A ResNet-50 [32] network and our Hierarchical Motion Excitation (HME) module compose the backbone, which can accumulate short-range multi-frame feature-level motion features. The accumulated time window is four frames. The HME module discovers and enhances feature-level motion-sensitive information in the original features. The classifier uses a combination of TRX [10] and Prototype Network [4] to align temporal sequences and complete few-shot action recognition.

3.1. Problem Formulation

HME-Net is implemented according to the principle of meta-learning and metric learning. The network learns meta-knowledge in the meta-training stage, which is used in the meta-testing stage to complete few-shot video classification. Full set

D^{all}

is divided into three sets: Training Meta-set

D^{tra}

, Validation Meta-set

D^{val}

, and Test Meta-set

D^{test}

. Each sub-set consists of a large number of non-intersecting video categories, with each corresponding to one action category, i.e.,

\{\begin{matrix} C^{tra} \cup C^{val} \cup C^{test} = C^{all} \\ C^{tra} \cap C^{val} = \emptyset \\ C^{tra} \cap C^{test} = \emptyset \\ C^{val} \cap C^{test} = \emptyset \end{matrix},

(1)

where

C^{all}

,

C^{tra}

,

C^{val}

, and

C^{test}

represent sample categories in

D^{all}

,

D^{tra}

,

D^{val}

, and

D^{test}

, respectively. The processing procedure of the network can be divided into two stages: meta-training and meta-testing, following the method of meta-learning. In the meta-training stage, the Training Meta-set is used to train our model, and the Validation Meta-set is used to validate our model. In the meta-testing stage, the Test Meta-set is used to distinguish novel classes in the target domain after the meta-training stage.

This paper uses the few-shot problem definition from the literature [3] and the episode-based shot learning strategy from the literature [4,27,33]. There is a classification problem called N-way K-shot. Each sub-set generates its task set T. Each task set T consists of multiple task instances

T_{i}

. A task instance

T_{i}

consists of a support set and a query set and contains N video categories. The support set and the query set are denoted with S and Q, respectively. Each video category in support set S has K videos, and each video category in query set Q has M videos. The network must learn to distinguish

N \times M

label-free videos in the query set during the meta-testing stage with the help of

N \times K

labeled videos in the support set. Each episode is defined as a learning process, meaning that the network completes few-shot classification of a task instance

T_{i}

once. Support set S and query set Q are expressed as

\{\begin{matrix} S = \{(s_{n}^{k}, c_{n}^{k} \in \hat{C}) ∣ k \in [1, K], n \in [1, N]\} \\ Q = \{(q_{n}^{m}, c_{n}^{m} \in \hat{C}) ∣ m \in [1, M], n \in [1, N]\} \\ \hat{C} = \{c_{n} ∣ n \in [1, N]\} \end{matrix},

(2)

where

s_{n}^{k}

represents an example video and

c_{n}^{k}

represents the class of

s_{n}^{k}

. The same applies to

q_{n}^{m}

.

\hat{C}

is for all classes of the task, and M is the number of examples for each class in the query set. Following TRX [10], one video in one class is sampled for the query set in each episode, and the number of M videos is included in the query set of a task.

3.2. Hierarchical Motion Excitation Module

We propose a Hierarchical Motion Excitation (HME) module with an accumulative time window of four frames to capture feature-level dynamic information between frames. The HME module consists of Motion Excitation [11] and Interval Frame Motion Excitation. They are used to extract the feature-level motion pattern of adjacent frames and interval frames, respectively. The backbone is obtained through the integration of the HME module and ResNet50, and the visible time window is four frames. The schematic illustration of the HME module is shown in Figure 2.

The sparse temporal sampling strategy of TSN [14] is used to extract an input sequence

V_{i} = {v_{1}, v_{2}, \dots, v_{T}}

from the i-th video, where T represents the total number of video frames. Concretely, given the i-th video, we divide it into T segments

{S_{1}, S_{2}, \dots, S_{T}}

of equal duration. A frame sequence

V_{i} = {v_{1}, v_{2}, \dots, v_{T}}

is generated from the T segments

{S_{1}, S_{2}, \dots, S_{T}}

. Each video frame

v_{i}

is randomly sampled from its corresponding segment

S_{i}

. This sampling method causes the time interval of the frame sequence to be uncertain. For two video frames

v_{a}

and

v_{b}

,

b > a

is satisfied. If

b - a \geq \frac{T}{2}

, the two video frames are at a short-range distance. If

b - a > \frac{T}{2}

, the two video frames are at a long-range distance. In our experiments, we selected a video sequence containing 8 frames. Therefore, when two video frames meet

b - a \geq 4

, the distance between the two frames is considered short-range distance. We define the value of

b - a

as the frame order difference and complete the model design by increasing the value of the frame order difference. We use the backbone to process each frame in the sequence individually. Feature

F_{i n}

of

V_{i}

processed by the residual blocks from

c o n v 1_x

to

c o n v 4_x

of ResNet is

F_{i n} = ϕ_{c o n v 4} (V_{i}), F_{i n} \in R^{T \times C \times H \times W},

(3)

where T, C, H, and W denote temporal dimension, feature channel, height, and width, respectively. The feature is input into the HME module to obtain accumulative feature-level motion information.

3.2.1. Motion Excitation

ME and MTA have previously been designed for the TEA block [11], which discards the pixel-level motion pattern. TEA can model the video motion information of the video from a feature-level perspective. Our work is different from the previous TEA method, which uses a whole block in each residual block. As we need to capture feature-level short-range temporal features, the ME part as the first level of HME is disassembled from TEA and used once after

c o n v 4_x

.

In the lower part of Figure 2, two

1 \times 1

2D convolutional layers are used to achieve the squeezing and unsqueezing steps. The feature-level motion feature is modeled on the frame difference between adjacent squeezed features

\begin{matrix} F_{m e_d i f f} = ϕ_{m e_{c o v}_{3 \times 3}} (F_{m e_s e q} [t + 1, :, :, :]) - F_{m e_s e q} [t, :, :, :], F_{m e_d i f f}, F_{m e_s e q} \in R^{T \times \frac{C}{r} \times H \times W}, \end{matrix}

(4)

where

F_{m e_s e q}

is the output feature of the

1 \times 1

2D convolutional layer used for the channel squeezing,

F_{m e_d i f f}

represents the frame difference of the squeezed feature between adjacent frames, r represents the reduction ratio of the channel, and

r = 16

is set according to TEA [11].

ϕ_{m e_{c o v}_{3 \times 3}}

is a

3 \times 3

2D convolutional layer used to adjust the features of

t + 1

. The attention weight of channel

W_{m e} \in R^{T \times C \times 1 \times 1}

is obtained by successively performing average pooling, deconvolution, and activation function on

F_{m e_d i f f}

. The output feature of ME is

\begin{matrix} F_{m e_o u t} = & W_{m e} ⊙ F_{i n} + F_{i n}, F_{m e_o u t} \in R^{T \times C \times H \times W} . \end{matrix}

(5)

3.2.2. Interval Frame Motion Excitation

Our second-level interval frame Motion Excitation (IFME) has the ability to capture interval frame differences. Input feature

F_{i m e_i n} \in R^{T \times 2 C \times \frac{H}{2} \times \frac{W}{2}}

of IFME is obtained by performing the residual calculation of

c o n v 5_x

on

F_{m e_o u t}

. The processing steps of IFME are shown in the upper part of Figure 2.

In order to reduce the complexity of the module, a

1 \times 1

2D convolutional layer is used to compress the input channel, which can be represented as

\begin{matrix} F_{i m e_s e q} = ϕ_{i m e_s c o n v_{1 \times 1}} (F_{i m e_i n}), F_{i m e_s e q} \in R^{T \times \frac{2 C}{r} \times \frac{W}{2} \times \frac{H}{2}}, \end{matrix}

(6)

where

ϕ_{i m e_s c o n v_{1 \times 1}}

is a

1 \times 1

2D convolutional layer used to squeeze the features of t. r represents the reduction ratio of the channel, and

r = 16

is set according to TEA [11].

The compressed features are segmented according to the time dimension. Feature-level motion differences are obtained by calculating the frame difference between the compressed features of the interval frames. Before the frame difference operation, a

3 \times 3

2D convolution operation needs to be performed in order to align the channel features of

t + 2

with t. The frame difference calculation of the interval frames is expressed as

\begin{matrix} F_{i m e_d i f f} = ϕ_{i m e_{c o v}_{3 \times 3}} (F_{i m e_s e q} [t + 2, :, :, :]) - F_{i m e_s e q} [t, :, :, :], F_{i m e_d i f f} \in R^{T \times \frac{2 C}{r} \times \frac{W}{2} \times \frac{H}{2}} . \end{matrix}

(7)

It should be noted here that when

t \in [T - 1, T]

,

t + 2 > T

exceeds video sequence

[1, T]

. Therefore, we use the squeezed features of the last two frames to fill the time-series sequence, in order to expand the time dimension of the subtracted number to

[1, T + 2]

.

Spatial global average pooling is performed on the obtained interval frame difference feature

F_{i m e_d i f f}

to access global spatial information

\begin{matrix} F_{i m e_g a p} = \frac{1}{\frac{H}{2} \times \frac{W}{2}} \sum_{i = 1}^{\frac{H}{2}} \sum_{j = 1}^{\frac{W}{2}} F_{i m e_d i f f} [:, :, i, j], F_{i m e_g a p} \in R^{T \times \frac{2 C}{r} \times 1 \times 1} . \end{matrix}

(8)

Then, the number of channels of

F_{i m e_g a p}

is re-expanded to

2 c

using a

1 \times 1

2D convolutional layer. Output features

W_{i m e}

are activated by a

S i g m o i d

function

σ

to obtain feature-level motion-attentive weights; the steps are shown as follows:

\begin{matrix} W_{i m e} = σ (ϕ_{i m e_u c o n v_{1 \times 1}} (F_{i m e_g a p})), W_{i m e} \in R^{T \times 2 C \times 1 \times 1}, \end{matrix}

(9)

\begin{matrix} F_{o u t} = W_{i m e} ⊙ F_{i m e_i n} + F_{i m e_i n}, F_{o u t} \in R^{T \times 2 C \times \frac{H}{2} \times \frac{W}{2}} . \end{matrix}

(10)

We still use the same channel-wise multiplication and residual connection method to obtain output

F_{o u t}

of the HME module, since not only it can activate and enhance the feature-level motion information, but it can also retain the static background features.

3.2.3. Module Aggregation for Accumulated Feature-Level Motion Differences among Multi-Frames

Next, we will explain how the HME module hierarchically integrated with ME and IFME accumulates and expands the receptive field of the feature-level motion pattern. The principle explanation of HME is shown in Figure 3, where we assume that each video extracts eight frames.

First of all, the output features obtained from the

c o n v 4_x

residual block are input into the ME module to extract the feature-level motion information of the adjacent frames. The frame difference features of adjacent frames

v_{i + 1}

and

v_{i}

are calculated and integrated into the 3D feature maps of video frame

v_{i}

. At this time, the 3D feature maps of video frame

v_{i}

contain the feature-level motion information between video frame

v_{i + 1}

and video frame

v_{i}

, which is defined as

v_{i} (v_{i + 1})

. This is shown from the first row to the second row in Figure 3. After ME module processing, the time window of the Motion Excitation module is two frames long, i.e.,

v_{i}

and

v_{i + 1}

. Similarly, the 3D feature maps of video frame

v_{i + 2}

are calculated. The feature-level motion information between

v_{i + 3}

and

v_{i + 2}

frames is defined as

v_{i + 2} (v_{i + 3})

.

Secondly, the feature-level motion information of adjacent frames are input to the IFME module, where the interval frames are selected for frame difference. The frame difference features of interval frames

v_{i + 2} (v_{i + 3})

and

v_{i} (v_{i + 1})

are calculated and integrated into the 3D feature maps of video frame

v_{i}

, which is defined as

v_{i} (v_{i + 1}, v_{i + 2}, v_{i + 3})

. That is, the 3D feature maps of video frame

v_{i}

contain feature-level motion information among the frames of

v_{i}

,

v_{i + 1}

,

v_{i + 2}

, and

v_{i + 3}

. This is shown from the second row to the third row in Figure 3. After IFME module processing, the time window of the HME module is expanded to four frames, i.e.,

v_{i}

,

v_{i + 1}

,

v_{i + 2}

, and

v_{i + 3}

.

The time window for the network model obtained by simultaneously using the ME and IFME modules is four frames long. Our research has temporarily stopped at the two-step frame order difference and the four-frame time window. Our future work will continue to expand the frame order difference and study new modeling methods for extracting feature-level motion information from long-range distance.

3.3. Few-Shot Classifier

Our few-shot classifier is derived from TRX [10] and Prototype Network [4]. Since our HME module is proposed for short-range feature-level motion information, we use TRX to extract short-range and long-range frame-level motion information, as shown in Figure 1.

The output features of the HME module perform spatial global average pooling to obtain

F_{t r x_i n} \in R^{T \times D}

. Referring to the processing method of Transformer [34], positional encoding is added to

F_{t r x_i n}

. We use

Ω = [2, 3]

from TRX, where the possible tuples for

{ω_{i} \in Ω, i = 0, 1}

are

\begin{matrix} Θ^{ω_{i}} = {(f_{1}, \dots, f_{ω_{i}}), \forall j : 1 \leq f_{j} < f_{j + 1} \leq T}, \end{matrix}

(11)

and

F_{t r x_i n}

is reorganized into

\hat{H} \in R^{\hat{K} \times ω_{i} \times D} \mapsto R^{\hat{K} \times ω_{i} D}

according to

Θ^{ω_{i}}

, assuming that

ω_{i}

has

\hat{K}

possible tuples.

TRX includes two linear mappings,

f c 1

and

f c 2

.

ϕ_{h m e} (s_{n}^{k}) \mapsto {\hat{H}}_{s}

and

ϕ_{h m e} (q^{i}) \mapsto {\hat{H}}_{q}

are input into

f c 1

to obtain key

Γ \in R^{\hat{K} \times d_{f c 1}}

and query

Y \in R^{\hat{K} \times d_{f c 1}}

, respectively, based on CrossTransformer [35], where

s_{n}^{k} \in S

and

{q^{i}, i \in [1, 2, \dots, N M]} \in Q

are presented in Equation (1). Similarly, we correspondingly obtain values

Λ^{q} \in R^{\hat{K} \times d_{f c 2}}

and

Λ^{s} \in R^{\hat{K} \times d_{f c 2}}

after

f c 2

. The weight of attention is obtained using

\begin{matrix} A = Softmax (LN (Γ) \cdot {LN (Y)}^{T}), A \in R^{\hat{K} \times \hat{K}}, \end{matrix}

(12)

where

LN (\cdot)

is standard layer normalization [36]. In order to make the temporal features of the labeled video correspond to the query video, matrix multiplication is performed as follows:

\{\begin{matrix} \tilde{Γ} = A \cdot Λ^{q}, \tilde{Γ} \in R^{\hat{K} \times d_{f c 2}} \\ \tilde{Y} = A \cdot Λ^{s}, \tilde{Y} \in R^{\hat{K} \times d_{f c 2}} \end{matrix},

(13)

where

\tilde{Γ}

and

\tilde{Y}

obtained from the above formula are measured with Prototype Network.

Let us assume that

{\tilde{Γ}}_{n}^{k}

is the k-th video of the

c_{n}

-th class in the support set. Then, for each

{c_{n}, n \in [1, 2, \dots, N]}

, its prototype

p_{c_{n}}

is

p_{c_{n}} = \frac{1}{K} \sum_{k = 1}^{K} {\tilde{Γ}}_{n}^{k} .

(14)

Then, let us define

{{\tilde{Y}}^{i}, i \in [1, 2, \dots, N M]}

as the i-th video in the query set. By comparing

{\tilde{Y}}^{i}

and

p_{c_{n}}

, the predicted probability that

q^{i}

belongs to class

c_{n}

is

P (c_{p r e} = c_{n} | q^{i}) = Softmax (\sqrt{{({\tilde{Y}}^{i} - p_{c_{n}})}^{2}}) .

(15)

We use Euclidean distance to obtain the distance similarity and use cross-entropy to calculate the loss for backpropagation.

4. Experiments

4.1. Datasets

To demonstrate the utility of our method, we conducted experiments on two few-shot video benchmark datasets, which are UCF101 and HMDB51.

4.1.1. UCF101

The UCF101 dataset [37] has 101 action classes, and four–seven sets of movement of each class are accomplished by 25 people. The dataset has a total of 13,320 videos. We followed the few-shot segmentation scheme proposed in [31]. We divided the dataset into 70, 10, and 21 disjoint subsets used for training, verification, and testing, respectively.

4.1.2. HMDB51

HMDB51 [38] contains 51 action classes and 6849 videos with resolution of

320 \times 240

. We also followed the few-shot segmentation scheme proposed in [31]. The dataset was divided into 31, 10, and 10 disjoint subsets used for training, verification, and testing, respectively.

4.2. Implementation Details

4.2.1. Pre-Processing

In the pre-processing stage of the video frames, we first resized the picture of each frame to

256 \times 256

and then randomly cropped a region of

224 \times 224

, in addition to adding a random horizontal flipping data augmentation method. We used center cropping to obtain a region of

224 \times 224

in the testing phase. Our experiments used the same strategy mentioned in [10] to randomly sample eight frames from the video.

4.2.2. Training Details

We trained our HME-Net end to end. The size of the input provided to the model was

N \times T \times 3 \times 224 \times 224

, where N is the batch size and T is the number of frames. The HME module was added to the ResNet50 to form a backbone network. The backbone network was pre-trained on the Kinetics400 dataset. Concretely, we averaged the features of each frame extracted by the backbone network to obtain a video-level representation and used a fully connected layer containing 400 neurons as a classifier.

The ME module and the IFME module were placed after the

c o n v 4_x

and

c o n v 5_x

residual blocks, respectively. The reduction ratio of the HME module was

r = 16

, and the feature-level motion-attentive features were fused using the channel-wise multiplication and residual connection method. For the parameter selection of the few-shot classifier, we used the following parameter settings (found in [10]):

d_{k} = d_{v} = 1152

and

Ω = [2, 3]

. Euclidean distance was selected for Prototype Network [4]. These experiments were performed on four Nvidia 2080Ti GPUs. We used the SGD optimizer with a learning rate of 0.001. The network was trained with 100,000 for the UCF101 and HMDB51 datasets.

4.2.3. Evaluation Details

We used the same model parameter and settings in the training phase. We randomly selected 10,000 episodes in the Test Meta-set, and the average accuracy was reported. All sample categories in the Validation Meta-set were different from those in the Training Meta-set. Note that the aim of the proposed neural network was to obtain a measurement criterion for few-shot video classification by quantifying intra-class consistency and inter-class difference. During the validation stage, we evaluated the effectiveness of the measurement criterion by transferring it into the Validation Meta-set. In this way, the generalization ability of the measurement criterion could be validated. Although the frame length in each sample video varied, the frame length of all videos in each category obeyed the independent and identical distribution, and all sample videos in each category of the Validation Meta-set were utilized for validation, no matter how long they were, such that the performance of the updated measurement criterion for the sample videos with various qualities could be evaluated.

4.3. Experimental Results

In this part, we compare our method with several currently available state-of-the-art works. Table 1 shows our comparative results on UCF101 and HMDB51. The accuracy marked in bold represents the highest classification accuracy.

Several existing estimation methods can be compared with our results. For the UCF101 and HMDB51 datasets, GenApp [39] mainly adopts a generative model of attribute-based learning. ProtoGAN [29] uses a GAN model to generate additional novel samples. ARN [31] uses a variety of techniques, including a spatio-temporal attention mechanism and a self-supervised training mode. RGB Basenet, based on embodied learning reported in [40], uses ResNet50 as the feature extractor and Prototype Network as the classifier. RGB Basenet++ and AmeFu-Net are reported in the same research work [1]; the difference between RGB Basenet++ and RGB Basenet is that RGB Basenet++ adopts the temporal synchronization augmentation mechanism. AmeFu-Net [1] introduces depth information to enhance background features. TRX [10] uses the permutation and combination of frame features to obtain long- and short-range temporal information.

Refer to Table 1 to compare the experimental results on UCF101 and HMDB51. The last row in the table shows the classification accuracy of our proposed HME-Net, which is the average of 10,000 episodes. As you can see, our method outperformed all the competitors and is established as a new state-of-the-art technique, with 94.8%, 96.8%, 72.2%, and 77.1% accuracy in the few-shot settings of five-way three-shot and five-way five-shot video recognition. The reason for the accuracy improvement is that our HME-Net network shows high performance in extracting feature-level motion information and channel interdependence.

4.4. Ablation Study

We conducted ablation experiments to prove the superiority of our final model on the UCF101 and HMDB51 datasets. We mainly explored the impact of different model structures and hyperparameter changes on the precision of video classification. We specifically considered three aspects: the effectiveness of each component of HME (Section 4.4.1), different r values (Section 4.4.2), and the five-way K-shot setting where

K \geq 1

(Section 4.4.3). The results show that extracting feature-level motion information is essential. The results of the experiment are reported in Table 2, Table 3 and Table 4.

4.4.1. Effectiveness of Each Component of HME

To more clearly explore the effectiveness of each component of HME, we designed a baseline without pre-training, called “BaseNet”. In the “BaseNet” model, the frame-level features extracted by ResNet50 with random initialization are averaged to obtain video-level features for a video sample, and the Prototype Network is used as a few-shot classifier. We added Motion Excitation (Section 3.2.1) to “BaseNet”, termed “BaseNet + ME”; then, we added Interval Frame Motion Excitation (Section 3.2.2) based on “BaseNet + ME”, termed “BaseNet + HME”. The difference between the “BaseNet” model and the “BaseNet + ME” model is the extraction of the feature-level motion information of adjacent frames. The difference between the “BaseNet + ME” model and the “BaseNet + HME” model is the extraction of the feature-level motion information of interval frames.

The results of few-shot experiments with five-way three-shot and five-shot settings are reported in Table 2 with hyperparameter

r = 8

. By comparing the results of the first three rows in Table 2, it can be seen that the classification precision of “BaseNet + HME” is higher than that of “BaseNet + ME”, and the classification precision of “BaseNet + ME” is higher than that of “BaseNet”. The extracted feature-level motion information of adjacent and interval frames can effectively improve the classification ability of modeling. The ME and IFME modules allow the neural network to enhance the feature-level motion-sensitive information in the original features.

4.4.2. HME-Net with Different r Values

We independently evaluated the experimental results when

r \in {2, 4, 8, 16, 32, 64}

on UCF101. We selected the best hyperparameter r value and fixed the HME parameters. The accuracy in Table 1 corresponds to

r = 16

. The experimental results are shown in Table 3. It can be observed in Table 3 that as r increased, the classification accuracy of the video first increased and then decreased. When

r = 16

, the highest recognition result was obtained. The result obtained with

r = 16

is 1.7% higher than that obtained with

r = 2

and 0.4% higher than that obtained with

r = 64

.

When the appropriate r value is selected, the HME module can better capture the feature-level motion information of multiple video frames. If the value of r is too small, the degree of channel compression is significantly reduced, which results in more parameters being used in channel attention calculation. The risk of overfitting is greatly increased. On the other hand, an overly large value of r increases the compression of the channel, leading to the loss of some helpful information. The ability to capture feature-level motion characteristics is strongly reduced. Therefore, it is necessary to balance the relationship between the two when using the HME module. The best performance of the network was obtained when

r = 16

.

4.4.3. Five-Way K-Shot Results

For the completeness of the experiments, we provide the experimental results of the HME-Net model when different values of hyperparameter K were selected. The results of the ablation study with hyperparameter K are shown in Table 4. We conducted one-, two-, three-, four-, and five-shot few-shot video classification on the UCF101 and HMDB51 datasets, respectively. As we predicted, the larger the K value was, the higher the classification accuracy was.

4.5. Visualization

4.5.1. Visualization of Feature-Level Motion Information

We randomly selected a video with the class “YoYo” from the UCF101 dataset. The Grad-CAM [41] method was used to visualize the Class Activation Map (CAM) [42]. The goal was to have a clearer understanding of feature-level motion information. The visualization results are shown in Figure 4. Lines 2 to 4 correspond to the outputs of

c o n v 4_x

,

c o n v 5_x

, and IFME in the HME module. It can be seen that the output characteristics of HME are focused on the position of human motion while ignoring the interference of the background.

4.5.2. Visualization for Few-Shot Classification

As shown in Figure 5, t-SNE [43] was used to classify and visualize the backbone networks learned by BaseNet and HME-Net. The network parameters obtained with the five-way three-shot setting were used on the UCF101 dataset. Each color in the figure represents a class. The sub-graphs in Figure 5 are gradually distributed in clusters from left to right. The clusters of HME-Net show a clear demarcation.

5. Conclusions

In this paper, we have explored and exploited a Hierarchical Motion Excitation Network (HME-Net) to recognize actions with very few labeled training videos. We use the HME module to extract the feature-level motion information from the video. The expanded accumulation time window can capture feature-level motion transformations of adjacent frames and interval frames. The attention mechanism is combined with Hierarchical Motion Excitation so that the feature-level motion pattern is activated in the channel. To ensure that the background information is not completely eliminated, the HME module adopts a residual connection method. The experimental results can prove that the improved performance of the model is due to feature-level motion information extraction. We conducted extensive experiments. The results allow us to establish our method as a new state-of-the-art technique for the few-shot settings of five-way three-shot and five-way five-shot video recognition on the UCF101 and HMDB51 datasets.

Author Contributions

Conceptualization, B.W. and Y.S.; methodology, B.W. and Y.S.; software, X.W.; validation, S.R.; formal analysis, X.W.; investigation, W.W.; resources, S.R.; data curation, Y.S.; writing—original draft preparation, B.W. and Y.S.; writing—review and editing, Y.S. and X.W.; visualization, W.W.; supervision, S.R.; project administration, X.W. and S.R.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Graduate Interdisciplinary Innovation Project of Yangtze Delta Region Academy of Beijing Institute of Technology (Jiaxing) grant number GIIP2021-016.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Fu, Y.; Zhang, L.; Wang, J.; Fu, Y.; Jiang, Y.G. Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1142–1151. [Google Scholar]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Lille, France, 6–11 July 2015; Volume 2. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuogl, K.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 3630–3638. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical networks for few-shot learning. arXiv 2017, arXiv:1703.05175. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar]
Chen, Z.; Fu, Y.; Wang, Y.X.; Ma, L.; Liu, W.; Hebert, M. Image deformation meta-networks for one-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8680–8689. [Google Scholar]
Wang, Y.; Xu, C.; Liu, C.; Zhang, L.; Fu, Y. Instance Credibility Inference for Few-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, Y.; Zhang, L.; Yao, Y.; Fu, Y. How to trust unlabeled data Instance Credibility Inference for Few-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6240–6253. [Google Scholar] [CrossRef] [PubMed]
Cao, K.; Ji, J.; Cao, Z.; Chang, C.Y.; Niebles, J.C. Few-shot video classification via temporal alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10618–10627. [Google Scholar]
Perrett, T.; Masullo, A.; Burghardt, T.; Mirmehdi, M.; Damen, D. Temporal-Relational CrossTransformers for Few-Shot Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 475–484. [Google Scholar]
Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 909–918. [Google Scholar]
Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 20–36. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2740–2755. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
Li, D.; Yao, T.; Duan, L.Y.; Mei, T.; Rui, Y. Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Multimed. 2018, 21, 416–428. [Google Scholar] [CrossRef]
Diba, A.; Sharma, V.; Van Gool, L. Deep temporal linear encoding networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2329–2338. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Deep quantization: Encoding convolutional activations with deep generative model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6759–6768. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Z.; She, Q.; Smolic, A. ACTION-Net: Multipath Excitation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13214–13223. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar]
Zhu, L.; Yang, Y. Compound memory networks for few-shot video classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 751–766. [Google Scholar]
Zhu, L.; Yang, Y. Label independent memory for semi-supervised few-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 273–285. [Google Scholar] [CrossRef]
Kumar Dwivedi, S.; Gupta, V.; Mitra, R.; Ahmed, S.; Jain, A. Protogan: Towards few shot learning for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
Bishay, M.; Zoumpourlis, G.; Patras, I. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv 2019, arXiv:1907.09021. [Google Scholar]
Zhang, H.; Zhang, L.; Qi, X.; Li, H.; Torr, P.H.; Koniusz, P. Few-shot action recognition with permutation-invariant attention. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part V 16, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 525–542. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sung, F.; Zhang, L.; Xiang, T.; Hospedales, T.; Yang, Y. Learning to learn: Meta-critic networks for sample efficient learning. arXiv 2017, arXiv:1706.09529. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Doersch, C.; Gupta, A.; Zisserman, A. Crosstransformers: Spatially-aware few-shot transfer. arXiv 2020, arXiv:2007.11498. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2556–2563. [Google Scholar]
Mishra, A.; Verma, V.K.; Reddy, M.S.K.; Arulkumar, S.; Rai, P.; Mittal, A. A generative approach to zero-shot and few-shot action recognition. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 372–380. [Google Scholar]
Fu, Y.; Wang, C.; Fu, Y.; Wang, Y.X.; Bai, C.; Xue, X.; Jiang, Y.G. Embodied one-shot video recognition: Learning from actions of a virtual embodied agent. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 411–419. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Schematic outline of our Hierarchical Motion Excitation Network. A task

T_{i}

containing the support set and query set is extracted from total task

T

. The video clips are obtained from the video with the sparse temporal sampling strategy. The backbone network is composed of ResNet50 and our HME module, which includes ME and IFME. The video clips are processed by the backbone network to obtain feature

F_{θ} (V_{i})

of the support set and feature

F_{θ} (V_{q})

of the query set. The features at this time include feature-level motion information between video frames. When extracting features, the parameters of the backbone network are shared. The few-shot video classifier is composed of TRX and Prototype Network.

F_{θ} (V_{i})

and

F_{θ} (V_{q})

are processed by the classifier to obtain the predicted probabilities. We use the cross-entropy function to calculate the loss, and use SGD to complete the network update.

Figure 1. Schematic outline of our Hierarchical Motion Excitation Network. A task

T_{i}

containing the support set and query set is extracted from total task

T

. The video clips are obtained from the video with the sparse temporal sampling strategy. The backbone network is composed of ResNet50 and our HME module, which includes ME and IFME. The video clips are processed by the backbone network to obtain feature

F_{θ} (V_{i})

of the support set and feature

F_{θ} (V_{q})

of the query set. The features at this time include feature-level motion information between video frames. When extracting features, the parameters of the backbone network are shared. The few-shot video classifier is composed of TRX and Prototype Network.

F_{θ} (V_{i})

and

F_{θ} (V_{q})

are processed by the classifier to obtain the predicted probabilities. We use the cross-entropy function to calculate the loss, and use SGD to complete the network update.

Figure 2. Internal processing of the HME module. It includes two parts: Motion Excitation (ME) and Interval Frame Motion Excitation (IFME).

t, t + 1

, and

t + 2

respectively represent the current frame, adjacent frame, and interval frame. GAP represents the average pooling operation.

σ

represents the Sigmoid activation function.

F_{i n}

is obtained through the processing of

c o n v 1_x

to

c o n v 4_x

in ResNet50.

F_{i n}

first uses the ME structure to obtain the feature-level motion pattern of the adjacent frames. Input feature

F_{i m e_i n}

of the IFME module is obtained by performing the residual operation of

c o n v 5_x

in ResNet50 on

F_{m e_o u t}

. In the IFME structure, the alignment feature of the

(t

+

2)

-th frame and t-th frame is used to perform frame difference calculation. The channel attention mechanism is used to fuse the feature-level motion transformation of the interval frame into the original feature.

Figure 2. Internal processing of the HME module. It includes two parts: Motion Excitation (ME) and Interval Frame Motion Excitation (IFME).

t, t + 1

, and

t + 2

respectively represent the current frame, adjacent frame, and interval frame. GAP represents the average pooling operation.

σ

represents the Sigmoid activation function.

F_{i n}

is obtained through the processing of

c o n v 1_x

to

c o n v 4_x

in ResNet50.

F_{i n}

first uses the ME structure to obtain the feature-level motion pattern of the adjacent frames. Input feature

F_{i m e_i n}

of the IFME module is obtained by performing the residual operation of

c o n v 5_x

in ResNet50 on

F_{m e_o u t}

. In the IFME structure, the alignment feature of the

(t

+

2)

-th frame and t-th frame is used to perform frame difference calculation. The channel attention mechanism is used to fuse the feature-level motion transformation of the interval frame into the original feature.

Figure 3. Graphic illustration of the HME module principle. The flow of arrows from the first row to the second row indicates the process of the ME module. Similarly, the flow of arrows from the second row to the third row represents the calculation process of the IFME module. The colors in the figure correspond to the colors in Figure 2.

t_{(l_{1})}

indicates that the feature of the t-th frame includes the feature-level motion information of the

l_{1}

-th frame. Similar to

t_{(l_{1})}

, there are

t_{(l_{1}, l_{2})}

and

t_{(l_{1}, l_{2}, l_{3})}

.

Figure 3. Graphic illustration of the HME module principle. The flow of arrows from the first row to the second row indicates the process of the ME module. Similarly, the flow of arrows from the second row to the third row represents the calculation process of the IFME module. The colors in the figure correspond to the colors in Figure 2.

t_{(l_{1})}

indicates that the feature of the t-th frame includes the feature-level motion information of the

l_{1}

-th frame. Similar to

t_{(l_{1})}

, there are

t_{(l_{1}, l_{2})}

and

t_{(l_{1}, l_{2}, l_{3})}

.

Figure 4. Visualization of significant features extracted by HME-Net for the action “YoYo” from UCF101. Features are visualized using the Class Activation Map (CAM) [42]. The second to fourth rows in the figure correspond to the output features of

c o n v 4_x

,

c o n v 5_x

, and the HME module, respectively. The most obvious image comparisons are marked by the red box. Comparing these sub-graphs, it can be noticed that the HME module was able to extract features that are related to movements. This is because the HME module captures feature-level motion information.

Figure 4. Visualization of significant features extracted by HME-Net for the action “YoYo” from UCF101. Features are visualized using the Class Activation Map (CAM) [42]. The second to fourth rows in the figure correspond to the output features of

c o n v 4_x

,

c o n v 5_x

, and the HME module, respectively. The most obvious image comparisons are marked by the red box. Comparing these sub-graphs, it can be noticed that the HME module was able to extract features that are related to movements. This is because the HME module captures feature-level motion information.

Figure 5. Visualization of few-shot classification using t-SNE [43]. Different colors represent different action classes. Five action classes are visualized in total. We compare our HME-Net with our baseline BaseNet on UCF101. The randomly selected categories include “PommelHorse”, “Surfing”, “CuttingInKitchen”, “Diving”, and “VolleyballSpiking” from the test set.

Table 1. Comparison of our method and current state-of-the-art works on UCF101 and HMDB51. We show the classification accuracy (%) in 5-way 3-shot and 5-way 5-shot video recognition. The classification results of 10,000 episodes are averaged as the final result in the Test Meta-set.

	UCF101		HMDB51
Model	3-Shot	5-Shot	3-Shot	5-Shot
GenApp [39]	73.4	78.6	47.5	52.5
ProtoGAN [29]	75.3	80.2	49.1	54.0
ARN [31]	-	84.8	-	59.1
RGB Basenet [40]	88.7	92.1	62.4	67.8
RGB Basenet++ [1]	90.0	92.9	63.0	68.2
AmeFu-Net [1]	93.1	95.5	71.5	75.5
TRX [10]	-	96.1	-	75.6
HME-Net	94.8	96.8	72.2	77.1

Table 2. The ablation experiments explored the impact of different structures on classification accuracy. The “BaseNet”, “BaseNet + ME”, and “BaseNet + HME” models were used to probe the effectiveness of each component of HME (Section 4.4.1).

	UCF101		HMDB51
Setting	3-Shot	5-Shot	3-Shot	5-Shot
BaseNet	60.1	64.4	42.1	45.8
BaseNet + ME	64.2	67.4	43.3	47.7
BaseNet + HME	65.8	68.6	44.9	49.7

Table 3. The ablation experiments explored the impact of different r values (Section 4.4.2) on classification accuracy. Results of 5-way 3-shot setting on UCF101.

Cardinalities	3-Shot	Cardinalities	3-Shot
r = 2	92.1	r = 16	93.8
r = 4	92.5	r = 32	93.4
r = 8	93.0	r = 64	93.4

Table 4. Classification accuracy of 5-way K-shot where

K \geq 1

(Section 4.4.3) on UCF101 and HMDB51.

Table 4. Classification accuracy of 5-way K-shot where

K \geq 1

(Section 4.4.3) on UCF101 and HMDB51.

Dataset	Model	K-Shot
Dataset	Model	K = 1	K = 2	K = 3	K = 4	K = 5
UCF101	HME-Net	83.3	91.4	94.8	96.0	96.8
HMDB51	HME-Net	53.1	65.4	72.2	75.0	77.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Wang, X.; Ren, S.; Wang, W.; Shi, Y. Hierarchical Motion Excitation Network for Few-Shot Video Recognition. Electronics 2023, 12, 1090. https://doi.org/10.3390/electronics12051090

AMA Style

Wang B, Wang X, Ren S, Wang W, Shi Y. Hierarchical Motion Excitation Network for Few-Shot Video Recognition. Electronics. 2023; 12(5):1090. https://doi.org/10.3390/electronics12051090

Chicago/Turabian Style

Wang, Bing, Xiaohua Wang, Shiwei Ren, Weijiang Wang, and Yueting Shi. 2023. "Hierarchical Motion Excitation Network for Few-Shot Video Recognition" Electronics 12, no. 5: 1090. https://doi.org/10.3390/electronics12051090

APA Style

Wang, B., Wang, X., Ren, S., Wang, W., & Shi, Y. (2023). Hierarchical Motion Excitation Network for Few-Shot Video Recognition. Electronics, 12(5), 1090. https://doi.org/10.3390/electronics12051090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Motion Excitation Network for Few-Shot Video Recognition

Abstract

1. Introduction

2. Related Work

2.1. Two-Dimensional CNN-Based Video Classification

2.2. Three-Dimensional CNN-Based Video Classification

2.3. Few-Shot Video Classification

3. Method

3.1. Problem Formulation

3.2. Hierarchical Motion Excitation Module

3.2.1. Motion Excitation

3.2.2. Interval Frame Motion Excitation

3.2.3. Module Aggregation for Accumulated Feature-Level Motion Differences among Multi-Frames

3.3. Few-Shot Classifier

4. Experiments

4.1. Datasets

4.1.1. UCF101

4.1.2. HMDB51

4.2. Implementation Details

4.2.1. Pre-Processing

4.2.2. Training Details

4.2.3. Evaluation Details

4.3. Experimental Results

4.4. Ablation Study

4.4.1. Effectiveness of Each Component of HME

4.4.2. HME-Net with Different r Values

4.4.3. Five-Way K-Shot Results

4.5. Visualization

4.5.1. Visualization of Feature-Level Motion Information

4.5.2. Visualization for Few-Shot Classification

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI