3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition

: Human activity recognition is an active ﬁeld of research in computer vision with numerous applications. Recently, deep convolutional networks and recurrent neural networks (RNN) have received increasing attention in multimedia studies, and have yielded state-of-the-art results. In this research work, we propose a new framework which intelligently combines 3D-CNN and LSTM networks. First, we integrate discriminative information from a video into a map called a ‘motion map’ by using a deep 3-dimensional convolutional network (C3D). A motion map and the next video frame can be integrated into a new motion map, and this technique can be trained by increasing the training video length iteratively; then, the ﬁnal acquired network can be used for generating the motion map of the whole video. Next, a linear weighted fusion scheme is used to fuse the network feature maps into spatio-temporal features. Finally, we use a Long-Short-Term-Memory (LSTM) encoder-decoder for ﬁnal predictions. This method is simple to implement and retains discriminative and dynamic information. The improved results on benchmark public datasets prove the effectiveness and practicability of the


Introduction
Human activity recognition (HAR) is one of the enabling technologies behind human-computer interactions, video surveillance and video scene understanding [1].To date, it imposes significant challenges such as the frequent presence of background clutter, view point changes, irregular motion, intra-class variations and camera motion.In addition, the huge information redundancy in video requires large amounts of memory, and also, the discovery of discriminative information from video frames is very complex and slow process.
The result of various research studies indicates that the success of action recognition problems depends on an appropriate feature extraction process.The appropriate feature extraction is very important in distinguishing samples and variations in the frames.Considerable progress has been made to address this problem by employing various specific solutions.Many local space-time visual representations have been proposed to overcome these issues in action recognition tasks.Laptev [2] detected sparse-time interest points and computed a histogram of the detected local points.Hessian [3], local trinary patterns (LTP) [4], Cuboids [5], and 3-D SIFT [6]) have also shown promising levels of HAR effectiveness, mainly thanks to their robustness against partial occlusions and noise.To facilitate a more effective usage of motion information, many trajectory-based feature extraction approaches have been proposed, such as KLT-tracker [7], SIFT matching [8], DTF [9], improved DTF [10].However, there are number of weaknesses in these models, such as the presence of irrelevant and redundant trajectories, computational complexity and blending of unnecessary motion.
The ideal video representation method must be efficient to compute and simple to implement instead of using complicated and labor-intensive feature extraction and encoding methods.The extraction of spatiotemporal features from video frame sequences is widely used for recognizing human actions.Due to the advancement of digital camera technology, it has become possible to capture depth information, which can be embodied into a single motion map.Compared to dynamic and conventional images, motion maps can provide 3D information, and can be insensitive to changes in light conditions.Much research efforts has used depth imagery such as dynamic images [11] and depth maps [12] in the context of action recognition.These methods are able to process temporal information, but are insufficient to capture dense and discriminative information in terms of shape, appearance and motion.
Recently, deep convolutional neural networks (DCNNs) and Long Recurrent Convolutional networks (LRCNs) have shown great potential in many areas, and have yielded promising results for many computer vision tasks.These approaches have the ability to accurately identify the hidden pattern in visual data by back propagation, so features are auto-extracted without any artificial selection.It has been proven empirically that features learned from deep neural networks are much better than hand-crafted features.
In light of the above analysis, this research article examines the issue of human action recognition by using motion maps and intelligently incorporating a C3D network with a Long Recurrent Convolutional network (LRCN) network.We utilize a 3D convolutional neural network (C3D) [13] to acquire and integrate the temporal information.The C3D can model appearance and motion information simultaneously.Our model integrates a motion map of the previous frames with the next frame to generate a new motion map.We can get a motion map of the whole video after the repetitive integration of the next frame for various-length videos.We use a linear weighted fusion method to fuse feature maps to take advantage of spatiotemporal features.Finally, we use LSTM for feature encoding and action classification.The proposed method is simple to implement and acquires temporal information effectively, integrating it into a map without losing the discriminative information of videos.The proposed method shows significantly improved results over some baseline methods when applied to the various benchmark video datasets.It is worth highlighting the following contributions: 1.
We propose an iterative training method for our neural network to generate a motion map from input video, which can integrate information into a motion map from each video frame.

2.
We intelligently incorporate C3D and LSTM networks and capture long-range spatial and temporal dynamics.C3D features on video shots contain richer motion information; LSTM can explore the temporal relationship between video shots.

3.
We introduce an effective fusion technique i.e., a linear weighted fusion method which can fuse correspondence between spatial and temporal features and boost recognition accuracy.

4.
The effectiveness of our approach is evaluated on benchmark datasets, in which it obtained state-of-the-art recognition results.
The remainder of this article is organized as follows: Section 2 reviews related works.In Section 3, we present our proposed approach in detail.We demonstrate the experimental evaluation in Section 4. Finally, a conclusion is presented in Section 5.

Related Work
Over the last decade, researchers have presented many hand-crafted and deep-net-based methods for action recognition.Earlier works were based on hand-crafted features for non-realistic actions videos.Since the proposed method is based on deep neural network (DNN), in this section, we will only review related works based on DNN.
In recent years, different variants of deep learning models have been proposed for human activity recognition in videos, and have achieved great performance for computer vision tasks.Ji et al. [14] applied 3D convolutional kernels on video frames in a time axis to capture both spatial and temporal information.Karpathy et al. [15] directly applied CNNs to multiple frames in each sequence and obtained the temporal relations by pooling, using single, late, early and slow fusion; however, the results of this scheme were just marginally better than those of a single frame baseline.Simonyan and Zisserman [16] used a two-stream CNN framework to incorporate both feature types, with one stream taking RGB image frames as the input and the other taking pre-computed stacked optical flows.Since optical flow contains only short-term motion information, adding it does not enable CNNs to learn long-term motion transitions.The additional stream significantly improved action recognition accuracy, indicating the importance of motion features.Tran et al. [13] avoided the need for pre-computing optical flow features through their 3D convolution (C3D) framework, which allows deep networks to learn temporal features in an end-to-end manner.However, C3D only covers a short range of the sequence.Wang et al. [17] introduced a temporal segment network (TSN) architecture, where a sparse temporal sampling strategy is adopted to model long-term temporal structures.In [18], Feichtenhofer et al. study a number of ways of fusing CNN towers in order to take advantage of this spatial-temporal information from the appearance and optical flow networks.However, the CNN-based method only extracts visual appearance features, and lacks the long-range temporal modeling capabilities.Moreover, the CNN-based method ignores the intrinsic difference between spatial and temporal domains.
Some researchers have also presented methods by uniting the benefits of both hand-crafted and deep learned features, such as [19,20], and obtained good results.They integrate the key factors from two successful video representations, namely improved trajectories [10] and two-stream ConvNets [18].How to combine the benefits of these two kinds of features to design good descriptors has been an active research area.Some research efforts have been carried out using depth imagery such as dynamic images and depth maps.Bilen et al. [11] introduced the dynamic image network to generate dynamic images for action videos.The order of video frames is used as the supervisory information; however, this method loses some discrimination information.Chen et al. [12] represented a model in the form of depth maps in the context of action recognition.These contributions showed good action recognition results but were insufficient to capture dense and discriminative information in terms of shape, appearance and motion.Taylor et al. [21] used a convolutional gated restricted Boltzmann machine to generate a flow field of the adjacent two frames in the video for action recognition, but this model could not generate a single map to represent a video.Rank pooling [22] and Fisher Vector [23] made an attempt to generate the desire length motion map.However, these methods are unable to model temporal dynamics among video frames.
In order to model the temporal dynamics among video frames, RNNs have been considered for video-based HAR.RNN networks provide strength to find and process hidden patterns in time-space data.In these kind of systems, data is processed in a sequential way, such that at each time t, it gets input from the previous hidden state s t−1 and obtains new data x t .Most of the state-of-the-art methods [24][25][26][27][28][29] have proposed their own recurrent networks by leveraging CNNs and RNNs for action recognition, and have achieved impressive performance.However, due to the large number of calculations of parameters, and negligence of effect of initial input after few layers, vanishing gradient problems occurred.The solution to this problem is LSTM [25,27,30], which has the ability to capture long-term dependencies and preserve sequence information over time by integrating memory units.LSTM was first introduced by [31]; it has been successfully adapted to many sequential modelling tasks such as speech recognition, visual description and machine translation, and has achieved encouraging performance.In most of these networks, the inputs to the LSTM are the high level features captured from a fully-connected layer of CNN.LSTM units use multiplicative gates to control access to the error signal propagating through the networks.
In this paper, we propose a 3Dconv-based iterative training method to generate the motion map, enabling the use of existing CNN models directly on video data with fine-tuning.Our model efficiently integrates the temporal information of the motion map and video frames and generates the arbitrary length of the motion map.The Combination of CNN-RNN provides effective representation for long-term motion and modeling of the sequential data, each of which has a time relationship with adjacent points.(RNN uses the extracted C3D features as inputs and models more robust, longer-range features.)The C3D network is able to encode local temporal features within each video unit; it cannot model across the multiple units of a video sequence.We thus introduce LSTM to capture global sequence dependencies of the input video and cues on motion information.The fused spatio-temporal features are processed by LSTM, which helps recognizing complex frame-to-frame hidden sequential patterns.After conducting extensive experiments, we observed that our method is very effective for videos of various lengths, and shows significant improvement in action recognition.

The Proposed Approach
In this section, the proposed approach and its related components are discussed.The process of action recognition is divided into two parts: The first part is related to the extraction of spatiotemporal fused features, so we discuss this within the relevant subsequent sections, e.g., the generation of motion maps and the training of motion map networks.Finally, we explain the encoding of the extracted features and the action classification part in the main subsequent section.

Generation of Motion Map
A motion map is a powerful and compact representation of a video which can be useful in computer vision tasks.The motion map can visualize motion information in good manner, and can remove a large amount of information redundancy of the video, thereby revealing discriminative information.The calculation of the motion map is fast, and takes up fewer memory resources.Hence, using a map to represent the video has realistic requirements.Our propose model is very simple to implement and can be trained by increasing the training video length iteratively.Mainly, it is very helpful to solve the problem of videos of various lengths to get the same effect of the map representation, and also to integrate the temporal information into a map without losing the discriminative information of the video.Another advantage of this method is that we can extract a constant number of video frames per second, which improves the generalization performance of the network.We can utilize a 3D-convolutional neural network for the extraction of the motion map.3D convolution and 3D pooling operations are adopted in 3D ConvNets.Three-dimensional convolution is the extension of 2D convolution.The output of the 2D convolution are two-dimensional feature maps, while the output volume of 3D convolution can have multiple-dimensions.Each feature map of the convolutional layer is connected with some successive adjacent frames in upper layer.As a result, the temporal information is not lost and the motion of the human body can be efficiently captured.Hence, multiple 3D convolutional layers can be used to handle the spatial and temporal information of the inputs in a hierarchal way.
For a video V with N frames, we define the video frames as fi, i {1 . . . . . .N}. F i denotes the motion map from f 1 to f i .In order to retain appearance and action information, we introduce an iterative method to generate a new motion map F i+1 using Equation (1) by combining the current motion map F i with the future video frame f i+1 by using MMN.Symbol ⊕ is the pixel-wise addition between motion map and video frame.The process of generating our first and final motion map is shown in Figure 1a,b respective Future Internet 2019, 10, x FOR PEER REVIEW 5 of 17 violin are the main features, while the rest of the image is diluted.This shows the relationship between the arm movements and playing the violin.It reflects the motion relationship between the actor and the object, and proves that the dynamic information which is available in different sequences of the video can be retained and embodied in the motion map.

C3D Network Architecture
The C3D network has the ability to learn visual patterns directly from pixels without any preprocessing step.The architecture of C3D comprises trainable filters and local pool operations, which is very useful to find hidden patterns in a video frame, and captures all changes in terms of spatial and temporal information.
The architecture of the C3D network is given in Figure 3. Table 1 illustrates the different parameter settings of each convolutional and pooling layer.We set the 3D Convolution and pooling kernel size as d × k × k, where d is kernel temporal depth and k is kernel spatial size.The 3D convolution is achieved by convolving a 3D kernel to the cube formed by stacking multiple contiguous frames together.By this construction, the feature maps in the convolution layer are connected to multiple contiguous frames in the previous layer, thereby capturing motion information.Intuitively, these different layers describe the visual content at different level, each of which is complementary to each other for the task of recognition.The C3D network has 5 convolution layers and 5 pooling layers (each convolution layer is immediately followed by a pooling layer), 2 fully connected layers and softmax loss layer.The number of channels (filters) for 5 convolution layers from 1 to 5 is 64, 128, 256, 512, and 512 respectively.The ratio represents the spatial map size ratio.In both spatial and temporal dimensions, all convolutional layers have 3 × 3 × 3 convolution filters with stride 1 × 1 × 1.All pooling layers from pool2 to pool5 (except for the first layer) have 2 × In the last iteration of our MMN network, we obtain the final motion map F N of video V, in which discriminative information is embodied and can be applied for action recognition tasks.Some motion maps generated by our C3D network are listed in Figure 2.Each map highlights the static object with its main features, and the superposed silhouette incarnates the different locations and postures of the actor and objects.For example, the action category playing violin shows that the actor, arms and violin are the main features, while the rest of the image is diluted.This shows the relationship between the arm movements and playing the violin.It reflects the motion relationship between the actor and the object, and proves that the dynamic information which is available in different sequences of the video can be retained and embodied in the motion map.

C3D Network Architecture
The C3D network has the ability to learn visual patterns directly from pixels without any pre-processing step.The architecture of C3D comprises trainable filters and local pool operations, which is very useful to find hidden patterns in a video frame, and captures all changes in terms of spatial and temporal information.
The architecture of the C3D network is given in Figure 3. Table 1 illustrates the different parameter settings of each convolutional and pooling layer.We set the 3D Convolution and pooling kernel size as d × k × k, where d is kernel temporal depth and k is kernel spatial size.The 3D convolution is achieved by convolving a 3D kernel to the cube formed by stacking multiple contiguous frames together.By this construction, the feature maps in the convolution layer are connected to multiple contiguous frames in the previous layer, thereby capturing motion information.Intuitively, these different layers describe the visual content at different level, each of which is complementary to each other for the task of recognition.The C3D network has 5 convolution layers and 5 pooling layers (each convolution layer is immediately followed by a pooling layer), 2 fully connected layers and softmax loss layer.The number of channels (filters) for 5 convolution layers from 1 to 5 is 64, 128, 256, 512, and 512 respectively.The ratio represents the spatial map size ratio.In both spatial and temporal dimensions, all convolutional layers have 3 × 3 × 3 convolution filters with stride 1 × 1 × 1.All pooling layers from pool2 to pool5 (except for the first layer) have 2 × 2 × 2 pooling kernels with stride 2 × 2 × 2, which means the size of the output signal is reduced by a factor of 8 compared with input signal.The first pooling layer, i.e., the pool1 layer, has a kernel size of 1 × 2 × 2, with the goal of not merging the temporal signal and preserving the temporal information in the early phases.The output of each convolution-al layer is a kind of volume in the form of feature maps.All pooling layers lead to the same number of feature maps as convolution layers but with reduced spatial resolution; also, these pooling layers introduce scale-invariant features.The two fully connected layers have 2048 outputs, and finally, a softmax layer is used to predict action labels.Table 1.The convolutional and pooling layers of the C3D architecture.4 illustrates the single iteration process of our Motion Map Network (MMN).The input to our network is frame-by-frame RGB clip.A motion map and the next video frame are combined into a video frame sequence as input, and a single 2D-feature map is extracted as output.At this stage, it is very important to mention that the feature maps are extracted in a frame-by-frame manner.We compute feature maps of layer conv5b from the input videos, and the rest of the pool5 and fullconnected layers are abandoned in our scheme.C3D conv5b feature maps have the highest activation projected back to the image space.In each iteration, the output of the conv5b layer generates two feature maps, each with a size of 7 × 7 × 512, where 7 × 7 is the spatial size of the feature maps with 512 channels.So, we build only one feature map of 7 × 7 × 512 by taking the maximum value for each position of the both feature maps from conv5b.This process is applied for all iterations for our pipeline, except the last iteration, because the output of the last iteration is again with two feature maps.We will apply linear weighted fusion to the last two feature maps by taking advantage of spatial-temporal features to obtain our final feature map.The discriminative information embodied in the final motion map can be applied to human action recognition tasks.
Two Feature Table 1.The convolutional and pooling layers of the C3D architecture.4 illustrates the single iteration process of our Motion Map Network (MMN).The input to our network is frame-by-frame RGB clip.A motion map and the next video frame are combined into a video frame sequence as input, and a single 2D-feature map is extracted as output.At this stage, it is very important to mention that the feature maps are extracted in a frame-by-frame manner.We compute feature maps of layer conv5b from the input videos, and the rest of the pool5 and full-connected layers are abandoned in our scheme.C3D conv5b feature maps have the highest activation projected back to the image space.In each iteration, the output of the conv5b layer generates two feature maps, each with a size of 7 × 7 × 512, where 7 × 7 is the spatial size of the feature maps with 512 channels.So, we build only one feature map of 7 × 7 × 512 by taking the maximum value for each position of the both feature maps from conv5b.This process is applied for all iterations for our pipeline, except the last iteration, because the output of the last iteration is again with two feature maps.We will apply linear weighted fusion to the last two feature maps by taking advantage of spatial-temporal features to obtain our final feature map.The discriminative information embodied in the final motion map can be applied to human action recognition tasks.
feature maps, each with a size of 7 × 7 × 512, where 7 × 7 is the spatial size of the feature maps with 512 channels.So, we build only one feature map of 7 × 7 × 512 by taking the maximum value for each position of the both feature maps from conv5b.This process is applied for all iterations for our pipeline, except the last iteration, because the output of the last iteration is again with two feature maps.We will apply linear weighted fusion to the last two feature maps by taking advantage of spatial-temporal features to obtain our final feature map.The discriminative information embodied in the final motion map can be applied to human action recognition tasks.

Training of Motion Map Network (MMN)
Since the network can only handle video frames, the videos need to be processed as video frames.Some methods directly split the videos, ignoring different frame rates, so dynamic information may be inconsistent in time.Therefore, we extract a constant number video frames per second, which improves the generalization performance of the network.For the network to better capture the changes in the action, we extract two frames per second.As for some short videos, we loop the extracted frames, and fill up 16 frames per video.
We introduce an iterative method to train the Motion Map Network (MMN).We use video V with N frames and video labels L to train our MMN.S is defined as maximum training iteration length.We train MMN using training length s from 2 to S. The training length s is the round of iteration.We cut the training video V i into s-length clips C i j (j ∈ 1 . . ..N i /s) with overlap 0.

Fusion Method
The motion of the object can be observed via changes in both appearance and semantics.Based on this, we follow a feature fusion strategy to combine spatial and temporal information.Given, X s ∈ R H×W×T , X t ∈ R H×W×T are the extracted frame level spatial and temporal features, where H and W are the height and width of the feature maps, T is the number of frames.Before the fusion operation, we have to reshape both features maps (spatial and temporal) into vectors, which can be given as: Now, we perform a pixel-wise addition which is known as linear weighted fusion between X s and X t to compute a single feature map F.
where, X ∈ R H×W×T , ⊕ is a matrix addition, w s and w t are weights of appearance and motion for spatial and temporal features maps, respectively.The weights are used to measure the significance of spatial and temporal features.After performing the fusion operation, we can define the new representative features as x f,t for the video clip.So, for the input video, a set of fused features (x f , 1 , . . . . . ., x f,t , . . . . . ., x f,N ) can be generated.Finally, we apply LSTM on these generated features to perform temporal encoding for human activity prediction.

Encoding and Activity Classification
This is the second and final part of our approach, which starts from detailed on LSTM features and its architecture; then, we present the encoding and activity classification method.

Long Short-Term Memory (LSTM)
To analyze the hidden sequential patterns, it is natural choice to use RNN to encode the temporal structure of extracted sequential features.In video, visual information is represented in many frames which help in understanding the context of an action.RNN can interpret such sequences, but in cases of long term sequences, it usually forgets the earlier input sequence.LSTM has been designed to mitigate the vanishing problem and to learn long-term contextual information from temporal sequences.LSTM is a kind of recurrent network which can capture long-term dynamics, and which preserves sequence information over time.In addition, the LSTM gradient does not tend to vanish when trained with back propagation through time.Its special structure with input, output and, control gates control long-term sequence pattern identification.The gates are adjusted by a sigmoid unit that learns during training when to open and close.We adopt LSTM for encoding and decoding to recognize human actions.
The architecture of a LSTM unit is depicted in Figure 5. x t , c t , h t and y t stand for input vector, cell state, hidden state and output at the t-th state, respectively.The output y t depends on hidden state h t , while h t depends on not only the cell state c t , but also on its previous state.Intuitively, the LSTM has the capacity to read and write to its internal memory, and hence, to maintain and process information over time.The LSTM neuron contains an input gate i t , a memory cell c t , a forget gate f t , and an output gate o t .At each time step t, it can choose to write, read or reset the memory cell through these three gates.This strategy helps LSTM to access and memorize information in many steps.Equations ( 4)-( 9) demonstrate the operation of temporal modelling performed in LSTM unit.
W and b are the parameters of the LSTM known as weights of the input vector and bias term.S means a sigmoid function, tanh is the activation function and ⊗ is the element-wise multiplication.The cell state and output are computed step by step to extract long-term dependencies.The input to LSTM is x t , which is the feature vector.A forget gate is used to clear the information from the memory unit, and an output gate keeps the information about the upcoming step.We also have g t , which is computed from the input of the current frame and state of the previous state h t-1 .The hidden state of LSTM step is computed by using a tanh activation function and memory cell c t .

Encoding and Classification Process by LSTM
The generated fused features xf,t are fed into LSTM as inputs to conduct encoding and decoding for activity prediction.LSTM can be jointly trained, and our proposed model provides a trainable platform which is ideal for large-scale cognitive intelligence.The unique feature of LSTM is that it processes variable length inputs and produces high-level variable length predictions (Output).As shown in Figure 6, LSTM consists of an encoder and decoder; the encoder transforms input data xt to a corresponding activation h.The decoder in the output layer is trained to reconstruct an approximation y of the input from activation h.In general, the LSTM model has parameters i.e., W and b, which denotes the weights and the biases of input layer and the hidden layer respectively, generates an output zt of given input xt and a previous hidden state at time step t − 1 i.e., ht−1, and also updates the current hidden state ht.
( ) The next step is decoding, which is similar as the encoding step given in Equation (10), where W2 and b2 denotes the weights and biases of the hidden layer and the output layer.
( ) The final single label prediction for a video can be produced by using softmax classifier.A Softmax layer can be utilized to achieve the M-way class scores for a given video sequence.This single prediction can be achieved by averaging the label probabilities, which is the output of our decoder, and can be represented by the Equation (11).

Encoding and Classification Process by LSTM
The generated fused features x f,t are fed into LSTM as inputs to conduct encoding and decoding for activity prediction.LSTM can be jointly trained, and our proposed model provides a trainable platform which is ideal for large-scale cognitive intelligence.The unique feature of LSTM is that it processes variable length inputs and produces high-level variable length predictions (Output).As shown in Figure 6, LSTM consists of an encoder and decoder; the encoder transforms input data x t to a corresponding activation h.The decoder in the output layer is trained to reconstruct an approximation y of the input from activation h.

Encoding and Classification Process by LSTM
The generated fused features xf,t are fed into LSTM as inputs to conduct encoding and decoding for activity prediction.LSTM can be jointly trained, and our proposed model provides a trainable platform which is ideal for large-scale cognitive intelligence.The unique feature of LSTM is that it processes variable length inputs and produces high-level variable length predictions (Output).As shown in Figure 6, LSTM consists of an encoder and decoder; the encoder transforms input data xt to a corresponding activation h.The decoder in the output layer is trained to reconstruct an approximation y of the input from activation h.In general, the LSTM model has parameters i.e., W and b, which denotes the weights and the biases of input layer and the hidden layer respectively, generates an output zt of given input xt and a previous hidden state at time step t − 1 i.e., ht−1, and also updates the current hidden state ht.
( ) The next step is decoding, which is similar as the encoding step given in Equation (10), where W2 and b2 denotes the weights and biases of the hidden layer and the output layer.
( ) The final single label prediction for a video can be produced by using softmax classifier.A Softmax layer can be utilized to achieve the M-way class scores for a given video sequence.This single prediction can be achieved by averaging the label probabilities, which is the output of our decoder, and can be represented by the Equation (11).In general, the LSTM model has parameters i.e., W and b, which denotes the weights and the biases of input layer and the hidden layer respectively, generates an output z t of given input x t and a previous hidden state at time step t − 1 i.e., h t−1 , and also updates the current hidden state h t .
The next step is decoding, which is similar as the encoding step given in Equation (10), where W 2 and b 2 denotes the weights and biases of the hidden layer and the output layer.
The final single label prediction for a video can be produced by using softmax classifier.A Softmax layer can be utilized to achieve the M-way class scores for a given video sequence.This single prediction can be achieved by averaging the label probabilities, which is the output of our decoder, and can be represented by the Equation (11).
where W and b t are the trained parameters of the LSTM model, q ∈ Q is the prediction and t is the current time step.

Experiments
We conduct several experiments to validate the effectiveness of our system.Three well-known benchmark human action datasets, UCF101 [32], HDMB51 [33], and UCF Sports [34], have been used.The description of datasets with their validation schemes, experimental setup, results and comparative analysis are presented in subsequent sections.

Datasets
The UCF101 dataset is the extension of UCF50; it contains 101 different action categories.Each action category consists of at least 100 video.There are 13,320 video clips in total.Most of the video clips are realistic, clean and user-uploaded videos with cluttered background, illumination and camera motion.The dataset is divided into a training set containing 9.5 K videos and testing set containing 3.8 K videos.We adopt the evaluation scheme of the THUMOS13 challenge [35] and follow the three testing/training split for performance evaluation by reporting average recognition accuracy over these three splits.
The HDMB51 dataset comprises of variety of realistic videos collected from YouTube and Google video.There are 6766 manually annotated video clips of 51 different action classes and each action class containing about 100 video clips.For experimental setting, we follow the original evaluation guidelines using three-test splits, and each split with an action class has 30 sequences for testing and 70 sequences for training.The average accuracy over these three splits is used to measure the final performance.
The UCF sports dataset encompasses 150 videos from 10 action classes.These videos were recorded in real sports environments, taken from different television channels.This dataset exhibiting the occlusion, illumination conditions and variations in background make it a complex and challenging dataset.The average accuracy is used to measure the final performance.Some sample frames from three datasets are given in Figure 7.
the final performance.
The UCF sports dataset encompasses 150 videos from 10 action classes.These videos were recorded in real sports environments, taken from different television channels.This dataset exhibiting the occlusion, illumination conditions and variations in background make it a complex and challenging dataset.The average accuracy is used to measure the final performance.Some sample frames from three datasets are given in Figure 7.

Experimental Setup and Implementation Details
As UCF101 is the largest dataset among the three datasets, we use it to train the C3D model initially, and then transfer it to the learnt model to HMDB51 and UCF sports for feature extraction.RGB clips are resized to have a frame size of 128 × 171.On training, we randomly crop input clips

Experimental Setup and Implementation Details
As UCF101 is the largest dataset among the three datasets, we use it to train the C3D model initially, and then transfer it to the learnt model to HMDB51 and UCF sports for feature extraction.RGB clips are resized to have a frame size of 128 × 171.On training, we randomly crop input clips into 16 × 112 × 112 crops.We also horizontally flip them with 55% probability.We fine-tune the model parameters on the UCF101 dataset, where the initial learning rate is set as 0.003, which is divided by 2 every 150 K iterations.The optimization is stopped at 1.9 M iterations.
Since the network can only handle video frames, the videos need to be processed as video frames.Therefore, we extract a constant number video frames per second, which improves the generalization performance of the network.For the network to better capture the changes in the action, we extract two frames per second (fps) and loop the extracted frames, and fill up to 16 frames per video.

Results and Comparison Analysis
We conduct extensive experiments to evaluate the performance of our proposed method.In this section, we presented relevant experimental results and performance analysis.

Effect of Different Feature Fusion Techniques
In this section, we analyze the effect of different early fusion methods such as element-wise sum, concatenation, element-wise max and linear weighted fusion on our proposed framework.We show the recognition accuracy for UCF Sports dataset and also each split of UCF 101 dataset, and the average recognition accuracy over the three splits.The results are reported in Table 2.We observe that linear weighted fusion enhances the recognition accuracy of our approach by a fair margin, compared to other fusion methods.This enhancement may be due to the fact that the linear weighted fusion method efficiently fuses spatial and motion features.Therefore, we choose the linear weighted fusion method as our fusion scheme to fuse spatio-temporal features.This section computes the class-wise accuracy for action recognition.We investigate the recognition accuracy of our method by making a confusion matrix and considering 10 action classes of the UCF Sports dataset.Table 3 demonstrates the accuracy of each action category, the x-axis denotes the predicted labels and the y-axis represents the ground truth labels.The intensity of the true score is high (diagonal) for each category, and our method achieves 94% for all 10 classes.It is interesting to note that some of categories with similar actions are more easily confused with each other, such as golf swing, kicking, running, swing bench and walking; these categories interfere with each other and yield low scores.A possible reason for this is the similarity of the features and representations among actions.In addition, the number of training samples is too small, so the result is confusing and misclassification occurs.The confusion matrix of HMDB51 dataset is shown in Figure 8, which is well diagonalized.However, some categories are easily misclassified; nonetheless, our proposed approach still performs well with most action categories.In this section, we further verify the effectiveness of our model, and compare our proposed approach to different existing state-of-the-art Human Action Recognition approaches on UCF101 and HDMB51 benchmark datasets.The comparison of results is reported in Table 4.We organize these baseline methods into different categories with respect to the type of features and network being used, including traditional, deep-learned features, very deep-learned features and hybrid features.[27] 2015 88.6 -LRCN [25] 2015 82.9 -VideoLSTM [44] 2016 89.2 56.4 3D Convolution [14] 2016 91.8 64.6 STPP-LSTM [45] 2017 91.6 69.0 FCNs-16 [46] 2017 90.5 63.4 Hidden-Two-stream [47] 2017 90.3 58.9 Multi-LSTM [48] 2018 90.8 -

Hybrid-Model
TDD-iDT [19] 2015 91.5 65.9 C3D-iDT [13] 2015 90.4 -TSN [17] 2016 94.2 69.4 3D conv + iDT [14] 2016 93.5 69.2 SCLSTM [49] 2017 84.0 55.1 LTC-iDT [43] 2018 92.7 67.2 Ours LSTM-3D ConvNet -92.9 70.1 Compared to traditional methods, our model performs the best by 4.5% on both datasets, Compared with RNN-based methods such as (LRCN) [25] and (LSTM) [27], our model outperforms these two methods by 4.3% and 10% on UCF101 datasets respectively.Different experiments indicated that our approach possesses higher discriminative power, even using fewer parameters.It can be also seen that some methods with both features such as TSN [17] and 3D conv-iDT [14] lead to a performance gain by a minimal margin on the UCF101 dataset.We can explain the decrease in prediction rate by fact that this dataset contains action classes with cluttered backgrounds and illumination changes, and TSN is pre-trained on the large-scale ImageNet dataset, which provides large scale size and diversity.Our approach is based on C3D, which is pre-trained on the UCF101 dataset.However, our introduced method outperformed the 3D conv-iDT by 0.9% and the TSN method by 0.7% on the HDMB51 dataset, and showed the highest recognition rate on small-scale datasets.A possible reason for this higher recognition accuracy is that our model is based on a hybrid deep learning model, and the introduction of LSTM temporally works well by capturing the long-term dependencies and boosting the recognition accuracy for complex action categories in the HDMB51 dataset.We can conclude that a combination of LSTM with a 3D convolutional network for the spatiotemporal stream achieves better results and obtains recognition rates of 92.9% and 70.1% on UCF101 and HDMB51 datasets respectively.This shows that there is a degree of complimentary between LSTM and convolutional neural network.

Conclusions
In this paper, we proposed an action recognition framework by utilizing frame-level deep features of the 3D-CNN and processing it through LSTM.First, we introduced a 3Dconv-based model MMN and its iterative training method to integrate the discriminative information of a video into motion maps.Three-dimensional convolutional components extract compact and efficient spatiotemporal features from the input video in the form of feature maps.Moreover, we design a linear weighted fusion method to effectively fuse spatial and temporal feature maps.Finally, we adopt LSTM encoder/decoder to obtain video level representations to conduct video classification.According to the experimental results, our model takes the complementary information contained in multiple features (both spatial and motion features).It is also proof that the motion maps generated by our model intuitively integrate the dynamic information in an efficient manner, and that they retain more discriminative aspects.Moreover, our fusion method makes the features more detailed and specific.To verify the effectiveness of our framework, extensive experiments have been carried out on benchmark datasets, and the obtained results showed that our approach achieves promising performance.

Figures 1 .Figure 2 .
Figures 1. (a,b) Generation of our first and final Motion Map.

Figure 1 .
Figure 1.(a,b) Generation of our first and final Motion Map.

Figures 1 .Figure 2 .
Figures 1. (a,b) Generation of our first and final Motion Map.

Figure 2 .
Figure 2. Output Motion map generated by our network, illustrating the discriminative information integrated into a single motion map to classify the video category.

Future
Internet 2019, 10, x FOR PEER REVIEW 6 of 17 2 × 2 pooling kernels with stride 2 × 2 × 2, which means the size of the output signal is reduced by a factor of 8 compared with input signal.The first pooling layer, i.e., the pool1 layer, has a kernel size of 1 × 2 × 2, with the goal of not merging the temporal signal and preserving the temporal information in the early phases.The output of each convolution-al layer is a kind of volume in the form of feature maps.All pooling layers lead to the same number of feature maps as convolution layers but with reduced spatial resolution; also, these pooling layers introduce scale-invariant features.The two fully connected layers have 2048 outputs, and finally, a softmax layer is used to predict action labels.

Fi fi+1 Fi+1Figure 4 .Figure 4 .
Figure 4.The structure of our Motion Map Network, which illustrates the single iteration to generate Motion Map.

Algorithm 1 . 5 : 7 :L i end for 9 :
7 and assign the labels L i to clip C i j .We define the MMN as a function Z θ s (I a , I b ), where θ s denotes the parameters of MMN after the iteration of training length s.The initial parameters of the network are defined as θ 1 .For each s, we generate the motion map F c i j 1∼s−1 using Z θ s−1 , and train the MMN using the motion map F clip label L i .Finally, we can get θ s which is the parameter of our trained motion map network.The detail of the training steps is summarized in Algorithm 1. Training of Our Motion Map Network Input: V is Video dataset; Frame number of video dataset, N; Video labels, L; Maximum training iteration length, S; Parameters of our model, θ 1 ; Output: Final parameters of Network, θ s ; 1: Initialize the parameter θ 1 for our model; 2: for each s∈2,3, . . .,S do 3: cut V i into s-length clips C j i (j ∈ 1 . . .N i /s) with overlap 0.7; 4: Extract the video frames from C j i as f C j i ; for each j ∈1, 2 . . .N/s do 6: for k ∈1, 2 . . .s − 1 do Generate the motion map F Get the MMN parameters θ s ; end for

Figures 1 .Figure 2 .
Figures 1. (a,b) Generation of our first and final Motion Map.

Figure 8 .
Figure 8. Confusion matrix on the HMDB51 dataset using our model.

Table 2 .
Effect of different earlier fusion methods on our model.The accuracy (%) is computed on a UCF Sports dataset, and all three splits and their average on UCF101.

Table 3 .
Confusion matric of UCF sports dataset.

Table 4 .
Comparison to the state-of-the-art methods.