Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition

: Human action recognition has gathered signiﬁcant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classiﬁcation of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-speciﬁc codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition.


Introduction
Human action recognition [1] is one of the active areas of research in computer vision. Action recognition systems can be used in many applications such as surveillance, human-computer interactions, content-based retrieval systems and video indexing. It involves the recognition of human actions from video sequences. This task can be challenging when posed with problems such as background clutter, partial occlusion, variation of scales, appearance and lighting. An important aspect of action recognition is to find meaningful information from videos in the form of feature vectors. These feature vectors provide representations that help in discriminating different human actions. Most of the earlier research in action recognition focused on hand crafted features such as local space-time features [2], spatio-temporal features [3] and Motion Boundary Histograms (MBH) [4].
These local spatio-temporal features have shown promising results and gave direction to future research. Moreover, spatio-temporal feature detector 3D Harris [5] and 3-D scale invariant feature transform [6] developed from their 2D counterparts provide interest point based video descriptors. These techniques usually do not require human detection algorithms. However, Heng et al. [7] have stated that dense sampling methods outperform the interest point based methods based on their evaluation of several interest point based descriptors and detectors. These spatio-temporal points extract information from short temporal durations, which can result in distortion of long-term temporal actions. Hence, techniques such as feature trajectories [8,9] are used to track these interest points in order to get a robust and accurate representation. These methods have made significant progress and have shown the ability of handling some recognition challenges such as noise, lighting changes and background clutter.
Recently, deep learning algorithms [10,11] have been widely used for feature learning and classification tasks. These deep architectures build the high-level features on top of low-level features to provide an optimal representation and have proved to be quite successful in image classifications. Inspired from their success on images, researchers have proposed a variety of architectures for video classifications. Some of these architectures include spatio-temporal convolutions [10], recurrent neural networks [12] and two-stream networks [13], which are used for training two individual streams of RGB data and motion-based optical flow data. These architectures, however, often fail to register or model the motion patterns and temporal information concurrently because both streams are trained independently with no link inbetween the streams. To overcome this limitation, 3D convolutional neural network (CNN) architectures [14,15] have been proposed. These networks perform 3D convolutions and pooling, which enables the flow of temporal information across all layers. These architectures provide simultaneous feature learning for motion and appearance data, but have not proven to be effective due to their large number of parameters. More recently, Choutas et al. [16] have presented state-of-the-art results using 3D CNNs and encoding of human pose motion but requires the extraction of joint heatmaps for key points localization.
The action recognition approaches discussed earlier, provide action representations in the form of embedding or feature vectors that are either classified directly using a SoftMax layer or rely on a Bag-of-Words (BoW) pipeline. The BoW framework [17][18][19] has been widely used for action recognition, as it provides a simplified representation for classification tasks. It consists of the following steps: (i) feature extraction, (ii) vocabulary generation, (iii) feature encoding and classification. First, the feature vectors are extracted using a feature descriptor, for example, 3D-SIFT [6] or a ConvNet. The obtained features are quantized using a clustering algorithm [20] such as k-means or gaussian mixture models (GMMs) to form a visual vocabulary. The entries in the vocabulary are known as visual words. These visual words form the basis for feature encoding and play an important role in recognition performance. Therefore, an optimal codebook generation algorithm that can provide discriminative visual words is highly desirable. The third step involves the encoding of features for each individual video. Many researchers have worked in feature encodings [21][22][23] to improve the final representation. This encoded vector is classified in a final stage, which is most commonly performed by support vector machines.
In this paper, the focus is to develop a discriminative codebook and a features encoding method, which provide an all-encompassing representation for action videos. The codebook generation and encoding are the two most crucial steps for a BoW pipeline. The overall framework of the proposed method can be seen in Figure 1. In the first step, spatio-temporal features are extracted from ConvNets. The saliency of these features greatly impacts the overall recognition performance. Although hand-crafted features have had some success, such features do not generalize well to large scale video datasets [1]. Hence, we use the spatio-temporal two-stream ResNet [24] and a single stream 3D ResNet [25] architecture to extract feature descriptors. In the second step, these features are used for learning an agglomerative codebook. In the third step, the proposed agglomerative codebook is used for feature encoding. Finally, a support vector machine classifier is used for classification. The contribution of this work is three-fold: • First, we propose an agglomerative codebook. The aim of this codebook is to agglomerate the benefits of global codebook representation with class-specific codebook representations. The final agglomerative codebook provides discriminative codewords for the feature encoding step. • Second, we have proposed a modified Vector of Locally Aggregated Descriptors (VLAD) vector, known as Residual-VLAD (R-VLAD), which computes the higher order statistics by finding a difference vector between the feature descriptor and mean of the nearest codeword. It requires less computational resources than VLAD due to a reduced size of the encoded vector. • Third, to further enhance the capability of the R-VLAD vector, a hybrid feature vector is formed by fusing the locality-based descriptor with R-VLAD. The final encoded vectors are L2 normalized, which serve as the input to the classifier.  The rest of the paper is organized as follows: In Section 2, we explain the proposed methodology; Section 3 describes the detailed experimentation and comparisons with the state-of-the-art. Finally, the paper is concluded in Section 4.

Proposed Methodology
The aim of our proposed framework is to generate a discriminative codebook and provide a better representation that results in improved recognition performance. The proposed scheme for codebook generation and feature encoding is given in Figure 1. As discussed earlier, ConvNets have outperformed the traditional hand-crafted features, therefore, we extract features using ConvNets. Details of feature extraction are given in Section 2.1. These deep features are used for codebook generation and feature encoding. Sections 2.2 and 2.3 describe the codebook learning and the proposed feature encoding methodology, respectively.

Deep Feature Extraction
Deep CNN architectures when trained from scratch tend to overfit on small datasets like HMDB51 and UCF101. Therefore, we have used architectures that are pre-trained using a larger dataset, for example, ImageNet, Kinetics, etc. The video data contains the information related to appearance and flow of appearance, hence, we have used two architectures: a spatio-temporal 2D residual network [24] and a 3D architecture ResNeXt-101 [25]. The ResNet architectures have been successfully used in image classifications [24,25]. These architectures offer shortcut connections for bypassing the layers in the network. This allows the gradients to flow from later layers to earlier ones, which can be beneficial in training very deep architectures.
The ST-ResNet [24] is a two-stream convolutional Resnet with a spatial and a flow stream. Both motion and appearance streams use a ResNet-50 model that is pre-trained on the ImageNet dataset. We fine-tuned the model for video datasets, i.e., HMBD51 and UCF101. One of the drawbacks of two-stream architectures is the registration of motion and appearance information. To overcome this issue, spatio-temporal cues are learned by the network at different scales. This is achieved by introducing residual connections inbetween the motion and appearance streams. There are several ways of connecting these streams just like various shortcut connections. The 2048-dimensional features for both streams are extracted after the pooling layer of the last convolutional layer.

ResNeXt-101
Resnext-101 is a 3D convolutional neural network with 101 layers. It was originally proposed by Xie et al. [26], and Hara et al. [25] have further extended the work to 3D convolutions. These networks introduce the concept of cardinality, which denotes the size of the set of transformations. This cardinality proved to be effective compared to wider and deeper networks. The convolutional groups are introduced in ResNeXt block, these groups convert the feature maps into smaller groups. The network used in the study has a cardinality of 32. We have used a Kinetics pre-trained network, provided by Hara et al. [25].

Agglomerative Codebook Generation
The codebook formation is an essential step in any BoW pipeline as discussed in Section 1. The codebook is a compact representation of the dominant features from the training data. This codebook is used for encoding of both training and testing data. Figure 2 depicts the proposed agglomerative clustering approach (see Algorithm 1).

Global Codebook Generation
Class-Specific Codebook

Algorithm 1: Agglomerative Clustering
Input: A feature set F for N number of classes Output: Agglomerative codebook C Step 1: Build global codebook C G C G ← Cluster(F, number of clusters) Step 2: Build C cs Step 3: Clustering of C G and C cs to get agglomerative codebook C ← Cluster(C cs ∪ C G , number of clusters) A feature set F is extracted using a CNN architecture [24,25] for N number of action classes, i.e., F = [F 1 + F 2 +, . . . , F N ]. A global codebook C G is generated by clustering all these features. Features having similar patterns will be grouped into the same clusters. The k-means++ algorithm [27] is used as a cluster initialization method. Cluster optimization is performed by Lloyd's method [28]. These methods provide good results as compared to the simpler k-means. The same setting is used in class-specific and final codebook generation. The global codebook with M number of clusters is represented as: C G = c m |m = 1, . . . ., M. The global codebook can cause discriminative features to be grouped into the same cluster as unlabelled data is used. This codebook provides a global simplified representation of the entire training data. The second phase of codebook generation is learning a class-specific codebook. In a class-specific codebook, the features from each class are clustered separately. For a given class i, having n number of features are extracted, the class-specific codebook is generated given by c i = c k |k = 1, . . . ., K where K represents the total number of clusters for class i. The final class-specific codebook is formed by appending the class-specific codebooks for all classes given as C cs = c j |j = 1, . . . ., (K i × N), where N represents the total number of classes. The codewords in this codebook are class-specific, meaning that equal contributions are made in the codebook from each individual class of features. However, this codebook lacks the global representation, which is present in the global codebook. The final phase of clustering involves the clustering of codewords. To combine the advantage of global representation along with class-specific representation, we introduce the idea of clustering these global and class-specific codewords. k-means clustering is applied on centroids of the global C G and the class-specific C cs codebooks. This step helps in finding linkages between the codewords having similar information in the global and the class-specific codebooks. The nearest codewords are combined, which results in a better representation containing information of global as well as class-specific visual words. The combination of two different codebooks proves to be beneficial for encoding of features. The size of the codebooks is chosen very cautiously, as it affects the overall performance of the BoW approach. Large sizes can cause processing overheads, redundant codewords and less generalizability. A small size can cause distinct features to be clustered under the same codeword, which can reduce the discriminative power of the codebook. Different codebook sizes have been tested experimentally with details given in Section 3.3.1.

Feature Encoding
The proposed feature encoding provides a simplified R-VLAD encoding and fusion-based representation for deep features, discussed in Sections 2.3.4 and 2.3.5, respectively. First, the features are extracted using a ConvNet, discussed in Section 2.1. These features are then used to construct the agglomerative codebook. The codebook thus formed is utilized for further encoding stage. The sections below first review the most commonly used encoding schemes: Vector Quantization (VQ), VLAD and Locality based coding, then the proposed method is discussed in Section 2.3.4.

Vector Quantization (VQ)
Let X be a video descriptor with dimension D with a total of N number of descriptors extracted from a video, i.e., X = [x 1 , x 2 , . . . where contains the set of codes and lv 0 = 1 means each code v i contains only one non-zero element. Only one codeword is voted for each descriptor x. The vector quantization is also known as hard quantization as it causes information loss due to large quantization errors.

VLAD Super Vector
VLAD [23] is among the most widely used super vector methods, which has shown promising results in a number of tasks. It can be regarded as the simplified version of fisher vector encoding.
The final representation is a higher dimensional super vector, which contains zeros and the difference vector computed between nearest centroid and descriptor. The final representation has dimension of M × D.

Locality Constrained Coding
For the success of many recognition applications, data locality proves to be an important factor. In [29], the authors have shown that sparse information is not enough to handle large occlusions. Local information is useful in the reconstruction of samples for query data by using nearest codewords, whereas in sparse coding the farthest entries from codebook are used for reconstruction, which is undesirable. Contrary to sparse coding, locality-based coding ensures that codewords are similar for a given class of query data. This inspires us to use a locality-based features encoding approach to achieve good representation. Locality constrained linear coding is a fast implementation for local coordinate coding. The query features are encoded by projecting features into a local coordinate system. Consider a video descriptor X, i.e., X = [x 1 , x 2 , . . .
where e = exp( dist(x,D) σ ) σ is a constant that regulates weight decay speed for locality adaptor. 1 T s = 1 is a constraint that is a requirement for the final encoded vector.

Proposed Residual-VLAD (R-VLAD)
This section describes the proposed encoding method inspired by VQ and VLAD encoding schemes. The goal is to find a simple yet meaningful representation for given spatio-temporal descriptors of a video. The first step is to find the nearest codeword c for each f i in F = [ f 1 , f 2 , . . . f N ] ∈ R D×N of a video. These nearest codeword assignments are similar to VQ but instead of hard assignment into a histogram, a residual vector similar to VLAD is obtained by finding the difference between the descriptor f i and the mean of the assigned codeword. This helps in describing the distribution of feature vectors with respect to the center. The mean of each assigned codeword is a unique value which is an important statistical information. The final encoded vector is given by: where c represents the nearest codeword and f j represents the jth component of the feature descriptor F. This difference vector brings complementary information related to the assigned codewords, which proves to be beneficial for the classifier. The final dimension of the encoded vector is the same as that of the feature descriptor represented by D (features dimensionality). Due to this reduced size we call it the Residual VLAD or R-VLAD feature vector. In comparison, VLAD has a dimension of M × D (number of clusters × the features dimensionality).

Proposed Hybrid Feature Vector
For a fusion based representation, the LLC [29] codeword is used along with the modified VLAD vector. The final form (see Algorithm 2) of the hybrid feature vector (HFV) is the concatenation of the LLC and the modified VLAD represented by: where ψ(s) is the LLC descriptor, c represents the nearest codeword and f j represents the jth component of the feature descriptor F. This residual vector and locality-based code is the final vector with a dimension of D + M (Feature Dimensionality + Number of Clusters). This vector contains the aggregation of higher order statistics while maintaining a lower order dimension unlike VLAD which generates a k × D dimension super vector. To suppress the large values within this vector we employ intra-normalization and L2 normalize each residual vector. Step 3: Fusing R-VLAD and LLC descriptors

Experimental Results and Discussion
In this section, experimental evaluation of the proposed methodology is presented. First, the publicly available datasets that have been used for experimentation are discussed in Section 3.1. The implementation details for the proposed scheme are summarized in Section 3.2. Next, the performance evaluation is given in Section 3.3 and finally a comparison with the state-of-the-art is provided in Section 3.5.

Datasets
Experimentation has been carried out on two widely used action recognition datasets: HMDB51 and UCF101. HMDB51 was developed by Serre Lab of Brown University [30]. The videos are taken from YouTube and movies. It contains 51 different categories with 6800 videos approximately.
Each action category contains at least 100 videos with reasonable discrepancy. The actions in the dataset can be grouped into five categories: facial movements (chew, laugh, talk, smile), body movements (clap hands, walk, wave, jump, pull up, cartwheel, etc.), facial movement with objects (smoke, eat, drink), body movements with objects (brush hair, sword, pour, catch, ride bike, etc.) and human interactions (fencing, kick, shake hands, etc.). Figure 3 shows some images from these action categories. The HMDB51 is considered as one of the most challenging datasets due to poor quality videos and substantial camera motions. UCF101 is an extension of the UCF50 dataset [31], developed by the University of Central Florida. The videos are taken from YouTube. The dataset contains 101 realistic human action categories. With more than 13,000 videos in the dataset, the UCF101 offers a large diversity in terms of actions. The action classes in the dataset are divided into five types: body motion, sports, human to object interaction, human-human interaction and playing of musical instruments. Each action category is divided into 25 groups each having four to seven clips. The mean clip duration is around 7.21 s. The resolution of videos is 320 × 240 with a frame rate of 25 fps. Some sample frames from these videos are given in Figure 4. For these two datasets three training and testing splits are provided, which are used as a standard evaluation scheme. The average accuracy of these training-testing splits is used for performance evaluation.

Implementation Details
The experiments were performed on a workstation with an Intel Xeon 2687, 64GB RAM and an Nvidia P5000 GPU. For feature extraction two ConvNets have been used, as discussed in a previous section. The spatio-temporal ResNet architecture was taken from Feichtenhofer et al. [24]. It was pre-trained on ImageNet and fine-tuned on our datasets. This has residual connections inbetween motion and appearance streams to incorporate the registration of streams. RGB and flow features are extracted from the pooling layer after the conv_5 block from both streams. The ResNeXt model provided by Hara et al. [25] was pre-trained on the large video dataset Kinetics. The top layers of network conv_5 and FC layer were fine-tuned for HMDB51 and UCF101 datasets. The model used for fine-tuning was trained on 64 frame input. The input size for ResNeXt-101 is 3 × 64 × 112 × 112. The videos are first converted into frames using the ffmpeg-python library. The frames are reshaped to 112 × 112 pixels. Videos having less than 64 frames are looped to complete frames as necessary. Details regarding training parameters and network architecture can be seen in [25]. The training data used in training these networks were taken from the standard training splits provided with the datasets [30,31]. For the HMDB51 dataset, each training split includes around 70 videos from each action class and for the UCF-101 dataset 18 groups out of 25 and used in training and the rest are used as testing data. The networks were fine-tuned for each of the three splits for both datasets.
A discriminative codebook is learned using a three-step clustering approach, as discussed earlier. The k-means++ cluster initialization and Lloyd's optimization scheme were set with default parameters as they performed best in preliminary experimentation. First, a global codebook is learned considering features from all the classes. The number of clusters is varied to achieve the best results. Secondly, a class-specific codebook is formed using the features from individual classes. Finally, clusters from both codebooks are clustered together to form a final discriminative codebook. The final step in the pipeline is feature encoding. The features from each video are encoded using the proposed hybrid feature vector. These encoded vectors are L2 normalized. The final normalized vectors act as the training data for classifier, using a one-vs-all SVM classifier.

Performance Evaluation
This section discusses the parameters that are tuned to achieve optimal performance. The codebook size is a notable parameter for optimal codebook generation. In the encoding part, we have varied the number of neighbors and beta parameters for locality constraint coding. Finally, we explain the impact of normalization.

Codebook Size
The codebook size affects the performance of the BoW pipeline. In the proposed scheme, three codebooks are formed. The global codebook using features from all the classes, a class-specific codebook using data from individual classes and the final codebook by combining the prior codebooks. Different combinations for both codebooks have been tried, for example, varying the size of the class-specific codebook while keeping the global codebook size fixed and vice versa. The size of the final codebook was generally kept as half of the combined codebooks. The number of clusters for the global codebook were varied from 50 to 500, while fixing the number of clusters for the class-specific codebook to two clusters per class as suggested by empirical evaluations. Figure 5 shows the results for HMDB51 split1 using features from ST-ResNet and ResNeXt-101. The best results are produced when the global codebook size is 200 for ResNeXt and 150 for ST-ResNet.

The Encoding Parameters and Normalization Scheme
In the proposed hybrid feature vector, the number of nearest neighbors and value of beta(β) (used for regularization) parameter for locality constraint coding have been evaluated empirically while fixing the size of the codebook as mentioned previously. The number of neighbors were varied between 1 to 10 and the best results were achieved at five nearest neighbors and the value of β was set to 0.2. Normalization influences the recognition performance of the proposed scheme as learned during the experimentation. Five different normalization schemes have been evaluated, namely: L1, L2, Power Normalization (PN), PN+L1, PN+L2. The α parameter for Power Normalization has been set to 0.5. The results for these normalization methods are given in Figure 6.
The results are computed on HMDB51-split 1, where L1 and PN performed poorly. The best results are computed using L2 normalization. The combination of PN+L2 provides almost similar results but we rely only on L2 to avoid extra processing.

Comparison with Other Encoding Schemes
The comparison of the proposed R-VLAD vector and Hybrid Feature Vector (HFV) encoding method with other encoding schemes is presented in Table 1. The parameters for the proposed encoding scheme are fixed as discussed previously. The vector quantization follows the hard assignment and a histogram of codeword occurrences is formed. The results show that these assignments can cause loss of information, which ultimately affects recognition performance. The soft assignment (SA) [32] is another encoding method, which considers all the codewords for final voting. The β parameter is used for controlling of softness. We set β as 1 in evaluation. The improvement is seen as compared to VQ as more information is encoded related to other codewords. VLAD and fisher vectors are used in standard form with a codebook size of 256. R-VLAD vector outperforms the standard VLAD representation despite being a lower dimensional (more compact) representation. However, the R-VLAD residual vector contains high order statistics, as discussed earlier. Fisher vector encoding performed poorly, which shows super vectors do not necessarily improve accuracies. Best accuracy is obtained on both datasets with the proposed HFV, which combines the R-VLAD with the LLC descriptor.

Comparison with State-of-the-Art
A comparison with relevant state-of-the-art is given in Table 2. The results are computed by calculating average accuracy of three splits of UCF101 and HMDB51 datasets. First, the proposed scheme is compared with two-stream CNNs. Most of these methods combine the encoding schemes with improved dense trajectories. The proposed scheme performs better compared to these methods except in the case of iDT + VLMPF [34]. These results are computed by fusion of five different features, namely, SCN, TCN, C3D, HMG and iDT. However, the computational complexity involved in extracting these deep features and dense trajectories makes this method ineffective. Subsequently, the results for 3D ConvNets are compared. The results suggest that 3D CNNs trained on large video datasets outperform the ones trained on large image datasets. The proposed scheme improves the performance of existing 3D ConvNets as opposed to the addition of more expensive dense trajectories. Furthermore, the proposed approach can be extended to two-stream 3D Resnets architectures given that computational complexity is not a matter of concern. These two streams consist of RGB and flow data, which proves to be effective in terms of higher accuracies. As the results suggest, the results can be improved with a traditional BoW pipeline, but at the cost of additional computational overheads.

Conclusions
In this work, we worked on two important aspects of Bag-of-Words framework: optimal codebook generation and features encoding. Two deep ConvNet architectures were explored for features extraction, a two-stream 2D Resnet and a 3D Resnet. We have proposed an effective agglomerative clustering approach for codebook formation. This approach provided the most discriminative codewords as many empirical tests were conducted on global and class-specific codebooks. The limitation of this approach is that it takes three steps for generating the final codebook. The R-VLAD feature encoding is proposed which offers compact representation. The R-VLAD is combined with locality-based descriptor to form a hybrid feature vector, it offers an inclusive spatio-temporal representation. The 3D Resnets pre-trained on large video datasets have shown competitive performance compared to the 2D Resnets.
While this study was focused on 2D Resnet and a single stream 3D Resnet, in the future, two-stream 3D architectures will be explored for learning better video representations. In addition, determining the fusion weights for individual streams and improvement of computational efficiency by embedding the encoding in end-to-end trainable ConvNet.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: