Feature Encodings and Poolings for Action and Event Recognition: A Comprehensive Survey

: Action and event recognition in multimedia collections is relevant to progress in cross-disciplinary research areas including computer vision, computational optimization, statistical learning, and nonlinear dynamics. Over the past two decades, action and event recognition has evolved from earlier intervening strategies under controlled environments to recent automatic solutions under dynamic environments, resulting in an imperative requirement to effectively organize spatiotemporal deep features. Consequently, resorting to feature encodings and poolings for action and event recognition in complex multimedia collections is an inevitable trend. The purpose of this paper is to offer a comprehensive survey on the most popular feature encoding and pooling approaches in action and event recognition in recent years by summarizing systematically both underlying theoretical principles and original experimental conclusions of those approaches based on an approach-based taxonomy, so as to provide impetus for future relevant studies


Introduction
More and more research efforts within the computer vision community have focused on recognizing actions and events from uncontrolled videos over the past two decades.There are many promising applications for action and event recognition, such as abnormal action and event recognition in surveillance applications [1][2][3][4], interaction action and event recognition in entertainment applications [5][6][7][8], and home-based rehabilitation action and event recognition in healthcare applications [9][10][11][12], and many other analogous applications such as in [13][14][15][16][17][18].According to the definition given by NIST [19], an event is a complex activity occurring at a specific place and time, which involves people interacting with other people and/or objects, and consists of a number of human actions, processes, and activities.Feature representation approaches, pattern recognition models and performance evaluation strategies are three key components in action and event recognition [20].Compared to the other two key components, feature representation approaches, which should contain robust appearance and motion information, played a more critical role in video analysis.There are three key components in feature representation approaches, which are feature extractions, feature encodings, and feature poolings.
Feature extractions care mainly about how to extract required features from specified multimedia collections.On the one hand, there are many approaches proposed for feature extraction using specified sensors in previous research work.For example, Lara et al. [21] surveyed the state of the art in wearable sensors-based human activity recognition, and categorized feature extraction from time series data into statistical approaches and structural approaches, with measured attributes of acceleration, environmental signals and vital signs.Jalal et al. [22] developed a life logging system, where both magnitude features and direction angle features were extracted from depth silhouettes of human activities captured by a depth camera.Yang et al. [23] applied a low-bandwidth wearable motion sensor network to recognize human actions distributed on individual sensor nodes and a base station computer, based on a set of LDA features.Song et al. [24] proposed a robust feature approach, namely the body surface context, to encode the cylindrical angular of the difference vector according to the characteristics of human body, for action recognition from videos of depth camera.Jalal et al. [25] presented a methodology for human activity recognition-based smart home application, by extraction of both magnitude features and direction features from human silhouettes in a depth camera.Althloothi et al. [26] extracted two sets of 3D spatiotemporal features via a Kinect sensor for human activity recognition, where the shape features were from the surface points using spherical harmonics coefficients, and the motion features were from the end points of the distal limb segments.Besides, similar research work in the area can be found in references [27][28][29].
On the other hand, there are also a number of approaches proposed for extraction of multimedia features in previous research work, including audio features, visual features, text features, and hybrid features.For example, Li et al. [30] proposed the extraction of deep audio features for acoustic event detection by a multi-stream hierarchical deep neural network.Kumar et al. [31] proposed a unified approach that adopted strongly and weakly labeled data for audio event and scene recognition based on a mel-ceptra coefficients feature.Farooq et al. [32] constructed a feature-structured framework by using skin joints features and self-organizing map for 3D human activity detection, tracking and recognition from RGB-D video sequences.Siswanto et al. [33] verified by experiments that PCA-based Eigenface outperformed LDA-based Fisherface for facial recognition for biometrics-based time attendance purposes.Manwatkar et al. [34] designed an automatic image-text recognition system by using matrix features and Kohonen neural networks.Chang et al. [35] invented a source-concept bi-level semantic representation analysis framework for multimedia event detection.Jalal et al. [36] proposed a hybrid feature representation approach called depth silhouettes context, which is fused of invariant features, depth sequential silhouettes features and spatiotemporal body joints features, for human activity recognition based on embedded Hidden Markov Models.Kamal et al. [37] presented a framework for 3D human body detection, tracking and recognition from depth video sequences using spatiotemporal features and modified HMM.Jalal et al. [38] proposed a hybrid multi-fused spatiotemporal feature representation approach that concatenated four skeleton joint features and one body shape feature to recognize human activity from depth video.
The feature encodings and poolings concern primarily how to organize effectively the extracted features, so that overall performances in action and event recognition, such as recognition precision, recognition robustness and computation complexity, can be further improved.The difference between action representation and event representation lies mainly in their feature extraction approaches, whereas these two representations usually share same feature encodings and poolings.Due to complex spatiotemporal relationships between actions and events in multimedia collections, feature encodings and poolings are becoming increasingly important in feature representation approaches.There have been many survey papers available on feature representation approaches of action and event recognition [39][40][41][42][43][44][45][46][47][48].However, most of them focused on reviewing proposed feature extraction approaches, and thus did not take into comprehensive account feature encoding and pooling approaches, which have been widely used in both image and video analysis.The purpose of the paper is to conduct a complete survey on the most popular feature encoding and pooling approaches in action and event recognition from recent years.

Feature Encoding and Pooling Taxonomy
The hierarchical taxonomy of the survey paper is shown in Figure 1.These proposed methodologies of feature encodings and poolings for action and event recognition from recent years can be classified into four categories, which are 2D encodings, 3D encodings, general poolings, and particular poolings, where most of the methodologies the paper surveyed are from papers of either top conferences or top journals in computer vision and pattern recognition.
The rest of the survey paper is organized as follows.Firstly, popular approaches for 2D encodings are surveyed in Section 3.Then, Section 4 covers popular approaches for 3D encodings.In addition, Section 5 presents popular approaches for both general pooling strategies and particular pooling strategies.Finally, Section 6 concludes the survey paper.
The rest of the survey paper is organized as follows.Firstly, popular approaches for 2D encodings are surveyed in Section 3.Then, Section 4 covers popular approaches for 3D encodings.In addition, Section 5 presents popular approaches for both general pooling strategies and particular pooling strategies.Finally, Section 6 concludes the survey paper.

2D Encodings
There are eight popular approaches surveyed for 2D encodings, as shown in Table 1.

Standard Bag-of-Features Encoding
Since most classifiers require fixed length feature vectors, the ever-changing number of local features in images or videos, e.g., SIFT [87,88] and HOG3D [89], poses difficulties for further pattern recognition tasks.This issue can be solved by the most popular orderless tiling method called bag-of-keypoints [49] or bag-of-visual-words (BOV) [50], which was inspired by the well-known bag-of-words used in text categorization.
BOV treats an image or video frame as a document, the visual vocabulary which is generated by clustering a large set of local features extracted from patches around detected interest points as the word vocabulary, and the cluster centers as the visual words.BOV is a typically high-dimensional sparse visual word occurrence histogram representation.
BOV has been widely used in event and action recognition.Wang and Schmid improved the dense trajectories approach [90] by taking into account camera motion and proposed an efficient state-of-the-art action recognition approach, called Improved Trajectories, where they observed an improvement due to our motion-stabilized descriptors when encoding extracted features with BOV.Lan et al. [91] adopted spatial BOV as one of three encodings to combine more than 10 kinds of features for the TRECVID 2013 multimedia event detection and multimedia event recounting tasks.Ye et al. [92] developed the Raytheon BBN Technologies (BBN) led VISER system for the TRECVID 2013 multimedia event detection and multimedia event recounting tasks, which is based on the BOV approach built on low-level features extracted from pixel patterns in videos.
Although performing surprisingly well, the BOV-based approaches are unable to understand deeply semantic contents of the videos, such as the hierarchical components, which is common issue in high-level event and action recognition [20].Another drawback is that the important spatial information is lost in this coarse representation.

Fisher Vector Encoding
Based on the Fisher Kernel framework [93], the Fisher Vector [51][52][53][54] is an extension of the bag-of-visual-words (BOV) and encodes feature vectors of patches within one specific image by a gradient vector derived from log likelihood of a universal generative Gaussian mixture model (GMM) [94][95][96][97][98][99] which can be regarded as a probabilistic visual vocabulary.
Let X = {x t |1 ≤ t ≤ T} be the set of T local low level D dimensional feature vectors extracted from an image, e.g., a set of 128 dimensional SIFT feature vectors of interest patches in the image, and ++ are the D × D symmetric positive definite covariance matrices which are assumed to be diagonal by the variance vector σ 2 i = diag Σ (i) , and g(x t µ (i) , Σ (i) ) are the component Gaussian density functions.
Then, the fisher vector encoding adopted in [53] follows steps of: (1) GMM parameters Θ are trained on a large number of images using the expectation-maximization (EM)-based maximum likelihood estimate (MLE) of [100][101][102]. ( (3) The final fisher vector Φ norm (X) of the image X is the concatenation of Φ norm Thus, the dimension of the fisher vector Φ norm (X) is 2ND.Furthermore, the authors stated that, linear classifiers provide almost the same results as computation expensive kernel classifiers by using the fisher vector encoding which yields high-classification accuracy and is efficient for large scale processing.

VLAD Encoding
For indexation and categorization applications, a high-dimensional BOV representation usually shows better yet robust results.However, the efficiency becomes unbearable when those applications are performed on a large amount of data.Jégou et al. [55] proposed a representation, called VLAD (vector of locally aggregated descriptors), for image search on a very large scale.This approach could jointly consider the accuracy, the efficiency, and the computational complexity of the image search.x i − w j is the nearest visual word for local patch x i .Then, the VLAD encoding of the image X can be denoted by a the accumulation of differences between the visual word w j and its nearest multiple image patches, and 1 ≤ j ≤ k.Jégou et al. [55] combined the normalized f VLAD (X) with dimension reduction oriented principal component analysis and indexing oriented approximate nearest neighbor search [103], resulting an accurate, efficient, yet memory friendly image search.

Locality-Constrained Linear Encoding
Wang et al. [56] pointed out that in order to achieve desirable performance in classification tasks, the most common combinational strategies adopt either linear encodings plus nonlinear classifiers, such as BOV [50] plus χ 2 -SVM, or nonlinear encodings plus linear classifiers, such as ScSPM [83] plus linear SVM.Usually, either the nonlinear encodings or the nonlinear classifiers are computationally expensive.As a result, Wang et al. proposed a fast linear encoding approach, called Locality-constrained Linear Coding (LLC) [56] for real-time classification applications.
is the encoding dictionary with m words.Then, the process of the proposed fast linear LLC encoding works as follows: (1) For each x i , find its k nearest neighbors from the dictionary D by using k-NN method, denoted as (2) Then, for each x i , its LLC encoding can be obtained from the objective function , where c i 0 = k and 1 T c i = 1.The authors showed that the promising LLC approach cannot only improve encoding velocity, but also classification performance even with linear classifier than several compared approaches.

Super Vector Encoding
Zhou et al. [57] proposed a nonlinear high-dimensional sparse coding approach, called Super Vector (SV) encoding, for image representation and classification.The approach is an extended version of the well-known Vector Quantization (VQ) encoding [104].
Let x ∈ R d be the original d dimensional feature vector of an interest patch to be encoded, and D ⊂ R d×l be the encoding dictionary with l words.Then, the SV vector encoded from x can be denoted by , where α is a nonnegative constant, w ∈ R l represents a word, δ(x|w) = 1 if w = argmin w∈D x − w , else δ(x|w) = 0. Thus, the final sparse SV vector is l(d + 1) dimensional.
The authors stated that the SV encoding could achieve a lower function approximation error than the VQ encoding, and their proposed classification method achieved state-of-the-art accuracy on the PASCAL dataset.

Kernel Codebook Encoding
In addition to the loss of spatial structure information, there are another two major drawbacks [58], i.e., codeword uncertainty and codeword plausibility, to the traditional BOV encoding model due to the hard assignment of visual patches to a single codeword.Van Gemert et al. [58] proposed a kernel density estimation approach, called Kernel Codebook encoding, to solve above two issues for scene categorization.Suppose X = x j 1 ≤ j ≤ n, x j ∈ R d is the set of n local low level d dimensional feature vectors extracted from n interest patches of an image, and D = {v 1 , v 2 , • • • , v l } ⊂ R d×l be the encoding dictionary with l words.Then, the Kernel codebook encoding of the image X can be denoted by a 2πσ is the Gaussian kernel with a scale parameter σ.
Meanwhile, van Gemert et al. [58] concluded that: (1) The encoding f kcb (X) has integrated both codeword uncertainty and codeword plausibility; (2) The Kernel codebook encoding could improve the categorization performance, with either a high-dimensional image descriptor or a smaller dictionary.

Spatial Pyramid Encoding
Disregarding spatial layouts of interest patches, the bag-of-features-based methods have severely limited descriptive ability in capturing shape information.Lazebnik [59,60] proposed a spatial pyramid encoding and matching approach for scene recognition.
The approach involves repeatedly partitioning the scene image or frame into increasingly fine cells, computing histograms of local features at each cell and measuring similarity with pyramid match kernel.Specifically: (1) A 400-dimension visual vocabulary is formed by performing k-means clustering on random subset of interest patches, such as 16 × 16 sized SIFT patches, from the training dataset.(2) A 4-level spatial pyramid is applied to the scene image or frame X, such that there are 2 l cells along each dimension of level l, resulting in a total of 85 cells in the pyramid, where 0 ≤ l ≤ 3. (3) For each cell, a 400-dimension bag-of-features histogram is computed.The histograms from all 85 cells are then concatenated and normalized to generate a 34,000-dimension feature vector SP X as the spatial pyramid encoding of X. (4) The pyramid match kernel K(SP X , SP Y ) defined in [59,60] is applied to measure the similarity between scene images or frames X and Y for scene recognition.
Researchers also found that recognition performance from step 1-3 plus SVM classifier with standard kernels is similar to that from step 1-3 plus the pyramid match kernel.The spatial pyramid encoding has been proved effective and is now widely adopted in many applications, such as scene recognition.

Jensen-Shannon Tiling Encoding
Reseach has shown that representations that consider dynamic salient spatial layouts could always perform better than those with predefined spatial layouts on many recognition or detection tasks [105][106][107].However, both the layout of the spatial Bag-of-Words [49,50] encoding and the layout of spatial pyramid [59,60] encoding are predefined and data independent, which would lead to suboptimal representations.
Jiang et al. [61] proposed a Jensen-Shannon (JS) tiling approach, based on efficiently learned layouts, to encode feature vectors derived from the target image or frame by concatenated histograms from according tiles.Two steps are involved in the JS tiling approach: (1) It generates systematically all possible tilings, i.e., all vectors κ = κ 1 , κ 2 , • • • , κ |S| , from a base mask S = t 1 , t 2 , • • • , t |S| by the proposed Algorithm 1, such that the resulting masks can be denoted by S = T κ (S), where κ i ∈ Z and T κ is the tiling operator.
(2) It selects the best tiling T * κ as the encoding layout, such that T * κ = argmin on the assumption that an optimal tiling tends to separate positive and negative samples with maximum distance, i.e., JS divergence, where the cost is got by is the JS divergence, D j is average histogram of all positive (or negative) samples for the j th tile, and n + (or n − ) denotes the total number of positive (or negative) samples.
The authors demonstrated that the JS tiling, as a much faster method, is especially appropriate for large-scale datasets, but with comparable or even better classification results.

3D Encodings
There are three popular approaches surveyed for 3D encodings, as shown in Table 2.

Spatiotemporal Grids Encoding
Although spatial 2D encodings, such as spatial pyramid and JS tiling, could improve recognition accuracy by considering spatial layouts of interest patches, they are designed mainly for describing one image or one frame and thus not sufficient for describing spatiotemporal targets, such as actions and events.
Laptev et al. [62,63] proposed a spatiotemporal grids (or spatiotemporal bag-of-features) encoding approach for action recognition, which is an extension of the spatial pyramid encoding to spatiotemporal domain.The pipeline involves: (1) A K-dimension, e.g., K = 4000, visual vocabulary is constructed by clustering interest patches sampled from the training videos, such as HOG patches or HOF patches around detected interest points, with k-means algorithm.(2) For a given test video clip V, 3D interest points are obtained firstly by the spatiotemporal detector, such as Harris 3D detector.Then, spatiotemporal features, such as HOG or HOF, are extracted from patches around those interest points.(3) The whole test video clip V is divided by a σ 1 × σ 2 × τ sized grid into cuboids, e.g., spatial sizes σ 1 = 3, σ 2 = 1, and temporal size τ = 2.For each cuboid, a K-dimension bag-of-features histogram is formed.
(4) All histograms are further concatenated and normalized to form a K × σ 1 × σ 2 × τ dimensional feature vector to represent the video clip V.
The authors concluded that the spatiotemporal grids give a significant gain over the standard bag-of-features methods for action recognition.Besides, the same spatiotemporal grids-based pipeline has been widely used and shown promising results by several research groups [108][109][110][111] for multimedia event detection in TRECVID dataset.[64] that local interest points-based sparse representations are not able to preserve adequate spatiotemporal action structures, and traditional tracking or spatial/temporal alignment-based holistic representations are sensitive to background variations.To overcome these defects, they designed a spatiotemporal laplacian pyramid encoding (STLPC) for action recognition.Being different from representation first and encoding or pooling later feature extraction pipelines, Shao et al. proposed to extract action features by successively using frame differences-based STLPC encoding, 3D Gabor filtering representation, and max pooling, where the STLPC encoding involves there primary operations, as follows.

Shao et al. mentioned in paper
(1) Frame difference volume buliding.For the original video sequence volume V O , the difference approach is applied as a preprocessing step to generate its frame difference volume V D .(2) Spatiotemporal gaussian pyramid buliding.Firstly, generate a four-level frame difference volume pyramid P VD by subsampling the volume V D with (1, 1/2, 1/4, 1/8) resolution.Then, generate a four-level spatiotemporal gaussian pyramid P g by convolving a 3-D Gaussian function with each level of the pyramid P VD .
(3) Spatiotemporal laplacian pyramid buliding.In the beginning, generate a four-level revised pyramid P G by expanding each level of the P g into the same size as of the bottom level.After that, generate a three-level spatiotemporal laplacian pyramid P L by differencing all two consecutive levels on the revised P G with P i L = P i G − P i+1 G , where 1 ≤ i ≤ 3 is the level number.
Evaluation experiments [64] on four typical action datasets illustrated that the spatiotemporal STLPC, which performed well even with coarse bounding boxes, is an effective yet efficient global encoding for complex human action recognition.

Spatiotemporal VLAD Encoding
It is well known that spatiotemporal feature extraction is crucial for action recognition.However, as another key factor in action recognition, spatiotemporal encoding has not been paid enough attention.In order to combine both spatial information and temporal information in traditional 2D VLAD encoding, Duta et al. [65] proposed a spatiotemporal VLAD encoding for human action recognition in videos.The pipeline of the ST-VLAD encoding is as follows.
(1) Spatiotemporal deep video features extraction using two stream ConvNet or Improved Dense Trajectories (iDT).For two stream ConvNet, frames are resized to size of 224 × 224.Then, VGG19 ConvNet is pretrained on ImageNet and then adopted to extract 49 feature vectors with 512 dimentions by each vector for each frame in spatial stream, and VGG16 ConvNet is pretrained on UCF101 and then adopted to extract 49 feature vectors with 512 dimentions feature for each ten frames in temporal stream.For iDT, the resulted dimensionality is 96 for HOG, MBHx and MBHy, and 108 for HOF.Finally, all extracted features are reduced by PCA.
(2) VLAD encoding.Suppose X = x i 1 ≤ i ≤ n, x i ∈ R d is a set of n local low level d dimensional spatial temporal deep video features extracted with approaches in the first step, and D a = {va 1 , va 2 , • • • , va l } ⊂ R d×l is the appearance encoding dictionary with l words.Then, VLAD encoding is defined as a d × l dimensional feature vector f VLAD (X) = , where f VLAD X va j = ∑ x i ∈NN(va j ) x i − va j / NN va j , and NN va j = argmin 1≤i≤n x i − va j .
(3) Spatiotemporal encoding.Suppose pos(•) is a function to generate a three dimensional normalized position vector, and is the spatiotemporal encoding dictionary with m three dimensional words.Then, the ST encoding is defined as a m × (d + l) dimensional feature vector by y − vp j , and MS i is its membership vector.
(4) ST-VLAD encoding.The final representation, i.e., ST-VLAD, of a video concatenates VLAD encoding and ST encoding in a d × l + m × (d + l) dimensional vector.
The authors verified that combing powerful deep features, the proposed ST-VLAD encoding could obtain state-of-art performance on three major challenging action recognition datasets.

Pooling Strategies
In computer vision and image processing, there are mainly two applications for pooling strategies.One is to pool the encoding features, and the other is to pool the convolutional features.In this section, let I be an input image.For encoding pooling, suppose X = {x i |1 ≤ i ≤ n, x i ∈ R r } is the set of n local low level r dimensional feature vectors extracted from I, E = e j 1 ≤ j ≤ m, e j ∈ R s is the encoding feature vectors of I, and PE = {pe k |1 ≤ k ≤ s, pe k ∈ R} be the pooling feature vector of I.
For convolutional pooling, suppose C = c i,j 1 ≤ i ≤ row c , 1 ≤ j ≤ col c , c i,j ∈ R is the convolutional feature map of I or its upper pooling layer, and PC = pc a,b 1 ≤ a ≤ row p , 1 ≤ b ≤ col p , pc a,b ∈ R is the pooling feature map of C.

General Poolings
There are three popular approaches surveyed for general poolings, as shown in Table 3.

Sum Pooling
For encoding pooling, mathematical representation of the sum pooling can be expressed by: . Although sum pooling is an intuitive pooling strategy, there are a number of papers using this method.For example, Peng et al. [66] analyzed action recognition performance among several bag of visual words and fusion methods, where they adopted sum pooling and power l 2 -normalization for pooling and normalization strategy.Zhang et al. [67] gave a probabilistic interpretation why the max pooling was usually better than sum pooling in the context of sparse coding framework for image retrieval applications, since max pooling tended to increase the discrimination of the similarity measurement than sum pooling.Besides, they proposed a modified sum pooling method, improving the retrieval accuracy significantly over the max pooling strategy.
For convolutional pooling, sum pooling can be derived by: pc a,b = ∑ (i,j)∈R a,b c i,j .There are some attempts using sum pooling for convolutional neural network (CNN) based applications.For instance, Gao et al. [68] proposed a compact bilinear pooling method for image classification based on a kernelized analysis of bilinear sum pooling.They verified that the method could reduce the feature dimensionality two orders of magnitude with little loss in performance, and the CNN back-propagation can be efficiently computed.LeCun et al. [69] developed a typical CNN, called LeNet-5, for isolated character recognition.The LeNet-5 architecture consisted of two convolution layers, two sum-pooling layers, and several full connection layers.However, Mohedano et al. [70] found that for instance retrieval tasks, even bag-of-words aggregation could outperform techniques using sum pooling when combing with local CNN features at the challenging TRECVID INS benchmark.

Average Pooling
For encoding pooling, the average pooling strategy can be represented by: pe k = m ∑ j=1 e j,k /m.
There are many works using average pooling.For instance, Pinto et al. [71] constructed a biologically inspired object recognition system, i.e., a simple V1-like model, based on average pooling [72].This model outperforms state-of-the-art object recognition systems on a standard natural image recognition test.However, many researchers have shown that, in most encoding pooling-based vision applications, average pooling usually was not the best pooling strategy.For example, Boureau et al. [73] provided a theoretical and empirical insight into the performance between max pooling and average pooling.They pointed out that max pooling outperformed almost always average pooling, especially dramatically when using a linear SVM.For convolutional pooling, average pooling can be denoted as: where R a,b is a pooling region, and R a,b is its edge length.He et al. [74] introduced a deep residual net, called ResNet, for large-scale image recognition.The ResNet ended with a global average pooling layer and a fully connected layer with softmax.The authors won first place in several tracks in ILSVRC & COCO 2015 competitions by using the ResNet.For convolutional pooling-based vision applications, the situation that average pooling is always not the best choice is similar.For example, Sainath et al. [75] explored four pooling strategies in frequency only for an LVCSR speech task, and concluded that either l p pooling or Stochastic pooling can address the issues of max pooling or average pooling.Yu et al. [76] invented a mixed pooling method to regularize CNN.They demonstrated that the mixed pooling method is superior to both max pooling and average pooling, since the latter may reduce largely the feature map if there are many zero elements.In 2016 TRECVID competition, the CMU Informedia team [112] adopted the average-pooling for video representation in the multimedia event detection task, the Ad-hoc video search task, and the surveillance event detection task.Besides, they also adopted the average-pooling for document representation in the video hyperlinking task.Their experiments indicated that the average-pooling is an ideal discriminative strategy for hybrid representations.

Max Pooling
As one of the most welcome pooling strategies in vision related tasks, max pooling has been widely used in both encoding-based and convolutional-based pooling applications.For encoding pooling, the max pooling strategy has following formalization: e j,k .This max-pooling has been empirically justified by many algorithms.For example, Serre et al. [77] developed a biologically motivated framework.The framework adopts only two major kinds of computations, i.e., template matching and max pooling, to obtain a set of scale-and translation-invariant C2 features, for robust object recognition.Boureau et al. [72] conducted a theoretical analysis of average pooling and max pooling for visual recognition tasks.The authors concluded that the recognition performance using pooling strategies can be influenced by many factors, such as sample cardinality, resolution, codebook size, and max pooling performs no worse than average pooling in most cases.For convolutional pooling, the max pooling strategy can be formalized as: pc a,b = max where R a,b is a pooling region.There are also lots of researches using max pooling in CNN to generate deep features.For example, Sainath et al. [78] explored applying CNNs to large-vocabulary speech tasks, and showed that their convolutional network architecture, which consisted of a convolutional and max-pooling layer, was an improved CNN. Scherer et al. [79] evaluated two pooling operations in convolutional architectures for object recognition, and showed that a maximum pooling operation significantly outperformed a subsampling operation.Wei et al. [80] presented a flexible CNN framework, which can be pre-trained well on large-scale single-label image datasets, for multi-label image classification.The framework generated its ultimate multi-label predictions with a cross-hypothesis max-pooling operation on confidence vectors obtained from the input hypotheses using the shared CNN.

Particular Poolings
There are three popular approaches surveyed for particular poolings, as shown in Table 4.

Stochastic Pooling
For large-scale convolutional neural networks, how to simultaneously reduce computational complexity and keep visual invariance in training processes has become an important issue.Conventionally, researchers solved this issue by adding several extra average pooling or max pooling layers.However, both types of pooling have drawbacks.Namely, average pooling leads to small-pooled responses, and max pooling leads to over-fitting.Thus, Zeiler et al. [81] proposed a stochastic pooling approach for regularization of deep convolutional neural networks.
Firstly, they derive a probability map PM = pm i,j from the convolutional feature map C using pm i,j = c i,j / ∑ (p,q)∈R i,j c p,q , where R i,j is the pooling region determined by c i,j .Secondly, they compute the pooling feature map PC = pc a,b using pc a,b = c i * ,j * , s.t.(i * , j * ) ∼ , where (i * , j * ) is pooled activation, P MN (•) is the multinomial distribution, and R i,j h or R i,j v refers to horizontal or vertical dimension of the pooling region.Thus, the convolutional feature map in the large CNN can be randomly pooled according to the multinomial distribution.The authors also stated that the simple yet effective stochastic pooling strategy can be combined with any other forms of regularization to prevent over-fitting and reduce computational complexity for deep convolutional neural networks based applications.

Semantic Pooling
For complex event detection in long internet videos with few relevant shots, traditional pooling strategies treat usually each shot equally and cannot aggregate the shots based on their relevance with respect to the event of interest [82].Chang et al. [82] proposed a semantic pooling approach to prioritize CNN shot outputs according to their semantic saliencies.
Firstly, shots-based CNN feature extraction.Specifically, compute the average number of key frames m for all videos in the experiment dataset, and adopt the color histogram difference-based shot boundary algorithm to divide each video into m shots, denoted by V = {SH i |1 ≤ i ≤ m}.Then, select randomly one frame in the shot as its key frame, and extract CNN features on all key frames, denoted by XC Secondly, concept probability-based feature extraction.Specifically, apply common datasets of action and event recognition to train beforehand plenty of semantic auxiliary concept detectors, i.e., c = 1534, and generate a c dimensional probability vector for each shot by concatenating responses of all concept detectors on the shot, denoted by XV = {xv i |1 ≤ i ≤ m, xv i ∈ R c }.
Thirdly, concept relevance-based feature extraction.Specifically, bring the English Wikipedia dump to train previously a skip-gram model, and employ the skip-gram model together with the Fisher vector encoding approach to vectorize both the textual event descriptions and the c concept names.Then, compute the cosine distance between the overall textual description vector and each concept name vector, resulting in a concept relevance vector XR = {xr i |1 ≤ i ≤ c, xr i ∈ R}.
Subsequently, saliency-based semantic pooling.For each shot i, compute a semantic saliency score sa i inner product between its corresponding row vector xv i of the concept probability matrix XV and the concept relevance vector XR, producing a semantic saliency score vector SA = sa i sa i = XR T xv i , 1 ≤ i ≤ m, sa i ∈ R .Then, rank all CNN feature vector xc i in a descending order according to its saliency score sa i , and concatenate successively the ranked feature vectors as the final deep representation.
In order to exploit ordering information in the semantic pooling-based deep features, the authors [82] designed also a nearly isotonic classification approach, and verified through a number of experiments that the combination of the flexible deep representation and the sophisticated classification exhibited higher discriminative power in event analysis tasks of event detection, event recognition, and event recounting.

Multi-Scale Pooling
For encoding pooling, Yang et al. [83] proposed a multi-scale spatial max pooling approach to generate nonlinear features based on sparse coding, as a generalized vector quantization, for fast yet accurate image classification.
For convolutional pooling, Gong et al. [84] proposed a multi-scale orderless pooling approach to improve geometric invariance in CNN for classification and matching of highly variable scenes.He et al. [85] proposed a spatial pyramid pooling approach to generate fixed length representations for CNN-based visual recognition.The major contribution of the spatial pyramid pooling approach is its additional three-scale max-pooling layer.Szegedy et al. [86] proposed a 22-layer deep model, namely GoogLeNet, for classification and detection in the ImageNet challenge competition.The GoogLeNet employs a parallel multi-scale hybrid pooling architecture to reduce computing resources in deeper convolutions.

Conclusions
In this paper, we have surveyed comprehensively the approaches of encodings and poolings that have been previously studied for feature representations in action and event recognition on uncontrolled video clips, and summarized systematically both underlying theoretical principles and original experimental conclusions of those approaches.Furthermore, we have designed an approach-based taxonomy to categorize the most popular previous research work on encodings and poolings by 2D encodings, 3D encodings, general poolings, and particular poolings.As mentioned above, feature encoding and feature pooling are only two of three key components in feature representation approach, whereas the feature representation approach is only one of three key components in action and event recognition.In the future, we will conduct three more surveys for the other three key components, namely a survey on feature extraction approaches, a survey on pattern recognition models, and a survey on performance evaluation strategies, for recognition of complex actions and events.

Figure 1 .
Figure 1.The hierarchical feature encoding and pooling taxonomy of the paper.
the set of n local low level d feature vectors extracted from n interest patches of an image, D = {w 1 , w 2 , • • • , w k } ⊂ R d×k be the learned codebook with k-means, and w * = argmin w j ∈D,1≤j≤k

Table 1 .
Popular approaches and corresponding references for 2D encodings.

Table 1 .
Popular approaches and corresponding references for 2D encodings.

Table 2 .
Popular approaches and corresponding references for 3D encodings.

Table 3 .
Popular approaches and corresponding references for general poolings.

Table 4 .
Popular approaches and corresponding references for particular poolings.