Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences

Human Action Recognition (HAR) is the classification of an action performed by a human. The goal of this study was to recognize human actions in action video sequences. We present a novel feature descriptor for HAR that involves multiple features and combining them using fusion technique. The major focus of the feature descriptor is to exploits the action dissimilarities. The key contribution of the proposed approach is to built robust features descriptor that can work for underlying video sequences and various classification models. To achieve the objective of the proposed work, HAR has been performed in the following manner. First, moving object detection and segmentation are performed from the background. The features are calculated using the histogram of oriented gradient (HOG) from a segmented moving object. To reduce the feature descriptor size, we take an averaging of the HOG features across non-overlapping video frames. For the frequency domain information we have calculated regional features from the Fourier hog. Moreover, we have also included the velocity and displacement of moving object. Finally, we use fusion technique to combine these features in the proposed work. After a feature descriptor is prepared, it is provided to the classifier. Here, we have used well-known classifiers such as artificial neural networks (ANNs), support vector machine (SVM), multiple kernel learning (MKL), Meta-cognitive Neural Network (McNN), and the late fusion methods. The main objective of the proposed approach is to prepare a robust feature descriptor and to show the diversity of our feature descriptor. Though we are using five different classifiers, our feature descriptor performs relatively well across the various classifiers. The proposed approach is performed and compared with the state-of-the-art methods for action recognition on two publicly available benchmark datasets (KTH and Weizmann) and for cross-validation on the UCF11 dataset, HMDB51 dataset, and UCF101 dataset. Results of the control experiments, such as a change in the SVM classifier and the effects of the second hidden layer in ANN, are also reported. The results demonstrate that the proposed method performs reasonably compared with the majority of existing state-of-the-art methods, including the convolutional neural network-based feature extractors.


Introduction
In machine vision, automatic understanding of video data (e.g., action recognition) remains a difficult but important challenge. The method of recognizing human actions that occur in a video sequence is defined as human action recognition (HAR). In video understanding, it is difficult to differentiate routine life actions, such as running, jogging, and walking, using an executable script. There has been an increasing interest in HAR over the past decade, and it is still an open field for many researchers. The domain of HAR has developed considerably with significant application in human motion analysis [1,2], identification of familiar people and gender [3], motion capture and animation [4], video editing [5], unusual activity detection [6], video search and indexing (useful for TV production, entertainment, social studies, security) [7], video2text (auto-scripting) [8], video annotation, and video mining [9].
Human action recognition is a challenging multi-class classification problem due to high intra-class variability within a given class. To overcome variability issue, we propose a scheme to design a feature descriptor that is highly invariant to the fluctuations present in the classes. In other words, the proposed feature descriptor fuses various diverse features. In addition, this paper addresses various challenges in HAR, such as variation in the background (outdoor or indoor), recognizing the gender of the action performer, variation in clothes worn, and scale variation. We deal with constrained video sequences that involve moving background and multiple actions in single video sequence.
Our contributions in this paper can be summarized in the following way. First, for moving object detection, we use a novel technique by incorporating the human visual attention model [10] making it background-independent. Therefore, its computational complexity is much lower than the algorithm which updates background at regular interval for moving object detection in the video. Second, we propose the feature description preparation layer, which includes the use of the HOG features with the non-overlapping windowing concept. Moreover, averaging the features reduces the size of the feature descriptor. In addition to the HOG, we also use the object displacement, which is crucial to differentiate the action performed at the same location, i.e., zero displacements (like boxing, hand waving, clapping, etc.) or at various locations, i.e., non-zero displacement (like walking, running, etc.). Furthermore, a velocity feature is used at this stage to further identify the overlapping actions having non-zero displacement (like walking, running, etc.). It is based on the observation that speed variation among such actions exists and incorporation of velocity feature can aid the classification. To consider the spatial context in terms of boundaries and smooth shapes of the human body, regional features from Fourier HOG are employed. Finally, we propose six different models for classification to demonstrate the effectiveness of the proposed features descriptor across different types of classifier families.
The rest of the paper is organized in the following way. Section 2 discusses the existing literature on HAR. Section 3 outlines the motivation for feature fusion and briefly describes the HOG, support vector machines (SVMs), artificial neural networks (ANNs), multiple kernel learning (MKL), and Meta-cognitive Neural Network (McNN). In Section 4, the proposed approach for HAR is described. Section 3 also presents the proposed techniques for fusing features. Section 4 presents and discusses the experimental results. Finally, we conclude the paper in Section 5.

Existing Methods
In the last two decades, most research on human action recognition is concentrated at two levels: (1) feature extraction and (2) feature classification. One of the feature extraction methods is the Dense trajectories approach [11] that extracts features at multiple scales. In addition, these features are sampled for each frame, and based on the displacement information from dense optical flow field actions are classified. In [12], an extension to Dense trajectories was proposed by replacing the Scale-Invariant Feature Transform (SIFT) feature with the Speeded Up Robust Features (SURF) feature to estimate camera motion.
The advantage of these trajectories representations is that they are robust to fast irregular motions and boundaries of human action. However, this method cannot handle the local motion in any action which involves the important movement of the hand, arm, and leg. Therefore, it is not providing enough information for action discrimination. This particular problem is overcome by using important motion parts using Motion Part Regularization Framework (MPRF) [13]. This framework uses Spatio-temporal grouping of densely extracted trajectories, which have been generated for motion part. Objective function for sparse selection of these trajectory groups has been optimized and learned motion parts are represented by fisher vector. Lan et al. again points out in [14] about the local motion of body parts, which result in small changes of intensity, resulting in low-frequency action information. In feature preparation layer, low-frequency action information is not included; therefore, resultant feature descriptors cannot capture enough detail for action classification. In order to address this problem, the Multi-skIp Feature Staking (MIFS) approach was proposed. This approach considers stacking extracted features using differential operators at various scales, which makes the task of action recognition invariant to speed and range of motion offered by the human subject. Due to consideration of various scales in feature building stage, computation complexity is increased in this approach.
In the traditional way, distinct features are derived for representing any human action. However, Liu et al. [15] proposed a human action recognition system, which extracts spatio-temporal and motion features automatically, and this is accomplished by an evolutionary algorithm such as genetic programming. These features are scale and shift invariant and extract color information as well from optical flow sequences. Finally, classification is performed using SVM but the automatic learning needs training process which is time-consuming. The approach in [16] defined the Fisher vector model based on the spatio-temporal local features. Conventional dictionary learning approaches are not appropriate for Fisher vectors extracted from features; therefore, the authors of Reference [16] proposed Multiple Instance Discriminative Dictionary Learning (MIDDL) methods for human action recognition. Recently, frequency domain representation of the multi-scale trajectories has been proposed [17]. The critical points are extracted from the optical flow field of each frame; later multi-scale trajectories are generated from these points and transformed into frequency domain. This frequency information is combined with other information like motion orientation and shapes at the end. The computational complexity of this method is high due to the consideration of the optical flow. The author [18] proposed the skeleton information as a coordinated non-cyclic diagram that gave the kinematic reliance between the joints and bones in the characteristic human body.
Recently proposed, the Deep Convolution Generate Adversarial Network (DCGANs) [19] bridges the gap between supervised and unsupervised learning. The author proposed a semisupervised framework for action recognition, which uses trained discriminator from GAN model. However, the method evaluates the feature based on the appearance of the human and does not account motion in feature building stage. Representation of action is evaluated in terms of distinct action sketches [20]. Sketch formation has been done using fast edge detection. Later on, the person in each particular frame is detected by R-CNN. Furthermore, ranking and pooling are deployed for designing distinct action sketch. Improved dense trajectories and pooling feature fusion are provided to SVM classifier for action recognition. VideoLSTM, a new recurrent neural network architecture, has been proposed in [21]. This architecture can adaptively fit the requirement of given video. This approach exploits new spatial layout of architecture, motion-based attention for relevant spatio-temporal location, and action localization from videoLSTM. In addition to that, there are several methods proposed over the decades [22].

Proposed Framework
The proposed HAR framework is shown in Figure 1 and involves three parts: moving object detection, feature extraction, and action classification.

Moving Object Detection
Moving object detection plays crucial role in many computer vision applications. The process involves the classification of pixels from each frame of a video stream as a background or foreground pixel, and a model representing the background is generated. Then, the background is removed from each frame to enable moving object detection and the process is referred to as background subtraction. Popular background subtraction techniques include frame differencing [23,24], Shadow removal [25], Gaussian mixture model(GMM) [26], and CNN-based background removal [27]. However, an algorithm for moving object detection without any background modeling was presented in [28][29][30], and the detailed procedure is given below.
First, an average filter is applied on video sequence I(m, n, t) of size X × Y for a particular time t. The moving object detection performance of the method is depicted in Figures 2 and 3 for two different video sequences. The first column shows the snapshot from two different videos. A saliency map is shown in the second column, and a silhouette creation is done using morphological operation and shown in third column. The detected moving objects are shown in the fourth column.
where A represents the avg filter of mask size X × Y, and ⊗ represents the convolution between two images. Next, a Gaussian filter is employed on the image, The Gaussian low-pass filter is represented as G. The saliency value calculated at each pixel (m, n) is given as The distance between the respective images is represented as dist. S(m, n) contains the moving object obtained from specific video. In the proposed approach, moving object detection is performed as where FG(m, n) defines the moving object from the video sequence I. Therefore, the moving object detection is faster and computationally efficient, as the method is background-independent. In other words, the time-consuming process of updating the background at regular intervals is not needed.

Feature Extraction
The procedure for extracting feature descriptors from a segmented object is shown in Figure 4, which represents action in a compact three-dimensional space associated with an object, background scene, and variation that appears in the object over time. After detecting and segmenting moving objects from each video sequence, compact features are extracted. In the proposed approach, we calculate the following features.
• HOG over 10 non-overlapping frames (HOGAVG10): Here, we have used HOG, which was proposed by Dalal and Trigg [31] in 2005 and is still a highly effective human detection feature. The segmented object is converted to a fixed size (e.g., 128 × 64). HOG features extracted from the resized segmented object (per frame) have a dimensionality of 3780 as explained in Figure 4. Each video has 120 frames; therefore, the final descriptor for each video having one action is 3780 × 120. Feature descriptors contain redundant data; thus, the computational cost for learning and testing is excessive. In the proposed approach, we have calculated HOG features over a window size of 10 non-overlapping frames (HOGAVG10) because the object does not change considerably over the frames as shown in Figure 5 . Thus, there is a considerable reduction in the redundant data by using 10 frames. • Displacement in Object Position (OBJ_DISP): To evaluate the displacement of an object, the centroid (or center of mass) of the silhouette corresponding to the object is calculated by taking the (arithmetic) mean of the pixels is denoted by Suppose that the centroid of the present frame is C(x t , y t , t) and the past frame is C(x t−1 , y t−1 , t − 1). Then, the displacement (OBJ_DISP) D(x t , y t , t) can be approximated using • Velocity of Object (OBJ_VELO): Similar to the displacement features, the extraction of velocity features also requires the centroid of the detected moving object. The displacement and velocity features are used to estimate the motion of the moving object as they increase the inter-class distance which subsequently increases the accuracy of the overall proposed framework. Velocity OBJ_VELO(x t , y t , t) of object is estimated using where ∆t = t i+10 − t i (for example, ∆t = 10 for our proposed approach) and OBJ_DISP refers to Displacement.
• Regional Features from Fourier HOG [32] (R_FHOG): In this work, we extended the Regional Features from Fourier HOG proposed in [32] for action recognition. In Cartesian coordinate system, two-dimensional function is represented by f (x, y) ∈ R 2 . The polar coordinate representation of same function is defined as [r, θ], as r is frequency in radius and angle θ. The relation between polar and Cartesian coordinate is defined as and θ = arctan(y, x) ∈ [0, 2π).
In the polar coordinate system, the Fourier transform is combination of radial and angular parts. The basis function B for Fourier transform in polar coordinate systems is defined as where k is non-negative value, and its also defines the scale of the pattern; J m (kr) is a mth-order Bessel function; and Φ m = 1 2 e imΦ . k can be continuous or discrete value, depending on whether the region is infinite or finite. Transform considering finite region r ≤ a, the basis function is reduced to B n,m (r, θ) = √ kR nm (r)Φ m (θ), (12) where, The basis function (13) is orthogonal and orthonormal in nature. For B n,m (r, θ), m is number of cycles in angular direction and n − 1 is defined as number of zero crossing in radial direction.
As the values of m and n increase, finer details can be extracted from the image. Generally, the evaluation of HOG features involves three steps namely gradient orientation binning, spatial aggregation, and magnitude normalization, which are followed in the Fourier domain as well.
Step 1: Gradient Orientation Binning: The gradient of image I(x, y) ∈ R 2 is defined as G(x, y) = [G x , G y ], and its polar representation is defined as where G = G 2 x + G 2 y and ∠G = arctan(G y , G x ) ∈ [0, 2π). Gradient orientation are stored in bins of histogram using distribution function h at each pixel. Suppose that the gradient of any image is resented as G = [G x G y ] ∈ R 2 . The angular part of G is Φ(G), and the distribution function h for each pixel should be Dirac function gain with G In this work, Fourier basis has been replaced with Fourier coefficientf m In HOG, for each gradient vector, its magnitude contribution is split into three closest bins. Therefore, it can be considered a triangular interpolation. In Fourier space, to build a HOG feature, a 1D triangular kernel can be employed to implement the gradient orientation binning. However, the execution of this particular step does not affect the results. Therefore, this step has not been considered in the proposed work.
Step 2: Spatial Aggregation: To achieve spatial aggregation, convolution operation is performed on a Gaussian Kernel or an isotropic kernel and Fourier coefficients obtained.
Step 3: Local Normalization: An isotropic kernel is convolved with Fourier coefficient to achieve normalization of gradient magnitude. Steps 2 and 3 are performed using two kernels. The first kernel for spatial aggregation is K 1 : R 2 → R and the second kernel K 2 : R 2 → R is used for local normalization. Finally, Fourier HOG is accomplished usingF Regional descriptor using Fourier HOG: To obtain the regional descriptor, a convolution operation is performed using the Fourier basis (in polar representation) function. B n,m (r, θ).
The graphical illustration of calculation of R_FHOG features is provided in Figure 6. Figure 7 depicts the positive result by showing R_FHOG (i.e., R n,m ) for the segmented object. To speed up the process, we have not considered non-redundant data. Therefore, we have selected region features which give a maximum response on a human region. The formation of the final template from region features considers a value of scale ∈ {1}, order ∈ {1, −1}, and degree ∈ {1, 2}. Template has been shown in Figure 8.
Rn,m Figure 6. The generation process of the Region Feature Description.

Fusion of Features
The motivation behind fusing features is to increase diversity within classes and thus improve classification.
• HOGAVG10 + OBJ_DISP: Here, we fuse HOGAVG10 with OBJ_DISP. The importance of this parameter is to differentiate between actions performed at a static location (e.g., boxing, hand waving, and hand clapping) and actions performed at a dynamic location (e.g., walking, jogging, and running). Therefore, we gain inter-class discriminative power by combining these two features. The position of an object does not change drastically; thus, we propose to employ the window concept to investigate the object motion over that period. In addition, we take the average of the positions to reduce the feature set. This feature is important as it provides the inter-frame offset corresponding to the object position. The displacement values for all classes are shown in Table 1 Table 2.
The R_FHOG feature is effective at splitting the frequency gradient into bands, subsequently emphasizing the human action region. In other words, R_FHOG represents crucial information regarding boundaries and smoothed shapes. R_FHOG also provides information regarding the spatial context of a human subject.

Formal Description
This section presents the proposed fusion techniques in detail. Fusion techniques are performed at both feature and classifier level, referred to as early and late fusion techniques, respectively.

Early Fusion
The task Feature Fusion is performed using basic techniques such as concatenating features one after another as shown in Figure 9.   Figure 9: Proposed Early Fusion Technique using Concatenation method Input Layer S (Score of Individual classifier) The connection between units (/nodes) of the input layer and output layer are between associated by weights w.Each input node gets a score s ik where, i characterizes i th classifier and k characterizes k th class. On the off chance that s ik of input is associated 280 with output node j, weight of this connection is characterized as w ijk . Greatest reaction at output layer node is characterized as the choice of action recognition.

SMN Sij
The sigmoid activation function is used in each node, the reaction of this proposed late combination approach is characterized as Sugeno's Fuzzy Integral

Late Fusion
Late combination is utilized in this work to accomplish combination at classifier level. The two distinctive late combination approaches utilized in the current investigation are Decision Combination Neural Network (DCNN) and Sugeno Fuzzy Integral.

Decision Combination Neural Network (DCNN)
Decision Combination of Neural Network (DCNN) [33] is neural network architecture with no concealed layers. Accordingly, DCNN characterizes the straight connection between the input and output nodes. The most elevated reaction of a specific output layer node is characterized as choice or class mark for action recognition. Details of the DCNN follow.
As shown in Figure 10, this neural network organization contains two layers: input layer (S) and output layer (Z) individually. M classifier's outputs are taken care of corresponding to the input layer and there are N inputs nodes related with the class. The connection between units (/nodes) of the input layer and output layer are between associated by weights w. Each input node gets a score s ik , where i characterizes ith classifier and k characterizes kth class. On the off chance that s ik of input is associated with output node j, the weight of this connection is characterized as w ijk . The greatest reaction at the output layer node is characterized as the choice of action recognition.  Figure 9: Proposed Early Fusion Technique using Concatenation method · · · · · · · · · · · · · · · Output Layer Z (Decision) SM S1 Si wijk Input Layer S (Score of Individual classifier) The connection between units (/nodes) of the input layer and output layer are between associated by weights w.Each input node gets a score s ik where, i characterizes i th classifier and k characterizes k th class. On the off chance that s ik of input is associated 280 with output node j, weight of this connection is characterized as w ijk . Greatest reaction at output layer node is characterized as the choice of action recognition.

SMN Sij
The sigmoid activation function is used in each node, the reaction of this proposed late combination approach is characterized as The sigmoid activation function is used in each node, the reaction of this proposed late combination approach is characterized as

Sugeno's Fuzzy Integral
The supposition of a basic weighted normal system is that all classifiers are not commonly reliant. In any case, classifiers are connected. To take out a requirement for such presumption, the possibility of fuzzy integral was actualized by the authors of [34,35] and is a nonlinear mapping function characterized with a fuzzy measure. A fuzzy integral is the fuzzy normal of classifier scores. Definitions are given underneath thinking about fuzzy and fuzzy integral, separately. Definition 1. Let X be a finite set defined as {x 1 , x 2 . . . x n }. A fuzzy measure µ defined on X is a set of function µ : X → [0, 1] satisfying with The fuzzy measure we adopt in this work is the Sugeno integral.

Definition 2.
Let µ be a fuzzy measure on X. The discrete Sugeno integral of function f : X → [0, 1] with respect to µ is defined as where, .(i) shows the indices have been permuted so that Fuzzy measure µ is a µ λ -fuzzy measure and is calculated by using Sugeno's λ measure. The value of µ(A (i) ) is calculated recursively as the value of λ is calculated by solving the equation where λ ∈ (−1, +∞) and λ = 0. This can be easily computed by calculating an (n − 1)st degree polynomial and determining the distinct root greater than −1. The fuzzy integral is characterized in proposed work as late combination method for consolidating classifiers scores. Assume that C = {c 1 , c 2 · · · c n } is a bunch of action classes of interest. Let X = {x 1 , x 2 . . . x n } be a bunch of classifiers and A be an input pattern considered for action recognition. Let f k : X → [0, 1] be the assessment of the object A for class c k , for example, f k (x i ) is sign of guarantee in the characterization of the input pattern A for class c k utilizing the classifier x i . Value 1 for f k (x i ) is characterizing outright guarantee of input pattern A in class c k and 0 shows supreme uncertainty that the object is in c k . Knowledge of the density function is needed to figure the fuzzy integral and µ i , ith density is considered as the level of significance of the source x i towards a ultimate choice. A maximal evaluation of comprehension between the evidence and desire is spoken to as fuzzy integral. In the proposed approach, the density function µ is approximated via preparing information gave to the classifier. The calculation in the proposed algorithm characterizes the late combination approach for choice combination. The Algorithm 1 defines the late fusion approach for decision fusion.
Calculate fuzzy integral for the action class end for Find out the action class label end procedure

Classifier
Various classifiers have been used to evaluate the performance of proposed approach. The parameters and their respective values are summarized in Table 3. We have considered the parameters kernel function with degree (d), Gamma in Kernel Function (γ), and Regularization Parameter (c). Polynomial and Radial basis kernel functions have been used. Table 3. Parameters setting for SVM and their respective levels evaluated in experimentation [36]. The parameters of the ANN are hidden layer neurons (n), the value of the learning rate (lr), momentum constant (mc), and number of epochs (ep). To find out the values of these parameters efficiently, ten levels of n, nine levels of mc, and ten levels of ep are evaluated in the parameter setting experiments. The value of lr is initially fixed at 0.1. The values of these parameters and their respective levels are evaluated in Table 4. Table 4. Parameters setting for neural network and their respective levels evaluated in experimentation [36].

Meta-Cognitive Neural Network (McNN) Classifier
Neural network provides a self-learning mechanism, whereas the meta-cognitive phenomenon comprises self-regulated learning. Self-regulation makes the learning process more effective. Therefore, there is need of jump from single or simple learning to collaborative learning. The collaborative learning can be achieved using the cognitive component, which interprets knowledge, and the meta-cognitive component, which represents the dynamic model of the cognitive component.
Self-regulated learning is a key factor of meta-cognition. It is threefold mechanism: it plans, monitors, and manages the feedback. According to Flavell [37], meta-cognition is awareness and knowledge of the mental process for monitoring, regulate, and direct the desired goal. We present here Nelson and Naren's meta-cognitive model [38]. The cognitive component and meta-cognitive component are prime entities of McNN. A detailed architecture of the Meta-cognitive Neural network is shown in Figure 11.
where, α j0 = bias to jth output neuron, α jk is weight connecting the kth neuron to the jth output neuron and φ k (x i ) is the output of kth Gaussian neuron to the excitation x where α j0 = bias to jth output neuron, α jk is weight connecting the kth neuron to the jth output neuron, and φ k (x i ) is the output of kth Gaussian neuron to the excitation x is represented as where µ l k is the mean, σ l k is the variation in the mean value of the kth hidden neuron, and l represents the hidden layer class.

Measures:
The meta cognitive component of McNN uses four parameters for regulation learning:
2. Maximum hinge error: Hinge error estimates posterior probability more precisely than mean square error function, and, eventually, the error between the predicted outputŷ i and actual output y i hinge error loss defined as The maximum absolute hinge error E is as follows, 3.

Confidence of classifier:
The classifier confidence is given aŝ

Class-wise significance:
The input feature is mapped to higher dimensional S using Gaussian activation function applied to hidden layer neurons. Therefore, it is considered to be on hyper-dimensional sphere. The feature space S is described by the mean µ and σ variation in the mean value of Gaussian neurons. Moreover, steps are shown in [39] for the calculation of potential ψ, which is given as In the classification problem, each class distribution is considered crucial and eventually affect the accuracy of the classifier, significantly. Therefore, a measure of the spherical potential of new training data x belongs to class c with respect to neurons belongs to same class has been utilized, i.e., l = c. Class-wise significance ψ c is calculated as where K c is the number of neurons associated with class c. The sample contains relevant information or not depends on ψ c , the lowest value of it denotes sample consider novelty.

Learning Strategy:
Based on various measures, the meta-cognitive component has different learning strategies, which deal with the basic rules of self-regulated learning. These strategies manage sequential learning process by utilizing one of them for new training sample.

1.
Sample Delete Strategy: This strategy reduces the computational time consumed by learning the process. It reduces the redundancy in training samples, i.e., it prevents similar samples being learnt by the cognitive component. The measures used for this strategy are predicted class label and confidence level. When actual class label and predicted class label of the new training data is equal and the confidence score is greater than expected value, it indicates that new training data training data provides redundancy.

2.
Neuron growth strategy: New hidden neuron should be added to the cognitive component or not is decided by this strategy. When new training sample include substantial information and the estimated class label is different from an actual class label, new hidden neuron should be added to adopt the knowledge.

3.
Parameter update strategy: Parameters of the cognitive component are updated in this strategy from new training sample. The value of parameters change when an actual class label is same as the predicted class of sample and maximum hinge loss error is greater than a threshold set for adaptive parameter updation. 4.
Sample reverse strategy: Fine tuning of parameters of the cognitive component has been established by new training samples, which are having some information but not much relevant.
The parameters are updated in McNN, when the desired class is equal to the actual class. The value of maximum hinge error E for neuron growth and the parameter update strategy is between 1.2 and 1.5, and 0.3 and 0.8, respectively. For parameter update strategy, If the value is close to 1, it will avoid system to use any sample. The value is close to 0 cause all samples to be used in updation. In neuron addition strategy, the value 1 lead of E lead to misclassification of all samples and the value 2 causes few neurons will be added. Other parameters are selected accordingly and the range of values of parameters have been shown in Table 5.

Performance Evaluation
A performance evaluation of the proposed work has been done using a sufficient set of performance parameters through extensive experiments on standard datasets, which is described as follows.

Database Used
The proposed approach was applied to two datasets: the KTH [40] and Weizmann datasets [41]. These datasets are popular benchmarks for action recognition in constrained video sequences. These datasets incorporate only one action in each frame with the static background.

KTH Dataset
The KTH dataset contains action clips with variations in the background, object, and scale, and was thus useful for determining the accuracy of our proposed method. The video sequences contain six different types of human actions (i.e., walking, jogging, running, boxing, hand waving, and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors, outdoors with scale variation (zooming), outdoors with different clothes (appearance), and indoors, as illustrated below. Static and homogeneous backgrounds are considered in all sequences, where the frame rate is 25 frames per second. The resolution of these videos is 160 × 120 pixels, and the duration of the videos is four seconds on average. There are 25 videos for each action in the four different categories. Certain snapshots of video sequences from the KTH dataset are shown in Figure 12.

Weizmann Dataset
The Weizmann database [41] is a collection of 90 low-resolution (180 × 144, de-interlaced 50 frames per second) video sequences. The dataset contains nine different humans, each one performing ten natural actions: run, walk, skip, jumping-jack (or shortly jack), jump forward on two legs (or jump), jump in place on two legs (or pjump), gallop sideways (or side), wave two hands (or wave2), wave one hand (or wave1), or bend. Snapshots of the Weizmann dataset are shown in Figure 13.

HMDB51 Dataset
The HMDB51 dataset [43] is built up using videos, adopted from YouTube, movies, and various other sources for managing unconstrained environment. The datasets have the variety of 6849 video clips and 51 action categories. Each class has the at least 101 clips.

UCF101 Dataset
UCF101 is a dataset [44] of 13,320 videos including 101 different action classes. This dataset reflects the large diversity in terms of human appearance performing the action, scale, and viewpoint of the object, background clutter, and illumination variation, resulting in the most challenging dataset. This dataset is bridging path to real-time action recognition.

The Testing Strategy
The KTH dataset contains 600 video samples of 6 types of human actions. The dataset is divided into two parts: 80% and 20%. We have used a 10-fold leave-one-out cross-validation scheme on the 80% part and left out 20% for testing. In this experiment, nine splits are used for training, with the remaining split being used for the validation set, which optimizes the parameters of each classifier. The same testing strategy has been implemented for Weizmann dataset. Leave-One-Group-Out cross-validation has been used for the UCF11 dataset. A cross-validation strategy used for the HMDB51 dataset, the same as in [43]. The whole dataset is divided into three portions. Each includes 70 training and 30 testing video clips. The training strategy used for UCF101 is three split technique evaluated for training and testing.

Experimental Setup
Experiments were performed on an Intel(R) Core(TM) i5 (2nd Gen) 2430M CPU @ 2.5 GHz with 6 GB of RAM and a 64-bit operating system. The names of the parameters and the values used in this proposed work are listed in Tables 3 and 4, respectively. In this section, we examine the performance of our proposed approach and compare it with the state-of-the-art methods. We also compare different classifier performances with our feature extraction technique for the proposed framework. All confusion matrices address the average accuracy of all features for the SVM classifier with different kernel functions, as well as for the ANN with different numbers of hidden layers.
In this experiment, we have also considered different types of fusion techniques, i.e., early and late have been considered for experimentation. We have employed five various fusion strategies in the proposed work. Figures 14-19 present the various models of the early fusion and late fusion techniques used in our experiments. In Figure 14, early fusion has been applied to features and fed to ANN classifier, and some early fusion of features are fed to SVM classifier as shown in Figure 15. Features are provided to MKL with base learner as ANN, and MKL with base learner as SVM, these strategies are defined in Figures 16 and 17, respectively. Figure 18 shows a combination of classifiers scores using late fusion techniques, where we have used SVM classifier in this technique. Meta-cognitive Neural network has been used with all proposed features as shown in Figure 19.

Empirical Analysis
The confusion matrix is shown in Tables 6 and 7 for different combinations of feature extraction and classifier techniques for the KTH dataset. Tables 8-10 show the results with the Weizmann dataset. We have considered linear, polynomial, and radial basis kernel functions for the SVM classification. The results demonstrate that we obtain the good result (97%) with the radial basis function SVM and best result 99.98% with the late fusion using fuzzy integral approach compare to other proposed approaches. Ambiguity arises from the classes like boxing, hand waving and hand clapping actions. Furthermore, running, walking and jogging are misclassified by all classifiers.          The confusion matrix with Radial basis function SVM (RBF SVM) for the UCF11 dataset is shown in Table 11. We accomplished 77.05% accuracy in this proposed approach for UCF11 dataset with previously mentioned parameters SVM classifier as in Table 4. Tables 12-14 shows the confusion matrix for KTH,Weizmann & UCF11 dataset using McNN. The UCF11 dataset has unconstrained environments and contains various challenges in video sequences; the proposed feature extraction technique is not adequate for describing the action performed by the human object. Therefore, we can see that a lot of actions are misclassified into other actions like Shoot is misclassified as Swing, etc.        Accuracy obtained for KTH dataset using late fusion using DCNN is 99.19% and late fusion using fuzzy integral is 99.98% i.e. effectiveness of fuzzy integral technique compared to DCNN technique as late fusion is higher as shown in Table 15.
Moreover, the performance of five broad groups is evaluated in this work using the particular model as shown in Figure 20. Recognition rate has been calculated for all group categories. A Large portion of performance has been gain from sports category. Even all other categories are performing impressively.
Tables 15 and 16 compare our results with the state-of-the-art methods. Table 15 compares our proposed approach with 21 other approaches that used the KTH dataset. Our approach obtained an accuracy of 100%, which is outperformed to those of the state-of-the-art methods. The proposed approach is compared with the state-of-the-art methods for the Weizmann dataset, which is shown in Table 16. The result shows that our method outperforms the other methods. These comparisons demonstrate that the proposed approach is effective and superior in classifying actions.     In Table 20, we compare our approach with various convolutional neural network architectures. For this comparison, the average accuracy has been calculated over three splits as is the original setting. For the UCF101 dataset, we find that our McNN with proposed features performed well compared with state-of the-art methods. For UCF101, we get a 1% improvement in classification accuracy. However, our result for HMDB51 dataset is not the best result, but the improvement in resultant accuracy is considerable.

Conclusions
In this paper, we have employed a HAR-based novel feature fusion approach. HOG, R_FHOG, displacement, and velocity features are combined to prepare the feature descriptor in this approach. The classifiers used to classify human action are an ANN, a SVM, MKL, late fusion approach, and McNN. The experimental results demonstrate that this proposed approach can easily recognize actions such as running, walking, and jumping. The McNN outperforms other classifiers. The proposed approach performs reasonably well compared with the majority of existing state-of-the-art methods. For the KTH dataset, our proposed approach outperforms existing methods, and for the Weizmann dataset our approach performs similarly to standard available methods. We have also checked the system performance with unconstrained UCF11 dataset, HMDB51 dataset, and UCF101 dataset, and its performance is approximate to the state-of-the-art method.
In the future, an overlapping window can be utilized for the feature extraction technique to increase the accuracy of the proposed method. Here, the proposed work focuses only on a constrained video; however, we can also use this proposed feature set for an unconstrained video, where more than one object is present in the video performing the same action or in the video performing multiple actions. The traditional neural network can be replaced by the convolutional neural network for further enhancements. We can conclude that fusion of features is a vital idea to enhance the performance of the classifier, where a large complex set of features available. Late fusion was found to be better than early fusion as features are used by multiple classifiers because of their competitiveness for late fusion.  Acknowledgments: The authors would like to thank the reviewers for their valuable suggestions which helped in improving the quality of this paper.

Conflicts of Interest:
The authors declare no conflict of interest.