Video-Based Human Activity Recognition Using Multilevel Wavelet Decomposition and Stepwise Linear Discriminant Analysis

Video-based human activity recognition (HAR) means the analysis of motions and behaviors of human from the low level sensors. Over the last decade, automatic HAR is an exigent research area and is considered a significant concern in the field of computer vision and pattern recognition. In this paper, we have presented a robust and an accurate activity recognition system called WS-HAR that consists of wavelet transform coupled with stepwise linear discriminant analysis (SWLDA) followed by hidden Markov model (HMM). Symlet wavelet has been employed in order to extract the features from the activity frames. The most prominent features were selected by proposing a robust technique called stepwise linear discriminant analysis (SWLDA) that focuses on selecting the localized features from the activity frames and discriminating their class based on regression values (i.e., partial F-test values). Finally, we applied a well-known sequential classifier called hidden Markov model (HMM) to give the appropriate labels to the activities. In order to validate the performance of the WS-HAR, we utilized two publicly available standard datasets under two different experimental settings, n–fold cross validation scheme based on subjects; and a set of experiments was performed in order to show the effectiveness of each approach. The weighted average recognition rate for the WS-HAR was 97% across the two different datasets that is a significant improvement in classication accuracy compared to the existing well-known statistical and state-of-the-art methods.

(GMMs) [20,21], and hidden Markov models (HMMs) [22][23][24] have been utilized for the purpose of recognition. Among them, HMM is widely used for sequence-based classification [25] in FER systems. Because HMMs have their own advantage in handling sequential data when frame-level features are used, while the vector-based classiers such as GMMs, ANNs, and SVMs fail to learn the sequence of the feature vectors.
The objective of this paper is to propose a new feature extraction technique based on wavelet transform (especially symlet wavelet). To obtain the feature vectors, symlet wavelet family was tested for which the image was decomposed up to 4 levels. In order to select the most prominent features, we also proposed the use of a robust feature selection technique called Stepwise Linear Discriminant Analysis (SWLDA). SWLDA is easy to explain, has good predictive ability, and computational wise, it is less expensive than other existing methods [26]. Some limitations of the existing works, such as illumination change, do not affect the performance of the SWLDA. SWLDA only chooses a small subset of features from the large set of features by employing forward and backward regression models. In forward process, the most correlated features are selected based on partial F-test values from the feature space. While in the backward process, the least significant values are removed from the regression model i.e., the lower F-test values. In both processes, the F-test values were calculated on the basis of the defined class labels. The advantage of this method is that it is very efficient for seeking localized features.
We already discussed some related work about this field. Rest of the paper is organized as: Section 2 provides an overview of our WS-HAR. The experimental setup has been described in Section 3. Section 4 presents the experimental results and discussion of the WS-HAR. Finally, the paper is concluded with some future directions in Section 5.

Materials and Methods
The WS-HAR system consists of the following modules.

Preprocessing
In most of the activity datasets, the activity frames have various resolutions and backgrounds, and were taken under varying light conditions; therefore, the preprocessing module is necessary to improve the quality of the frames. At this stage, the background information, illumination noise, and unnecessary details are diminished for fast and easy processing. After this module, we can obtain sequences of images which have normalized intensity, size and shape. So, in the preprocessing module of the WS-HAR systems, we have employed histogram equalization in order to solve the lighting effects. Moreover, we have extracted the human bodies by subtracting the empty frames from the activity frames as shown in Figure 1.

Feature Extraction
Feature extraction is a process that deals with getting the distinguishable features from each human body shape and quantizing it as a discrete symbol. In WS-HAR, we have proposed a robust feature extraction technique as described below.

Wavelet Transform
After obtaining a set of body silhouettes segmented from a sequence of images the wavelet transform is applied for feature extraction. In wavelet transform, we used the decomposition process for which the video frames were in grey scale. The reason for converting from RGB to gray scale was to improve the efficiency of the proposed algorithm. The wavelet decomposition could be interpreted as signal decomposition in a set of independent feature vectors. Each vector consists of sub-vectors like: where V represents the 2D feature vector. If we have 2D activity frame X, and it is decomposed into orthogonal sub images corresponding to different visualization. The following equation shows one level of decomposition: where X indicates the decomposed image and A 1 and D 1 show approximation and detailed coefficient vectors respectively. If the activity frame is decomposed up to multilevel, then, the Equation (2) can then be written as: where j represents the level of decomposition. Mostly, the detail coefficients consist of noise; therefore, only the approximation were utilized for feature extraction. During the decomposition process, each frame is decomposed up to four levels of decomposition, i.e., j = 4, because by exceeding the value of j = 4 the image loses lots of information due to which the informative coefficients cannot be detected properly and might cause misclassification. The detail coefficients further consist of three sub-coefficients. So the Equation (3) can be written as: Or simply, the Equation (4) can be written as: where D h , D v , and D d indicate horizontal, vertical and diagonal coefficients respectively. We can observe from Equation (4) or Equation (5), that all the coefficients are connected with each other like a chain, through which we can easily extract the prominent features. These coefficients graphically and image-wise are represented by Figures 2 and 3 respectively. All the coefficients are connected with one after another like performing head to tail rule in vector addition that produces one dimensional matrix, due to which the coefficients are extracted easily.
In each decomposition step, the approximation and detail coefficient vectors are obtained by passing the signal through the low-pass and high-pass filters. . Decomposition of a frame along with its corresponding coefficients after using the proposed feature extraction algorithm. The blue arc shows the detail coefficients that further consists of three sub-coefficients horizontal, vertical and diagonal, respectively.
After the decomposition process, the feature vector is created by taking the average of all the frequencies of the activity frames. In a specified time window the frequency of each activity frame has been estimated by analyzing the corresponding frame by utilizing the wavelet transform [27]: where a i is the scale of the wavelet between the lower and upper frequency bounds to get higher decision for the frequency estimation, b j is the position of the wavelet from the start to end of the time window with the spacing of signal sampling period, t is the time, ψ f.e is the wavelet function used for frequency estimation, and C (a i , b j ) are the wavelet coefficients with the specified scale and position parameters, which is converted to the mode frequency as: where f a (ψ f.e ) is the average frequency of the wavelet function, and ∆ is the signal sampling period. So the feature vector is obtained by taking the average of the whole frame frequencies for each activity that is given as: where f Act indicates the average frequency of each activity which is a feature vector for that activity, K is the last frame of the current activity, and N represents the whole number of the frames in each activity.

Feature Selection
Feature selection module is used for selecting subset of relevant features, which contain information to help distinguish one class from the others, from a large number of features extracted from the input data. Some of the human activities such as running and walking, skipping and jumping have quite similar feature values in the feature space, which can result in a high misclassification rate. This also result in high within-class variance and low between-class variance. Therefore, a method is required that not only provides dimension reduction, but also increases the low between-class variance to increase class separation before the features are fed to the classifier.
In order to solve this problem, several methods have been discussed in the machine learning literature, such as kernel discriminant analysis (KDA) [28], generalized discriminant analysis (GDA) [29], and linear discriminant analysis (LDA) [30]. Among these, LDA has been most widely employed in HAR systems.
However, LDA has two major limitations. First, it relies on the mixture model containing the correct number of components. Second, it is a linear technique that is limited in flexibility when applied to more complex datasets. Moreover, the assumption made by LDA that all classes share the same within-class covariance matrix is not valid. Additionally, large amounts of data are necessary to generate robust transforms for LDA, and there may be insufficient data to robustly estimate transforms to separate the classes. For more details on LDA, please refer to a previous study [31].
In sum, we believe that the use of LDA will not essentially yield an improvement in the performance of an HAR system. Moreover, LDA cannot provide a better classification rate due to the aforementioned limitations. Therefore, we propose the use of a robust technique such as SWLDA [26] that does not suffer from the aforementioned limitations. To the best of our knowledge, it is the first time that SWLDA is being utilized as a feature selection technique for HAR systems.

Stepwise Linear Discriminant Analysis (SWLDA)
Fishers linear discriminant (FLD) is a well-known linear classification method that has been utilized in order to find the optimal separation between the two classes [28]. For two classes that have a Gaussian distribution with an identical covariance, FLD is more robust than other linear classifiers with regard to optimal separation. FLD and the least-squares regression method are comparable to each other and project feature masses in binary jobs as follows: where M indicates the pragmatic feature vectors matrix, and Y is the label of the class. FLD has the capability to provide the best classification solution for linear data; however, FLD does not provide a better solution when the data is non-linear. Therefore, we propose the use of a non-linear classification technique such as SWLDA that has been reported to discriminate P300 Speller responses [26]. In short, SWLDA is an extended version of FLD that performs two operations in parallel: reducing the feature space by extracting informative features and removing irrelevant features.
As mentioned before, SWLDA extracts and selects the best features by utilizing two algorithms, namely forward and backward algorithms that work in parallel. The most substantial interpreter value is obtained with a model that has a p-value < 0.2 because there is no initial model at the start. When the new values are entered by the forward technique, the backward algorithm is used to remove irrelevant values (i.e., those that have a p-value > 0.25). This entry and removal procedure continues until the predefined criteria are satisfied and the resultant function is constrained to the extreme number of 200 features.
In contrast, the regression methods select the best variable, such as X, and then move on to form more X s in meaningful situations. In this method, the new entry and the selection of the best values are based on F-test values that are used to determine which value should be entered first or second. Then the two values, namely the partial F-value and the selected value, are compared. This whole process is done using the forward technique. In the next step, the deletion process is initiated using a backward regression technique (known as backward deletion) in which the testing values for all interpreter variables previously present in the backlog are calculated. The testing value with the lowest value, V L is compared with the pre-selected value, P S . Then For more details on SWLDA, please refer to a previous study [26].

Recognition
In recognition module, a classifier such as Hidden Markov Model (HMM), or Gaussian Mixture Model (GMM) or Support Vector Model (SVM) is first trained with training data and then used to generate the label of human activities contained in the incoming video data.

Hidden Markov Model (HMM)
As described before that HMM is the best candidate for sequential data (video-based activities) classification, which provides a statistical model λ for a set of observation sequences. These observations are called frames in HAR domain. Suppose there are sequence of observations of length T that are denoted by O 1 , O 2 , ..., O T and HMM also consists of particular sequences of states S, whose lengths range from 1 to N (S = S 1 , S 2 , ..., S N ), where N is the number of states in the model, and the time t for each state is denoted Q = q 1 , q 2 , ..., q N . The likelihood P (O|λ) can be evaluated by summing over all possible state sequences: A simple procedure for nding the parameters λ that maximize the above equation for HMMs, introduced in [32] depends on the forward and backward algorithms α t (j) = P (O 1 , O 2 , ..., O t , q t = j|λ) and β t (j) = P (O (t + 1) ....O t /q t = j, λ) respectively, such that these variables can be initiated inductively by the following three processes: During testing, the appropriate HMMs can then be determined by mean of likelihood estimation for the sequence of observations O calculated based on the trained λ as: The maximum likelihood for the observations provided by the trained HMMs indicates the recognized label. For more details on HMM, please refer to [33]. The following formula has been utilized to model HMM (λ): where O is the sequence of observations e.g., O 1 , O 2 , ..., O T and and each state is denoted by Q such as Q = q 1 , q 2 , ..., q N , where N is the number of the states in the model, and π is the initial state probabilities. The parameters that used to model HMM (λ) for all experiments were 44, 4, and 4, respectively. These values have been selected by performing multiple experiments.

Experimental Setup
There are some pose-based action datasets such as Weizmann action dataset [5], and KTH action dataset [34], and some are spontaneous-based action datasets like RGBD-HuDaAct [35], UCF Youtube [36], Hollywood2 [37], HMDB51 [38], ASLAN [39], etc. Most of the activity frames in pose-based datasets have only one subject for performing the activity. While, all the spontaneous-based action datasets have more than one subject in each activity clip for the corresponding activity. However, the WS-HAR may not work on spontaneous-based action datasets because of involving more than one subject in the activity frames, and that is one the limitations of the WS-HAR system. Therefore, the performance of the WS-HAR has been tested and validated only on pose-based action datasets such as Weizmann and KTH action datasets. The detailed description on each of these datasets are as follows: • Weizmann Action Dataset: In this dataset, there were 9 subjects performed 10 actions such as bending, walking, running, skipping, jumping forward, place-jumping side-movement, one-hand-waving, and two-hand-waving. There were 90 video clips in the datasets and the average number of frames in each clip was 15. The size of each frame 144 × 180.
• KTH Action Dataset: Additionally, we also employed KTH dataset of activity recognition. In this dataset, there were 25 subjects performed six activities such as walking, jogging, running, boxing, hand-waving, hand-clapping in four different scenarios. There were total 2391 sequences taken over homogenous backgrounds with a static camera. The fame size was 160 × 120.
During all the experiments, the size of each input frame was 60 × 60, where the images were first converted to a zero-mean vector of size 1 × 3600 for feature extraction. For a thorough validation, four experiments were performed.
• In the first experiment of the WS-HAR, an n−fold cross-validation scheme based on subjects was used for each dataset, which means that, out of n subjects, data from a single subject was taken as the validation data for testing the WS-HAR, whereas the data for the remaining n − 1 subjects were used as the training data. This process was repeated n times, with data from each subject used exactly once as validation data. The value of n varied according to the dataset used. The benefit of this rule is that each activity was used for both training and testing. • While, in the second experiment of WS-HAR, the performance of the sub-components of WS-HAR, i.e., feature extraction (symlet wavelet transform), and SWLDA were analyzed separately. • In the third experiment, the performance of WS-HAR was compared with previous state-of-the-art methods. • Finally, in the fourth experiment, the performance of different approaches with different combination was analyzed using all the three datasets.

Experimental Results of WS-HAR Based on Subjects
In this experiment, the WS-HAR (Wavelet transform, Stepwise linear discriminant analysis (SWLDA)-based Human Activity Recognition) system was separately trained and tested on each dataset. In this experiment, symlet wavelet transform, SWLDA, and HMM were applied collectively on each dataset. The overall experimental results of WS-HAR using Weizmann and KTH action datasets are shown in Tables 1 and 2, respectively. Table 1. The recognition rate of WS-HAR using Weizmann action dataset. It can be seen that the WS-HAR showed better classification rate (Unit: %).  It can be seen from Tables 1 and 2 that the WS-HAR consistently achieved a high recognition rate when applied to these datasets separately: 97.11% for Weizmann action dataset, and 97.16% for KTH action dataset.

Experimental Results of WS-HAR under the Absence of Each Module
In this experiment, a set of sub-experiments were performed in order to assess the efficacy of each module of WS-HAR (feature extraction, and feature selection) separately. This experiment was repeated two times and the classification rate was analyzed under two different settings: Firstly, the experiment was repeated by employing the existing feature extraction technique such as ICA instead of using the proposed feature extraction technique (wavelet transform). While in the second experiment, a well-known feature selection technique such as PCA was utilized instead of employing the proposed feature selection method (SWLDA). The results for the two experimental settings are indicated in Tables 3-6 on Weizmann and KTH action datasets respectively.  It can be seen that in the WS-HAR, the proposed feature extraction method (symlet wavelet transform) is important as shown in Tables 3 and 4. It is because symlet wavelet can extract the most prominent information in the form of frequency from activity frames, and also it is a compactly supported wavelet on frames with the least asymmetry and highest number of vanishing moments for a given support width. The symlet wavelet has the capability to support the characteristics of orthogonal, biorthogonal, and reverse biorthogonal of gray scale images, that's why it provides better classification results.
The frequency-based assumption is supported in our experiments and we measure the statistic dependency of wavelet coefficients for all activity frames. Joint probability of a frame is computed by collecting geometrically aligned frames of the activity for each wavelet coefficient. Mutual information for the wavelet coefficients computed using these distributions is used to estimate the strength of statistical dependency between the two frames. Moreover, symlet wavelet transform is capable to extract prominent features from activity frames with the aid of locality in frequency, orientation and in space as well. Since wavelet is a multi-resolution that helps us to efficiently find the images in coarse-to-find way.
Similarly, it is also to be noted from Tables 5 and 6 that the proposed feature selection method such SWLDA has also much contribution in the WS-HAR. Without SWLDA, the WS-HAR system was unable to achieve adequate classification rate. This indicates that SWLDA is a robust feature selection method that addresses the limitations of previous feature selection techniques, especially PCA and LDA. The reason behind the better performance of SWLDA is apparent in Tables 5 and 6. Thus SWLDA not only provides dimension reduction, it also increases the low between-class variance to increase the class separation before the features are fed to the classifier. The low within class and high between class variance are achieved because of the forward and backward recognition models in the SWLDA.

Comparison of the WS-HAR with State-of-the-Art Methods
In this experiments, we compared the performance of WS-HAR with some state-of-the-art methods on both datasets, i.e., Weizmann and KTH action datasets of activities. Some of these methods including [40][41][42][43][44][45][46]. Some of them recognized the activities by employing frame-based classification methods while some used sequential-based classification method. All these methods were implemented by us using the instructions provided in their respective papers. For each dataset, n−fold cross validation scheme (based on subjects) was utilized as described in Section 3. The average recognition rate for each method along with the WS-HAR are shown in Table 7. Table 5. Confusion matrix for the WS-HAR using Weizmann action dataset, while removing the proposed feature selection method (SWLDA) (Unit: %).   It can be seen from Table 7 that the WS-HAR outperformed the existing state-of-the-art methods. Thus, the WS-HAR system shows significant potential in its ability to accurately and robustly recognize the human activities using video data.

Experimental Results of Existing Well-Known Statistical Methods
In this experiments, a set of experiments were performed using different combinations of various previously used feature extraction and classification approaches on the two datasets. The overall results of these experiments are shown in Tables 8-19. Comparing Tables 1 and 2 with the abovementioned tables, one can notice that the performance of WS-HAR is much better in contrast to the performance of different combinations of the previously explored methods.
Moreover, in order to show the efficacy of the proposed approaches, we have compared the weighted recognition rate of the proposed approaches with some recent well-known feature extraction methods such as motion history image (MHI) [47,48], spatio-temporal interest points [7,49], and dense motion trajectories [50]. The over all results of along with the proposed approaches are shown in Table 20.   Table 15.
The recognition rate of ICA + LDA and HMM using KTH action dataset (Unit: %).   It can be seen from Table 20 that the proposed approaches outperformed compared to the recent existing feature extraction methods. These methods (shown in Table 20) have their own limitations. For example, the scalability is one of the major limitations of motion history image-based methods because it analyze the lateral motion of the gesture [51]. Also, it might only recognize actions of angle of 180 degree [52]. Commonly, good segmented silhouettes are required for spatio-temporal interest points features and also these methods are very sensitive to viewpoint and occlusion [53]. Although, spatio-temporal interest points features-based methods are well recognized the activities; however, these methods the time information is often discarded [54]. Likewise, dense motion trajectories-based methods typically lost the underlying sequential information provided by the ordering of the words, when the activities are represented as bags of words [55]. On the hand, the proposed approaches came up with the limitations of the aforementioned feature extraction techniques and achieved high recognition rate than the others. The details are described in Section 4.2.

Conclusions
The aim of video-based activity recognition systems is to automatically recognize a human activity using sequence of images (video frames). Over the last decade, HAR systems have received a great deal of attention from community due to their application in many areas of pattern recognition and computer vision. However, accurately recognizing the activities is still a major concern for most of them. This lack of accuracy can be attributed to various causes, such as the failure to extract the prominent features, and the high similarity among different activities that results due to the presence of low between-class variance in the feature space.
Accordingly, the purpose of this study was to propose an accurate and robust HAR system, called WS-HAR that is capable of exhibiting high recognition rate. The WS-HAR uses symlet wavelet transform, SWLDA, and HMM as its feature extraction, feature selection, and classification techniques respectively. Symlet wavelet can extract the most prominent information in the form of frequency from activity frames, and also it is a compactly supported wavelet on frames with the least asymmetry and highest number of vanishing moments for a given support width. Similarly, SWLDA helps the system in selecting the most significant features thereby reducing the high within class variance and increasing the low between class variance. HMM then uses these features to accurately classify the human activities. This model is capable of approximating the complex distributions using a mixture of full covariance Gaussian density functions.
The proposed WS-HAR system has been validated using two publicly available standard datasets (Weizmann and KTH action datasets). Weizmann action dataset consisted of nine activities, while KTH action dataset consisted of six activities. Each activity clip was composed of several sequence of activity frames. All of these experiments were performed in the laboratory using offline validation. Though the system was very successful in recognizing each of the activities in all of these experiments with a very high accuracy, its performance in real environment is yet to be investigated. The system performance could degrade in real-life test, especially when used with various angles, dynamic background, and clutter (unnecessary objects in a test image). Therefore, further study is needed in order to solve these issues in real-time environment.
As mentioned before that we have applied the WS-HAR system on two publicly available standard action datasets that are pose-made datasets. In these datasets, all the activity clips have only one subject for performing the activity. However, the WS-HAR systems may not work on real time datastes such as UCF Youtube, Hollywood2, HMDB51, ASLAN etc. Because, most of these datasets have more than one subject in each activity clip for the corresponding activity. Therefore, further research is needed to apply the WS-HAR in order to solve this issue in real world datasets.