Machine Vision-Based Human Action Recognition Using Spatio-Temporal Motion Features (STMF) with Difference Intensity Distance Group Pattern (DIDGP)

: In recent years, human action recognition is modeled as a spatial-temporal video volume. Such aspects have recently expanded greatly due to their explosively evolving real-world uses, such as visual surveillance, autonomous driving, and entertainment. Speciﬁcally, the spatio-temporal interest points (STIPs) approach has been widely and efﬁciently used in action representation for recognition. In this work, a novel approach based on the STIPs is proposed for action descriptors i.e., Two Dimensional-Difference Intensity Distance Group Pattern (2D-DIDGP) and Three Dimensional-Difference Intensity Distance Group Pattern (3D-DIDGP) for representing and recognizing the human actions in video sequences. Initially, this approach captures the local motion in a video that is invariant to size and shape changes. This approach extends further to build unique and discriminative feature description methods to enhance the action recognition rate. The transformation methods, such as DCT (Discrete cosine transform), DWT (Discrete wavelet transforms), and hybrid DWT+DCT, are utilized. The proposed approach is validated on the UT-Interaction dataset that has been extensively studied by past researchers. Then, the classiﬁcation methods, such as Support Vector Machines (SVM) and Random Forest (RF) classiﬁers, are exploited. From the observed results, it is perceived that the proposed descriptors especially the DIDGP based descriptor yield promising results on action recognition. Notably, the 3D-DIDGP outperforms the state-of-the-art algorithm predominantly.


Introduction
Video acquisition technologies are becoming pervasive in our daily lives. Powerful digital cameras used in social media, traffic, security, and emergency monitoring are capable of capturing high-level details of people's faces and body posture for activity recognition [1]. However, these approaches model machine learning architectures requiring high computational power, disregarding real-time performance and integration into embedded devices. The primary motivation of Human action recognition is extensively investigated in real-time, and it represents a vital role in many applications such as video surveillance, human-computer interaction, and video content retrieval [2]. Human motion, typically a combination of interpretation and rotational motions of each body joint contains much information inherent to humans [3,4]. In selective, motion similarity can be accomplished by analyzing the human motions with various applications. For example, motion similarity can be used for action recognition, and it is also possible to measure a motion similarity to conclude whether a task is performed well or to identify abnormal behavior [5]. A motion estimation system helps to match a target person from different cameras for re-identification. While interpreting, human motion plays an imperative role in the tasks mentioned, motion similarity research has attracted less attention so far due to the subsequent reasons. Firstly, measuring the motion similarity is a challenging problem. Different camera views or human body structures cause various 2D joint coordinates even for similar motions in videos. This makes it difficult to measure the similarity using the joint coordinates directly. Secondly, the availability of large-scale datasets for learning the motion similarity is limited. Lastly, there are few human motion datasets available for assessing the performance of different motion similarities to computation methods. Compared with traditional images, spatial temporal-based action recognition has drawn increasingly more attention since it is robust against embellished backgrounds and camera viewpoints.
Human body motions are represented as a sequence of 2D or 3D spatial coordinates, and they provide a good representation for describing human actions with motion or texture features [6]. The spatial data can be easily obtained by video cameras or pose estimation algorithms. The motion information of the joints is also an important cue to recognize the underlay action. Some actions such as "hugging" and "punching" are challenging to recognize from spatial information and can extort the movements of body joints to help the recognition. Since the spatial data are represented as the action coordinates, joints motion is easily calculated as the difference of coordinates along the temporal dimension Figure 1 shows the three categories of human motion analysis. Hence, this work establishes transform-based approaches for human action recognition. Several studies were reported on feature descriptor, representation, and classification methods for robust action recognition [7][8][9][10][11]. Knowing human behaviors remains a challenge due to diverse complex variables, such as perspective, size, rotation, shifts in tempo, different differences in the anthropometry, and embroiled contexts. The human body usually varies dramatically in size, physique, and appearance in different groups of actors while performing the same action.

Related Work
Several studies [1][2][3]14,15] has been conductedin recent years in the field of recognition of human behavior to reduce the manual effort and increase computational performance. Laptev [16] and Dollar [17] proposed a space-time interest point detector for action recognition and these feature points showed discriminative properties like appearance The biggest challenge in recognizing human activity is identifying and extracting the right and significant features. The Deep learning technique that has recently been Electronics 2022, 11, 2363 3 of 23 created can extract, as well as choose, the relevant features. The convolutional neural network (CNN) is one of the many deep learning techniques that have the benefits of local dependency and scale invariance and are appropriate for temporal data. The CNN is excelling at handling the temporal data when compared to the traditional machine learning techniques, which demands on domain specific knowledge [12,13].

Related Work
Several studies [1][2][3]14,15] has been conducted in recent years in the field of recognition of human behavior to reduce the manual effort and increase computational performance. Laptev [16] and Dollar [17] proposed a space-time interest point detector for action recognition and these feature points showed discriminative properties like appearance and positions. Feature extraction methods for identifying behavior can generally be divided into four categories: geometry based [18,19], motion-based [20,21], appearance-based [22,23], and space-time feature-based [17,[24][25][26]. The geometry-based approaches use geometric points from human body structure that is normally difficult and time-consuming for object segmentation and tracking. The movement-based optical flow models for process recognition, but it reduces the effects of background flows due to foreground segmentation. Motion patterns have also been considered as an important cue for action recognition. The appearance-based approaches use silhouette information to recognize actions, but they are weak to cluttered backgrounds. The space-time feature-based approaches uses space-time interest points to distinguish action categories. The majority of the work in human action recognition has been conducted on the standard benchmark dataset, such as the KTH dataset [24] and Weizmann dataset [27], which consists of various kinematics activities performed by a single actor with homogeneous backgrounds with some datasets like HMDB51 [28] and Hollywood2 [29] that comprises of realistic environments that have been used for evaluation. Furthermore, another type of benchmark dataset such as UT-Interaction [30] is a human-human interaction activity.
Laptev [16] used the idea of Harris and Forstner's point-of-interest using operators to identify the local structure with significant space and time-domain deviations. Dollar [17] used linear separable filters for detecting interest points in the local region, which respond to strong motion and the space-time corners. Moreover, it was proposed to use a Hough transform-based voting framework for action recognition that used Spatio-temporal voting with extracted local X − Y − T features [31]. They performed recognition by voting with a collection of random learned trees in Hough space. In some works [32], the SVM classification method with χ 2 kernels proposed a structure on dense multiscale trajectories that extracted the dense trajectory (DT) form, gradient orientation histogram (HoG), flow orientation histogram (HoF), and motion boundary histogram (MBH) from image data. Finally, the visual Code-book was read from the training models. There were two wellknown standard methods in the frequency-domain, discrete cosine transformation (DCT) and discrete wavelet transformation (DWT). The DCT [33] technique had been extensively used in the field of digital image processing field that involved compression of images including various enhancement techniques and segmentation. It is similar to discrete fourier transform (DFT) that concentrated the massive DCT coefficients into the low-frequency region and had excellent characteristics of energy properties. DWT [34] used an orthogonal rule that can be applied to separate finite data into different frequency components such as approximation coefficient matrix (cA) and other coefficients matrices are horizontal coefficient (cH), vertical coefficient (cV), and diagonal coefficient (cD).
Some of the work [35] presented a method based on mean and variance that was computed from the absolute DCT coefficients values for the entire image using texture-based classification. It also proposed a content-based image retrieval method based on quad tree-structure using DCT coefficients as quad tree nodes to represent the image features [36]. Here, the wavelets helped in detecting significant points by representing the local properties of images. It used the multi-resolution wavelet decomposition that extracted the intensity-hue-saturation (IHS) and principal components analysis (PCA) to implement the Electronics 2022, 11, 2363 4 of 23 spatial detail regarding wavelet-based image fusion [37]. The transformation methods eliminated redundancy in neighboring pixels that offered the advantage of determining the uncorrelated transform coefficients. Hence, the main advantage of this process is to reduce the correlation between neighboring pixels which in turn leads to irrelevant coefficients of transformation. The existing works highlighted that the geometry-based approaches were time-consuming for object segmentation and tracking the spatiotemporal interest points in three-dimensional patterns. Previous researchers limited their work to 3-4 actions since many databases do not prefer a wide variety of action sequences. Previous human activity recognition approaches lack pre-processing steps that effectively filter data and increase the classifier accuracy. Hence, this paper proposes a novel approach based on the STIPs for action descriptors for representing and recognizing human actions in video sequences.

Contributions
This work discusses the action recognition problem by extracting spatio-temporal interest points. Initially, the approach captures the local motion in a video that is invariant to size and shape changes. Then, classification methods, such as Support Vector Machines (SVM) and Random Forest (RF) classifiers, are exploited. The approach extends further to build unique and discriminative feature description methods along with PCA for feature dimensional reduction in order to enhance the action recognition rate. In this work, a novel approach based on the STIPs is proposed for action descriptors i.e., Two Dimensional-Difference Intensity Distance Group Pattern (2D-DIDGP) and Three Dimensional-Difference Intensity Distance Group Pattern (3D-DIDGP) for representing and recognizing the human actions in video sequences. In addition to that, the transformation methods such as DCT, DWT, and hybrid DWT+DCT are utilized. Predominantly, the 3D-DIDGP method outclasses the state-of-the-art algorithm.

Organization
The article begins with background, literature report, and contributions relating to the proposed work. Following this, the proposed method is described in detail in Section 3. Further, this work presents a detailed experimental setup for the study in Section 4. Then, Section 5 illustrates the test results under different methods of operation. Subsequently, a comparative analysis from the observed results is carried out in Section 6. Finally, Section 7 concludes the article with the key observations.

Proposed Method
Motion patterns in the field of action recognition were identified based on changes in a subject's location regarding time. Since motion information is an important cue to describe the action. Initially the input video is converted to gray scale and further the noise is removed for fine features. All frames are smoothed by Gaussian convolution method with a matrix size of 5 × 5 for successful feature extraction and classification. The frame difference method is adapted to extract the motion features. In this work, difference intensity distance group pattern (DIDGP) based 2D/3D cuboids extraction and transformation based DCT, DWT, Hybrid DWT+DCT also applied at each spatial-temporal interest point of an action sequence. Moreover, principal component analysis (PCA) is adapted to select the most discriminatory motion features to improve action recognition performance. Finally, a Support Vector Machine (SVM) and Random Forest classifier are used to classify the actions.

Identifying Motion by Frame Differencing
In video analysis research, object detection still remains as open problem. Since the objects in a video are moving in general, if the object moves from the camera viewpoint, the images of the object may dramatically differ. This change may arise due to variation in target pose, variation in illumination and partial and total occlusion of the target.
Initially, the input video is converted into frames and the extracted color frames are converted to grey-scale frame using simple average method by adding the pixel values of red (R), green (G) and blue (B) channels and divided by three: (R + G + B)/3. Secondly, the 2D Gaussian smoothing operator is used to 'smooth' images and remove detail and noise. In this sense it is similar to the mean filter, with help of 5 × 5 kernel. This kernel has some special properties. Third, Frame differencing is defined by the difference between consecutive frames in time, instead of subtracting predefined or estimated background on-the fly, the frame subtraction method considers every pair of frames in time t and t + 1, and extracts any motion in it. In order to find any region of interest/object present in a video frame, simply subtract the current frame with the previous frame on pixel-by-pixel basis. The difference image computed between the two consecutive frames of the video. Difference image at time t is given by Figure 2 shows the description of the method proposed. The approach begins by extracting the motion information followed by identifying the interest points from the training video sequences. Depending on the structural distribution of interest points, each sequence generates the descriptors in the cuboid. Moreover, another set of features called DIDGP in 2D and 3D are extracted from the training sequences. The above measures are repeated during the process. For classifying the test sequences to the suitable kind of behavior based on the model developed in the training stage, SVM and Random Forest Classifiers are adopted.

Overview of the Proposed Human Action Recognition Framework
objects in a video are moving in general, if the object moves from the camera viewp the images of the object may dramatically differ. This change may arise due to varia in target pose, variation in illumination and partial and total occlusion of the target.
Initially, the input video is converted into frames and the extracted color frames converted to grey-scale frame using simple average method by adding the pixel valu red (R), green (G) and blue (B) channels and divided by three: (R + G + B)/3. Secondly 2D Gaussian smoothing operator is used to 'smooth' images and remove detail and n In this sense it is similar to the mean filter, with help of 5 × 5 kernel. This kernel has s special properties. Third, Frame differencing is defined by the difference between con utive frames in time, instead of subtracting predefined or estimated background on fly, the frame subtraction method considers every pair of frames in time t and t + 1, extracts any motion in it. In order to find any region of interest/object present in a v frame, simply subtract the current frame with the previous frame on pixel-by-pixel b The difference image computed between the two consecutive frames of the video. Di ence image at time t is given by , | , , | 1 , 1 ℎ Figure 2 shows the description of the method proposed. The approach begins by tracting the motion information followed by identifying the interest points from the tr ing video sequences. Depending on the structural distribution of interest points, each quence generates the descriptors in the cuboid. Moreover, another set of features ca DIDGP in 2D and 3D are extracted from the training sequences. The above measures repeated during the process. For classifying the test sequences to the suitable kind o havior based on the model developed in the training stage, SVM and Random Forest C sifiers are adopted.

Interest Point Identification
This work applies the Harris interest point detector [38] due to its strong in-vari to rotation, scaling, illumination, and noise. Interest points in a video are constra along both spatial and temporal dimensions.
Harris corner detector is based on the local auto-correlation function. At a corner image intensity will change largely in multiple directions. For the image I, the algori calculates the change of intensity for the shift [u, v] as follows:

Interest Point Identification
This work applies the Harris interest point detector [38] due to its strong in-variance to rotation, scaling, illumination, and noise. Interest points in a video are constrained along both spatial and temporal dimensions.
Harris corner detector is based on the local auto-correlation function. At a corner, the image intensity will change largely in multiple directions. For the image I, the algorithm calculates the change of intensity for the shift [u, v] as follows: where w(x, y) is called as window at (x, y)I(x, y) is the intensity at (x, y)I(x + u, y + v) is the intensity of moved window (x + u, y + v).It is required to capture corners with a maximum variation in intensity. Hence, the shifted image is approximated by a Taylor expansion and finally a score is calculated to determine interesting point as represented in the following equation, where det(M) = λ 1 λ 2 trace(M) = λ 1 + λ 2 . A window with score R greater than a threshold is considered as 'interest point'. Figure 3 shows the interest points detected in 'hug' sequence in the UT-Interaction dataset. The highlighted points relating to the local maxima response function known as spatio-temporal points of interest. Cuboids obtained from the various actions are shown in Figure 4. It is also evident from Figure 5 that the actions can be clearly distinguished from these cuboids.
in the following equation, det where det = . A window with score R greater tha old is considered as 'interest point'. Figure 3 shows the interest points detect sequence in the UT-Interaction dataset. The highlighted points relating to the ima response function known as spatio-temporal points of interest. Cuboid from the various actions are shown in Figure 4. It is also evident from Figure actions can be clearly distinguished from these cuboids.
As seen in Figure 6. The cuboid (spatial temporal video patch) is extract each interest point and contains spatio-temporally windowed pixel values.   expansion and finally a score is calculated to determine interesting point as represen in the following equation, det where det = . A window with score R greater than a thr old is considered as 'interest point'. Figure 3 shows the interest points detected in 'h sequence in the UT-Interaction dataset. The highlighted points relating to the local m ima response function known as spatio-temporal points of interest. Cuboids obta from the various actions are shown in Figure 4. It is also evident from Figure 5 that actions can be clearly distinguished from these cuboids.
As seen in Figure 6. The cuboid (spatial temporal video patch) is extracted aro each interest point and contains spatio-temporally windowed pixel values.    The size of cuboids in space-time volumes is set to four sizes (i.e., 49 (space) × 49 (space) × 14 (time), 49 × 49 × 21, 49 × 49 × 28, and 49 × 49 × 35), in detection. By using this information of each cuboid, it is easy to describe and build a valid action recognition model.

Feature Extraction Procedure
Once the cuboids of spatio-temporal interest points are identified, feature extraction is performed, and various features such as Difference Intensity Distance Group Patterns (DIDGP) in 2D/3D, Discrete Cosine Transform derivatives (DCT), and hybrid DWT+DCT are evaluated. As discussed above, the center block 0, 0 is kept at the interesting point detected using the procedure discussed in Section 3.3.1. A sampled patch size of 49×49 is positioned on the interest point and the DIDGP features are extracted.

Distance Relationship Calculation
This approach finds the distance relationship between the two blocks in the 7×7 block area of the extracted cuboid. Distance is calculated with the help of the center block 0, 0 is also called as the reference block. The concept of neighbouring pixels is applied to the blocks. This block , at 0, 0 consists of two horizontal and vertical neighbors, given as 1, , 1, , , 1 , , 1 in positions, 1, 0 , 1, 0 , 0, 1 , and, 0, 1 correspondingly as shown in Figure 7a,b. Every block is at b unit distance as seen in Figure 7a. The four diagonal neighbors of , at 0, 0 is given by, They are at Euclidean distance of √ 2 from 0, 0 . As seen in Figure 6. The cuboid (spatial temporal video patch) is extracted around each interest point and contains spatio-temporally windowed pixel values.  The size of cuboids in space-time volumes is set to four sizes (i.e., 49 (space) × 49 (space) × 14 (time), 49 × 49 × 21, 49 × 49 × 28, and 49 × 49 × 35), in detection. By using this information of each cuboid, it is easy to describe and build a valid action recognition model.

Feature Extraction Procedure
Once the cuboids of spatio-temporal interest points are identified, feature extraction is performed, and various features such as Difference Intensity Distance Group Patterns (DIDGP) in 2D/3D, Discrete Cosine Transform derivatives (DCT), and hybrid DWT+DCT are evaluated. As discussed above, the center block 0, 0 is kept at the interesting point detected using the procedure discussed in Section 3.3.1. A sampled patch size of 49×49 is positioned on the interest point and the DIDGP features are extracted.

Distance Relationship Calculation
This approach finds the distance relationship between the two blocks in the 7×7 block area of the extracted cuboid. Distance is calculated with the help of the center block 0, 0 is also called as the reference block. The concept of neighbouring pixels is applied to the blocks. This block , at 0, 0 consists of two horizontal and vertical neighbors, given as 1, , 1, , , 1 , , 1 in positions, 1, 0 , 1, 0 , 0, 1 , and, 0, 1 correspondingly as shown in Figure 7a,b. Every block is at b unit distance as seen in Figure 7a. The four diagonal neighbors of , at 0, 0 is given by, The size of cuboids in space-time volumes is set to four sizes (i.e., 49 (space) × 49 (space) × 14 (time), 49 × 49 × 21, 49 × 49 × 28, and 49 × 49 × 35), in detection. By using this information of each cuboid, it is easy to describe and build a valid action recognition model.

Feature Extraction Procedure
Once the cuboids of spatio-temporal interest points are identified, feature extraction is performed, and various features such as Difference Intensity Distance Group Patterns (DIDGP) in 2D/3D, Discrete Cosine Transform derivatives (DCT), and hybrid DWT+DCT are evaluated. As discussed above, the center block b(0, 0) is kept at the interesting point detected using the procedure discussed in Section 3.3.1. A sampled patch size of 49 × 49 is positioned on the interest point and the DIDGP features are extracted.

Distance Relationship Calculation
This approach finds the distance relationship between the two blocks in the 7 × 7 block area of the extracted cuboid. Distance is calculated with the help of the center block b(0, 0) is also called as the reference block. The concept of neighbouring pixels is applied to the blocks. This block b(x, y) at b(0, 0) consists of two horizontal and vertical neighbors, given as (x + 1, y), (x − 1, y), (x, y + 1), x, y − 1) in positions, b(1, 0), b(−1, 0), b(0, 1), and, b(0, −1) correspondingly as shown in Figure 7a,b. Every block is at b unit distance as seen in Figure 7a. The four diagonal neighbors of b(x, y) at b(0, 0) is given by,

Signal Transformation Descriptors
The technique of time-frequency transformation helps in the conversion of the signal into various frequency components, making the features more accurate for the representation of the action. Further, it enriches the recognition ratio, and it has extensive usage of the image processing applications.  They are at Euclidean distance of √ 2 from b(0, 0). In this work, the sampled patch size is M = N = 49 and each patch is divided into a 7 × 7 grid with m = n = 7 in pixels.

Distance Group Pattern
It is denoted by the different pattern groups organized by their distance from the central block as shown in Figure 8. The various distance group patterns obtained are denoted as Electronics 2022, 11, x FOR PEER REVIEW 9 of 23 The discrete cosine transform (DCT) [35] characterizes an image data, sum of sinusoids of shifting magnitude, and frequency coefficients that are encoded individually for compression efficiency without dropping the information. It is mainly useful in the area of image processing, has many good properties like decorrelation and Energy compaction that removes the redundancy between neighboring pixels and discards the coefficients with relatively small amplitudes.

,
, cos 2 To summarize, block distance is calculated from the central block. The eight different distance group is computed from 7 × 7 sub block region. The two dominant and different edge directions in a local region are defined as a corner. An interest point has a welldefined position in an image having local intensity maximum or minimum where the curvature is locally maximal. Harris 2D was utilized and the features extracted improve the performance of the activity recognition approach. Harris corner detector uses 2D Gaussian filter and 1D Gabor filters in spatial and temporal directions respectively. A response value is given at every position. Initially, various experiments are performed to fix the number of interest points for computation purpose. The number of interest points is varied as n = 2, 3, 5, and 7. Good performance is obtained with n = 5 and increasing the number of interest points increases the computational complexity. Thus, for further analysis, the number of interest points is fixed as 5. The highlighted spatial temporal interest points correspond to local maxima of response function as shown in Figure 3 for the local patches placed on the detected interest points in an action sequence. The 2D/3D DIDGP features are extracted from these patches and the information contains at each patch is utilized to describe and build the valid model for action recognition.

Signal Transformation Descriptors
The technique of time-frequency transformation helps in the conversion of the signal into various frequency components, making the features more accurate for the representation of the action. Further, it enriches the recognition ratio, and it has extensive usage of the image processing applications.
The discrete cosine transform (DCT) [35] characterizes an image data, sum of sinusoids of shifting magnitude, and frequency coefficients that are encoded individually for compression efficiency without dropping the information. It is mainly useful in the area of image processing, has many good properties like decorrelation and Energy compaction that removes the redundancy between neighboring pixels and discards the coefficients with relatively small amplitudes.
The 2D basis functions can be generated by multiplying the horizontally oriented 1-D basis functions with vertically oriented set of the same functions. The basic functions for N = 8. The basis functions exhibit a progressive increase in frequency both in the vertical and horizontal direction. The top left basis function assumes a constant value and is referred to as the DC coefficient. All other transform coefficients are called the AC Coefficients. The DCT is applied to the entire image to obtain the frequency coefficient matrix of the same dimension. In general, the DCT coefficients are divided into three bands namely low frequencies, middle frequencies and high frequencies. A transformation scheme must have the ability to pack input data into as few co-efficient as possible which allows the quantize to discard coefficients with relatively small amplitudes without introducing visual distortion in the reconstructed image. DCT exhibits excellent energy compaction for highly correlated images. The uncorrelated image has its energy spread out whereas the energy of the correlated image is packed into the low frequency region which is the top left region. The conventional coefficient selection approaches select the fixed elements of the DCT coefficients matrix. For an M × N size action frame. The feature extraction consists of two phases. In the first phase, the DCT is applied to the entire frame to obtain the DCT coefficients; a deterministic approach called Zigzag is used for coefficient selection of features in our work. Dimension of the DCT coefficient matrix is the same as the input frame.
The discrete wavelet transform (WT) has gained widespread acceptance in signal processing and image compression. The 'wavelets' are different from Fourier analysis that avoids the sine and cosine transform. This appears to be part of harmonic analysis of the wavelet family; it breaks the signal into a series of essential functions and models the complex phenomena effectively. The wavelet transition is independently determined at various frequencies for different segments in their respective time intervals. Multiresolution analyses are designed in such a way that provides good time resolution and poor frequency resolution in high frequencies. In turn, proper frequency resolution and low time resolution result in high-frequency components for a short period and poor frequency components for a long duration. Here, wavelets are produced in the form of a single prototype wavelet y(t) called mother wavelet which includes dilations and shifting and the group of wavelets resulting from (x) [39].
where a is the scaling parameter and b is the shifting parameter. The mapping of discrete wavelet transform is given as where the mother wavelet Y a,b (x) satisfies the below equation: DWT is used to transforming the gray-scale images to spatial and frequency domain at the same time, where x is any signal which wavelet decomposition to be performed, q l and q h are low pass and high pass filters with half the cut-off frequency from the previous one. Such a transformation is applied recursively on the low-pass series until the desired number of iterations is reached. In frequency domain, when the facial image is decomposed using two-dimensional wavelet transform, four sub regions are obtained. These regions are: one low-frequency region LL (approximate component), and three high-frequency regions, namely, LH (horizontal component), HL (vertical component), and HH (diagonal component), respectively. The LL image is generated by two continuous low-pass filters; HL is filtered by a high-pass filter first and a lowpass filter later; LH is created using a lowpass filter followed by a high-pass filter; and HH is generated by two successive high-pass filters. Subsequent levels of decomposition follow the same procedure by decomposing the LL sub image of the previous level. Since the LL part contains most important information and discards the effect of noises and irrelevant parts, the LL part is adopted for further analysis. In the proposed work, two-level 2D discrete wavelet decomposition is performed on the motion images. The Dimension of the DWT coefficient matrix is the same as the input frame.
Hybrid DWT+DCT. The process for feature extraction using hybrid DWT+DCT is as follows: input the motion frames of size M × N is applied with 2D-DWT to obtain LL band; further, the LL band is divided into 32 narrow-width bands. Each band is of size 16. It is observed that dividing the image into 32 bands gives better results in terms of recognition accuracy. We apply the 2D-DCT on each band to obtain the fewer number of DCT coefficients with high compaction energy on lower frequency which gives dominant magnitudes are obtained by arranging them in descending order in each band of the image. The first dominant magnitudes in each band contain different characteristics in comparison to other magnitudes in the respective bands. All hybrid DWT-DCT coefficients pertaining to first dominant magnitudes in each band of the motion frame are considered as the feature vector.

Principal Component Analysis (PCA)
PCA is a powerful and significant statistical approach in broad areas, and it is used in high-ceiling dimension to identify patterns. PCA 'combines' the attribute spirit by producing a replacement for a smaller collection of variables [40,41]. The original data are projected as a reduced set. Assume that x 1 , x 2 , . . . , x p are P training vectors, each fit into one of the N classes {ζ 1 , ζ 2 , . . . , ζ N }. Hence, the training vector, x p , can also be projected in a lower dimension vector space y p , using a linear transform of orthonormal form and given as y p = W T x p . The eigenvalues and eigenvectors help in the construction of the transformation matrix (W) and it includes the covariance matrix (∑) of the input data. By default, input data covariance can be calculated as Here, µ is considered as the mean vector of all the sample training images.
The covariance matrix has the eigenvectors e 1 , e 2 , . . . , e K associated with the eigen values λ 1 ≥ λ 2 ≥ . . . . , ≥ λ K , correspondingly; here K, is defined as the feature vector dimension. The eigenvectors relating to D concentrated eigenvalues, i.e., (W) = [e 1 e 2 . . . The projection of lower dimension space (say vector t) for a given specific sample of test data is decreased using distance match algorithm, and the corresponding class is assigned to the feature training vector X i o , where i o = arg min 1≤i≤p || t − y i ||, where ||.|| indicates the Euclidean distance in R D. The initial eigenvector indicates the two-dimensional data and represents the direction of maximum zero variance. The second eigenvector is orthogonal to the first, and it relates to the next maximum direction of variance.
In this work, we utilized the Principal Component Analysis (PCA) approach for linear feature extraction used for unsupervised feature selection based on eigenvectors analysis to identify critical original features for principal component. This can dramatically impact the performance of machine learning.

Classification Methods
In this work, the SVM classifier and Random Forest classifier are used to test the efficacy of the classifier in the UT-Interaction dataset. The classifiers applied are as follows:

Support Vector Machines
Support Vector Machine (SVM) is a widely used approach for classifying visuals in pattern recognition [42,43]. It achieves greater success in the performance of optimization theory by using vital pattern recognition [44], and it mainly involves in the practice of kernel learning algorithm. Typically, the classification task involves sample training and testing data. The separation of training data is given by (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m ) into two classes, where x 1 ∈ R N resembles the feature vector of n-dimensional and y i = {+1, −1} is the required class labels. SVM aims to forecast the target values from the testing set with the help of generating a model. The hyper plane in binary classification is w·x + b = 0, where w ∈ R n , b ∈ R is involved in the separation of two classes in the distinct space Z [45]. M = 2 ||w|| is assumed as the maximum margin as shown in Figure 9. Lagrange multipliers help in resolving the minimization problem α i (i = 1, 2, . . . , m) inturn w and b are considered as the most optimum values gained from Equation (12).
In order to optimize the margin and minimize the training error, the non-negative slack variables ξ i are used. The soft margin classifier is acquired by optimizing the Equations (13) and (14).  In order to optimize the margin and minimize the training error, the non-negative slack variables ξ are used. The soft margin classifier is acquired by optimizing the Equations (13) and (14).
If, in case the sample training, data arenot linearly separable, then the input space with the kernel function is mapped to the high dimensional space , . [42]. Some of the main characteristics of kernel functions are listed in Table  1.From Table 1, γ andr areknownas parameters of inner kernel.
Multiclass SVM helps in the construction of N-binary classifier, and one class is isolated from the rest. The class training sets comprises of positive labels, and all other labels are negative. The SVM unravels the decision function given in Equation (12).
A grid search technique is used in the RBF kernel to predict the value of the C and γ parameters in LIBSVM space [45]. The ideal value parameters, such as C and γ, are not previously known, and it helps in the optimal performance of the classifier, where C refers to the variable slack and γ defines the decision boundary curvature or penalty parameter.

Types of Kernel Inner Product Kernel
Linear , Sigmoid , tanh γ

Random Forest
Leo Breiman's [46] enhanced Random Forest has a collection of non-pruned classification trees made up of a random selection of trials of training data samples. Here, the induction process helps in the selection of random features. Aggregation attains the majority vote by predicting in case of classification, and it undergoes averaging for regression. Breiman'sRandom Forest approach combines random subspace elements and bagging uniquely to use decision trees as the base classifier. This approach is not suitable for handling a large number of dissimilar features, and, hence, overfitting occurs in noisy classification/regression tasks in distinct data sets. In this work, the maximum number of trees is 100 and the depth is fixed to 50. If, in case the sample training, data are not linearly separable, then the input space with the kernel function is mapped to the high dimensional space K x i , x j = φ(x i )·φ x j [42]. Some of the main characteristics of kernel functions are listed in Table 1. From Table 1, γ and r are known as parameters of inner kernel. Table 1. SVM inner product kernel types.

Types of Kernel Inner Product Kernel
Multiclass SVM helps in the construction of N-binary classifier, and one class is isolated from the rest. The ith class training sets comprises of positive labels, and all other labels are negative. The ith SVM unravels the ith decision function given in Equation (12).
A grid search technique is used in the RBF kernel to predict the value of the C and γ parameters in LIBSVM space [45]. The ideal value parameters, such as C and γ, are not previously known, and it helps in the optimal performance of the classifier, where C refers to the variable slack and γ defines the decision boundary curvature or penalty parameter.

Random Forest
Leo Breiman's [46] enhanced Random Forest has a collection of non-pruned classification trees made up of a random selection of trials of training data samples. Here, the induction process helps in the selection of random features. Aggregation attains the majority vote by predicting in case of classification, and it undergoes averaging for regression. Breiman's Random Forest approach combines random subspace elements and bagging uniquely to use decision trees as the base classifier. This approach is not suitable for handling a large number of dissimilar features, and, hence, overfitting occurs in noisy classification/regression tasks in distinct data sets. In this work, the maximum number of trees is 100 and the depth is fixed to 50.

Experimental Setup
The evaluation of the proposed work is assessed using the UT-Interaction dataset on Set 1 and Set 2 ( Figure 10). The experiments are performed with MATLAB R2019 in Windows 10 operating system on a computer with Intel i7 Processor having 16 GB RAM. The extraction of cuboids in the feature detection step is carried by interest points, that access the necessary information needed for transform-based features and DIDGP features. Moreover, this model is tested with real time benchmark-video sequences and 10 frames per second are required for processing. When we increase the computation power to GPU based systems, this algorithm can work in speed of 25 to 30 frames per second. static with the minimal camera jitter. Set 2 (i.e., 10 additional sequences) is taken on a windy day lawn with a slightly shifting background with additional camera jitters. Every package has a different context, size, and lighting.  Table 2 illustrates a confusion matrix for a human emotion recognition problem having true positive (TP), false positive (FP), true negative (TN), and false negative (FN) class values [47,48]. If the classifier predicts a correct response of class at each instance, it is counted as "success"; if not, it is an "error". The overall performance of the classifier is obtained by error rate, which is a proportion of the errors made over the whole set of instances.

Actual Values Predicted Values Positive Negative
Positive TP FP Negative FN TN From the confusion matrix it is possible to extract a statistical metrics (Precision, Recall, and F-measure) for measuring the performance of classification systems and is defined as follows: Precision (P) or detection rate is a ratio between correctly labelled instances and total labelled instances. It is a percentage of positive predictions in specific class that are correct and it is defined by: (15) where, TP and FP are the number of true positive and false positive predictions for the particular class.
Recall (R) or Sensitivity is a ratio between correctly labeled instances and total instances in the class. It has an ability to measure the prediction model and is also called as true positive rate. It is defined by: (16) where, TP and FN are the numbers of true positive and false negative predictions for the particular class. TP + FN is the total number of test examples of the particular class.

3D-DIDGP
The spatio-temporal information generated within each cuboid and the concatenated vector is identical to the sub-block number. Hence, the descriptor vectors are assigned to a lower-dimensional space in the cuboid times.

2D-DIDGP
The spatio-temporal information generated within each interest point DIDGP descriptor is concatenated and expected to project in lower-dimensional space through descriptor vectors.

DCT
In the DCT feature detection step, the 2D windowed pixel that is extracted within the spatio-temporal interest point of the difference image has the length of 49 × 49 = 2401. From this, it is divided into 8 × 8 = 64 sub blocks were achieved, then DCT method is applied on 8 × 8 sub blocks to obtain the complex coefficients. The extracted descriptors of length 64 are given to lower dimensional space by varying the step size of 5 to achieve recognition accuracy.

DWT
Daubechies wavelet transform can be applied to separates finite data into different frequency components, such as approximation coefficients matrix cA and other coefficients matrices cH (horizontal), cV (vertical), and cD (diagonal). Similar to the Discrete Cosine Transform, DWT is used, and results in the smallest number of coefficients of descriptor length 48.

Hybrid DWT+DCT
In this method, the process of DWT and DCT act together. Where the input is DCT coefficients, it helps in minimizing the redundancy to discriminate against the action more efficiently. Finally, the classification process involves leaving the one-out-cross validation (LOOCV) method for performance assessment of the non-linear support vector machine (SVM) with the RBF kernel. Here the best parameters are chosen by 10-fold cross-validation in a grid search on the sample training data, and Random Forest also uses the LOOCV approach for performance evaluation.

Dataset
The UT-Interaction dataset contains a video sequence of six classes of human-human interaction classes: shake-hands, dot, hug, push, kick and punch. There are a total of 20 video sequences of approximately 60 s in length. Each video includes two experiences, providing an average of eight human activities per video. More than 15 different styles of participants with various clothing requirements perform in the videos. The images are taken at a resolution of 720 × 480 and 30 fps, and one person's height is approximately 200 pixels in the video. The video is divided into two parts. Set 1 consists of 10 video sequences pictured in a parking lot with modifying zoom rates and their backgrounds are almost static with the minimal camera jitter. Set 2 (i.e., 10 additional sequences) is taken on a windy day lawn with a slightly shifting background with additional camera jitters. Every package has a different context, size, and lighting. Table 2 illustrates a confusion matrix for a human emotion recognition problem having true positive (TP), false positive (FP), true negative (TN), and false negative (FN) class values [47,48]. If the classifier predicts a correct response of class at each instance, it is counted as "success"; if not, it is an "error". The overall performance of the classifier is obtained by error rate, which is a proportion of the errors made over the whole set of instances. From the confusion matrix it is possible to extract a statistical metrics (Precision, Recall, and F-measure) for measuring the performance of classification systems and is defined as follows: Precision (P) or detection rate is a ratio between correctly labelled instances and total labelled instances. It is a percentage of positive predictions in specific class that are correct and it is defined by:

Evaluation Metrics
where, TP and FP are the number of true positive and false positive predictions for the particular class. Recall (R) or Sensitivity is a ratio between correctly labeled instances and total instances in the class. It has an ability to measure the prediction model and is also called as true positive rate. It is defined by: where, TP and FN are the numbers of true positive and false negative predictions for the particular class. TP + FN is the total number of test examples of the particular class. The F-measure is the harmonic mean of precision and recall and it attempts to give a single measure of performance. A good classifier can provide both recall and precision values high. The F-measure is defined as: where β is the weighting factor. Here, β = 1, that is, precision and recall are equally weighted and used to measure the F β -score which is also known as F 1 -measure.

Transform Based Descriptor
A modern discriminative feature descriptor approach of the time-frequency transformation technique is used to identify the features in a more reliable way using action representation. It also increases the detection rate which is commonly used in the field of image processing. Here various transformation methods like Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT), as well as the combination of DCT and DWT (Hybrid DWT+DCT), which helps in utilizing human action recognition due to its excellent performance in image and video processing. Figure 11 illustrates the performance comparison of transform-based descriptor methods, using the RBF-kernel SVM and Random Forest, which shows the results classified by SVM from Figure 11a,b. The results for Set 1 and Set 2 are shown in Figure 11c,d by using the Random Forest Classifier. It appears that the Hybrid DWT+DCT approach outperforms both DCT and DWT techniques, and the Hybrid DWT+DCT results on both sides are 81.04% and 71.69%, with the size 35, respectively. The accuracy of the identification increases as the dimension increases to a certain amount. Furthermore, it was found that further dimensional increases do not improve. The SVM classification gives greater accuracy to recognize than the Random Forest and for the SVM classified DWT+DCT method, the results of the Random Forest results for both sets of UT-Interaction are approximately 17% to 20%.

2D-DIDGP
The 2D-DIDGP features are extracted on each spatial-temporal interest point. PCA undergoes dimensionality reduction and the resultant features are supplied to the SVM and the Random Forest Classifier to achieve a good description. Figure 12 displays the results of the projected 2D-DIDGP. Figure 12a exhibits the SVM results with RBF kernel on the UT-Interaction dataset (Set 1 and Set 2). From this average performance accuracy, 89.87% and 87.59% were obtained on 2D-DIDGP at the dimension of 18 on Set 1 and Set 2, accordingly. The performance improves up to a certain point and then decreases

2D-DIDGP
The 2D-DIDGP features are extracted on each spatial-temporal interest point. PCA undergoes dimensionality reduction and the resultant features are supplied to the SVM and the Random Forest Classifier to achieve a good description. Figure 12 displays the  Figure 12a exhibits the SVM results with RBF kernel on the UT-Interaction dataset (Set 1 and Set 2). From this average performance accuracy, 89.87% and 87.59% were obtained on 2D-DIDGP at the dimension of 18 on Set 1 and Set 2, accordingly. The performance improves up to a certain point and then decreases slightly. From this, the best results were achieved with a dimension of 18. Figure 12b shows the results of the Random Forest classifier, the obtained performance of 87.83% and 86.18% on 2D-DIDGP at the dimension of 15 on Set 1 and Set 2 respectively.   Tables 5 and 6 show the UT-Interaction dataset confusion table on Set 1 and Set 2 using the Random Forest; one can observe that the action punch is confused with hug and push. In Set 2, most actions are not classified well, except handshake.   Tables 3 and 4 show the classification results of the UT-Interaction dataset with RBF kernel using SVM on Set 1 and Set 2 and the correct response lies along with the diagonal entries of the table. Most classes are predicted well, with some confusion between punch and push on Set 1. In Set 2, the results are vaguely predicted well and show that the action point and punch are likely to be confused with push. From both cases, the hand-based action (point, punch, and push) is the same and, therefore, the confusion is expected due to the similarity of the poses.  Tables 5 and 6 show the UT-Interaction dataset confusion table on Set 1 and Set 2 using the Random Forest; one can observe that the action punch is confused with hug and push. In Set 2, most actions are not classified well, except handshake.

3D-DIDGP
The obtained 3D-DIDGP cuboid features were extracted from each spatiotemporal interest point. Reduction of dimensionality with PCA by varying step size of 15 and projected features are fed to the SVM and Random Forest classifiers to achieve fair comparison. Figure 13 displays the results of projected 3D-DIDGP under different cuboid sizes in times (14, 21, 28, and 35). Figure 13a,b show the results of SVM with RBF kernel on UT-Interaction dataset (Set 1 and Set 2). From this average performance accuracy, 96.32% and 92.03% were obtained on 3D-DIDGP28 at the dimension of 30 on Set 1 and Set 2 respectively.
The accuracy hardly improves as the number of cuboid size increase. Figure 13c,d show the results of the Random Forest classifier, the obtained better performance of 90.86% and 82.60% on 3D-DIDGP28 at the dimension of 30 on Set 1 and Set 2, respectively. The cuboid size of 28 outperforms the other cuboid sizes and the best result is achieved on a projected dimension of 30 on both classifiers.
The recognition capability of the 3D-DIDGP method on SVM with RBF kernel demonstrated by confusion matrix for Set 1 and Set 2 are given in Tables 7 and 8, respectively. In Set 1, 3D-DIDGP has outstanding recognition competence in point, punch, push, and hug. However, for a handshake, hug, and kick, results are somewhat confusing. In Set 2, confusion matrix results of point and push have good recognition capability and show that the action kick, hug, and handshake are likely to be confused with push. This is because in the UT-interaction dataset, a certain sequence like push and punch are correlated, and it is very complex to differentiate with human eyes.

3D-DIDGP
The obtained 3D-DIDGP cuboid features were extracted from each spatiotemporal interest point. Reduction of dimensionality with PCA by varying step size of 15 and projected features are fed to the SVM and Random Forest classifiers to achieve fair comparison. Figure 13 displays the results of projected 3D-DIDGP under different cuboid sizes in times (14, 21, 28, and 35). Figure 13a,b show the results of SVM with RBF kernel on UT-Interaction dataset (Set 1 and Set 2). From this average performance accuracy, 96.32% and 92.03% were obtained on 3D-DIDGP28 at the dimension of 30 on Set 1 and Set 2 respectively.
The accuracy hardly improves as the number of cuboid size increase. Figure 13c,d show the results of the Random Forest classifier, the obtained better performance of 90.86% and 82.60% on 3D-DIDGP28 at the dimension of 30 on Set 1 and Set 2, respectively. The cuboid size of 28 outperforms the other cuboid sizes and the best result is achieved on a projected dimension of 30 on both classifiers.   Table 9, where the majority of actions were appropriately categorized but the discrimination on a handshake and kick actions on Set 1 was comparatively less effective. Table 10 shows the results on Set 2, most of the actions are not predicted well except point action, the greatest confusion between hug, kick, handshake, push and punch were found difficult to segregate reliably. The performance of the 3D-DIDGP method on SVM with RBF kernel is outperformed by 5.46% and 9.43% on Set 1 and Set 2, respectively, by the Random forest classifier.

Performance Analysis of Different Methods
The overall results reported by SVM and the Random Forest classifiers are shown in Figure 14a,b. The 3D-DIDGP approach succeeded the methods based on 2D-DIDGP and Transform, with comparable performance. For the 3D-DIDGP process, the result reported by SVM exceeds the results of Random Forest by about 7-15%. The findings are not limited to random forests.

Performance Analysis of Different Methods
The overall results reported by SVM and the Random Forest classifiers are shown in Figure 14a,b. The 3D-DIDGP approach succeeded the methods based on 2D-DIDGP and Transform, with comparable performance. For the 3D-DIDGP process, the result reported by SVM exceeds the results of Random Forest by about 7%-15%. The findings are not limited to random forests.
To measure the efficiency of the proposed method, the results obtained by the proposed method are quantitatively compared with the most advanced results, and compared them are shown in Table 11. Based on the comparison, the proposed method shows good results on the UT-Interaction dataset. The experimental results validate their accuracy and efficiency in human action recognition and also indicate the potential of proposed techniques. Moreover, the results indicate that the 3D-DIDGP is quite promising and convincing for human action recognition in surveillance videos as it extracts reliable feature information when compared to 2D-DIDGP and transforms-based descriptors. The 3D-DIDGP based method outperforms the state-of-the-art recognition algorithms and obtained 96.32% and 82.03% for UT-Interaction dataset (Set 1 and Set 2). To measure the efficiency of the proposed method, the results obtained by the proposed method are quantitatively compared with the most advanced results, and compared them are shown in Table 11. Based on the comparison, the proposed method shows good results on the UT-Interaction dataset. The experimental results validate their accuracy and efficiency in human action recognition and also indicate the potential of proposed techniques. Moreover, the results indicate that the 3D-DIDGP is quite promising and convincing for human action recognition in surveillance videos as it extracts reliable feature information when compared to 2D-DIDGP and transforms-based descriptors. The 3D-DIDGP based method outperforms the state-of-the-art recognition algorithms and obtained 96.32% and 82.03% for UT-Interaction dataset (Set 1 and Set 2). Table 11. State-of-the-art Recognition Accuracy (%) for the UT-Interaction (Set 1 and Set 2) datasets.

Method
Year Set 1 Set 2 Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.