Automatic Recognition of Human Interaction via Hybrid Descriptors and Maximum Entropy Markov Model Using Depth Sensors

Automatic identification of human interaction is a challenging task especially in dynamic environments with cluttered backgrounds from video sequences. Advancements in computer vision sensor technologies provide powerful effects in human interaction recognition (HIR) during routine daily life. In this paper, we propose a novel features extraction method which incorporates robust entropy optimization and an efficient Maximum Entropy Markov Model (MEMM) for HIR via multiple vision sensors. The main objectives of proposed methodology are: (1) to propose a hybrid of four novel features—i.e., spatio-temporal features, energy-based features, shape based angular and geometric features—and a motion-orthogonal histogram of oriented gradient (MO-HOG); (2) to encode hybrid feature descriptors using a codebook, a Gaussian mixture model (GMM) and fisher encoding; (3) to optimize the encoded feature using a cross entropy optimization function; (4) to apply a MEMM classification algorithm to examine empirical expectations and highest entropy, which measure pattern variances to achieve outperformed HIR accuracy results. Our system is tested over three well-known datasets: SBU Kinect interaction; UoL 3D social activity; UT-interaction datasets. Through wide experimentations, the proposed features extraction algorithm, along with cross entropy optimization, has achieved the average accuracy rate of 91.25% with SBU, 90.4% with UoL and 87.4% with UT-Interaction datasets. The proposed HIR system will be applicable to a wide variety of man–machine interfaces, such as public-place surveillance, future medical applications, virtual reality, fitness exercises and 3D interactive gaming.


Introduction
Human interaction recognition (HIR) deals with the understanding of communication taking place between a human and an object or other persons [1]. HIR includes an understanding of various actions, such as social interaction, person to person talking, meeting or greeting in the form of a handshake or a hug and the performance of inappropriate actions, such as fighting, kicking or punching each other. There are many different kinds of interactions that can easily be identified by human observations. However, in many situations, personal human observation of some actions is impractical due to the cost of resources and also to hazardous environments. For example, in the case of smart rehabilitation, it is more suitable for a machine to monitor a patient's daily routine rather than for a human to constantly observe a patient (24/7) [2]. Similarly, in the case of video surveillance, it is more appropriate to monitor human actions via sensor devices, especially in places where risk factors and suspicious activities are involved.

•
Space and time based-i.e., spatio-temporal features-in which displacement measurements between key human body points are recognized as temporal features. Intensity changes along the curved body points of silhouettes are taken as spatial features.

•
Motion-orthogonal histograms of oriented gradient (MO-HOG) features are based on three different views of human silhouette. These views are projected in the form of orthogonal shape and then HOG is applied. • Shape based angular and geometric features include angular measurements over two types of shapes-i.e., inter-silhouettes and intra-silhouette. • Energy based features examine distinct body parts energy distribution within a silhouette.
These hybrid descriptors are fed into a Gaussian mixture model (GMM) and into fisher encoding for codebook generation and for proper discrimination among various activity classes. Then, we applied cross entropy algorithm which resulted in the optimized distribution of matrixes. Finally, the maximum entropy Markov model (MEMM) is embodied in the proposed HIR system to estimate empirical expectation and the highest entropy of different human interactions to achieve significant accuracy. Four experiments were performed using a leave-one-out cross validation method on three well-known datasets. Our proposed method acquired significant performance compared to well-known statistical state-of-the-art methods. The major contributions of this paper can be highlighted as follows: 1.
We proposed to apply hybrid descriptor features of spatiotemporal characteristics, invariant properties, view-orientation as well as displacement and intra/inter angular values to distinct human interactions.

2.
We introduced a combination of GMM with fisher encoding for codebook generation and optimal discrimination of features.

3.
We designed cross entropy optimization and MEMM to analyze contextual information as well as to classify complex human interactions in a better way. 4.
We performed experiments using three publicly available datasets and the proposed method was fully validated for the efficacy, outperforming other state-of-the-art methods, including deep learning.
The rest of the paper is organized as follows: Section 2 consists of related work in the field of HIR. Section 3 presents details of our proposed methodology. Section 4 reports the experimentation, dataset description results generation. Section 5 presents a discussion of the overall paper. Finally, Section 6 concludes the proposed research work with some future directions.

Related Work
Recently, a lot of works have been done by researchers for the development of HIR using multiple types of sensors. On the basis of methods used to capture human interactions, we categorize these sensors in our related work into three major types: (1) wearable sensor-based HIR; (2) vision sensor-based HIR and (3) Marker sensor-based HIR.

Wearable Sensor-Based HIR Systems
In wearable sensor-based technology, many sensors (e.g., accelerometers, gyroscopes and magnetometers) are attached to the subject's limb and body in order to examine interactions with the surroundings [26][27][28]. In [29], A. Howedi et al. proposed a unique HIR methodology based on different entropy measures, such as Fuzzy, sample and approximate entropy. They achieved significant accuracy in entropy measurements for the detection and identification during human interactions. In [30], M. Ehatisham et al. designed an action recognition system based on K-nearest neighbors and SVM via multiple sensors, including RGB cameras, depth sensors and wearable sensors for accurate recognition of human behaviors. H. Xu et al. [31] developed a wearable sensor based HIR that extracted various feature values via Hilbert-Huang transform (HHT). HHT spectrum features include frequency, amplitude, means and energy values that are tested over PAMAP2 wearable sensor datasets. Experimental results showed that multi features approach achieved better performance for HIR.
Motivated by the application of wearable sensors in health departments, a human motion detection system based on accelerometer sensor measurements is proposed by A. Jalal et al. [32]. In order to extract features of each activity class axial components of the accelerometer are taken. After extracting features, Random forest is applied to classify interactions that result in good performance in human Entropy 2020, 22, 817 4 of 33 motion detection. In order to recognize the physical activities of humans, wearable sensors are used by M. Batool et al. in [33]. They used both the gyroscopic and accelerometer sensor data.
They extracted statistical and Mel-frequency cepstral coefficient data. A combination of particle swarm optimization (PSO) and support vector machine (SVM) resulted in a better recognition rate. In order to solve the problem of feature selection and classification of sensor data, a genetic algorithm-based approach is used by M.A. Quaid et al. [34]. Statistical and acoustic features are extracted and then features are reweighted. After reweighting the features, biological operations for crossover and mutation are applied. One self-annotated dataset is proposed in this work. Experiments on three-mark datasets proved the efficiency of proposed human behavior analysis system. Motivated by applications using wearable sensors for elderly care, S. Badar et al. proposed a wearable sensor-based activity monitoring system [35]. This system consists of inertial and motion node sensors. Three types of features, such as binary, wavelet and statistical are extracted. In order to optimize features, adaptive moment estimation (Adam) and Ada delta are applied. Experiments on two datasets are used for system evaluation. The results showed a better performance compared to other state of the art systems.
In order to recognize daily activity, a smartphone with built in accelerometer was used by A.M. Khan et al. [36]. Two types of features, such as autoregressive coefficients and signal magnitude area were extracted. Kernel discriminant analysis and Artificial Neural Network (ANN) were then used for accurately identifying the activity class. Inspired by the applications of sensors embedded in smartphones, N.A. Capela et al. [37] proposed a human activity recognition system. In this research work, sensor data were taken from patients and elderly people. Seventy-six signal features were extracted and then selected on the basis of feature selection methods. Three classifiers were used to evaluate the proposed methodology and results reveal a better rate of accuracy. Motivated by healthcare and rehabilitation-based applications for human activity recognition systems, W. Jiang et al. proposed a wearable sensor-based method [38]. They collect signals from sensors in the form of activity images. They applied deep CNN for feature extraction. Evaluation on three benchmark datasets validated the performance of their system. However, these technologies face several limitations in HIR, such as discomfort and restricted movement for subjects, due to many wires and wearable sensors that are attached to their bodies [39]. Similarly, in order to capture full-body movements, the multiple sensors that are attached to the human body, cause computational complexity in the system. Background noises picked up by wearable sensors during measurements are also incorporated in the data and these result in numerous false predictions which affect decision making [40]. Therefore, instead of relying on wearable devices, vision-based sensor technologies have started gaining global attentions as a solution in HIR studies.

Vision-Based HIR Systems
In vision-based HIR systems, video cameras are mounted for automated inspection of human interactions in various public areas (i.e., shopping malls, parks and roads). In [41], M. Sharif et al. proposed a human activity monitoring system. They used a fused feature algorithm technique that consists of HOG, Harlick and binary patterns. Then, a novel joint entropy-based feature selection algorithm is used along with a recognizer engine (i.e., multi-class SVM) to examine HIR behavior. In [42], O. Ouyed et al. extracted motion features from the joints of two persons involved in an interaction. They used multinomial kernel logistic regression to evaluate HIR using Set I of UT-Interaction dataset. In [43], X. Ji et al. presented a vision based HIR system using multiple stage probability fusion. They divided interaction between two persons into the start state, execution state and the end state. Through weighted fusion, better recognition accuracy rates were obtained.
S. Bibi et al. [44] proposed a multi-feature model along with median compound local binary patterns for HIR system. They monitored individual action through multi-view cameras and showed better human-human interaction recognition rates. N. Cho et al. [45] described a novel system in order to identify complex human interaction identification. Their feature descriptors contained movements at global, local and individual levels. They detected points of significance in order to identify human motion. Experiments on two publicly available datasets with a SVM classifier showed that their system produced a better accuracy rate. A human activity recognition system based on depth sensors is proposed by O.F. Ince et al. [46]. Their system, which is based on joint-angle features, can detect activities in 3D space. The Haar-wavelet transform and dimension reduction algorithm is also applied. K-nearest neighbor (KNN) is applied to recognize human actions. In order to track human interaction recognition, a wise human interaction and tracking model was proposed by M. Mahmood et al. [47]. They extracted data from spatio-temporal and angular-geometric features. They evaluated their system on three benchmark datasets and, as a result, the performance of the system was better than many state-of-the-art systems.
In order to recognize human interactions in both indoor and outdoor environments, an RGB-based HIR system was introduced by Jalal [48]. Multiple features are proposed in this research work and Convolution Neural Networks (CNN) was applied. CNN proved to be better than other state-of-the-art classifiers. N. Nguyen et al. proposed an HIR system motivated by the performance of deep learning methods [49]. Hierarchical invariant features are extracted using Independent Subspace Analysis (ISA) via three-layer CNN. Through experimentation they showed that their three-layer approach is better at recognizing human interaction in complex environments than other approaches. Motivated by the success of bag-of-words, an automated recognition system was proposed by K.N. Slimani et al. [50]. They extracted a 3D volume of spatio-temporal features. Each interaction between two persons is represented by the co-occurrence of words through their frequency. Inspired by the applications of information technology (IT) in the education sector, Jalal et al. proposed a student behavior recognition (SBR) system [51]. They extracted spatio-temporal features for identifying student-student interaction. Their tested their system against one self-annotated and one RGB dataset. In [52] depth map-based person-person interaction is recognized. Interaction is divided into body part interactions. Regression based learning is used to process each camera view then features from multiple views are combined. The efficacy of the system was evaluated on three public depth-based datasets.
These methods mentioned above are either implemented on single RGB data or have used a very small set of features. On the other hand, we propose a vision based HIR system that consists of hybrid features having generic properties for RGB as well as depth images. For experimental validation, we use two depth datasets and one RGB dataset that consist of complex interactions over indoor-outdoor environments.

Marker Sensor-Based HIR Systems
In marker-based HIR systems, different markers, such as light emitting diodes, infrared or reflective spheres, are attached to the human body in order to capture motion information [53]. These sensors are attached to targeted body regions, such as joints or limbs of the human body. Many researchers used marker sensors for human activity analysis, clinical diagnosis and in rehabilitation centers. For example, M.H. Khan proposed a marker sensor-based system in order to provide home based therapy for patients [54]. Markers of different colors are attached to the individual's joints and motion information is recorded. Experiments were conducted on 10 patients which validated the performance of proposed systems. In [55], color markers are used to track foot positions. The motions of different body parts are tracked with the help of marker sensors and then interaction information between person and virtual surroundings is achieved. In order to analyze upper limb function of patients with abnormal limbs, a combination of a hand skateboard device, an IR camera and an infrared emitter is used [56]. Experiments showed that this system is easy to use and that it delivers results immediately.
In order to perform biomechanical examinations and to capture motion in sports activities, marker based optical sensors are used [57]. For system evaluation, collegiate and elite baseball pitchers are used. A Trunk Motion System (TMS) was developed by M.I. Esfahani [58]. They used Body-Worn Sensors (BWS) in their system. They measured the 3D motions of the trunk. Their system is very lightweight. They attached 12 Body-Worn Sensors on stretchable clothing. Seven actions were system was proposed by J. Michael et al. [59]. In order to measure the physical activities of humans, an innovative wireless system is proposed by N. Golestani et al. [60]. They proposed a magnetic induction system to track human actions. In this system, markers are attached to the joints. Successful evaluations were performed by applying laboratory measurements and deep recurrent neural network monitoring.
These sensors provide very accurate information regarding position, but they lack effectiveness in high speed motion because they cannot read and produce data on factors such as acceleration, velocity and torque. More precise results are generated via marker sensors, which provide better results in many clinical studies [61]. However, their performance was affected by surroundings such as dust, temperature changes and vibrations [62].

Proposed System Methodology
In this section, we describe details of each process involved in the proposed HIR system. Firstly, raw image (i.e., RGB and depth) sequences are preprocessed to remove noise. Then the segmentation algorithms are applied to extract the foregrounds from the backgrounds. Secondly, after segmentation, four different types of features are extracted as hybrid descriptor features. These feature descriptors are then fed into a codebook generation algorithm. Thirdly, cross entropy algorithms are applied to optimize the quantized codebook. Finally, experiments are performed and MEMM is used to determine each interaction class. Figure 1 shows the complete system architecture of the proposed HIR methodology.

Image Acquisition
During image acquisition, we start with video normalization to extract human silhouette representations by applying various techniques for noise removal, handling varying scales and contrasting distribution. For these purposes, all image sequences are first cropped to a fixed dimension to remove unnecessary areas. In order to enhance image quality, brightness and contrast, the distribution of both RGB and depth images are adjusted to make the images clearer via histogram

Image Acquisition
During image acquisition, we start with video normalization to extract human silhouette representations by applying various techniques for noise removal, handling varying scales and contrasting distribution. For these purposes, all image sequences are first cropped to a fixed dimension to remove unnecessary areas. In order to enhance image quality, brightness and contrast, the distribution of both RGB and depth images are adjusted to make the images clearer via histogram equalization. Then, a smoothing filter is applied as mean filter [63], which calculates all mean values between a current pixel and its neighboring pixels. The mean filter of input signal x is given through Equation (1): where y is the smoothened image, i and j are pixel values, and M is the window size, having a number of neighboring pixels.

Silhouette Representation
For robust identification of HIR, actual human interaction areas need to be extracted and to distinguish target images from clutter [64]. To extract efficient silhouette representation, we depend mainly on connected components, skin tone, region growing and color spacing [65]. Various algorithms are used for both RGB and RGB-D silhouette segmentation to improve the performance of the proposed system. We discuss this below.

Silhouette Segmentation of RGB Images
RGB silhouette segmentation is performed on the basis of pixel connectivity analysis and skin detection [66]. Initially, we detect human silhouettes where connected components in an image are found using 8-connected pixel analysis. This technique seeks to identify horizontal, vertical and diagonal connections between pixels. Human silhouettes are then defined by auto-bounding on the boxes on the basis of height and width parameters. Next, to segment silhouettes from a noisy background, we apply coloring algorithms to identify all light intensity colors, such as yellow, skin color and white. These light intensity colors are then converted from RGB to luminance, chrominance (yCbCr) color space, which is formulated as: where Y is luminance, Cb and Cr represent blue difference and red difference chrominance. After the identification of light intensity colors, they are converted to black color. Then we apply threshold-based segmentation, which works as growing regions to segment humans from the background. Full procedure of RGB silhouette identification and segmentation is shown in Figure 2. equalization. Then, a smoothing filter is applied as mean filter [63], which calculates all mean values between a current pixel and its neighboring pixels. The mean filter of input signal x is given through Equation (1): where y is the smoothened image, I and j are pixel values, and M is the window size, having a number of neighboring pixels.

Silhouette Representation
For robust identification of HIR, actual human interaction areas need to be extracted and to distinguish target images from clutter [64]. To extract efficient silhouette representation, we depend mainly on connected components, skin tone, region growing and color spacing [65]. Various algorithms are used for both RGB and RGB-D silhouette segmentation to improve the performance of the proposed system. We discuss this below.

Silhouette Segmentation of RGB Images
RGB silhouette segmentation is performed on the basis of pixel connectivity analysis and skin detection [66]. Initially, we detect human silhouettes where connected components in an image are found using 8-connected pixel analysis. This technique seeks to identify horizontal, vertical and diagonal connections between pixels. Human silhouettes are then defined by auto-bounding on the boxes on the basis of height and width parameters. Next, to segment silhouettes from a noisy background, we apply coloring algorithms to identify all light intensity colors, such as yellow, skin color and white. These light intensity colors are then converted from RGB to luminance, chrominance (yCbCr) color space, which is formulated as: 16 65 where Y is luminance, Cb and Cr represent blue difference and red difference chrominance. After the identification of light intensity colors, they are converted to black color. Then we apply thresholdbased segmentation, which works as growing regions to segment humans from the background. Full procedure of RGB silhouette identification and segmentation is shown in Figure 2.

Silhouette Segmentation of Depth Images
For silhouette segmentation of depth images, we used Otsu's thresholding method in which an image is divided into two classes-i.e., background class and foreground class [67]. In this method, multiple iterations with possible threshold values T are performed and one unique value of T is chosen that best separates foreground and background pixels. In order to calculate T, inter-class and

Silhouette Segmentation of Depth Images
For silhouette segmentation of depth images, we used Otsu's thresholding method in which an image is divided into two classes-i.e., background class and foreground class [67]. In this method, multiple iterations with possible threshold values T are performed and one unique value of T is chosen that best separates foreground and background pixels. In order to calculate T, inter-class and intra-class variance are needed for analyzation. In the case of intra-class analysis, variance should be as low as possible so, it is minimized through Equation (3): where probabilities of both classes that are divided by T, is given by w 0 and w 1 . Variances of both classes are shown by σ 2 0 and σ 2 1 . On the other hand, variance between two classes-i.e., inter-class variance should be as high as possible, as shown through Equation (4): In this way, depth silhouettes are separated from their background. Figure 3 demonstrates an example of the depth silhouette segmentation of kicking interaction from the SBU dataset. intra-class variance are needed for analyzation. In the case of intra-class analysis, variance should be as low as possible so, it is minimized through Equation (3): where probabilities of both classes that are divided by T, is given by and . Variances of both classes are shown by and . On the other hand, variance between two classes-i.e., inter-class variance should be as high as possible, as shown through Equation (4): In this way, depth silhouettes are separated from their background. Figure 3 demonstrates an example of the depth silhouette segmentation of kicking interaction from the SBU dataset.

Hybrid Feature Extraction
After the extraction of silhouettes from complex backgrounds, we proposed a novel hybrid feature extraction method. This method is a fusion of key-body point features and full silhouette features. Spatio-temporal and angular-geometric features are based on key-body points while motion-orthogonal HOG and energy-based features are based on full silhouettes. These four novel features are extracted and discussed in sub-sections below.

Spatio-Temporal Feature
Spatial features give information regarding changes with respect to space, location or position [68]. For spatial features, we measured intensity changes along the curve points of the body using the 8 Freeman chain code algorithm. These features are extracted along the boundary of the human silhouette because a small change in the position of a human silhouette results in changes in the curves of silhouette. So, in order to extract spatial features, we first identified the boundaries of the two human silhouettes involved in the interaction. Then, all the curve points along the human contour of both silhouettes were identified and represented using the 8 Freeman chain code. If we

Hybrid Feature Extraction
After the extraction of silhouettes from complex backgrounds, we proposed a novel hybrid feature extraction method. This method is a fusion of key-body point features and full silhouette features. Spatio-temporal and angular-geometric features are based on key-body points while motion-orthogonal HOG and energy-based features are based on full silhouettes. These four novel features are extracted and discussed in sub-sections below.

Spatio-Temporal Feature
Spatial features give information regarding changes with respect to space, location or position [68]. For spatial features, we measured intensity changes along the curve points of the body using the 8 Freeman chain code algorithm. These features are extracted along the boundary of the human silhouette because a small change in the position of a human silhouette results in changes in the curves of silhouette. So, in order to extract spatial features, we first identified the boundaries of the two human silhouettes involved in the interaction. Then, all the curve points along the human contour of both silhouettes were identified and represented using the 8 Freeman chain code. If we suppose that all the points along the boundary b are represented by n points, then curve points C b along the boundary are found from starting point C 0 to n − 1 as Moreover, we start to find a feature point from curve C 0 and move in a clockwise direction along with the boundary until we observe a change in direction from C 0 . Suppose that C 0 is the current curve point and C 1 is the next point and if the direction of both C 0 and C 1 is the same then we will move to next curve point C 2 . If the directions of both C 0 and C 1 are not the same, then we will consider C 1 as a feature point f (see Figure 4a). So, we will consider a curve point to be a feature point f if the difference between current curve point and the next curve point is not equal to 0. In this way, spatial feature finds almost all the parts of body of a human silhouette (see Figure 4b). Figure 4 demonstrates the overall procedure to find the feature points using the 8 Freeman chain code.
Entropy 2020, 22, x FOR PEER REVIEW 9 of 34 current curve point and is the next point and if the direction of both and is the same then we will move to next curve point . If the directions of both and are not the same, then we will consider as a feature point f (see Figure 4a). So, we will consider a curve point to be a feature point f if the difference between current curve point and the next curve point is not equal to 0. In this way, spatial feature finds almost all the parts of body of a human silhouette (see Figure 4b). Figure 4 demonstrates the overall procedure to find the feature points using the 8 Freeman chain code.  In order to find a feature point, we have taken eight cases of 45° and four cases of 90° to find changes in the direction of each curve point. Figure 5 describes a few cases of 45° and 90° change in direction in which yellow arrows show the current curve point direction while the blue arrows show the subsequent curve point direction. Temporal features give information about changes with respect to time. In order to extract temporal features, critical displacement measurements between eight key-body points [69,70] are considered. Initially, our system tracked eight key-body points (head, left shoulder, right shoulder, left arm, right arm, left foot, right foot and torso) on detected RGB and depth silhouettes. These silhouettes were converted to binary and then the outer boundaries of silhouettes were identified. Then, different positions, such as the topmost, right most, left most, bottom left most, bottom right most and center point of a human silhouette, are identified. Algorithm 1 presents the overall procedure used for the key-body point detection of human silhouettes. In order to find a feature point, we have taken eight cases of 45 • and four cases of 90 • to find changes in the direction of each curve point. Figure 5 describes a few cases of 45 • and 90 • change in direction in which yellow arrows show the current curve point direction while the blue arrows show the subsequent curve point direction.

Algorithm 1 Detection of key-body points human silhouette
Entropy 2020, 22, x FOR PEER REVIEW 9 of 34 current curve point and is the next point and if the direction of both and is the same then we will move to next curve point . If the directions of both and are not the same, then we will consider as a feature point f (see Figure 4a). So, we will consider a curve point to be a feature point f if the difference between current curve point and the next curve point is not equal to 0. In this way, spatial feature finds almost all the parts of body of a human silhouette (see Figure 4b). Figure 4 demonstrates the overall procedure to find the feature points using the 8 Freeman chain code.  In order to find a feature point, we have taken eight cases of 45° and four cases of 90° to find changes in the direction of each curve point. Figure 5 describes a few cases of 45° and 90° change in direction in which yellow arrows show the current curve point direction while the blue arrows show the subsequent curve point direction. Temporal features give information about changes with respect to time. In order to extract temporal features, critical displacement measurements between eight key-body points [69,70] are considered. Initially, our system tracked eight key-body points (head, left shoulder, right shoulder, left arm, right arm, left foot, right foot and torso) on detected RGB and depth silhouettes. These silhouettes were converted to binary and then the outer boundaries of silhouettes were identified. Then, different positions, such as the topmost, right most, left most, bottom left most, bottom right most and center point of a human silhouette, are identified. Algorithm 1 presents the overall procedure used for the key-body point detection of human silhouettes.
Algorithm 1 Detection of key-body points human silhouette Temporal features give information about changes with respect to time. In order to extract temporal features, critical displacement measurements between eight key-body points [69,70] are considered. Initially, our system tracked eight key-body points (head, left shoulder, right shoulder, left arm, right arm, left foot, right foot and torso) on detected RGB and depth silhouettes. These silhouettes were converted to binary and then the outer boundaries of silhouettes were identified. Then, different positions, such as the topmost, right most, left most, bottom left most, bottom right most and center point of a human silhouette, are identified. Algorithm 1 presents the overall procedure used for the key-body point detection of human silhouettes.  After identifying key-body points, position displacement measurement between all key-body points of the first person's silhouette (silhouette of person on left side) and all key-body points of the second person's silhouette (silhouette of person on right side) are measured as shown in Equation (5): where D(p, q) is Euclidian distance, p x and p y are x, y coordinates of the key body points of the first person's silhouette and q x and q y are x, y are coordinates of the second person's silhouette. As a person moves or performs any interaction, the distance between these key-body points may increase or decrease in values. Key-body points for both RGB and depth images are shown in Figure 6.

Angular-Geometric Features
An angular-geometric feature is a shape-based entity defined as a key-body point feature. In order to extract angular and geometric features, seven extreme body points (head, left shoulder, right shoulder, left arm, right arm, left foot, right foot) are first identified. Then, three geometric shapesi.e., pentagon, quadrilateral and triangle-are made by joining these extreme points. In angular features, we measure changes in angular values between extreme point positions in consecutive frames. Two types of geometric shapes are made by joining these extreme points, such as: intersilhouette shapes and intra-silhouette shapes. Table 1 shows a detailed overview of a number of intersilhouette and intra-silhouette shapes and angles that are made by joining the extreme body points of silhouettes.

Angular-Geometric Features
An angular-geometric feature is a shape-based entity defined as a key-body point feature. In order to extract angular and geometric features, seven extreme body points (head, left shoulder, right shoulder, left arm, right arm, left foot, right foot) are first identified. Then, three geometric shapes-i.e., pentagon, quadrilateral and triangle-are made by joining these extreme points. In angular features, we measure changes in angular values between extreme point positions in consecutive frames. Two types of geometric shapes are made by joining these extreme points, such as: inter-silhouette shapes and intra-silhouette shapes. Table 1 shows a detailed overview of a number of inter-silhouette and intra-silhouette shapes and angles that are made by joining the extreme body points of silhouettes. Table 1. Properties of inter-silhouette and intra silhouette geometrical shapes.

Type of Geometrical Shape Connected Extreme Points No. of Angles Diagrammatical Representation
Inter-Silhouette Triangle

Angular-Geometric Features
An angular-geometric feature is a shape-based entity defined as a key-body point feature. In order to extract angular and geometric features, seven extreme body points (head, left shoulder, right shoulder, left arm, right arm, left foot, right foot) are first identified. Then, three geometric shapesi.e., pentagon, quadrilateral and triangle-are made by joining these extreme points. In angular features, we measure changes in angular values between extreme point positions in consecutive frames. Two types of geometric shapes are made by joining these extreme points, such as: intersilhouette shapes and intra-silhouette shapes. Table 1 shows a detailed overview of a number of intersilhouette and intra-silhouette shapes and angles that are made by joining the extreme body points of silhouettes. Inter-Silhouette Quadrangular

. Angular-Geometric Features
An angular-geometric feature is a shape-based entity defined as a key-body point feature. In order to extract angular and geometric features, seven extreme body points (head, left shoulder, right shoulder, left arm, right arm, left foot, right foot) are first identified. Then, three geometric shapesi.e., pentagon, quadrilateral and triangle-are made by joining these extreme points. In angular features, we measure changes in angular values between extreme point positions in consecutive frames. Two types of geometric shapes are made by joining these extreme points, such as: intersilhouette shapes and intra-silhouette shapes. Table 1 shows a detailed overview of a number of intersilhouette and intra-silhouette shapes and angles that are made by joining the extreme body points of silhouettes.
Inter-Silhouette Pentagon  3 Left Shoulder of first silhouette, 4 Right Arm of first silhouette, 5 Left Arm of first silhouette, 6 Right Foot of first silhouette, 7 Left Foot of first silhouette, 8 Head of second (right) silhouette, 9 Right Shoulder of second silhouette, 10 Left Shoulder of second silhouette, 11 Right Arm of second silhouette, 12 Left Arm of second silhouette, 13 Right Foot of second silhouette and 14 Left Foot of second silhouette.
Inter-silhouette shapes are made within each silhouette. These are geometric shapes made by connecting the extreme points of each silhouette individually. Intra-silhouette shapes are made between two silhouettes by connecting the extreme points of one silhouette with the extreme points of the second silhouette within each frame. After the completion of both types of geometric shapes, the inverse cosine angle is measured between all these shapes, as shown in Equation (6):  3 Left Shoulder of first silhouette, 4 Right Arm of first silhouette, 5 Left Arm of first silhouette, 6 Right Foot of first silhouette, 7 Left Foot of first silhouette, 8 Head of second (right) silhouette, 9 Right Shoulder of second silhouette, 10 Left Shoulder of second silhouette, 11 Right Arm of second silhouette, 12 Left Arm of second silhouette, 13 Right Foot of second silhouette and 14 Left Foot of second silhouette.
Inter-silhouette shapes are made within each silhouette. These are geometric shapes made by connecting the extreme points of each silhouette individually. Intra-silhouette shapes are made between two silhouettes by connecting the extreme points of one silhouette with the extreme points of the second silhouette within each frame. After the completion of both types of geometric shapes, the inverse cosine angle is measured between all these shapes, as shown in Equation (6):  3 Left Shoulder of first silhouette, 4 Right Arm of first silhouette, 5 Left Arm of first silhouette, 6 Right Foot of first silhouette, 7 Left Foot of first silhouette, 8 Head of second (right) silhouette, 9 Right Shoulder of second silhouette, 10 Left Shoulder of second silhouette, 11 Right Arm of second silhouette, 12 Left Arm of second silhouette, 13 Right Foot of second silhouette and 14 Left Foot of second silhouette.
Inter-silhouette shapes are made within each silhouette. These are geometric shapes made by connecting the extreme points of each silhouette individually. Intra-silhouette shapes are made between two silhouettes by connecting the extreme points of one silhouette with the extreme points of the second silhouette within each frame. After the completion of both types of geometric shapes, the inverse cosine angle is measured between all these shapes, as shown in Equation (6):  3 Left Shoulder of first silhouette, 4 Right Arm of first silhouette, 5 Left Arm of first silhouette, 6 Right Foot of first silhouette, 7 Left Foot of first silhouette, 8 Head of second (right) silhouette, 9 Right Shoulder of second silhouette, 10 Left Shoulder of second silhouette, 11 Right Arm of second silhouette, 12 Left Arm of second silhouette, 13 Right Foot of second silhouette and 14 Left Foot of second silhouette.
Inter-silhouette shapes are made within each silhouette. These are geometric shapes made by connecting the extreme points of each silhouette individually. Intra-silhouette shapes are made between two silhouettes by connecting the extreme points of one silhouette with the extreme points of the second silhouette within each frame. After the completion of both types of geometric shapes, the inverse cosine angle is measured between all these shapes, as shown in Equation (6): Total 6 types 32 Geometrical Shapes 136 angles 1 Head of first (left) silhouette, 2 Right Shoulder of first silhouette, 3 Left Shoulder of first silhouette, 4 Right Arm of first silhouette, 5 Left Arm of first silhouette, 6 Right Foot of first silhouette, 7 Left Foot of first silhouette, 8 Head of second (right) silhouette, 9 Right Shoulder of second silhouette, 10 Left Shoulder of second silhouette, 11 Right Arm of second silhouette, 12 Left Arm of second silhouette, 13 Right Foot of second silhouette and 14 Left Foot of second silhouette.
Inter-silhouette shapes are made within each silhouette. These are geometric shapes made by connecting the extreme points of each silhouette individually. Intra-silhouette shapes are made between two silhouettes by connecting the extreme points of one silhouette with the extreme points of the second silhouette within each frame. After the completion of both types of geometric shapes, the inverse cosine angle is measured between all these shapes, as shown in Equation (6): where u and v are the two vectors in which the angle is measured. After measuring the angles, the shape's areas of all the inter-silhouette and intra-silhouette triangles are calculated. The area of the triangle is measured through Equation (7): where a, b and c are three sides of the triangle in which vectors are joined together-i.e., three extreme points to make a triangle-and S is the semi-perimeter of a triangle-i.e., half the length of the triangle's perimeter.
With the movement of each extreme point during interaction, the area of each geometric shape may increase or decrease. So, angular and geometric features measure changes in the angles as well as changes in the area of each shape between consecutive frames. The rate of change for the angles and the area are more evident in interactions like fight and kick, because they involve rapid movements of the extreme points during interaction as compared to approaching and departing interactions that include less pronounced movements of the extreme points.

Motion-Orthogonal Histogram of Oriented Gradient (MO-HOG)
MO-HOG is a motion-based feature applied over full silhouettes. It was observed that, in most of the interactions, the postures of both humans' silhouettes remain the same. For example, in approaching, departing, pushing and talking, the front views of both humans look like they are standing with only slight movements. Interactions like exchanging object and shaking hands are hardly distinguishable from each other. Punching and pushing interactions also have similar body movements. Therefore, our system proposed a novel multi-views approach including front, side and top views of both RGB as well as depth silhouettes by using a 3D Cartesian planes approach [71]. In order to incorporate motion data, we created RGB and depth differential silhouettes (DS) by taking differences between top t, front f and side s views of two consecutive frames as defined by Equation (8): where Fc is current frame and Fp is previous frame. After taking DS of multi-views of silhouettes, they are projected as 3-dimensional (3D) Cartesian planes in the form of orthogonal shapes, as shown in Figure 7. where u and v are the two vectors in which the angle is measured. After measuring the angles, the shape's areas of all the inter-silhouette and intra-silhouette triangles are calculated. The area of the triangle is measured through Equation (7): where a, b and c are three sides of the triangle in which vectors are joined together-i.e., three extreme points to make a triangle-and S is the semi-perimeter of a triangle-i.e., half the length of the triangle's perimeter.
With the movement of each extreme point during interaction, the area of each geometric shape may increase or decrease. So, angular and geometric features measure changes in the angles as well as changes in the area of each shape between consecutive frames. The rate of change for the angles and the area are more evident in interactions like fight and kick, because they involve rapid movements of the extreme points during interaction as compared to approaching and departing interactions that include less pronounced movements of the extreme points.

Motion-Orthogonal Histogram of Oriented Gradient (MO-HOG)
MO-HOG is a motion-based feature applied over full silhouettes. It was observed that, in most of the interactions, the postures of both humans' silhouettes remain the same. For example, in approaching, departing, pushing and talking, the front views of both humans look like they are standing with only slight movements. Interactions like exchanging object and shaking hands are hardly distinguishable from each other. Punching and pushing interactions also have similar body movements. Therefore, our system proposed a novel multi-views approach including front, side and top views of both RGB as well as depth silhouettes by using a 3D Cartesian planes approach [71]. In order to incorporate motion data, we created RGB and depth differential silhouettes (DS) by taking differences between top t, front f and side s views of two consecutive frames as defined by Equation (8): where Fc is current frame and Fp is previous frame. After taking DS of multi-views of silhouettes, they are projected as 3-dimensional (3D) Cartesian planes in the form of orthogonal shapes, as shown in Figure 7. These multi-view DS are fed into HOG to extract orientation features [72]. It calculates magnitude and gradient by dividing the image into 8 × 8 cells which are stored in a 9-bin histogram. A bar graph shows the magnitude and orientation bins of different interactions in Figure 8. These multi-view DS are fed into HOG to extract orientation features [72]. It calculates magnitude and gradient by dividing the image into 8 × 8 cells which are stored in a 9-bin histogram. A bar graph shows the magnitude and orientation bins of different interactions in Figure 8. These multi-view DS are fed into HOG to extract orientation features [72]. It calculates magnitude and gradient by dividing the image into 8 × 8 cells which are stored in a 9-bin histogram. A bar graph shows the magnitude and orientation bins of different interactions in Figure 8.

Energy-Based Features
In energy-based features, the movements of human body parts are captured in the form of Energy Maps (EMs). EMs distribute the energy matrix between a set of [0-8000] indexes values over the detected silhouette. After energy distribution, a threshold-based technique is used in which only higher energy index values that are greater than a specified threshold are extracted into a 1D vector. Energy distribution is represented by Equation (9): where ER(v) is 1D energy vector, N is the index number, and In R is the RGB values of N. Energy distributions over some interactions of the SBU dataset are shown in Figure 9.
higher energy index values that are greater than a specified threshold are extracted into a 1D vector. Energy distribution is represented by Equation (9): where ER(v) is 1D energy vector, N is the index number, and In R is the RGB values of N. Energy distributions over some interactions of the SBU dataset are shown in Figure 9. In Figure 9a, most of the energy is distributed in the region of hands. In Figure 9b, most of the energy is distributed in the left foot and the left shoulder because, when a person kicks, the upper body moves a little backward. Lastly, in Figure 9c, when the right silhouette starts punching, it moves forward, while the left silhouette moves backward as a reaction. Thus, energy distribution occurs at hands and the head of the right silhouette and around the whole body of the left silhouette. These energy maps show the energy of the body parts that are involved in the interaction in a red or a darker color. Meanwhile, those parts of the human body that are not involved during the interaction are in blue or lighter colors. Algorithm 2 explains the hybrid feature extraction algorithm. In Figure 9a, most of the energy is distributed in the region of hands. In Figure 9b, most of the energy is distributed in the left foot and the left shoulder because, when a person kicks, the upper body moves a little backward. Lastly, in Figure 9c, when the right silhouette starts punching, it moves forward, while the left silhouette moves backward as a reaction. Thus, energy distribution occurs at hands and the head of the right silhouette and around the whole body of the left silhouette. These energy maps show the energy of the body parts that are involved in the interaction in a red or a darker color. Meanwhile, those parts of the human body that are not involved during the interaction are in blue or lighter colors. Algorithm 2 explains the hybrid feature extraction algorithm.

Codebook Generation
After extracting the hybrid features of both RGB and depth images, feature descriptors of all image sequences are combined to form a matrix representation. Such a matrix representation is so assorted and complex that there is a need to represent it in a sorted and simpler way. Therefore, we applied Fisher vector encoding (FVC), based on GMM for codebook generation [73]. Initially, we applied GMM to compute the mean and covariance of each class, separately [74]. Based on computed values, clusters of each class are generated. Thus, the probability density function (pdf) of the cluster of the d dimensional vector X is defined by through Equation (10): where θ = {w k , µ k , Σ k | k = 1, 2, . . . , K}, w k is the weight of kth Gaussian component, K is the total number of clusters, the mean value is represented through µ k , the covariance matrix is given by Σ k and N represents the distribution of d dimensional Gaussian. In addition, Expectation Maximization estimates the maximum likelihood of parameters of GMM. During expectation maximization soft assignment of vectors x t to their belonging Gaussian cluster k is learned through Equation (11): After applying GMM, FVC is performed on feature descriptors. X = {x t , t = 1, 2, . . . , T} is a given feature set, while Gradient of log likelihood ∇ θ of X having GMM parameters θ is given through Equation (12): where F X is feature vector. Now, the gradient vector is computed with respect to each mean µ k and covariance σ k defined by Equations (13) and (14), respectively.
Finally, all computed gradient vectors µ k and v k for K components are combined to form D-dimensional final encoded feature vectors of dimensions 2KD. Hence, the Fisher vector reduces the intra-cluster gap and increases the inter-cluster gap, which gives a more precise discrimination of each cluster. Figure 10 demonstrates clusters formed by each interaction as a result of FVC over SBU and UoL 3D datasets.

Cross Entropy Optimization
In order to reduce the complexity of fisher encoded vectors, a cross entropy technique is implemented [75]. In cross entropy, initially, a sample of a specified size is generated from the fisher encoded vectors of each interaction class and an objective function is applied to that sample [76]. Then, more samples are extracted from encoded vectors and their objective functions are compared.

Cross Entropy Optimization
In order to reduce the complexity of fisher encoded vectors, a cross entropy technique is implemented [75]. In cross entropy, initially, a sample of a specified size is generated from the fisher encoded vectors of each interaction class and an objective function is applied to that sample [76]. Then, more samples are extracted from encoded vectors and their objective functions are compared. This process continues until maximum numbers of iterations are reached or the best sample is obtained. The best sample of descriptors would represent an interaction class with the optimal set of descriptors. So, several iterations are performed until an optimal sample is generated. Cross entropy is measured among two probability distributions with samples p and q, and this is represented through Equation (15): where p x and q x are probabilities of event x (i.e., p x is an actual or true value of probability and q x is the predicted value). Meanwhile, Kullback-Leibler divergence D is calculated between true probability and predicted probability by Equation (16): In this way, the difference between the true and the predicted probability of a given sample is calculated. Cross entropy between the predicted and true probability distribution of each class of SBU dataset is shown in Figure 11.  (h) shaking hands.

Classification Via MEMM
After getting an optimized representation of the vectors, they are fed into a maximum entropybased classifier in order to determine the different interaction classes (see Algorithm 3). MEMM is a combination of both the HMM and the maximum entropy model [77]. It is a discriminative model where a conditional probability is used to predict the interaction class. Such conditional probability is represented as P (S|S', X). Each transition between state and observation in the MEMM is given through a log linear model, which is represented in Equation (16): where S is the current state, S' is the next state, X is an observation, f k is a feature function of X and possible S', Z(X,S') is a normalization factor that ensures the matrix sum and λ k is the weight to be learned and is associated with feature f k . According to the above observations, it is clear that MEMM is not only dependent on current observations but also on the previously predicted interaction. Figure 12 shows the overall procedure of the MEMM over different interaction classes of SBU Kinect interaction dataset. learned and is associated with feature . According to the above observations, it is clear that MEMM is not only dependent on current observations but also on the previously predicted interaction. Figure  12 shows the overall procedure of the MEMM over different interaction classes of SBU Kinect interaction dataset.

SBU Kinect Interaction Dataset
The SBU Kinect interaction dataset [78] consists of RGB, depth and skeletal information for the two-person performing interactions collected by Microsoft Kinect sensors in an indoor environment. Eight types of interactions including Approaching, Departing, Kicking, Punching, Pushing, Shaking Hands, Exchanging Object and Hugging are performed. The overall dataset is really challenging to interpret due to the similarity or closer proximity of movements in the different interaction classes. The sizes of both RGB and depth images are 649 × 480. Additionally, the dataset has a total of 21 folders, where each folder consists of all eight interaction classes performed by a different combination of seven actors. The ground truth labels of each interaction class are also provided. Videos are segmented at the rate of 15 frames per second (fps). Figure 13 shows some examples of human interaction classes of the SBU dataset.

UoL 3D Dataset
In the UoL 3D dataset, there is a combination of three types of interaction, such as casual daily life, harmful and assisted living interactions [79]. Included are interactions, such as handshake, hug, help walk, help stand-up, fight, push, conversation and call attention, performed by four males and two females. In addition, RGB, depth and skeletal information for each interaction is captured through the Kinect 2 sensor. Each folder has 24-bit RGB images, 8-bit and 16-bit resolution depth

UoL 3D Dataset
In the UoL 3D dataset, there is a combination of three types of interaction, such as casual daily life, harmful and assisted living interactions [79]. Included are interactions, such as handshake, hug, help walk, help stand-up, fight, push, conversation and call attention, performed by four males and two females. In addition, RGB, depth and skeletal information for each interaction is captured through the Kinect 2 sensor. Each folder has 24-bit RGB images, 8-bit and 16-bit resolution depth images of both 8-bit and 16-bit resolution and the skeletal information has 15 joints. There are ten different sessions of eight interactions performed by two subjects (in pairs), which are recorded in an indoor environment for period of 40-60 repetitions. This is a very challenging dataset and consists of over 120,000 data frames. Some snapshots of interactions of this dataset are shown in Figure 14.

UoL 3D Dataset
In the UoL 3D dataset, there is a combination of three types of interaction, such as casual daily life, harmful and assisted living interactions [79]. Included are interactions, such as handshake, hug, help walk, help stand-up, fight, push, conversation and call attention, performed by four males and two females. In addition, RGB, depth and skeletal information for each interaction is captured through the Kinect 2 sensor. Each folder has 24-bit RGB images, 8-bit and 16-bit resolution depth images of both 8-bit and 16-bit resolution and the skeletal information has 15 joints. There are ten different sessions of eight interactions performed by two subjects (in pairs), which are recorded in an indoor environment for period of 40-60 repetitions. This is a very challenging dataset and consists of over 120,000 data frames. Some snapshots of interactions of this dataset are shown in Figure 14.

UT Interaction Dataset
The UT interaction dataset [80] consists of only RGB data. It has six interaction classes: point, push, shake hands, hug, kick and punch performed, by several participants with different appearances. This dataset is divided into two sets, named as: UT-Interaction Set 1 and UT-Interaction Set 2. The environment of Set 1 is a parking lot and the environment of Set 2 is a windy lawn. Video is captured with a resolution of 720 × 480 at 30 fps. There are 20 videos per interaction providing a

UT Interaction Dataset
The UT interaction dataset [80] consists of only RGB data. It has six interaction classes: point, push, shake hands, hug, kick and punch performed, by several participants with different appearances. This dataset is divided into two sets, named as: UT-Interaction Set 1 and UT-Interaction Set 2. The environment of Set 1 is a parking lot and the environment of Set 2 is a windy lawn. Video is captured with a resolution of 720 × 480 at 30 fps. There are 20 videos per interaction providing a total of 120 videos of six interactions. Figure 15 demonstrates some examples of interaction classes for UT-Interaction dataset.

Performance Parameters and Evaluation
In order to validate the methodology of the proposed HIR system, four different types of experiments with various performance parameters-i.e., recognition accuracy, precision, recall, Fscore, computational time and comparison with state-of-the art methods-were performed. Details and observations for each experiment are discussed in the sub-sections.

Performance Parameters and Evaluation
In order to validate the methodology of the proposed HIR system, four different types of experiments with various performance parameters-i.e., recognition accuracy, precision, recall, F-score, computational time and comparison with state-of-the art methods-were performed. Details and observations for each experiment are discussed in the sub-sections.

First Experiment
In the first experiment, optimized feature vectors are subjected to MEMM in order to evaluate the average accuracy of the proposed system. We used the n-fold cross validation method for training/testing over three benchmark datasets. Tables 2 and 3 show the accuracy of the interactions of SBU and UoL datasets in the form of a confusion matrix. Similarly, recognition accuracies of UT-Interaction Set 1 and Set 2 are shown in Tables 4 and 5, respectively. While, the mean accuracy of the SBU dataset is 91.25%, the accuracy of UoL is 90.4% and the combined accuracy of the UT-Interaction Set 1 and Set 2 is 87.4%.   From the experimental results, it is observed that our hybrid features methodology, along with cross entropy optimization and the MEMM, can clearly recognize human interactions better. However, some confusion is observed between pairs of similar interactions, such as shaking hands and exchanging object, and punching and pushing interactions, in the SBU dataset. In the UoL dataset, confusion is observed between handshake and help stand-up interactions. Such confusion is due to the similarity in body movements involved in these interactions. In the UT-interaction dataset, there is confusion between shaking hands and point, and push and punch interactions due to similarities of these interactions. In addition, it is also observed that when combinations of RGB and depth vectors were fed into the MEMM, we achieved better recognition rates compared to RGB alone. The recognition rate of the RGB dataset i.e., UT interaction (87.4%) is less than those of the SBU and the UoL datasets, which are 91.25% and 90.4%, respectively. Thus, incorporating depth information results causes improvements in accuracy rate.

Second Experiment
In the second experiment, precision, recall and F1 Score for each interaction class of three datasets are evaluated, as shown in Table 6. It is observed that, in the SBU dataset, the Approaching interaction has the least precise rate of 88% and it also has a highest rate of false positive. This is because many periodic actions of many interactions such as departing, shaking hands and exchanging object are similar to the approaching interaction. On the other hand, the kicking interaction gives the most precise results with a less false positive ratio of 3%. In the UoL dataset, Hug interaction gives the most precise result of 95% because the periodic actions performed during the Hug interaction are different from the other interactions of this dataset. Handshake and conversation interactions have the highest false positive ratios of 13% and 14%, respectively, because body movements of silhouettes during these two interactions are similar to many other interactions. Overall, if we compare three datasets, the precision recall and F1 score ratios of both sets of the UT Interaction dataset are less as compared to the SBU and UoL datasets.

Third Experiment
In the third experiment, nine sub-experiments for each dataset were performed. In this experiment, different combinations of the two parameters (i.e., number of states and observations) were used to evaluate the performance of MEMM. As a result, comparisons are made in terms of time complexity and recognition accuracy. During MEMM, each transition not only depends on the current state but also on the previous state. Therefore, increasing the number of states and observations affects the performance rate of HIR. Tables 7-9 show a comparison of number of states and observations for time complexity and recognition accuracy over the SBU, UoL 3D and UT-Interaction datasets. In Table 7, by using four states and changing the number of observations from 10 to 30, computational time and recognition accuracy were gradually increased. These experiments are repeated for five and six states. Similarly, Table 8 used 4-6 states and received significant results for computational time and recognition accuracy at 15 to 35 numbers of observations. Table 9 presents the results of these experiments on Set 1 of the UT-Interaction dataset, respectively.
It is concluded from the third experiment that reducing the number of states to two reduces recognition accuracy and computational time. On the other hand, increasing the number of states to six results in increased computational time with no change in accuracy. However, similar patterns of observations are noticed in Tables 7-9 (i.e., increasing the number of states and observations results in increased computational time and accuracy as well).

Fourth Experiment
In the fourth experiment, we compared our proposed system in two parts. In the first part, a hybrid descriptor-based MEMM classifier is compared with other commonly used classifiers. In the second part, the proposed system is compared with other statistically well-known state-of-the-art HIR systems.
In the first part, quantized features vectors are given to most commonly used classifiers-i.e., ANN, HMM and Conditional Random Field (CRF)-and compared with MEMM to find the HIR accuracy rates for the interactions of each dataset. Figure 16 shows a comparison of recognition accuracies for each interaction class of the SBU dataset using all four classifiers. From Figure 16, it can be seen that the mean recognition accuracy for ANN is 87.3%, CRF is 90%, HMM is 85.3% and MEMM is 91.25%. It is observed that, in some interactions, such as exchanging object and shaking hands, CRF performed better than MEMM. Additionally, ANN performed better in a few interactions, such as kicking and punching. Overall accuracy using the MEMM was higher From Figure 16, it can be seen that the mean recognition accuracy for ANN is 87.3%, CRF is 90%, HMM is 85.3% and MEMM is 91.25%. It is observed that, in some interactions, such as exchanging object and shaking hands, CRF performed better than MEMM. Additionally, ANN performed better in a few interactions, such as kicking and punching. Overall accuracy using the MEMM was higher than for other classifiers. Figure 17 shows the comparison of recognition accuracies for each interaction class using the UoL dataset. From Figure 16, it can be seen that the mean recognition accuracy for ANN is 87.3%, CRF is 90%, HMM is 85.3% and MEMM is 91.25%. It is observed that, in some interactions, such as exchanging object and shaking hands, CRF performed better than MEMM. Additionally, ANN performed better in a few interactions, such as kicking and punching. Overall accuracy using the MEMM was higher than for other classifiers. Figure 17 shows the comparison of recognition accuracies for each interaction class using the UoL dataset. From Figure 17, it is shown that the mean recognition accuracy of ANN is 82.75%, CRF is 88.5%, HMM is 86.37% and MEMM is 90.4%, using the UoL dataset. It is observed that some interactions, such as fight in the case of ANN, handshake in the case of CRF and help walk in the case of HMM, achieved better recognition accuracy than the MEMM. However, the overall recognition rate was still higher with the MEMM. Figure 18 shows the comparison of four classifiers over interaction classes of the UT-Interaction Set 1 and Set 2. From Figure 17, it is shown that the mean recognition accuracy of ANN is 82.75%, CRF is 88.5%, HMM is 86.37% and MEMM is 90.4%, using the UoL dataset. It is observed that some interactions, such as fight in the case of ANN, handshake in the case of CRF and help walk in the case of HMM, achieved better recognition accuracy than the MEMM. However, the overall recognition rate was still higher with the MEMM. Figure 18 shows the comparison of four classifiers over interaction classes of the UT-Interaction Set 1 and Set 2. From Figure 18 a,b, it is observed that the mean recognition accuracy rates of Set 1 and of Set 2 for the UT-interaction dataset are less than the depth datasets. The mean accuracies of Set 1 of the UT Interaction dataset are 79.16% with ANN, 84.3% with CRF, 82.8% with HMM and 88% with the MEMM classifier. Mean accuracies are further reduced with Set 2 for the UT interaction dataset due to the cluttered background of a windy lawn. The mean accuracy for ANN is 77%, CRF is 82.7%, From Figure 18a,b, it is observed that the mean recognition accuracy rates of Set 1 and of Set 2 for the UT-interaction dataset are less than the depth datasets. The mean accuracies of Set 1 of the UT Interaction dataset are 79.16% with ANN, 84.3% with CRF, 82.8% with HMM and 88% with the MEMM classifier. Mean accuracies are further reduced with Set 2 for the UT interaction dataset due to the cluttered background of a windy lawn. The mean accuracy for ANN is 77%, CRF is 82.7%, HMM is 80.7% and MEMM is 86.8%. Meanwhile, it is observed that patterns of recognition accuracies for Set 1 and for Set 2 are similar to those of the UoL and of the SBU datasets and that the MEMM has the highest accuracy rate while ANN has the lowest. Accuracy rates for the MEMM and CRF are comparable. Moreover, CRF, HMM and MEMM performed better in most of the interactions classes except for the fight interaction, where ANN has better or nearly similar recognition rate. However, overall MEMM has best recognition rates. Thus, it is concluded that MEMM based performance is best for HIR.
In second part of this experiment, the proposed HIR system is compared with other statistically well-known state-of-the-art systems. Table 10 presents a comparison of results for the SBU, UoL and UT interaction datasets, respectively.

Discussion
A unique HIR system is proposed in this research work. Four unique features are extracted from both RGB and depth silhouettes. The efficiency of the proposed model is proved through four types of experiments. However, certain challenges were faced during this research work. In the silhouette detection phase of the RGB frames, a connected components algorithm was used to identify connected objects. However, this algorithm does not give the best results as it confuses white or light color clothes of individual with the white wall background. Therefore, in order to tackle this problem, we applied a human skin detection algorithm as well as a pre-specified measurements ratio (i.e., the height and width) of the human performers. This ratio is compared with the height and width ratio of bounding boxes of connected components. Again, the specific height and width ratio of human causes the failure of silhouette detection due to frequent changes in scaling values of human posture.

Conclusions and Future Work
In this paper, we have proposed a novel HIR system to recognize human interactions using both RGB and depth environmental settings. The main accomplishments of this research work are: (1) we achieved adequate silhouette segmentation; (2) identification of key human body parts; (3) extraction of four novel features-i.e., spatio-temporal, MO-HOG, angular-geometric and energy based features; (4) cross-entropy optimization and recognition of each interaction via MEMM. In the first phase, both RGB and depth silhouettes are identified separately. For RGB silhouette segmentation, various skin colors, connected components and binary thresholding methods are applied to separate humans from their background. After the extraction of silhouettes, all spatio-temporal features are extracted. In these features the displacement between key body points is identified via Euclidean distance. In angular-geometric features, various geometrical shapes are made by connecting the extreme points of silhouettes. The angles of these shapes are then measured in each interaction class. After that, MO-HOG features are extracted, in which differential silhouettes are projected from three different views and then HOG is applied. Finally, unique energy features are extracted from each interaction class. A Hybrid of these feature descriptors results in very complex vector representation. In order to reduce the complexity of the feature descriptors, a GMM based FVC is applied and then cross entropy optimization is performed.
During experimental testing, four different types of experiments were conducted on three benchmark datasets in order to validate the performance of the proposed system. In the first experiment, recognition accuracies for the interaction classes of each dataset were measured. In the second experiment, F1 scores, precision and the recall of each interaction class were measured and compared. In the third experiment, computation time and accuracy were measured by changing the number of states and observations of the MEMM classifier. Finally, in the fourth experiment, recognition accuracies for the interaction classes of each dataset were measured via the most commonly used classifiers-i.e., ANN, HMM and CRF-and compared with MEMM. Results showed better performance, with an average recognition rate of 87.4% for UT-Interaction, 90.4% for UoL and 91.25% for SBU datasets. Results of these experiments validated the efficacy of the proposed system. The proposed system is applicable to various real-life scenarios, such as security monitoring, smart home, healthcare and content-based video indexing and retrieval, etc.
In the future, we plan to implement the proposed method in a group of human interactions as well as human-object interactions. We will also use entropy-based features. We will also work on more challenging datasets.