Modeling Two-Person Segmentation and Locomotion for Stereoscopic Action Identiﬁcation: A Sustainable Video Surveillance System

: Due to the constantly increasing demand for automatic tracking and recognition systems, there is a need for more proﬁcient, intelligent and sustainable human activity tracking. The main purpose of this study is to develop an accurate and sustainable human action tracking system that is capable of error-free identiﬁcation of human movements irrespective of the environment in which those actions are performed. Therefore, in this paper we propose a stereoscopic Human Action Recognition (HAR) system based on the fusion of RGB (red, green, blue) and depth sensors. These sensors give an extra depth of information which enables the three-dimensional (3D) tracking of each and every movement performed by humans. Human actions are tracked according to four features, namely, (1) geodesic distance; (2) 3D Cartesian-plane features; (3) joints Motion Capture (MOCAP) features and (4) way-points trajectory generation. In order to represent these features in an optimized form, Particle Swarm Optimization (PSO) is applied. After optimization, a neuro-fuzzy classiﬁer is used for classiﬁcation and recognition. Extensive experimentation is performed on three challenging datasets: A Nanyang Technological University (NTU) RGB+D dataset; a UoL (University of Lincoln) 3D social activity dataset and a Collective Activity Dataset (CAD). Evaluation experiments on the proposed system proved that a fusion of vision sensors along with our unique features is an efﬁcient approach towards developing a robust HAR system, having achieved a mean accuracy of 93.5% with the NTU RGB+D dataset, 92.2% with the UoL dataset and 89.6% with the Collective Activity dataset. The developed system can play a signiﬁcant role in many computer vision-based applications, such as intelligent homes, ofﬁces and hospitals, and surveillance systems.


Introduction
Vision-based Human-Computer Interaction (HCI) is a broad field covering many areas of computer vision, such as human action tracking, face recognition, gesture recognition, human-robot interaction and many more [1]. In our proposed methodology we focused on vision-based human motion analysis and representation for Human Action Recognition (HAR). HAR can be precisely defined as tracking the motion of each and every observable body part involved in performing human actions and identifying the activities performed by humans [2]. HAR further subdivides into atomic actions, two person interactions, multiperson interactions, human-object interactions and human-robot interactions, etc. [3,4]. However, in the proposed system, we focused on two-person interactions, i.e., human-human interaction. Extensive research has been carried out in the field of vision-based HAR systems but there remains a need for an adaptive and sustainable Table 1. Phases of the proposed Human Action Recognition (HAR) system.

Phase Techniques Description
Silhouette Segmentation

Background subtraction and Morphological operations
Efficient silhouette segmentation is executed on both RGB and depth frames via frame differencing and morphological operations, respectively.

Geodesic distance
Geodesic maps are generated based on the shortest distance from the center points of two human silhouettes towards the outer boundary.
3D Cartesian plane RGB and depth silhouettes are projected in an altered Cartesian plane to represent features from different views.

Joints MOCAP
The geometrical properties of each human joint are taken to record human locomotion.
Way-point trajectory generation The shape and motion information of each way-point trajectory generated from subsets of skeletal joints are recorded with each changing frame. Inter-silhouette and intra-silhouette trajectory generation is implemented.

Phase Techniques Description
Optimization Particle Swarm Optimization (PSO) The tracked motion descriptors of each action class are represented in an optimized form via a PSO algorithm. PSO is used as a feature selection algorithm to remove redundant features and to increase classification performance. For an efficient time and space computation, PSO is applied for feature selection and size reduction.

Classification
Neuro-Fuzzy Classifier(NFC) An extensive experimental evaluation with challenging RGB-D datasets is performed with NFC. System validation is proved with an altered number of membership functions.
The main contribution of each phase of this research work is as follows: 1. Segmentation of both RGB and depth silhouettes is achieved via background subtraction and a series of morphological operations.

2.
Feature extraction from full human silhouettes is performed via geodesic maps and 3D Cartesian planes. Whereas, feature extraction from skeletal joints is performed via way-points trajectories and orientation angles. These features record each movement performed by two interacting human silhouettes.

3.
Feature selection is performed on the combined feature descriptors of four proposed features via PSO.

4.
Extensive experimentation is performed to prove systems' validity via classification with a neuro-fuzzy inference system, effects of different numbers of membership functions, sensitivity specificity and error measures.
In the rest of the paper, Section 2 presents related work in the field of HAR. Section 3 provides the details of each phase of proposed methodology. Section 4 explains the experiments performed and their generated results. At the end, the proposed research work is concluded in Section 5.

Literature Review
This section describes different methodologies that have been adopted in recent years for human action tracking and recognition [35]. Vision-based human activity tracking can be subdivided at different stages: (1) first, on the basis of the source of input; (2) second, at the features extraction and recognition stage. An extensive review of related work and preceding methodologies is given in this section.

Devices for HAR Data Acquision
On the basis of data acquisition, vision-based HAR systems are divided into two categories: (1) RGB-based HAR and (2) RGB-D-based HAR.

RGB-Based HAR Systems
Many HAR systems that only work on RGB datasets for experimentation and validation have been proposed in recent years [36][37][38]. Table 2 presents summary details of authors, datasets and the research work relevant to these systems. Table 2. RGB-based HAR methods.

RGB Datasets Methodology Classification Results
Xiaobin et al. [39] CAD Choi's Dataset A learning-based methodology was proposed in which the interaction matrix of each activity was represented. A multitask interaction response (MIR) was computed for each class separately. Support Vector Machine (SVM) as baseline and MIR was used for classification. Experiments proved the validation of the system. The mean accuracy achieved was 83.3% with CAD and 80.3% with Choi's dataset.
Qing et al. [40] UT-Interaction dataset A global feature-based approach was presented where a combination of Gaussian time-phase features was used. Multifeature fusion was performed with ResNet (Residual Network) and parallel inception.
Experiments were performed via SVM with Kalman tracking. An overall recognition rate of 91.7% was achieved with UT-Interaction dataset.
Amir et al. [41] UCF YouTube action dataset and an IM-DailyRGBEvents Spatiotemporal multidimensional features were used for both body part detection and action recognition.
Better system performance was achieved with Maximum entropy Markov model and activity recognition rates of 89.09% with the UCF dataset and 88.26% with the IM Event dataset were achieved.
Kishore et al. [42] UCF 50 Scene context approach was applied. Motion features were applied along with the fusion of descriptors at early and late stages. UT-Interaction dataset After identifying the starting and ending frame, spatiotemporal features were extracted from human key body points and from full body silhouettes, as well.
With Artificial Neural Network (ANN) and one-third training validation test, better recognition was achieved in six classes with an average accuracy of 83.5% with Set 1 and 72.5% with Set 2.

RGB-D-Based HAR Systems
Many HAR systems are based on datasets that combine both RGB color and depth information [44]. RGB-D sensors also provide skeletal information [45]. Table 3 shows the details of authors, datasets and research work based on RGB-D sensors, using the combination of both RGB and depth images. Table 3. HAR systems based on RGB-D sensors.

RGB-D Datasets Methodology Classification Results
Rawya et al. [46] MSR-Daily Activity 3D dataset and Online RGBD action dataset Spatio-temporal features were extracted using a Bag-of-Features (BoF) approach. Points of interest were detected, and motion history images were created to perform this research work.
By using K-means clustering and multiclass SVM, experimental results on these publicly available datasets proved the efficacy of the system with average recognition rates of 91.1% with the MSR dataset and 92.8% with the Online RGBD dataset.
Jalal et al. [47] MSRAction3D The two types of features that were extracted from human silhouettes were shape and motion features using temporal continuity constraints.
As a result of experimentation on two challenging datasets with Hidden Markov Model (HMM), this approach proved to be effective in HAR with a mean recognition rate of 82.10%.
Xiaofei et al. [48] SBU Kinect interaction dataset UT-Interaction dataset In this research work, interaction was divided into three stages, namely, start, middle and end. Probability fusion-based features were extracted.
Extensive experiments via HMM proved the efficacy of the system with 91.7% accuracy with the SBU and 80% with the UT-Interaction dataset. Table 3. Cont.

RGB-D Datasets Methodology Classification Results
Meng et al. [49] NTU RGB+D, SBU Kinect interaction dataset and M2I dataset With the help of skeletal and depth data, pairwise feature learning was introduced. Relative movement between body parts was extracted.
Linear SVM was used as a classifier. All activity classes were recognized with higher accuracy rates than many state-of-the-art systems.
Claudio et al. [50] UoL 3D social activity dataset Determining social activity via statistical and geometrical features such as skeletal positions and motion features was proposed in this research work.
The proposed novel features with HMM proved to be very effective in social interaction recognition with a mean accuracy of 85.5%.

Division on the Basis of Feature Extraction and Recogition
Some researchers have applied hand-crafted features and machine learning methods for feature extraction and recognition, respectively, in vision-based systems. On the other hand, some researchers have applied deep learning approaches for both feature learning and activity recognition [51]. So, on the basis of feature extraction methods, HAR can be divided into two methodologies: (1) machine learning-based HAR systems and (2) deep learning-based HAR systems.

Machine Learning-Based HAR Systems
In this section, HAR systems based on machine learning approaches (supervised, unsupervised and semi-supervised) are presented. Table 4 shows details of authors, datasets and research work based on hand-crafted features and machine learning-based approaches. In some HAR systems, features are learned and actions are recognized automatically through deep learning models. Table 5 shows details of authors, datasets and research work based on feature learning and activity recognition via deep learning-based approaches. Comparisons with four baselines and state-of the-art methods were performed.
The validity of the novel approach presented in this method was proved by the high accuracy achieved with three datasets.

Materials and Methods
A comprehensive description of each phase is given in this section. It is represented in the following phases:

•
In the preprocessing phase, human silhouettes of each RGB and depth image are segmented from their backgrounds.

•
In the feature descriptor generation stage, four features (geodesic distance, 3D Cartesian plane, way-point trajectory and joints MOCAP) are mined from each RGB and depth image and thus, feature descriptors are generated.

•
The optimization phase results in an optimized representation of feature descriptors via PSO.

•
In the final stage, each human action is classified via a neuro-fuzzy inference system. Figure 1 shows the flow diagram of the proposed human action surveillance system.

Foreground Extraction
Prior to any processing, all RGB and depth image sequences are subjected to image normalization technique to improve the quality of image [62,63]. Image contrast is adjusted, and intensity values are uniformly distributed through the entire image via histogram equalization [64,65]. After that, in order to remove noise from the image, a median filter is applied in which pixels are replaced by a median of neighboring pixels [66,67]. The most important step in any HAR system is to define and mine Regions of Interest (ROI) [68]. In our work, an ROI consists of two persons involved in an interaction in RGB-D images. These ROIs are first segmented from their background. Methods adopted to segment human silhouettes are separately given in the following subsection.

Background Subtraction
RGB silhouette extraction of all three datasets is achieved through a background subtraction method [69]. A frame difference technique is used in which current frames of each interaction class are subtracted from a background frame [70]. Pixels of the current frame I(t) at time t, denoted by P[I(t)], are subtracted from pixels of a background frame denoted by P [B], as given in Equation (1): where P[F(t)] is the frame obtained after subtraction. The subtracted image, i.e., the image containing human silhouettes is further processed for better foreground detection through specifying a threshold value T as given through Equation (2):

Foreground Extraction
Prior to any processing, all RGB and depth image sequences are subjected to image normalization technique to improve the quality of image [62,63]. Image contrast is adjusted, and intensity values are uniformly distributed through the entire image via histogram equalization [64,65]. After that, in order to remove noise from the image, a median filter is applied in which pixels are replaced by a median of neighboring pixels [66,67]. The most important step in any HAR system is to define and mine Regions of Interest (ROI) [68]. In our work, an ROI consists of two persons involved in an interaction in RGB-D images. These ROIs are first segmented from their background. Methods adopted to segment human silhouettes are separately given in the following subsection.

Background Subtraction
RGB silhouette extraction of all three datasets is achieved through a background subtraction method [69]. A frame difference technique is used in which current frames of each interaction class are subtracted from a background frame [70]. Pixels of the current frame I(t) at time t, denoted by P[I(t)], are subtracted from pixels of a background frame denoted by P[B], as given in Equation (1): where P[F(t)] is the frame obtained after subtraction. The subtracted image, i.e., the image containing human silhouettes is further processed for better foreground detection through specifying a threshold value T as given through Equation (2): The T value is automatically selected for each subtracted image via Otsu's thresholding method [71]. In this method the subtracted frame is first converted to a grayscale image and then the best T value (a best value which differentiates the black background pixels and the white foreground pixels) is obtained through an iterative process. This T value is then used to convert the subtracted grayscale image to binary image, and then a binary silhouette is obtained as a result. Examples of RGB silhouette extraction of NTU RGB+D and UoL datasets are shown in Figure 2.
Sustainability 2021, 13, x FOR PEER REVIEW 8 of 30 The T value is automatically selected for each subtracted image via Otsu's thresholding method [71]. In this method the subtracted frame is first converted to a grayscale image and then the best T value (a best value which differentiates the black background pixels and the white foreground pixels) is obtained through an iterative process. This T value is then used to convert the subtracted grayscale image to binary image, and then a binary silhouette is obtained as a result. Examples of RGB silhouette extraction of NTU RGB+D and UoL datasets are shown in Figure 2.

Morphological Operations
In order to extract depth silhouettes, first, threshold-based segmentation [72] is used to obtain a binary image from the original image. These segmented images are closed morphologically using binary dilation followed by a binary erosion operation [73]. Thus, binary dilation works through adding pixels to human edges while erosion works by removing boundary pixels. Binary dilation and erosion are shown through Equations (3) and (4), respectively: where z is a set of pixel locations where structuring element B and its reflection B ∧ joins with pixels of foreground element A during translation to z. In this way, only the shape of the main objects in an image is maintained. Finally, Canny edge detection is applied to separate foreground pixels from the background. After the detection of the edges, smaller area objects are removed from the binary image which results in human silhouette detection. The silhouette segmentation of the depth images forms the UoL dataset is shown in Figure 3.

Morphological Operations
In order to extract depth silhouettes, first, threshold-based segmentation [72] is used to obtain a binary image from the original image. These segmented images are closed morphologically using binary dilation followed by a binary erosion operation [73]. Thus, binary dilation works through adding pixels to human edges while erosion works by removing boundary pixels. Binary dilation and erosion are shown through Equations (3) and (4), respectively: where z is a set of pixel locations where structuring element B and its reflection ∧ B joins with pixels of foreground element A during translation to z. In this way, only the shape of the main objects in an image is maintained. Finally, Canny edge detection is applied to separate foreground pixels from the background. After the detection of the edges, smaller area objects are removed from the binary image which results in human silhouette detection. The silhouette segmentation of the depth images forms the UoL dataset is shown in Figure 3.

Feature Descriptors Mining
Segmented RGB-D silhouettes are then used for feature mining. Unique features are extracted from full silhouettes and from the skeleton joints. Two features, namely, geodesic and 3D Cartesian-plane, are applied over full human silhouettes. Two features, namely, way-point trajectory generation and joints MOCAP are applied to the skeleton joints. Each feature is explained in detail in the following subsections.

Geodesic Distance
In this feature, actions between two interacting humans are represented via geodesic wave maps. These maps are generated by calculating the geodesic distance (the smallest distance) which is found by a Fast Marching Algorithm (FMA) [74]. Firstly, the center point s of the two human silhouettes is located and given a distance value d (s) = 0. Point s is the starting point and it is marked as a visited point. All the other pixel points p on human silhouettes are given a distance value d (p) = ∞ and are marked as unvisited. Each unvisited point p is taken from the neighbors of s and its distance from s is measured. In this way, each neighboring pixel is taken in each iteration until all the pixel points are marked as visited. The distance calculated at each iteration is compared with the distance of each previous iteration. A priority queue is generated where the shortest distances are given priority [75]. An update in distance is defined as: where dx = min (Dk+1, ℓ, Dk−1, ℓ) and dy = min (Dk, ℓ+1, Dk, ℓ−1). Figure 4 demonstrates the wave propagation of the geodesic distance via FMA.

Feature Descriptors Mining
Segmented RGB-D silhouettes are then used for feature mining. Unique features are extracted from full silhouettes and from the skeleton joints. Two features, namely, geodesic and 3D Cartesian-plane, are applied over full human silhouettes. Two features, namely, way-point trajectory generation and joints MOCAP are applied to the skeleton joints. Each feature is explained in detail in the following subsections.

Geodesic Distance
In this feature, actions between two interacting humans are represented via geodesic wave maps. These maps are generated by calculating the geodesic distance (the smallest distance) which is found by a Fast Marching Algorithm (FMA) [74]. Firstly, the center point s of the two human silhouettes is located and given a distance value d (s) = 0. Point s is the starting point and it is marked as a visited point. All the other pixel points p on human silhouettes are given a distance value d (p) = ∞ and are marked as unvisited. Each unvisited point p is taken from the neighbors of s and its distance from s is measured. In this way, each neighboring pixel is taken in each iteration until all the pixel points are marked as visited. The distance calculated at each iteration is compared with the distance of each previous iteration. A priority queue is generated where the shortest distances are given priority [75]. An update in distance is defined as: where d x = min (D k+1, , D k−1, ) and d y = min (D k, +1 , D k, −1 ). Figure 4 demonstrates the wave propagation of the geodesic distance via FMA.

Feature Descriptors Mining
Segmented RGB-D silhouettes are then used for feature mining. Unique features are extracted from full silhouettes and from the skeleton joints. Two features, namely, geodesic and 3D Cartesian-plane, are applied over full human silhouettes. Two features, namely, way-point trajectory generation and joints MOCAP are applied to the skeleton joints. Each feature is explained in detail in the following subsections.

Geodesic Distance
In this feature, actions between two interacting humans are represented via geodesic wave maps. These maps are generated by calculating the geodesic distance (the smallest distance) which is found by a Fast Marching Algorithm (FMA) [74]. Firstly, the center point s of the two human silhouettes is located and given a distance value d (s) = 0. Point s is the starting point and it is marked as a visited point. All the other pixel points p on human silhouettes are given a distance value d (p) = ∞ and are marked as unvisited. Each unvisited point p is taken from the neighbors of s and its distance from s is measured. In this way, each neighboring pixel is taken in each iteration until all the pixel points are marked as visited. The distance calculated at each iteration is compared with the distance of each previous iteration. A priority queue is generated where the shortest distances are given priority [75]. An update in distance is defined as: where dx = min (Dk+1, ℓ, Dk−1, ℓ) and dy = min (Dk, ℓ+1, Dk, ℓ−1). Figure 4 demonstrates the wave propagation of the geodesic distance via FMA.

3D Cartesian-Plane Features
In this feature, shape as well as motion information from the two human silhouettes are taken [76]. 3D shapes of the segmented RGB-D silhouettes are created by projecting them onto a 3D Cartesian plane. Motion information from the two interacting persons is retained via a frame differencing technique that is used to take the differences in 3D shapes created between two consecutive frames. After creating 3D shapes, Histogram of Oriented Gradients (HOG) is applied to extract unique features. In order to apply HOG [77], all images are first preprocessed to make their dimensions 64 × 128 pixels. Bounding boxes are drawn around each human in the image and the gradient of each human in the image is calculated separately. These features measure both position and the direction of changes along each pixel. Magnitude is given as: where g is the gradient, i.e., the change in x and y directions for each pixel and the directional angle. Pseudo code for full body feature extraction techniques (Geodesic and 3D Cartesian plane) is given in Algorithm 1. The direction of change is shown through red marks on 3D shapes in Figure 5. for all the other points on human silhouette that are unvisited initialize d(x) = ∞ 7: initialize a queue Q = X for unvisited points 8: while Q = ø 9: Step 1: Locate a vertex with a smallest distance d as x = argmin x Q d(x)

10:
Step 2: For each neighboring unvisited vertex x ∈ N(x)∩Q 11: Step 3: Remove x from Q 13: end while 14: Return distance vector d(x i ) = d L (x 0 , x i ) //3D Cartesian-plane features// 15: project each frame f in F on 3D Cartesian plane yz 16: for each projected 3D frame subtract current frame f yz from successor frame (f + 1) yz to get differential frame as diff ← (f + 1) yz − f yz 17: end for 18: for each differencial frame diff calculate HOG vector from gradient, magnitude, orientation and histogram as: 19: Gradient (diff, grad_x, grad_y) 20: Magnitude (grad_x, grad_y, mag) 21: Orientation (grad_x, grad_y, orient) 22: Histogram (orient, mag, hist) 23: Normalization (hist, normhist) 24: HOG vector ← normhist 25: end for 26: compute full body feature descriptor V for each frame f as V ← concatenate (distance vector, HOG vector) 27: end for 28: return Full body feature descriptors (V 1 , V 2 , . . . V n ) 2 2 y x g g g + = (7) where g is the gradient, i.e., the change in x and y directions for each pixel and the directional angle. Pseudo code for full body feature extraction techniques (Geodesic and 3D Cartesian plane) is given in Algorithm 1. The direction of change is shown through red marks on 3D shapes in Figure 5. initialize a queue Q = X for unvisited points 8: while Q ≠ ø 9: Step 1: Locate a vertex with a smallest distance d as x =

Joints MOCAP
Joints MOCAP features are used to track the movements of human joints because joints are the most significant parts involved in human movements [78]. We represent the skeleton as S = {J k |k = 1, 2, . . . n} where n consists of sixteen major human joints, namely, head, neck, right shoulder, left shoulder, right elbow, left elbow, right hand, left hand, spine-mid, spine-base, left hip, right hip, right knee, left knee, right foot and left foot. A joint is represented as J k = (x j , y j ) which specifies the coordinates location in RGB-D silhouettes. After locating all the joint positions in both human silhouettes, geometrical properties are measured between joint J i and the rest of the joints J k where k = i. A total of thirty-two joints (sixteen per person in an interaction) are tracked with each changing frame with time t. Two types of angular measurements that are taken to track skeletal joint movements with each changing frame are: • Upper body Angles: In this type, human motion caused by the rotation of the spine's mid joint with respect to (w.r.t) all the upper body joints, namely, head, neck, left shoulder, right shoulder, left elbow, right elbow, left hand and right hand, are tracked. Four upper body angles per person, i.e., eight per frame, are tracked. The angle of the tangent between the spine's mid joint and two other joints taken from a joints set S is calculated. The inverse tangent is found by taking a dot product of two lines v 1 and v 2 , as represented by Equation (8): • Lower body Angles: In this type, the angle of tangent from the spine-base joint to all the lower body joints, left hip, right hip, left knee, right knee, left foot and right foot, are calculated. Three lower body angles per person, i.e., six per frame, are tracked.

Way-Point Trajectory Generation
A lot of research has been done on dense trajectories [79] and localized trajectories [80]. We, however, introduced the concept of new intra-silhouette and inter-silhouette localized way-point trajectories. In both of these trajectory types a subset S, containing a different number of human joints, is given as a way-point to generate trajectories. Curve trajectories are generated at a specified orientation. First of all, two joints sets, J1 and J2, are created. Where J1 = {j1, j2, … jn} is constructed from n number, i.e., sixteen joints (head, neck, right shoulder, left shoulder, right elbow, left elbow, right hand, left hand, spine-mid, spine-base, left hip, right hip, right knee, left knee, right foot and left foot) of the first (left) person in an interaction. J2 = {j1, j2, … jm} is constructed from m number, i.e., from sixteen joints of the second (right) person in an interaction. In intra-silhouette trajectory generation, a subset S consists of all the way-points from a single joint set, i.e., either J1 or J2. On the other hand, in inter-silhouette trajectory generation, a subset S consists of way-points from both joint sets J1 and J2. Table 6 shows a detailed description of each intra-silhouette and inter-silhouette way-point trajectory cluttered around human joints.

Way-Point Trajectory Generation
A lot of research has been done on dense trajectories [79] and localized trajectories [80]. We, however, introduced the concept of new intra-silhouette and inter-silhouette localized way-point trajectories. In both of these trajectory types a subset S, containing a different number of human joints, is given as a way-point to generate trajectories. Curve trajectories are generated at a specified orientation. First of all, two joints sets, J 1 and J 2 , are created. Where J 1 = {j 1 , j 2 , . . . j n } is constructed from n number, i.e., sixteen joints (head, neck, right shoulder, left shoulder, right elbow, left elbow, right hand, left hand, spine-mid, spine-base, left hip, right hip, right knee, left knee, right foot and left foot) of the first (left) person in an interaction. J 2 = {j 1 , j 2 , . . . j m } is constructed from m number, i.e., from sixteen joints of the second (right) person in an interaction. In intra-silhouette trajectory generation, a subset S consists of all the way-points from a single joint set, i.e., either J 1 or J 2 . On the other hand, in inter-silhouette trajectory generation, a subset S consists of way-points from both joint sets J 1 and J 2 . Table 6 shows a detailed description of each intra-silhouette and inter-silhouette way-point trajectory cluttered around human joints.
After construction of all the trajectories over human joints, two types of feature are extracted from each trajectory [81]. Shape descriptors are described by calculating changes in displacement of the length T of the trajectory over time t. These changes are measured along the coordinate positions x and y of the joints with each changing frame given as ∆l t = (x t + 1 − x t , y t +1 − y t ). The normalized displacement vector is given as: Motion descriptors are computed by tracking changes in velocity w.r.t time. Velocity is measured by changes in position (i.e., displacement) of trajectories over time t. So, a firstand second-order derivative of the position of trajectory (coordinates) is taken as x t , y t , x t and x t respectively. The final curvature C over space time coordinates x and y is defined as: Pseudo code of feature extraction from skeletal joints is given in Algorithm 2. Figure 7 displays curved intra-silhouette and inter-silhouette way-point trajectories over human joints. Table 6. Intra-silhouette way-points trajectory generation.

Intra-Silhouette
Inter-Silhouette   After construction of all the trajectories over human joints, two types of feature are extracted from each trajectory [81]. Shape descriptors are described by calculating changes in displacement of the length T of the trajectory over time t. These changes are measured along the coordinate positions x and y of the joints with each changing frame given as ∆lt = (xt + 1 − xt, yt +1 − yt). The normalized displacement vector is given as: Motion descriptors are computed by tracking changes in velocity w.r.t time. Velocity is measured by changes in position (i.e., displacement) of trajectories over time t. So, a first-and second-order derivative of the position of trajectory (coordinates) is taken as , , and respectively. The final curvature C over space time coordinates x and y is defined as: Pseudo code of feature extraction from skeletal joints is given in Algorithm 2. Figure  7 displays curved intra-silhouette and inter-silhouette way-point trajectories over human joints.

Particle Swarm Optimization (PSO)
After combining RGB-D descriptors to recognize human activities, a very complex representation is generated. So, for an efficient time and space computation, PSO is for j = 1: n 5: calculate angle of tangent θ up from spine mid joint to all the upper body joints 6: calculate angle of tangent θ low from spine base joint to all the lower body joints 7: compute joints MOCAP feature descriptor J MOCAP ← concatenate (θ up , θ low ) 8: end for //way-point trajectory feature descriptors// 9: for i = 1: n 10: compute subsets Sub 3 , Sub 4 and Sub 5 consisting of sets of three, four and five 11: number of joints, respectively 12: generate trajectories as three-way T 3 from Sub 3 , four-way T 4 from Sub 4 and four-way T 5 13: from Sub 5 14: compute displacement d x,y and motion C t vector from trajectories T 3 , T 4 and T 5 with

Particle Swarm Optimization (PSO)
After combining RGB-D descriptors to recognize human activities, a very complex representation is generated. So, for an efficient time and space computation, PSO is applied for feature selection and dimensionality reduction. PSO belongs to a stochastic optimization technique category [82]. This algorithm is based on the communal behavior of groups of animals such as birds, insects and fishes [83]. At first, optimization is initialized by randomly selecting a swarm, i.e., a sample of candidate solutions from feature descriptors. The t position of this swarm in dimension D is constantly regulated by a position vector → x i and velocity vector → v I defined as: where i = 1, 2, 3 . . . N. N is the total number of particles. A movement of this selected swarm is initialized, and the direction of this movement is toward the best found position in the search space. During this whole optimization process, the three types of variables that are retained by every candidate of optimization are current velocity, current position and personal best position. The personal best position called pbest is maintained in a vector p i = (p i1 , p i2 , . . . . . . p iD ) and gives the optimal fitness value. However, the global best position (gbest) is also maintained in a vector as p g = (p g1 , p g2 , . . . . . . p gD ) and gives the best particle from all the N particles. Both the position and velocity of particles are updated in the search space according to the new best position, thus: where ϕ 1 and ϕ 2 can be defined as random numbers. All the particles finally converge to local minima after calculating best values. This is an iterative process which continues until a best solution is learned. Then, original dimension of NTU RGB+D feature descriptors is 5360 × 550, for UoL it is 5360 × 400 and for CAD it is 5360 × 250. The length of the combined feature vector of all four proposed features is 5360 which is reduced to 4796 × 550 for NTU RGB+D, 4796 × 400 for UoL and 4796 × 250 for CAD dataset. At the end, all the particles are assigned the best place in the search space. Movement of each particle is influenced by both the local best position and global best position. All the swarm particles try to get closer to the global best position by moving towards and getting closer to it. Movement of swarm particles that are trying to achieve global best position by moving towards gbest is displayed in Figure 8.

Neuro-Fuzzy Classifier (NFC)
In order to accelerate the recognition rate of human actions, NFC, i.e., the hybrid of fuzzy set theory and Artificial Neural Networks (ANN) is applied. This hybrid classifier results in an intelligent inference system which is capable of both reasoning and selflearning [84]. Many action recognition systems based on NFC have been proposed in recent years [85]. This is a six-layer architecture, as displayed in Figure 9.

Neuro-Fuzzy Classifier (NFC)
In order to accelerate the recognition rate of human actions, NFC, i.e., the hybrid of fuzzy set theory and Artificial Neural Networks (ANN) is applied. This hybrid classifier results in an intelligent inference system which is capable of both reasoning and selflearning [84]. Many action recognition systems based on NFC have been proposed in recent years [85]. This is a six-layer architecture, as displayed in Figure 9. First of all we fed our training data to the input layer: ] K ∈ R m is the vector of dimension m, and c k = [1, 2, … n] is the label of the class to which it belongs and n is the total number of classes in a training dataset. The second layer is the membership layer. In this layer, the membership function of each input vector is recognized. We applied a Gaussian membership function. The membership function uij for sample xsj with mean c and standard deviation σ is defined by: (15) where j, i and s represents the feature, the rule of the corresponding feature and the sample, respectively. Results of the Gaussian function for each input are fed into the third layer, i.e., the fuzzification layer [86]. The firing strength of each generated fuzzy rule w.r.t each corresponding sample xs from all features N is calculated in this layer as: where α is the firing strength for i th rule. The fourth layer is called the defuzzification layer. Nodes in this layer are equal to the total number of action classes in the training data. In this layer, output is generated by integrating the results of the preceding layers, i.e., firing strength αis with weight values wik. The output ysk for a sample s from class k is generated by:  First of all we fed our training data to the input layer: is the vector of dimension m, and c k = [1, 2, . . . n] is the label of the class to which it belongs and n is the total number of classes in a training dataset. The second layer is the membership layer. In this layer, the membership function of each input vector is recognized. We applied a Gaussian membership function. The membership function u ij for sample x sj with mean c and standard deviation σ is defined by: where j, i and s represents the feature, the rule of the corresponding feature and the sample, respectively. Results of the Gaussian function for each input are fed into the third layer, i.e., the fuzzification layer [86]. The firing strength of each generated fuzzy rule w.r.t each corresponding sample x s from all features N is calculated in this layer as: where α is the firing strength for ith rule. The fourth layer is called the defuzzification layer. Nodes in this layer are equal to the total number of action classes in the training data. In this layer, output is generated by integrating the results of the preceding layers, i.e., firing strength α is with weight values w ik. The output y sk for a sample s from class k is generated by: This weighted summation is completed from rule i to overall generated rules M. At the last layer, all the output values are normalized by dividing the output of each sample s from each class k with the sum of the output for all the classes K. The normalized output o sk is given as: In this way, the class label of each s sample is obtained by maximum o sk value. When all the testing vectors are fed to the NFC architecture, the resultant output accurately predicts its class label for the input vector. The steps involved in predicting class labels from input data are given in Algorithm 3. u ij ← gussM (x, sig, c) 3: end for 4: Step 2: for each node in second layer calculate fuzzy strength by product of each sample with 5: antecedents of previous layer α is ← rule-layer (u ij ) 6: end for 7: Step 3: for each node in third layer defuzzify each node and generate output by weighted sum 8: of firing strenghts y sk ← sum(α is , w ik .) 9: end for 10: Step 4: for each node in this layer normalize each output class by dividing it with sum of all 11: output classes o sk ← y sk /sum(y sl ) 12: end for 13: Step 5: Assign class label to each sample C s ← max{o sk }

System Validation and Experimentation
This section first gives a brief description of the datasets used for training and testing the proposed system. Then, the parameters used for evaluation of the system and the generated results are given. All the experiments are performed on MATLAB (R2017a). Four parameters are used to validate system performance. First, the recognition rate of the individual activities of all three datasets is given. The second parameter is the effect of the number of membership functions on evaluation time and performance. The third parameters are used for testing system sensitivity, specificity and error measures. The fourth parameter is the comparison of the proposed system with other systems that have been reported in recent years. Each parameter is explained in detailed subsections. Table 7 gives the name, type of input data and description of each dataset used for the training and testing of the proposed system.

Recognition Accuracy
In order to validate the system's performance, descriptors of action classes from each dataset are given individually to NFC to identify the recognition rate. The percentage of accuracies for each class is given separately in the form of a confusion matrix. Each activity class of all three datasets used for experimentation achieved very good performance results with the proposed system. Table 8 shows the confusion matrix of action classes of the NTU RGB+D dataset.  shaking hands, 10 walking towards, 11 walking apart.
It is inferred from Table 8 that the average recognition rate for the NTU RGB+D dataset is 93.5%. Each activity class is recognized with a high recognition accuracy. Due to our robust features set, the proposed system has achieved excellent accuracies of 98%, 97% and 96% with slap/punch, kicking and pushing interactions, respectively. Thus, it is proved that our system is capable of detecting anomalous activities from environment. Regular activities like pointing the finger and hugging are also recognized with very high accuracy rates. The lowest accuracy rates are observed in activities such as pat on back, point finger and giving object due to the repetition of similar movements involved in these activities. For example, the actions giving object and shaking hands are performed with similar movements of the same body parts (the hands). Table 9 shows the confusion matrix for action classes of the UoL dataset.
When a testing set of action classes from the UoL dataset is given to NFC, an average recognition rate of 92.2% is achieved. It is inferred from Table 9 that anomalous activities from this dataset are also detected with excellent recognition rates. This is because of the strong set of skeletal joints data and full body features which enable our system to detect continuous activities such as hug and handshake with very high accuracy rates. However, a slight drop in the recognition rate is observed with conversations and call attention activities due to similarities in human body gestures and postures. Table 10 shows the confusion matrix for action classes of the CAD dataset. Mean Accuracy = 92.2% Table 10. Confusion matrix for action classes of the CAD dataset. crossing  talking  walking  queueing  waiting  crossing  88  0  10  2  0  talking  0  92  3  0  5  walking  8  0  84  3  5  queueing  0  2  4  90  4  waiting  1  0  1  4  94 Mean Accuracy = 89.6%

Actual Action Classes
CAD is a very challenging outdoor dataset with highly occluded backgrounds. The average recognition rate with CAD is slightly less compared to the NTU RGB+D and the UoL datasets. Nevertheless, our system is capable of recognizing some activities such as talking and waiting with 92% and 94% recognition rates, respectively. Actions involved in all classes of the CAD dataset are strongly related to each other so a confusion rate as high as 10% is observed in activities such as crossing and walking. A mean performance rate of 89.6% is achieved with the CAD dataset. In summary, this experiment proved the effectiveness of the proposed system by achieving high recognition rates with all three datasets.

The Effects of Numbers of Membership Functions
In this experiment, the effect of different numbers of Membership Functions (MF) over computation time, Root Mean Square Error (RMSE) and accuracy is observed. A Gaussian membership function is used. During experimentation, the number of MFs is varied from 2, 3, 5, to 8. The number of epochs is changed from 200 to 300 and 500. This experiment is performed with all the three datasets. Tables 11-13 demonstrate the effects of different numbers of MFs on NTU RGB+D, UoL and CAD datasets, respectively. It is observed from the results given in Tables 11-13 that increases in the number of membership functions affect the performance and computation time of the system. Increases in the number of MFs up to some points result in increased performance. However, after a certain limit increases in the number of MFs will result in the repetition of similar patterns. For example, in Table 11, increases in the number of MFs from five to eight results in increases of RMSE and decreases in the system's recognition rate. This is because increases in number of MFs after a certain limit will result in increased in fuzzy rules and the problem of overfitting occurs. However, if we use very few numbers of MFs, then fewer numbers of fuzzy rules will be compared and system performance will decrease. The minimum RMSE is observed with five MFs at the cost of computation time with NTU RGB+D, UoL, and also with CAD dataset. It is also observed that an increase in the number of MF and iterations results in increased computation time. However, increases in the number of iterations above a certain limit start to produce similar results to previous iterations. The best performance is achieved with 300 iterations. Thus, it is inferred that the number of MF and iterations effects the performance of the system.

Sensitivity, Specificity and Error Measures
For an in-depth evaluation and validation of the proposed system, we calculated sensitivity, specificity and error measures. Sensitivity measures the probability of detection, i.e., the True Positive Rate (TPR), while specificity measures the True Negative Rate (TNR). In order to represent false classifications, False Positive Rates (FPR) or fall-out rate and False Negative Rates (FNR) or miss-rate are calculated. FPR and FNR identify errors or misclassification rates. Sensitivity, specificity, FPR and FNR for each activity class of NTU RGB+D, UoL and CAD dataset are displayed in the form of bar graphs in Figures 10-12

Sensitivity, Specificity and Error Measures
For an in-depth evaluation and validation of the proposed system, we calculated sensitivity, specificity and error measures. Sensitivity measures the probability of detection, i.e., the True Positive Rate (TPR), while specificity measures the True Negative Rate (TNR). In order to represent false classifications, False Positive Rates (FPR) or fallout rate and False Negative Rates (FNR) or miss-rate are calculated. FPR and FNR identify errors or misclassification rates. Sensitivity, specificity, FPR and FNR for each activity class of NTU RGB+D, UoL and CAD dataset are displayed in the form of bar graphs in Figures 10-12, respectively.

Sensitivity, Specificity and Error Measures
For an in-depth evaluation and validation of the proposed system, we calculated sensitivity, specificity and error measures. Sensitivity measures the probability of detection, i.e., the True Positive Rate (TPR), while specificity measures the True Negative Rate (TNR). In order to represent false classifications, False Positive Rates (FPR) or fallout rate and False Negative Rates (FNR) or miss-rate are calculated. FPR and FNR identify errors or misclassification rates. Sensitivity, specificity, FPR and FNR for each activity class of NTU RGB+D, UoL and CAD dataset are displayed in the form of bar graphs in Figures 10-12, respectively.    It is observed from Figure 10 that the sensitivity ratio of all action classes of the NTU RGB+D dataset is very high. The proposed system can clearly distinguish between the classes and accurately predict the true label of the class to which it belongs. The overall sensitivity for all action classes is as high as 93.5%, 92.2% and 89.6% for the NTU RGB+D, UoL and CAD datasets, respectively. This shows that the system has a very small failure rate. The mean true negative rate of the system is 99.3% with NTU RGB+D dataset, 98.8% with the UoL dataset and 97.3% with the CAD dataset. Most of the specificity ratios that we obtained with all three datasets are as high as 99%. Thus, our system has a high ability to reject a testing sample if it does not belong to a specific class.
When the FPR of action classes for all three datasets are compared, it is observed that the mean FPR of NTU RGB+D, UoL and CAD datasets is 0.61%, 1.06% and 2.58%, respectively, as seen in Figures 10-12. On the other hand, the FNR of all three datasets is 6.43% with NTU RGB+D, 7.78% with UoL and 10.4% with CAD datasets. Hence, the error rates are very low, compared to sensitivity and specificity. So, a robust system is produced with high TPR and TNR, and low FPR and FNR.

Computational Time Complexity
In order to demonstrate the efficiency of a system, an experiment is performed to compute the computational time of the system. This experiment investigates the running time with respect to given input in the form of frames. A Core i5-4300U CPU (Control Processing Unit of speed 1.90 GHz and MATLAB (R2017a)) is used to compute the running time. The testing set of single activity class consists of 30 frames per action. When a testing sample of each activity class was given to the system, it took 3.3 s to recognize the action and assign a class label to a given input. For one frame, the computational time for recognition of the human action was 0.11 s. So, our system is capable of providing real-time human action recognition of 10 frames per second. Furthermore, in this experiment, computational time of the proposed system was compared with Artificial Neural Networks (ANN) as a classifier. First of all, the action classes from all three datasets were given individually as an input to the proposed system and the computational time in which the system classified all the action classes is measured. Then action classes from all three datasets were given individually as an input to the proposed system and classified via ANN instead of NFC. The proposed system with NFC provided results faster than ANN approach. Figure 13 shows the computational time with action classes of NTU RGB+D dataset, UoL dataset and CAD dataset classified via NFC and ANN. It is observed from Figure 10 that the sensitivity ratio of all action classes of the NTU RGB+D dataset is very high. The proposed system can clearly distinguish between the classes and accurately predict the true label of the class to which it belongs. The overall sensitivity for all action classes is as high as 93.5%, 92.2% and 89.6% for the NTU RGB+D, UoL and CAD datasets, respectively. This shows that the system has a very small failure rate. The mean true negative rate of the system is 99.3% with NTU RGB+D dataset, 98.8% with the UoL dataset and 97.3% with the CAD dataset. Most of the specificity ratios that we obtained with all three datasets are as high as 99%. Thus, our system has a high ability to reject a testing sample if it does not belong to a specific class.
When the FPR of action classes for all three datasets are compared, it is observed that the mean FPR of NTU RGB+D, UoL and CAD datasets is 0.61%, 1.06% and 2.58%, respectively, as seen in Figures 10-12. On the other hand, the FNR of all three datasets is 6.43% with NTU RGB+D, 7.78% with UoL and 10.4% with CAD datasets. Hence, the error rates are very low, compared to sensitivity and specificity. So, a robust system is produced with high TPR and TNR, and low FPR and FNR.

Computational Time Complexity
In order to demonstrate the efficiency of a system, an experiment is performed to compute the computational time of the system. This experiment investigates the running time with respect to given input in the form of frames. A Core i5-4300U CPU (Control Processing Unit of speed 1.90 GHz and MATLAB (R2017a)) is used to compute the running time. The testing set of single activity class consists of 30 frames per action. When a testing sample of each activity class was given to the system, it took 3.3 s to recognize the action and assign a class label to a given input. For one frame, the computational time for recognition of the human action was 0.11 s. So, our system is capable of providing real-time human action recognition of 10 frames per second. Furthermore, in this experiment, computational time of the proposed system was compared with Artificial Neural Networks (ANN) as a classifier. First of all, the action classes from all three datasets were given individually as an input to the proposed system and the computational time in which the system classified all the action classes is measured. Then action classes from all three datasets were given individually as an input to the proposed system and classified via ANN instead of NFC. The proposed system with NFC provided results faster than ANN approach. Figure

Comparison with Other Recently Proposed Systems
Finally, we compared the performance of the proposed system with other recently developed systems. We compared the recognition rates of our system with other system using same activities from all three datasets. In [49] each interaction is divided into interaction of different body parts. Two unique features called RLTP (Relative Location Temporal Pyramid) and PCTP (Physical Contact Temporal Pyramid) are produced. In [90], two libraries, OpenPose and 3D-baseline, are used to extract human joints while Convolutional Neural Networks (CNN) is used to classify the results. In [57], a feature factorization network is proposed. For better classification, a sparsity learning machine is produced. In [91], a system is proposed for both pose estimation and action recognition. In this method, action heat maps are generated via CNN. In [92], a scale and translation invariant transformation of skeletal images to color images is performed. In order to adjust frequency, a multi-scale CNN is used.
In [50], a method to temporally detect human action is presented by tracking the movements of upper body joints. HMM is used to detect and classify human actions. They used proxemics theory which depends on the usage of space during social interactions. In [93], spatiotemporal features are extracted from each person interacting in an action class, while social features are extracted between two interacting persons. In [94], human actions are tracked with skeletal data and by human body postures. SVM and X-means are used in the training and testing phase. In [95], a method based on the fusion of RGB, depth and wearable inertial sensors is presented. HOG and statistical features are extracted to record human actions. In [39], a connection between atomic activities is measured and interaction responses are formulated. A multi-task interaction response (MIR) was computed for each class separately. In [61], inter-related dynamic among different persons is identified via LSTM. First the static features of one person are given to Single-person LSTM and then its output is given to Concurrent-LSTM. In [96], a graphical model is used to find relationships between individual persons in an interaction. Furthermore, structured learning is introduced to connect with right output. In [97], the relationships among individuals as well as the atomic activity performed by each individual are measured. Table 14 shows the comparison of performances on the NTU RGB+D, UoL and CAD datasets.
It is observed from Table 14 that the proposed system performed better than many action recognition systems of recent years. The proposed system works well for HAR because of the features used to track each and every movement made by both persons involved in an interaction. The incorporation of depth sensors makes it possible to predict even complex human-to-human actions accurately. The data obtained after extracting features are in a more structured form to make decisions which improve the performance of the system. In a very short time, our system can give results with high sensitivity and accuracy. On the other hand, deep learning methods presented in the comparison consist of very complex data models that take more computational time to predict results. A large

Comparison with Other Recently Proposed Systems
Finally, we compared the performance of the proposed system with other recently developed systems. We compared the recognition rates of our system with other system using same activities from all three datasets. In [49] each interaction is divided into interaction of different body parts. Two unique features called RLTP (Relative Location Temporal Pyramid) and PCTP (Physical Contact Temporal Pyramid) are produced. In [90], two libraries, OpenPose and 3D-baseline, are used to extract human joints while Convolutional Neural Networks (CNN) is used to classify the results. In [57], a feature factorization network is proposed. For better classification, a sparsity learning machine is produced. In [91], a system is proposed for both pose estimation and action recognition. In this method, action heat maps are generated via CNN. In [92], a scale and translation invariant transformation of skeletal images to color images is performed. In order to adjust frequency, a multi-scale CNN is used.
In [50], a method to temporally detect human action is presented by tracking the movements of upper body joints. HMM is used to detect and classify human actions. They used proxemics theory which depends on the usage of space during social interactions. In [93], spatiotemporal features are extracted from each person interacting in an action class, while social features are extracted between two interacting persons. In [94], human actions are tracked with skeletal data and by human body postures. SVM and X-means are used in the training and testing phase. In [95], a method based on the fusion of RGB, depth and wearable inertial sensors is presented. HOG and statistical features are extracted to record human actions. In [39], a connection between atomic activities is measured and interaction responses are formulated. A multi-task interaction response (MIR) was computed for each class separately. In [61], inter-related dynamic among different persons is identified via LSTM. First the static features of one person are given to Single-person LSTM and then its output is given to Concurrent-LSTM. In [96], a graphical model is used to find relationships between individual persons in an interaction. Furthermore, structured learning is introduced to connect with right output. In [97], the relationships among individuals as well as the atomic activity performed by each individual are measured. Table 14 shows the comparison of performances on the NTU RGB+D, UoL and CAD datasets.
It is observed from Table 14 that the proposed system performed better than many action recognition systems of recent years. The proposed system works well for HAR because of the features used to track each and every movement made by both persons involved in an interaction. The incorporation of depth sensors makes it possible to predict even complex human-to-human actions accurately. The data obtained after extracting features are in a more structured form to make decisions which improve the performance of the system. In a very short time, our system can give results with high sensitivity and accuracy. On the other hand, deep learning methods presented in the comparison consist of very complex data models that take more computational time to predict results. A large amount of data is used in the compared deep learning-based approaches for training and then predicting the right outcome. By contrast, the proposed system can be used as real-time surveillance system which can learn from a small number of data samples and produce results in less time.

Discussion
A sustainable system with high stability and uniformity towards different challenges faced during performance is proposed in this research work. We used three challenging datasets in both indoor and outdoor environments. Our system produced uniformly good performance with all three datasets by tackling problems of varying environment conditions such as various brightness and lightning conditions due to the incorporation of depth sensors. Actions of all three datasets used in the proposed system are very complex because the movements involved in performing most of the actions are quite similar to each other. For example, walking towards, shaking hands, giving an object are actions in which two persons move towards each other. However, our system remained stable and reliable in differentiating all similar actions; this is due to the robust set of features. Our features resulted in high accuracy, sensitivity and specificity ratios.
The challenge of silhouette overlapping is faced during the system's execution. Silhouette segmentation of both RGB and depth images is not affected by overlapping silhouettes. However, in the feature extraction phase, there are some images where the silhouette of one person either slightly or completely overlaps the silhouette of another person. For example, in classes such as shaking hands and giving objects, hands of two silhouettes do not overlap at the beginning of the interaction. However, at the end of interaction, the hands of two persons overlap with each other and it is difficult to distinguish and mark the hand joints of each person. In the case of slight silhouette overlapping, blob extraction is performed through connected component analysis and specifying height and width of human. Through blob extraction, the silhouettes of both humans are extracted individually and then the feature extraction is performed. So, the performance of these actions is not very much affected. In some actions such as pat on the back, performance is affected by overlapping of the silhouettes and it is slightly lower (89%) compared to other classes. This is because in this action class, there is constant overlapping of hand of one person with shoulder of other person from start of the interaction until end. Moreover, in instances where one silhouette overlaps more significantly with the other silhouette (e.g., hugging interaction), connected component analysis fails. In such instances, full-body features such as geodesic distance and 3D Cartesian-plane features are still being computed which is why recognition accuracy is not very much affected. For example, in hugging interaction, the geodesic maps are created by taking a single point of origin, i.e., single centroid is used for both persons. However, in skeletal joints features, the joints of one person are detected on the silhouette of the other person, and the skeleton is deformed. As shown in Figure 14, human joints are not identified in the correct positions.
overlapping, blob extraction is performed through connected component analysis and specifying height and width of human. Through blob extraction, the silhouettes of both humans are extracted individually and then the feature extraction is performed. So, the performance of these actions is not very much affected. In some actions such as pat on the back, performance is affected by overlapping of the silhouettes and it is slightly lower (89%) compared to other classes. This is because in this action class, there is constant overlapping of hand of one person with shoulder of other person from start of the interaction until end. Moreover, in instances where one silhouette overlaps more significantly with the other silhouette (e.g., hugging interaction), connected component analysis fails. In such instances, full-body features such as geodesic distance and 3D Cartesian-plane features are still being computed which is why recognition accuracy is not very much affected. For example, in hugging interaction, the geodesic maps are created by taking a single point of origin, i.e., single centroid is used for both persons. However, in skeletal joints features, the joints of one person are detected on the silhouette of the other person, and the skeleton is deformed. As shown in Figure 14, human joints are not identified in the correct positions.

Conclusions
In this research work, an efficient and sustainable human surveillance system is proposed. We developed an action recognition system that is capable of good performance due to the deployment of RGB-D sensors and regardless of varying environments. We proposed four novel features in this research work. These features track each and every movement made by humans. Two features, namely, geodesic distance and 3D Cartesian-plane features are extracted from full human body images. Two of the proposed features, joints MOCAP and way-point trajectory, are extracted from the skeletal joints of humans. By combining all four feature descriptors, we are able to track human locomotion. These feature descriptors are optimized via PSO and, at the end, a hybrid neuro-fuzzy classifier is used to recognize human actions.
The proposed system has been validated via extensive experimentation. At first, feature descriptors of each dataset are separately fed into NFC and the mean recognition rate for each action class is calculated. Mean recognition accuracy with the NTU RGB+D dataset is 93.5%, with the UoL dataset it is 92.2% and with CAD it is 89.6%. We also evaluated our system with different numbers of membership functions over different numbers of iterations. The best performance was obtained with five membership functions but at the cost of computation time. The resulting RMSE values at 300 iterations are 0.55 with the NTU RGB+D dataset, 0.056 with the UoL dataset and 0.096 with CAD. Sensitivity and specificity measures for each activity class were taken to measure system performance from the true positive rate and true negative rate, respectively. Overall, the true positive rate for all action classes was 93.5%, 92.2% and 89.6% with the NTU RGB+D, UoL and CAD datasets, respectively. The overall true negative rate is 99.3%, 98.8% and 97.3% with the NTU

Conclusions
In this research work, an efficient and sustainable human surveillance system is proposed. We developed an action recognition system that is capable of good performance due to the deployment of RGB-D sensors and regardless of varying environments. We proposed four novel features in this research work. These features track each and every movement made by humans. Two features, namely, geodesic distance and 3D Cartesianplane features are extracted from full human body images. Two of the proposed features, joints MOCAP and way-point trajectory, are extracted from the skeletal joints of humans. By combining all four feature descriptors, we are able to track human locomotion. These feature descriptors are optimized via PSO and, at the end, a hybrid neuro-fuzzy classifier is used to recognize human actions.
The proposed system has been validated via extensive experimentation. At first, feature descriptors of each dataset are separately fed into NFC and the mean recognition rate for each action class is calculated. Mean recognition accuracy with the NTU RGB+D dataset is 93.5%, with the UoL dataset it is 92.2% and with CAD it is 89.6%. We also evaluated our system with different numbers of membership functions over different numbers of iterations. The best performance was obtained with five membership functions but at the cost of computation time. The resulting RMSE values at 300 iterations are 0.55 with the NTU RGB+D dataset, 0.056 with the UoL dataset and 0.096 with CAD. Sensitivity and specificity measures for each activity class were taken to measure system performance from the true positive rate and true negative rate, respectively. Overall, the true positive rate for all action classes was 93.5%, 92.2% and 89.6% with the NTU RGB+D, UoL and CAD datasets, respectively. The overall true negative rate is 99.3%, 98.8% and 97.3% with the NTU RGB+D, UoL and CAD datasets, respectively. Finally, the performance of the proposed system was compared with other systems and these comparisons showed that the proposed system performed better than many state-of-the-art systems.
In future, we have plans to further evaluate our model with deep learning concepts over more challenging human action datasets as well as for group interaction recognition.