Accurate Physical Activity Recognition using Multidimensional Features and Markov Model for Smart Health Fitness

: Recent developments in sensor technologies enable physical activity recognition (PAR) as an essential tool for smart health monitoring and for ﬁtness exercises. For e ﬃ cient PAR, model representation and training are signiﬁcant factors contributing to the ultimate success of recognition systems because model representation and accurate detection of body parts and physical activities cannot be distinguished if the system is not well trained. This paper provides a uniﬁed framework that explores multidimensional features with the help of a fusion of body part models and quadratic discriminant analysis which uses these features for markerless human pose estimation. Multilevel features are extracted as displacement parameters to work as spatiotemporal properties. These properties represent the respective positions of the body parts with respect to time. Finally, these features are processed by a maximum entropy Markov model as a recognition engine based on transition and emission probability values. Experimental results demonstrate that the proposed model produces more accurate results compared to the state-of-the-art methods for both body part detection and for physical activity recognition. The accuracy of the proposed method for body part detection is 90.91% on a University of Central Florida’s (UCF) sports action dataset and, for activity recognition on a UCF YouTube action dataset and an IM-DailyRGBEvents dataset, accuracy is 89.09% and 88.26% respectively.


Introduction
Assistive technologies for human locomotion tracking provide independent mobility, social participation and health benefits [1]. These benefits have emerged as a major research gain in worldly application domains such as violence detection, home automation systems, customer surveillance, virtual reality and physical fitness [2,3]. However, the tracking and recognition of people's physical activities remain problematic due to the human body's articulated nature, degrees of freedom between joints, partial occlusion and varying scales normalization [4]. Several modules such as rigid body configuration, body-part landmarks, homograph estimation, and optimal feature descriptors are introduced to minimize these difficulties.
Although, a lot of efforts have been put in by researchers in physical activity recognition (PAR), some challenges are still unresolved as described below: 1.
Shape and height variations: human size and shape appear smaller when individuals are further away from the camera; when they are closer they appear larger. In addition, human bodies vary a lot in shape and size.

2.
Feature selection approach: there is a huge number of feature selection approaches. To choose an appropriate approach for feature selection for the PAR is a critical issue. 3.
Occlusion: a human body or any part of a particular body may be hidden due to occlusion. 4.
Hardware problems: many approaches for PAR use expensive hardware in their research making it difficult to incorporate these systems in real life. 5.
Illumination variations: the same image can look entirely different in different lighting situations.
In recent years, the interest of researchers in PAR has increased due to its numerous applications. Major application areas of PAR include video surveillance, virtual reality gaming, human-object interaction, e-learning, healthcare systems and human behavior analysis. In security surveillance systems, if a person walks normally under video surveillance and suddenly behaves suspiciously, there is a chance that abnormal events such as threats, fighting, domestic violence or agitation have occurred. Such abnormal activities are automatically detected by PAR to initiate the securing of the areas under surveillance. Similarly, health-exercise systems use automatic action-recognition technologies that can guide patients to exercise properly and assist them in their daily routines. In addition, PAR can make sports and games more attractive and entertaining due to the prediction of future players' actions and the expected scores of each team.
In this paper, we present a novel technique to estimate pose and to identify the specific physical activities based on a 12-point skeletal model and multidimensional features. These features are further reduced, optimized and classified by quadratic discriminant analysis (QDA), along with a maximum entropy Markov model (MEMM). The proposed method was tested on University of Central Florida's UCF sports actions, UCF YouTube actions and IM-DailyRGBEvent datasets (a collection of video sequences containing different common human actions) for body-part detection along with PAR. The test results achieved remarkable performance scores.
The rest of the paper is organized as follows. Section 2 consists of related work in the field of PAR. In Section 3, the framework is outlined, including system design, preprocessing stage, feature generation, and activity training/recognition. In Section 4, experimental results for body part detection and PAR are described. Finally, Section 5 describes the conclusion of our proposed work and future directions.

Related Work
Two types of sensors are commonly used in PAR. The first type is inertial sensors. Some gadgets (e.g., a smart watch) are worn by users at different locations on their body. Here, various accelerometers are embedded in the smart watch to measure acceleration forces. As activities are carried out, different acceleration forces are stored as data, preprocessing steps are performed and all activities are categorized. The second type of PAR sensors is vision sensors that recognize activities based on captured images. For PAR-based vision sensors, image sequences or video are captured by still/movable cameras and fed into the detector engine. Research has been done in PAR using both types of sensors and it is discussed in the following subsections.

Body Worn Sensors for PAR Systems
In body worn inertial sensor based work, Trung et al. [5] recognized the similar types of actions that are usually difficult to classify. They used inter-class relationships to improve the overall performance of the method. In [6], Trung et al., used inertial sensors to recognize actions and, instead of using interclass relationships, they used a scale space method to segment the action signals properly. They also tackled the problem of inconsistency in sensor orientation by adjusting the tilt of the sensors. In both [5] and [6], despite using different methodologies, accuracy was low, so in [7], Hawang et al. suggested a new method for physical activity recognition by fusing an inertial sensor with a vision sensor to overcome the problem of unreliability of the inertial sensors. They claimed that merging these two types of sensors could help overcome deficiencies in both types of systems. In [8], Irvin and Angelica, by contrast with [7], preferred to attach more inertial sensors to the user's body for PAR over the fusion of vision and inertial sensors. They estimated the angle of lower and upper limbs to extract features related to movement of the limbs. Dawar et al. [9] supported the fusion of sensors as in [7]. Additionally, they used convolutional neural networks to detect the data that was captured by vision sensors as well as another network "short term memory" with accelerometer data.

Vision Sensors for PAR Systems
In vision sensors, Fang et al. [10] developed a classical statistical sampling scheme along with deep learning representation of individual silhouettes to identify complex human actions. In [11], Silambarasi et al. developed a 3D spatio-temporal plane which locates human movements from different views via motion history tracing. They extracted histograms of oriented gradients and directional gradients and magnitudes to recognize physical activities. In [12], Shehzed et al. presented a new multiperson tracking system that included body-part labeling, Kalman filter and Gaussian mapping for crowd counting and action detection. In [13], Han et al. proposed a global spatial attention (GSA) model that explored different skeletal joints and adopted an accumulative learning curve to distinguish and recognize various action types. However, these articles [10][11][12][13] still have major issues such as uncontrolled lighting, dynamic postures, rotational views and motion ambiguities, which result in low performance. Therefore, to overcome these limitations, we developed a novel methodology for PAR.

Proposed System Methodology
We utilize video sensors to capture raw data during preprocessing; human silhouettes are extracted using two significant models including saliency based segmentation and skin tone detection; and then, these silhouettes are used to extract multilevel features including displacement parameter values. Finally, features are quantified using quadratic discriminant analysis (QDA) to get the best matching unit and to find maximum entropy of each activity class via a Markov model. Figure 1, depicts the proposed framework of our PAR system. inertial sensor with a vision sensor to overcome the problem of unreliability of the inertial sensors. They claimed that merging these two types of sensors could help overcome deficiencies in both types of systems. In [8], Irvin and Angelica, by contrast with [7], preferred to attach more inertial sensors to the user's body for PAR over the fusion of vision and inertial sensors. They estimated the angle of lower and upper limbs to extract features related to movement of the limbs. Dawar et al. [9] supported the fusion of sensors as in [7]. Additionally, they used convolutional neural networks to detect the data that was captured by vision sensors as well as another network "short term memory" with accelerometer data.

Vision Sensors for PAR Systems
In vision sensors, Fang et al. [10] developed a classical statistical sampling scheme along with deep learning representation of individual silhouettes to identify complex human actions. In [11], Silambarasi et al. developed a 3D spatio-temporal plane which locates human movements from different views via motion history tracing. They extracted histograms of oriented gradients and directional gradients and magnitudes to recognize physical activities. In [12], Shehzed et al. presented a new multiperson tracking system that included body-part labeling, Kalman filter and Gaussian mapping for crowd counting and action detection. In [13], Han et al. proposed a global spatial attention (GSA) model that explored different skeletal joints and adopted an accumulative learning curve to distinguish and recognize various action types. However, these articles [10][11][12][13] still have major issues such as uncontrolled lighting, dynamic postures, rotational views and motion ambiguities, which result in low performance. Therefore, to overcome these limitations, we developed a novel methodology for PAR.

Proposed System Methodology
We utilize video sensors to capture raw data during preprocessing; human silhouettes are extracted using two significant models including saliency based segmentation and skin tone detection; and then, these silhouettes are used to extract multilevel features including displacement parameter values. Finally, features are quantified using quadratic discriminant analysis (QDA) to get the best matching unit and to find maximum entropy of each activity class via a Markov model. Figure 1, depicts the proposed framework of our PAR system.

Preprocessing Stage
During vision-based image preprocessing, we applied two significant methods to extract reasonable human silhouettes. First, we extracted the silhouette using a salient region detection technique and then a separate silhouette was extracted using a skin tone segmentation technique. After the extraction of these two silhouettes, results were merged to get robust and accurate silhouettes from the given image. Saliency based segmentation [14] was used to distinguish an object  Figure 1. Overview of the proposed system architecture.

Preprocessing Stage
During vision-based image preprocessing, we applied two significant methods to extract reasonable human silhouettes. First, we extracted the silhouette using a salient region detection technique and then a separate silhouette was extracted using a skin tone segmentation technique. After the extraction of these two silhouettes, results were merged to get robust and accurate silhouettes from the given image. Saliency based segmentation [14] was used to distinguish an object (i.e., silhouette) by saliency values which were calculated from its surroundings. Saliency SR for pixel (i, j) was computed as; N is defined as an area near the saliency pixel at location (x, y) and d is defined as the position difference between pixel vectors R and Q. After determining the saliency values for all the given regions of the image, a standard threshold saliency value was used to differentiate foreground from background. Figure 2a shows the results of the saliency method.  The silhouette that was extracted with the skin tone approach [15] was used to improve the results of the saliency value method. In the skin tone approach, certain color ranges having specified decision borders were perceived. On the other hand, RGB and hue, saturation, value (HSV) threshold values were used to separate the skin region from the nonskin region. Equation (2) represents the RGB threshold, while Equation (3) was used to represent the HSV threshold for the skin tone segmentation.
Here, r_r and g_r represent the range of red and green channels of RGB, respectively, whereas H r , S r and V r represent the ranges of the HSV color model. After distinguishing the skin regions from nonskin regions, skin regions were elaborated with the help of the geometrical characteristics of the human body. In order to grow these regions, it was assumed that the skin was often visible only on the face, arms and lower legs. If two regions were detected vertically with the skin tone segmentation method, it was most likely that one of these regions was the face and the other was the lower legs. In that case, the system connected these regions. Thus, if further skin regions were found on the left/right side of the linked region, they may have been either hands or arms (See Figure 2b).
After extracting the silhouettes from each method, both methods were merged to get a more accurate silhouette. To combine both extraction methods, Algorithm 1 was formulated as: After extracting the silhouettes from each method, both methods were merged to get a more accurate silhouette. To combine both extraction methods, Algorithm 1 was formulated as:

Algorithm 1: Extraction of human silhouette
Input: Y: Result of saliency method, result of skin tone method Output: Merged silhouette /* Calculating position of human in image*/ /* β is denoting non zero region*/ /* µ is denoting merged silhouette*/ /* Ω is denoting human shape*/ Step 1: Until biggest regions in both inputs are searched.
Step 2: /* Compare β of both images*/ For all pixel in β of both images If pixel is matching Ω µ = End End End

Pose Estimation: Body Parts Detection
During pose estimation, the initial pose was considered as a T-shape with the arms extended straight out from the neck for human body configuration. Initially, five parts of the body were detected as basic parts [16]: the hands, the head, and the feet. An inflection-based method was incorporated in the system; it used the 2D kappa mechanism which was closely associated with object silhouette refining [17]. The Kappa function is defined as;

Pose Estimation: Body Parts Detection
During pose estimation, the initial pose was considered as a T-shape with the arms extended straight out from the neck for human body configuration. Initially, five parts of the body were detected as basic parts [16]: the hands, the head, and the feet. An inflection-based method was incorporated in the system; it used the 2D kappa mechanism which was closely associated with object silhouette refining [17]. The Kappa function is defined as; ..
After the silhouette was well refined with the Kappa function, the gap between the highest pixel of the silhouette and the lowest pixel of the silhouette was measured. The head diameter was standardized as 1/4.5th times the height to estimate the individual's height and width. In addition, taking into consideration pixelwise digging, the head diameter was calculated by measuring the altitude of the silhouette. To detect the head position, the following formula was used; where, P f H is head position at any given frame f . The position of the limb was needed to estimate the locations of the hips and the feet (See Figure 3). The following equation was used to determine the limb position; where, P f i is the limb position in frame i. The position of the hands and feet were determined via lower and upper limb positions and the geometrical feature of the silhouette [18]. where, is head position at any given frame . The position of the limb was needed to estimate the locations of the hips and the feet (See Figure 3). The following equation was used to determine the limb position; where, is the limb position in frame . The position of the hands and feet were determined via lower and upper limb positions and the geometrical feature of the silhouette [18]. The torso point was at the center of the upper head point and between the feet [19]. The torso position was adjusted with the help of the following equation; Equation (8) was used to identify the location of the knees. It was usually at the center point between the feet and the hip joints. The torso point was at the center of the upper head point and between the feet [19]. The torso position was adjusted with the help of the following equation; Equation (8) was used to identify the location of the knees. It was usually at the center point between the feet and the hip joints.
Thus we explored twelve body parts, being five basic body parts and seven body sub parts. As these images were in sequence we could track these parts and get optimal positions for each body part. Figure 4 gives a few examples of the detection of the 12 body parts.

Multidimensinal Features Generation
Once the twelve body parts were detected from a human posture, we applied multilevel features. This included six dimensional torso features, eight dimensional first-degree features and eight dimensional second degree features. Algorithm 2 explains the overall definition of the multidimensional features.

Multidimensinal Features Generation
Once the twelve body parts were detected from a human posture, we applied multilevel features. This included six dimensional torso features, eight dimensional first-degree features and eight dimensional second degree features. Algorithm 2 explains the overall definition of the multidimensional features.

Features Discrimination
QDA [20] was used to evaluate which feature values can distinguish between all activity classes in labeled datasets. Each class was dispersed normally [21], and therefore a quantification function for quadratic discriminant analysis was applied as where, S j is covariance matrix and j is 1, 2, · · · k. To distinguish between features, we examined if D 2 i (x) is the smallest for any class x. Figure 5, shows a 3D plot having clear discrimination of 11 different classes of the UCF YouTube action dataset.

Features Discrimination
QDA [20] was used to evaluate which feature values can distinguish between all activity classes in labeled datasets. Each class was dispersed normally [21], and therefore a quantification function for quadratic discriminant analysis was applied as where, is covariance matrix and j is 1,2, ⋯ , . To distinguish between features, we examined if ( ) is the smallest for any class x. Figure 5, shows a 3D plot having clear discrimination of 11 different classes of the UCF YouTube action dataset.

Recognition Engine: MEMM
For activity classification, conditional probability of the observation sequence was used to estimate the state sequence [22] via MEMM. According to the MEMM model [23], the activity classes were declared as the state ( | , ) with entropy adjustments, formulated as: ( , … , | , … , ) = e ∑ (10) Figure 6. Discrimination of QDA over all classes of the IM-DailyRGBEvents dataset.

Recognition Engine: MEMM
For activity classification, conditional probability of the observation sequence was used to estimate the state sequence [22] via MEMM. According to the MEMM model [23], the activity classes were declared as the state P (S i S i−1 , O i ) with entropy adjustments, formulated as: where, δ k is the feature value and β k is the adjustable weight K for the given observation in the sequence. The conditional entropy of a distribution P (S O) is estimated by maximum entropy theory. It was inferred by the log-linear model as: where Z (O, S ) is a normalized factor and λ m is the multiplier parameter with multi-level features. Figure 7, shows how probability is estimated during MEMM classification over the different activities of walking, swinging and T-jumping in the YouTube action dataset. where, is the feature value and is the adjustable weight for the given observation in the sequence. The conditional entropy of a distribution ( | ) is estimated by maximum entropy theory. It was inferred by the log-linear model as: where ( , ') is a normalized factor and is the multiplier parameter with multi-level features. Figure 7, shows how probability is estimated during MEMM classification over the different activities of walking, swinging and T-jumping in the YouTube action dataset.

Experimental Results
In this section, firstly, we explain three different benchmark-challenging datasets. Four types of experimental results are represented after the explanation of three datasets. In the first experiment, we explored body part detection accuracies with respect to ground truth. In the second experiment, action, recognition accuracies are represented. In the third experiment, we compared the proposed technique with well-known machine learning algorithms. Finally, in the four experiment, we compared body part detection accuracies as well as action recognition accuracies with other statistical well-known state-of-the-art methods.

Experimental Results
In this section, firstly, we explain three different benchmark-challenging datasets. Four types of experimental results are represented after the explanation of three datasets. In the first experiment, we explored body part detection accuracies with respect to ground truth. In the second experiment, action, recognition accuracies are represented. In the third experiment, we compared the proposed technique with well-known machine learning algorithms. Finally, in the four experiment, we compared body part detection accuracies as well as action recognition accuracies with other statistical well-known state-of-the-art methods.

Datasets Description
In the UCF sports actions dataset [24], a set of action classes was gathered from different games usually shown on TV stations like the BBC and the ESPN. The actions included diving, golf swing, kicking, lifting, riding horse, running, skate boarding, swing-bench, swing-side and walking of 720 × 480 resolution. The dataset is available as videos having more than a hundred sequences. Figure 8 shows some samples of the UCF sports actions dataset. resolution. The dataset is available as videos having more than a hundred sequences. Figure 8 shows some samples of the UCF sports actions dataset. In the UCF YouTube action [25] dataset, there were 11 different classes of action such as swing, diving, T-jumping, walking, basketball, volleyball, soccer juggling, G-swing, horse-riding, biking, and T-swing. The clips were combined into 25 groups per category, containing a minimum of four actions per clip. Videos of the same category shared common characteristics such as the same performer, common context and specific point of view. In Figure 9, there are some samples from the UCF YouTube action dataset. In our self-annotated IM-DailyRGBEvents dataset [26], there were 15 classes of actions performed by 15 subjects (i.e., 13 males, 2 females). There were more than seventy video sequences for each action. Figure 10, shows some images from the IM-DailyRGBEvents dataset. In the UCF YouTube action [25] dataset, there were 11 different classes of action such as swing, diving, T-jumping, walking, basketball, volleyball, soccer juggling, G-swing, horse-riding, biking, and T-swing. The clips were combined into 25 groups per category, containing a minimum of four actions per clip. Videos of the same category shared common characteristics such as the same performer, common context and specific point of view. In Figure 9, there are some samples from the UCF YouTube action dataset. resolution. The dataset is available as videos having more than a hundred sequences. Figure 8 shows some samples of the UCF sports actions dataset. In the UCF YouTube action [25] dataset, there were 11 different classes of action such as swing, diving, T-jumping, walking, basketball, volleyball, soccer juggling, G-swing, horse-riding, biking, and T-swing. The clips were combined into 25 groups per category, containing a minimum of four actions per clip. Videos of the same category shared common characteristics such as the same performer, common context and specific point of view. In Figure 9, there are some samples from the UCF YouTube action dataset. In our self-annotated IM-DailyRGBEvents dataset [26], there were 15 classes of actions performed by 15 subjects (i.e., 13 males, 2 females). There were more than seventy video sequences for each action. Figure 10, shows some images from the IM-DailyRGBEvents dataset. In our self-annotated IM-DailyRGBEvents dataset [26], there were 15 classes of actions performed by 15 subjects (i.e., 13 males, 2 females). There were more than seventy video sequences for each action. Figure 10, shows some images from the IM-DailyRGBEvents dataset.

Experimentation I: Body Parts Detection Accuracies
To calculate the effectiveness and accuracy of body part detection, the distance from the ground truth (GT) was calculated with the help of the following equation.
Here, is the GT and is the location of the detected body part. The threshold of 15 is set to identify accuracy between the detected data and the GT data. With the help of the following equation (13), the percentage of the detected parts that lies within the threshold range of labeled dataset was detected.
In Table 1, column 2 is the distance from the ground truth and column 3 shows body part detection accuracy over the UCF sports action dataset. Observations: In Table 1, it can be observed that the head and feet of the proposed system were more properly identifiable because the head was often at the top of the silhouette and the feet were at the bottom of the silhouette. These parts of the body were easier to detect than the other parts like

Experimentation I: Body Parts Detection Accuracies
To calculate the effectiveness and accuracy of body part detection, the distance from the ground truth (GT) was calculated with the help of the following equation.
Here, J is the GT and I is the location of the detected body part. The threshold of 15 is set to identify accuracy between the detected data and the GT data. With the help of the following equation (13), the percentage of the detected parts that lies within the threshold range of labeled dataset was detected.
In Table 1, column 2 is the distance from the ground truth and column 3 shows body part detection accuracy over the UCF sports action dataset. Observations: In Table 1, it can be observed that the head and feet of the proposed system were more properly identifiable because the head was often at the top of the silhouette and the feet were at the bottom of the silhouette. These parts of the body were easier to detect than the other parts like the hips and knees which are in more complex relationships with the body parts, especially when in motion.

Experimentation II: Activity Recognition Accuracies
For calculating action recognition accuracies, the proposed system was examined by the leave-one-out (LOO) cross-validation method for training and testing data. Table 2 presents the confusion matrix of PAR over the UCF YouTube action dataset and Table 3 represents recognition accuracies of the IM-DailyRGBEvents dataset.    Observations: In Table 2, it is observed that a few activities such as walking and G-swing affected accuracy due to similarities in patterns with other activities. However, the overall confusion matrix shows significant results of 89.09%. In Table 3, clapping activity shows higher recognition accuracy as it is an easily differentiable activity. On the other hand, recognition accuracy for both hand waving and Right hand waving was relatively low due to similar motions in these activities. The mean of recognition accuracy scores for the IM-DailyRGBEvents dataset was 88.26%.

Experimentation III: Comparison of the Proposed System with Well-Known Machine Learning Algorithms
In the third experiment, the results of our proposed system were compared with the results of more commonly used machine learning algorithms. The first algorithm which was chosen for comparison is support vector machine (SVM) and the second algorithm was decision tree. For body parts detection and activity recognition, convolutional neural network (CNN) has gained much popularity due to its effectiveness, so we also selected this algorithm for comparison of the results. In Figure 11, body parts detection results were compared with common machine learning techniques using the UCF sports action dataset. The proposed method's accuracy was 90.91% which was better than CNN's 83%, decision tree's 80% and SVM's 78%. Figures 12 and 13 illustrate the activities recognition results for the UCF YouTube action dataset and the IM-DailyRGBEvents dataset, respectively. Boxing = BX; clapping = CP; take an object = TO; throwing = TH; reading an article = RA; phone conversation = PC; cleaning = CN; kicking = KG; eating = ET; sitting down = SD; bending = BD; right hand waving = RW; both hands waving = BW; exercise = EX and standing up = SU.
Observations: In Figure 11, it can be observed that our method performs better than the other techniques. The performance of CNN is slightly below our method but in some cases detection accuracy of SVM is better than decision tree and CNN. Similarly, in the case of activity recognition as shown in Figures 12 and 13, our method has the best result. However, in a few cases i.e., both hands waving in the IM-DailyRGBEvents dataset, accuracy rates of decision tree and of the CNN were slightly better than the accuracy rate of our proposed technique. Similarly, for reading an article in the IM-DailyRGBEvents dataset, the accuracy rate of our proposed system was a little below the accuracy rate of CNN. In conclusion, the overall accuracies of our proposed system for body part detection as well as for action recognition were satisfactory. Figure 11. Comparison of body parts detection accuracies with common machine learning algorithms on the UCF sports action dataset. For the UCF YouTube action dataset, the proposed method's accuracy was 89.09% which was better than CNN's 83%, decision tree's 79% and SVM's 80%. For the IM-DailyRGBEvents dataset, the proposed method's accuracy was 88.26% which is better than CNN's 84%, decision tree's 81% and SVM's 77%.

Experimentation IV: Comparison of our Proposed System with State-of-the-Art Techniques
Observations: In Figure 11, it can be observed that our method performs better than the other techniques. The performance of CNN is slightly below our method but in some cases detection accuracy of SVM is better than decision tree and CNN. Similarly, in the case of activity recognition as shown in Figures 12 and 13, our method has the best result. However, in a few cases i.e., both hands waving in the IM-DailyRGBEvents dataset, accuracy rates of decision tree and of the CNN were slightly better than the accuracy rate of our proposed technique. Similarly, for reading an article in the IM-DailyRGBEvents dataset, the accuracy rate of our proposed system was a little below the accuracy rate of CNN. In conclusion, the overall accuracies of our proposed system for body part detection as well as for action recognition were satisfactory. Table 4. Comparison of the proposed body parts detection accuracies method with other state-of-the-art methods.

Methods
UCF Sports Actions dataset (%) Physical Sports Movements [27] 86.67 HOIRM feature fusion [28] 88.25 Hybrid deep learning model [29] 89.01 Proposed method 90.91 Figure 12. Comparison of activity recognition accuracies with common machine learning algorithms on the UCF YouTube action dataset. A comparison of overall results shows that the proposed method achieved a significant improvement with recognition results as high as 89.09% and 88.26% over other methods as shown in Table 5. Table 5. Result comparison of the state-of-the-art methods with proposed physical activity recognition (PAR) method.  Table 4 compares the body parts detection accuracy of the proposed multidimensional features method with other state-of-the-art methods using the UCF sports action dataset. It was observed that the proposed method achieved a better detection accuracy rate of 90.91% compared to the others. Table 4. Comparison of the proposed body parts detection accuracies method with other state-of-the-art methods.

Proposed method 90.91
A comparison of overall results shows that the proposed method achieved a significant improvement with recognition results as high as 89.09% and 88.26% over other methods as shown in Table 5. Table 5. Result comparison of the state-of-the-art methods with proposed physical activity recognition (PAR) method.

Conclusions
We proposed a novel technique that combines multidimensional features along with MEMM to detect daily life-log activities for smart indoor/outdoor environments. These features were extracted by robust body part models having 12 tracked key points with an overall accuracy of 90.91%. Finally, the QDA and Markov models were used for optimal discrimination and efficient classification of the extracted features. Experimental results revealed impressive performance (89.09% accuracy for the YouTube action dataset and 88.26% accuracy for the IM-DailyRGBEvents dataset) for the proposed technique and demonstrated that MEMM were used for successful recognition modelling. In future work, we will apply our work to local hospital, fitness gymnasium and kindergarten environments to increase the experimental data sets and make the proposed model more universally applicable.