Depth-Camera Based Energy Expenditure Estimation System for Physical Activity Using Posture Classification Algorithm

Insufficient physical activity is common in modern society. By estimating the energy expenditure (EE) of different physical activities, people can develop suitable exercise plans to improve their lifestyle quality. However, several limitations still exist in the related works. Therefore, the aim of this study is to propose an accurate EE estimation model based on depth camera data with physical activity classification to solve the limitations in the previous research. To decide the best location and amount of cameras of the EE estimation, three depth cameras were set at three locations, namely the side, rear side, and rear views, to obtain the kinematic data and EE estimation. Support vector machine was used for physical activity classification. Three EE estimation models, namely linear regression, multilayer perceptron (MLP), and convolutional neural network (CNN) models, were compared and determined the model with optimal performance in different experimental settings. The results have shown that if only one depth camera is available, optimal EE estimation can be obtained using the side view and MLP model. The mean absolute error (MAE), mean square error (MSE), and root MSE (RMSE) of the classification results under the aforementioned settings were 0.55, 0.66, and 0.81, respectively. If higher accuracy is required, two depth cameras can be set at the side and rear views, the CNN model can be used for light-to-moderate activities, and the MLP model can be used for vigorous activities. The RMSEs for estimating the EEs of standing, walking, and running were 0.19, 0.57, and 0.96, respectively. By applying the different models on different amounts of cameras, the optimal performance can be obtained, and this is also the first study to discuss the issue.


Introduction
Sedentary lifestyles are common in modern society. Insufficient physical activity increases the risk of noncommunicable diseases (NCDs) [1], such as cardiovascular diseases, respiratory diseases, cancers, stroke, and diabetes. Studies have indicated that NCDs account for more than 70% of global deaths [2]. To address the problem of insufficient physical activity, physical activities must be quantified for the design of appropriate exercise plans. However, some elderly are unable to go outside to meet the basic daily requirements of physical activities because of certain healthy issues. For those elderly who can only stay at home, it is necessary to estimate their EE of indoor activities and ensure they meet the basic requirement of physical activity. Energy expenditure (EE) is one factor used for quantifying physical activities. Conventionally, people use the portable metabolic analyzer view. Moreover, five regression-based models were compared to determine the model with the highest accuracy. Lin et al. estimated the EEs of only moderate-to-vigorous activities. Light activities, such as standing, were not considered; however, standing is an everyday physical activity. Another limitation of the aforementioned study was the high computational time required because of the complicated feature extraction process.
The proposed study can be regarded as an extension of the study of Lin et al. The motivation of the proposed study is to propose a Kinect-based EE estimation system to solve the problems and limitations associated with the previous research of Lin et al. This study investigated if applying the physical activity classification algorithm to the EE estimation system can improve the accuracy of EE estimation, which has not been confirmed in previous studies. Moreover, the proposed study investigated the best model for each type of physical activity for different camera settings. The study adopted a physical activity recognition algorithm to classify the postures captured by Kinect depth cameras into three activities, namely standing, walking, and running. Only 18 velocity-based features were used for EE estimation in this study to lower the computational complexity for feature extraction and selection. Three types of models were adopted to develop an independent model for each activity, and the model with the best performance for each activity was selected. The proposed system hypothesized that using the posture classification algorithm can improve the accuracy of EE estimation and is expected to obtain the optimal model for each physical activity for different camera settings. Figure 1 depicts the system architecture used in this study. The system was used to perform five tasks: data acquisition, data preprocessing, feature extraction, physical activity classification, and EE estimation. The performance of three models, namely linear regression (LR), multilayer perceptron (MLP), and convolutional neural network (CNN) models, for estimating the EE of each activity were compared. After physical activity classification, the independent models for activity prediction were used for training and EE estimation.

Methodology
Lin et al. [12] proposed an EE estimation system by using Kinect depth cameras. cameras were installed at three locations, namely the rear view, rear side view, and s view. Moreover, five regression-based models were compared to determine the mo with the highest accuracy. Lin et al. estimated the EEs of only moderate-to-vigorous tivities. Light activities, such as standing, were not considered; however, standing is everyday physical activity. Another limitation of the aforementioned study was the h computational time required because of the complicated feature extraction process.
The proposed study can be regarded as an extension of the study of Lin et al. motivation of the proposed study is to propose a Kinect-based EE estimation system solve the problems and limitations associated with the previous research of Lin et al. T study investigated if applying the physical activity classification algorithm to the EE e mation system can improve the accuracy of EE estimation, which has not been confirm in previous studies. Moreover, the proposed study investigated the best model for e type of physical activity for different camera settings. The study adopted a physical ac ity recognition algorithm to classify the postures captured by Kinect depth cameras three activities, namely standing, walking, and running. Only 18 velocity-based featu were used for EE estimation in this study to lower the computational complexity for ture extraction and selection. Three types of models were adopted to develop an in pendent model for each activity, and the model with the best performance for each acti was selected. The proposed system hypothesized that using the posture classification gorithm can improve the accuracy of EE estimation and is expected to obtain the opti model for each physical activity for different camera settings. Figure 1 depicts the system architecture used in this study. The system was used perform five tasks: data acquisition, data preprocessing, feature extraction, physical ac ity classification, and EE estimation. The performance of three models, namely linear gression (LR), multilayer perceptron (MLP), and convolutional neural network (CN models, for estimating the EE of each activity were compared. After physical activity c sification, the independent models for activity prediction were used for training and estimation.

Data Acquisition
The dataset used in this study was obtained from a study conducted in 2019 [12]. Three XBOX 360 Kinect cameras (Microsoft Corp., Redmond, WA, USA) were placed along three directions, namely the side, rear side, and rear view directions. The positions of the Kinect cameras are illustrated in Figure 2. Each Kinect camera was installed and fixed to a tripod 0.9 m high. The distance between each camera and the participant was set to approximately 2 m, which ensured that the Kinect cameras could capture the participant's entire body. The sampling frequency of Kinect is 30 Hz.

Data Acquisition
The dataset used in this study was obtained from a study conducted Three XBOX 360 Kinect cameras (Microsoft Corp., Redmond, WA, USA) along three directions, namely the side, rear side, and rear view directions. T of the Kinect cameras are illustrated in Figure 2. Each Kinect camera was i fixed to a tripod 0.9 m high. The distance between each camera and the par set to approximately 2 m, which ensured that the Kinect cameras could cap ticipant's entire body. The sampling frequency of Kinect is 30 Hz. During experiments, participants were asked to wear a portable metab (K4b , COSMED, Rome, Italy) to record the pulmonary gas exchange by br accuracy of deviation of ± 0.02% O2 and ± 0.01% CO2.
A total of 21 subjects (10 men and 11 women) were recruited for the exper gender, ages, weights, and body fat ratios are listed in Table 1. The mean a deviation (SD) of the age of the participants is 21.90 ± 1.55 years old. The me the weight is 60.20 ± 7.60 kg. The mean and SD of the body fat rate is 20.69 ± People with infectious or chronic diseases were excluded from this stud ipants provided informed consent before participating in the experiments. mental procedures used in this study were approved by the Institutional R (IRB) of Cathay General Hospital, Taipei, Taiwan (IRB code: CGH-NTPU105 Experimental details for each activity are listed in Table 2. Various physi including standing, walking, and running at various speeds, were included iments. Standing, walking, and running are regarded as the light, moderate, a activities in this study. The walking speeds considered were 4.8, 5.6, and 6 the running speeds considered were 8.0 and 8.3 km/h and were set on the tr ticipants performed each activity for 5 min on a treadmill. In the pilot study was used instead of free-living conditions because a stable movement speed tained with the treadmill. The resting times between each light-to-moderate a moderate-to-vigorous activities were 5 and 10 min, respectively. During experiments, participants were asked to wear a portable metabolic analyzer (K4b 2 , COSMED, Rome, Italy) to record the pulmonary gas exchange by breath with an accuracy of deviation of ±0.02% O 2 and ± 0.01% CO 2 .
A total of 21 subjects (10 men and 11 women) were recruited for the experiment. Their gender, ages, weights, and body fat ratios are listed in Table 1. The mean and standard deviation (SD) of the age of the participants is 21.90 ± 1.55 years old. The mean and SD of the weight is 60.20 ± 7.60 kg. The mean and SD of the body fat rate is 20.69 ± 7.37%. People with infectious or chronic diseases were excluded from this study. All participants provided informed consent before participating in the experiments. The experimental procedures used in this study were approved by the Institutional Review Board (IRB) of Cathay General Hospital, Taipei, Taiwan (IRB code: CGH-NTPU105001).
Experimental details for each activity are listed in Table 2. Various physical activities, including standing, walking, and running at various speeds, were included in the experiments. Standing, walking, and running are regarded as the light, moderate, and vigorous activities in this study. The walking speeds considered were 4.8, 5.6, and 6.4 km/h, and the running speeds considered were 8.0 and 8.3 km/h and were set on the treadmill. Participants performed each activity for 5 min on a treadmill. In the pilot study, a treadmill was used instead of free-living conditions because a stable movement speed can be maintained with the treadmill. The resting times between each light-to-moderate activities and moderate-to-vigorous activities were 5 and 10 min, respectively. The skeleton sequential data were collected with Kinect cameras while the participants were performing the physical activity on the treadmill. The first 3 min of each activity represented a non-steady state, and data from this period were discarded in this study [14]. Two minutes of steady-state data were used for physical activity classification and EE prediction.
To develop an EE estimation model, the expenditure data for each physical activity should be collected as the ground truth. In this study, the metabolic equivalent of task (MET) was regarded as the measurement unit. The MET is calculated as the ratio of the rate of energy expended during physical activity to the mass of a person. This ratio is used by many aerobic training organizations to estimate exercise intensity and EE.

Data Preprocessing
Data preprocessing can be divided into two parts: coordinate system transformation and noise reduction. Coordinate system transformation aims to transform the reference coordinate of each skeletal value into the body center. The initial reference coordinate of skeletal values was the center of a Kinect camera, but the varying object distances from a Kinect camera would cause problems of action recognition. For example, when the participants were walking or running on the treadmill, their bodies moved back and forth, which caused unstable variations in the skeletal data. According to the previous study, the referenced coordinate should be translated from the Kinect camera to the origin of body center, which is the shoulder center [15]. In this study, the reference coordinate was set to the center of the shoulder as the previous study had suggested. The reference coordinate of other joints was transformed by using the relative distance between the joint and the shoulder center. The formula for coordinate system transformation can be expressed as follows: where the coordinate of the shoulder center is (x r , y r , z r ), i = 1, . . . , j is the index of the joint, j is the total selected number of joints, which is six and will be explained in the following section, and (x i , y i , z i ) denotes the skeleton coordinates after transformation. Kinect cameras are sensitive to backgrounds, light, and surrounding objects. Noise occurs during skeletal tracking [16]. Therefore, a moving average filter, which averages the coordinates of two previous and two subsequent frames [17], can be used to remove the noise. The formula for applying the moving average filter is: where k represents the current frame, k − 2 to k + 2 represent two previous frames and two subsequent frames, and (x k * , y k * , z k * ) is the new skeleton coordinate after processing with the moving average filter.

Feature Extraction
Because the speed of each activity is varying, the mean velocity between every two consecutive frames might be an important factor to discriminate the activities. The formula is presented in (3). The velocity of a joint is calculated by obtaining the difference of velocity between two consecutive frames [18]. v k x , v k y , and v k z means the mean velocity in x, y, and z directions and will be used to train the models. The accuracy of the positions of joints from Kinect was also discussed and validated in the previous study to prove that the data used for this study are reliable [19].
According to the related research [20], treadmill workouts, such as standing, walking, and running, involve periodic and symmetric movements. Thus, to reduce the computational complexity, only half of the joints in the body were used to analyze physical activities. The six joints selected in this study were the shoulder, elbow, wrist, hip, knee, and ankle on the left side (red circles in Figure 3). The six selected joints were imaged in three directions simultaneously by using the three cameras. Three-dimensional data were obtained for each joint. Therefore, 18 velocities were obtained as features for further analysis.
Sensors 2021, 21,4216 where represents the current frame, − 2 to 2 represent two previous fram two subsequent frames, and ( * , * , * ) is the new skeleton coordinate after proc with the moving average filter.

Feature Extraction
Because the speed of each activity is varying, the mean velocity between ever consecutive frames might be an important factor to discriminate the activities. Th mula is presented in (3). The velocity of a joint is calculated by obtaining the differe velocity between two consecutive frames [18].
, , and means the mean velo x, y, and z directions and will be used to train the models. The accuracy of the pos of joints from Kinect was also discussed and validated in the previous study to prov the data used for this study are reliable [19].
According to the related research [20], treadmill workouts, such as standing, ing, and running, involve periodic and symmetric movements. Thus, to reduce the putational complexity, only half of the joints in the body were used to analyze ph activities. The six joints selected in this study were the shoulder, elbow, wrist, hip, and ankle on the left side (red circles in Figure 3). The six selected joints were ima three directions simultaneously by using the three cameras. Three-dimensional data obtained for each joint. Therefore, 18 velocities were obtained as features for further ysis. Principal component (PC) analysis (PCA) is used to reduce feature dimensio In this study, PCs that cumulatively accounted for more than 90% of the overall var were retained [21]. The numbers of feature subsets from the side, rear side, and rear were 10, 9, and 7, respectively. The cumulative explained variance of the retained plotted in Figure 4. Principal component (PC) analysis (PCA) is used to reduce feature dimensionality. In this study, PCs that cumulatively accounted for more than 90% of the overall variances were retained [21]. The numbers of feature subsets from the side, rear side, and rear views were 10, 9, and 7, respectively. The cumulative explained variance of the retained PCs is plotted in Figure 4.

Physical Activity Classification
To improve the accuracy of EE estimation, physical activities were classified as standing, walking, or running. An SVM algorithm was adopted to classify physical activities [22]. SVM is a widespread method to solve the classification problems. It can convert the raw data from lower to higher dimension using the kernel functions. After converting to higher dimension, the optimal hyper plane can be found to separate the data from differ-

Physical Activity Classification
To improve the accuracy of EE estimation, physical activities were classified as standing, walking, or running. An SVM algorithm was adopted to classify physical activities [22]. SVM is a widespread method to solve the classification problems. It can convert the raw data from lower to higher dimension using the kernel functions. After converting to higher dimension, the optimal hyper plane can be found to separate the data from different classes with the maximized margin. The advantage of SVM is that it can solve linear inseparable problems after projecting the raw data to a higher dimension. In this study, three SVM classifiers were built to classify the physical activities into three activities with one-against-all approach, and radial basis function kernel was used as the kernel function.

EE Prediction
To obtain the optimal model for estimating the EE during physical activities, the estimation results obtained by using three models, namely LR, MLP, and CNN, were compared. The reason of testing the three models is that these are the most common models in statistical, machine learning, and deep learning fields. The LR model can be used to determine whether the ground truth and estimated output are highly correlated. The MLP model is a type of feedforward artificial neural network, which comprises an input layer, an output layer, and several hidden layers [23]. The MLP model used in this study contained five layers: one input layer, three hidden layers, and one output layer, as presented in Figure 5. Different numbers of nodes n were tested to obtain the optimal number of neurons in the hidden layers to achieve optimal regression results. The logistic function was used as the activation function to estimate EE for different numbers of neurons, and mean square error (MSE) was used to evaluate the performance of different numbers of nodes. The number of nodes varied from 10 to 90 in increments of 10 neurons. model is a type of feedforward artificial neural network, which comprises an input layer, an output layer, and several hidden layers [23]. The MLP model used in this study contained five layers: one input layer, three hidden layers, and one output layer, as presented in Figure 5. Different numbers of nodes n were tested to obtain the optimal number of neurons in the hidden layers to achieve optimal regression results. The logistic function was used as the activation function to estimate EE for different numbers of neurons, and mean square error (MSE) was used to evaluate the performance of different numbers of nodes. The number of nodes varied from 10 to 90 in increments of 10 neurons. CNN is widely used to recognize objects and patterns in images [24]. In this study, CNN was used to automatically extract the features of kinematic data. The structure of the constructed CNN model is depicted in Figure 6. The CNN model used in this study comprised one input layer, two convolutional layers, two max-pooling layers, and one dense layer. The kernel size for each convolutional layer was 3 × 3. The rectified linear unit activation function was used in each convolutional layer [25]. Eighteen velocity data were input as the 18 × 1 input layer. The two convolutional layers were used to detect the features of the input data and create the feature maps automatically. The fully connected layers comprised two hidden layers with 16 hidden units per layer. In addition, the final output layer will output the result of the MET.
. Figure 6. Structure of the constructed CNN model.  CNN is widely used to recognize objects and patterns in images [24]. In this study, CNN was used to automatically extract the features of kinematic data. The structure of the constructed CNN model is depicted in Figure 6. The CNN model used in this study comprised one input layer, two convolutional layers, two max-pooling layers, and one dense layer. The kernel size for each convolutional layer was 3 × 3. The rectified linear unit activation function was used in each convolutional layer [25]. Eighteen velocity data were input as the 18×1 input layer. The two convolutional layers were used to detect the features of the input data and create the feature maps automatically. The fully connected layers comprised two hidden layers with 16 hidden units per layer. In addition, the final output layer will output the result of the MET. model is a type of feedforward artificial neural network, which comprises an input layer, an output layer, and several hidden layers [23]. The MLP model used in this study contained five layers: one input layer, three hidden layers, and one output layer, as presented in Figure 5. Different numbers of nodes n were tested to obtain the optimal number of neurons in the hidden layers to achieve optimal regression results. The logistic function was used as the activation function to estimate EE for different numbers of neurons, and mean square error (MSE) was used to evaluate the performance of different numbers of nodes. The number of nodes varied from 10 to 90 in increments of 10 neurons. CNN is widely used to recognize objects and patterns in images [24]. In this study, CNN was used to automatically extract the features of kinematic data. The structure of the constructed CNN model is depicted in Figure 6. The CNN model used in this study comprised one input layer, two convolutional layers, two max-pooling layers, and one dense layer. The kernel size for each convolutional layer was 3 × 3. The rectified linear unit activation function was used in each convolutional layer [25]. Eighteen velocity data were input as the 18 × 1 input layer. The two convolutional layers were used to detect the features of the input data and create the feature maps automatically. The fully connected layers comprised two hidden layers with 16 hidden units per layer. In addition, the final output layer will output the result of the MET.   To evaluate the performance of EE estimation and compare the performance with the related research, three indicators used in the related research [5][6][7]10,12], mean absolute error (MAE), MSE, and root MSE (RMSE), were adopted to evaluate the performance. The formulas for calculating MAE, MSE, and RMSE are as follows: whereŶ k denotes the estimated EE from each model, and Y k denotes the actual EE from K4b 2 portable calorimetry system. k = 1, . . . , N is the index of the data point. N is the amount of the data points. The unit for the MAE and RMSE was the MET. The ranges for MAE, MSE, and RMSE are [0, ∞]. Ten-fold cross-validation was used to evaluate the accuracy of different EE predictive models [26]. The final accuracy of each model was the average accuracy obtained through 10-fold cross-validation.

Results
For the MLP model, different nodes were tested to obtain the optimal model for the further comparisons. Figure 7 indicated that the lowest MSE was achieved when 70 neurons per hidden layer were used. The MLP with three hidden layers which contains 70 neurons per layer was adopted for the further comparisons. Ten-fold cross-validation was used to evaluate the accuracy of different E models [26]. The final accuracy of each model was the average accuracy obtai 10-fold cross-validation.

Results
For the MLP model, different nodes were tested to obtain the optimal m further comparisons. Figure 7 indicated that the lowest MSE was achieved w rons per hidden layer were used. The MLP with three hidden layers which neurons per layer was adopted for the further comparisons. The performance evaluation of the models was divided into three parts. mation performance of (i) a general model for all physical activities, (ii) the tivity classification method, and (iii) independent models for different physi were evaluated.

EE Estimation Performance of the General Model
A general model was built for estimating the energy expenditure of all tivities for two purposes. The first is for comparison with independent models activity classification. The second is that if only one Kinect camera is available position of the general model can be used.
Data for different physical activities were used to evaluate the EE estim mance of the LR, MLP, and CNN models. The performance of the models wi The performance evaluation of the models was divided into three parts. The EE estimation performance of (i) a general model for all physical activities, (ii) the physical activity classification method, and (iii) independent models for different physical activities were evaluated.

EE Estimation Performance of the General Model
A general model was built for estimating the energy expenditure of all physical activities for two purposes. The first is for comparison with independent models for physical activity classification. The second is that if only one Kinect camera is available, the optimal position of the general model can be used.
Data for different physical activities were used to evaluate the EE estimation performance of the LR, MLP, and CNN models. The performance of the models with and without PCA was also examined to evaluate whether PCA improves the performance of the general model. The performance of the models with and without PCA is presented in Tables 3-5. According to Tables 3-5, when data from all physical activities were used, the models without PCA outperformed those with PCA. The MLP and CNN models exhibited smaller errors than did the LR model. Moreover, the smallest estimation error was obtained when a Kinect camera was set along the side view and the MLP model without PCA was adopted. The MAE, MSE, and RMSE under the aforementioned settings were 0.55, 0.66, and 0.81, respectively.

Accuracy of Physical Activity Classification
Because the EE estimation method proposed in this study is mainly based on physical activity classification, the accuracy of physical activity classification must be examined. The performance of physical activity classification with and without PCA is presented in Table 6. The performance differences in physical activity classification with and without PCA were nonsignificant. Moreover, the accuracy of physical activity classification without PCA was higher than that with PCA. Accuracy without PCA could reach 99.55%.
Therefore, PCA was not adopted for further comparisons of the models and experimental settings for different activities. Tables 7-9 indicate the EE estimation performance with physical activity classification. The smallest error in the EE estimation for standing was achieved when a Kinect camera was installed along the rear view and the CNN model was used. The MAE, MSE, and RMSE of the EE estimation for standing were 0.15, 0.04, and 0.19, respectively. The smallest error in the EE estimation for walking was achieved when a Kinect camera was set along the side view and the CNN model was used. The MAE, MSE, and RMSE of the EE estimation for walking were 0.45, 0.33, and 0.57, respectively. The smallest error in the EE estimation for running was obtained when a Kinect camera was set along the rear view and the MLP model was used. The MAE, MSE, and RMSE of the EE estimation for running were 0.66, 0.94, and 0.96, respectively.

Discussion
In the proposed method, physical activity classification is performed before the EE estimation model is developed. To compare with the performance in different models and experimental settings, the performance evaluation was divided into three parts. First, the performance of a general model for all physical activities was investigated. According to the results in Tables 3-5, the smallest error in EE was obtained when the Kinect camera was set along the side view and the MLP model was used. The MAE, MSE, and RMSE in the EE under the aforementioned settings were 0.55, 0.66, and 0.81, respectively. Thus, when only one Kinect camera is available, optimal EE estimation performance is obtained when the camera is set along the side view and the MLP model is used. For an individual who weighs 60 kg and exercises for 1 h, the error between the EE estimated with the general model and the actual EE is 33 kcal.
Second, the accuracy of physical activity classification was examined with and without PCA. The classification accuracies with and without PCA were examined because the primary aim of this study was to determine whether physical activity classification improves EE estimation accuracy. Therefore, evaluating the accuracy of physical activity classification was crucial. As presented in Table 6, the accuracy of EE estimation along the side view was 99.55% without PCA and 98.49% with PCA. Therefore, superior EE estimation performance was obtained without PCA. A possible reason for this finding is the loss of velocity characteristics after using PCA because PCA results in the projection of data from high to low dimensions. Although the accuracy of using PCA is lower in this study, it was still helpful while collecting a larger dataset in future works because larger dataset will increase the computational cost of the model, and PCA can reduce the complexity of classification problems and make the predictive model more stable.
Third, the performance of EE estimation with physical activity classification was investigated. Tables 7-9 present the EE estimation performance of different models under different experimental settings for three physical activities. The results indicated that the CNN model provided the highest EE estimation accuracy for light and moderate-intensity activities, such as standing and walking. The MLP model exhibited the highest EE estimation accuracy for vigorous activities such as running. The optimal setting for achieving optimal EE estimation performance with physical activity classification is illustrated in Figure 8. The optimal setting involves two Kinect cameras along the side and rear views as well as a combination of MLP and CNN models. After physical activity classification, if the physical activity is classified as standing, walking, or running, then the EE is estimated using the CNN model for standing, CNN model for walking, or MLP model for running, respectively. The combination of various models reduced computational complexity because fewer data were required for independent model training than for general model training.
Sensors 2021, 21,4216 will increase the computational cost of the model, and PCA can reduce the comp classification problems and make the predictive model more stable.
Third, the performance of EE estimation with physical activity classification vestigated. Tables 7-9 present the EE estimation performance of different model different experimental settings for three physical activities. The results indicated CNN model provided the highest EE estimation accuracy for light and moderate-i activities, such as standing and walking. The MLP model exhibited the highest EE tion accuracy for vigorous activities such as running. The optimal setting for ac optimal EE estimation performance with physical activity classification is illust Figure 8. The optimal setting involves two Kinect cameras along the side and rea as well as a combination of MLP and CNN models. After physical activity classi if the physical activity is classified as standing, walking, or running, then the EE mated using the CNN model for standing, CNN model for walking, or MLP m running, respectively. The combination of various models reduced computation plexity because fewer data were required for independent model training than for model training. The data from the rear side view did not provide optimal results for any exper setting possibly because the data provided by the camera on the rear side view ready been projected on the other two planes. Therefore, the skeletal data from th cameras along only the side and rear views should be used in future studies. Mo the results show that after applying physical activity classification for EE estimat performance of estimating EE for light to moderate activity, such as standing and w will be improved; however, the performance for vigorous activity will not be im by applying physical activity classification according to the results. Table 10 presents a comprehensive comparison of the proposed method wit ods presented in related studies. Several criteria, such as sensors, accuracy, ind door usage, limitation of battery time, and requirement of wearable devices, are di among all the listed studies in Table 10. The proposed method can estimate the EE to-vigorous activities. Only Kinect cameras were used in this study. Of all the co methods, only the proposed method does not require personal data, such as weigh rate, and gender. The proposed method does not have the problem of battery life be used for indoor activities only. None of the compared camera-based systems wearable devices. The accuracy of the proposed general model is the same as tha system proposed by Cvetković et al. [5] but lower than that of the system propose et al. [12]. However, in the EE estimation for standing and walking, the proposed outperforms that of Lin et al. This result is acceptable because standing and wal The data from the rear side view did not provide optimal results for any experimental setting possibly because the data provided by the camera on the rear side view have already been projected on the other two planes. Therefore, the skeletal data from the Kinect cameras along only the side and rear views should be used in future studies. Moreover, the results show that after applying physical activity classification for EE estimation, the performance of estimating EE for light to moderate activity, such as standing and walking, will be improved; however, the performance for vigorous activity will not be improved by applying physical activity classification according to the results. Table 10 presents a comprehensive comparison of the proposed method with methods presented in related studies. Several criteria, such as sensors, accuracy, indoor/outdoor usage, limitation of battery time, and requirement of wearable devices, are discussed among all the listed studies in Table 10. The proposed method can estimate the EE of lightto-vigorous activities. Only Kinect cameras were used in this study. Of all the compared methods, only the proposed method does not require personal data, such as weight, heart rate, and gender. The proposed method does not have the problem of battery life but can be used for indoor activities only. None of the compared camera-based systems require wearable devices. The accuracy of the proposed general model is the same as that of the system proposed by Cvetković et al. [5] but lower than that of the system proposed by Lin et al. [12]. However, in the EE estimation for standing and walking, the proposed system outperforms that of Lin et al. This result is acceptable because standing and walking account for most of the exercising time. The proposed method only uses velocity-based features (only 18 features) for EE estimation. Thus, the computational complexity of the proposed method is lower than that of other methods. This study has certain limitations. First, not all models considered in this study were fine tuned to their optimal parameters. In this study, only the neurons of the hidden layers in the MLP model were fine tuned. Second, two Kinect cameras are still required to obtain the best EE prediction performance. However, a practical solution requiring the use of only one Kinect camera would be more appropriate.
To solve the aforementioned limitations, in the future studies, the optimal parameters for all models will be investigated. Deep learning approaches for sequential datasets, such as the algorithms based on recurrent neural networks [27,28] and long short-term memory networks [29], will also be considered for EE estimation. Moreover, it is necessary to build a robust model with only one Kinect camera for more practical usage. The approaches for converting the coordinates of a Kinect camera from one direction to other direction will be considered for improving the performance and efficiency of the EE estimation model on any direction.

Conclusions
This paper proposed an EE estimation system based on physical activity classification. Three depth cameras were set at three locations, namely the side, rear side, and rear views, to obtain the kinematic data and test the performance of EE estimation. Three EE estimation models, including LR, MLP, and CNN models, were compared and determined the model with optimal performance in varying experimental settings.
The experimental results indicated that when only one camera is available, the optimal solution involves setting the camera along the side view and estimating EE with the MLP model. The MAE, MSE, and RMSE of the EE under the aforementioned settings were 0.55, 0.66, and 0.81, respectively. In physical activity classification, the highest EE estimation accuracy for light-to-moderate activities, such as standing and walking, was obtained when Kinect cameras were set along the side and rear views and the CNN model was used. The highest EE estimation accuracy for vigorous activities, such as running, was obtained when Kinect cameras were set along the side and rear views and the MLP model was used. The RMSEs for estimating the EEs of standing, walking, and running were 0.19, 0.57, and 0.96, respectively. The camera on the rear side view did not yield optimal performance in all the experimental settings in this study. The results indicated that after applying physical activity classification, the performance of estimating EE of standing and walking was improved compared with that of the general model; however, the performance of running was not improved after applying physical activity classification. By using the proposed system, the EE of physical activities can be accurately estimated.
The main innovation of this research is that this is the first study to discuss the performance of EE estimation in different experimental settings, including one and two cameras,