Ensemble Learning for Skeleton-Based Body Mass Index Classiﬁcation

: In this study, we performed skeleton-based body mass index (BMI) classiﬁcation by developing a unique ensemble learning method for human healthcare. Traditionally, anthropometric features, including the average length of each body part and average height, have been utilized for this kind of classiﬁcation. Average values are generally calculated for all frames because the length of body parts and the subject height vary over time, as a result of the inaccuracy in pose estimation. Thus, traditionally, anthropometric features are measured over a long period. In contrast, we controlled the window used to measure anthropometric features over short/mid/long-term periods. This approach enables our proposed ensemble model to obtain robust and accurate BMI classiﬁcation results. To produce ﬁnal results, the proposed ensemble model utilizes multiple k -nearest neighbor classiﬁers trained using anthropometric features measured over several different time periods. To verify the effectiveness of the proposed model, we evaluated it using a public dataset. The simulation results demonstrate that the proposed model achieves state-of-the-art performance when compared with benchmark methods.


Introduction
Over the past several decades, the percentage of the population that is obese has steadily increased. Since obesity can cause various diseases, its prevalence among people has become a major social issue. To identify its adverse effects, many studies have investigated the relationship between obesity and various diseases. According to the results presented in the literature, obesity is related to several diseases such as diabetes [1], high blood pressure [2], hyperlipidemia [3], cholelithiasis [4], hypopnea [5], arthritis [6] and mental disorders [7], meaning that the death rate for the obese is relatively high.
The World Health Organization (WHO) defines obesity as excessive fat accumulation that may threaten life. Additionally, the WHO uses the body mass index (BMI) to identify obesity in people. The BMI is one of the factors used to estimate obesity levels and is defined as the ratio of a person's body weight in kilograms to the square of their height in meters (i.e., the unit of the BMI is kg/m 2 ). In general, a person is classified as "overweight" when their BMI is greater than 25.
Traditionally, if a person wants to know their BMI, they must first measure their body weight and height using a weight scale and extensometer, respectively. After this, their BMI can be calculated according to the definition of the BMI above. If no device is used to measure body weight or height, it is impossible to calculate the BMI. To make the acquisition of BMIs simpler, i.e., possible without using such devices, many researchers in the field of computer vision have explored various methods of BMI prediction/classification. Most have focused on calculating the BMI of an individual from a facial image [8][9][10]. In the literature, face image databases built under a controlled environment, such as frontal view, clean background, alignment according to interpupillary distance, etc., have been used. However, in a real-world situation, facial images are generally captured in uncontrolled conditions. For example, the facial image of a person can be taken from the side view as well as the frontal view and can have complex backgrounds. In addition, in some cases, a person can wear a mask which conceals parts of their face. In particular, since most of the features used in [8][9][10] can be extracted from only front-view face images without any occlusions, the feasibility and effectiveness of these methods for real-world application still need to be demonstrated.
Some studies on BMI classification have been conducted using human full-body images [11][12][13]. In these studies, each individual was asked to stand up straight while their full-body image was taken. In addition, the image was captured from a frontal view. Most of the features proposed in [11][12][13] can be obtained from only frontal full-body images. Therefore, in a real-world situation, if the methods proposed in [11][12][13] are used for BMI classification, it is necessary to have the cooperation of an individual to enable the extraction of features from their frontal full-body images.
Recently, based on advancements in human pose estimation techniques, the acquisition of three-dimensional (3D) human motion data has become quicker, simpler and more precise [14][15][16][17]. Therefore, many studies have exploited human motion data for various applications. Human action recognition [18][19][20][21], animation generation [22][23][24][25] and gait recognition [26,27] are prime examples of such research. Compared with conventional facial and full-body image-based BMI classifications, skeleton-based BMI classification has the advantage of being performed without the user's cooperation, with the user only being asked to walk in a usual way. Because the user's gait data can be obtained from the video filmed from various viewpoints as well as a frontal view, the BMI classification process is performed in such a way that the user is unaware of it and does not feel uncomfortable. However, to the best of our knowledge, there has only been one study [28] to date in which a method for classifying BMIs using 3D human motion sequences has been investigated. According to the results presented in [28], classification accuracy was improved up to only 40%. Motivated by these results, in this study, we investigate a method that facilitates the improvement of the classification accuracy of BMI. A block diagram of the proposed BMI classification method is presented in Figure 1.
The main contributions of this study can be summarized as follows.
• We determine why the BMI classification accuracy in [28] was poor. Because the dataset considered had a class imbalance problem, the authors of [28] used the undersampling technique to balance classes. During our investigation, it was determined that the reason for performance degradation stemmed from the use of this technique. Therefore, we found that it is better to use an oversampling technique rather than an undersampling technique in terms of performance.

•
We demonstrate the usefulness of anthropometric features in the BMI classification problem. During this study, we observe that feature values calculated for each frame are not consistent over time, as a result of the inaccuracy in pose estimation. Furthermore, for some motion sequences, a large variance in feature values was observed over specific periods. Therefore, instead of using traditional anthropometric features, we employ anthropometric features measured over several different time periods (i.e., long/mid/short-term periods). The use of these features contributes to our ensemble model being able to obtain robust and accurate classification results.

•
We propose an ensemble model capable of classifying BMIs from 3D human motion sequences. In the proposed model, multiple k-nearest neighbor (k-NN) classifiers are trained separately using anthropometric features calculated from different time windows. Generalized results are then obtained by integrating the classification results obtained from the k-nearest neighbor (k-NN) classifiers. The evaluation results for the dataset considered in [28] demonstrate that the proposed model achieves state-of-the-art performance. The remainder of this paper is organized as follows. Section 2 presents a review of related work. The motivation for our study is discussed in Section 3. Our methodology for classifying BMIs from 3D human motion sequences is described in Section 4. The experimental setup is described in Section 5. The results and discussion are presented in Sections 6 and 7, respectively. Conclusions are presented in Section 8.

Related Work
Previous studies on BMI classification have mainly focused on calculating human weight or BMI from facial images [8][9][10]29]. For example, Coetzee et al. [8] demonstrated that three facial features, namely the width-to-height ratio, perimeter-to-area ratio and cheek-to-jaw-width ratio, are highly related to both human weight and BMI. Additionally, they proposed the use of these three features to judge the body weight and BMI of an individual. Motivated by this study, Pham et al. [9] assessed the relationship between these facial features and BMIs for facial images of 911 subjects. They demonstrated that the eye size and average eyebrow height are also related to the BMI. Wen and Guo [10] investigated a method for predicting the BMIs of people from face images. To achieve this goal, they extracted the facial features proposed in [8,9] from passport-style frontal face images with clean backgrounds. They then trained a support vector regression (SVR) model using these features and demonstrated the feasibility of using SVR to predict BMIs. Kocabey et al. [29] explored a method for predicting the BMI using low-quality facial images collected from social media. To achieve this goal, they used the convolutional neural network (CNN) proposed in [30]. They then used the outputs from the sixth fully connected layer of the CNN as features for the BMI prediction. They trained an SVR model using these features and evaluated their model using a publicly available social media face dataset. The evaluation results demonstrated the feasibility of BMI prediction using low-quality facial images.
BMI prediction/classification based on full-body person images has also been studied [11][12][13]. For example, Bipembi et al. [11] proposed a method for obtaining the BMI of an individual from their full-body silhouette image. To achieve this goal, the authors captured frontal full-body images of standing individuals. They then converted these images into silhouette images. Based on a silhouette image, they calculated the area and height of an individual. The area was defined as the number of pixels occupied by a silhouette, and the height was defined as the distance between the highest and lowest pixels in a silhouette. Finally, the BMI was calculated using the extracted values of the area and height. Amador et al. [12] also used human silhouette images for BMI classification. A total of 57 shape features introduced in [31] were extracted from images. Four machine learning algorithms, namely logistic regression, a Bayesian classifier, an artificial neural network and support vector machine (SVM), were tested on their own dataset consisting of 122 subjects. According to their results, the SVM outperformed the other algorithms and achieved an accuracy of approximately 72.16%. Madariaga et al. [13] proposed the calculation of human height from full-body images using the vanishing point and camera height information. For body weight calculation, a sensor called the Wheatstone bridge load cell was used. Using the estimated height and measured body weight information, the BMI was calculated.
Nahavandi et al. [32] investigated a method for estimating BMIs from depth images of people. To achieve this goal, the deep residual network (ResNet) proposed in [33] was used. To train the ResNet, 3D human polygon models were generated using the MakeHuman open-source software. The body surface area (BSA) of each model was calculated. By using the formula BSA = 1/6(body weight × height) 0.5 [34], the body weight of each model was estimated. For each polygon model, a BMI was then obtained using the height and estimated body weight. These were then used as ground truth BMI values to train the ResNet. Additionally, 3D depth images were obtained from 3D human polygon models using a rendering software and encoded using red-green-blue (RGB) colorization. These colorized depth images were then used as training data for the ResNet. The authors tested their proposed method on their own dataset. According to their results, the ResNet achieved an accuracy of 95%.
The concept of using anthropometric features for BMI classification was first proposed by Andersson et al. [28]. For BMI classification, they used 20 anthropometric features, including the average length of each limb and the average height of the 3D human skeleton. To test the performance of their model, they constructed a dataset consisting of 112 subjects. Three classes of "underweight", "normal weight" and "overweight" were used to classify BMIs. However, in their dataset, the number of people categorized as underweight was too small. Most people were categorized as normal weight. It is well known that if a classifier is trained using a dataset with an unequal distribution of classes, that classifier will exhibit poor classification performance for minority classes [35][36][37]. Therefore, in their experiments, Andersson et al. used an undersampling technique that randomly removed some samples from the majority classes to alleviate the imbalance in the dataset. They applied three machine learning algorithms, namely the SVM, k-NN model and multi-layer perceptron (MLP), to the undersampled dataset. According to their results, the BMI classification accuracies of the SVM, k-NN model and MLP were 40%, 25% and 35%, respectively. Because the accuracies of all algorithms were poor, the authors concluded that the anthropometric features were unhelpful for classifying BMIs.

Motivation
According to the results of some studies [38][39][40], undersampling techniques do not provide a good solution to the class imbalance problem because they may remove some important and useful data from majority classes. This loss of data can result in the performance degradation of a classifier for both majority and minority classes. Therefore, it can be concluded that the performance drop in terms of classification accuracy [28] may stem from the use of undersampling, rather than the unreliability of anthropometric features. To solve this issue, we generated a training dataset by employing the synthetic minority oversampling technique (SMOTE) proposed in [41] to overcome the lack of minority samples. Additionally, for comparison, we generated a training dataset by employing the undersampling technique used in [28]. Figure 2 illustrates the change of the number of datasets according to the used sampling technique. As shown in Figure 2, the number of samples belonging to the underweight class is significantly lower than the numbers of samples belonging to the other two classes. In this figure, one can see that the original dataset has a class imbalance problem. To overcome this problem, the undersampling technique was employed. Accordingly, some samples were removed from the majority classes. As a result, as shown Figure 2, the normal weight and overweight classes have the same number of samples as the underweight class. Although the class imbalance problem is alleviated by the undersampling technique, the total number of samples in the undersampled training dataset is significantly reduced compared to that in the original dataset. Instead of removing some samples from majority classes, SMOTE augments the samples of minority classes by generating new synthetic samples. As a result, as shown in Figure 2, the number of samples in each minority class is the same as that in the majority class. Using the training dataset generated by SMOTE, we trained three machine learning algorithms: C-SVM, nu-SVM and k-NN. We then tested these algorithms on the testing dataset. It is noteworthy that the performance of average BMI classification for all three schemes improved significantly. k-NN achieved an accuracy of 92.97%, outperforming the other algorithms.
Therefore, we can confirm that anthropometric features hold validity for the classification of BMIs. Next, we investigated a method for improving BMI classification accuracy. To this end, we carefully reviewed the generation process for anthropometric features and measured the effects of each feature on classification. Traditionally, anthropometric features have been calculated using the length of each limb and the height of an individual averaged over the given frames. Figure 3 presents the body height of an individual over multiple frames (the height of the fourth subject in the dataset proposed by Andersson et al. [28]). As shown in this figure, the values are not consistent between frames as a result of the inaccuracy in pose estimation. Furthermore, there is a specific period during which a large variance among the values can be observed.
Therefore, if anthropometric features are simply extracted, ignoring such large variance over time, BMI classification accuracy could be significantly reduced. To overcome this problem, it is desirable to use anthropometric features localized over several different time periods rather than using features averaged over all frames. Therefore, in this paper, we propose an ensemble model for BMI classification. In the next section, we discuss our methodology in detail.  Figure 4 presents a schematic overview of the proposed method for skeleton-based BMI classification. One can see that a 3D human skeleton sequence is used as an input and is divided into several segments of equal length. In this study, we divide an original sequence into T different segments. Specifically, we let W be the total length of the input motion sequence. In the frame division process, T · (1 + T)/2 segments corresponding to each length are generated (one for W, two for W/2, three for W/3, four for W/4, five for W/5, · · · , T for W/T). For each segment described above, anthropometric features are calculated. For clarification, suppose that a 3D human skeleton model consists of N joints and M limbs. Let J = {1, · · · , N} be the set of joint indices and L = {(i, j)|i = j, i ∈ J, j ∈ J} be the set of joint pairs for constructing limbs. The values of N and M depend on the product specifications of motion capture sensors. For the Microsoft Kinect sensor [42], N and M are 20 and 19, respectively, as shown in Figure 5.

Ensemble for BMI Classification
) be the position of the jth joint at the kth frame in the world coordinate system, where x j [k], y j [k], and z j [k] are the coordinates of the jth joint at the kth frame on the X, Y, and Z axes, respectively. The Euclidean distance between the ith and jth joints at the kth frame is then calculated as (1) In [43][44][45][46][47][48][49][50], the average length of each limb and average height of the target individual were used as anthropometric features. In this study, we used these features for BMI classification. To obtain anthropometric features for each segment, the length of each limb and height of the individual are measured in each frame. Average values are then used as the anthropometric features for the corresponding segment.
For clarity, let F ∈ {W, W/2, W/3, · · · , W/(T − 1), W/T} be the frame length of a segmented sequence. The average length of each limb over F is then calculated as For simplicity, we define a feature vector AF 1 as Because we consider the 19 limbs presented in Figure 5, the dimension of AF 1 in (3) is 19.
In general, human height is measured as the distance from the bottom of the feet to the top of the head in a human body. According to this definition of human height, the height of a 3D human skeleton also can be calculated as the distance from the joint of feet to the joint of head. However, to measure the height using this traditional method, it is necessary for the user to stand straight up during height measurement. If the user moves during height measurement, the measured height may be inaccurate. In [43], Araujo et al. proposed a method that is capable of estimating the height of the person even if the person moves. This method is widely used to estimate the height of human skeletons [44][45][46][47][48][49][50]. In this study, we used the method to estimate the skeleton's height. According to the method proposed in [43], the target individual's height is calculated as the sum of their neck length, upper and lower spine lengths and average lengths of the right and left hips, thighs and shanks. For clarity, let H[k] be the subject's height at the kth frame. Then, H[k] is defined as  (4), the average height over F is obtained as After the anthropometric features are obtained using (2) and (5), these features are concatenated into a feature vector as where concat(AF 1 , AF 2 ) is an operator that concatenates AF 2 onto the end of AF 1 . As a result, the dimension of AF in (6) is 20.
We present the pseudo-code for the proposed ensemble learning algorithm in Algorithm 1. As shown in Figure 4, in the proposed method, the ensemble model consists of multiple (i.e., T · (T + 1)/2) k-NNs. For simplicity, hereafter, we refer to the T · (T + 1)/2 k-NNs as k-NN 1 , k-NN 2 , k-NN 3 , · · · , k-NN T·(T+1)/2−1 , and k-NN T·(T+1)/2 . In the training (testing) phase, each AF extracted from the T · (T + 1)/2 sequence segments is inputted into the corresponding k-NN. This can be explained as follows. As shown in Figure 3, the lengths of body parts and the height of the target individual vary over time as a result of the inaccuracy in pose estimation. Since, traditionally, the average values of anthropometric features over all frames (i.e., W) are calculated, the performance of classifiers trained using them can be affected by the variance of features.
In contrast, in the proposed ensemble learning method, anthropometric feature vectors are calculated as averages over long/mid/short-term periods (i.e., W, W/2, W/3, · · · , W/T). Through the frame division process, the periods in which the variance exists are divided into sub-periods. As a result, the variance for each of the T · (T + 1)/2 segments is reduced. Additionally, the AFs calculated for T · (T + 1)/2 segments are used to train the k-NNs in the ensemble model. The use of such AFs helps each k-NN to obtain robust and accurate classification results (i.e., BMI classes).
Let Q = {1, 2, 3, · · · , T · (T + 1)/2 − 1, T · (T + 1)/2} be the set of indices of k-NNs and q be the index of a k-NN (i.e., q ∈ Q). Additionally, let C q be the classification result of k-NN q . For simplicity, we define a classification result vector C Q as To derive a final classification result from C Q in (7), in this study, the majority voting algorithm was adopted. The pseudo-code for the majority voting algorithm is presented in Algorithm 2. Extract AF from segmented_data using (6)  # Here, elements in S indicate BMI classes. 2: Create a 1-by-N C vector of zeros V = [0, 0, · · · , 0]; 3: Initialize the majority class C M to 0; # Count the frequency of each class in C Q . 4: for q = 1:1:size(C Q ) do 5: ] + 1; 6: end for 7: Set C M to the index that corresponds to the maximum value in V; 8: return C M ;

Experimental Setup
We evaluated the proposed method on an existing publicly available BMI classification dataset [28]. This dataset is a unique BMI classification dataset that contains 3D human skeleton sequences as well as the body weights and heights of subjects. To assess the performance of the proposed ensemble model, we used five-fold cross-validation and leave-one-person-out cross-validation.

Five-Fold Cross Validation
The dataset consists of 112 people with five skeleton sequences for each person. However, for one individual, denoted as "Person158", only four sequences exist. To implement five-fold cross-validation, we decided to eliminate this individual from the dataset. Additionally, there are six sequences for the following four individuals: "Person034", "Person036", "Person053" and "Person096". After examining the six sequences for each of these individuals, we discarded the noisiest sequence for each individual. As a result, the first sequence for "Person034", the third sequence for "Person036", the fifth sequence for "Person053" and the sixth sequence for "Person096" were excluded from the dataset. As a result, the final dataset contained a total of 555 sequences (equaling 111 people × 5 sequences per person).
By using the body weights and heights of the 111 people mentioned above, we calculated the BMI for each individual and categorized the people according to their nutritional statuses. As shown in Table 1, only one person was categorized as "obesity class II", and there were no people categorized as "obesity class III". We decided to group "pre-obesity", "obesity class I", "obesity class II" and "obesity class III" into one class, called "overweight". As a result, the number of people included in the overweight class was 32(= 25 + 6 + 1). In our experiments, the people included in the class of "underweight" were labeled as "1". The people included in the classes of "normal weight" and overweight were labeled as "2" and "3", respectively. In each cross-validation fold, for each person, four sequences were selected from the five available sequences. The selected 444 sequences (i.e., 111 people × 4 sequences per person) were used to train the proposed model. The remaining 111 (i.e., 555 − 444) sequences were used to test the model.
Although the number of subjects in class 3 (overweight) was increased by consolidation, the three classes were still imbalanced. As shown in Table 1, the number of subjects in class 1 (underweight) was too low. The number of subjects in class 2 (normal weight) was greater than the numbers of subjects in the other classes. During the training phase of the proposed model, we used the SMOTE algorithm proposed in [41] to balance the three classes. For class 2, the number of anthropometric feature vectors was 292 (i.e., 73 people × 4 sequences per person). Therefore, according to the application of SMOTE, the numbers of anthropometric feature vectors for classes 1 and 3 were also 292. As a result, in each cross-validation fold, the model was trained using 876 anthropometric feature vectors (i.e., 292 vectors per class × 3 classes).

Leave-One-Person-Out Cross-Validation
Unlike five-fold cross-validation, all the sequences for 112 people were used in leave-one-person-out cross-validation. As a result, the dataset contained a total of 563 sequences (i.e., 107 people × 5 sequences per person + 1 person × 4 sequences per person + 4 people × 6 sequences per person). In addition, since the BMI of Person158 was greater than 25, Person158 was categorized as class 3 (overweight). As a result, the number of subjects included in classes 1, 2 and 3 were 6, 73 and 33, respectively. In each validation round, the skeleton sequences for one person were used to test the model, while the sequences for the remaining 111 people were used for training. In addition, since the three classes were imbalanced, we used the SMOTE algorithm to balance the three classes during the training phase of the proposed model.

Performance Measurement
In each cross-validation fold (round), to evaluate the performance of the proposed ensemble model, we used the following measures: the true positive rate (TPR), positive predictive value (PPV) and F 1 score. Here, for each class, the TPR, which is also called recall, is defined as For each class, PPV, which is also called precision, is defined as Based on (8) and (9), the F 1 score for each class is calculated as Additionally, the accuracy becomes Accuracy = ∑ True positive + ∑ True negative ∑ Total population .
Furthermore, we computed the macro-average values for TPR, PPV and F 1 for each cross-validation fold as follows: where B is the number of classes. Here, B = 3 because there are three classes: underweight (class 1), normal weight (class 2) and overweight (class 3).

Results
We investigated the main factors for the low accuracy of BMI classification reported in [28]. The main reason for this low accuracy was that Andersson et al. in [28] used an undersampling technique to overcome the class imbalance problem in their dataset. For benchmarking, we tested five machine learning algorithms, namely C-SVM, nu-SVM, k-NN, Naive Bayes (NB) and decision tree (DT), in conjunction with the undersampling technique. C-SVM and nu-SVM were implemented using LIBSVM (version 3.21), which is an open-source library for SVMs [51]. Additionally, according to the recommendations in [52], the radial basis function (RBF) kernel was used for the C-SVM and nu-SVM models. C-SVM has a cost parameter denoted as c whose value ranges from zero to infinity. nu-SVM has a regularization parameter denoted as g whose value lies within [0, 1]. The RBF kernel has a gamma parameter denoted as γ. The grid search method was used to find the best parameter combination for the C-SVM and nu-SVM models. The remaining k-NN, NB and DT models were implemented using the MATLAB functions "fitcknn", "fitcnb" and "fitctree", respectively. Additionally, to determine the best hyperparameter configuration for each model, we executed the hyperparameter optimization processes supported by the aforementioned functions. To this end, we used "Statistics and Machine Learning Toolbox". In addition, the models were implemented and tested in MATLAB R2018a (9.4.0.813654).

Results of Five-Fold Cross-Validation
During the development of the proposed ensemble model, determining the optimal number of k-NN models to be used was a significant challenge. To this end, we constructed several ensemble models with different numbers of k-NN models and compared the resulting performance metrics.
To implement each ensemble model, we used the MATLAB function "fitcknn". To this end, we used "Statistics and Machine Learning Toolbox". To determine the optimal hyperparameter configuration for each k-NN model, we performed hyperparameter optimization using the aforementioned function. For the hyperparameter optimization process, there were five optimizable parameters: the value of k, the distance metric, the distance weighting function, the Minkowski distance exponent and a flag to standardize predictors. During the training phase for each model, these five parameters were optimized using the hyperparameter optimization process. The proposed ensemble model was implemented and tested in MATLAB R2018a (9.4.0.813654). The MATLAB code is available at https://sites.google.com/view/beomkwon/bmi-classification. Table 2 lists the performance metrics for the ensemble models with different numbers of k-NN models. A detailed description of the experimental setup is listed in Table 3. In Table 2, the proposed ensemble model consists of 15 k-NN q models (q ∈ {1, 2, · · · , 14, 15}). "Comparison model #1" consists of three k-NN q models (q ∈ {1, 2, 3}). "Comparison model #2" consists of six k-NN q models (q ∈ {1, 2, · · · , 5, 6}). "Comparison model #3" consists of 10 k-NN q models (q ∈ {1, 2, · · · , 9, 10}). "Comparison model #4" consists of 16 k-NN q models (q ∈ {1, 2, · · · , 15, 16}). "Comparison model #5" consists of 17 k-NN q models (q ∈ {1, 2, · · · , 16, 17}). "Comparison model #6" consists of 18 k-NN q models (q ∈ {1, 2, · · · , 17, 18}). "Comparison model #7" consists of 19 k-NN q models (q ∈ {1, 2, · · · , 18, 19}). "Comparison model #8" consists of 20 k-NN q models (q ∈ {1, 2, · · · , 19, 20}). "Comparison model #9" consists of 21 k-NN q models (q ∈ {1, 2, · · · , 20, 21}). One can see that the performance of the ensemble model increased as the number of k-NN models increased. However, when the number of k-NN models was greater than 15, the performance of the ensemble model was not improved. Based on these results, we selected an ensemble model consisting of 15 k-NN models for additional testing. However, as shown in Table 2, since the average running time of the ensemble model increased as the number of k-NN models increased, the trade-off between classification accuracy and running time needs to be considered in real-world applications. Table 2. Five-fold cross-validation performance comparisons between ensemble models with different numbers of k-NN models. TPR: true positive rate; PPV: positive predictive value.

Item Explanation
Operating System Windows 10 Pro Processor Inter(R) Core(TM) i5-8250U CPU @ 1.60 GHz Memory 8.00 GB GPU Intel(R) UHD Graphics 620 Table 4 lists the optimized parameter settings for the ensemble model consisting of 15 k-NN models in each cross-validation fold.  Note: Elements in brackets indicate k, distance metric, distance weighting function, Minkowski distance exponent, and flag to standardize predictors, respectively. In this table, the distance metrics "cityblock", "correlation", "cosine", "euclidean", "minkowski" and "seuclidean" are represented as "c 1 ", "c 2 ", "c 3 ", "e", "m", and "s", respectively. The distance weighting functions "equal", "inverse" and "squaredinverse" are denoted as "eq", "in" and "sq", respectively. k-NN: k-nearest neighbor. Table 5 lists the five-fold cross-validation results of the five benchmark methods discussed above. Here, instead of SMOTE, the undersampling technique was used for performance evaluations. Therefore, during the training phase of each cross-validation fold, for class 2, 268 (i.e, 73 × 4 − 6 × 4) anthropometric feature vectors were randomly selected among the 292 total vectors and removed from the training dataset. For class 3, 104 (i.e., 32 × 4 − 6 × 4) vectors were randomly selected among the 128 total vectors and removed from the training dataset. The remaining 72 (i.e., 24 anthropometric feature vectors per class × 3 classes) vectors were used to train the algorithm for each method. As shown in the table, for each method, there were significant variations in the TPR, PPV and F 1 values over the three classes. These results demonstrate that the undersampling technique used in [28] is not effective in overcoming the class imbalance problem.  Table 6 lists the five-fold cross-validation results of the five benchmark methods when SMOTE is used to alleviate the class imbalance problem. The results in this table demonstrate that SMOTE improves the TPR, PPV and F 1 score values of the five methods compared to the results in Table 5. By using (12) to (14), we computed the average metric values over five-fold cross-validation, as shown in Table 7. The results in this table demonstrate that it is better to use SMOTE instead of the undersampling technique to alleviate the class imbalance problem. For all benchmark methods, average performance improvements are achieved when SMOTE is applied. In particular, the k-NN model outperforms the other methods, achieving results of TPR = 0.9276, PPV = 0.8512, F 1 = 0.8798 and accuracy = 0.9279. Based on these results, we decided to use k-NN models to construct the proposed ensemble model.   Table 7. Average performance comparisons of five benchmark methods over five-fold cross validation. Based on the results in the confusion matrices in Figure 6, we computed the TPR, PPV and F 1 score values of each class for five-fold cross-validation. Additionally, we computed the macro-average values of these metrics and classification accuracy. The performance evaluation results for the proposed ensemble model are summarized in Table 8. As shown in this table, the proposed model exhibits a robust and accurate classification performance for the minority class and majority class, achieving approximately 98.2% average classification accuracy.  Table 9 lists the macro-average values of TPR, PPV and F 1 scores, as well as the accuracy of each method over five-fold cross-validation. One can see that the proposed ensemble model performs best in terms of all evaluation metrics. This is because the ensemble model is trained using anthropometric features calculated over long/mid/short-term periods. In other words, the use of such features enables the ensemble model to be trained effectively by minimizing the adverse effects of variance in extracted features. As a result, the proposed model can achieve robust and accurate BMI classification performance. Among the considered benchmark methods, the k-NN model achieves the best performance, whereas DT achieves the worst performance. The classification accuracy of the proposed model is approximately 5.23% greater than that of a single k-NN model. To verify the benefits of using the anthropometric features calculated for various different periods in the BMI classification task, we analyzed the standard deviations of the anthropometric features. For explanation, let AF(r) be AF for the rth skeleton sequence in the dataset. Here, according to the definition of AF in (6), the dimension of AF(r) is 20. In addition, let AF(r)[u], u ∈ {1, 2, · · · , 19, 20}, be the uth element of AF(r). Then, the average value of AF(r)[u] over the whole sequences can be obtained as

Macro-Average
where R is the total number of skeleton sequences. Based on (15), the standard deviation of each of the 20 anthropometric features is calculated as Figure 7 presents the standard deviations of the 20 anthropometric features for five different periods (i.e., W, W/2, W/3, W/4 and W/5). In this figure, for the cases of W/2, W/3, W/4 and W/5, we calculated the average of the standard deviations. For example, for W/5, there are five equal-length segments according to the frame division process of the proposed method. We calculated the standard deviations in (16) for each of the segements, and then calculated the average of them over the five segments. As shown in Figure 7a, the features have high standard deviations when they are calculated for all frames (i.e., W). In contrast, the features calculated for W/2, W/3, W/4 and W/5 have relatively low standard deviations. In particular, the features calculated for W/5 have the lowest standard deviation values. A low standard deviation indicates that the features are clustered around the average. A high standard deviation means that the features are spread out over a wide range. Because the anthropometric features are calculated as average lengths over a given segment sequence, the high standard deviations shown in Figure 7a may adversely affect the performance of machine learning algorithms. In contrast, in our ensemble learning method, anthropometric features with low standard deviations are used to train/test multiple k-NN models. Based on the use of these features, the proposed ensemble model can be trained effectively by minimizing the adverse effects of variance in extracted features, resulting in state-of-the-art performance. Figure 8 shows the leave-one-person-out cross-validation process used in this work. As shown in the figure, in each validation round, the skeleton sequences for one person were used to test the classifier, and the sequences for the remaining 111 people were used for training. In each validation round, the predicted results (i.e., predicted BMI classes) for the testing skeleton sequences were obtained. After all 112 validations rounds were completed, we calculated a confusion matrix using all predicted results. Based on the confusion matrix, we calculated the TPR, PPV and F 1 values over the three classes in order to evaluate the performance of the classifier.   Table 10 shows the leave-one-person-out cross-validation results of the five benchmark methods when the undersampling technique was used. On the other hand, Table 11 shows the leave-one-person-out cross-validation results when SMOTE was used. By using (12) to (14), we computed the macro-average values of these metrics over the three classes. In addition, we calculated the classification accuracy of each benchmark method. The performance evaluation results for the five benchmark methods are summarized in Table 12. From the table, it is seen that SMOTE improves the macro-average of the TPR, PPV and F 1 of each benchmark method, compared with the results where the undersampling technique was used. In addition, the BMI classification accuracy of each method also improved when SMOTE was used. These results demonstrate that SMOTE is more effective at overcoming the class imbalance problem in the dataset used in [28] than the undersampling technique.  To find the optimal number of k-NN classifiers in the proposed ensemble model, we evaluated the performance of the ensemble models with different numbers of k-NN classifiers. Table 13 lists the performance metrics for each model. From this table, it can be seen that the best performance was achieved when the ensemble model consisted of 15 k-NN q models (q ∈ {1, 2, · · · , 14, 15}). Based on these results, we selected an ensemble model consisting of 15 k-NN models for additional testing.  Table 14 lists the macro-average values of TPR, PPV and F 1 scores, as well as the accuracy of each method over leave-one-person-out cross-validation. As shown in the table, for all evaluation metrics, the proposed method outperforms the other methods, achieving a BMI classification accuracy of 73%. In the proposed method, anthropometric features calculated over long/mid/short-term periods were used for the training of the ensemble model. In addition, the use of such features in the training phase could reduce the adverse effects of a variance in extracted features. As a result, the model could be trained effectively and achieved the best performance among the methods.

Discussion
In this study, we used human body joints estimated from depth images captured by the Kinect (v1) sensor [42]. These estimated joints generally contain the noise and error introduced by an uncertainty. The majority of the uncertainty is caused by occlusion and depth ambiguity [17]. Due to this uncertainty, the values of anthropometric features vary by measurement time, even if they are extracted from the same parts of the human body. In conventional methods, for each anthropometric feature, the average value is calculated over the whole period. Then, the average values for the features are used as inputs to classifiers. The performance of classifiers trained using these values can be affected by the variance of features. In order to minimize this adverse effect, we propose the use of anthropometric features calculated over several different time periods. However, after the Kinect v1 sensor was released in 2010, the Kinect v2 sensor was released in 2014, and Wang et al. [53] evaluated the human body joint estimation accuracies of the Kinect v1 and v2 sensors; according to the results in [53], compared with the Kinect v1 sensor, the Kinect v2 sensor had better accuracy in joint estimation and showed more robust results for occlusion and body rotation. Recently, the new version of the Kinect sensor, called Azure Kinect, was released in 2019. Albert et al. [54] reviewed the improved hardware implementation and motion tracking algorithm of the Azure Kinect sensor. The authors also evaluated the motion tracking performances of the Kinect v2 and Azure Kinect sensors. Their results demonstrated that the Azure Kinect sensor achieved higher accuracy in human motion tracking than the Kinect v2 sensor. Furthermore, with the advancements in camera techniques, releases of new sensors with better hardware and motion tracking algorithms will continue. Due to the use of state-of-the-art sensors, if the noise and error in the estimated joints are reduced and the anthropometric features extracted at each time all become exactly equal, the feature vectors calculated over different time periods by the method will all become equal. Therefore, the usefulness and effectiveness of our proposed method could be limited in this case.

Conclusions
In general, there are variances in anthropometric features calculated over all frames of a sequence as a result of the inaccuracy in pose estimation. These variances adversely affect the classification performance of machine learning algorithms. Therefore, to minimize these adverse effects, we proposed the use of anthropometric features that are calculated over long/mid/short-term periods. Additionally, we proposed a novel ensemble model consisting of 15 k-NN models trained using features calculated over several different time periods. Experimental results demonstrated that the proposed ensemble model outperforms five benchmark methods and achieves state-of-the-art performance on a publicly available BMI classification dataset. In practical situations, some joints can be invisible due to occlusions. If the joints are invisible over the whole time interval, the corresponding anthropometric features cannot be obtained. In this case, the anthropometric feature vector may contain "not a number", called NaN. As the occlusions become severe, the number of such NaN values in the anthropometric feature vector increases. The use of these vectors in BMI classification can adversely affect the classification accuracy of the ensemble model. However, in this study, we considered only the condition that there was no occlusion and that all 20 joints were visible during the whole time interval. Therefore, as part of our future work, we plan to extend our method to classify the BMIs of people for whom some body parts are occluded.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, and writing-original draft preparation, B.K.; writing-review and editing, B.K. and S.L.; supervision and funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.