Missing Value Imputation in Stature Estimation by Learning Algorithms Using Anthropometric Data: A Comparative Study

: Estimating stature is essential in the process of personal identiﬁcation. Because it is di ﬃ cult to ﬁnd human remains intact at crime scenes and disaster sites, for instance, methods are needed for estimating stature based on di ﬀ erent body parts. For instance, the upper and lower limbs may vary depending on ancestry and sex, and it is of great importance to design adequate methodology for incorporating these in estimating stature. In addition, it is necessary to use machine learning rather than simple linear regression to improve the accuracy of stature estimation. In this study, the accuracy of statures estimated based on anthropometric data was compared using three imputation methods. In addition, by comparing the accuracy among linear and nonlinear classiﬁcation methods, the best method was derived for estimating stature based on anthropometric data. For both sexes, multiple imputation was superior when the missing data ratio was low, and mean imputation performed well when the ratio was high. The support vector machine recorded the highest accuracy in all ratios of missing data. The ﬁndings of this study showed appropriate imputation methods for estimating stature with missing anthropometric data. In particular, the machine learning algorithms can be e ﬀ ectively used for estimating stature in humans.


Introduction
One of the major limitations in attempting to estimate human information such as sex, stature, and age at crime and disaster scenes is that necessary anthropometric measurements can be missing [1][2][3]; previous researchers have shown that estimating the biological information of a human body using a variety of anthropometric measurements such as of the upper and lower limbs is effective [4][5][6]. However, many previous researchers have shown that estimates of biological information vary widely across different ancestry groups and sexes [7][8][9]. Therefore, it is important to identify anthropometric measurements that can best estimate the biological information of a specific ancestry group. In addition, investigators have developed and applied several statistical techniques for estimating human biological information. Most previous researchers used regression analysis based on principles of linearity in body parts [10][11][12][13], but recently, efforts have been made to improve the accuracy through nonlinear analysis methods such as artificial neural networks [14][15][16][17].
There are documented methods of estimating human physical information based on measurement data [18,19], but most of the relevant studies are based on complete bodies including all parts [20][21][22]. In the real world however, human remains are damaged, whether intentionally or naturally. Therefore, it is difficult to extrapolate human biological information from human remains found in the field, where damage is not contained but is instead manifested differently in different situations and environments.
The purpose of this study was to compare imputation methods of managing missing values in anthropometric data in the process of estimating biological information for damaged remains at crime and disaster sites. For this purpose, we first examined the differences in accuracy between different imputation methods. Second, to compare the differences in accuracy according to the learning algorithm, we selected the optimal algorithm according to the missing ratio of data. Finally, we compared the accuracy of machine learning algorithms by dividing body parts into upper versus lower limbs.
The remainder of this study has consisted as follows. Section 2 includes a literature review for previous studies related to this study. In Section 3, we describe the participant, measurement, procedure of experiment, and data processing methods. Section 4 provides the results of three imputation methods for four learning algorithms. Finally, in Section 5, we discuss the results of each learning algorithm by comparing them with the results from previous studies; we also provide future research directions.

Human Biological Information
Estimated stature is known to be one of the most important factors in profiles of human biometric information [23]. Researchers over recent decades have studied methods of predicting human stature by measuring various parts of the human body and have developed and utilized various estimation methodologies [24]. Developed methods include measuring body parts or bones of ashes, and these methods have been used to estimate height in various countries [24,25]. The methods of estimating stature and the measurement variables of interest have differed according to sex and ancestry, and research on Koreans is insufficient. Due to the lack of previous studies, there may be limitations in improving the accuracy of stature estimation, and finally differences occur from the current standard of forensic anthropology.
Human stature is a polymorphic result of a combination of genetic, environmental, and biological elements, and it is essential to develop a method that can be applied universally to different ancestry groups and countries. Most of the research related to estimating stature in Koreans has been conducted on only upper or lower limbs, and the necessity for research based on integrating body parts has been steadily raised. Researchers have highlighted that the accuracy of estimating biological information such as stature, sex, and age with a part of the human body has been lower than the accuracy of estimates for other ancestry groups and countries.
One of the major limitations of previous studies that estimate human stature is that the researchers have assumed intact bodies of human when they have developed their estimation models. In the fields of anthropology and forensic science, however, biological information is estimated using body parts that have been damaged in some way. Therefore, a model is needed that can effectively estimate stature from even damaged body parts.

Imputation Method for Handling Missing Values
Missing values are one of the most frequent issues in data analysis; they can occur for many reasons such as malfunctioning sensing systems or survey questions left blank. Imputation, the process of replacing missing values, has been extensively studied, and for this study, we compared three methods, mean imputation, nearest neighbor imputation, and multiple imputation.
Mean imputation is one of the simplest methods; it entails filling in missing values with the corresponding means [26]. Medians can be used instead of means for robustness. For categorical variables, the missing values are usually replaced with the most frequent values. Although this approach is simple and can be powerful, it has a limitation that feature variances are underestimated.
Nearest neighbor (NN) imputation, or hot-deck imputation, replaces missing values with the corresponding variables in the closest instance [27,28]. Because imputation based on a single nearest instance may not be robust, there have been several studies to improve NN imputation by using multiple nearest points [29][30][31].
Unlike mean and NN imputation in which a missing value is replaced by a single value, multiple imputation samples a missing value multiple times from the predefined distribution [32]. Then, the multiple data sets are generated by the random sampling of missing values, and the result is obtained by an ensemble of the results of each data set. The parameters in the distribution can be estimated by expectation-maximization algorithm [33,34] and Markov chain Monte Carlo method [35] when they cannot be found analytically.

Machine Learning Classifier
Recently, machine learning algorithms have been widely used in areas including business and finance, health care, and production due to their superior performance in sophisticated tasks [36][37][38][39][40]. We employed four widely used machine learning classifiers to predict the stature. The brief descriptions of these classifiers are as follows.
Logistic regression is a basic classifier that assumes the logarithm of the ratio between the probability of positive class to that of negative class as a linear combination of independent variables as in Equation (1): Because Equation (1) cannot be solved analytically for general cases, it is usually solved by iteratively reweighted least squares. It can be extended to multiclass classification by setting the log odds, the logarithm values of the ratios between the probability of a certain class and that of the reference class, as linear combinations of independent variables as in Equation (2): where K denotes the number of classes. Naïve Bayes classifier (NB) is a probabilistic classifier based on Bayes theorem [41]. It usually assumes that all features are conditionally independent of one another given the class of an instance; thus, the classifier becomes as follows: where x i is the i-th feature of an instance, P(y = k) is a prior probability that the instance belongs to class k, and Z = K k=1 P(y = k) p i=1 P(x i y = k) is a normalization factor, usually called evidence. In general, prior probabilities are defined proportional to the number of instances belonging to the class before training. Thus, in a training phase, a model finds the parameters in the likelihood distribution, P(x i y = k), that best fits the training instances.
Artificial neural networks (ANNs) are one of the most famous machine learning models today because they encompass deep learning, the most powerful algorithms in many applications. Neural networks were originally inspired by the human brain, which consists of several interconnected neurons [42]. In a neural network algorithm, as in a central nervous system, each neuron receives signals from other neurons, processes them into a new signal, and transmits it to others. The output of each neuron is calculated as follows: where x i 's are inputs from other neurons, w i 's are weight parameters, and g(·) is an activation function that gives a neural network model nonlinearity; rectified linear unit, sigmoid, and hyperbolic tangent functions are typical choices for the activation function. In multilayer perceptron, a neural network model for regression and classification, the layers containing a number of neurons are located sequentially, and the neurons in one layer receive inputs from the neurons in the previous layer and transmit outputs to those in the next layer. After calculating the final output values in the output layer, the weight parameters between layers are trained to minimize the cost function by backpropagation [43]. Support vector machine (SVM), proposed by [44], is one of the most famous kernel-based classifiers and has advantages in both sparsity and robustness. It finds a hyperplane, a decision boundary, which maximizes the margin, the distance between a decision boundary and the closest data point, in the feature space, a high-dimensional space mapped from the original space. By mapping from the original space to the high-dimensional feature space, the separating hyperplane can be found even when it does not exist in the original space. Mapping to the high-dimensional space increases the calculation time for most algorithms, and sometimes it fails to find a solution within a reasonable time. However, solving the dual problems of SVM only requires the inner product of two instances in the feature space, the kernel function. This "kernel trick" dramatically reduces the calculation for training SVM and makes the algorithm scalable. Because the original SVM is designed to perform binary classification, one-versus-one or one-versus-all settings are adopted for the multiclass classification tasks. In this paper, we used a one-versus-all scheme for multiclass classification.

Participants
The measurement was performed by SizeKorea (Korean Agency for Technology and Standards) in South Korea (https://sizekorea.kr/page/data/1_2). The 6th investigation for anthropometric dimension in Korean was conducted from March to November 2010 and the total number of participants was 14,016 (7532 males and 6484 females) recruited from various regions of South Korea. The participants' ages ranged from 7 to 69, with the average age for the men being 22.00; the women's average age was 23.74. All subjects were measured in the morning because human stature changes throughout the day.

Measurements
In this study, the upper and lower limbs were defined with referred to previous studies [45,46], and all dimensions of measurement used in this research are explained in Table 1. For the consistency of measurements, only upper and lower limbs on the right side were measured, and Martin Anthropometer, caliper (Martin type), and plastic tapeline were used for each body measurement. All units of measurement are centimeters and are rounded off at the third decimal place. Stature is measured by the vertical distance from the floor surface to the vertex point of the head. The subject was standing parallel to the anthropometer with the gaze fixed in front, and the measurer recorded the stature of the object displayed on the anthropometer.
The human upper limb is defined as the region from deltoid to hand and is commonly composed of the shoulder, upper/lower arm, wrist, hand, and finger. In this study, 10 measurement variables related to the upper limb were selected (see Table 1). In each of the upper limbs, the variables related to arm and elbow were measured with a plastic tapeline, and the variables related to hand were measured with a caliper (Martin type). The lower limb of human consists of the thigh, the leg (or upper/lower leg), and the foot. The researchers selected 15 measurement variables related to length, width, circumference for upper/lower leg, and foot. In each of the lower limbs, all variables except the foot length/breadth of the lower limbs were measured by Martin Anthropometer. The vertical distance between a standing surface and side waist band (half the distance between the tenth rib and iliac crest)

Iliac Spine Height
The vertical distance between a standing surface and anterior superior iliac spine Knee Height The vertical distance between a standing surface and the tibia Thigh Vertical Length The distance between gluteal fold and popliteal fossa Outside Leg Length The vertical distance between a standing surface and side waist band (half the distance between the tenth rib and iliac crest)

Foot Breadth
The horizontal length between metatarsophalangeal V and metatarsophalangeal I Foot Length The straight length between ptemion and acropodion Lateral Malleolus Height The vertical distance between a standing surface and lateral malleolus Thigh Circumference The horizontal circumference at gluteal fold Knee Circumference The horizontal circumference at mid-patella Ankle Circumference the maximum circumference over lateral malleolus and medial malleolus

Experimental Procedure
The overview of the research flow and analysis of this study is shown in Figure 1. First, we generated input data sets for each experiment. After choosing the input features, upper limbs, lower limbs, or both, and sexes, male, female, or both, that would be utilized for the experiment, we randomly made missing values from an input data set according to the missing ratio ranging from 0.2 to 0.8. Then, we employed three imputation methods, mean, nearest neighbor, and multiple, to impute the missing values. For multiple imputation, we assumed that the joint distribution of input variables followed Gaussian distribution and sampled missing values from the conditional distribution five times.
Then, in referring to Miguel-Hurtado and his colleagues, we transformed the target variable, stature, into seven classes [15].  Then, in referring to Miguel-Hurtado and his colleagues, we transformed the target variable, stature, into seven classes [15]. The classes were chosen equally spaced, so that the boundary values were (1047.0, 1173.9, 1300.7, 1427.6, 1554.4, 1681.3, 1808. 1,1935.0) for the seven-class cases, respectively. Because both maximum and minimum values occurred in male cases, the boundary values for males were the same as those for all instances. The boundary values for female cases were (1057.0, 1159.9, 1262.7, 1365.6, 1468.4, 1571.3, 1674.1, 1777.0) for 7 class cases, respectively.
Because the combination of missing value imputation methods and machine learning classifiers for anthropometry data have not yet been studied, we selected four conventional machine learning classifiers, logistic regression, naïve Bayes, neural network, and support vector machine, for the stature classification tasks with the imputed data sets because they have already been employed for anthropometry as well as other applications [15,47]. We also applied five-fold cross validation to find the best hyperparameters for classifiers. For the neural network classifier, we controlled two hyperparameters, the number of layers and the number of nodes in each hidden layer. We changed the number of layers from one to three and the number of hidden nodes from 10 to 50. For SVM, we employed Gaussian kernel. There were also two hyperparameters, γ, which controlled the bandwidth of the kernel function, and C, which balanced the errors for misclassified instances and the regularization for the classifier or the margin maximization. In this study, we varied γ from 0.01 to 100 and C from 0.05 to 10. For each of three imputation methods, the parameters for imputation, including mean values and covariance matrices, were estimated only with the training data set and the imputation for validation set was also conducted based on these parameters. We repeated the whole procedure 10 times for every case, and we reported averaged cross validation errors for comparison.

Stature Classification: Upper Limb
For each learning algorithm, Table 2 shows the relationships between the missing ratio and the accuracy according to the three imputation methods based on variables for the upper limb for both males and females. There was no statistical significance at the 95% confidence level in the one-sample t-test for the results of accuracy obtained through 10 repeated trials for the imputation methods and learning algorithms for both sexes.   Because the combination of missing value imputation methods and machine learning classifiers for anthropometry data have not yet been studied, we selected four conventional machine learning classifiers, logistic regression, naïve Bayes, neural network, and support vector machine, for the stature classification tasks with the imputed data sets because they have already been employed for anthropometry as well as other applications [15,47]. We also applied five-fold cross validation to find the best hyperparameters for classifiers. For the neural network classifier, we controlled two hyperparameters, the number of layers and the number of nodes in each hidden layer. We changed the number of layers from one to three and the number of hidden nodes from 10 to 50. For SVM, we employed Gaussian kernel. There were also two hyperparameters, γ, which controlled the bandwidth of the kernel function, and C, which balanced the errors for misclassified instances and the regularization for the classifier or the margin maximization. In this study, we varied γ from 0.01 to 100 and C from 0.05 to 10. For each of three imputation methods, the parameters for imputation, including mean values and covariance matrices, were estimated only with the training data set and the imputation for validation set was also conducted based on these parameters. We repeated the whole procedure 10 times for every case, and we reported averaged cross validation errors for comparison.

Stature Classification: Upper Limb
For each learning algorithm, Table 2 shows the relationships between the missing ratio and the accuracy according to the three imputation methods based on variables for the upper limb for both males and females. There was no statistical significance at the 95% confidence level in the one-sample t-test for the results of accuracy obtained through 10 repeated trials for the imputation methods and learning algorithms for both sexes.
First, in cases of both sexes, when the missing ratio was 0.2, multiple imputation had the highest accuracy in all algorithms except NB. In mean and multiple imputation, the accuracy of NB changed less than it did with other algorithms when we increased the missing ratio. Among the three imputation methods, NN imputation showed the lowest accuracy at all missing ratios; all algorithms showed lower accuracy as the missing ratio increased. In addition, when the missing ratio was 0.6 or more, mean imputation had higher accuracy than multiple imputation; this was because the accuracy of multiple imputation when the missing ratio increased was lower than the accuracy of the other two methods under the same conditions. SVM showed the highest accuracy among the four machine learning algorithms: 0.756; missing ratio = 0.2.
Second, for females, when the missing ratio was 0.2 to 0.4, multiple imputation using SVM showed the highest accuracy. In contrast, when the missing ratio was 0.5 to 0.8, mean imputation using SVM showed the highest accuracy. The results of the experiment confirmed that the accuracy derived through SVM was the highest at all missing ratios. In addition, among the imputation methods, NN imputation showed the lowest accuracy at all missing ratios.
Finally, for males, when the missing ratio was 0.2 to 0.4, multiple imputation using SVM showed the highest accuracy, as with females. However, unlike with the female cases, when missing ratio was 0.5 or 0.6, the accuracy of SVM and ANN was the same, and when the missing ratio was larger than 0.7, mean imputation was more accurate than multiple.

Stature Classification: Lower Limb
The relationship between the missing ratio and the accuracy according to the three imputation methods based on variables for the lower limb is shown in Table 3 by male and female. There was no statistical significance at the 95% confidence level in the one-sample t-test for the results of accuracy obtained through 10 repeated trials for the imputation methods and learning algorithms for both sexes.
First, for both sexes, when the missing ratio was 0.2, multiple imputation had the highest accuracy with all learning algorithms except NB. With mean and multiple imputation, the accuracy of NB changed less than did accuracy with the other algorithms when the missing ratio increased. Among the three imputation methods, as in the upper limb, NN showed the lowest accuracy at all missing ratios. All learning algorithms were less accurate as the missing ratio increased, although when the missing ratio was 0.5 or more, mean imputation was more accurate than multiple imputation. This was because the accuracy of multiple imputation decreased more than did accuracy with the other two methods when the missing ratio increased. The highest accuracy among the four machine learning algorithms was with SVM: 0.837; missing ratio = 0.2.
Second, with females, when the missing ratio was 0.2 to 0.6, multiple imputation using SVM showed the highest accuracy, and when the missing ratio was 0.5 or 0.6, accuracy was the same for SVM and ANN. In contrast, when the missing ratio was 0.7 to 0.8, mean imputation using SVM showed the highest accuracy. Among all methods, NN imputation showed the lowest accuracy at all missing ratios.
Finally, in the case of males, when the missing ratio was 0.2 to 0.5, multiple imputation using SVM showed the highest accuracy. Unlike with the female cases, when the missing ratio was 0.5, the accuracy of SVM and ANN was the same, and when the missing ratio was larger than 0.7, mean imputation was more accurate than multiple.

Stature Classification: Both
For each learning algorithm, Table 4 shows the relationship between the missing ratio and the accuracy according to the three imputation methods based on variables for both limbs by male and female. There was no statistical significance at the 95% confidence level in the one-sample t-test for the results of accuracy obtained through 10 repeated trials for the imputation methods and learning algorithms for both sexes. First, for both sexes, when the missing ratio was 0.2, multiple imputation had the highest accuracy with all algorithms. Among the three imputation methods, NN was the least accurate at all missing ratios, although all algorithms were less accurate as the missing ratio increased. In addition, when the missing ratio was 0.6 or more, mean imputation was more accurate than multiple imputation. This was because multiple imputation showed the least accuracy of any method as the missing ratio increased. Among the four machine learning algorithms, SVM was the most accurate: 0.857; missing ratio = 0.2.
Second, for females, when the missing ratio was 0.2 to 0.6, multiple imputation using SVM showed the highest accuracy, whereas when the ratio was 0.7 or 0.8, mean imputation using SVM was the most accurate. The results of the experiment confirmed that the accuracy derived through SVM was the highest at all missing ratios. In addition, among the learning algorithms, the NB was the least accurate of all methods at all missing ratios. Finally, with males, when the missing ratio was 0.2 to 0.6, multiple imputation using SVM showed the highest accuracy, as with the female cases, but when the missing ratio was larger than 0.7, accuracy was higher with mean rather than multiple imputation.

Discussion and Future Work
The purpose of this study was to investigate the optimal missing value imputation and statistical methods for estimating demographic features through anthropometric measurements. We examined general imputation methods with machine learning algorithms to estimate sex and stature using anthropometric measurements related to the upper and lower limbs. In this study, we proposed three ways to impute missing values, and within our classification analysis of machine learning, we used seven classes to classify statures. Estimates of this class are significant for constructing biometric profiles of humans using various kinds of anthropometric data. In addition, this study has provided a baseline of comparison to researchers who conduct study that estimates human biological information based on anthropometric measurements by country and ethnicity.
First, we confirmed through upper and lower limbs that there were differences in the accuracy of the stature estimates for Korean males versus females; specifically, the stature estimates for the men were more accurate than the estimates of Korean women's stature. Previous researchers obtained similar results for Koreans [10,45], but these results were not unique to Koreans: Other researchers found the same results in multiple studies on estimating stature across different countries [4,5,[48][49][50].
Second, through our experiments on imputing missing values, we confirmed that multiple imputation was the most accurate in all cases of estimating biological information based on the upper and lower limbs except for high missing ratios. The multiple imputation used in this study was estimated using Gaussian distribution, which imputes missing data based on covariance between data. Therefore, this method is potentially more accurate than others but has a disadvantage in that it requires a larger minimum data set than do other methods for estimating the parameters of Gaussian distribution. We confirmed similar results in this study: As the missing ratio increased, the accuracy of multiple imputations decreases rapidly; when the missing ratio was over 0.7, mean imputation showed the highest accuracy in all cases. With mean imputation, the missing data are estimated based on the averages of the totals without considering relationships among features, so that the average for each missing ratio in the overall data does not change significantly. The other imputation methods, multiple and nearest neighbor imputations can be overfitted to the small amount of observed information. In this study, among the three imputation methods, the accuracy of mean imputation decreased the least when the missing ratio increased. Therefore, when estimating stature through anthropometric measurements in Koreans, if the victim's body is severely damaged and it is difficult to obtain measurements for each body part, anthropometric measurements should be calculated using mean imputation.
Third, from the perspective of the learning algorithm, we used two types of linear classification (logistic, NB) and two types of nonlinear classification (SVM, ANN) in this study. In the previous studies of estimating stature through anthropometric data, researchers primarily performed linear regression and classification analysis based on the linearity of the human body. However, recently researchers have confirmed the accuracy of estimating stature using nonlinear or machine learning methods [51]. We also determined that in the context of missing data, which is the main contribution of this study, nonlinear classification was more accurate for measuring stature. Therefore, it is necessary to expand the methodology based on machine learning in research to estimate or classify biological information of humans through anthropometry.
The limitation of this study and future research are as follows. First, this study focused on deriving the best algorithm for estimating stature based on anthropometric measurements of Koreans over a wider range of ages compared to previous studies using statistical methods. In addition, in this research, it was not conducted on the elderly population over the age of 69. The aging of Korean society is progressing, and the population of the elderly is growing rapidly. Therefore, in future studies, it seems necessary to collect anthropometric data of elderly people over 70 years old and propose a more general methodology for estimating stature. Second, for this study we assumed that all missing data occurred randomly and that human body parts in the fields of anthropology and forensic science correlate with each other. For example, in a corpse without an arm, the probability that the hand is damaged is extremely high. In addition, the measurements related to the same body parts, such as the food breadth and the foot length, have high probability that they are missing simultaneously. Therefore, based on the results of this study, it is necessary to carry out additional studies in consideration of missing data specific to humans. Third, it is possible to conduct research to improve accuracy by examining various anthropometric variables that we did not measure in this study.
Since this study was conducted on living people, it can be used when estimating the stature of suspects. However, since a corpse's body measurements change, it is difficult to apply it directly to the identification of the victims such as a crime or natural disasters. Therefore, in the future, it is necessary to conduct research to find a method for accurately estimating the stature for Korean corpses by additionally considering data on the carcasses. In addition, the type of missingness can be varied by the situations, the imputation methods for anthropometry data with structural missing values can be studied. In addition, the sophisticated machine learning classifiers, such as random forests and deeper neural networks, can be used for similar tasks to improve the prediction performances.