A Comparative Study of Various Methods for Handling Missing Data in UNSODA

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.


Introduction
Soil properties, including bulk density (BD), water content (WC), particle density (PD), porosity (PO), saturated hydraulic conductivity (SHC), organic matter content (OMC), and pH, can be divided into physical properties and chemical properties [1]. OMC and pH are the most important chemical properties. They have significant effects on plant growth [2][3][4]. Physical properties, such as SHC, BD, WC, PD, and PO are frequently measured to calculate soil's hydraulic properties [5][6][7][8][9] or to characterize soil compaction [10,11]. BD is also widely used as an essential parameter for soil weight-to-volume conversion, especially when calculating the carbon and nutrient contents of a soil layer [12]. SHC is also used to calculate water flux in a soil profile and to design irrigation and drainage systems [13].
In theory, data on these soil properties can be obtained directly through experiments, but in practice, direct measurements are difficult and labor-intensive, and the data are highly variable, particularly for properties such as BD [12,[14][15][16][17]. In addition, the measurement of soil-water characteristic curves and unsaturated soil hydraulic conductivity is also timeconsuming and labor-intensive [9,18]. For these reasons, methods have been developed to use property data with low variability to estimate soil properties which are difficult to measure directly; these estimation methods are called pedotransfer functions (PFTs) [19,20], and they are commonly data-driven. One of the advantages of these data-driven models is that they are usually far more flexible than standard statistical models and can capture higher-order interactions between the data, resulting in better predictions. For this reason, the role of soil property data is becoming more and more important, and a number of soil property datasets have been established. Among them, the Unsaturated Soil Database (UNSODA) [21,22], the European Database of Soil Hydraulic Properties (HYPRES) [23] , SoilVision, and the Harmonized World Soil Database (HWSD) [24] are the most representative. The UNSODA database has been widely used because it provides a large amount of information free of charge. For example, Huang and Zhang [25], Hwang and Powers [26], Hwang et al. [27], Mohammadi and Vanclooster [28], and Chang and Cheng [29] predicted SWCCs using particle size distribution data; Ghanbarian-Alavijeh et al. [30] used soil texture data; and Haverkamp et al. [31], Seki [32], Ghanbarian and Hunt [33], Pham et al. [34], and Vaz et al. [35] used soil properties such as BD, PO, and others.
However, there are missing soil property data in UNSODA for pH [pH] (pH), saturated hydraulic conductivity [k_sat] (SHC), organic matter content [OM_content] (OMC), porosity [porosity] (PO), particle density [particle_density] (PD), and bulk density [bulk_ density] (BD). The square brackets represent the features in the original tables and round brackets represent the features used in this study.
Missing data, a real-world problem often encountered in scientific settings, is problematic because many statistical analyses require complete data. Researchers who want to perform a statistical analysis that requires complete data are forced to choose between imputating data and discarding missing values; the latter is the most common method of using the UNSODA. However, discarding missing data may not be reasonable when the proportion of missing data is not small, as valuable information may be lost and inferential power compromised [36]. According to Strike et al. [37] and Raymond and Roberts [38], when the dataset contains a very small amount of missing data, e.g., the missing rate is less than 10% or 15% across the whole dataset, missing data can simply be removed without loss of valuable information. However, when the missing rate exceeds 15%, the missing information may reduce insights into the data [39], especially when dealing with the extraction of knowledge from a given dataset; therefore, careful consideration should be given to handling of missing data. Missing values are of different types, and some of them are discussed below [40]: (i) Missing completely at random (MCAR): The missing data are not related to known values. With this type of missing data, we assume that a whole distribution of data is completely missing. (ii) Missing at random (MAR): The missing value depends on an already known value and does not depend upon the missing value itself. (iii) Not missing at random (NMAR): The missing value does not depend upon any given or missing value.
These different types of anomaly generally arise from different sources. Data MCAR may arise from sensor recording failure, and there may be no other data dependent on the missing data. By contrast, MAR may arise when some survey questions are not answered, yet there are other questions related to the unanswered items.
The data in UNSODA were mainly contributed by individual scientists, and some of the datasets were taken from the literature. A questionnaire based on suggestions from participants at an international workshop on soil hydraulic properties held in Riverside in 1989 was also used to request information for UNSODA [22,41].
The above discussion explains how UNSODA was created, but we still cannot confirm its type(s) of missing data. For imputation purposes, the missing values in UNSODA were supposed to be MAR.
The objective of this paper was to impute missing values in UNSODA; to our knowledge, this work has not been undertaken previously. The main missing features, such as pH, SHC, OMC, PO, and PD, were all included. After reviewing of existing missing value imputation techniques, we used the random forest (RF) regression method to impute the missing values in UNSOD. Its performance was then compared with the performances of both machine learning-based methods, i.e., support vector (SVR) regression and artificial neural network (ANN) regression, and statistics-based methods, i.e., mean and multiple imputation (MI) methods.

A Brief Review of Existing Missing Value Imputation Techniques
Imputation methods involve replacing one missing value with another value that has been estimated based on data mining of available information in the dataset [39]. Imputation methods can be divided into single and multiple imputation methods based on the number of values imputed [39,42]. According to the construction approach used for data imputation, these technologies can also be classified into statistics-based and machine learning-based (or model-based) methods [39]; the details of these approaches are listed in Table 1.
Statistics-based methods are a popular approach for missing data imputation in which a statistic (such as mean) is calculated for each column, and all missing values for that column are replaced with the statistic. The MI method is another widely used statistics-based method, which was first proposed by Rubin in the late 1970s [43]. Instead of imputing a single value for each missing data, multiple imputation creates many completed candidate datasets according to the missing data case, and then combines these candidate datasets into one estimate for the missing data. Machine learning techniques, such as the k nearest neighbor (KNN), RF, ANN, and SVR methods, have been widely employed in the last 20 years [44]. It should be noted that KNN approaches tend to perform poorly in high-dimensional and large-scale data settings.  General  available sample  code  790  series  790  texture  776  structure  347  depth_upper  685  depth_lower  685  horizon  651  depth_water  132  Location  758  site_ID  688  annual_rain  438  avetemp_jan  308  avetemp_jul  309  data  613  publication_ID  790  keyword  735  contact person_ID      Furthermore, according to Tukey's rule, 176 outliers, whose values were either higher than Q3 + 1.5IQR or lower than Q1 − 1.5IQR (IQR = Q3 − Q1 is the interquartile range of the dataset; Q3 and Q1 are the first and third quartiles of the dataset, respectively) were detected in cases of DU, DL, pH, SHC, OMC, PO, PD, and BD, as shown in Figure 3. It should be noted that most of the outliers were observed in case of the SHC, i.e., 43 out of 176 outliers.

Procedure for Missing Values Imputation
The experimental procedure for missing value imputation is shown in Figure 4. Before imputing the missing values in the original incomplete dataset, we first considered one complete dataset (Dataset I) (i.e., with missing values discarded). Once missing values were discarded, the number of datasets decreased significantly (n = 109). Second, a missing value simulation was performed. That is, dataset I was simulated with different missing rates (e.g., 3%, 7%, 11%, 15%, 19%, and 23%) using an MAR approach. Different incomplete datasets were produced with different proportions of missing data. The purpose of this design was to simulate and compare quantitatively the advantages and the drawbacks of the different imputation methods.

Methodology
In this Section, we introduce and describe the methods used to impute the dataset with the values deleted at random (Dataset I) and the original incomplete dataset (Dataset II).
(i) Statistics-based methods, including mean and MI. (ii) Machine learning-based methods, including RF, SVR, and ANN.

General Notation
Let X be our n × p matrix of predictors that requires imputation [57]:

Mean Imputation
Mean imputation is a simple imputation technique that calculates the mean of y mis . The mean imputation method is easy to perform, simple in process, insensitive to extreme values of the variable, and has good robustness.

Multiple Imputation
As shown in Figure 5, the MI method does not attempt to provide an accurate estimate for the missing data, but rather tries to represent a random sample of the missing data by constructing valid statistical inferences that properly reflect the uncertainty due to missing data. Hence, it retains the advantages of single imputation while allowing the data analyst to obtain valid assessments of uncertainty. In this study, we used SPSS 22.0 to impute the missing values, whose algorithm is predictive mean matching (PMM).

Dataset
with Missing data "Complete" datasets candidates Analysis Final estimation

RF Imputation Method
RF is an ensemble technique capable of performing regression and classification with the use of multiple decision trees and a technique called bootstrap aggregation, commonly known as bagging, which involves training each decision tree on a different data sample, where sampling is performed with replacement [58,59].
We used binary decision trees (i.e., CART) as the base learner for the RF, as shown in Figure 6. It was necessary to consider how to choose split variables (features) and split points, and how to estimate the quality of the split variable and split point; the calculation formula was as follows: where x i is the split variable, v ij is the value of the split variable, n le f t is the sample size of the left node, n right is the sample size of the right node, and N s is the total sample size of the variable x i . H(x) is the impurity function, which can be calculated as: By substituting Equation (3) into (2), we can obtain: The training process of a node in the decision tree is mathematically equivalent to the following optimization problem: The above discussion explains how to train a CART; it should be noted that the prediction results of the RF involve averaging all results of all CARTs. The RF can be used to estimate missing values by fitting an RF to predict the non-missing values of X s , i.e., obs , and using this to predict the missing values of X s , i.e., y (s) mis . BD was considered to be the label because it had the lowest missing ratio (0.0354). The RF imputation Algorithm 1 can be described as follows: Algorithm 1: The RF imputation. 1 Make an initial guess for all missing categorical/numeric values (e.g., mean, median); 2 k←vector of column indices in X s , sorted in ascending order of % missing; 3 while not γ do 4 X imp old ←store previously imputed matrix; 5 for s in k do 6 Fit an RF that predicts the non-missing vales of X s :y

SVR Imputation Method
SVR regression is an adaptation of the support vector machine (SVM)) algorithm used for regression problems [60]. The SVR can be divided into two types, i.e., hard margin and soft margin [61]. To illustrate the basic idea behind the SVR, we first introduce the case of linear functions, as shown in Figure 7a,b. In the hard margin model, there are no points outside the shaded region; the parameter ε affects the number of support vectors used in the regression function. That is, the smaller the value of ε, the greater the number of support vectors that will be selected. However, this may not be the case, or we may also want to allow for some errors, as shown in Figure 7b; therefore, the soft margin SVR model was proposed. In the soft margin model, another important parameter, the cost parameter (C), is involved; it determines the tolerance for deviations larger than ε from the real value. That is, smaller deviations are tolerable for larger values of C. The training process of SVR is mathematically equivalent to the following optimization problem: Minimize: Subject to: where C is the cost parameter and C ≥ 0. w is the coefficient matrix. The SVR formulations described above make up a linear decision boundary to fit the training dataset. Kernel functions are commonly used for non-linear SVR; they transform the data into a higher-dimensional feature space to enable linear fitting. There are two commonly used kernel functions, the polynomial and Gaussian radial basic functions.
The SVR can be used to estimate missing values by fitting an SVR to predict the non-missing values of X s , i.e., y (s) obs , and using this to predict the missing values of X s , i.e., y

ANN Imputation Method
An artificial neural network (ANN) is a computing system designed to simulate the way that the human brain analyzes and processes information. It is the foundation of artificial intelligence and solves problems that would prove impossible or difficult by human or statistical methods. Feed-forward multi-layer perceptron ANNs are frequently used in engineering applications [62], as shown in Figure 8. Here, we use a standard ANN with one hidden layer as an example. Weights of the first layer connect the input data variables to the H hidden units (neurons), and the second layer's weights connect these hidden neurons to the output units. First, given a p dimensional input vector x, the H hidden neuron outputs are computed in the form [44]: where h = 1, 2, ..., H, z h is the hidden neuron output, w (1) hj is the first layer weight, w is the corresponding bias parameter, the superscript (1) indicates that the corresponding parameters are in the first layer, and σ is the activation function chosen as the ReLU function in this study. By analogy, considering z h as another input, the output can be calculated.
The ANN can be used to estimate missing values by fitting an ANN to predict the non-missing values of Xs, i.e., y (s) obs , and using this to predict the missing values of X s , i.e., y (s) mis . The ANN imputation algorithm is also similar to RF.

Input layer
Hidden layer 1 Hidden layer 2 Output layer Output Figure 8. A schematic of an artificial neural network (ANN) with two hidden layers and a single neuron output.

Quantitative Evaluation
To assess the quality of the RF, SVR, ANN, mean, and MI predictions in the complete dataset (Dataset I), it was essential to establish metrics that allow the comparison of the different methods. This evaluation had to consist of a comparison between the prediction results and the actual results. We used two common statistical measurements, the root mean square error (RMSE) and the mean absolute error (MAE): where x i is the actual value, x i is the predicted value, and n is the total number of missing value.

Qualitative Evaluation
RMSE and MAE of the imputation was used to determine which method performs better for imputing the missing values in UNSODA. However, the real values of these missing values are unknown, and we therefore used nonparametric tests (the reasons for not using ANOVA follow) to analyze whether there were any statistically significant differences between different features before and after imputation. As the missing values in UNSODA are supposed to be MAR, the data before and after the imputation were expected to not show significant differences. Before the ANOVA analysis, all features before and after imputation had to be tested for normality. In this study, the values of skewness and kurtosis were used. In addition, we used a multiple linear regression model to quantitatively determine which imputation method performed better for UNSODA.

Parameter Determination and Sensitivity Analysis
In this study, scikit-learn (version: 0. 22), an open-source machine learning library, was used to perform the model training. We took the RF as an example to explain how to calibrate the parameters.
As discussed in Section 4.3.1, RF is a meta estimator that fits a number of CARTs on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
The main parameters include the number of trees in the forest (n_estimators), the maximum depth of the tree (max_depth), and the minimum number of samples required to split an internal node (min_samples_split, default = 2). It should be noted that the number of features and sample sizes in Datasets I and II were not enough; therefore, the maximum depth of the tree was considered to be none, which meant nodes were expanded until all leaves were pure or until all leaves contained less than min_samples_split samples.
According to Bisong [63] and Pham et al. [64], the number of trees in the forest has a significant effect on the model accuracy and the RF will converge to a lower generalization error as the number of trees increases. Dataset I with a missing ratio = 0.15 was used to estimate the number of trees. As shown in Figure 9, the mean squared errors (MSEs) of pH and SHC did not decline when the numbers of trees exceeded 31 and 41, respectively. However, considering the size of the sample and the number of features, the n_estimators were considered to be 100. By analogy, the parameters of the SVR and ANN can be calibrated; the details are listed in Table 3.   Table 4 summarizes statistical measurements for the performances of RF, SVR, ANN, mean, and MI methods on Dataset I. The obtained results revealed that the missing proportion has a potential effect on the performances of the imputation methods. RF and MI outperformed the other imputation methods examined in this study; the mean and ANN had inferior performances.
For further discussions, Figure 10 presents the relationship between the missing proportion and the statistical measurements for each imputation method. For RF imputation, when the missing proportion increased from 0.03 to 0.11, the RMSE of the DU increased from 5.99 to 33.32, and the MAE increased from 6.61 to 19.80. However, when the missing proportion increased from 0.11 to 0. 15 The above discussion cannot clearly explain which methods performed better for Dataset I, because the RMSEs and MAEs of all features fluctuated within accepted ranges, except for SHC. It should be noted that the RMSE and MAE of SHC were high using every imputation method, probably because of the outliers in the raw data, as shown in Figure 3d. These results indicate that RF, SVR, ANN, mean, and MI methods are adequate and could be used to impute the missing values in the original incomplete dataset (Dataset II).

Qualitative Measures for Dataset II
After imputing missing values in dataset II, we took the features (or independent variables) DU, DL, pH, SHC, OMC, PO, and PD as the x-axis variables, and the label (or dependent variable) BD as the y-axis variable. We also assumed that XXR referred to the feature after RF imputation, XXS to the feature after SVR imputation, XXA to the feature after ANN imputation, XXM to the feature after mean imputation, XXMI to the feature after MI imputation, and XXO to the feature before imputation (i.e., the raw data). Figure 11 shows the distributions of DU, DL, pH, SHC, OMC, PO, and PD before and after imputation. The results demonstrate that the data for each feature were distributed well in the major quadrants after RF, SVR, ANN, mean, and MI, indicating that these imputation methods are feasible. The sample size, minimum value, maximum value, mean value, standard deviation, and standard error; and the median, skewness, kurtosis, coefficient of variation, and variance of each feature before and after imputation are listed in Table 5. The mean value was computed from the sample data. It should be noted that the standard deviation differs from the standard error: the standard deviation indicates approximately how far individuals are from the mean values, whereas the standard error estimates the variability of the sample mean-i.e., approximately how far it is from the population mean [65].    ANN imputation, to 1.358, 4.197, 0.033, 52.325, 0.244, 0.003, and 0.003 after mean imputation; and to 1.395, 4.209, 0.039, 57.427, 0.270, 0.003, and 0.004 after MI imputation, indicating that the sample means became closer to the population means. The decreased coefficients of variation and standard deviations indicated that individuals were closer to the sample mean values. Table 5 also shows that the maximum values of DUO, DLO, SHCO, and OMCO were three standard deviations away from the mean values; and the maximum values of DUR, DUS, DUA, DLR, DLS, DLA, SHCR, SHCS, SHCA, OMCR, OMCS, OMCA, DUM, DLM, SHCM, OMCM, DUMI, DLMI, SHCMI, and OMCMI were still three standard deviations away from the mean values, indicating that the raw data fluctuate greatly. To more clearly visualize this finding, the means, standard deviations, and standard errors for all features before and after imputation are plotted in Figure 12.   Figure 13 also shows that mean imputations generated many outliers, such as for pH and PO, which shows that the mean imputation cannot be used for pH and PO. It is worth noting that ANN imputation also generated outliers for DL.  Table 5. All of these features did not strictly meet the criterion for normality. Therefore, nonparametric tests rather than ANOVA were used to investigate whether there were differences before and after imputation, and the results are listed in Table 6. Table 6 suggests that there were no significant differences before and after imputation, except for SHCO and SHCR, SHCO and SHCM, SHCO and SHCMI, POO and POR, POO and POM, POO and POMI, pHO and pHA, OMCO and OMCA, OMCO and OMCM, and OMCO and OMCMI. As discussed above, this result may have been caused by the raw data. Although there was a difference in SHC and PO before and after RF imputation, we assume that the RF imputation is still valid based on observations of the mean and median. The pH and OMC metrics before and after ANN imputation are similar.
After nonparametric tests, we used a multiple linear regression model to quantitatively determine which imputation method performed better for UNSODA. In the multiple linear regression model, DU, DL, pH, SHC, OMC, PO, and PD were still considered to be independent and BD dependent. However, it should be noted that DU and DL are collinear. Therefore, we should consider only one feature, and DU was used in this study. For comparison, multiple linear regression was also performed for these features when missing data were imputed by zero.
The results of multiple linear regression after RF, SVR, ANN, mean, and zero imputations are presented in Table 7. Table 7 shows that R 2 was 0.910 for the RF imputation, which means that DU, pH, SHC, OMC, PO, and PD could explain 91.0% of the variation in BD. The model passed the F-test (F = 1273.712, p = 0.000 < 0.05), indicating that at least one of DU, pH, SHC, OMC, PO, and PD affect BD, and the model's formula was: BD = 1.483 + (0.000 × DU) − (0.001 × pH) + (0.000 × SHC) + (0.002 × OMC) − (2.481 × PO) + (0.410 × PD). A test for multicollinearity indicated that all the variance inflation factor (VIF) values in the model were less than five, which meant that there was no collinearity problem [66]. The results for SVR, ANN, mean, MI, and zero imputations were similar to those for RF imputation, but their R 2 values (0.639 for SVR imputation, 0.885 for ANN imputation, 0.594 for mean imputation, 0.902 for MI method, and 0.379 for zero imputation) were smaller. It should be noted that the R 2 values for MI imputation are close to those of the RF imputation method, and the model also passed the F-test (F = 1154.513, p = 0.000 < 0.05), indicating that the performance of the MI method was close to that of the RF imputation. We also considered the effect of discarding missing values, which decreased the number of datasets substantially (n = 109). The analysis results when missing values were discarded are presented in Table 7. The R 2 value was 0.998, which meant that DU, pH, SHC, OMC, PO, and PD could explain 99.8% of the variation in BD. This model still passed the F-test (F = 6932.797, p = 0.000 < 0.05), indicating that at least one of DU, pH, SHC, OMC, PO, and PD affect BD, and the model's formula was: BD = 1.069 + (0.000 × DU) − (0.002 × pH) + (0.000 × SHC) + (0.002 × OMC) − (2.551 × PO) + (0.581 × PD). Although the R 2 values of the RF, SVR, and ANN imputation were smaller than those obtained after discarding missing values, the above discussion suggests that these imputation methods are feasible; and RF and MI methods may be better than SVR, ANN, mean, and zero imputation. The results of the actual BD and predicted BD are shown in Figure 14.

Discussion
The reason for using regression to impute the missing values is that the regression algorithm believes that there is some connection between the Eigen matrix and the label. That is, we can use features X 1 , X 2 , X 3 , and X 4 to predict the label Y. Similarly, we can also use Y, X 2 , X 3 , and X 4 to predict X 1 .
Before imputing the missing values in the original incomplete, Dataset I was simulated with different missing proportions (e.g., 3%, 7%, 11%, 15%, 19%, and 23%) using an MAR approach. It should be noted that the maximum simulated missing proportion was 23%, which is far below the average level of missing data in Dataset II. In fact, it would be difficult to simulate a high missing proportion considering the sample size of Dataset I. The main reason is the sample size of Dataset I is insufficient; the ANN imputation fails to converge with the increase of the missing proportion.
In this study, we mainly focused on the comparison of various methods of handling missing data in UNSODA. The imputed dataset was not used for modeling and applying in a case study. It will be more convincing if the quality of a real case study's results can be improved after imputing the missing data using the proposed method. By observing multiple linear regression, we can infer that the data after RF and MI imputation can be used in a case study. We will use the imputed Dataset II to predict BD in the future according to PTFs proposed by Yi et al. [12], Curtis and Post [14], Adams [67] and Rawls [68].
In addition, the outliers were not deleted or replaced because of the imputation, which may have caused the large RMSEs and MAEs for SHC. In a subsequent study, we will further explore how to predict SHC well.

Conclusions
Three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statisticsbased methods, i.e., mean and multiple imputation (MI) were used to impute the missing data for DU, DL, pH, SHC, OMC, PO, and PD in UNSODA. Both quantitative and qualitative methods were used to evaluate the feasibility. Based on the results, we can conclude that: (1) The RMSEs and MAEs of DU, DL, pH, SHC, OMC, PO, and PD for the complete dataset indicate that RF, SVR, ANN, mean, and MI methods are appropriate for imputing the missing values in UNSODA. (2) The standard error significantly decreased after imputation, indicating that the sample means had become closer to the population mean. The decreased coefficients of variation and standard deviations indicate that the individual data points were closer to the sample mean values. (3) There were no significant differences before and after imputation for DU, DL, pH, SHC, OMC, PO, and PD. (4) DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD after RF, SVR, ANN, mean, and MI imputation, respectively; and this value was 99.8% when missing values were discarded. (5) This study suggests that the RF and MI methods may be best for imputing the missing data in UNSODA. Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The raw data used during this study were obtained from a public dataset (https://doi.org/10.15482/USDA.ADC/1173246). The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: