Machine Learning Data Imputation and Prediction of Foraging Group Size in a Kleptoparasitic Spider

: Cost–beneﬁt analysis is widely used to elucidate the association between foraging group size and resource size. Despite advances in the development of theoretical frameworks, however, the empirical systems used for testing are hindered by the vagaries of ﬁeld surveys and incomplete data. This study developed the three approaches to data imputation based on machine learning (ML) algorithms with the aim of rescuing valuable ﬁeld data. Using 163 host spider webs (132 complete data and 31 incomplete data), our results indicated that the data imputation based on random forest algo-rithm outperformed classiﬁcation and regression trees, the k -nearest neighbor, and other conventional approaches (Wilcoxon signed-rank test and correlation difference have p -value from < 0.001–0.030). We then used rescued data based on a natural system involving kleptoparasitic spiders from Taiwan and Vietnam ( Argyrodes miniaceus , Theridiidae) to test the occurrence and group size of kleptopara-sites in natural populations. Our partial least-squares path modelling (PLS-PM) results demonstrated that the size of the host web ( T = 6.890, p = 0.000) is a signiﬁcant feature affecting group size. The resource size ( T = 2.590, p = 0.010) and the microclimate ( T = 3.230, p = 0.001) are signiﬁcant features affecting the presence of kleptoparasites. The test of conformation of group size distribution to the ideal free distribution (IFD) model revealed that predictions pertaining to per-capita resource size were underestimated (bootstrap resampling mean slopes <IFD predicted slopes, p < 0.001). These ﬁndings highlight the importance of applying appropriate ML methods to the handling of missing ﬁeld data.


Introduction
In natural populations, it is common to see multiple conspecific individuals foraging in a single resource patch. The simplest explanation for this phenomenon is the gathering of individuals in the vicinity of resources that are patchily distributed [1]. Theoretical models of foraging behavior have been developed to facilitate the prediction of the foraging group size [2]. Field ecologists frequently conduct ecological surveys of resource utility strategies in natural populations; however, the data they bring back are often incomplete due to instrument failure, human error, or weather conditions [3]. Researchers require a large number of samples to reveal features that could be used to predict the number of individuals in a resource patch and to assess the fitness costs and benefits of remaining webs follows the IFD model; (2) the presence/absence of kleptoparasites and the group size are determined primarily by features related to resource availability; (3) the densities of hosts and kleptoparasites and the physical conditions of microhabitats are key features in predicting the occurrence of kleptoparasites.

Field Site Selection and Surveys
This study surveyed three populations of A. miniaceus in Taiwan, including Huoyenshan (June 2008), Green Island (June 2019; November 2018), Orchid Island (November 2018), as well as Hanoi in Vietnam (September 2018). Using the transect line method, we surveyed 163 host webs of Nephila pilipes (Nephilidae), which are linearly distributed along light gaps (i.e., trails) in forests. The length and width of each transect were based on local populations of hosts within an area measuring 1.0 × 0.1 km 2 . We searched exhaustively to find all potential host webs in the target area (with and without kleptoparasites), determined the coordinates of each web using a hand-held GPS device (Garmin eTrex Summit HC, Taipei, Taiwan), and counted the number of kleptoparasites present in each web. We also measured features related to resource size, microclimate, and resource density at each web site (see below).

Kleptoparasite Foraging Group Size and Occurrence
We examined features that could potentially be used to predict the size of kleptoparasites groups in a given web (Klepto NUM) and the occurrence of kleptoparasites (Klepto Y/N) for > 40 webs/transect. The putative features fell into three categories related to resource size, microclimate, and resource density. Resource size was determined by the body length of host (BLH), vertical length of web (VLW), horizontal length of web (HLW), and web area (WA). Microclimate was characterized by instant wind speed (IWS), height of the web from ground (HWG), humidity (H%), luminance (Lux), and temperature (Temp). Resource density estimates were based on the minimum distance to the next host (Min D Host) and minimum distance to the next kleptoparasitic foraging group (Min D Klepto) on a different web. VLW and HLW were used to calculate the web area using the formula for ovals ( VLW 2 × HLW 2 × π). Note that only web area data were used for analysis. All data were obtained on the same clear day within a single, four-hour time interval. Distances to the nearest host web and distances to the nearest host web with kleptoparasites were estimated using GPS data. A hand-held mini-environmental meter (LM-8000, Lutron, Taipei, Taiwan) was used to measure the environmental features. We counted the numbers of female, male, and juvenile kleptoparasites in each web, the three of which were summed to give the group size.
(2) We divided the original dataset (N = 163 samples) into a set with missing values (N = 31 samples) and a set without missing values (N = 132 samples). The latter dataset was termed the RAW dataset. (3) The mean absolute error (MAE) and root mean squared error (RMSE) were estimated using the leave-one-out cross-validation (LOOCV) method, which involves sequentially selecting one sample from the RAW dataset. The trained ML-based models were used for imputing missing data. The final results were compared with a RAW dataset (i.e., the total dataset minus the missing data) (Supplementary Materials: Table S1). MAE was calculated as follows: where m indicates the number of missing values, y i is a real value from the RAW dataset, andŷ i is the imputed value. RMSE was calculated as follows: where m indicates the number of missing values, y i is a real value from the RAW dataset, andŷ i is the imputed value. Figure 2 presents the workflow used to evaluate the effectiveness of the data imputation methods. One to three features (HWG, HLW, VLW, BLH, Lux, H%, Temp, and IWS) are then sequentially deleted from the sample in question. The deleted data points are used as a testing set, and the remaining (i.e., non-deleted) data points are used as a training set. These data imputation methods were ranked in terms of MAE and RMSE.
Mathematics 2021, 9,415 4 of 16 samples). The latter dataset was termed the RAW dataset. (3) The mean absolute error (MAE) and root mean squared error (RMSE) were estimated using the leave-one-out cross-validation (LOOCV) method, which involves sequentially selecting one sample from the RAW dataset. The trained ML-based models were used for imputing missing data. The final results were compared with a RAW dataset (i.e., the total dataset minus the missing data) (Supplementary Materials: Table S1). MAE was calculated as follows: where m indicates the number of missing values, is a real value from the RAW dataset, and is the imputed value. RMSE was calculated as follows: where m indicates the number of missing values, is a real value from the RAW dataset, and is the imputed value. Figure 2 presents the workflow used to evaluate the effectiveness of the data imputation methods. One to three features (HWG, HLW, VLW, BLH, Lux, H%, Temp, and IWS) are then sequentially deleted from the sample in question. The deleted data points are used as a testing set, and the remaining (i.e., non-deleted) data points are used as a training set. These data imputation methods were ranked in terms of MAE and RMSE.    Figure 2. The procedures of simulation experiment design and the process of validation of different data imputation methods. The area shaded in gray is the process of numerical data standardization. We validated the qualities of imputation results using the leave-one-out cross-validation (LOOCV) method using test samples by evaluating the deviation of mean absolute error (MAE) and root mean squared error (RMSE) (shaded in green). For the missing set, the new sub-sets (pink blocks) are generated after the imputation by different methods. After integrating the new sub-set with the non-missing dataset, the Pearson correlation and Wilcoxon signed-rank test are used to evaluate the different methods deviation of the correlation. The positive ranks indicate the lower deviation, whereas the negative ranks indicate the larger deviation. The lower MAE and RMSE are the optimal imputation results. The application of the ecological machine-learning imputation tool (Eco-MIT) method is higlighted in purple.
We determined the values for data points in the missing set using a variety of methods and then substituted those values into the RAW dataset to create an imputed dataset. Finally, we estimated the degree of correlation between each feature from the imputation datasets using the Pearson correlation coefficient (Supplementary Materials: Table S2). We calculated the differences between the correlation coefficients of feature pairs in the RAW dataset for each data imputation method (Supplementary Materials: Table S3). The differences between correlation coefficients were calculated as follows: Figure 2. The procedures of simulation experiment design and the process of validation of different data imputation methods. The area shaded in gray is the process of numerical data standardization. We validated the qualities of imputation results using the leave-one-out cross-validation (LOOCV) method using test samples by evaluating the deviation of mean absolute error (MAE) and root mean squared error (RMSE) (shaded in green). For the missing set, the new sub-sets (pink blocks) are generated after the imputation by different methods. After integrating the new sub-set with the non-missing dataset, the Pearson correlation and Wilcoxon signed-rank test are used to evaluate the different methods deviation of the correlation. The positive ranks indicate the lower deviation, whereas the negative ranks indicate the larger deviation. The lower MAE and RMSE are the optimal imputation results. The application of the ecological machine-learning imputation tool (Eco-MIT) method is higlighted in purple.
We determined the values for data points in the missing set using a variety of methods and then substituted those values into the RAW dataset to create an imputed dataset. Finally, we estimated the degree of correlation between each feature from the imputation datasets using the Pearson correlation coefficient (Supplementary Materials: Table S2). We calculated the differences between the correlation coefficients of feature pairs in the RAW dataset for each data imputation method (Supplementary Materials: Table S3). The differences between correlation coefficients were calculated as follows: where I is the correlation coefficient of one feature pair from the imputation dataset, and N is from the RAW dataset. A smaller value means that there is less deviation between the imputation method and the RAW dataset, thereby indicating a superior imputation method. Calculations were performed using the Wilcoxon signed-rank test in SPSS version 22.

Partial Least-Squares Path Modelling (PLS-PM) of Population Size Features
Partial least-squares path modelling (PLS-PM) was used to evaluate the significance of latent features and sup-features relative to response features. This analysis was applied to all of the datasets. The results were then compared with the RAW dataset.
PLS-PM is one approach to structural equation modelling aimed at evaluating the direct and indirect regression relationships between features. Before conducting PLS-PM, we conducted pairwise correlation analysis between features using the R package GGally [30]. We then removed one of the collected features in cases where the correlation coefficient was >0.7 (Supplementary Materials: Figure S1). The retained features were used to construct the PLS-PM model using the R package PLSPM [31]. Two sub-models were used for PLS-PM: an inner model (path correlations between response features and latent features) and an outer model (path correlations between latent features and its block of sub-features). In the inner model, we defined two response features: Klepto NUM and Klepto Y/N of A. miniaceus foraging groups [31], and set resource size, microclimate, and resource density as latent features. Each latent feature was clustered according to its reflective/formative sup-features, called the outer model: (1) resource density was clustered in terms of reflective sup-features, as Min D Host and Min D Klepto. (2) Resource size was clustered in terms of formative sup-features, as WA and BLH. (3) We defined all environmental features as formative sup-features of microclimate. We used the loading (for reflective model) and weight (for formative model) >|0.7| to determine the significance of a feature in the outer models. The importance of the latent features was evaluated using R 2 values. All statistical analysis was conducted in the R environment [32].
The Python package RFPIMP [4] (a Scikit-learn random forest approach, [4] RFPIMP: https://github.com/parrt/random-forest-importances, accessed on 1 December 2020) was used to evaluate the importance ranking of features to response features (Klepto NUM and Klepto Y/N). Using out-of-bag data, we bootstrapped the number of trees = 10,000 and set the minimum sample size at a leaf node of 1 with the other settings at their default values. For Klepto Y/N (discrete feature), we used the loss of mean accuracy. For Klepto NUM (continuous feature), we used the change in R 2 . Evaluations were performed by removing the data of one feature and measuring the resulting statistical changes. The features were ranked in terms of their contribution to variations in the response features.

IFD Test: Group Size and Web Area
Tests were conducted to determine whether the theory of IFD held within our sample population. In this context, web area represents resource size. We calculated the sum of all web areas (WA)/total population size (N) (i.e., the total number of kleptoparasites in all samples) from the RAW and best-imputation datasets. This sum represents the per-capita web area (WA/N), or slope, under the assumption of IFD. We bootstrapped 50% of the data 10 times and 100 times to generate the distributions of the subsets. We repeated this process 100 times to generate a probability distribution for the slope mean (i.e., linear regression coefficients). The one-sample Mann Whitney U-Test was implemented in SciPy ver1.3.1 [33] to test the significance of the difference between theoretical values (from the RAW and best-imputation datasets) and the bootstrapped value in the Python package pandas ver. 0.25.1 (Pandas development team, 2020).

Results
In the original data (163 host spider webs, N = 132 complete data, and N = 31 incomplete data), the number of missing samples ranged from 2.45% to 17.79% of the values pertaining to each feature (mean = 4.17 ± 5.09%). Among the 163 samples, 43 had no kleptoparasites (26.71%). Among the 118 webs with kleptoparasites, the mean group size was 4.84 ± 8.19 individuals per web and the largest group included 44 individuals.
We also used the Wilcoxon signed-rank test and the Pearson correlation coefficient to determine whether the RF method statistically outperformed the other methods. We compared the differences in the correlation coefficients between the RAW dataset and the datasets established using the various imputation methods (RAW-RF, RAW-ZERO, RAW-MEAN, RAW-KNN, and RAW-CART). Overall, RAW-RF significantly outperformed all of the other methods (Table 2).

Results
In the original data (163 host spider webs, N = 132 complete data, and N = 31 incomplete data), the number of missing samples ranged from 2.45% to 17.79% of the values pertaining to each feature (mean = 4.17 ± 5.09%). Among the 163 samples, 43 had no kleptoparasites (26.71%). Among the 118 webs with kleptoparasites, the mean group size was 4.84 ± 8.19 individuals per web and the largest group included 44 individuals.

Comparison of Data Imputation Strategies
As shown in Figure 3a We also used the Wilcoxon signed-rank test and the Pearson correlation coefficient to determine whether the RF method statistically outperformed the other methods. We compared the differences in the correlation coefficients between the RAW dataset and the datasets established using the various imputation methods (RAW-RF, RAW-ZERO, RAW-MEAN, RAW-KNN, and RAW-CART). Overall, RAW-RF significantly outperformed all of the other methods (Table 2).   To evaluate the effect of each method on the addition of missing datasets, we calculated the Pearson correlation difference between the imputation dataset and RAW. The Wilcoxon signed-rank test evaluated the result differences between RAW-RF and other methods. By comparing the difference of 66 Pearson correlations, results show that RAW-RF has more R + than other methods, which indicates that the Pearson correlation deviation of the dataset is smaller after the RF method supplement. RF is stable for the imputation of the dataset. The correlation difference between RAW-RF and another method was evaluated using a two-tailed significance test, in which the p-values were less than 0.05, indicating a significant improvement. A comparison of the PLS-PM outcomes with the imputation dataset and the RAW dataset (Table 3) made it possible to determine whether our imputation method affected statistical interpretation. The effects of the two inner models were as follows: (1) the path correlations between Klepto NUM and latent features revealed that the trends produced by all of the methods matched the trend of the RAW dataset; i.e., resource size was significantly positively correlated with group size. The outer model of resource size produced larger weights for web area (WA weight = 0.775 to 0.886) than for the body length of hosts (BLH weight = 0.195 to 0.338), regardless of the imputation method. Unlike the RAW datasets where WA weight = 0.527 and BLH weight = 0.562, all of the imputation results identified web area as a more important factor of group size. The PLS-PM outcomes of KNN, CART, and RF data conformed to the RAW dataset in terms of resource density and microclimate. However, the PLS-PM outcomes of MEAN and ZERO data presented a trend opposite to that of the RAW dataset in terms of temperature weights. (2) The path correlations between Klepto Y/N and latent features presented the same trend for the imputed datasets and RAW dataset. For the outer models of each latent feature, resource size (with the feature WA weight = 0.621 to 1.137 from all datasets) and microclimate (with the feature HWG weights = 0.684 to 0.759) presented significantly positive regression to Klepto Y/N. Nonetheless, the sup-feature BLH from ZERO data presented a trend opposite to that of the other datasets.

Biological Significance of Features Related to Group Size
All 163 samples from the RF dataset ( Figure 4) were subjected to PLS-PM analysis to identify features capable of explaining the variation in the two response features (Klepto NUM and Klepto Y/N). In the inner model, (1) resource size (T = 2.590, p = 0.010) and microclimate (T = 3.230, p = 0.001) were both significantly correlated with Klepto Y/N (Figure 4a). For the outer model of each latent feature, the weight of WA was > 0.7 (weight = 0.713) with resource size, and the weight of WHG was > 0.7 (weight = 0.720) with microclimate ( Figure 4a). (2) Resource size (T = 6.890, p = 0.000) was significantly correlated with Klepto NUM. The weight of WA was > 0.7 (weight = 0.775) with resource size (Figure 4b). The PLS-PM results revealed that resource size was significantly correlated with Klepto NUM and Klepto Y/N. Furthermore, the microclimate of the web was significantly correlated with Klepto Y/N. The results of RF ranking matched those obtained using PLS-PM ( Figure 5). In the predictive results for Klepto Y/N (Figure 5a), Min D Klepto, WHG, and BLH led to a > 20% increase in MSE the original data were randomized. In the predictive results of Klepto NUM (Figure 5b), WA and BLH led to a > 20% increase in MSE the original data were randomized. The results of RF ranking revealed that the microclimate (height of the web), resource size (area size and host size), and density of competitors (minimum distance to the kleptoparasites in another web) may also affect the occurrence and group size of kleptoparasites. The results of RF ranking matched those obtained using PLS-PM ( Figure 5). In the predictive results for Klepto Y/N (Figure 5a), Min D Klepto, WHG, and BLH led to a > 20% increase in MSE the original data were randomized. In the predictive results of Klepto NUM (Figure 5b), WA and BLH led to a > 20% increase in MSE the original data were randomized. The results of RF ranking revealed that the microclimate (height of the web), resource size (area size and host size), and density of competitors (minimum distance to the kleptoparasites in another web) may also affect the occurrence and group size of kleptoparasites. The IFD test showed that the mean slopes of the bootstrapped per-capita web area (cm 2 , area/individual), which represents the resource area that an individual can occupy, are less than the prediction of IFD. This pattern was observed in the RAW dataset ( Figure  6a) as well as the RF imputed dataset (Figure 6b), regardless of bootstrapping. The mean slopes obtained from the bootstrapped data were significantly smaller (p < 0.001 for most of the resampling) than the slopes predicted under the IFD theory. The IFD test showed that the mean slopes of the bootstrapped per-capita web area (cm 2 , area/individual), which represents the resource area that an individual can occupy, are less than the prediction of IFD. This pattern was observed in the RAW dataset (Figure 6a) as well as the RF imputed dataset (Figure 6b), regardless of bootstrapping. The mean slopes obtained from the bootstrapped data were significantly smaller (p < 0.001 for most of the resampling) than the slopes predicted under the IFD theory.

Discussion
The prediction and modelling of foraging group size is an important aspect of the theories related to foraging [34]. In this study, the feature with the most pronounced effect on foraging group sizes was resource size ( Figure 5). The dispersion of kleptoparasites did not conform to the predictions of the IFD theory, in which overloading is expected in some resource patches ( Figure 6). Note that the presence of kleptoparasites depended on resource size, as well as the density of neighboring competitors, and environmental conditions (humidity and elevation of the web above the ground). Moreover, the proposed data imputation method using the RF approach could be of considerable benefit to field ecologists examining the relationship between population dispersion and environmental factors.
Our results are in line with those of previous studies, indicating that the host area is crucial to the size of kleptoparasite groups [11,13]. The positive correlation between host areas and group sizes has previously been cited as evidence that the dispersion of kleptoparasites follows IFD [11,13]; however, our results revealed a smaller per-capita resource

Discussion
The prediction and modelling of foraging group size is an important aspect of the theories related to foraging [34]. In this study, the feature with the most pronounced effect on foraging group sizes was resource size ( Figure 5). The dispersion of kleptoparasites did not conform to the predictions of the IFD theory, in which overloading is expected in some resource patches ( Figure 6). Note that the presence of kleptoparasites depended on resource size, as well as the density of neighboring competitors, and environmental conditions (humidity and elevation of the web above the ground). Moreover, the proposed data imputation method using the RF approach could be of considerable benefit to field ecologists examining the relationship between population dispersion and environmental factors.
Our results are in line with those of previous studies, indicating that the host area is crucial to the size of kleptoparasite groups [11,13]. The positive correlation between host areas and group sizes has previously been cited as evidence that the dispersion of kleptoparasites follows IFD [11,13]; however, our results revealed a smaller per-capita resource occupation scenario. The case study of kleptoparasitic spiders conforms to a continuous-input IFD model [35,36] because the host webs have a constant influx of prey, which is rapidly consumed by kleptoparasites or the host. It has been proposed that the group size in a resource patch can be predicted by the size of the resource (or "dispersion economy") [2]. The deviation from IFD in our case study could be attributed to free access to other resource patches. This indicates that access to other resources could be used as an additional criterion for IFD [35,36]. In this study, the overloading of some resource patches (smaller per-capita resources than expected) showed evidence of impeded dispersal among webs or lower within-group competition. Impeded dispersal would lead to the depletion of resources as the group size increases; the reduction in competition could be attributed to the high conspecific tolerance of A. miniaceus [9].
In our PLS-PM results, the predictive power of resource size for group size was greater than those of resource density and microclimate (Figure 4a). However, microclimate significantly affects the occurrence of kleptoparasites (Figures 4b and 5b), and the density of kleptoparasites (Min D Klepto) outweighed resource size (Figure 5b) in our RF ranking. These results demonstrate the importance of accounting for resource patch conditions and the density of neighboring competitors when modelling foraging group size, especially under the IFD theory. Our results empirically demonstrate that in addition to resource size, dispersal probability, degree of competition, and interspecies competition [37], it is important to model various aspects of the ambient community.
The ZERO and MEAN algorithms are commonly used to fill in missing values; however, our results indicate that ZERO and MEAN have large errors in terms of MAE and RMSE. The KNN, CART, and RF are the classification algorithms that demonstrated more accuracy for data imputation. The differences in the correlation coefficients between the RAW dataset and the imputed dataset using imputation algorithms established that KNN, CART, and RF outperformed ZERO and MEAN. Classification algorithms can train the model and detect samples that are similar to a sample with missing values. Therefore, a sample with the missing values can refer to similar samples to determine values to fill in missing values. KNN is widely discussed and applied to pattern recognition, however, when the sample size is not large enough, KNN is no longer optimal, especially when compared to the inherent dimensionality of the feature space. CART can handle highly skewed or multi-modal numerical data, as well as classification predictors with ordinal or non-ordinal structure. However, if the regression tree in CART does not show significantly different segments, i.e., homogeneous, CART cannot perform well. RF is a special type of simple regression tree set, which is based on the majority vote (in the case of classification) or average (in the case of ensemble) to predict each tree in the set using some input data. RF is based on regression trees and allows for non-linear links and instability of variable influence across different segments, and it does not require a detailed model description. RF is provided by considering the differences in the set of missing value determinants in a column. The prediction of the imputed dataset is in the same range as the prediction of the RAW, which can prevent the trend opposite due to excessive deviation. Consequently, RF can be effectively used in data imputation. Table 4 shows the time and space complexities of MEAN, ZERO, KNN, CART, and RF. Although the time and space complexities of RF is higher when compared to the rest of the classifiers, however, RF can implement within an acceptable time and obtain the superior data imputation than MEAN, ZERO, KNN, and CART.
Mixed-type data and outlier data are common in situations that rely on self-collection and recording. These issues can be caused by clerical mistakes or data categorization. Numerous methods based on integrated learning have been developed to deal with this issue [38]. The selection of an appropriate learner depends on the type of data, the size of the dataset, and the research topic. In this study, the random forest method outperformed all other imputation methods in dealing with continuous and discrete data as well as noise. This is hardly surprising considering the sensitivity of the arithmetic means to extreme values, which can lead to large errors in data imputation. Note that this approach also eliminates the need for data pre-processing when dealing with mixed-type and outlier data. The degree of variance is inversely proportional to the number of trees, which means that increasing the number of trees can reduce the likelihood of overfitting, albeit with an increase in computational complexity. The parameters of the random forest are easily adjusted, and no additional verification set is required. This is particularly beneficial in situations where datasets are difficult to obtain [39].

Conclusions
In this study, we proposed ML-based data imputation processes for imputing missing data. We analyzed the performance of MEAN, ZERO, KNN, CART, and RF methods in terms of performance of data imputation, time, and space complexities. After data imputation using the proposed ML-based data imputation processes, we determined that the RF method is superior to other imputation methods in dealing with continuous and discrete data, noise, and outlier data. Furthermore, the results of RF ranking were indicated to match those obtained using PLS-PM. Our results demonstrated that the size of the host web and the ambient environment are significant features of group size in natural populations of A. miniaceus in Taiwan and Vietnam. We argued that among group-living kleptoparasites, social interactions provide benefits that favor remaining in groups [39], which could be the reason underlying the deviation of our results from IFD theory. In the future, the proposed method could be applied to the ecological surveys of the endangered populations, or the surveys of the disease vector populations, where missing data might have a strong impact on the accuracy of the ecological inferences. This method, ECO-MIT, could also be improved when updated machine learning algorithms are available.
Supplementary Materials: The following are available online at https://www.mdpi.com/2227-7 390/9/4/415/s1, Figure S1. Pearson correlation between features pair from RF dataset. Figure  S2. The results of PLS-PM for (a) the occurrence and (b) group size of kleptoparasites from RAW dataset. Figure S3. The results of PLS-PM for (a) the occurrence and (b) group size of kleptoparasites from ZERO dataset. Figure S4. The results of PLS-PM for (a) the occurrence and (b) group size of kleptoparasites from MEAN dataset. Figure S5. The results of PLS-PM for (a) the occurrence and (b) group size of kleptoparasites from KNN dataset. Figure S6. The results of PLS-PM for (a) the occurrence and (b) group size of kleptoparasites from CART dataset. Table S1. All data matrices used in this article. Table S2. The data matrix of the Pearson correlation coefficient between each feature from the imputation datasets. Table S3. The data matrix of the difference between the correlation coefficient of feature pairs in the non-missing set and each data imputation method.