Identification of Soil Properties Associated with the Incidence of Banana Wilt Using Supervised Methods

Over the last few decades, a growing incidence of Banana Wilt (BW) has been detected in the banana-producing areas of the central zone of Venezuela. This disease is thought to be caused by a fungal–bacterial complex, coupled with the influence of specific soil properties. However, until now, there was no consensus on the soil characteristics associated with a high incidence of BW. The objective of this study was to identify the soil properties potentially associated with BW incidence, using supervised methods. The soil samples associated with banana plant lots in Venezuela, showing low (n = 29) and high (n = 49) incidence of BW, were collected during two consecutive years (2016 and 2017). On those soils, sixteen soil variables, including the percentage of sand, silt and clay, pH, electrical conductivity, organic matter, available contents of K, Na, Mg, Ca, Mn, Fe, Zn, Cu, S and P, were determined. The Wilcoxon test identified the occurrence of significant differences in the soil variables between the two groups of BW incidence. In addition, Orthogonal Least Squares Discriminant Analysis (OPLS-DA) and the Random Forest (RF) algorithm was applied to find soil variables capable of distinguishing banana lots showing high or low BW incidence. The OPLS-DA model showed a proper fitting of the data (R2Y: 0.61, p value < 0.01), and exhibited good predictive power (Q2: 0.50, p value < 0.01). The analysis of the Receiver Operating Characteristics (ROC) curves by RF revealed that the combination of Zn, Fe, Ca, K, Mn and Clay was able to accurately differentiate 84.1% of the banana lots with a sensitivity of 89.80% and a specificity of 72.40%. So far, this is the first study that identifies these six soil variables as possible new indicators associated with BW incidence in soils of lacustrine origin in Venezuela.


Introduction
Bananas (Musa spp.) represent an important crop for Venezuela's economy, which is predominantly based on oil. During the last 20 years, banana production has undergone slight reductions, reaching 650,051 tons in 2019, with a cultivated area of around 41,708 ha, partially due to the shortage of agricultural inputs (fertilizers and agrochemicals), problems of access to foreign currency to meet domestic demand, the inadequate management of agricultural policies and the impact of drought, pests and diseases [1].
part dies. Often one to four upper leaves remain green, but are smaller in size and their development stops. New leaf growth can occur, but the bunches in this case are generally small with short and thin bananas, which generates economic losses due to the rejection of the fruit in the market. All of the lots evaluated (n = 78) in the study area have BW disease. The percentage of the lots with a low incidence (<1.90%) of BW reached 37.18% (n = 29), while the lots with a high incidence (≥1.90%) represented 62.82% (n = 49) ( Figure 2). The highest incidence values were found in lot 36 with 8.47%, lot 32 (5.97%) and lot 34 (5.13%) for the year 2017 (Figure 2b), while during 2016 the maximum incidence values were registered in lots 38 and 45 with 5.57% and 5.03%, respectively. On the other hand, lots 12, 13 and lot 17 presented low incidence values that did not exceed 1.0% in both of the years of evaluation. For the entire dataset, the mean incidence was 2.17 ± 1.40 with a P50 of 1.90 ( Figure 2a). However, there were no significant differences in the BW incidence according to the date on which different banana lots were established within the study area, according to the Kruskal-Wallis test (p value: 0.107). All of the lots evaluated (n = 78) in the study area have BW disease. The percentage of the lots with a low incidence (<1.90%) of BW reached 37.18% (n = 29), while the lots with a high incidence (≥1.90%) represented 62.82% (n = 49) ( Figure 2). The highest incidence values were found in lot 36 with 8.47%, lot 32 (5.97%) and lot 34 (5.13%) for the year 2017 (Figure 2b), while during 2016 the maximum incidence values were registered in lots 38 and 45 with 5.57% and 5.03%, respectively. On the other hand, lots 12, 13 and lot 17 presented low incidence values that did not exceed 1.0% in both of the years of evaluation. For the entire dataset, the mean incidence was 2.17 ± 1.40 with a P 50 of 1.90 ( Figure 2a). However, there were no significant differences in the BW incidence according to the date on which different banana lots were established within the study area, according to the Kruskal-Wallis test (p value: 0.107). Figure 3 shows the results of the heat map of the soil data, classified into the high and low incidence groups. The heat map provides an intuitive visualization of the data used; each colored cell in the map corresponds to a concentration value in the data table, with the soil properties in the rows and the 78 banana lots in the columns. In general, the soils with a high incidence of BW presented with loam to silty loam textures, with a predominance of the particles with an equivalent diameter between 2 and 50 µm. In these soils, the banana lots classified as a high incidence of BW showed high values of Na, Fe and Mg, with slightly higher pH values ( Figure 3).   On the other hand, the characteristics of the parental material of these soils produced very high levels of Ca. The limitations for the development of the roots in these soils with a high incidence of BW could be associated with chemical conditions, such as the presence of a high CaCO 3 content, the limiting ratios being Ca Mg −1 and Ca K −1 (data not shown). The sodium levels were high in most of the lots with a high incidence of BW, which could generate toxicity problems for the plants and low structural stability in the soils. Likewise, low levels of Cu were observed in the lots with a low incidence of BW. The metabolic nature of these elements means that their deficiency can greatly affect the development of the crop. It is important to highlight that in some of the lots with a high incidence of BW, high levels of P were present on the surface, possibly due to overfertilization.

Description of Soil Properties in Experimental Lots
In the very loamy soils, with low permeability and limited drainage, and with a nutrient imbalance, BW disease was more frequent. Additionally, in the soils showing a high incidence of BW, the clay content was slightly higher, whereas the K and Zn contents was slightly lower. However, a high incidence of BW occurred in those plant lots where the Ca content was higher, while the soils were more saline in depth.

Wilcoxon Rank Test
For a direct comparison of the soil variables' levels, the Wilcoxon analysis was used to identify the critical significant variables differentiating between the groups with a low and high incidence of BW. The analysis revealed a total of six significant soil variables (adjusted p value < 0.05) (Table 1): Zn, Ca, Fe, Clay, Mn and K. In our study, a small fraction of false positives could be accepted as substantially increasing the total number of discoveries; therefore, the false discovery rate (FDR) obtained is usually appropriate and useful. The FDR is the rate at which the so-called significant features are actually null. The significant and most important soil variables that were responsible for the observed differentiation between the two BW incidence groups are shown in Figure 4.

Wilcoxon Rank Test
For a direct comparison of the soil variables' levels, the Wilcoxon analysis was used to identify the critical significant variables differentiating between the groups with a low and high incidence of BW. The analysis revealed a total of six significant soil variables (adjusted p value < 0.05) (Table 1): Zn, Ca, Fe, Clay, Mn and K. In our study, a small fraction of false positives could be accepted as substantially increasing the total number of discoveries; therefore, the false discovery rate (FDR) obtained is usually appropriate and useful. The FDR is the rate at which the so-called significant features are actually null. The significant and most important soil variables that were responsible for the observed differentiation between the two BW incidence groups are shown in Figure 4.

Identification of Important Soil Variables
The results of the descriptive analysis ( Table 2) indicated important differences between the characteristics of the soils of the sampled banana lots. The variable importance in the projection (VIP) values were obtained from the OPLS-DA model. The VIP was taken for selection, and those variables with a VIP > 1 were considered as possible candidate variables for the group discrimination (Table 2). Accordingly, the analysis revealed prominent values in three variables: K, Fe and Zn. On the other hand, as shown in Figure 5a, the OPLS-DA allowed us to analyze the information collected in the predictive component independently from the orthogonal components. That is, it allowed the separation of the variability responsible for the discrimination from the noise generated by the uncorrelated variability. For this reason, the OPLS-DA was the method chosen for the selection of the relevant variables in the discrimination of groups. In addition, based on the loading values > 0.2, the OPLS-DA identified six critical variables: Clay, Mn, K, Ca, Fe and Zn ( Figure 5b). Besides, the OPLS-DA model showed a proper fitting of the data (R 2 Y = 0.61, p value < 0.01), and exhibited good predictive power (Q 2 = 0.50, p value < 0.01) (Figure 5c).   Table 3 shows the measures of the importance of the soil variables selected by the RF model. The results establish the frequency with which an independent variable is selected  Table 3 shows the measures of the importance of the soil variables selected by the RF model. The results establish the frequency with which an independent variable is selected greater than/equal to a defined importance threshold (0.5). The Mean Decrease Accuracy (MDA) allows for the visualization of the relative impact on the performance of the RF classifier by subtracting each specific soil variable. Figure 6 shows the classification results after the RF analysis; the receiver operating characteristic (ROC) curve of the best-performing model indicated an area under the curve (AUC) of 0.91 (95% confidence interval CI: 0.80% to 0.99%) (Figure 6a). The scores plot (Figure 6b) shows the predicted class probabilities for all of the samples included in the analysis, indicating the correct classification of 44 banana lots out of 49 with a high incidence of BW, and 21 banana lots out of 29 with a low incidence. Our results showed the great power of the RF classifier to correctly differentiate the lots of bananas with a high or low BW incidence. Furthermore, our proposed system reached 89.80% sensitivity and 72.40% specificity in the test dataset, which implies that most of the banana lots with a low BW incidence were correctly classified with a false negative (FN) rate of 5/49, and most of the banana lots with a high BW incidence were also correctly classified with a false positive (FP) rate of 8/29 (Figure 6c).

Classifier Performance and Accuracy Assessment
Finally, the McNemar test was used to determine if the observed vs. predictive proportions of the banana lots with a high and low incidence of BW were different. The results establish that the p value of the McNemar test (0.41) is greater than 0.05, so there is no evidence to reject the null hypothesis, and it is concluded that there are no significant differences in the proportion of banana lots with a high and low incidence of BW before (observed data) and after classification with RF (predictive data).

Discussion
Banana Wilt is a disease of unknown etiology that has not yet been properly studied. Indeed, the incidence of BW has only been assessed in a few countries, including Costa Rica, where a BW incidence of 7.3% was reported [16]; in Colombia, where an incidence of 0.31% was reported in some of the banana-producing areas with a prevalence of 4.30% [17]; and in Indonesia, where the average incidence of BW in 15 provinces was as high as 24% [18].
In the case of the banana areas located in the Aragua state of Venezuela, Martínez et al. [19], Ramírez et al. [20] and Rey et al. [8] reported incidences of BW ranging from 0.32% to 11.41% in different plant lots. These values are similar to those obtained in our study, where the vast majority of the foci showing an incidence of BW were centralized between lots 31 to 46 of the farm sampled and for both of the years evaluated. This could suggest that the spread of the disease may be linked to specific soil physical-chemical characteristics, combined in some degree with poor agronomic management (inappropriate fertilization) that generates a significant nutritional imbalance in the soil.
The identification of the symptoms associated with BW represented the first step in understanding and identifying the causes of the disease in the field and distinguishing the areas affected by the disease, to later perform a classification based on certain previously established statistical, economic and agronomic management parameters. In our study, we established two levels (low and high) for describing the incidence of BW, based on previous experience in the banana field plots in Venezuela presenting similar type of soils and agronomical practices (J. C. Rey, personal communication). This threshold incidence value of 1.90% was selected as that inducing severe yield loss.
The studies indicated that the soil factors, specifically its physical and chemical properties, are closely associated with the occurrence of BW in bananas [7,8,14,15]. In the present study, using a RF model, we identified soil differences in six soil variables (i.e., Zn, Fe, Ca, K, Mn and Clay) between the zones with different levels of BW incidence. The K contents were highest (5.6-984.0 mg kg −1 ) in the group of lots with a low incidence of BW. However, Ca contents were excessively high in both of the groups, with the concentrations being more notable in the lots with a high incidence of BW (6472-16,648 mg kg −1 ), due to the lacustrine origin of the soils, which can generate K and Mg deficiencies in the plants. In relation to the microelements, Fe (0.06-78.40 mg kg −1 ) and Mn (0.8-58.4 mg kg −1 ) were present at high levels in the group of lots with a high incidence of BW, while Zn was at low levels (0.3-30.4 mg kg −1 ) ( Table 3). These high Fe and Mn contents could be associated with a higher clay content that can generate drainage problems. Under these conditions of excess humidity, the solubility of Fe 2+ and Mn 2+ increases [21].
Regarding Zn, in the Canary Islands, the authors of [22] demonstrated that the application of Zn in the soil notably reduced the incidence and severity of BW because this type of soil shows a Zn deficiency. Therefore, in our study, conducted in the soils of Aragua, Venezuela, the low levels of this element in the plant lots with a high incidence of BW may have favored the appearance of the BW symptoms. According to Domínguez et al. [4], the banana soils in the Canary Islands that presented severe BW problems showed a tendency to the formation of stable aggregates of clays, that with an excess of irrigation favored anaerobiosis in the soil and high concentrations of Fe, which caused compaction when the soil became dry. These relationships of the clay content (1-40%) with the water and the detrimental effect of compaction in banana soils results in a decrease in the productivity and plant height, and a reduction in the number of offspring plants in the banana production unit. Additionally, according to the results of Dorel [23] and Sabadell [14] the most significant effect would be related to the reduction in the absorption of N, P, K, Ca and the massive absorption of Mn.
The results of our analysis established that the heavy texture in the lots with a high incidence of BW favored the appearance of symptoms, agreeing with the other studies that found that this disease developed in the presence of soils with a heavy texture [24] and poor drainage [25], in conditions of high humidity, favoring infection by deleterious microorganisms in the lateral rootlets.
The study by Rey et al. [8] establishes that the variables that showed the highest significant correlation with the incidence of BW were the sand and silt content, organic carbon, exchangeable Mg content and the Ca/Mg ratio. The authors found that a positive correlation was observed with BW incidence for the silt content and the Ca and Mg levels in the banana soils of Aragua, indicating that in very silty soils with low permeability and limited drainage, it was more frequent to find a high incidence of BW. Likewise, they found that, the C/N ratio and the K content, the nutritional relationships between the exchangeable cations (Ca, Mg and K) and the Zn content were the variables that had the greatest importance in the differentiation between the field areas, coinciding with the results of this study.
Our results also showed that the incidence of the disease was not uniform throughout the farm; the most affected areas had very silty soils with drainage problems, certain nutrient deficiencies and nutritional imbalances, related to the natural condition of the lacustrine soils and, surely, the lack of appropriate fertilization cycles in recent years [8].
In recent times, modern approaches, such as machine learning and deep learning algorithms, have been employed to identify the characteristics of banana agroecosystems that could be affecting productivity and the appearance of diseases in the field. Several investigations were carried out in the field of machine learning for the detection and diagnosis of banana diseases, using RF [11,12,[26][27][28], artificial neural networks [11], support vector machine (SVM) [10,11,29,30] and decision trees [26], among others. This study aimed to use a RF model analysis strategy to determine the soil variables that could favor the development of BW disease, with the final aim of helping to avoid using those soils or promoting the application of the appropriate corrective fertilization treatments.
In those studies, reported above, the machine learning analysis approaches were used to detect Fusarium wilt and Black Sigatoka diseases using aerial images, but none of them used in situ soil data to predict the occurrence of a banana disease, as is the case in our study. This evidences the existence of an information gap regarding the application of these novel algorithm-based techniques, using data from the sampled soils. Our study is a pioneer in showing results from the application of supervised methods, such as OPLS-DA and the RF algorithm, to identify the soil variables associated with BW incidence. According to our results, it is reported for the first time that soil variables, such as Zn, Fe, Ca, K, Mn and Clay content, could be promising new soil indicators to classify the lots of bananas prone to show a higher incidence of BW disease in the lacustrine soils in Venezuela.
The RF classifier achieved a significant advantage over the classifiers used in previous works [11,12,28]. The characteristics of the RF classifier, and the way in which the most important soil variables are selected through the OPLS-DA, determine the performance of the RF classifier. However, the precision of classifying the banana lots with different levels of BW incidence can be affected by many different factors, such as the quality and representativeness of the information obtained, the performance of the characteristic extraction algorithm, and the subsets used for training and testing purposes, as established by the studies of [11,12]. The results of our study showed that RF performed well in differentiating the banana lots with a high or low BW incidence. More interestingly, our model provides an easy, fast and inexpensive method to accurately identify the risk of incidence of BW in bananas.
Nevertheless, we are aware that it is not only the soil properties that may be directly related to the plants that develop BW, since it is a disease caused by a fungal-bacterial complex. Consequently, it is logical to think that the climatic variables of the site, other than the physical and biological soil properties, and the physiological and agronomic management of the plantation, among other factors, could also have an important influence on the manifestation of the disease. However, all of those factors were not the object of this study; so, it would be necessary to establish additional methods of analysis that would allow for the analysis of the complexity of this type of disease, to obtain findings that do not depend on a single method of analysis and to explore other potential factors that may influence the development of BW.

Study Area
The study was carried out in a banana plantation located in the Aragua state, with 205 ha planted with Cavendish cv. Pineo Gigante (67.58 • W, 10.14 • N; Figure 7). These plants had at the time of sampling: (i) a leaf number from 16 to 18; (ii) height values ranging from 3.5 to 4.5 m; and (iii) a growth period from 9 to 10 months. This region is characterized by a Tropical Savanna climate (Aw). The annual mean rainfall is 980 mm [31] and shows a marked seasonal pattern, with a wet season from May to October. The mean annual temperature is 26.2 • C, whereas the mean annual relative humidity is 70.0% [32]. The terrain relief is mostly flat (slope ranging 0-2%). The predominant types of soil are Mollisol and Entisol, which are mostly of lacustrine origin, with medium textures, high nutrient availability, moderate to good drainage, soil pH varying from neutral to alkaline, good fertility and high soil organic matter content [33,34]. would allow for the analysis of the complexity of this type of disease, to obtain findings that do not depend on a single method of analysis and to explore other potential factors that may influence the development of BW.

Study Area
The study was carried out in a banana plantation located in the Aragua state, with 205 ha planted with Cavendish cv. Pineo Gigante (67.58° W, 10.14° N; Figure 7). These plants had at the time of sampling: (i) a leaf number from 16 to 18; (ii) height values ranging from 3.5 to 4.5 m; and (iii) a growth period from 9 to 10 months. This region is characterized by a Tropical Savanna climate (Aw). The annual mean rainfall is 980 mm [31] and shows a marked seasonal pattern, with a wet season from May to October. The mean annual temperature is 26.2 °C, whereas the mean annual relative humidity is 70.0% [32]. The terrain relief is mostly flat (slope ranging 0-2%). The predominant types of soil are Mollisol and Entisol, which are mostly of lacustrine origin, with medium textures, high nutrient availability, moderate to good drainage, soil pH varying from neutral to alkaline, good fertility and high soil organic matter content [33,34].

Soil Sampling
A systematic soil sampling was carried out in 39 banana lots sampled during January 2016 and 2017 (total banana lots sampled, n = 78) (Figure 7). These lots were established at different periods at the time of disease monitoring (<6 years, 6 to 12 years, and >12 years) [8]. The sampling was conducted following the guidelines of Lozano et al. [35], with an approximate distance of 150 m between the sampling sites. The composite soil samples were obtained in each of the banana lots, in the first horizon at a depth of 0 to 20.0 ± 5.0 cm. The samples were subjected to soil analysis for fertility characterization purposes; in total, 16 soil variables were determined including: percentage of sand, silt and clay [36]; soil reaction (pH); electrical conductivity (EC, dS m −1 ) in suspension 1: 2 (soil: water) [37]; organic matter (OM, %) [38]; available contents of potassium (K, mg kg −1 ); sodium (Na, mg kg −1 ); magnesium (Mg, mg kg −1 ); calcium (Ca, mg kg −1 ); manganese (Mn, mg kg −1 ); iron

Banana Wilt Incidence
Before the beginning of the study, the plants with the typical symptoms of BW disease were located and identified in all of the lots of the farm, from which the tissue samples were taken from the pseudostem and roots, for the identification of the pathogenic microorganisms. The isolation method, in PDA culture medium and humid chamber, was used, in the laboratory of the Faculty of Agronomy of the Central University of Venezuela.
For the identification of the BW incidence in the field, in each banana lot each banana plant was individually inspected on a monthly basis for the presence of symptoms compatible with BW. The banana plants showing BW symptoms were eliminated in each lot and each evaluation period. Therefore, in the next monthly inspection, only the number of plants with new BW symptoms to that date were counted. The cumulative incidence of BW was determined in each of the 78 banana lots sampled during 2016 and 2017, using the guidelines by Bosman [40]. The main aim of the continuous monitoring of BW incidence was to determine the new cases of BW that occurred in the total population of plants in each banana lot in a given plot and sampling time and for all of the physiological plant stages growing simultaneously. The harvest of the fruit was carried out throughout the year, which is interpreted as a staggered harvest, so that in the same lot it is possible that the plants are in different phenological phases: Vegetative; Floral and Fruiting; that is why the annual accumulated incidence was obtained to prevent the incidence of BW from being confused with plant age. Within a banana lot, a plant grows for a maximum of 11-to 12-month period when the fruit is harvested and the mother plant removed. Hence, the cumulative incidence rate is calculated as the sum of the monthly incidence of BW values of all of the plants at different phenological stages in percentage for each banana lot in a particular year according to Equation (1): In the scientific literature, there is no information describing the threshold values to establish the categories for BW incidence for the study area, nor in any other banana areas of Venezuela. The percentiles were established in agronomy as an important alternative to disease incidence indicators in bananas [41,42]. In this sense, the percentile (50) (P 50 ) or median represented by the value below which a certain proportion of the observations falls was selected. In this study, the P 50 (and thus also the percentile rank classes) offer an alternative to the mean-based ratios for the disease incidence classes. The selection of this measure of the statistical position is based on the low influence of the extreme values of the distribution, such as the mean value; as additionally, the non-dependence of the choice of the specific probability density functions compared to the arithmetic mean, which requires normally distributed data [43].
The two percentile-rank classes are aggregated as follows: low incidence of BW < 1.90% (incidence values of BW with a percentile less than the P 50 ); and high incidence of BW ≥ 1.90% (incidence values of BW with a higher percentile equal to the P 50 ).This high incidence value would represent a decrease of up to 13,300 kg ha −1 year −1 in those banana lots showing an incidence of BW of 1.90% and was selected based on the information provided by J.C. Rey (personal communication, 28 September 2019) and several years of experience observing yield losses associated with BW.

Data Analysis
Before the data analysis, we checked the data integrity. The normalization of the soil variables was carried out using the statistical package in R software version 4.0.2 (R Core Team, Austria) [44] based on the geometric mean, and a generalized logarithmic transformation using "glog" function in R was performed to make the variables comparable among themselves due to differences in the units to measure them [45,46]. Figure 8 shows the general scheme of the data analysis procedures followed in this work. transformation using "glog" function in R was performed to make the variables comparable among themselves due to differences in the units to measure them [45,46]. Figure 8 shows the general scheme of the data analysis procedures followed in this work.

Identification of Important Soil Variables
For the identification of the relevant soil variables characterizing the incidence of BW, a Wilcoxon rank sum test was performed to find the most important features of the soil variables at a threshold p value < 0.05 [45], showing the differences between the group of bananas lots with a low and high incidence of BW. Next, an Orthogonal Least Squares Discriminant Analysis (OPLS-DA) was used to reduce the number of the soil variables in the high-dimensional data to produce a robust and easy-to-interpret model, and to identify the main soil characteristics that drive the separation of the plant lots based on BW incidence (low or high). This multivariate statistical analysis was carried out using "ropls" R packages [47].
The variable importance in projection (VIP) > 1, and the corresponding |loading val-ues| > 0.2 in the model were used to identify the variables responsible for distinguishing both of the BW categories [48]. Furthermore, a permutation test with 100 permutations was employed to validate the performance of OPLS-DA model. For the quality criteria, we chose in the OPLS-DA model, the R 2 Y (goodness of fit parameter) and Q 2 (predictive ability parameter) > 0.5 [49].

Classifier Performance and Accuracy Assessment
The random forest (RF) algorithm was used as a machine-learning approach for classifying the lots with a high and low incidence of BW [50]. The RF models allow for the prediction of unknown samples (i.e., a test dataset) after training on a known dataset (i.e., a training dataset). The receiver operating characteristic (ROC) curves were generated by

Identification of Important Soil Variables
For the identification of the relevant soil variables characterizing the incidence of BW, a Wilcoxon rank sum test was performed to find the most important features of the soil variables at a threshold p value < 0.05 [45], showing the differences between the group of bananas lots with a low and high incidence of BW. Next, an Orthogonal Least Squares Discriminant Analysis (OPLS-DA) was used to reduce the number of the soil variables in the high-dimensional data to produce a robust and easy-to-interpret model, and to identify the main soil characteristics that drive the separation of the plant lots based on BW incidence (low or high). This multivariate statistical analysis was carried out using "ropls" R packages [47].
The variable importance in projection (VIP) > 1, and the corresponding |loading values| > 0.2 in the model were used to identify the variables responsible for distinguishing both of the BW categories [48]. Furthermore, a permutation test with 100 permutations was employed to validate the performance of OPLS-DA model. For the quality criteria, we chose in the OPLS-DA model, the R 2 Y (goodness of fit parameter) and Q 2 (predictive ability parameter) > 0.5 [49].

Classifier Performance and Accuracy Assessment
The random forest (RF) algorithm was used as a machine-learning approach for classifying the lots with a high and low incidence of BW [50]. The RF models allow for the prediction of unknown samples (i.e., a test dataset) after training on a known dataset (i.e., a training dataset). The receiver operating characteristic (ROC) curves were generated by Monte Carlo cross validation (MCCV) [51], that is, a cross validation approach which creates multiple random splits of the dataset into training and validation data. In each MCCV, two/three of the samples were used to evaluate the feature importance, and the remaining third were used to validate the model created in the first step [52,53].
To determine the predictive performance of the model, the graphs of the ROC curve were used, from which the sensitivity was defined as the relationship between the number of P correctly classified and the total P observed, against "1-specificity" (specificity is the relationship between the number of N correctly classified and the total N observed). A model will have a high predictive performance if at low values of "1-specificity" a high sensitivity is obtained, that is, a good capacity to correctly classify P with a low number of false positives. This yields a curve closer to the upper left corner [54]. The Area under the ROC curve (AUC) quantifies this relationship, so that a model is considered acceptable if the AUC ≥ 0.7, excellent if the AUC ≥ 0.8 and outstanding if the AUC ≥ 0.9.

Conclusions
This study was focused on an analysis of the key soil properties that play an important role in the incidence of BW. So far, crop-disease detection models primarily focus on leaf symptoms through image recognition technology. This means that the diseases can be detected only after they have appeared. In the present study, by using a random forest analysis approach, we identified that the risk of low or high incidence of BW in a banana farm in Venezuela could be associated with the differences in six key soil variables, including Zn, Fe, K, Ca, Mn and Clay content. The findings may contribute to increasing our understanding of the basic mechanisms and progression of BW incidence, and indicated that these soil variables are potentially the determining factors of a risk of high BW incidence in the tropical lacustrine soils of Venezuela.
Although the Random Forest analysis performed well in this particular study, and its performance in other banana areas in Venezuela has not yet been proven, we consider that this machine learning algorithm, using the soil properties as indicators, has the potential to be further explored as a simple and effective tool in banana areas with the risk of developing BW.
Our results open the field for further research in which we could quantitatively predict the risk of BW in banana fields based on available, or relatively easy to gather, information, which in turn could allow farm managers to implement preventive measures to minimize BW risk and target other techniques (e.g., plant sampling, withdrawal of infested material) on the areas where there is maximum risk.
In the future, new research can be improved through the systematic use of new locations to obtain a much larger database of BW-affected plants, and also to take into consideration various environmental, physiological and agronomic variables, among others, and apply new and different statistical analyses that may help to identify other factors potentially associated with BW development.