Humboldtian Diagnosis of Peach Tree ( Prunus persica ) Nutrition Using Machine-Learning and Compositional Methods

: Regional nutrient ranges are commonly used to diagnose plant nutrient status. In contrast, local diagnosis confronts unhealthy to healthy compositional entities in comparable surroundings. Robust local diagnosis requires well-documented data sets processed by machine learning and compositional methods. Our objective was to customize nutrient diagnosis of peach ( Prunus persica ) trees at local scale. We collected 472 observations from commercial orchards and fertilizer trials across eleven cultivars of Prunus persica and six rootstocks in the state of Rio Grande do Sul (RS), Brazil. The random forest classiﬁcation model returned an area under curve exceeding 0.80 and classiﬁcation accuracy of 80% about yield cuto ﬀ of 16 Mg ha − 1 . Centered log ratios ( clr ) of foliar defective compositions have appropriate geometry to compute Euclidean distances from closest successful compositions in “enchanting islands”. Successful specimens closest to defective specimens as shown by Euclidean distance allow reaching trustful fruit yields using site-speciﬁc corrective measures. Comparing tissue composition of low-yielding orchards to that of the closest successful neighbors in two major Brazilian peach-producing regions, regional diagnosis di ﬀ ered from local diagnosis, indicating that regional standards may fail to ﬁt local conditions. Local diagnosis requires well-documented Humboldtian data sets that can be acquired through ethical collaboration between researchers and stakeholders.


Introduction
In 2017, peaches and nectarines were produced on 1.5 × 10 6 ha worldwide [1]. Mainland China accounted for 51.0% of total area, followed by Spain (5.5%) and Italy (4.4%). Brazil ranked 13 th with 17,116 ha, and 21 th in total production. The states of Rio Grande do Sul (RS), Santa Catarina and Paraná accounted for 72% of Brazilian production [2]. Average Brazilian yield was half that of USA and Europe, and this was attributed in part to regional nutrient guidelines based on a limited number of fertilizer experiments that may not fit local conditions.
The performance of Brazilian peach orchards could be improved by tackling local yield-limiting factors. Because the plant explores the soil in deeper layers than the arable layer sampled for soil testing [3,4], tissue tests are generally more closely related to crop performance than soil tests [5]. Indeed, the plant integrates site-specific growth-impacting genetic, managerial, and environmental factors [6]. Yield, fruit quality and tissue nutrient composition depend on cultivar, rootstock, phenological stage, yield, pedoclimatic conditions and crop management [7,8]. To address several factors simultaneously, Humboldtian system-based data sets integrate records at large scale and as much quantitative data as possible instead of focusing on data collected in isolated studies [9].
Well-documented data sets relating features to crop performance can be processed by machine learning methods of artificial intelligence [10]. On the other hand, compositional features are intrinsically multivariate, and necessitating to address information redundancy and the closure problem of compositions using log ratio transformation methods [11]. Acknowledging the multivariate nature of tissue compositions, Lagatu et al. [12] drew a diagnostic yield contour map within an interactive N × P × K ternary diagram. Holland [13] proposed using multivariate data analysis to diagnose tissue nutrients holistically rather than separately but did not demonstrate it explicitly nor did he address nutrient interactions. Assuming data additivity and function reflectivity, Beaufils [14] suggested adding up standardized nutrient ratios to nutrient indices to conduct regional nutrient diagnosis, ignoring the large range of mathematically robust multivariate statistical analysis methods and biasing nutrient standards by non-normal data distribution patterns and false positive specimens.
Aitchison [11] and Egozcue et al. [15] developed the concepts of compositional log ratios amenable to multivariate analyses. Log ratios are additive in the n-dimensional Euclidean space and address compositions as unique combinations of parts or entities rather than parts taken in isolation [16]. Using log ratio techniques, multivariate distances can be computed as Euclidean or Mahalanobis distances between defective and successful compositional entities for diagnostic purposes [17,18].
Machine learning and compositional data analysis methods provide unprecedented tools to conduct nutrient diagnosis of tissue compositional entities at local scale if supported by large data sets. We hypothesized that (1) machine learning methods return accurate classification relating yield to combinations of features that influence the performance of peach orchards, and (2) regional diagnosis using state standards differs from local diagnosis that compares the tissue compositional entities of defective and successful specimens. The objective of this paper was to customize nutrient diagnosis of peach orchards using site-specific information.
The tree training method was open vase. Plantation density ranged between 770 and 1358 trees ha −1 . Orchards were managed following national standards for integrated peach orchard management [19,20]. One of the most sensitive features controlling peach yield is winter chill hour requirement for bud break [21], measured in Brazil as the number of hours where temperature is lower than 7.2 • C during the winter period. Meteorological data were obtained from regional weather stations ( [22] in Bento Gonçalves, Farroupilha, Flores da Cunha, Caxias do Sul, [23] in Pelotas and [24] in Eldorado do Sul). Orchards were not irrigated. Soils were Typic Hapludalf and Udorthent [25]. Soil between rows was covered by vegetation year-round.
Fertilization followed the Brazilian guidelines for mature (≥4 years old) peach orchards [26] based on tissue tests (0-110 kg N ha −1 , 0-52 kg P ha −1 and 0-83 kg K ha −1 ). Hence, the effects of fertilizer dosage and tissue tests were confounded. By comparison, nitrogen was applied at three occasions (i.e., 50% at bud break, 25% at fruit thinning, and 25% after harvest (except in years of low production or excessive vigor)). In comparison, phosphorus was applied once together with the first application of nitrogen. Potassium was applied once except on coarse-textured soils where K was split-applied. The crop was harvested yearly from November to February. Yield was measured in three central trees in experimental areas.

Soil and Tissue Analyses
Soil nutrients were extracted for K, P, Cu, Zn and Mn using the Mehlich1 method [27]. The Ca, Mg and Na were extracted using KCl 1 M. The Fe was extracted using DTPA. Exchangeable acidity was measured using the Shoemaker-McLean-Pratt (SMP) method. Organic matter content was determined by oxidization in a sulfo-chromic solution. Soil pH was measured in water. Clay content was determined by sedimentation.
Diagnostic leaves were collected in June from the middle tier of annual growth, dried in at ±65 • C, ground to pass through a 1 mm sieve. A subsample was digested using sulfuric acid and quantified for N by micro-Kjeldahl [27]. Another subsample was digested in a mixture of nitric and perchloric acids and quantified by ICP-OES for S, P, K, Ca, Mg, Zn, Cu, Mn, Fe and B concentrations. Fruit quality was measured as fruit weight, dimension (average length and width), firmness, Brix index and acidity using 30 fruits per experimental unit [28].

Isometric Log-Ratio Transformation
Isometric log ratios (ilr) are orthogonal arrangements of D components into D-1 balances, the exact number of degrees of freedom available in D-part compositional entity [29]. Isometric log ratios were computed as follows [30]: where r and s are the numbers of components at numerator and denominator, respectively, and G N and G D are the geometric means of components at numerator and denominator, respectively. Components were arranged as meaningful balances in a sequential binary partition or SBP (Table 1). We first contrasted nutrients against the filling value computed by difference between measurement unit and the sum of quantified components. While N, K, Mg, P, S, Cl and Na are phloem-mobile, the Fe, Zn, Cu, B and Mo have intermediate mobility and the Ca and Mn are relatively immobile [31]. Concentrations of Cu, Zn and Mn may vary widely due to fungicide applications [32]. Orthonormal balances allowed computing Mahalanobis distance as follows: where ilr i and ilr * i are orthonormal balances for the specimen under diagnosis and reference balances, respectively, COV is the covariance matrix, and T indicates that the ilr vector is transposed. The M 2 is distributed like a χ 2 variable.

Centered Log-Ratio Transformation
The centered log ratio (clr) integrates all pairwise log ratios in a composition [11], as follows for N: where G is geometric mean across components including Fv, and Fv is the filling value computed by difference between measurement unit and the sum of quantified nutrients. The clr transformation has Euclidean geometry. The Euclidean distance between two D-part compositions of equal length is computed at local scale as follows: where clr i is the clr transformation of the diagnosed composition, I is the identity matrix and clr * i is the clr transformation for reference local compositions. Nutrients are classified in the order of their limitation to yield along the clr i − clr * i gradient and illustrated in histograms. Nutrient diagnosis can be conducted as Mahalanobis distance at regional scale as follows, assuming independence among clr variables [33]: where clr i and clr * i refer to diagnosis and reference compositions and VAR is variance matrix excluding the clr value for the filling value to avoid generating a singular matrix. Hence, the reference compositions (Equation (4)) and weighted (Equation (5)) clr differences as well as assumptions differed between local and regional diagnoses.
Machine learning (ML) analysis was run using freeware Orange 3.24. Fruit yield categories were separated at cut off yield of 16 Mg ha −1 , the world average in 2017, yet above the 14.5 Mg ha −1 average in Brazil [1]. Exploratory analysis was conducted using the classification tree algorithm and the tree viewer. The survey data set was split into training (70%) and testing (30%) sets to test precision and across the data set by cross-validation to select a subset of balanced specimens. Precision metrics were accuracy (proportion of instances predicted as true negative or true positive) and area under curve (AUC) [17]. We expected AUC of 70-90% [34]. In exploratory analysis, random forest (RF), support vector machine, neural networks, adaboost, and stochastic gradient decent models returned similar accuracies in cross-validation (data not shown). However, we selected RF to deal with over-fitting of partition trees, but RF may be affected by data transformation [35]. The significance of the partition in the confusion matrix for the testing data set was assessed as a χ 2 homogeneity test with Yates' correction. Classification prediction and risk analysis for independent specimens were provided by the prediction module of Orange 3.24 using the same features as in the training set. Descriptive statistics were computed using Excel Microsoft 365.

Features
The meteorological indices and soil and tissue tests used to run machine learning models are presented in Tables 2-4. There were large ranges of properties, allowing to model fruit yields across a large range of features. Soil pH varied from 5.0 to 5.9 with a median value of 5.3. High soil P, Cu and Zn contents are due in part to the application of organic residues of diverse origins. Foliar nutrient composition is presented by cultivar in Table 4. Exploratory analysis using the classification tree algorithm indicated that the number of chilling hours, the cultivar and tissue K were driving variables at high yield level (data not shown), indicating genetic-environment-management interactions at local scale.

Model Precision
The AUC of the RF model varied between 0.834 and 0.844 in test and 0.894-0.901 in cross-validation, in the range of 0.7-0.9 considered acceptable by Delacour et al. [34] for diagnostic purposes (Table 5). Classification accuracy was close to 80% as reached by most fruit crops [36]. There was no apparent advantage using log-ratio transformations before processing compositional data with RF. At the step of model building, raw compositions were; thus, preferable because they did not require full-length compositions needed to log-ratio transform the data, hence avoiding to impute data, replace values lower than detection limits or remove observations.
The confusion matrix showed 142 true negative (high-yield, well-balanced) specimens producing more than 16 Mg ha −1 , providing a diversity of factor combinations leading to high performance of peach orchards. There were 254 true positive (low-yield, imbalanced), 39 false negative (low-yield, well-balanced indicating yield-limiting factors other than nutrients) and 37 false positive (high-yield, imbalanced due to luxury consumption or contamination) specimens. The partition was significant at p = 0.01 according to the χ 2 homogeneity test with Yates' correction.
Boxplots of foliar macro-and micro-nutrient concentrations and of centered log ratios of true negative specimens are presented in Figures 1 and 2. There were some outliers among P and Ca expressions. The number of outliers was larger among micronutrient expressions likely due to variable soil composition, site-specific applications of organic amendments, and different timings between tissue sampling and fungicide applications (Zn and Mn in carbamate formulations, copper sulfate).   Ranges of nutrient concentrations, centered log ratios and isometric log ratios in boxplots are presented in Table 6. Among macronutrients, the lower and upper limits of N boxplots differed the most from Brazilian standards that could lead likely to N over-fertilization. The P ranges were similar between standards and boxplots, while the ranges of K, Mg and Ca concentrations were wider. Macronutrients, which showed narrower ranges of concentrations compared to micronutrients, were diagnosed as a separate subset to facilitate comparison with Brazilian standards ( Table 6). The means and covariance matrix across 181 balanced (TN + FN) specimens are presented in Table 7.

Regional vs. Local Diagnosis
The ilr values of state standards [26] and the Mahalanobis distance from regional standards computed in the present study were measured using state standard median (M), first quartile (Q1), third quartile (Q3) and six sequential combinations thereof as Q1M, MQ3, Q3M, Q1Q3 and Q3Q1 ( Table 7). The . The Q1Q3 sequence (N Q1 , P Q3 , K Q1 , Mg Q3 , Ca Q1 ) that showed the smallest Mahalanobis distance was retained to compare regional to local diagnosis ( Table 8).
The closest successful Euclidean neighbors were detected by comparing foliar compositions and other features of TN specimens (municipality, cultivar, rootstock, number of chilling hours and some soil analyses where available in the data set) to those of the diagnosed specimens. For the compared defective and successful peach orchards at Bento Gonçalves and Pelotas, soil texture and classification, clay and organic matter contents, and number of chilling hours were similar, but yield, cultivar, rootstock and tissue composition differed.
A defective specimen of "Chimarrita" grafted on "Aldrighi" (8.9 Mg ha −1 ) was grown in Bento Gonçalves. The closest successful orchards (29.5-30.4 Mg ha −1 ) were "Chimarrita" and "Maciel" grafted on "Nemaguard". Because both successful orchards returned similar diagnosis, the "Chimarrita" orchard was selected as the closest successful neighbor. Regional diagnosis across factors indicated N, K and Mg shortage and P sufficiency ( Table 8). The diagnosed tissue specimen was classified as true positive with χ 2 5 value (squared Mahalanobis distance) of 17.47 across ilr variables and a highly significant probability to respond to a more appropriate fertilization regime. At local scale, the nearest neighbor returned an inverse K and Mg diagnosis (Figure 3), indicating site-specific factor interactions that were not depicted by nutrient standards at regional scale. While the apparent K:Mg imbalance detected at local scale may also be attributed not only to different rootstocks ("Aldrighi" vs. "Nemaguard"), comparable rootstock for successful "Chimarrita" was not available, emphasizing the importance of acquiring larger data sets.  . The Euclidian distance is computed across clr differences. Negative differences between defective and successful specimens indicate relative shortage.
Positive differences indicate relative excess.
In Pelotas, state standards indicated considerable nutrient imbalance ( Figure 4). Hence, regional diagnosis classified the specimen as true positive potentially responsive to K and Ca additions and to the reduction the N, P and Mg fertilization. At local scale, in contrast, the nearest successful neighbor, where other features were close to those of the diagnosed specimen, indicated negligible nutrient imbalance. Hence other factors likely limited yield at local scale but this was not indicated by the regional diagnosis.

Compositions as Separate Parts or Interactive Systems?
Johnson et al. [37] reported that nutrient concentration ranges, the Diagnosis and Recommendation Integrated System or DRIS [14] and the deviation from optimum percentage or DOP [38] have been used with some success to diagnose the nutrient status of peach trees. Normally distributed nutrient concentration ranges addressed simultaneously to diagnose tissue nutrient status collapse in the ellipsoidal multivariate hyperspace of nutrients and are thus useless as the number of diagnosed nutrients increases [39]. On the other hand, dual ratios such as P:Zn [40], N:P, N:S and S:P [41] are important in peach tree nutrition. Nutrient interactions such as N × P synergism and K × Mg antagonism [32,42] should also be considered. However, D-part tissue compositions can return up to D × (D − 1)/2 dual ratios, most of them being redundant, hence useless for correlation analysis with yield. For example, the N:P, N:S and S:P ratios are redundant because N P = N S × S P . Tissue compositions should be rather viewed as entities (i.e., unique combinations of nutrients). Aitchison [11] integrated pairwise log ratios into centered log ratios to secure the unique character of combinations of components in a composition (Equation (3)). The tissue nutrient clr variables are multivariate in nature and affected by farm nutrient management, climate and soil composition that vary widely regionally. Moreover, because adding one nutrient through fertilization may affect several others by resonance within the compositional space of tissue dry matter, other nutrients are also impacted by fertilization. Downscaling regional clr descriptive statistics (mean, variance) to site-specific level may be hazardous. Direct comparison between two equal-length compositions (Equation (4)) lumped into the Euclidean distance [11] at local scale where soil, management and meteorological factors are comparable; thus, appeared to be a more appropriate diagnostic method than computing clr indices using means and standard deviations at regional scale. In addition, differences between clr values adding up to the Euclidean distance allow classifying nutrients numerically in the order of their apparent limitation to yield [16]. The perturbation vector between two compositions computed as nutrient-wise ratios (X i de f ective /X i success f ul ) between defective and successful compositions is an alternative expression to rank nutrients in a numerical order at local scale [43].
Compositional data distribution of successful specimens needs not have a specific shape. Successful specimens in "enchanting islands" [10,43] may be even harbored close to the composition of defective specimens but outside the regional critical hyper-ellipsoid. Regional and local diagnoses thus involve different references (regional centroids vs. local enchanting islands), and weighting matrices (identity, variance, covariance) that may lead to contrasting nutrient diagnoses (Figures 3 and 4). An additional benefit of selecting the closest successful neighbors (smallest Euclidean distance) is to provide reliable means to correct controllable growth-limiting factors and reach the trustful potential yields recorded in comparable surroundings.

From Regional to Local Diagnosis
In the early 1800 s, Alexander von Humboldt championed the principles of quantitative biogeography [9] that illuminated a cascade of key concepts in agronomy and soil science such as Boussingault's nutrient budgets [44], Sprengel's law of the minimum, Liebscher's law of the optimum, Mitscherlich's law of diminishing returns [45], as well as Dokutchaev's morphogenetic soil classification system. Bernhard Baule combined interactive nutrients into a multiplicative law of diminishing returns that was later extended by Wallace and Wallace [46] to ≈70 multiplicative growth factors, a concept known as the law of the maximum. At local scale, it is unlikely that 70 growth factors can reach non-limiting conditions but in some illusory "Gardens of Eden". However, several near-optimum conditions could be reached in enchanting islands showing uncontrollable factors comparable to those found in defective orchards.
Difficulties to fit deterministic models to facts and data and to derive economically optimum nutrient dosage led to the development of empirical polynomial models by economists [47]. However, to make predictions, calculations required not only response models but also assumptions on the likelihood of future events and of uncontrollable and controllable factors [48,49]. Kyvegyga et al. [48,50,51] showed that the historical difficulties to tackle optimum nutrient rates using a limited number of fertilizer experiments could be alleviated by collecting large amounts of on-farm data.
Natale et al. [7] emphasized the importance of considering local conditions for nutrient management of orchards. The low-performing "Chimarrita" in Bento Gonçalves was grafted on "Aldrighi" and the high-performing one on "Nemaguard", indicating possible nutrient imbalance attributable either to inadequate nutrient management or to difference in rootstock. Mestre et al. [52] and Jimenez et al. [53] showed that rootstock could regulate the nutrition of peach trees. In contrast, Mayer et al. [54] did not find any difference in leaf nutrient content of "Maciel" grafted on "Nemaguard" and "Aldrighi", although nutrient levels were generally below state standards. Galarça et al. [55] concluded, from field trials on "Chimarrita" and "Maciel" grafted on "Aldrighi", 'Capdeboscq', 'Flordaguard' or "Nemaguard", that scion, rootstock and soil nutrient supply can impact on leaf content of peach trees but not necessarily on tree performance. Only well-documented data sets can fully capture the combined effects of yield-driving variables at local scale.
It may be argued that nutrient dosage has been optimized by curve fitting in a few well-conducted fertilizer experiments but has not been optimized in growers' enchanting islands. Successful specimens provide nutrient dosage at local scale where uncontrollable and controllable growth factors interact and where controllable factors have been combined successfully. It is common that growers compare unhealthy and healthy specimens on their own property and in comparable surroundings. Trustful data sets and effective data-processing methods can allow growers to compare defective to well-documented successful specimens, avoiding extra analytical costs. Proximity between defective and successful specimens makes corrective measures more trustful. However, regional standards insure more protection against outliers that may contaminate some unsupervised enchanting islands (several enchanting islands in the TN data subset should be compared as compositional references for defective specimens). Nutrient diagnosis at regional scale then becomes a subsidiary tool in the decision-making process. Because soil fertility classes are established across growth-limiting factors, such as soil texture, compaction, and stoniness as well as soil conservation measures, regional guidelines could be upgraded or downgraded to adjust fertilization to local conditions.

Machine Learning and Big Data
In this study, we compared defective and successful compositional entities at local scale where all factors but the ones limiting yield were comparable. While yield cut-off was fixed at 16 Mg fruit ha −1 , local organizations may select another yield cut-off and a minimum set of features to run their own machine learning and compositional models. As data sets build up, machine learning methods could assess more accurately the contribution of each feature to crop yield by removing them sequentially to test parsimoniously their impact on yield prediction (razor of Occam). To facilitate data collection, minimum data sets can be selected from meaningful quantitative and qualitative data easily available at farm level.
Nowadays, large data sets can be processed by machine learning and compositional data analysis methods directly from data input rather than being supervised by deterministic response models to conduct nutrient diagnosis. Given local uncontrollable factors such as climate, soil depth, stoniness and texture, enchanting islands may be documented where controllable factors have been already addressed successfully by local growers. It should be noted that any change in fertilization regimes of peach orchards may take more than one season to be effective because nutrient reserves accumulated in off years can be remobilized in large amounts in fruiting years [31,56] and at high rate [57].

Citizen Science and Precision Farming
The concept of site-specific nutrient management has been developed to increase crop yield and quality at local scale [54,58] and to minimize environmental damages from unwise fertilization [39,59]. Precision maps indicated that fruit quality may decrease at high-yield level [60]. [61] demonstrated the importance to adopt profitable site-specific nutrient management and disease control in Brazilian fruit orchards.
Citizen science to collect high-quality data is challenging because it requires close and ethical collaboration between researchers and stakeholders to build trustful and informative data sets [62,63]. Data sets are developing rapidly in North America from continental [64] to regional [9,51,65,66] and local [50] scales. A recent survey showed positive attitude of American fruit growers toward precision agriculture if supported by research and extension programs [67]. Researchers and growers can document, store and track analytical and managerial records. Spectroscopic techniques may facilitate collecting proximate soil analyses at low cost [68][69][70]. While plant tissue analysis has long been non-competitive with soil analysis for price and the facility of data collection and interpretation, high-throughput inductively-coupled plasma (ICP) technology [71], low-cost visible-infrared-ultraviolet (VIS-IR-UV) spectroscopy [68] and laser-induced breakdown technology [72] may increase the use of both plant and soil analysis in the near future. Large data sets can be processed rapidly by machine learning and compositional methods to tackle local production problems. The larger and more diversified the data set, the more accurate the prediction. Our study combined the efforts of Brazilian growers and research institutions to build knowledge on the site-specific nutrient management of peach orchards.

Conclusions
There is a great challenge in Brazil and many other fruit-producing countries to increase the production of high-quality fruits by improving nutrient management of orchards at local scale. Up till now, regional nutrient standards based on field trials have been used to interpret the results of soil and tissue analyses. In the present study, machine learning models relating fruit yield to tissue composition returned classification accuracy >80% from a set of growth-impacting features at yield cutoff of 16 Mg ha −1 . The collection of state-wide data sets from experimental farms and commercial orchards allowed setting apart nutritionally balanced specimens to provide updated tissue nutrient standards from ever-growing data sets.
At regional scale, site attributes are assumed to be equal and yield targets are not documented. At local scale, several attributes are reported, and trustful yield targets and corrective measures are provided in close enchanting islands. Nutrient imbalance diagnosis at regional scale may; thus, differ from local diagnosis. Such discrepancy may explain in part why several Brazilian peach orchards produced deceiving yields using the present regional standards. Due to high cost of field trials, local diagnosis requires a close and ethical collaboration between researchers and stakeholders to acquire large-size and diversified sets of high-quality trustful data. As data sets mature in size and diversity, machine learning and compositional methods could solve more complex and subtle factor interactions at local scale. This will be possible only by combining the efforts of researchers, extension specialists, crop advisers and growers.