Assessing the Information Potential of MIR Spectral Signatures for Prediction of Multiple Soil Properties Based on Data from the AfSIS Phase I Project

The aim of the study was to assess the predictive potential of mid-infrared (MIR) spectral response in the estimation of 60 soil properties. It is important to know the accuracy limitations in estimating various soil characteristics using various models in conditions of high spatial variability of the environment. To fully assess this potential, three types of algorithms were used in modeling, i.e., partial least squares (PLSR), one-dimensional convolutional neural network (1DCNN), and generalized regression neural network (GRNN). The research used data from 19 sub-Saharan African countries collected as part of the Africa Soil Information Service (AfSIS) Phase I project. The repositories provide 18,250 MIR reflectance recordings and nearly two thousand analytical data records from the determination of many soil properties by reference methods. The modeled subset of these properties included texture (three variables), bulk density, moisture content at soil water characteristic curves (SWCC, 4 variables), total and organic C and total N content (3 variables), total elemental content (32 variables), elemental content in bioavailable forms (12 variables), electrical conductivity, exchangeable acidity, exchangeable bases, pH, and phosphorus sorption index. It is not possible to indicate a universal optimal prediction model for all soil variables. The best prediction results are provided by all regression models for total and organic C, total Fe, total Al and bioavailable Al content, and pH. For bulk density, total N and total K content satisfactory results are provided by specific model type. Many other properties, i.e., texture, SWCC, total Ga, Rb, Na, Ca, Cu, Pb, Hg content, and bioavailable Ca and K content, can be predicted with accuracies sufficient for some less demanding tasks.


Introduction
The interpretation of short-range spectral recordings under field or laboratory conditions attracts a lot of attention [1][2][3][4]. The importance of such methods is related to the three-dimensional variability of soils and their profile variation, which are impossible to recognize by recording spectral reflectance of the ground surface. Most studies to date relating to spectral analysis of soils have used the visible near-infrared (Vis-NIR) range covering the 350-2500 nm range of electromagnetic radiation [3,5,6]. Currently, there is increasing interest in investigating the suitability of the mid-infrared (MIR) spectrum covering the 2500-25,000 nm range [7][8][9].
The analysis of the usefulness of MIR signatures in soil research is of increasing interest. Margenot et al. analyzed the technical aspects of using this spectral range in the estimation of soil properties, mainly on the basis of a literature review [10]. They indicate the need to standardize the method of considering the impact of soil moisture on disturbances in the estimation of soil organic matter (SOM). Nath et al. analyze the usefulness of MIR signatures in estimating biological properties [11], pointing to the high usefulness of this 2 of 22 approach in determining the concentration of C and N. They believe that the estimation of the concentration of enzymes and other organic components requires more sophisticated estimation methods. In a study of alfisols in eastern India [12], using a relatively small (approximately 330 samples) sample set, the authors concluded that the PLSR and SVM model provide a better estimate of SOC, pH, and clay than the random forest model. They point to the need to strengthen these conclusions on the basis of more territorially extensive research. An extensive analysis of the estimates of over 100 soil properties, based on the MIR Kellogg Laboratory's spectral library, is contained in the publication by Ng et al. [8]. The results obtained confirmed the usefulness of MBL models for the estimation of many properties of soils from the territory of the USA.
A very insightful analysis of prediction results using NIR and MIR was conducted by Bellon-Maurel and McBratney [5]. The review analyzed the effect of the method of measurement implementation, data preprocessing, and algorithms used on soil organic carbon (SOC) prediction. According to one of the review's conclusions, NIR and MIR allow the prediction of SOC with similar accuracy, with a slight advantage to MIR as it provides better reproducibility, with a lesser error (10-40% when compared to NIR). In addition, it was found that, regardless of the range of the spectrum, homogeneity of data for calibration and validation is essential. It should be noted that comparison of modeling results based on spectral response is justified only for soil samples with similar properties, geology, and climatic conditions. The importance of the range of variability of the data, the number of cases for calibration, preprocessing, and the type of model should also be emphasized.
In order to use spectral analysis in the soil survey, it is necessary to know the potential possibilities and limitations of this approach in estimating the values of various soil characteristics. The problem is complex. The model error, apart from its architecture, depends on the distribution and differentiation of data, the range of their variability, the number of observations, the method of validation, etc. The work of Nduwamungu et al. [13] summarizes the results of modeling tests based on the NIR spectrum of soil properties such as texture, SOC, pH, and cation exchange capacity (CEC) presented in the work of many authors, whose statistics differed dramatically: the R 2 spread was between 0.01 and 0.99. The availability of possibly extensive data with high variability increases the credibility of the information potential of spectral data under various soil conditions. The knowledge of the potential usefulness of a spectrum for determining various soil properties is of significant practical importance for the selection of variables used to estimate soil quality indices [14,15].
Soils are described by a number of properties subject to evaluation from the point of view of their cultivation, economic, and engineering suitability. As evaluation criteria, depending on the use under consideration, various properties of soils are treated as more or less important. Classification of soils, in the context of their use in agriculture or forestry, requires characterization of basic physical (texture, water holding capacity, compactness, permeability, and many other potential physical indicators) and chemical (concentration of macro-and micronutrients, pH, sorption capacity, buffering capacity, and many others) properties. The list of elements to be assessed is constantly growing; for example, by including rare earth elements in the assessment of soils [16,17]-important because of their role as micronutrients and potential pollutants. Given this fact, it is important to appreciate the possibility of assessing the information potential contained in the spectral response of soil samples as a source to support, at least roughly, the assessment of environmentally or economically significant properties. It is important to note that the quality of models linking the spectral response to individual soil properties depends largely on the type of prediction algorithm used as well as the statistical distribution of the modeled properties. For this reason, it is advantageous to confront modeling approaches and validation datasets in order to identify features, prediction methods, and conditions for their use.
The aim of this research is to determine prediction accuracy for many soil variables based on the analysis of the spectral response of soils in the proximal sensing mode in the MIR range. The literature generally presents attempts to model several variables-usually The aim of this research is to determine prediction accuracy for many soil variables based on the analysis of the spectral response of soils in the proximal sensing mode in the MIR range. The literature generally presents attempts to model several variables-usually C, N, P, K, pH, and texture, contents of some heavy metals, such as Zn, Hg, Cr, Cu, and Pb [18,19]-and occasionally other properties, such as soilwater characteristics curve, CEC, EC, or exchangeable bases [8]. AfSIS data are characterized by a very extensive list of variables marked with reference methods along with the spectral response [20]. They allow estimation of the modeling errors of many soil variables from samples taken in highly varied soil and climatic conditions. Determined validation statistics allow for the assessment of the usefulness and limitations of MIR models and data for estimating the values of variables characterizing soils.

AfSiS Phase I Data
AfSIS Phase I data were collected as part of a project financed by the Melinda and Bill Gates Foundation, implemented with the participation of five research institutions: The Tropical Soil Biology and Fertility Institute of The International Center for Tropical Agriculture, The International Soil Reference and Information Center-World Soil Information, The Center for International Earth Science Information Network, The Earth Institute at Columbia University, and World Agroforestry (Center for Research in Agroforestry-ICRAF). These data include the results of analyses of samples from soil profiles collected at points determined according to the methodology of the Land Degradation Surveillance Framework [20] in 19 countries of sub-Saharan Africa: Angola, Botswana, Burkina Faso, Cameroon, Ethiopia, Ghana, Guinea, Kenya, Madagascar, Malawi, Mali, Mozambique, Niger, Nigeria, South Africa, Tanzania, Uganda, Zambia, and Zimbabwe. More details regarding the AfSIS database can be found in Vågen et al. [21].
The data provided by the repositories (Amazon AWS, GitHub, and the Agroforestry portal) came from topsoil (0-20 cm) and subsoil (20-50 cm) layers. Soil sampling was carried out in accordance with the Land Degradation Surveillance Framework (LDSF) procedure, taking into account the local and spatial characteristics of the environment. The sampling areas represent a random stratification of the African landscapes south of the Sahara desert ( Figure 1). The samples were taken in the areas surrounding the cluster center, which allowed consideration of the spatial variability of soils [22]. A detailed method for determining the location of sampling areas, sampling, and analytical procedures is included in the report written by Leenaars et al. [23].  AfSIS data were the basis for studies on the analysis of the spatial variability of soil chemical composition [24] and the characteristics of soil profiles [23]. On this foundation, cartographic documentation of soils with a resolution of 30 m was developed and prepared [25]. The repositories provide 18,250 MIR reflectance recordings [26,27]. A single spectral signature covers 1749 recorded reflectance values in the range from 2500 to 16,666 nm (wavelength 4000 to 600 cm −1 with a step of 1.9 cm −1 ). The average density of the spectral registration points was about 8.1 nm, and it ranged from 1.2 nm to 53 nm. The GitHub repository also provides nearly two thousand analytical data records from determinations of soil samples by reference methods. Georeferenced data on the location of sampling points are also available. The analytical data of the original dataset are split at the highest level into two groups, i.e., DryChemistry and WetChemistry. The DryChemistry subset includes the results of X-ray determinations of the content of 32 elements. The WetChemistry subset [28] includes data, which include soil texture, water characteristics of samples, and many soil chemical properties associated with the use of the Mehlich 3 methodology [29], from three different laboratories.
There are slight variations in the amount of data from different laboratories, which resulted in the need for separate modeling for individual parts of the set. Modeling of chemical data for which the reference data were less than 500 records were omitted. In addition, properties for which there are strict relationships allowing for their direct calculation on the basis of the modeled values were omitted in the modeling. Moreover, modeling of soil organic N was not carried out, as the method of determination of reference values was different from the commonly accepted standard. Each of the group of features for which modeling was performed included at least 1860 soil samples data, of which a randomly selected 70% was used for training and the remaining 30% was used for validation. The training and validation sets were kept the same for each model throughout the study.
A total of 60 variables were modeled, with the data divided into groups: "Texture" (3 variables, 1307 training, and 560 validation samples), separated into three particle-size fractions, i.e., sand, silt, or clay; "Water" (5 variables, 1306 training, and 560 validation samples), i.e., saturation at 0 (Sat. at 0), pF2.0, pF2.5, pF4. For a description of the reference methods, in addition to standard operating procedures, see Vågen et al. [22]. The texture of the soil was determined with laser diffraction particle size analysis. For AfSIS reference samples, soil moisture release curves are determined using soil fines in pressure plate apparatus. High throughput total x-ray fluorescence spectroscopy (TXRF) was used in the analysis of the chemical composition (macro and microelements). Total and organic C and total N were determined with combustion method. Extractable Al, Ca, Mg, P, K, Na, S, Fe, Mn, Zn, Cu, B, Mo, H, and other bases were determined with use of ICP analysis of Mehlich 3 extracts. P sorption index was measured by single-point P addition. Figure 2 and Appendix A (Tables A1-A5) present selected statistics for all variables in these data groups. For each data group, prediction models were created for all soil properties included in the group.

Models
Three different groups of algorithms were used in this study: partial least squ regression (PLSR), one-dimensional convolutional neural network (1DCNN), and g alized regression neural network (GRNN).
The PLSR algorithm has been the most widely described and applied to predi of the state of the environment from remotely sensed data. The deterministic PLSR rithm [30,31] produces a linear model preceded by an orthogonalization operation o input data taking into account their effect on the variance of the outputs. The experim used a module available in the MATLAB package [32]. Depending on the numb model outputs, the number of PLS coefficients was set in range from 25 to 100. The s input vector consisted of 1749 reflectance values recorded for one soil sample.
A one-dimensional convolutional neural network [33] represents the so-called a tive deep learning by processing the input data through successive layers of the algor All conducted computational experiments used a network with two convolutional la counting 32 and 64 filters, or 64 and 128 filters with widths of three, six, and ten i features separated by MaxPooling layers. A fully connected layer consisted of 15 units. The number of outputs corresponded to the number of modeled properties

Models
Three different groups of algorithms were used in this study: partial least squares regression (PLSR), one-dimensional convolutional neural network (1DCNN), and generalized regression neural network (GRNN).
The PLSR algorithm has been the most widely described and applied to prediction of the state of the environment from remotely sensed data. The deterministic PLSR algorithm [30,31] produces a linear model preceded by an orthogonalization operation of the input data taking into account their effect on the variance of the outputs. The experiments used a module available in the MATLAB package [32]. Depending on the number of model outputs, the number of PLS coefficients was set in range from 25 to 100. The single input vector consisted of 1749 reflectance values recorded for one soil sample.
A one-dimensional convolutional neural network [33] represents the so-called adaptive deep learning by processing the input data through successive layers of the algorithm. All conducted computational experiments used a network with two convolutional layers, counting 32 and 64 filters, or 64 and 128 filters with widths of three, six, and ten input features separated by MaxPooling layers. A fully connected layer consisted of 15 to 50 units. The number of outputs corresponded to the number of modeled properties. The spectra were standard normal variate (SNV)-transformed [34] and, like the outputs, were normalized to the numerical interval <0; 1> according to the methodology described in Gruszczyński and Gruszczyński [35]. The calculations associated with this algorithm were performed in the Python language environment using the Keras system and TensorFlow libraries [36,37].
A generalized regression neural network is a memory-based local nonparametric linear and nonlinear regression algorithm [38]. The GRNN algorithm is susceptible to the curse of dimensionality; therefore, the spectral data from the AfSIS project were transformed into vectors of 50 components each using the PCA algorithm, considering all available spectra (18,250). In order to optimize the spread of RBF units, a modeling cycle was performed with its values between 0.5 and 2.5 (with a step of 0.05). Figure 3 shows the dependence of linear correlation coefficients of data and prediction as a function of spread. In the construction of models, the spread generating the highest average linear correlation coefficient of the prediction with the observed (r) data was used. Optimal spread values ranged from 0.75 for the "Mehlich" data group to 2.0 for variables C and N. Calculations associated with the GRNN model were performed in the MATLAB environment [32]. spectra were standard normal variate (SNV)-transformed [34] and, like the outputs, were normalized to the numerical interval <0; 1> according to the methodology described in Gruszczyński and Gruszczyński [35]. The calculations associated with this algorithm were performed in the Python language environment using the Keras system and TensorFlow libraries [36,37]. A generalized regression neural network is a memory-based local nonparametric linear and nonlinear regression algorithm [38]. The GRNN algorithm is susceptible to the curse of dimensionality; therefore, the spectral data from the AfSIS project were transformed into vectors of 50 components each using the PCA algorithm, considering all available spectra (18,250). In order to optimize the spread of RBF units, a modeling cycle was performed with its values between 0.5 and 2.5 (with a step of 0.05). Figure 3 shows the dependence of linear correlation coefficients of data and prediction as a function of spread. In the construction of models, the spread generating the highest average linear correlation coefficient of the prediction with the observed ( ) data was used. Optimal spread values ranged from 0.75 for the "Mehlich" data group to 2.0 for variables C and N. Calculations associated with the GRNN model were performed in the MATLAB environment [32]. The evaluation criteria for each model were the following validation statistics: coefficient of determination (R 2 ), root mean square error (RMSE), ratio of performance to interquartile distance (RPIQ), bias, standardized bias (Sb), and Lin's concordance correlation coefficient (LCCC).
The coefficient of determination was calculated according to the equation: where SSRv is a sum of squares of differences of modeled value and its prediction in the validation set, and SSTv is a sum of squares of differences of modeled value and its mean in the validation set. The coefficient of determination is considered to be a numerical estimate of the degree to which the variance in a modeled variable is explained. Root mean square error was calculated according to the equation:
The coefficient of determination was calculated according to the equation: where SSRv is a sum of squares of differences of modeled value and its prediction in the validation set, and SSTv is a sum of squares of differences of modeled value and its mean in the validation set. The coefficient of determination is considered to be a numerical estimate of the degree to which the variance in a modeled variable is explained. Root mean square error was calculated according to the equation: where n is the size of the validation set. Ratio of performance to interquartile distance was calculated according to the equation: where IQR is an interquartile distance, and Q 3 and Q 1 are third and first quartile of validation data. The bias is calculated according to equation: where x i and y i are ith observed value and its prediction. Standardized bias is calculated according to equation: Lin's concordance correlation coefficient LCCC [39] is calculated according to equation: where s x , s y are standard deviations of observed values and their predictions, and x and y are the average of the observed values and their predictions. The LCCC value indicates the degree of agreement between the observed data and their prediction.

Hierarchical Summary of Prediction Positions
The relatively large number of modeled variables and the use of three different modeling algorithms justify the need to hierarchize the prediction accuracy of the variables in order to compare their practical utility. The hierarchization procedure was implemented using the technique for order of preference by similarity to ideal solution (TOPSIS) methodology [40]. This technique is used to rank objects described by multidimensional criteria through determination of their distance in multidimensional space from the potentially most favorable and least favorable variants. These variants are determined by identifying, for each criterion, the highest and lowest rated values. As a result, two (not necessarily existing in the dataset) variants are created with the best and the worst properties. The distance, in the multidimensional space, from both of them is a measure of the quality of prediction of the variable, according to the formula: where TIndex is a ranking coefficient, DistWORST is a euclidean distance in multidimensional space from the worst variant, and DistBEST is a euclidean distance in multidimensional space from the most favorable variant. The criteria used in TOPSIS method to characterize the prediction quality should be independent of the values scale. Thus, R 2 , RPIQ, Sb, and LCCC statistics scaled to the interval <0; 1> were used. The prediction accuracy of the model is highly dependent on the data number, distribution, and range of variability. For this reason, the hierarchization performed is of relative importance, depending on the type and homogeneity of the data.

Importance of Spectral Bands to Soil Property Estimation
The multidimensionality of spectral data is a significant obstacle in some modeling algorithms. The reduction of the dimensionality of the input data is a component of linear models (PCA, PLS). It is always associated with determining the parts of the spectrum that are important for obtaining a good estimation of the sought relationship between the spectral response and soil properties.
In PLSR models, the significance of individual parts of the spectrum for the estimation of a soil variable can be determined by calculating the indicator referred to as (VIP scores) variable importance in projection [30,32]. VIP scores, as a vector of numerical values with a length equal to the number of spectral registration points, can be calculated a posteriori. The sum of squares of the VIP scores for the spectrum is equal to the number of spectrum components. VIP scores less than 1.0 indicate a lower than average importance of a particular input value. Components with VIP scores greater than 1.0 are important for modeling. To assess the importance of spectral bands, calculations of VIP scores for 11 selected soil features (SOC, total N, Ca, Al, Fe, Cr, Cu, Pb, Zn, and Hg contents) were performed during the creation of PLSR models for these properties.
An independent analysis of the significance of spectral bands was performed with the neighborhood components analysis (NCA) algorithm. This algorithm, not related to a specific modeling method, is used to select the number of input variables a priori, especially in the case of a limited number of data [32, [41][42][43]. In order to compare the results of this approach with VIP scores, calculations were performed on the same dataset.

Results of Modeling of Texture
The search for an effective model to estimate soil texture is a frequent topic in soil science research [44]. The estimates of the content of the different fractions vary in terms of RMSE values (Table 1). Validation statistics for all algorithm groups analyzed are relatively weak (R 2 < 0.75) and (RPIQ < 2.0); R 2 , RPIQ, and LCCC signal significant agreement between predictions and observations, but RMSE is relatively high. In our study, the individual models unanimously indicate that sand has the largest prediction error, while silt and clay have much smaller errors. This is an indirect indication that when soil texture needs to be determined, it is safest to determine it using these two variables. However, it should be assumed that this result is specific to the AfSiS dataset and it results from a much larger concentration spread of Sand compared to the range of silt and clay. With similar R 2 values, the RMSEs for silt and clay fractions are significantly smaller. Moreover, the distribution of sand content is characterized by negative asymmetry.

Results of Modeling of Properties from "Water" Group
The complicated and lengthy process of laboratory determination of water characteristics of soils using sand box and ceramic plates justifies the search for alternative means of estimating them. The validation statistics of the prediction models for the "Water" group indicate that the models based on MIR provide a reasonably good basis for estimation of these quantities ( Table 2). RPIQs exceeding 2.0 signal a relatively good approximation of all relevant properties. Saturation at 0 and pF2.0 were estimated with the least error by the GRNN model; soil moisture at pF2.5 and pF4.2, by the 1DCNN model; while the volumetric density estimate with the highest accuracy is provided by the PLSR model.

Results of Modeling of C and N Contents
C and N concentrations in soils are the soil properties most commonly modeled, based on NIR and MIR spectral response [2,3]. This is due to the important role of both elements in shaping other soil properties, the role of N as a nutrient, and the importance of C sequestration in soils under conditions of increasing CO 2 content in ambient air. Validation statistics (Table 3, Figure 4) signal that, in the AfSIS dataset, the concentration of total C and N, as well as SOC, can be determined with relatively high accuracy, even when the compound content of both elements in soils is low. RMSE estimates of 0.33% for C and 0.03% for N provide good approximations of the true values. Validation results indicate that (among those tested) the best model for estimating C and N is 1DCNN. This provides reason to believe that the relationship between the spectrum and SOC concentration (and associated N) is nonlinear, as indicated by the weaker result of the PLSR model. The result of the GRNN model is also relatively weak.

Results of Modeling of C and N Contents
C and N concentrations in soils are the soil properties most commonly mode based on NIR and MIR spectral response [2,3]. This is due to the important role of b elements in shaping other soil properties, the role of N as a nutrient, and the importa of C sequestration in soils under conditions of increasing CO2 content in ambient air. V idation statistics (Table 3, Figure 4) signal that, in the AfSIS dataset, the concentratio total C and N, as well as SOC, can be determined with relatively high accuracy, even w the compound content of both elements in soils is low. RMSE estimates of 0.33% for C 0.03% for N provide good approximations of the true values. Validation results indi that (among those tested) the best model for estimating C and N is 1DCNN. This provi reason to believe that the relationship between the spectrum and SOC concentration ( associated N) is nonlinear, as indicated by the weaker result of the PLSR model. The re of the GRNN model is also relatively weak.   Table 4 contains validation statistics for the prediction models of the 32 elements. The validation results vary widely, although there are several elements whose concentrations can be estimated with relatively high accuracy. The RPIQ value indicates that a small relative error characterizes the prediction of Fe, Al, Ga, and Rb (RPIQ > 2.0). However, analyzing the RMSE level shows that, despite low values of RPIQ or R 2 , the content of other macroelements can also be predicted with sufficient accuracy, taking into account the scale of data variability (data spread). Comparison of the validation results from the three models indicates that the GRNN model dominates in terms of prediction quality. There are six exceptions to this rule: Na, K, Ca, and Se, for which better prediction results were obtained with the PLSR method; and Se and Bi, for which a slightly better result was obtained with the 1DCNN model.  Table 5 shows the prediction statistics of chemical properties of soils from the "Mehlich" group. Apart from Al content and pH value, the RPIQ of other characteristics does not exceed a value of 2.0. A relatively good validation result is shown by the model of Ca and K content (PLSR), and Mg content (1DCNN).  Table 6 contains hierarchical summaries of the relative prediction positions of soil variables using the TOPSIS method considering the R 2 , RPIQ, LCCC, and Sb criteria. A high position in the hierarchy indicates a prediction result more consistent with the observations. Hierarchization takes into account all validation results. A higher Tindex value for the same predicted element indicates an algorithm that more accurately approximates the variables. Despite some differences in the position of individual variables, a relatively narrow group of data can be predicted with the greatest precision: SOC, Al and M3 Al, Fe, K, pH, and total N. Ranks of prediction of soil texture elements and water properties are relatively low. However, this highly formalized hierarchical procedure does not invalidate the usefulness of lower-ranked models, which should be further evaluated. This applies, for example, to the prediction of Ca or pF, especially at higher values of these variables.

Relative Prediction Positions
Spectral response is used for the estimation of soil properties in digital mapping of the environment as well as in the management of field operations. The use of spectral analysis in laboratory conditions should be included in a similar category as it allows rapid, inexpensive, and approximate estimation of some soil properties in the whole soil profile [45,46]. An important problem is that of identifying the optimal modeling algorithm, as well as the limiting prediction precision, taking into account the minimum documentation requirements.

Impact of Relationships between Properties
It can be assumed that in the field of spectral analysis used for quantitative estimation of soil properties, there are three categories of variables: those that affect the spectral response directly to the extent that they can be quantitatively identified; those that are correlated with properties that affect the spectral response and thus can be quantitatively estimated; and those that do not affect the spectral response and are not correlated with properties that affect the spectral response. Results of previous studies indicate that, for soils, the first category includes carbon organic compounds [47,48], when using both NIR and MIR spectra. For other features, the possibility of belonging to each category has to be evaluated experimentally. The linkages between soil properties that enable or interfere with the quantitative identification of some components can be traced by confronting the statistics of the prediction results with the cluster system generated by the variable agglomeration algorithm. Figure 5 shows the dendrogram of associations between soil properties obtained by Ward's agglomeration procedure using the 1-r 2 metric [49,50]. The algorithm groups statistically related variables using a metric that expresses the similarity or dissimilarity between variables. The value of the association distance indicates the degree of similarity of the variables in the dataset. A cluster composed of mutually correlated variables Ti, Bi, V, Cu, Fe, Th, Pb, and Mn in models using MIR is characterized by coefficients of determination of, respectively: 0.67, 0.35, 0.63, 0.59, 0.84, 0.76, 0.64, 0.62. The highest R 2 value of the Fe prediction leader gives reason to believe that the variability of its content reflects to some extent the variability of the other elements of the cluster, supporting their prediction. The cluster related to Fe is in turn related to the cluster including M3 Cu, Pr, Ni, Cr, M3 Fe, Ex. Acid., PSI, M3 Al, and Al (R 2 : 0.50, 0.33, 0.66, 0.68, 0.49, 0.39, 0.78, 0.86, 0.79, respectively). There is also a cluster of variables related to soil pH, including pH, M3 Mg, M3 K, M3 Ca, Ex. Bas., and Ca. A characteristic group is formed by variables shaping water properties including: bulk density, pF2.0, pF2.5, pF4.2, SOC, total C, and total N. In this case, one might suspect that the supporting factor for quantitative identification in this group is total C, or SOC. some extent the variability of the other elements of the cluster, supporting their prediction. The cluster related to Fe is in turn related to the cluster including M3 Cu, Pr, Ni, Cr, M3 Fe, Ex. Acid., PSI, M3 Al, and Al (R 2 : 0.50, 0.33, 0.66, 0.68, 0.49, 0.39, 0.78, 0.86, 0.79, respectively). There is also a cluster of variables related to soil pH, including pH, M3 Mg, M3 K, M3 Ca, Ex. Bas., and Ca. A characteristic group is formed by variables shaping water properties including: bulk density, pF2.0, pF2.5, pF4.2, SOC, total C, and total N. In this case, one might suspect that the supporting factor for quantitative identification in this group is total C, or SOC.  The graphs in Figure 6 illustrate the statistical relationships between Fe content and the concentration of other metals (Al, V, Mn, Cu, Pb, and Th). It can be assumed that the model of these elements' content is to some extent related to the content of Fe, which is well quantified by the models. It cannot be excluded that the chain of connections between particular properties is more complex and may lead to correct estimation of features poorly influencing the spectral response or not influencing it at all. The graphs in Figure 6 illustrate the statistical relationships between Fe content and the concentration of other metals (Al, V, Mn, Cu, Pb, and Th). It can be assumed that the model of these elements' content is to some extent related to the content of Fe, which is well quantified by the models. It cannot be excluded that the chain of connections between particular properties is more complex and may lead to correct estimation of features poorly influencing the spectral response or not influencing it at all. Similar relationships exist between the exchangeable bases and the elements included in them (Figure 7). The exchangeable bases value is shaped by Ca to the largest extent, although the association with K and Mg is also significant. Calcium content, fairly well identified through the MIR spectral response, is nonlinearly related to pH. It is diffi- Figure 6. Scatter plots of the content of selected elements as a function of Fe content. Similar relationships exist between the exchangeable bases and the elements included in them (Figure 7). The exchangeable bases value is shaped by Ca to the largest extent, although the association with K and Mg is also significant. Calcium content, fairly well identified through the MIR spectral response, is nonlinearly related to pH. It is difficult to indicate which of these elements is the primary basis for correct modeling of soil sorption elements.  Figure 8 shows the diagrams of squares of VIP scores and NCA score value soil characteristics, including the concentration of heavy metals. The diagrams ref properties of the PLSR and NCA algorithms. In the case of PLSR, most orthogo inputs are used in the construction of the model. In the selection, in the NCA pro only spectral sections with appropriately high NCA scores are considered. In the an cases, the number of values indicated by this procedure ranged from 12 to 54.

Importance of Spectral Bands in Prediction of Selected Properties
It can be seen that SOC and total N is determined by taking the same spect ments into account. It can also be concluded that the concentration of Al, and esp Fe, is related to sections of the spectrum of heavy metals.  Figure 8 shows the diagrams of squares of VIP scores and NCA score values for 11 soil characteristics, including the concentration of heavy metals. The diagrams reflect the properties of the PLSR and NCA algorithms. In the case of PLSR, most orthogonalized inputs are used in the construction of the model. In the selection, in the NCA procedure, only spectral sections with appropriately high NCA scores are considered. In the analyzed cases, the number of values indicated by this procedure ranged from 12 to 54.

Importance of Spectral Bands in Prediction of Selected Properties
It can be seen that SOC and total N is determined by taking the same spectral segments into account. It can also be concluded that the concentration of Al, and especially Fe, is related to sections of the spectrum of heavy metals.

Discussion
Most soil data, for classification purposes or for making economic decisions, are interpreted in a semi-fuzzy way, or at least a tendency is observed leading to a blurring of the boundaries between classification criteria. The evaluation of modeling results of many soil characteristics must allow for a certain level of acceptable errors. In this context, the estimation of soil texture components, such as silt and clay contents, using the models described in this paper, can be considered as sufficiently precise. Similarly, the correctness of the models of water properties of soils, which are extremely time-consuming to determine in the laboratory using the classical approaches, should be assessed.
The modeling of SOC concentration based on NIR and MIR signatures is understandably of the greatest interest to researchers [2][3][4]51]. It can be assumed that, as an important component of soil organic compounds, SOC and total N content can be estimated with

Discussion
Most soil data, for classification purposes or for making economic decisions, are interpreted in a semi-fuzzy way, or at least a tendency is observed leading to a blurring of the boundaries between classification criteria. The evaluation of modeling results of many soil characteristics must allow for a certain level of acceptable errors. In this context, the estimation of soil texture components, such as silt and clay contents, using the models described in this paper, can be considered as sufficiently precise. Similarly, the correctness of the models of water properties of soils, which are extremely time-consuming to determine in the laboratory using the classical approaches, should be assessed.
The modeling of SOC concentration based on NIR and MIR signatures is understandably of the greatest interest to researchers [2][3][4]51]. It can be assumed that, as an important component of soil organic compounds, SOC and total N content can be estimated with satisfactory accuracy. In terms of total element prediction, only a few elements can be estimated with reasonable accuracy: Al, Fe, Ca, and K, along with several elements usually found in soils in low concentrations, namely Se, Rb, Ga, Zr, La, and Th. The contents of Cr, Ti, Mn, Ni, and Pb are estimated with slightly lower accuracy.
In modeling the "Mehlich" data group (17 variables), the 1DCNN (10 variables with the best prediction statistics) and GRNN (7 variables with the best prediction statistics) algorithms performed best. The PLSR model was inferior to the other two in every case. Particularly good prediction scores apply to: M3 Al, M3 B, M3 Mg, and pH (1DCNN model); and to total exchangeable bases, M3 Ca, M3 K, M3 Na, and PSI (GRNN model).
The results presented indicate that, among the examined models, there is no single optimal one that would provide the best predictions for all variables. Each of the models shows advantages in predicting a certain group of variables and at the same time is less useful in predicting others. This has certainly been influenced by the heterogeneity of data from many different countries, which is evident through the very large spread of variable values [24]. Compared to results with less dispersed distributions, more than 100 variables across the United States from the Kellogg Soil Survey Laboratory [8] show, in many cases, similar values of determination coefficients, but are associated, in the case of AfSIS data, with a much larger RMSE values. Comparisons between the two sets of results are difficult, mainly due to the smaller spread-often by several times-of the Kellogg Laboratory data relative to AfSIS, and also due to the slightly wider spectral range they used. Data from the USA were also used by Wijewardane et al. [9], who indicated satisfactory results for prediction of C, N, CEC, pH, and clay content, while poor results were produced for extractable forms of phosphorus and potassium. In this case, the results (RMSE) are similar to those obtained from the AfSIS data, presumably also due to less restrictive data selection conditions that did not limit the spread of the data. AfSIS observational data have been used in very advanced digital mapping attempts. Hengl et al. [25] presented a modeling methodology to generate a map of soil properties, among other things, from data collected in the AfSIS project. They used an ensemble of machine learning regression models whose inputs were topographic, hydrological, meteorological, and vegetation data, as well as spectral reflectance images collected by the Sentinel satellites (Copernicus program). Nineteen variables were modeled, mainly Mehlich 3 extractable components, pH, texture, CEC, C, and N contents. Results of modeling based on spectra recorded from the satellite ceiling were slightly worse than those obtained by proximal sensing. The disadvantage of this approach is the limitation of quantitative identification of variables to the surface layer. The indisputable advantage, however, is the possibility of relatively precise representation of many soil properties with good spatial resolution over a very large area.
If we arbitrarily take (R 2 ≥ 0.75) and (RPIQ ≥ 2.0) as the criteria of correct prediction, then for PLSR models they are fulfilled for 10 of the studied soil properties. For nine variables, these criteria are met when 1DCNN is used, while among predictions using the GRNN model, the adopted criteria are met for seven variables. For six variables, the assumed criteria are met by all tested algorithms. These variables are SOC, total C, Al, Fe, M3 Al, and pH.
The decision to use a model based on spectral analysis instead of the reference method is inevitably subjective. Using validation statistics, it is possible to estimate the risk of error based, for example, on R 2 and RMSE values. Acceptance or rejection of the model as an approximation for classification or mapping of soils is determined by the mapping scale (spatial resolution) and the way in which the variability of soils is reflected-either as a continuous image or discrete representation with information about the mean value of properties and the range of their variability.
Assuming a symmetrical distribution of deviations from observed values, properties such as sand, clay; total C, N, and SOC contents; and bulk density can be estimated with sufficient precision for large-scale soil mapping. In addition, the Ca content error of 0.5% relative to sample weight is small enough even for assessing the cultivation needs of soils, although the K content (error of 0.7%) is estimated with relatively low accuracy.
Fe compounds rarely show signs of deficiency; therefore, an error of 1% is not significant in practice. Similarly, Al concentration itself does not pose a significant problem; it only becomes important once the soil pH is below 5.0, when the presence of aluminum compounds is a threat to crops. The accuracy of estimates of other variables must be considered depending on the purpose and scale of the documentation required. For example, in the context of investigating threats to soils with high metal contents, even estimates of relatively low accuracy may be sufficient to determine the extent of transformation.
For large soil datasets, especially those from diverse geological settings, a significant asymmetry in the distribution of soil characteristics, usually right-handed, is characteristic. The aforementioned data from the US territory analyzed by Ng et al. [8] possess such a characteristic. Towett et al. [24], characterizing the chemical properties of the AfSIS data, noted the strong skewness of the distributions of many soil variables. Relative to the distribution of US data, the spread of AfSIS observations is, in most cases, much larger (in extreme cases, several tens of times). This has a significant impact on the level of prediction errors, especially at low values of the variables. In such circumstances, one of the solutions used is to exclude extremely high values (outliers) from the dataset. Such an operation, in the context of the real occurrence of high values of variables, is controversial. Figure 9 illustrates, based on the analyzed data, the relationship between the prediction error of the model with the most favorable characteristics (RMSE) and the prediction errors after removing the values considered as outliers, i.e., exceeding the assumed cut-off value (HL): The graph shows two versions of this approach: the RMSE ratio after removing extreme values from the validation data after prediction with a model derived from the entire data (RMSE OUT ); and the RMSE ratio after removing extreme values from the training and validation data before training the model (RMSE modOUT ). although the K content (error of 0.7%) is estimated with relatively low accuracy. Fe compounds rarely show signs of deficiency; therefore, an error of 1% is not significant in practice. Similarly, Al concentration itself does not pose a significant problem; it only becomes important once the soil pH is below 5.0, when the presence of aluminum compounds is a threat to crops. The accuracy of estimates of other variables must be considered depending on the purpose and scale of the documentation required. For example, in the context of investigating threats to soils with high metal contents, even estimates of relatively low accuracy may be sufficient to determine the extent of transformation. For large soil datasets, especially those from diverse geological settings, a significant asymmetry in the distribution of soil characteristics, usually right-handed, is characteristic. The aforementioned data from the US territory analyzed by Ng et al. [8] possess such a characteristic. Towett et al. [24], characterizing the chemical properties of the AfSIS data, noted the strong skewness of the distributions of many soil variables. Relative to the distribution of US data, the spread of AfSIS observations is, in most cases, much larger (in extreme cases, several tens of times). This has a significant impact on the level of prediction errors, especially at low values of the variables. In such circumstances, one of the solutions used is to exclude extremely high values (outliers) from the dataset. Such an operation, in the context of the real occurrence of high values of variables, is controversial. Figure 9 illustrates, based on the analyzed data, the relationship between the prediction error of the model with the most favorable characteristics (RMSE) and the prediction errors after removing the values considered as outliers, i.e., exceeding the assumed cut-off value ( ): The graph shows two versions of this approach: the RMSE ratio after removing extreme values from the validation data after prediction with a model derived from the entire data (RMSEOUT); and the RMSE ratio after removing extreme values from the training and validation data before training the model (RMSEmodOUT). A significant variation in the ratios of RMSE values can be seen. As a rule, the error after removing outliers before training is smaller than the RMSE for the full data and the RMSE after removing outliers from the validation data after training on the full data. The A significant variation in the ratios of RMSE values can be seen. As a rule, the error after removing outliers before training is smaller than the RMSE for the full data and the RMSE after removing outliers from the validation data after training on the full data. The exceptions are clay, Sat. at 0, K, M3 Al, and pH, for which limiting the range of variables in the learning set worsened the validation results. For sand and total N content, the exclusion of extreme feature values did not significantly affect the validation error.
The largest decreases in RMSE are for M3 Na, M3 K, M3 B, Hg, and Zr. Other RMSEs are closer to the variant with removal of outliers after training on the whole data. The problem of outliers for prediction of soil property values is debatable. The increase in accuracy due to truncation of large data values limits the applicability of the model, especially for memory-based models. In extreme cases, the construction of two different models should be considered [21]: one for data close to the modal and mean value, and a separate one for very high values of the variables. However, for models built for high property values, the problem will be the small number of learning examples.

Conclusions
Each of the models examined possesses a specific suitability for prediction of different soil variables. It should be assumed that differences in the correctness of estimation of particular variables depend on the type of relationship between the variable and the spectral response. This means that, due to the specificity of the relationship between the MIR spectrum and soil properties, it is not possible to indicate a universal optimal prediction model for all soil variables. Because of the wide range of the variable distributions, this result should be treated as specific for particular geological, morphological, and climatic conditions.
The best prediction results are provided by all regression models for total C, SOC, total Fe, Al, M3 Al, and pH. The clustering and modeling results indicate that many other soil properties do not directly affect the MIR spectral response, but there are correlations of features that can support such a prediction with limited accuracy.
Asymmetry in the distribution of the modeled variables makes it difficult to build a valid regression model. Removing outliers from the training set results in a significant reduction of the prediction error in many cases. However, this operation limits the universality of the model to a range relatively close to the median. Consideration of building two different models for different ranges of data variability is conditioned by the disproportion of the number of examples close to the median and outliers.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Descriptive statistics of properties in the "Texture" group.