Modelling Diverse Soil Attributes with Visible to Longwave Infrared Spectroscopy Using PLSR Employed by an Automatic Modelling Engine

The study tested a data mining engine (PARACUDA®) to predict various soil attributes (BC, CEC, BS, pH, Corg, Pb, Hg, As, Zn and Cu) using reflectance data acquired for both optical and thermal infrared regions. The engine was designed to utilize large data in parallel and automatic processing to build and process hundreds of diverse models in a unified manner while avoiding bias and deviations caused by the operator(s). The system is able to systematically assess the effect of diverse preprocessing techniques; additionally, it analyses other parameters, such as different spectral resolutions and spectral coverages that affect soil properties. Accordingly, the system was used to extract models across both optical and thermal infrared spectral regions, which holds significant chromophores. In total, 2880 models were evaluated where each model was generated with a different preprocessing scheme of the input spectral data. The models were assessed using statistical parameters such as coefficient of determination (R2), square error of prediction (SEP), relative percentage difference (RPD) and by physical explanation (spectral assignments). It was found that the smoothing procedure is the most beneficial preprocessing stage, especially when combined with spectral derivation (1st or 2nd derivatives). Automatically and without the need of an operator, the data mining engine enabled the best prediction models to be found from all the combinations tested. Furthermore, the data mining approach used in this study and its processing scheme proved to be efficient tools for getting a better understanding of the geochemical properties of the samples studied (e.g., mineral associations).


Introduction
Soil spectroscopy has proven to be a fast, environmentally-friendly, reproducible, and repeatable analytical technique that has been increasingly used for rapid, non-destructive and cost-effective soil analyses.Spectroscopy, covering the optical (reflected solar radiation) and thermal infrared (earth tsurface emitted radiation) regions across the 0.4-15 µm spectral range, can be used to determine a wide range of soil properties [1] such as organic carbon (OC) [2], texture [3], cationic exchange capacity (CEC) [4], total phosphorus (P) [5], exchangeable potassium (K) [6], electrical conductivity (EC) [7][8][9], total concentration of potential pollutant metals/metalloids [10] and mineral content [11,12].Aside from the fundamental interaction of electromagnetic radiation with matter, indirect interaction can be found and provide additional quantitative information of the soil in question [13].The chemical or physical phenomenon that interacts with electromagnetic radiation is termed a chromophore.
The interaction between matter and electromagnetic radiation has been studied from both the theoretical and practical points of view.The spectral information is represented by a spectrum which consists of visible as well as non-visible information to the naked eye that further, when extracted, can spotlight the material in question in both quantitative and qualitative ways.The previously mentioned spectral regions can be divided into sub-regions: visible (VIS, 0.4-0.7 µm), near infrared (NIR, 0.7-1.0µm), shortwave infrared (SWIR, 1.0-3.0µm), mid-wave infrared (MWIR, 3.0-7.0µm) and longwave infrared (LWIR, 7.0-15 µm).
Analysing the spectral data can yield quantitative information about the material's chemical composition.This is because the spectral characteristics are correlated with the direct and indirect chromophores across all regions.In the VIS and NIR regions, iron oxides (due to electronic transitions) and organic matter are the main chromophores.Across the SWIR region the main active chromophores are OH in free water and the clay mineral lattice, organic matter, carbonates and salts [13].In the MWIR and LWIR spectral regions, absorption features, resulting from fundamental molecular vibration modes, show additional information about soil constituents, such as Si-bearing minerals (mainly quartz and clay minerals), carbonates, organic matter and gypsum [1,14].
High-dimensional spectral datasets, which are typically obtained in the solid and liquid phase, are used as inputs for quantitative modelling in spectroscopy.The data mining stage for extracting a valid model requires adequate statistical analysis techniques, which are able to deal with many highly collinear spectral frequencies (predictor variables) from relatively few observations.In this regard, multivariate analysis techniques, such as Multiple Linear Regression (MLR) [15], Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR) analyses [16], are capable of extracting reasonable models with overlapping information.Recently, different approaches have been used for spectroscopic data modelling including the artificial neural network (ANN, [17]), support vector machine regression (SVM, [18]) and random forests (RF) regression [19]; however PLSR is still probably the most widely used analysis technique over all, as it is an adequate technique to handle the difficulties inherited from the interpreted overtones by extracting the response variable relevant information from the spectra, while ignoring redundant information [12,15,20].
Prior to employing statistical modelling to detect the relationship between spectroscopy and the chemical properties of the material, different types of preprocessing techniques can be applied to the input spectral information (e.g., [18,[21][22][23]) either to minimize the noise or normalize the input spectroscopic data.Among these methods, the most common are: smoothing techniques, spectral derivatives, transfer to a logarithmic scale or continuum removal (CR).However, a literature search revealed that a study assessing how these pretreatment techniques (employed alone or in combination) affect the final model validity has yet to be done.
Soil reflectance, in all spectral domains, has a proven capability to detect several soil properties, however it is still unclear what effects on the final models' validity the input spectral data have (e.g., sensor specifications: spectra resolution, spectral range/ranges covered and signal to noise ratio (SNR)).Whereas most of the chemometrics work has been devoted to the reflective part of solar radiation (the optical range; VIS-NIR-SWIR) (e.g., [24][25][26]), the wavelength range based on earth radiation (the thermal range; MWIR and LWIR) has been also used with quite good success [12,27].Merging both ranges (optical and thermal) in chemometric analyses of soil is still not so common [8, 12,28] and using optical and thermal ranges together has usually been employed mainly only for organic matter modelling [29,30].Therefore, the question as to whether this spectral merging would result in more valid prediction models than using just each range alone (optical vs. thermal), has yet to be sufficiently answered.
To fill these gaps in soil spectroscopic modelling, powerful statistical modelling, which can cope with many pre-treatment stages automatically to assess the best prediction models, was used and tested.This engine is termed PARACUDA ® and was designed to utilize parallel and automatic processing to build and process hundreds of diverse models in order to prevent errors or biases caused Remote Sens. 2017, 9, 134 3 of 21 by an operator when taking the model setting decision.The following questions were asked when taking advantage of this innovative data mining approach:

•
To what extent does the several preprocessing method improve the modelling?

•
What is the contribution of the thermal infrared region (MWIR and in particular of the LWIR, as it is used in remote sensing applications) to the predictive capability of complex attributes that have no direct chromophores in the VIS-NIR-SWIR as well as MWIR and LWIR regions such as pH, base saturation and heavy metals?

•
What is the influence of spectrometer parameters (e.g., different spectral resolution and region coverage) on the final model results?

Study Sites and Soil Samples
The study area was located in the Sokolov basin in the western part of the Czech Republic (Figure 1a), in a region affected by long-term extensive lignite mining.Due to the mining activities and coal burning power plants that were built in the immediate vicinity of the mined area, this region is one of the most contaminated areas of the Czech Republic where high levels of trace elements have been detected [31,32].The soil was sampled from natural Norway spruce forest stands which surround the open-cast lignite mines in Sokolov, but have not been directly affected by the mining activities.However, the soil in all of the stands exhibits low pH [33].
Different bedrock characterizes the selected sites (Table 1).Metamorphic rocks, such as paragneisses, phyllites or mica-schist, underlie the sampling stands within the Habártov and Kovářská study site.In contrast, it is mainly intrusive rocks, such as granites, that underlie the sampling sites within the Mezihorská site and sedimentary rocks, such as sandstones, at the stands located in the Erika site.Both study sites have the same main soil type-Cryptopodzol/podzol.Detailed field investigations preceded sample collection, In addition, trace element and heavy metal gradients were studied in situ using a portable Innov-x Alpha RFA spectrometer to ensure that the selected soil samples cover the whole gradient ranges found at the sites.At each forest stand (4 stands, Table 1), 10 soil samples were collected (40 samples in total).Tree litter was excluded and material was collected from the organo-mineral horizons (A + AB, depth of 10-30 cm).The soil material collected was air dried prior to sieving and then transported to the certified laboratories of the Czech Geological Survey.Exchangeable cations (Ca, Mg, K, Na) and Al were analysed in 0.1 M BaCl 2 -extracts using atomic absorption spectrophotometry (AAS, Perkin-Elmer A Analyst 100).Soil pH was determined in distilled water and in 1 M CaCl 2 (ISO 103900).To measure Total Exchangeable Acidity (TEA), BaCl 2 -extracts were titrated using 0.025 M NaOH to pH = 7.8.To measure the organic carbon (Corg) and selected trace elements (Cu, Zn, As, Hg) the samples were sieved <2 mm and homogenized, trace elements were then determined using flame atomic absorption spectrometry (FAAS) and, in the case of Hg, atomic absorption spectrometer (AMA254).Organic carbon (Corg) was determined by sulfochromic oxidation (ISO 14235).Cation exchange capacity (CEC) was calculated as the sum of exchangeable base cations (BC = Ca + Mg + K + Na) and TEA.Base saturation (BS) was determined as the fraction of CEC associated with BC.Chemical analysis of the 40 samples within the dataset shows that the pH ranges were rather low-between 3.03 and 3.83.The other soil attributes showed high variations in the studied attributes (Table 2).Naturally, significant positive correlations (Pearson correlation coefficient r higher than 0.7) were found for Hg-CORG, Hg-Pb, As-Pb, Cu-Pb, Zn-Cu, BC-BS, Hg-CEC, Hg-BC, CORG-CEC, CORG-BC (Figure 2).Chemical analysis of the 40 samples within the dataset shows that the pH ranges were rather low-between 3.03 and 3.83.The other soil attributes showed high variations in the studied attributes (Table 2).Naturally, significant positive correlations (Pearson correlation coefficient r higher than 0.7) were found for Hg-C ORG , Hg-Pb, As-Pb, Cu-Pb, Zn-Cu, BC-BS, Hg-CEC, Hg-BC, C ORG -CEC, C ORG -BC (Figure 2).Chemical analysis of the 40 samples within the dataset shows that the pH ranges were rather low-between 3.03 and 3.83.The other soil attributes showed high variations in the studied attributes (Table 2).Naturally, significant positive correlations (Pearson correlation coefficient r higher than 0.7) were found for Hg-CORG, Hg-Pb, As-Pb, Cu-Pb, Zn-Cu, BC-BS, Hg-CEC, Hg-BC, CORG-CEC, CORG-BC (Figure 2).

Spectral Measurements
Soil reflectance was measured at the laboratory using two different spectrometers: (i) a broad-band Full Spectral Range (FSR) reflectometer (designed by ABB Bomem, Québec City, QC, Canada) [34]-measured the reflectance of the soil samples across the optical and thermal regions (15,506-748 cm −1 ; 0.64-13.36µm) with a spectral resolution of 16 cm −1 ; (ii) ASD FieldSpec3 spectroradiometer-measured the reflectance of the soil samples across the whole optical VIS-NIR-SWIR region (0.35-2.5 µm).The parameters of both spectrometers are shown in Table 3.Both measurements were done using a high intensity contact probe.The ASD measurements were carried out according to [35].A calibrated gold panel is built into the FSR instrument [36] and the FSR measurements were done under the following routine [34]: each soil sample was placed in a sampling cup, placed under the input port of the reflectometer and its reflectance spectrum was recorded.The reflectance spectra of the soil samples acquired from both spectrometers are shown in Figure 3.

Spectral Measurements
Soil reflectance was measured at the laboratory using two different spectrometers: (i) a broad-band Full Spectral Range (FSR) reflectometer (designed by ABB Bomem, Québec City, QC, Canada) [34]-measured the reflectance of the soil samples across the optical and thermal regions (15,506-748 cm −1 ; 0.64-13.36µm) with a spectral resolution of 16 cm −1 ; (ii) ASD FieldSpec3 spectroradiometer-measured the reflectance of the soil samples across the whole optical VIS-NIR-SWIR region (0.35-2.5 µm).The parameters of both spectrometers are shown in Table 3.Both measurements were done using a high intensity contact probe.The ASD measurements were carried out according to [35].A calibrated gold panel is built into the FSR instrument [36] and the FSR measurements were done under the following routine [34]: each soil sample was placed in a sampling cup, placed under the input port of the reflectometer and its reflectance spectrum was recorded.The reflectance spectra of the soil samples acquired from both spectrometers are shown in Figure 3.  <1 nm @ 640-1100 nm From 1 to 5 nm @ 1050-2500 nm From 5 to 136 nm @ 2500-13360 nm  <1 nm @ 640-1100 nm From 1 to 5 nm @ 1050-2500 nm From 5 to 136 nm @ 2500-13,360 nm

Statistical Modelling Interface-PARACUDA
For this study a new computing engine that has been developed at RSL-TAU termed PARACUDA ® was used.This engine is an automated computing system developed to manage and deploy thousands of chemometric models and preprocessing combination methods in order to extract the best prediction model from a big data archive.It replaces the manual operation scheme usually deployed in chemometrics using other dedicated software such as the Unscrambler ® [37].The system can be considered as data mining software for finding and utilizing hidden patterns and relationships within large and complicated databases with no human interaction.PARACUDA ® excels especially in handling large (spectral) data and modelling spectral measurements against chemical constitutes.This allows users to generate robust prediction models for soil properties (although it can work with other general databases).
One of the most advantageous characters of the abovementioned data mining system is the 'all-possibilities' approach (APA) scheme, where the system automatically applies linear and nonlinear modelling algorithms combined with several preprocessing methods to the data.The preprocessing algorithms include averaging, centring, smoothing, standardization, normalization and transformations among others.The modelling algorithms applied to the data are Artificial Neural Networks (ANN) and Partial Least Squares (PLS).Under the APA concept, the machine evaluates all preprocessing methods and their combinations and develops a unique model for every possible sequence resulting in up to 120 different combinations.This is an extremely computer power-consuming method and thus it runs on a grid based supercomputer with many processing cores for rapid analysis.
It must be stated that the APA approach was firstly tested by Zhao et al. ( 2015) [38] who proved the importance of considering multiple pathways for modelling spectral data.However, there are a few differences between the two approaches while the main one is the initial preprocessing or pretreatments procedure.Zhao et al. (2015) [38] tested five combinations of pretreatments: while this study analyzed a set of eight potential algorithms and evaluated each in different mutual combinations resulting in up to 120 valid combinations to be applied separately to the spectroscopic data.
The system is easy to operate via an excel plug-in and a web interface that enables fast and easy data transfers to the PARACUDA ® servers, changes to the modelling parameters for advanced users (a full automatic mode is the default) as well as controlling current jobs and monitoring their progress.The main PARACUDA ® excel plug-in screen enables dependent and independent datasets to be selected, the problem type to be selected (Function fitting or classification), notes to be entered for future reference and to finalize fine tuning of the PARACUDA operation including the data division process for validation, dimension reduction and modelling methodologies.Furthermore, preprocessing options and other advanced parameters are also available through the excel plugin.
For reliable results that represent the entire dataset as best as possible, the Conditioned Latin Hypercube Sampling (cLHS) method is used in the system [37].The cLHS is a stratified random procedure that provides an efficient way of sampling variables from their multivariate distributions.Up to 1,000,000 semi-random (quantile) divisions are created and the distributions of the modelled values are examined.The variability of values within each sub-group is evaluated for each grouping iteration.The grouping with the most variability within each group is chosen as the best representation of the dataset and continues to the following steps.For this process it is important to make sure there is similarity in the values range and distribution in the calibration and validation groups.
PARACUDA ® pre-processes spectral datasets prior to analysis in order to remove any irrelevant information which cannot be handled properly by the modelling techniques.The system uses some of the most common preprocessing techniques: Savitsky-Golay Initial Smoothing [39]; Multiplicative Scatter Correction (MSC); Standard Normal Variate (SNV); Pseudo Absorptance (log(1/R) [40]; Continuum Removal [41]; First Derivative [42]; Second Derivative [43]; and Final Smoothing.The system applies possible combinations of the calculations in a specific order, optionally creating 120 new datasets.Each new dataset is the outcome of the preprocessing sequence applied to the data repetitively.For example, an optional sequence can be: Smoothing, Continuum Removal, First Derivative, and again Smoothing.All mathematically valid combinations applied to the original data result in a new archive of data-sets which are ready to be evaluated and modelled.
The system enables the user to adjust the training/test grouping ratio for the population.In this case the system was set to use 75% of the samples for training and 25% of the samples for testing.As previously mentioned, the data was divided into these groups based on the cLHS algorithm that ensured appropriate representation of the datasets.The results of all of the PLS models were then consolidated to statistically quantify the effects of the preprocessing method, as well as the training/test group selection process on the modelling results.

PARACUDA Model Setting
Due to the limited number of samples, PARACUDA was only used for the PLSR modelling.The PLSR models were built for diverse soil properties (BC, CEC, BS, pH, Corg, Pb, Hg, As, Zn and Cu) using three different spectral scenarios (see Figure 3):

•
using the soil spectra acquired by the FSR spectrometer, which covers the whole optical and thermal range (0.6-13.3 µm, Scenario 1: Sc1) • using the soil spectra acquired by the FSR spectrometer selecting the optical range only (0.6-2.5 µm, Scenario 2: Sc2) • using the soil spectra acquired by the ASD spectrometer: the whole optical range (0.4-2.5 µm, Scenario 3: Sc3) The 40 samples from the original dataset were automatically divided into a training dataset consisting of 30 samples, and a test set of 10 samples by the software, as described earlier.In order to avoid over fitting, an internal validation method (full cross-validation) was first run on the entire 40 samples to estimate the optimal number of Latent Variables (LV) on the PLSR models.
As previously discussed in this study, eight different preprocessing techniques and their combinations were tested under the PLSR modelling: Smoothing (SM), multiplicative Scattering Correction (MSC), Signal Normal Variate (SNV), Absorbance (ABS), Continuum Removal (CR), 1st and 2nd Derivative (FD and SD, respectively) and final smoothing (FSM).PLSR modelling was employed combining all logical combinations among the eight preprocessing techniques, while PLSR models were set to run identically (Figure 4) and employed on the three different scenarios described above (utilizing different spectral settings Figure 3).
Remote Sens. 2017, 9, 134 7 of 21 applied to the data repetitively.For example, an optional sequence can be: Smoothing, Continuum Removal, First Derivative, and again Smoothing.All mathematically valid combinations applied to the original data result in a new archive of data-sets which are ready to be evaluated and modelled.
The system enables the user to adjust the training/test grouping ratio for the population.In this case the system was set to use 75% of the samples for training and 25% of the samples for testing.As previously mentioned, the data was divided into these groups based on the cLHS algorithm that ensured appropriate representation of the datasets.The results of all of the PLS models were then consolidated to statistically quantify the effects of the preprocessing method, as well as the training/test group selection process on the modelling results.

PARACUDA Model Setting
Due to the limited number of samples, PARACUDA was only used for the PLSR modelling.The PLSR models were built for diverse soil properties (BC, CEC, BS, pH, Corg, Pb, Hg, As, Zn and Cu) using three different spectral scenarios (see Figure 3):

•
using the soil spectra acquired by the FSR spectrometer, which covers the whole optical and thermal range (0.6-13.3 µm, Scenario 1: Sc1) • using the soil spectra acquired by the FSR spectrometer selecting the optical range only (0.6-2.5 µm, Scenario 2: Sc2) • using the soil spectra acquired by the ASD spectrometer: the whole optical range (0.4-2.5 µm, Scenario 3: Sc3) The 40 samples from the original dataset were automatically divided into a training dataset consisting of 30 samples, and a test set of 10 samples by the software, as described earlier.In order to avoid over fitting, an internal validation method (full cross-validation) was first run on the entire 40 samples to estimate the optimal number of Latent Variables (LV) on the PLSR models.
As previously discussed in this study, eight different preprocessing techniques and their combinations were tested under the PLSR modelling: Smoothing (SM), multiplicative Scattering Correction (MSC), Signal Normal Variate (SNV), Absorbance (ABS), Continuum Removal (CR), 1st and 2nd Derivative (FD and SD, respectively) and final smoothing (FSM).PLSR modelling was employed combining all logical combinations among the eight preprocessing techniques, while PLSR models were set to run identically (Figure 4) and employed on the three different scenarios described above (utilizing different spectral settings Figure 3).After processing, the selection of the best models out of all the options were evaluated based on the following parameters: lowest number of factors as a measure of model robustness, the lowest root mean square error of prediction (RMSEP) as a measure of expected error in future predictions, the coefficient of determination (R 2 ) and the Relative Percentage Difference (RPD).
As a result, for each of the parameters, 96 models were run for each spectral set (each spectral scenario), thus 288 PLSR models (3 × 96) were processed in all for each soil parameter (Tables 4-6).Therefore, for 10 different soil attributes 2880 models were generated and evaluated.It should be emphasized that this arduous work was run automatically and took 60 min.An expert/operator consideration was only entered after the best model was selected in order to explain the spectral assignment of the model as will be discussed later.

PLSR Prediction Model Assessment
Tables 4-6 present the results of the best analytical runs.To select the number of the most significant components the eigenvalue plots were generated for all the soil variables studied.It was found that the first five PC's explained most of the data variance.Therefore, the PLSR models selected employed a maximum of up to five PC's, reached a RPD higher than 1.5 and achieved the highest R 2 for the test data set (R 2 test).In some cases it was not possible to find any model that showed RPD higher than 1.5 and these models were marked with * in Tables 4-6.
Table 4 shows the best prediction models found, respectively the preprocessing techniques employed on spectral data prior to modelling, either alone or in all possible mutual combinations.Summarizing these results it can be said that preprocessing spectral data prior to modelling is an important step.For all of the parameters, the models achieving the best results employed some kind of preprocessing techniques (usually a combination of two or three of them).In other words, no model without preprocessing achieved better results than those employing the preprocessing techniques.For most of the models which were run, the smoothing (smoothing and/or final smoothing) was found to be beneficial, especially in combination with derivative transformation (1st or 2nd derivatives).If the optical spectral data are used as an input (Sc2 and Sc3), the transformation to absorbance was also a preferable preprocessing step that resulted in increasing the model's quality.
Tables 5 and 6 show the best models with statistics as follow: number of PC's, R 2 test, Root Mean Square Errors (RMSE), Standard Deviations (SD) and Regression Point Displacements (RPD's).Based on the guidelines given by [44], the prediction models obtained were categorized into five groups which are then summarized in Table 7: score 1 (excellent prediction): RPD and R 2 test higher than 3.0 and 0.9 respectively, score 2 (good predication): RPD values from 2.5-3.0 and R 2 test between 0.82-0.90,score 3 (approximate prediction): RPD values from 2.0-2.5 and R 2 test between 0.66-0.81,score 4 (possibility to distinguish between high and low values): RPD values from 1.5-2.0 and R 2 test between 0.50-0.65,score 5 (unsuccessful prediction): RPD values lower than 1.5 or R 2 test lower than 0.5.

Soil
For some variables it was possible to find more than one model having high RPD's (Tables 5 and 6): • when the FSR scenario (Sc1) was employed: BC and Pb (three models), Corg (two models) • when the ASD scenario (Sc3) was employed: Cu (two models) On the other hand, CEC was found to be difficult to predict even when using the PARACUDA ® and being able to assess 288 different models (in Table 4 marked with *).
The information from the thermal region (Sc1) brought significant benefits to the modelling of base cations (BS), base saturation (BS) and organic C (C org ) content as well as for some heavy metals such as: Pb, Hg, Cu and As (Tables 6 and 7).On the other hand, the high spectral resolution across the optical VIS-NIR-SWIR range provided by the ASD spectrometer (Sc3) was more beneficial for modelling pH and Zn (Tables 6 and 7).

Spectral Sensitivity Assessment
As described in Section 3.1, the statistics revealed the following results:

•
High spectral resolution covering the whole optical range (the spectral information acquired by the ASD spectrometer), was found to be more beneficial for modelling pH, Zn, As and C org (Sc 3)

•
The information in the thermal region (MWIR and LWIR) improved the predictive capabilities for BC, BS and C org content as well as for such heavy metals as Pb, Hg and Cu (Sc1).
The intention was to further test if the above described trends (the cases when a high-spectral-resolution optical range gives better predictions vs. the cases when a lower spectral resolution covering optical and thermal ranges gives better predictions than the latter one) can be explained in terms of spectral assignments.Spectral assignment is very important information in chemometrics [45], especially when a small data set is analysed.If spectral assignment cannot be found, all models might remain questionable, as no physical chemical basis is encountered.The version of the Paracuda used at the time of processing did not provide an interface to visualize spectral implications and assignments (later versions will include this function).Accordingly, to be able to explain which particular wavelengths are the most important for predictions and to visualize the spectral implications of the studied soil attributes the Pearson's correlation coefficients (r) were calculated between the attributes and the measured reflectance values (Figures 5-7) using the statistical software (SW) SPSS.

Demonstrating the Benefit of Scenario 3 When the Whole Optical Region (0.4-2.5 µm) Is Covered at a High Spectral Resolution
As already discussed, the high spectral resolution within the wider VIS-NIR-SWIR optical range (the spectral information obtained by the ASD spectrometer), was more beneficial for modelling pH, Zn and As.For those attributes the Person's correlation coefficients (r) calculated over the optical ranges of both spectrometers, ASD and FSR, are displayed in Figure 5.In addition to the above attributes, the C org is also displayed to demonstrate the spectral assignments of the organic component.To be able to discuss and compare these results with those previously published, Table 8 shows band assignments of identified VIS-NIR-SWIR wavelengths, which were compiled from [45,46].
The key wavelengths for pH were located in the VIS and NIR regions, the highest negative correlations (r between −0.4 and −0.3) were found for the spectral region between 0.40-0.53µm and then between 0.82-1.00µm (r between −0.3 and −0.27).These wavelengths are usually assigned to electronic transitions of Fe-oxides [14,[48][49][50] which are highly correlated with pH.The inter-correlation of the chromophoric property (in this case Fe-oxides) to the non-chromophoric one (in this case pH) is discussed in [44].They demonstrated significant model prediction of non-chromophoric attributes in soils by the inter-correlation mechanism.It is well known that the VIS-NIR region can be used to differentiate between diverse iron oxides which are stable under different pH ranges [28,[51][52][53] and accordingly to serve as a spectral assignment indicator for pH.As demonstrated in Figure 5, the ASD spectrometer, characterized by a wider optical range while having a high spectral resolution, is more advantageous in this case than the FSR spectrometer, which is characterized by a shorter optical range in the VIS-NIR region.As a result the FSR is more limited in detecting and discriminating diverse Fe-oxides and thus not optimal for predicting pH.Similarly, the important wavelengths of Zn are assigned to the same VIS-NIR regions as pH, 0.40-0.53µm (r between −0.54 and −0.44) and 0.84-1.00µm (r between −0.435 and −0.453) again indicating the inter-correlation mechanism by the Fe-oxides assignments.In addition, wavelengths between 2.35 and 2.366 µm show higher negative correlations (below −0.4).This spectral region can be assigned to C biomass [14,54] showing that Zn can also be associated with C biomass in the soil samples under study (again via the chromophoric property).Also in this case, the better spectral coverage of the ASD spectrometer (e.g., spectral range) allowed better determination of both iron oxides and C biomass.The As follows a similar pattern as described for Zn, except that the assignments between 0.63 and 0.73 µm characterizing C org are better pronounced.With the examples of Zn and As, the advantage of the better spectral coverage of the ASD spectrometer, allowing better differentiation between iron oxides and the organic component, is also demonstrated.
VIS-NIR region can be used to differentiate between diverse iron oxides which are stable under different pH ranges [28,[51][52][53] and accordingly to serve as a spectral assignment indicator for pH.As demonstrated in Figure 5, the ASD spectrometer, characterized by a wider optical range while having a high spectral resolution, is more advantageous in this case than the FSR spectrometer, which is characterized by a shorter optical range in the VIS-NIR region.As a result the FSR is more limited in detecting and discriminating diverse Fe-oxides and thus not optimal for predicting pH.
Similarly, the important wavelengths of Zn are assigned to the same VIS-NIR regions as pH, 0.40-0.53µm (r between −0.54 and −0.44) and 0.84-1.00µm (r between −0.435 and −0.453) again indicating the inter-correlation mechanism by the Fe-oxides assignments.In addition, wavelengths between 2.35 and 2.366 µm show higher negative correlations (below −0.4).This spectral region can be assigned to C biomass [14,54] showing that Zn can also be associated with C biomass in the soil samples under study (again via the chromophoric property).Also in this case, the better spectral coverage of the ASD spectrometer (e.g., spectral range) allowed better determination of both iron oxides and C biomass.The As follows a similar pattern as described for Zn, except that the assignments between 0.63 and 0.73 µm characterizing Corg are better pronounced.With the examples of Zn and As, the advantage of the better spectral coverage of the ASD spectrometer, allowing better differentiation between iron oxides and the organic component, is also demonstrated.When comparing the sensitivity results obtained by the ASD and FSR spectrometers for the same attributes, a common general trend was found (lowest negative correlations found for Corg followed by Zn, As and pH).However, for the spectrum obtained by the FSR spectrometer, lower correlations were found (Figure 5b) (between −0.45 and −0.5) as compared to the ASD spectrum (between −0.2 and −0.8) (Figure 5a).These results demonstrate that the spectral optical coverage of the FSR spectrometer (0.6-2.5 µm) does not allow such detailed differentiation among iron minerals and organic matter as is possible with the ASD spectrometer (0.4-2.5 µm).

Wavelengths (µm)
Possible Assignment Remark 0.400-1.200Electronic transitions Iron-bearing minerals 0.450, 0.520-0.533,0.560-0.575,0.630-0.640,0.660-0.When comparing the sensitivity results obtained by the ASD and FSR spectrometers for the same attributes, a common general trend was found (lowest negative correlations found for C org followed by Zn, As and pH).However, for the spectrum obtained by the FSR spectrometer, lower correlations were found (Figure 5b) (between −0.45 and −0.5) as compared to the ASD spectrum (between −0.2 and −0.8) (Figure 5a).These results demonstrate that the spectral optical coverage of the FSR spectrometer (0.6-2.5 µm) does not allow such detailed differentiation among iron minerals and organic matter as is possible with the ASD spectrometer (0.4-2.5 µm).

Wavelengths (µm)
Possible Assignment Remark Combining the optical and thermal region together in Scenario 1 brought significant benefits to the modelling of base cations (BC), base saturation (BS) and organic C (C org ) as well as to heavy metals such as Pb, Hg and Cu.For these soil attributes the Person's correlation coefficients (r) calculated over the VIS-NIR-SWIR and MWIR-LWIR regions are displayed in Figures 6 and 7, respectively.
Similarly as in the case of As, Zn and pH, the important wavelengths in the VIS-NIR (Figure 6) indicate inter-correlation with Fe-oxides, in addition, the absorptions in the SWIR region at 2.21 µm and 2.34 µm indicate inter-correlations with clay minerals and C biomass, respectively.The most important ranges in the MWIR-LWIR region were identified as follows: 2.77-3.83µm (3600-2580 cm −1 ), 5.75-6.40µm (1740-1560 cm −1 ), 8.15-9.52 µm (1226-1050 cm −1 ) and 10.90-12.20 µm (920-825 cm −1 ) (Figure 7).The positive correlation peaks, which are highest around 5.2 µm (1900 cm −1 ) and 7.7 µm (1300 cm −1 ), characterize the absorption shoulder positions and demonstrate that the higher the shoulder is, the higher the contents of the studied soil attributes are.On the other hand, absorption wavelengths that correlate with the soil attribute contents are characterized by negative correlations, the stronger the relationship between the soil attribute and the absorption depth, the more significant the negative correlation and thus a lower r is obtained.In Table 9 the key wavelength ranges identified within the MWIR-LWIR regions when using the sensitivity analyses are assigned.To compare the results, Table 10 shows band assignments of identified wavenumbers compiled from [46,[53][54][55][56][57][58].Combining the optical and thermal region together in Scenario 1 brought significant benefits to the modelling of base cations (BC), base saturation (BS) and organic C (Corg) as well as to heavy metals such as Pb, Hg and Cu.For these soil attributes the Person's correlation coefficients (r) calculated over the VIS-NIR-SWIR and MWIR-LWIR regions are displayed in Figures 6 and 7, respectively.
Similarly as in the case of As, Zn and pH, the important wavelengths in the VIS-NIR (Figure 6) indicate inter-correlation with Fe-oxides, in addition, the absorptions in the SWIR region at 2.21 µm and 2.34 µm indicate inter-correlations with clay minerals and C biomass, respectively.The most important ranges in the MWIR-LWIR region were identified as follows: 2.77-3.83µm (3600-2580 cm −1 ), 5.75-6.40µm (1740-1560 cm −1 ), 8.15-9.52 µm (1226-1050 cm −1 ) and 10.90-12.20 µm (920-825 cm −1 ) (Figure 7).The positive correlation peaks, which are highest around 5.2 µm (1900 cm −1 ) and 7.7 µm (1300 cm −1 ), characterize the absorption shoulder positions and demonstrate that the higher the shoulder is, the higher the contents of the studied soil attributes are.On the other hand, absorption wavelengths that correlate with the soil attribute contents are characterized by negative correlations, the stronger the relationship between the soil attribute and the absorption depth, the more significant the negative correlation and thus a lower r is obtained.In Table 9 the key wavelength ranges identified within the MWIR-LWIR regions when using the sensitivity analyses are assigned.To compare the results, Table 10 shows band assignments of identified wavenumbers compiled from [46,[53][54][55][56][57][58].Table 9. Key wavelength ranges within MWIR-LWIR identified in this study employing a sensitivity analysis.As the results published by the scientific community commonly use the wavelength (µm) and/or wavenumber (cm −1 ) scales, both scales are used to make the results comparable and universal.Table 9. Key wavelength ranges within MWIR-LWIR identified in this study employing a sensitivity analysis.As the results published by the scientific community commonly use the wavelength (µm) and/or wavenumber (cm −1 ) scales, both scales are used to make the results comparable and universal.The key wavelengths for BS as well as BC were found for a region between 2.76 and 3.20 µm (3625-3120 cm −1 ) (for both r lower than −0.4) and then 11.8-12.0µm (843-833 cm −1 ) (r for BS lower than 0.4 and for BC lower than −0.35 respectively).Both these ranges are assigned to phyllosilicates (Table 10) showing that for these attributes it is beneficial if such a mineral phase can be spectrally detected.
When analysing the correlation across the MWIR-LWIR wavelengths for Cu and Pb, a wide range between 2.77-3.40µm (3600-2923 cm −1 ) exhibits correlations lower than −0.6 (possibly assigned to phyllosilicates).Apart from this, the range between 8.26 and 9.30 µm (1210-1070 cm −1 ) shows negative correlations lower than 0.6 for Cu and lower than 0.55 for Pb.This demonstrates that Cu and Pb are associated with phyllosilicates and quartz as well as with the organic phases in the studied soils.

Discussion
The PARACUDA ® engine, due to its automatic processing, allowed a systematic assessment of what effects preprocessing techniques have on prediction model validity.Some preprocessing strategies have been tested before; Gholizadeh et al., 2015 [10] achieved the best results to predict potentially toxic elements (As, Cd, Cu) using Support Vector Machine Regression (SVMR) when employing Sawitzky-Golay smoothing together with derivative analysis as a pretreatment technique.Nocita et al., 2014 [59] tested several preprocessing techniques (CR, SNV, Sawitzky-Golay smoothing filter and the 1st and 2nd derivative) prior to employing PLSR modelling of soil organic content (SOC) when using the Lucas dataset and concluded that the Sawitzky-Golay smoothing and 1st derivative improved model prediction validity while the 1st derivative analyses improved the model's performance for organic soil the most.In this study, the evaluation of 2880 models confirmed that the preprocessing of spectral data prior to modelling is an important step.all of the soil attributes, the models achieved the best results when some kind of preprocessing techniques were employed (usually a combination of two or three of them).For most of the models run, smoothing (smoothing and/or final smoothing) was beneficial, especially in combination with derivative transformation (1st or 2nd derivatives).
PARACUDA ® was used to find the best estimation models of diverse soil attributes reaching the highest validity.A similar systematic approach was tested by Zhao et al., 2015 when using a processing trajectory to optimize non-systematic parameters (e.g., spectral pretreatment, latent factors and variable selection) and demonstrating better efficiency of using this approach to model the relative content of water in corn.It can be concluded that the data mining engine presented here can serve as an efficient tool to find the best prediction models.Apart from CEC, it was possible to find satisfactory predictions for the other modelled attributes (Cu considered as excellent prediction: RPD and R 2 test higher than 3.0 and 0.9 respectively; C org , BC, BS, Pb, Hg models denoted as good prediction: RPD values from 2.5-3.0 and R 2 test between 0.82 and 0.90).Approximate quantitative predictions were achieved for As, Zn and pH (RPD values from 2.0 to 2.5 and R 2 test between 0.66 and 0.81).The values of CEC generally characterizing the studied Norway spruce forest soil dataset were high (mean: 109.8 mmol(+)/kg, standard deviation: 32.7 mmol(+)/kg) when compared to the studies published by [60,61].This can be the reason why the CEC prediction was worse (RPD = 1.48,R 2 test = 0.62).For heavy metals, the results are comparable or even better than those published by [10] when employing SVM modelling together with the 1st derivative as a pretreatment technique or by Bray et al. (2009) [62] who used an ordinal logistic regression technique to predict Zn, Cu, Pb and Cd.
However, it needs to be noted that the results published by different groups or researchers are not consistent in the manner of defining the accuracy that can be achieved in predicting different heavy metals or other attributes which do not show any direct spectral chromophores.As they do not show any direct spectral chromophores, the mechanism of detecting them using infrared spectroscopy is thus attributed to their relationship with other soil components that have direct spectral chromophores.Therefore, for different case studies, such attributes can be predicted at different accuracy levels.For instance, heavy metals can be modelled due their associations with soil components such as: organic matter, clay minerals and Fe/Al oxides.These indirect relationships have previously been explored in pedotransfer functions (PTFs) [28].For instance [63,64] demonstrated that various PTFs can be used to predict the adsorption of metals as a function of soil carbon content, CEC and pH.Furthermore, Choe et al. (2008) [65] proposed a binding mechanism based on the surface complex model where metals can bind to the hydroxylated mineral surface thus affecting the absorptions characteristic for Fe, Al, Mn oxides and clay minerals or organic matter.Therefore, in each case study, the results depend on the local geochemical conditions and thus the results are highly site-specific.
Due to the limited number of samples analysed (40 samples) and due to the fact that they were collected within one region, these results are also site-specific.However, it shows how the processing scheme presented here and the PARACUDA ® engine can be used to find the best prediction models as well as to get a better understanding of the geochemical properties of the samples studied.Employing statistical modelling with the aid of the system, allowed two groups or mineral associations to be defined: First group-Zn, As and pH-which can be predicted in a more accurate way using high spectral resolution spectra covering the whole optical range (ASD reflectance, 0.4-2.5 µm, Scenario 3)-and the second group-BC, BS, Pb, Hg, and Cu-for which more reliable predictions can be achieved when working with the spectrum covering shorter optical range and the thermal region (MWIR and LWIR, Scenario 1).Basically the same groups were defined when analysing the linear correlations among their chemical concentrations (Figure 2).It has been explained that pH, as well as Zn and As, show indirect spectral indications to iron oxides, organic matter and clay minerals (Figure 5).This relationship was also described in [33] where multivariate statistics were employed on a chemical analysis characterizing diverse horizons within the soil profile in order to define a conceptual model for the chemical and biochemical processes taking place in the studied soil-tree system.On the other hand, the second group (BC, BS, Pb, Hg, and Cu) has a stronger relationship to phyllosilicate content together with organic component (C org ).Logically, different TPHs can be used to predict these two groups (pH, Zn, As vs. BC, BS, Pb, Hg, and Cu).It has been demonstrated that the statistical analysis of spectral assignments can add important and reliable information about the non-chromophoric properties, just as laboratory methods provide.Furthermore, the indirect relationships to diagnostic soil constituent absorption features can be analysed to help to resolve the heavy metal's abundances and geochemical form (e.g., speciation) despite the argument some scientists have raised [66].This is key information for understanding the chemical behaviour of heavy metals in the environment studied.
Based on these results, it is possible to support the suggestion given by [61], that the use of portable VIS-NIR-SWIR together with MWIR-LWIR instruments has a potential to replace many conventional techniques of soil analysis.Whereas optical field spectrometers are becoming very popular and more and more frequently used, the price for a MWIR-LWIR portable instrument is still too high and remains unaffordable for most researchers.Nonetheless, as recently also demonstrated by others [12,67] these results strengthen the idea that future satellite systems should be designed to work in both the optical and thermal regions (e.g., HyspIRI) in order to better evaluate heavy metals on the Earth's surface as well as to improve the modelling capability of attributes in complex soil systems.This is also relevant for field and laboratory work.

Conclusions
The APA data mining engine termed PARACUDA ® proved to be a powerful tool to assess the spectral performances of different spectrometers in order to achieve the best prediction models.This task cannot be achieved by the traditional and regular method that uses manual analyses and a subjective consideration by an expert/operator.For 10 soil attributes, the evaluation of 2880 models proved that preprocessing spectral data prior to modelling is an important step.For all of the attributes, the models achieving the best results employed some kind of preprocessing techniques (usually a combination of two or three of them).For most of the models run, smoothing (smoothing and/or final smoothing) was beneficial, especially in combination with derivative transformation (1st or 2nd derivatives).
Using the engine it was possible to find excellent (Cu) or good models (C org , BC, BS, Pb, Hg) for most of the modelled attributes.Approximate quantitative predictions were achieved for As, Zn and pH.On the other hand, for the tested soils, the CEC was found to be difficult to predict even when using the PARACUDA ® and being able to assess 288 different models for this soil attribute.This can be explained by the rather high CEC values characterizing this forest soil dataset.
The information from the thermal region (MWIR and LWIR) brought significant benefits to modelling base cations (BS), base saturation (BS) and organic C (C org ) as well as such heavy metals as Pb, Hg, Cu and As.On the other hand, the wider optical-range coverage (ASD spectrometer) was more beneficial for modelling pH and Zn.These two attributes showed a high affinity to iron oxides, thus, in this case, the spectra acquired by the ASD spectrometer, which covers the optical range between 0.35-2.5 µm at a high spectral resolution, resulted in better predictions.The sensitivity analyses allowed identification of the wavelengths across the optical and thermal regions that were found to be important to predict selected soil attributes.Furthermore, it has been shown how the processing scheme presented here and PARACUDA ® can be used to obtain a better understanding of the geochemical properties of the samples studied (e.g., mineral associations).This is key information for understanding the chemical behaviour of heavy metals in the environment studied.

Figure 2 .
Figure 2. Colour-coded correlation matrix calculated for the soil attributes studied.(Note: ** Correlation is significant at the 0.01 level, * correlation is significant at the 0.05 level).

Figure 2 .
Figure 2. Colour-coded correlation matrix calculated for the soil attributes studied.(Note: ** Correlation is significant at the 0.01 level, * correlation is significant at the 0.05 level).

Figure 2 .
Figure 2. Colour-coded correlation matrix calculated for the soil attributes studied.(Note: ** Correlation is significant at the 0.01 level, * correlation is significant at the 0.05 level).

Figure 4 .
Figure 4. Simplified processing scheme (in total 2880 PLSR models were computed and assessed for 10 soil attributes using three spectral scenarios while testing eight selected pretreatment techniques).

Figure 4 .
Figure 4. Simplified processing scheme (in total 2880 PLSR models were computed and assessed for 10 soil attributes using three spectral scenarios while testing eight selected pretreatment techniques).

Figure 6 .
Figure 6.Spectral sensitivity calculated for the FSR spectra: VIS-NIR-SWIR range 0.6-2.5 µm (y-axes showing the Person's correlation coefficients).Enhanced: the important wavelengths in the VIS-NIR indicating the inter-correlation with Fe-oxides and the absorptions in the SWIR region at 2.21 µm and 2.34 µm which indicate inter-correlations with clay minerals and organics, respectively.

Figure 6 .
Figure 6.Spectral sensitivity calculated for the FSR spectra: VIS-NIR-SWIR range 0.6-2.5 µm (y-axes showing the Person's correlation coefficients).Enhanced: the important wavelengths in the VIS-NIR indicating the inter-correlation with Fe-oxides and the absorptions in the SWIR region at 2.21 µm and 2.34 µm which indicate inter-correlations with clay minerals and organics, respectively.

Figure 7 .
Figure 7. Spectral sensitivity calculated for the FSR spectra MWIR-LWIR range 2.5-13.6 µm/4000-700 cm −1 (y-axes showing the Person's correlation coefficients).As results published by the scientific community are commonly published on wavelength (µm) and wavenumber (cm −1 ) scales, both scales are used together in this study to make the results comparable and universal.

Figure 7 .
Figure 7. Spectral sensitivity calculated for the FSR spectra MWIR-LWIR range 2.5-13.6 µm/4000-700 cm −1 (y-axes showing the Person's correlation coefficients).As results published by the scientific community are commonly published on wavelength (µm) and wavenumber (cm −1 ) scales, both scales are used together in this study to make the results comparable and universal.

Table 2 .
Descriptive statistics of the soil parameters studied.

Table 2 .
Descriptive statistics of the soil parameters studied.

Table 2 .
Descriptive statistics of the soil parameters studied.

Table 3 .
Technical specifications of the two spectrometers used.

Table 3 .
Technical specifications of the two spectrometers used.

Table 4 .
Statistics on the spectral preprocessing techniques of the best models found for the selected soil attributes while using PLSR modelling (SM: smoothing, FSM: final smoothing, MSC: Multiplicative Scatter Correction, SNV: Signal Normal Variate, ABS: absorbance, CR: continuum removal, FD: first derivative, SD: second derivative, (n): number of models found fulfilling the selection criteria: employed maximum of up to five PC's, reached RPD higher than 1.5 and achieved the highest R 2 test, * model RPD was <1.5).
Demonstrating the Benefit of Scenario 1 When Optical and Thermal Regions (0.6-13.3 µm) Are Combined