Machine Learning Based On-Line Prediction of Soil Organic Carbon after Removal of Soil Moisture E ﬀ ect

: It is well-documented in the visible and near-infrared reﬂectance spectroscopy (VNIRS) studies that soil moisture content (SMC) negatively a ﬀ ects the prediction accuracy of soil attributes. This work was undertaken to remove the negative e ﬀ ect of SMC on the on-line prediction of soil organic carbon (SOC). A mobile VNIR spectrophotometer with a spectral range of 305–1700 nm and spectral resolution of 1 nm (CompactSpec, Tec5 Technology, Germany) was used for the spectral measurements at four farms in Flanders, Belgium. A total of 381 fresh soil samples were collected and divided into a calibration set (264) and a validation set (117). The validation samples were processed (air-dried and grind) and scanned with the same spectrophotometer in the laboratory. Three SMC correction methods, namely, external parameter orthogonalization (EPO), piecewise direct standardization (PDS), and orthogonal signal correction (OSC) were used to correct the on-line fresh spectra based-on its corresponding laboratory spectra. Then, the Cubist machine learning method was used to develop calibration models of SOC using the on-line spectra (after correction) of the calibration set. Results indicated that the EPO-Cubist outperformed the PDS-Cubist and the OSC-Cubist, with considerable improvements in the prediction results of SOC (coe ﬃ cient of determination (R 2 ) = 0.76, ratio of performance to deviation (RPD) = 2.08, and root mean square error of prediction (RMSEP) = 0.12%), compared with the corresponding uncorrected on-line spectra (R 2 = 0.55, RPD = 1.24, and RMSEP = 0.20%). It can be concluded that SOC can be accurately predicted on-line using the Cubist machine learning method, after removing the negative e ﬀ ect of SMC with the EPO method. moisture content (SMC) e ﬀ ect, using the three correction methods, namely, external parameter orthogonalization (EPO), piecewise direct standardization (PDS), and orthogonal signal correction (OSC). models built under different spectral correction schemes of uncorrected Cubist (Cubist), correct Cubist with external parameter orthogonalization (EPO-Cubist), piecewise direct standardization (PDS-Cubist), and orthogonal signal correction (OSC-Cubist).


Introduction
Organic matter and consequentially soil organic carbon (SOC) are key components of soil that affect its physicochemical properties such as soil structure, water holding capacity, and cation exchange capacity (CEC) [1], in addition to its direct influence on soil resistance to erosion [2]. Therefore, the spatial measurement of SOC content is essential for a wide range of environmental and agricultural applications [3]. Traditional laboratory procedures for determining SOC is costly, destructive, and time-consuming. Therefore, there is an increasing need for rapid, cost-effective, nondestructive, and sufficiently accurate approaches for predicting SOC under field conditions using either portable or on-line sensing infrastructure [4,5].
Visible and near infrared reflectance spectroscopy (VNIRS) is reported to be a promising technology for soil analysis [4,6]. Due to the availability of robust and portable detectors, VNIRS has been widely used for the in situ off-line and on-line predictions of various soil properties [7][8][9] including SOC [10,11].

Study area
The study area comprised of four farms with a total area of 105 ha at Melle (50° 59′ 6″ N, 3° 49′ 8″ E), Veurne (51° 1′ 18″ N, 2° 35′ 10″ E), Huldenberg (50° 48′ 38″ N, 4° 34′ 47″ E), and Landen (50° 45′ 7″ N, 5° 6′ 4″ E) in Flanders, Belgium ( Figure 1). The study area is characterized by a temperate maritime climate with a mean annual temperature that ranged between 6 to 10 °C and annual precipitation that ranged between 750 and 1000 mm. The Melle farm included one field of about 6 ha, which was flat and elevation ranged between 4 to 5 m asl, and the soil texture varied between clay to clay loam. The Veurne farm had three fields with a total area of about 20 ha, elevation ranged between 2 to 3 m asl, and soil texture varied between clay to clay loam. This farm is affected by salinity as it is located very close to the North Sea that affects the soil with salt-water intrusion. The Huldenberg farm (35 ha) had four fields with a relatively large elevation variation of 85 to 90 m asl, and soil texture varied between sandy loam to loam. The Landen farm included three fields of about 44 ha that were almost flat except the smallest field where the elevation is higher in the middle part of the field. The texture of this farm varies between sandy loam to loam. All farms are cultivated with wheat (or barley), maize, and potato crops in rotation.

On-line Vis-NIR Measurements and Soil Sampling
An on-line spectral survey was carried out using the on-line soil sensing platform developed by Mouazen [37]. It consists of a medium-deep subsoiler, attached to a metal frame, a differential global positioning system (DGPS), and a rugged computer. The description of this sensing platform can be found in Mouazen et al. [7] and Nawar and Mouazen [9]. The spectral survey was performed using a CompactSpec mobile, fibre type, VNIR spectrophotometer (305-1700 nm) with a sampling interval of 1 nm (Tec5 Technology, Germany). A 50-watt halogen lamb was used as a light source. Light was transferred to the soil by means of a dual optical fibre, while the diffuse reflected light was collected back by the same fibre. An optical probe containing a lens holder and protected by a mild steel was appended to the back of the subsoiler chisel. The soil spectra were collected in a diffuse reflectance mode from the smoothed bottom of the trench (15-25 cm deep), made by the subsoiler itself, due to downwards vertical forces acting on the chisel. The subsoiler retrofitted optical probe was attached to a frame, which was mounted onto the three-point linkage of a tractor ( Figure 2). A white Spectralon disc with about 98% reflectance was used for calibration once every 30 min. The positions of the spectra were recorded using a differential global positioning system (DGPS) (Trimble AG25, USA).

On-line Vis-NIR Measurements and Soil Sampling
An on-line spectral survey was carried out using the on-line soil sensing platform developed by Mouazen [37]. It consists of a medium-deep subsoiler, attached to a metal frame, a differential global positioning system (DGPS), and a rugged computer. The description of this sensing platform can be found in Mouazen et al. [7] and Nawar and Mouazen [9]. The spectral survey was performed using a CompactSpec mobile, fibre type, VNIR spectrophotometer (305-1700 nm) with a sampling interval of 1 nm (Tec5 Technology, Germany). A 50-watt halogen lamb was used as a light source. Light was transferred to the soil by means of a dual optical fibre, while the diffuse reflected light was collected back by the same fibre. An optical probe containing a lens holder and protected by a mild steel was appended to the back of the subsoiler chisel. The soil spectra were collected in a diffuse reflectance mode from the smoothed bottom of the trench (15-25 cm deep), made by the subsoiler itself, due to downwards vertical forces acting on the chisel. The subsoiler retrofitted optical probe was attached to a frame, which was mounted onto the three-point linkage of a tractor (Figure 2). A white Spectralon disc with about 98% reflectance was used for calibration once every 30 min. The positions of the spectra were recorded using a differential global positioning system (DGPS) (Trimble AG25, USA).
Soil spectra together with GPS data were logged through a rugged laptop computer using a standard data acquisition system. The on-line sensing for all farms was carried out using 12 m apart parallel transects and a travel speed of around 3.5 km/h. The soil scanning was carried out in summer (August to October) 2018, when the weather conditions were extremely warm and relatively dry.  [37], showing the main components (right) and the on-line spectral data acquisition (left).

Soil Samples and the Experiment
The fresh samples (381) were divided into a calibration dataset (264 samples), whose samples were collected from Huldenberg, Veurne, and Melle, and the remaining samples, collected from Landen were considered as the independent validation set (117) ( Table 1). The fresh samples were mixed and reduced in size to 300 g per sample, using the quartering method. The non-soil substances such as stone/gravel, grass, roots, and other non-soil materials were manually removed. The same fresh samples of the validation set were ground, air-dried, and passed through a 2 mm sieve, after which they were scanned in the laboratory with the same spectrophotometer. Three Petri dishes of 5 cm in diameter and 2 cm deep were used for each soil sample. After the samples were placed into the  [37], showing the main components (right) and the on-line spectral data acquisition (left).
Soil spectra together with GPS data were logged through a rugged laptop computer using a standard data acquisition system. The on-line sensing for all farms was carried out using 12 m apart parallel transects and a travel speed of around 3.5 km/h. The soil scanning was carried out in summer (August to October) 2018, when the weather conditions were extremely warm and relatively dry.

Soil Samples and the Experiment
The fresh samples (381) were divided into a calibration dataset (264 samples), whose samples were collected from Huldenberg, Veurne, and Melle, and the remaining samples, collected from Landen were considered as the independent validation set (117) ( Table 1). The fresh samples were mixed and reduced in size to 300 g per sample, using the quartering method. The non-soil substances such as stone/gravel, grass, roots, and other non-soil materials were manually removed. The same fresh samples of the validation set were ground, air-dried, and passed through a 2 mm sieve, after which they were scanned in the laboratory with the same spectrophotometer. Three Petri dishes of 5 cm in diameter and 2 cm deep were used for each soil sample. After the samples were placed into the The SOC was determined in the laboratory using the dry combustion method, following the Dumas principle (ISO 10694; CMA/2/II/A.7; BOC). For the determination of the SOC content, total inorganic carbon (TIC) compounds were in advance removed by treating the soil samples with hydrochloric acid.

Spectra Pretreatments
The three datasets (calibration, validation, the transfer set (e.g., wet and dry)) were subjected to the same spectral pretreatment, which started with cutting the noisy part of the spectra at the two far ends, withholding the spectral range of 400-1675 nm for the spectral analysis and modeling. In the next step, the absorbance (log 1/reflectance) was calculated followed by smoothing based-on the Savitzky-Golay algorithm (providing the best predictions) [38] with a window size of 23 and a polynomial of order 2. Afterwards, the standard normal variate (snv) transformation [39] was employed to remove the baseline influences and compose spectra into a common and comparable scale, where each spectrum was normalized. Figure 3 depicts the flow chart of steps taken during the model calibration and validation in this study. First, the fresh datasets of both the calibration and the on-line validation were treated similarly and used to calibrate and validate the Cubist model for SOC prediction without correction for SMC. The results were referred to as noncorrected prediction of SOC. Then, the three correction methods for removing SMC, namely, EPO, PDS, and OSC were used to develop the transformation matrices based-on the on-line fresh spectra and its corresponding dry samples (e.g., 117 samples). The transformation matrices had been applied then to the fresh calibration and on-line validation spectra, before the EPO-Cubist, PDS-Cubist, and OSC-Cubist models were developed and then validated. The output of these models was referred to as corrected SOC prediction. In order to evaluate the performance of the corrected models, their results were finally compared to the noncorrected Cubist model.

External Parameter Orthogonalization (EPO)
The concept of the EPO algorithm to eliminate the effects of external parameters is to project the spectral data onto the orthogonal to space, where changes generated by these parameter variations occur [19]. The mathematical description of EPO can be found in the literature [15,19]. In EPO, the spectra matrix X can be disintegrated into three components: a valuable component (XP) related to the chemical response, a parasitic component (XQ) that is formed by the external parameters, and N the spectral noise, as shown in Equation (1).
The process is to isolate the useful component XP through the spectra matrix D, which can be calculated as the difference between the spectra matrix with external effect (on-line spectra) and without the external effect (dry spectra). P and Q are the projection matrices of the useful and parasitic components of the spectra, respectively. Q can be calculated through a singular value decomposition (SVD) of D, and the projection matrix P is then calculated from P= I−Q; I is the identity matrix. The number of EPO components g is an essential parameter that should be defined during EPO development [15,19]. This component can be determined by means of the cross-validation that resulted from PLSR on transformed spectra. In this research, the optimal value of g was defined based on the PLSR cross-validation using 1 to 6 latent variables (LVs).

External Parameter Orthogonalization (EPO)
The concept of the EPO algorithm to eliminate the effects of external parameters is to project the spectral data onto the orthogonal to space, where changes generated by these parameter variations occur [19]. The mathematical description of EPO can be found in the literature [15,19]. In EPO, the spectra matrix X can be disintegrated into three components: a valuable component (XP) related to the chemical response, a parasitic component (XQ) that is formed by the external parameters, and N the spectral noise, as shown in Equation (1).
The process is to isolate the useful component XP through the spectra matrix D, which can be calculated as the difference between the spectra matrix with external effect (on-line spectra) and without the external effect (dry spectra). P and Q are the projection matrices of the useful and parasitic components of the spectra, respectively. Q can be calculated through a singular value decomposition (SVD) of D, and the projection matrix P is then calculated from P = I − Q; I is the identity matrix. The number of EPO components g is an essential parameter that should be defined during EPO development [15,19]. This component can be determined by means of the cross-validation that resulted from PLSR on transformed spectra. In this research, the optimal value of g was defined based on the PLSR cross-validation using 1 to 6 latent variables (LVs).

The Piecewise Direct Standardization Algorithm (PDS)
The piecewise direct standardization (PDS) [21] is a common method to relate each wavelength in master spectra (e.g., dry spectra) and those of secondary spectra (e.g., field spectra). PDS has two advantages of using a small number of samples in the transfer set, and its multivariate nature allowing a noise-filtering effect. The transfer parameters of the PDS were determined in this study by establishing a linear relationship between the transferred samples (dry) and the corresponding on-line fresh samples (validation). The absorbance of the dry spectra measured at each wavelength was related to the wavelengths located in a predefined small window around the same wavelength measured on the on-line spectra [40]. On the on-line spectra, both of the calibration and validation sets were then standardized using the PDS parameters that allowed a direct comparison with the dry spectra. The optimal number of PLSR LVs and the size of the wavelengths window (SW) are required to apply PDS. More details about the PDS algorithm can be found in the literature [21,22]. PDS with a different size of the wavelength window (SW = 3, 5, 11, 21, 31, 41) and the optimal number of PLSR LVs (NF = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) has been tested in this work.

Orthogonal Signal Correction (OSC)
Orthogonal signal correction intends to correct a signal by removing information from the spectral data that is irrelevant to the targeted response variable [23]. Therefore, the spectral information orthogonal to the response variable is removed [41]. The optimal number of OSC components to be eliminated is normally defined based on PLSR cross-validation, whereas the matrices X and Y are disintegrated based on the nonlinear iterative partial least squares (NIPLS) algorithm with minimization the calibration errors criteria. The samples used to develop OSC models (the transfer set) comprise samples measured under various conditions (e.g., different moisture content), for which one aims to carry out the correction. In this work, the optimal number of OSC components to be eliminated was defined based-on the PLSR cross-validation using the maximum number of 5 LVs. The transfer samples of the dry validation set were utilized to develop the OSC models that consisted of samples measured under laboratory (dry spectra) and field (fresh on-line spectra) conditions.

Principal Component Analysis (PCA)
The principal component analysis (PCA) was used to explore the difference between the three data sets that resulted from the three corrections methods. PCA concentrates the total variation in the dataset in only a few principal components (PCs) and each obtained PC presents decreasing amounts of the variance. This analysis made possible the identification of spectral variations due to the effect of SMC, while preserving the majority of information that originated from the spectral data. The PCA similarity maps of PC1 and PC2 were used to show differences between the dry samples and the corresponding fresh samples after corrections.

Modeling with Cubist
The spectral measurements obtained during the on-line and laboratory (dry) scanning modes were used to build predictive models before and after spectral corrections with EPO, PDS, and OSC using Cubist [30]. In principle, the Cubist algorithm constructs a regression tree, where intermediate linear models provide the prediction at each step. The original data is divided by the algorithm into subsets of identical samples and develops multilinear regression rules by choosing the optimal predictor variables among all of the spectral variables to be used in the regression. These rules are connected and each rule takes a form of condition sequence: ''if [condition is true] then [regress rule], and else [apply the next rule]". If a condition is being true, then calculate the next prediction value. If not, the sequence of if, then, and else is repeated [42]. In this study, it is assumed that the Cubist algorithm is capable of recognizing the effective spectral features for constructing a robust multivariate regression model to predict SOC. Cubist available at the caret R-package [43] with the most likely two hyper-parameters (the committees and neighbors) having the largest effect on the final performance tuning of the Cubist model was used.
To evaluate the model's performance, four parameters were used: the root mean squared error (RMSE); the coefficient of determination (R 2 ); the ratio of performance to deviation (RPD); and the ratio of performance to the inter-quartile range (RPIQ); [44]. The spectral data processing and the modeling were performed using the R-packages: pls [45]; prospectr [46]; and caret [43]. Table 2 shows the summary statistics of SOC and SMC in the calibration and the on-line prediction datasets. The SOC ranged between 0.86% and 2.40% for the calibration set and between 0.96% and 2.04% for the validation set, with median and mean values of 1.28% and 1.34% and 1.27% and 1.33%, respectively. The standard deviation (SD) values were 0.33 for the calibration set and 0.25 for the validation set. This data confirms that the range of SOC content of the validation set is smaller than that the calibration set, which is necessary to ensure the model validity for the studied range in the validation set.

Spectral Data and Correction Methods
SMC for the calibration set ranged between 2.28% to 24.59% with a mean and median of 12.28% and 13.03%, respectively. The SMC of the on-line validation set ranged between 11.27% and 25.03% with a mean and median of 19.40% and 20.29%, respectively. Indeed, SMC was relatively high at the time of on-line measurement ( Table 2). Figure 4 shows the spectral data of the three datasets before (Figure 4a) and after the three spectra correction methods (Figure 4b-d) for the SMC effect. The notable minor difference is observable for spectra after the EPO originated from subtracting the spectra of dry samples from the corresponding spectra of on-line fresh samples (Figure 4b), where the effect of SMC has been completely removed. In the PDS and OSC methods (Figure 4c,d, respectively), soil moisture in both cases has not been completely eliminated in particular for the OSC, where the variation between the three spectra is clear, compared to the results of EPO, and PDS to some extent.   Figure 5 compares the principal component similarity maps of the first two principal components (PC1 and PC2), derived from the PCA carried out on the on-line calibration spectra and laboratory dry and on-line validation spectra. These components accounted for 55.3% and 35.5%, respectively, of the total variation presented in the calibration set of the uncorrected data ( Figure 5a). The influence of SMC on grouping and separation of the three sets, namely, the fresh on-line calibration, the fresh on-line validation, and the dry laboratory validation spectra can be clearly observed. The separation is particularly clear for the dry validation spectra, with a minor overlap with the calibration spectra. After correcting for the effect of SMC, e.g., by applying the EPO for all the three datasets (Figure 5b), the three groups of spectra overlap now, indicating that the SMC effect has been indeed eliminated from the corrected spectra.  Figure 5 compares the principal component similarity maps of the first two principal components (PC1 and PC2), derived from the PCA carried out on the on-line calibration spectra and laboratory dry and on-line validation spectra. These components accounted for 55.3% and 35.5%, respectively, of the total variation presented in the calibration set of the uncorrected data ( Figure 5a). The influence of SMC on grouping and separation of the three sets, namely, the fresh on-line calibration, the fresh on-line validation, and the dry laboratory validation spectra can be clearly observed. The separation is particularly clear for the dry validation spectra, with a minor overlap with the calibration spectra. After correcting for the effect of SMC, e.g., by applying the EPO for all the three datasets (Figure 5b), the three groups of spectra overlap now, indicating that the SMC effect has been indeed eliminated from the corrected spectra. Remote Sens.

Principal Component Space of EPO, PDS, and OSC Datasets
10 of 20 Remote Sens. The projection of calibration and validation sets in PC space showed different patterns according to the correction method applied. Without correction, different convex hulls between the fresh (of both the calibration and online validation sets) and the dry (laboratory validation) sets is noticeable (Figure 6a). When projecting the fresh and dry spectra of EPO in PC space, the convex hulls of the on-line and laboratory validation sets coincided with each other, with both deviating from that of the on-line calibration set by almost 90°. The centroids of convex hulls for the on-line validation (fresh) spectra overlay with that of the laboratory dry spectra (Figure 6b), whereas the centroids of the calibration set deviated from both validation centroids. With the PDS correction, the convex hulls of the three sets coincide well, with a small deviation observed for the on-line calibration set (Figure 6c). Indeed, the centroids of convex hulls for the three datasets were almost overlaid (Figure 6c). The results of the OSC correction method were the worst, as exhibited by the deviation between the convex hulls of the three sets. Here, the centroids of convex hulls of the three datasets did not match (Figure 6d), in a similar fashion to the uncorrected spectra, as shown in (Figure 6a). The projection of calibration and validation sets in PC space showed different patterns according to the correction method applied. Without correction, different convex hulls between the fresh (of both the calibration and online validation sets) and the dry (laboratory validation) sets is noticeable (Figure 6a). When projecting the fresh and dry spectra of EPO in PC space, the convex hulls of the on-line and laboratory validation sets coincided with each other, with both deviating from that of the on-line calibration set by almost 90 • . The centroids of convex hulls for the on-line validation (fresh) spectra overlay with that of the laboratory dry spectra (Figure 6b), whereas the centroids of the calibration set deviated from both validation centroids. With the PDS correction, the convex hulls of the three sets coincide well, with a small deviation observed for the on-line calibration set (Figure 6c). Indeed, the centroids of convex hulls for the three datasets were almost overlaid (Figure 6c). The results of the OSC correction method were the worst, as exhibited by the deviation between the convex hulls of the three sets. Here, the centroids of convex hulls of the three datasets did not match (Figure 6d), in a similar fashion to the uncorrected spectra, as shown in (Figure 6a). Remote Sens.

Cubist Modeling after Spectral Correction
Both the cross-validation and prediction of the EPO-Cubist model outperformed both of the PDS-Cubist and OSC-Cubist models. For the cross-validation, the EPO-Cubist showed a modest improvement compared to the Cubist without spectral correction with RMSE, R 2 , RPD, and RPIQ of 0.11%, 0.89, 2.95, and 3.393, respectively (Table 3 and Figure 7b). The PDS showed a smaller improvement in prediction (RMSE = 0.12%, R 2 = 0.87, RPD = 2.73, and RPIQ = 3.64) than that of the EPO, but a slightly better performance than that of the OSC (RMSE = 0.12%, R 2 = 0.84, RPD = 2.66, and RPIQ = 3.55) (Table 3; Figure 7c,d).
The same trend of performance can be observed for the on-line prediction with the best performance obtained with the EPO (RMSE = 0.12%, R 2 = 0.76, RPD = 2.08, and RPIQ = 2.83), followed  Table 3 and Figure 7a show that the Cubist cross-validation resulted in a good performance with RMSE, R 2 , RPD, and RPIQ of 0.15%, 0.74, 1.99, and 3.23, respectively. The on-line prediction yielded a less good prediction performance (RMSE = 0.20%, R 2 = 0.55, RPD = 1.24, and RPIQ = 1.69). Table 3. Quality of prediction models of soil organic carbon (SOC) obtained from the Cubist modeling for uncorrected (Cubist) and corrected spectra for soil moisture content (SMC) using external parameter orthogonalization (EPO-Cubist), piecewise direct standardization (PDS-Cubist), and orthogonal signal correction (OSC-Cubist).

Cross-Validation
On

Cubist Modeling after Spectral Correction
Both the cross-validation and prediction of the EPO-Cubist model outperformed both of the PDS-Cubist and OSC-Cubist models. For the cross-validation, the EPO-Cubist showed a modest improvement compared to the Cubist without spectral correction with RMSE, R 2 , RPD, and RPIQ of 0.11%, 0.89, 2.95, and 3.393, respectively (Table 3 and Figure 7b). The PDS showed a smaller improvement in prediction (RMSE = 0.12%, R 2 = 0.87, RPD = 2.73, and RPIQ = 3.64) than that of the EPO, but a slightly better performance than that of the OSC (RMSE = 0.12%, R 2 = 0.84, RPD = 2.66, and RPIQ = 3.55) (Table 3; Figure 7c,d).

Variable Importance before and after Spectra Correction
The heat map of the variable importance analysis indicates the same important variables for the developed models in the current research ( Figure 8). The spectral regions at 406-436, 566-576, 656-666, 786-836, 1026-1036, 1406-1456, 1498-1536, and 1576-1606 nm are the most important bands for predicting SOC. In the VIS range, the bands of 406-436, 566-576, and 656-666 nm are located between the red absorption band (680 nm) and the blue band (450 nm) and are attributed to the electron transition associated with soil colour [47]. In the NIR range, the 786-836 band is associated with the C-H bond at 825, and the 1026-1036 band is near the absorption feature at 1035 nm, associated with the aromatic hydrocarbon (C-H) bond [26]. The band at 1406-1426 is relative to the absorption peak near the 1400 nm and that is related to the second overtone of O-H absorption at 1450 nm [48]. The bands at 1498-1536 and 1576-1606 nm are related to the first overtone of C-H, O-H, and N-H bonds [47].

Variable Importance before and after Spectra Correction
The heat map of the variable importance analysis indicates the same important variables for the developed models in the current research ( Figure 8). The spectral regions at 406-436, 566-576, 656-666, 786-836, 1026-1036, 1406-1456, 1498-1536, and 1576-1606 nm are the most important bands for predicting SOC. In the VIS range, the bands of 406-436, 566-576, and 656-666 nm are located between the red absorption band (680 nm) and the blue band (450 nm) and are attributed to the electron transition associated with soil colour [47]. In the NIR range, the 786-836 band is associated with the C-H bond at 825, and the 1026-1036 band is near the absorption feature at 1035 nm, associated with the aromatic hydrocarbon (C-H) bond [26]. The band at 1406-1426 is relative to the absorption peak near the 1400 nm and that is related to the second overtone of O-H absorption at 1450 nm [48]. The bands at 1498-1536 and 1576-1606 nm are related to the first overtone of C-H, O-H, and N-H bonds [47].  Table 1 indicated that SD and the range of SOC are comparable for the calibration and validation sets. The concentration range or SD of the target soil property can influence the model prediction accuracy [48]. For good prediction, the range of the validation set should be within the range of the calibration set [5]. However, larger range or SD will introduce not only higher R 2 and RPD, but higher RMSEP too [48]. Indeed, the narrow range of SOC of both the validation and calibration sets (0.68 to 2.40 %) influenced the prediction accuracy obtained in this study.

Soil and Spectral Data Analysis
SMC was relatively high, particularly in the on-line validation. Consequently, the effect of SMC on spectra is potentially high. Although alterations in soil reflectance can be related to variations in SMC, SOC, and texture [49], acquisition of the on-line data can induce the spectral variability due to  Table 1 indicated that SD and the range of SOC are comparable for the calibration and validation sets. The concentration range or SD of the target soil property can influence the model prediction accuracy [48]. For good prediction, the range of the validation set should be within the range of the calibration set [5]. However, larger range or SD will introduce not only higher R 2 and RPD, but higher Remote Sens. 2020, 12, 1308 14 of 19 RMSEP too [48]. Indeed, the narrow range of SOC of both the validation and calibration sets (0.68 to 2.40 %) influenced the prediction accuracy obtained in this study.

Soil and Spectral Data Analysis
SMC was relatively high, particularly in the on-line validation. Consequently, the effect of SMC on spectra is potentially high. Although alterations in soil reflectance can be related to variations in SMC, SOC, and texture [49], acquisition of the on-line data can induce the spectral variability due to machine vibration, ambient light, and variation of sensor-to-soil distance and angle [7]. The effect of SMC on soil VNIR spectra has been well reported in earlier studies [15,50,51], findings that are consistent with our results. Figure 4 demonstrates that the albedo of the on-line spectrum is generally lower than that of the laboratory spectrum, although the absorption peak in the second OH overtone at 1450 nm is larger. The lower albedo of the on-line spectrum might be attributed to the illumination conditions, plant debris, and variation in the sensor-to-soil distance and inclination [52][53][54]. Therefore, the main difference noticed between the uncorrected spectra (on-line and laboratory) can be attributed to the spectral intensity and not to a spectral signature. This difference is indeed due to the different SMC and other ambient conditions encountered during on-line measurement. Therefore, it was assumed that the on-line data have sufficient quality for further spectral analysis.

The Performance of EPO, PDS, and OSC for Spectral Correction
The results of spectral correction indicating that EPO outperformed both PDS and OSC. EPO showed a high performance of removing the variation of soil absorbance that originated by moisture, since EPO has resulted in identical spectra to those of the dry sample after EPO transformation ( Figure 4). The PC projection plot confirmed the best performance of EPO, as the centroids of convex hulls for the on-line validation set surrounded with the convex hulls of their corresponding dry spectra. The convex hulls of both the on-line and laboratory dry spectra coincided well over each other, with minor deviation (Figure 6). These results are in line with the findings of Chakraborty et al. [20] for EPO. Similarly, PDS was shown as a capable algorithm to correct the spectra for the moisture effect, although it performed less well compared to EPO. Examining the PC projection, a notable match between the three convex hulls can be observed with only slight deviations, which might be attributed to the noise at the two ends of the transformed spectra (Figure 4c), as PDS works with a moving window of data [22]. The poor match between the convex hulls between the laboratory dry and the on-line spectra with the on-line spectra corrected by OSC, as shown in Figure 6d, explains the poorest results of OSC in predicting SOC. In this case, the centroids of convex hulls of the three datasets dispersed without any matching tendency. This confirms that the EPO transformation has successfully corrected the spectra for the moisture effect, indicating the potential of EPO to result in the best Cubist model prediction accuracy for SOC.

Performance of Cubist Models before Moisture Correction
The predictive performance of the Cubist model without spectral correction in this research is considered poor (Table 3). A larger RMSEP of 0.31% (0.203% in the present work) was reported by Nawar and Mouazen [9] for the on-line measurement of SOC, using 529 samples combined with multivariate adaptive regression splines (MARS). Kuang and Mouazen [55] estimated SOC with a PLSR model, using a European dataset (425 soil samples) spiked with local samples that provided a similar result (RMSEP = 0.19%) to that reported in the present work. The poor prediction performance of Cubist in this research can be attributed to the effect of SMC on the VNIR spectra [51,56], which is in agreement with the literature stating that the prediction of SOC from field fresh spectra without appropriate correction is inaccurate [15,22,55]. This is indeed supported by the similarity map of PC1 and PC2 in Figure 5a, showing a clear separation between the validation and calibration sets. The laboratory spectra occupied a separate spectral space than the corresponding on-line spectra without any overlap observed. Another reason might be that the on-line spectra are influenced by other external factors (e.g., noise due to vibration, sensor-to-soil distance variation, ambient light) in addition to SMC.
In general, the variability range of SOC is a fundamental factor that affects model prediction performance [48]. Thus, with large soil heterogeneity in a target soil attribute, regression can be more successful compared with small variability. The reason for the rather poor performance of estimating SOC based-on the Cubist method in this study may be the narrow range of SOC in the calibration (1.54%) and prediction (1.08%) datasets (Table 2). However, the obtained RMSE values in the current study are not substantially higher compared with the literature, e.g., using random forest [5]. The predictive performance of the current work is of similar accuracy to that reported by Kuang and Mouazen [55] for on-line prediction of SOC at the farm-scale using the PLSR technique, with 0.12-0.96 R 2 and 1.07-4.95 RPD. However, numerous studies reported similar results for SOC prediction to our results [57][58][59], with R 2 values ranging from 0.55 to 0.79 and RPD from 1.80 to 2.01. The exposed large differences in the accuracy of the SOC estimates may be related to the high SOC variability and SMC effect. Although the calibration set in the present study is based-on the on-line collected spectra, that is highly affected by SMC and is of narrow variability range of SOC, prediction accuracies are reasonable, which can be attributed to the capability of Cubist to handle the nonlinearity between the SOC concentration and spectra.
The most effective bands in the VIS range were 406-436 and 656-666 nm, which are located, respectively, around the blue band (450 nm) [17] and the red band (680 nm) associated with electron transition. It is well-documented that the darker the soil color, the larger the SOC content [47]. In the NIR range, the most effective bands were 786-836, 1026-1036, 1406-1456, 1498-1536, and 1576-1606 nm. The 786-836 nm band is characterized with a broad region around 825 nm, which is associated with aromatic (C-H) and organic matter [26]. The band 956-1036 nm is associated with the third overtone of O-H (950 nm) [26]. The band 1406-1456 nm is associated with the second overtone of water absorption band around 1450 nm [7,60]. The 1498-1536, and 1576-1606 nm bands are associated with the first overtone of C-H, O-H, and N-H bonds [47], and are consequently related to the concentration of the SOC in the samples.
For SOC estimation in this work there was no rule for the best fitting of the data, and the prediction was based on the whole VNIR spectral range. It can be clearly observed in the heat map shown in Figure 8, that the NIR spectral range has contributed more to the prediction of SOC than those of the VIS spectral region. This result is in line with previous findings, e.g. [60], who reported that the NIR spectral range provided considerably better predictions of SOC than the VIS range. The prediction accuracy of SOC using the whole VNIR spectral range was better than the corresponding accuracy reported for the NIR spectral range only [7].

Performance of the Cubist Models after Correction for Moisture
The algorithms used to eliminate the effect of SMC from spectral data enhanced the performance of SOC models. EPO-Cubist yielded 40% reductions in RMSE for the on-line prediction, which is in agreement with a finding by Ackerson et al. [61], who obtained an error reduction of 63% using fresh field spectra. Ge et al. [55] using rewetted samples reported an error reduction of 60%. However, the smaller improvement of the on-line prediction of SOC achieved in this work, compared to that reported elsewhere can be attributed to the smaller difference in SMC between the on-line validation set and that of the on-line calibration set (Figure 5a). However, the correction methods, in particular EPO, provided reasonable accuracy for the on-line scanned dataset, to be recommended for future research on the on-line measurement, not only for SOC, but also on other soil properties.
The EPO-Cubist modelling found in this work as the best method to predict SOC suggests that it is not obligatory to use air-dried legacy samples for developing the calibration models, which is an important conclusion to ultimately reduce the laboratory time-consuming processing efforts. Instead, the on-line collected fresh spectra having a wide range of SMC can be used, after the correction of SMC effect to estimate SOC [60]. Both Ackerson et al. [16] and Wijewardane et al. [62] demonstrated that the utilization of EPO-based in situ spectra is essential for generating the initial EPO. Our results of EPO correction proposed that the projection matrix based-on the on-line spectra and corresponding air-dry spectra, when applied to the on-line spectral library with a varied moisture content can decrease logistical necessities by efficiently removing the effect of SMC from the spectra and, therefore, improving the prediction accuracy of SOC. The results of this research should be further tested in terms of applicability for moisture correction for the on-line prediction of other soil properties having direct or indirect spectral responses in the VNIRS spectroscopy.

Conclusions
This study investigated the use of the Cubist algorithm combined with spectral correction algorithms to remove the effect of soil moisture content (SMC) from on-line collected visible and near infrared (VNIR) spectra and improve the soil organic carbon (SOC) prediction accuracy of spectra collected from multiple fields in Belgium. Three correction methods, namely, external parameter orthogonalization (EPO), piecewise direct standardization (PDS), and orthogonal signal correction (OSC) were used to correct the spectral data for the removal of SMC from the on-line samples. The results showed that the EPO method outperformed both the PDS and OSC methods in eliminating the influence of differential moisture on soil VINR spectra. The EPO-Cubist model provided the best SOC prediction accuracy. It can be concluded that the use of on-line scanned spectra for developing calibration models for the prediction of SOC is possible and reliable, which reduces the effort related to preprocessing of samples in the laboratory, e.g., drying, grinding, and sieving. As EPO was found to be the best performing method, its projection matrix can be applied directly to effectively reduce the influence of SMC from the on-line spectra, supporting the sensor-based variable rate applications, and providing solutions to speed up the on-line soil mapping at field scale. Further work is suggested to test if the success obtained in the present work can be extended to other soil properties, when using the on-line data collection mode.