Estimation of Secondary Soil Properties by Fusion of Laboratory and On-Line Measured Vis–NIR Spectra

Visible and near infrared (vis–NIR) diffuse reflectance spectroscopy has made invaluable contributions to the accurate estimation of soil properties having direct and indirect spectral responses in NIR spectroscopy with measurements made in laboratory, in situ or using on-line (while the sensor is moving) platforms. Measurement accuracies vary with measurement type, for example, accuracy is higher for laboratory than on-line modes. On-line measurement accuracy deteriorates further for secondary (having indirect spectral response) soil properties. Therefore, the aim of this study is to improve on-line measurement accuracy of secondary properties by fusion of laboratory and on-line scanned spectra. Six arable fields were scanned using an on-line sensing platform coupled with a vis–NIR spectrophotometer (CompactSpec by Tec5 Technology for spectroscopy, Germany), with a spectral range of 305–1700 nm. A total of 138 soil samples were collected and used to develop five calibration models: (i) standard, using 100 laboratory scanned samples; (ii) hybrid-1, using 75 laboratory and 25 on-line samples; (iii) hybrid-2, using 50 laboratory and 50 on-line samples; (iv) hybrid-3, using 25 laboratory and 75 on-line samples, and (v) real-time using 100 on-line samples. Partial least squares regression (PLSR) models were developed for soil pH, available potassium (K), magnesium (Mg), calcium (Ca), and sodium (Na) and quality of models were validated using an independent prediction dataset (38 samples). Validation results showed that the standard models with laboratory scanned spectra provided poor to moderate accuracy for on-line prediction, and the hybrid-3 and real-time models provided the best prediction results, although hybrid-2 model with 50% on-line spectra provided equally good results for all properties except for pH and Na. These results suggest that either the real-time model with exclusively on-line spectra or the hybrid model with fusion up to 50% (except for pH and Na) and 75% on-line scanned spectra allows significant improvement of on-line prediction accuracy for secondary soil properties using vis–NIR spectroscopy.


Introduction
Accurate and high-resolution data on soil properties are essential for optimal soil management site-specifically with the aim of maximizing land production at minimum environmental footprints. Proximal soil sensing has made soil analysis convenient, easier, cheaper and faster [1,2] over the conventional laboratory soil analysis methods, which are laborious, costly, time consuming and destructive [3,4]. One of the best proximal soil sensors is visible and near infrared (vis-NIR) diffuse reflectance spectroscopy, which is a simple, non-destructive and rapid technique, needs no sample preparation for field applications, and can be used for off-line (the sensor is in a fixed position in the laboratory or the field) [5][6][7][8] and on-line (the sensor is being on the move during measurement) [9][10][11][12] measurement modes. The unique feature of the on-line mode is that it offers high sampling resolution

Experimental Sites
Study sites included six fields (shown in Figure 1), namely, Bottelare (5 ha), Thierry (3 ha), Watermachine (6 ha), Gingelomse (11 ha), Kattestraat (5 ha), and Dal (6 ha), which belong to four different commercial farms at Melle (50. Table 1) varied across different fields, with light to heavy loam for Gingelomse, Kattestraat and Dal, sandy to sandy-loam for Thierry, clay to clay-loamy for Watermachine's, clay to loam for Bottelare. All fields were rather flat except Gingelomse, where the elevation is higher in the middle part than the remaining parts of the field. Watermachine might have a problem associated with salt-water intrusion, since this field is located very close to the North Sea. All fields have an annual crop rotation of wheat/barley, maize, potato, and sugar beets with a short duration intermediate cover crop.

On-Line Sensing Platform, Soil Scanning, and Sampling
On-line soil sensing surveys were carried out on different dates in 2018 using the on-line sensing platform developed and patented by Mouazen [31]. It consists of a subsoiler fitted to a frame, which is attached to the three-point linkage of a tractor. The subsoiler makes 15 to 25 cm deep trench in the soil, the bottom of which is smoothened by the subsoiler itself, due to the downwards vertical forces acting mainly on the chisel (see supplement). An optical probe hosted in a mild steel lens holder was appended to the backside of the subsoiler chisel to measure soil spectra in diffuse reflectance mode from the smoothened bottom of the trench. A mobile, fibre type, vis-NIR spectrophotometer (CompactSpec from Tec5 Technology for spectroscopy, Germany) with a spectral range of 305-1700 nm was used to record on-line soil spectra. A digital global positioning system (DGPS) (Trimble AG25, USA) recorded the position of spectra, which were logged together with GPS readings at a frequency of 1 Hz, using a semi-rugged laptop computer (Toughbook, Panasonic UK Ltd., Bracknell, UK) through a standard data logging and acquisition system called MultiSpec pro-II software (Tec5 Technology for spectroscopy, Germany). A 100% ceramic disc was used as the white reference, which was scanned once every 30 min. Field sensing was carried out along 12 m parallel transects at an average forward travel speed of around 3.5 km/h.  During the on-line sensing a total of 138 soil samples were collected randomly from the bottom of the trenches created by the subsoiler chisel, at an average sampling frequency of 3.83 samples per ha.

Laboratory Optical Scanning and Chemical Analyses
Each soil sample was well mixed and the sample size was reduced to around 300 g by following the standard coning and quartering method [32]. The fresh soil samples were cleaned manually by removing debris such as grass, stubble, stone/gravel, and any other foreign objects. Each sample was divided into two parts of about 150 g each, with one part used for laboratory optical measurement and the other portion for laboratory chemical analyses. The first part of each sample was placed into three Petri dishes of 2 cm in diameter and 1 cm deep. Each soil in the Petri dishes was pressed gently after levelled by a spatula, which was necessary as a smooth surface ensures maximum diffuse reflection and high signal-to-noise ratio [33]. The soil samples were scanned in diffuse reflectance mode using the same mobile, fibre type, vis-NIR spectrophotometer (CompactSpec from Tec5 Technology for spectroscopy, Germany), used for the on-line soil measurement. The same 100% white reference was used before scanning, and this reference measurement was repeated every 30 min. Ten spectra were collected per Petri dish, and these were averaged into one spectrum.
The second portion of all samples were delivered to the Soil Survey of Belgium (BDB, Heverlee, Belgium) for the chemical analysis. The soil pH was measured in the supernatant, after shaking and equilibration for 2 h in mol/l potassium chloride solution (KCl), using 1:2.5 soil:solution ratio. The available K, Mg, Ca, and Na were measured in ammonium lactate extract with inductively coupled plasma atomic emission spectroscopy (ISO 11885; CMA 2/I/B1).

Spectral Pre-Processing and Charecterization
The same pre-processing for both the on-line and laboratory measured spectra were carried out using the prospect-R package in RStudio [34]. Several combinations of spectra pre-processing were tested including smoothing, scatter corrections, first derivative, and standard normal variate (SNV) and de-trending (DT). The best performing pre-processing per individual soil property was preserved ( Table 2). Table 2. Spectra pre-processing combined steps followed for different soil properties.

Pre-Processing
Pre-Processing Order of Sequences Soil Properties Firstly, the raw spectra were reduced to a spectral range of 405-1660 nm. Spectral jump at the joining points of the two detectors at 1045 nm was corrected according to Mouazen et al. [25]. The moving average was used for reducing spectral noise, while maximum normalization followed conforms spectra into the same 0 to 1 scale and creates an even distribution of variances. SNV de-trending is used as a means of base-line correction [35] after comparable data scaling. Savitzky-Golay [36] and gap-segment derivatives [37] were used to reduce noise and improve the signal-to-noise ratio [9]. The moving average was used for all sets as the control pre-processing, while smoothing with Savitzky-Golay was also used for all sets except for set-3, as shown in Table 2.
After spectra cut and jump removal, principal component analysis (PCA) was conducted to investigate spectral discrepancy between laboratory and on-line scanned spectra. The analyses were done on raw and pre-processed spectra to investigate whether or not this spectra discrepancy can be minimised by spectra pre-processing. The PCA provides a set of explanatory orthogonal vectors, known as principal components (PCs) with regard to the proportion of variance explained [38]. We considered the first two principal components (PC1 and PC2) in this study, as they captured the majority of variance in the spectral data.

Datasets Assigning, Model Building, and Quality Assessment
Five different calibration models were developed with detailed scanning mode (laboratory and on-line) and ratios of samples (Table 3): (i) Standard, (ii) hybrid-1, (iii) hybrid-2, (iv) hybrid-3, and (v) real-time. The Kennard-Stone (KS) algorithm [39] was used with argument metric "mahal" for selecting the calibration (72%) and prediction (28%) datasets with 100 and 38 samples, respectively. For fair comparison among the different calibration models, the same prediction dataset was used for the five models. Standard and real-time calibration involves only laboratory (100 samples) and on-line (100 samples) scanned spectra, respectively. Hybrid calibration datasets were created from fusion of laboratory and on-line scanned samples with three different ratios of 25%, 50%, and 75% on-line scanned samples, as described in Table 3. Figure 2 illustrates the different steps considered during the development of calibration models and the validation of these models for on-line predictions. The standard calibration is illustrated in Figure 2 by a solid line, whereas the proposed hybrid and real-time calibrations are illustrated by dotted lines.
Partial least squares regression (PLSR) models were developed with leave-one-out cross validation (LOOCV) using the pls package [40] in the R software and models were also validated using the prediction set. The number of latent variables (LV) were selected based on the plot of LOOCV residual variance against the number of LV. The performance of the models was evaluated using the coefficient of determination (R 2 ), root mean square error of prediction (RMSEP), residual prediction deviation (RPD), and the ratio of performance to inter-quartile range (RPIQ). Regarding the RPD value, models can be ranked into six categories such as (i) excellent (RPD > 2.5), (ii) very good (RPD = 2.5-2.0), (iii) good (RPD = 2.0-1.8), (iv) fair (RPD = 1.8-1.4), (v) poor (RPD = 1.4-1.0), and (vi) very poor (RPD < 1.0) performing models [41]. The current study adopted the above criterion to compare the quality of different models in both cross-validation and prediction. Partial least squares regression (PLSR) models were developed with leave-one-out cross validation (LOOCV) using the pls package [40] in the R software and models were also validated using the prediction set. The number of latent variables (LV) were selected based on the plot of LOOCV residual variance against the number of LV. The performance of the models was evaluated using the coefficient of determination (R 2 ), root mean square error of prediction (RMSEP), residual prediction deviation (RPD), and the ratio of performance to inter-quartile range (RPIQ). Regarding the RPD value, models can be ranked into six categories such as (i) excellent (RPD > 2.5), (ii) very good (RPD = 2.5-2.0), (iii) good (RPD = 2.0-1.8), (iv) fair (RPD = 1.8-1.4), (v) poor (RPD = 1.4-1.0), and (vi) very poor (RPD < 1.0) performing models [41]. The current study adopted the above criterion to compare the quality of different models in both cross-validation and prediction.

Laboratory Measured Soil Data
Distribution and size of the calibration dataset critically influence the overall quality of calibration models for soil measurement, and the range of variation in the prediction set has to be approximately equal to, or lies within the range of that of the calibration set [42]. Figure 3 illustrates the descriptive statistics, Pearson correlations (r) with scatter plot matrices, and density distributions of laboratory measured soil pH, K, Mg, Ca, and Na both for calibration (n = 100 soil samples) and prediction (n = 38 soil samples) sets.

Laboratory Measured Soil Data
Distribution and size of the calibration dataset critically influence the overall quality of calibration models for soil measurement, and the range of variation in the prediction set has to be approximately equal to, or lies within the range of that of the calibration set [42]. Figure 3 illustrates the descriptive statistics, Pearson correlations (r) with scatter plot matrices, and density distributions of laboratory measured soil pH, K, Mg, Ca, and Na both for calibration (n = 100 soil samples) and prediction (n = 38 soil samples) sets. The diagonal of Figure 3 illustrates the density plots with descriptive statistics; the upper quadrant of diagonal shows correlation matrixes with gradient colour ramps while the lower quadrant reveals scatter plots between properties. It shows that the data ranges of all individual properties in the prediction set are similar to the corresponding data range of the calibration set, though slightly smaller ranges are noticeable for the prediction set. The highest range is observed for soil Ca and successively followed by Mg, K, Na, and pH. Since the differences between the mean and median values indicates a non-normal data distribution, the differences in this study indicate slight to moderate positively skewed distributions for Mg, Ca, and Na, both in the calibration and prediction sets. The soil K data distribution shows a relatively close to normal (mean ≈ median) distribution, while data for pH is negatively skewed. Biological observations from soil data show skewed distribution [43] frequently, which is clearly visible in the density distributions charts. The correlation matrix shows a similar correlation trend both in the calibration and prediction sets, where a good correlation is found between Mg and Ca (r ≈ 0.80) and Mg and pH (r ≈ 0.60). In the calibration set, soil pH shows almost no correlation with K (r = 0.042) and weak but positive correlation with Na (r = 0.235). Na does not show correlation with Mg (r = 0.031) but weakly correlated with K (r = 0.346) and negatively correlated with Ca (r = -0.124). Moreover, K is negatively correlated with Mg (r = -0.348) and Ca (r = -0.499). Figure 4 presents score plots of on-line measured spectra (138 samples) against respective laboratory samples over the projected space obtained from the first two principal components (PCs) of PCA carried out before and after data pre-processing. The first two PCs (PC1 and PC2) cumulatively explain more than 99% of data variances for raw spectra, whereas smaller cumulative variances are explained for the pre-processed spectra, with ascending order for pre-processing set-4 (85.10%), set-2 (86.59%), set-1 (91.97%), and set-3 (94.02%) (see the meaning of sets in Table 2). Points scattering over the PC space indicates that the current dataset contained significant variations, The diagonal of Figure 3 illustrates the density plots with descriptive statistics; the upper quadrant of diagonal shows correlation matrixes with gradient colour ramps while the lower quadrant reveals scatter plots between properties. It shows that the data ranges of all individual properties in the prediction set are similar to the corresponding data range of the calibration set, though slightly smaller ranges are noticeable for the prediction set. The highest range is observed for soil Ca and successively followed by Mg, K, Na, and pH. Since the differences between the mean and median values indicates a non-normal data distribution, the differences in this study indicate slight to moderate positively skewed distributions for Mg, Ca, and Na, both in the calibration and prediction sets. The soil K data distribution shows a relatively close to normal (mean ≈ median) distribution, while data for pH is negatively skewed. Biological observations from soil data show skewed distribution [43] frequently, which is clearly visible in the density distributions charts. The correlation matrix shows a similar correlation trend both in the calibration and prediction sets, where a good correlation is found between Mg and Ca (r ≈ 0.80) and Mg and pH (r ≈ 0.60). In the calibration set, soil pH shows almost no correlation with K (r = 0.042) and weak but positive correlation with Na (r = 0.235). Na does not show correlation with Mg (r = 0.031) but weakly correlated with K (r = 0.346) and negatively correlated with Ca (r = -0.124). Moreover, K is negatively correlated with Mg (r = -0.348) and Ca (r = -0.499). Figure 4 presents score plots of on-line measured spectra (138 samples) against respective laboratory samples over the projected space obtained from the first two principal components (PCs) of PCA carried out before and after data pre-processing. The first two PCs (PC1 and PC2) cumulatively explain more than 99% of data variances for raw spectra, whereas smaller cumulative variances are explained for the pre-processed spectra, with ascending order for pre-processing set-4 (85.10%), set-2 (86.59%), set-1 (91.97%), and set-3 (94.02%) (see the meaning of sets in Table 2). Points scattering over the PC space indicates that the current dataset contained significant variations, including featured information. Individual group of each field confirms mutual homogeneity within that field and heterogeneity among different fields. In this context, soils in Dal, Thierry, and Watermachine are highly heterogeneous while Bottelare, Kattestraat, and Gingelomse soils are moderately diverse. Before spectral pre-processing, it was hard to find similar grouping pattern between on-line and laboratory measured spectra, since the on-line groups are located far from the laboratory group for all fields except for Watermachine. Figure 4i reveals that generally laboratory measured samples are more homogeneously located surrounding the origin of PC plot, whereas on-line samples are more scattered and randomly distributed over the PC space. On-line samples are more scattered, possibly due to spectral alterations due to the influences of external factors such as stones and roots in the soil, and ambient light and temperature [23][24][25]. In the plot for the raw spectra, overlap of samples between different fields is considerable. The on-line and laboratory samples for the Watermachine field are located very closely with great deal of overlap, which may be attributed to smaller influences of the ambient conditions during on-line measurement. The highest spectral discrepancy is seen for Thierry field, followed by Dal field.

Discrepancy between Laboratory and On-Line Scanned Vis-NIR Spectra
After spectral pre-processing, overlapping of spectra from different fields is smaller compared to that for the raw spectra. It seems logical that each field conveys distinguished features originating from self pedo-genesis; e.g., soil mineralogy and soil matrix characteristics. The separation is particularly clear for Watermachine samples from those of the other fields, which may be attributed to the heavy clay texture of Watermachine containing very high percentage of Ca, since it is located very close to the North Sea. Although Dal, Kattestraat, and Gingelomse are expected to be similar in soil characteristics, since they are from the same farm, the latter two fields are of more similar spectral characteristics compared to those of Dal, whose samples are perfectly separated from those of the other two fields. This perfect separation might be due to the very dry soil conditions during the on-line measurement that took place in summer 2018 (average moisture content = 8.75%). On-line spectra show more dispersion than the corresponding laboratory spectra. For example, laboratory spectra from Gingelomse field are located at one or two quadrants, while on-line spectra are spread out over the four quadrants. The highest spectral dispersion can be observed in the case of Gingelomse and Kattestraat fields, whereas the lowest discrepancy is observed for Watermachine. Since the degree of discrepancy is highly influenced by MC and that is the reason why the highest spectral differences (between the on-line and laboratory spectra) is observed for Gingelomse (average MC = 22.79%) and Katestraat (average MC = 23.02%), since these two fields were measured at very wet soil conditions ( Figure 4). However, during sample preparation for laboratory scanning, MC is lost explaining the considerable difference between laboratory and on-line scanning for these two fields. Since Watermachine was measured on-line at dry soil conditions (MC = 8.75%), and the field is of a heavy clay soil texture, this has resulted potentially in the lowest reduction in MC during laboratory scanning, explaining the smallest differences between the laboratory and on-line scanned spectra. Comparing all the pre-processing sets, it is suggested that pre-processing can successfully reduce external influences by some degree, but it is quite unable to neutralize the entire impact of external factors during the on-line spectral measurements, causing differences with the laboratory measurements.
Different degree of discrepancy between laboratory and on-line scanned spectra is also noticeable for both the raw and pre-processed spectra ( Figure 5). Raw spectra revealed higher variability of on-line spectra than laboratory measurement, evidenced by the higher mean, SD, and median values of the former, compared with the latter spectra. All the plots for set-1 to 4 in Figure 5 also complement the conclusion of PCA score plot that spectra pre-processing can reduce external influences only partially, which is supported by the different mean, SD, and median values at some specific wavebands.
Distinguishable absorption peaks at 420, 575, 600, 650, 930, 1125, 1400, and 1500 nm are more prominent for the laboratory scanning mode ( Figure 5 (set-4 (ii))). Specifically, the differences in the spectra are clearly visible at 420 and 575 nm, which are associated with the absorption of the blue band, strongly linked with OC; and at 1400 nm associated with O-H absorption at 1450 nm [41]. This difference between the laboratory and on-line spectra may result in errors in estimation of not only soil MC and OC, but also those properties having possible covariation with O-H absorption and OC, such as pH and P [21]. soil MC and OC, but also those properties having possible covariation with O-H absorption and OC, such as pH and P [21]. Figure 4. Characterization of the spectral discrepancy between laboratory and on-line measured samples that resulted from principal component analysis (PCA). PCA similarity maps are shown between principal component 1 (PC1) and 2 (PC2) for (i) before spectral pre-processing and (ii) after spectral pre-processing of different sets (set-1, set-2, set-3, and set-4), those described in Table 2.  . Spectral discrepancy, between laboratory and on-line scanning modes before (raw spectra) and after spectra pre-processing (for sets-1 to 4), shown with respect to the mean, standard deviation (SD), and median of spectra. A detail illustration is shown for pre-processing set-4 (ii), as an example, to highlight particular wavebands where a prominent discrepancy occurs. Red and green lines in the plot of set-4 (ii), respectively, stand for mean laboratory and on-line spectra while black lines represent the entire dataset. L: Laboratory; O: On-line. Figure 6 illustrates PLSR coefficients obtained from the standard, hybrid, and real-time calibrations. Important absorption peaks are evenly distributed across both the visible (400-780 nm) and NIR (780-1700 nm) regions. Only few key absorption peaks are observed for pH, K, and Ca though several smaller peaks are observed for Mg and Na. Important wavelengths of 455, 772, 1361, and 1424 nm are observed for pH. Mouazen et al. [21] reported several correlation features for pH in the visible and NIR ranges, which were probably associated with amine N-H (751, 1000, and 1500 nm), hydroxyl O-H (950, 1450, and 1950 nm), and aromatic C-H (825, 1100, and 1650 nm) bonds [14]. Similarly, the 772 nm wavelength in the present study can be attributed to N-H absorption at 751 nm. The wavelengths of 1361 and 1424 nm can be associated with the second overtone of O-H absorption, whereas the 455 nm wavelength can be associated with the blue colour absorption band Figure 5. Spectral discrepancy, between laboratory and on-line scanning modes before (raw spectra) and after spectra pre-processing (for sets-1 to 4), shown with respect to the mean, standard deviation (SD), and median of spectra. A detail illustration is shown for pre-processing set-4 (ii), as an example, to highlight particular wavebands where a prominent discrepancy occurs. Red and green lines in the plot of set-4 (ii), respectively, stand for mean laboratory and on-line spectra while black lines represent the entire dataset. L: Laboratory; O: On-line. Figure 6 illustrates PLSR coefficients obtained from the standard, hybrid, and real-time calibrations. Important absorption peaks are evenly distributed across both the visible (400-780 nm) and NIR (780-1700 nm) regions. Only few key absorption peaks are observed for pH, K, and Ca though several smaller peaks are observed for Mg and Na. Important wavelengths of 455, 772, 1361, and 1424 nm are observed for pH. Mouazen et al. [21] reported several correlation features for pH in the visible and NIR ranges, which were probably associated with amine N-H (751, 1000, and 1500 nm), hydroxyl O-H (950, 1450, and 1950 nm), and aromatic C-H (825, 1100, and 1650 nm) bonds [14]. Similarly, the 772 nm wavelength in the present study can be attributed to N-H absorption at 751 nm. The wavelengths of 1361 and 1424 nm can be associated with the second overtone of O-H absorption, whereas the 455 nm wavelength can be associated with the blue colour absorption band that used to be at 450 nm, which can be attributed to OC and water. This explains that pH is partially being successfully measured through covariation with water and OC. For K, a moderate absorption band (456 nm), two wide peaks at around 645 and 1158 nm and a stronger absorption peak at 1425 nm are recorded.

PLSR Coefficients
Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 21 that used to be at 450 nm, which can be attributed to OC and water. This explains that pH is partially being successfully measured through covariation with water and OC. For K, a moderate absorption band (456 nm), two wide peaks at around 645 and 1158 nm and a stronger absorption peak at 1425 nm are recorded. Similar to pH, the 456 and 1425 nm wavelengths are attributed to the blue colour (associated with overall changes is soil albedo related to water and organic matter) and O-H second overtone Similar to pH, the 456 and 1425 nm wavelengths are attributed to the blue colour (associated with overall changes is soil albedo related to water and organic matter) and O-H second overtone absorptions, respectively. The absorption at 645 and 1158 can be attributed, respectively, to red colour (at 680 nm, associated with soil mineralogy and iron oxides in particular) and aromatic C-H, used to be around 1100 nm, according to Viscarra Rossel and Behrens [44], respectively. Among the wavelengths of 460, 571, 810, 1056, 1405, and 1500 nm, contributing to Mg successful prediction, significant sharp absorption features are found at 460 and 1405 nm, which can be attributed to the blue band (attributed to OC and water) and the second overtone of O-H absorptions (similar to pH and K), respectively. Fewer absorption features for Na are observable at 560, 770, 1400, and 1510 nm. The significant wavelengths at 1500 nm for Mg and 1510 nm for Na, can be linked with the absorption of amine N-H bonds. Four absorption features are seen for Ca, which are almost identical to those for pH (455, 790, 1360, and 1424 nm), and these may well be attributed to calcium carbonate. Our findings of key absorption wavelengths that have significant features at around 1400-1450 nm reveal possible covariations of pH, K, Mg, Na, and Ca with soil MC, which is attributed to the prominent O-H absorption in the second overtone region. Both Mg and K show possible co-variations with aromatic C-H (1056 nm for Mg, and 1158 for K) and amine N-H bonds for Mg at 810 and 1500 nm, while Na shows covariations with the N-H bond only, attributed to peak coefficient at 1500 nm.
Nevertheless, the general trend of PLSR coefficients indicates that the key absorption peaks discussed above happen at exactly the same wavelengths for the standard, hybrid and real-time calibrations. This indicates overall consistency of significant spectral features associated with respective properties across the models developed. Table 4 shows the prediction results for the studied soil properties using PLSR models developed for the five datasets explained above. It can be observed that the prediction accuracy is dependent on the soil property. For example, the best on-line prediction is observed for Mg with R 2 = 0.48, RMSEP = 10.42 mg/100 g, RPD = 1.41, RPIQ = 0.55, which is successively in descent order of Na, pH, K, and Ca. Based on laboratory scanned soil samples. Compared to the standard calibration using laboratory spectra only, hybrid calibrations performed better for the prediction of all the investigated properties. The best on-line prediction result (R 2 = 0.81, RMSEP = 6.25 mg/100 g, RPD = 2.35, and RPIQ = 0.92) is obtained for Mg using the hybrid-3 model (Table 4 and Figure 7). The real-time calibration shows the same prediction quality to those of hybrid-2 and hybrid-3 in particular. The best on-line prediction result for the real-time model was found for Mg (R 2 = 0.81, RMSEP = 6.38 mg/100 g, RPD = 2.30, RPIQ = 0.90), followed by Ca (R 2 = 0.75, RMSEP = 436.46 mg/100 g, RPD = 2.02, RPIQ = 0.43). The remaining models can be sorted in descending order as pH (R 2 = 0.74, RMSEP = 0.39, RPD = 1.97, RPIQ = 2.25), Na (R 2 = 0.65, RMSEP = 3.14 mg/100 g, RPD = 1.72, RPIQ = 1.43), and K (R 2 = 0.54, RMSEP = 6.85 mg/100 g, RPD = 1.50, RPIQ = 2.00). Figure 7 illustrates the influence of fusion ratio of on-line versus laboratory collected spectra on on-line prediction quality of soil pH, K, Mg, Ca, and Na in comparison with the standard and real-time calibrations. Results show that by increasing the percentage of on-line collected spectra in the calibration set, proportional improvement in on-line prediction can be observed. Comparing among the three hybrid models, hybrid-1 (25%) is the least performing model followed by hybrid-2 (50%) and hybrid-3 (75%), successively. Both hybrid-2 and hybrid-3 provided comparable results, except for pH and Na, where hybrid-3 outperformed hybrid-2 clearly. Hybrid-1 has resulted in slight improvements in the prediction of pH, K, and Mg, whereas significant improvement can be already observed for Ca (R 2 = 0.69, RMSEP = 483.81 mg/100 g, RPD = 1.82, RPIQ = 0.39), compared with the standard calibration (R 2 = 0.13, RMSEP = 809.13 mg/100 g, RPD = 1.09, RPIQ = 0.23). Hybrid-2, hybrid-3, and real-time models all provide the best prediction results for Mg (R 2 = 0.81, RMSEP = 6.25-6.38 mg/100 g, RPD = 2.30-2.35, and RPIQ = 0.90-0.92), whereas the second best accurate prediction is found with hybrid-3 for Ca (R 2 = 0.77, RMSEP = 412.23 mg/100 g, RPD = 2.13, and RPIQ = 0.45). The hybrid-3 model provides the best prediction results for Na (Table 4). K is best predicted by the hybrid-2 model (R 2 = 0.58, RMSEP = 6.60 mg/100 g, RPD = 1.56, and RPIQ = 2.08), whereas pH (R 2 = 0.74, RMSEP = 0.39 mg/100 g, RPD = 1.97, RPIQ = 2.25) is best predicted by the real-time model. Results indicate that the proposed hybrid-2 (50% on-line spectra) and hybrid-3 (75% on-line spectra), both perform equally well as the real-time model, except the underperformance of hybrid-2 for pH and Na ( Table 4). The on-line prediction is classified as good performing for pH (hybrid-3 and real-time models; RPD = 1.96-1.97), fair for K (hybrid-2, hybrid-3, and real-time models; RPD = 1.48-1.56) very good for Mg and Ca (hybrid-2, hybrid-3, and real-time models; RPD = 2.02-2.35), and good for Na (hybrid-3 and real-time models; RPD = 1.72-1.82), which are improved way beyond the reported results in the literature. Therefore, they can be used successfully for the prediction of the named soil properties, except for hybrid-2 for pH and Na (Table 4). Table 4. Quality of on-line prediction of soil pH, potassium (K), magnesium (Mg), calcium (Ca), and sodium (Na), obtained from partial least squares regression (PLSR) models developed for (i) standard, (ii) hybrid-1, (iii) hybrid-2, (iv) hybrid-3, and (v) real-time calibrations.

Discussions
The current study hypothesises that the on-line measured vis-NIR spectra deviate from laboratory spectra collected for the same soil samples, due to the influences of external factors on the former scanning method, such as ambient conditions, mechanical vibrations, and sensor-to-soil distance variations. As a consequence, it is assumed that the on-line measured spectra, can if included in the calibration set, improve the prediction accuracy of on-line measured soil properties (i.e., pH, K, Mg, Ca, and Na), having indirect spectral responses in the near infrared spectroscopy. However, the influence of the percentage of on-line spectra to be added to the calibration set is unknown, and requires the investigation carried out in this work.
From the spectra analysis discussed above, one can conclude that indeed spectral differences between laboratory and on-line scanning methods exist at both individual spectral level for the same sample, and at groups of spectra. The PCA similarity maps obtained from PC1 and PC2, showed clear differences between laboratory and on-line measured spectra for both the raw and pre-processed spectra. The differences become smaller after implementing the different pre-processing steps considered in the present work, indicating that spectra pre-processing can at least partially remove these differences. Overlap between the laboratory and on-line samples was observed in the PCA similarity maps due to spectral pre-processing, particularly for fields with relatively low MC. However, due to the significant difference in the raw spectra, the same data pre-processing might not

Discussions
The current study hypothesises that the on-line measured vis-NIR spectra deviate from laboratory spectra collected for the same soil samples, due to the influences of external factors on the former scanning method, such as ambient conditions, mechanical vibrations, and sensor-to-soil distance variations. As a consequence, it is assumed that the on-line measured spectra, can if included in the calibration set, improve the prediction accuracy of on-line measured soil properties (i.e., pH, K, Mg, Ca, and Na), having indirect spectral responses in the near infrared spectroscopy. However, the influence of the percentage of on-line spectra to be added to the calibration set is unknown, and requires the investigation carried out in this work.
From the spectra analysis discussed above, one can conclude that indeed spectral differences between laboratory and on-line scanning methods exist at both individual spectral level for the same sample, and at groups of spectra. The PCA similarity maps obtained from PC1 and PC2, showed clear differences between laboratory and on-line measured spectra for both the raw and pre-processed spectra. The differences become smaller after implementing the different pre-processing steps considered in the present work, indicating that spectra pre-processing can at least partially remove these differences. Overlap between the laboratory and on-line samples was observed in the PCA similarity maps due to spectral pre-processing, particularly for fields with relatively low MC. However, due to the significant difference in the raw spectra, the same data pre-processing might not work for both the laboratory and on-line collected spectra, hence, different pre-processing is proposed to resolve this issue.
Current findings indicate that soil MC is one of the dominant properties responsible for spectra differences [23,44]. During laboratory preparation and scanning of samples collected during on-line measurement, there is a possibility for soil samples to lose MC, introducing spectral differences between laboratory and on-line scanned spectra. Discrepancy is more prominent in some particular spectral bands, shown in Figure 6. These differences may well be removed with a proper spectra pre-processing, before calibration. Similarly, the influence of noise due to vibration can be removed by a gentle smoothing, while variation of sensor-to-soil distance can also be removed by an algorithm suggested by Mouazen et al. [25]. However, aggressive spectra pre-processing may also lead to losing important feature information necessary for the successful prediction of the studied soil properties. Above all, the on-line prediction results of the standard models were of rather poor quality, compared to the hybrid and real-time models, suggesting that the spectra pre-processing is not sufficient to remove all sources of discrepancy, and that different ratios of on-line spectra should be included in the calibration set. This has resulted in improved prediction accuracy, and the degree of improvement was proportional with the ratio of on-line spectra added.
As can be revealed from the results in Table 4 that the real-time calibration performs almost as equally good as the hybrid-2 (except for pH and Na) and hybrid-3 models, while it outperforms the standard calibration and hybrid-1 models for all the studied soil properties. When comparing the hybrid against the standard calibrations, all hybrid models outperformed the standard model for on-line measurement of the studied soil properties (Table 4 and Figure 7). The hybrid calibration improved the on-line prediction quality for Mg and Ca from being poor to very good, for pH and Na from being poor to good and for K from being poor to fair, according to RPD classes proposed by Viscarra Rossel et al. [41]. This means that it is necessary to include on-line scanned spectra in the calibration set, which have special features that do not exist in the laboratory scanned spectra. These features still exist even after the spectra pre-processing methods detailed in Table 2. These results suggest that either the real-time model with on-line spectra or the hybrid model with fusion up to 50% (except for pH and Na) and 75% on-line scanned spectra allows significant improvement of the prediction accuracy for soil properties having indirect spectral response in the NIR spectroscopy for on-line vis-NIR spectral scanning mode.
Calibration models derived from laboratory spectra can predict secondary soil properties in on-line mode with relatively low accuracy. In this study, we examined whether or not there is a need to include on-line scanned spectra in the calibration data set for on-line prediction of the studied secondary soil properties. While laboratory scanning is essential to build a spectral library that is needed for future laboratory scanning-based calibrations, one should bear in mind that according to the results achieved in this work, there is a need to establish two different spectral libraries, one for laboratory scanning conditions and one for on-line scanning conditions.
Although the proposed hybrid approach can improve the overall accuracy, it has to be admitted that the vis-NIR spectroscopy is limited [15] in providing excellent prediction results for secondary soil properties unlike for properties with direct spectral responses, e.g., OC [45,46] and MC [8,47]. The on-line prediction was good for pH (hybrid-3 and real-time models; RPD = 1.96-1.97), fair for K (hybrid-2, hybrid-3, and real-time models; RPD = 1.48-1.56) very good for Mg and Ca (hybrid-2, hybrid-3, and real-time models; RPD = 2.02-2.35), and good for Na (hybrid-3 and real-time models; RPD = 1.72-1.82), which are improved results than those reported in the literature.
Based on laboratory calibrations, Chang et al. [15] reported that the vis-NIR is a limited technique for the estimation of soil K, Mg, and Ca with the least prediction accuracy obtained for Ca (R 2 < 0.5 and RPD < 1.4). A study by Dunn et al. [17] reported that NIR spectroscopy showed a high level of prediction accuracy with R 2 = 0.67, 0.91, 0.87, and 0.69, respectively for K, Mg, Ca, and Na for top soil (0 to 10 cm) using spectral data collected in the laboratory. In addition, using a full range spectrophotometer (350 to 2500 nm) in the laboratory, Mouazen et al. [19] found good measurement accuracy for Ca (R 2 = 0.77 and RPD = 2.10), whereas poor results were reported for Mg (R 2 = 0.59 and RPD = 1.56), Na (R 2 = 0.40 and RPD = 1.29), and K (R 2 = 0.33 and RPD = 1.21). Qiao and Zhang [48] developed NIR calibration models using laboratory spectra, achieving a better accuracy for K with R 2 = 0.69 and RMSE = 0.69%, compared to that of Mouazen et al. [19]. Therefore, all the above studies showed that laboratory-based vis-NIR spectroscopy demonstrate fluctuating results for the measurement of secondary soil properties and the prediction performance will depend on the sample set available for each study. The conclusion also applies to on-line prediction of secondary soil properties, although only three studies can be found in the literature [16,49,50].
Since secondary soil properties have indirect spectral responses in NIR spectroscopy, it is not possible to determine key wavebands that contribute directly to the parameter estimations. That is the reason why previous studies [15,16,48] did not attempt to identify important wavebands for secondary soil properties. However, significant bands were identified in the present study that associate with the successful prediction of pH (455, 772, 1361, 1424 nm), K (456, 645, 1158, 1425 nm), Mg (571, 810, 1066, 1405, 1500 nm), Ca (455, 770, 1360, 1424 nm), and Na (560, 770, 1400, 1510 nm). However, these featured wavelengths may vary for different datasets, depending on the parent material of soil, weathering conditions, soil texture, colour, and mineralogy. It was possible to explain the association of these bands or group of bands with soil properties having direct spectral responses in the NIR spectroscopy. These include among others, bands associated with OC, MC, and blue and red colour absorptions. Spectroscopy bands associated with molecular bonds such as O-H, aromatic C-H, and amine N-H were assigned for each soil property studied, where considerable deal of overlap was observed for few properties. For example, the same four absorption features were observed for both pH and Ca (455, 790, 1360, and 1424 nm), which was attributed to calcium carbonate. Among these four features, three similar features at 460, 810, and 1405 nm were observed for Mg. This indicates similarity of the significant spectral features for these three properties. The laboratory chemical analysis results show a similar trend of data for these three properties. For example, they are all positively skewed ( Figure 3) with a strong correlation between Mg and Ca (r ≈ 0.80) and good correlations between Mg with pH (r ≈ 0.60) and Ca and pH (r ≈ 0.55), indicating that the spectral correlations obtained from the PLS regression coefficients' plots ( Figure 6) contain real information about the chemical background of the data set. This also shows that links between spectral and chemical data can be made to understand why the secondary soil properties can be measured with vis-NIR spectroscopy, successfully. However, further study is needed for in-depth evaluation of the relationship between the spectral and chemical data to quantify the individual contribution of significant bands associated with the primary soil properties including soil color in the visible range to the prediction of the different secondary properties studied.
It is interesting to discuss the advantages of the modelling approach used in this study over the conventional methods (i.e., EPO, DS, and OSC) used for removing external effects from soil spectra. Several studies have used the EPO to remove the influences of known external factors, e.g., MC [6,51], soil roughness, aggregation, and ambient temperature [52]. However, both EPO and SD have not been reported to remove the influences of unknown factors, e.g., those frequently encountered during the on-line soil sensing such as noise, and presence of stones and plant roots and residues. Both DS and EPO require a transfer sample set, consisting of identical samples but measured under different measurement conditions to account for the known influences [28,52], e.g., dry versus wet condition when the external factor under consideration is moisture content. OSC is more advantageous over EPO and DS methods, since it does not require a transfer sample set [28], hence, it can theoretically tackle unknown external influences. However, it has mathematically been proven that OSC, when coupled with PLSR (OSC-PLSR) could not improve prediction quality but rather improved model interpretability [29]. A recent study [28] reported inconsistent performance of OSC-PLSR models, with slight improvement in the prediction quality of on-line measured pH, Ca, CEC, and lime requirement in a clay soil, whereas no improvement was reported for soil organic matter, Mg, potential acidity, sum of basis, percent base saturation, and MC. This contradictory performance was difficult to explain.
Our approach based on fusion of laboratory and on-line scanned spectra can handle the influences of unknown external factors present in on-line spectra, without the need for a transfer data set to be created. This is supported by the fact that results achieved in the present work are much better than those recently reported by Franceschini et al., [28] for on-line measurement of soil properties, obtained after spectra transformation with EPO, DS, and OSC techniques. It would be interesting in a future work to combine the approach of the present study with EPO, DS, and OSC to enable removal of both the known and unknown external influences, hence, maximising the prediction accuracy for the on-line measurement of secondary soil properties. This approach can be tested further for primary soil properties having direct spectral responses in the NIR spectroscopy.

Conclusions
The current study introduced a novel calibration approach to model the vis-NIR spectra for on-line prediction of secondary soil properties, namely, pH, available K (potassium), Mg (magnesium), Ca (calcium), and Na (sodium). It compares the performance of partial least squares regression (PLSR) models developed using 100% laboratory scanned spectra (standard model), 100% on-line measured spectra (real-time model) and hybrid-1, hybrid-2, and hybrid-3 models, having 25%, 50%, and 75% on-line measured spectra fused with laboratory spectra of 75%, 50%, and 25%, respectively. Results obtained suggest the following conclusions: a.
For a particular soil sample, laboratory and on-line spectra are rarely identical and spectra pre-treatments can reduce the discrepancies to some extent but cannot remove them completely. Therefore, the laboratory scanned spectra-based calibration models predict on-line soil properties with low accuracy. b.
Inclusion of on-line collected spectra in the spectra set is necessary, which has resulted in improved prediction accuracy. The degree of improvement was proportional with the ratio of on-line spectra added. The real-time calibration performed almost equally good as the hybrid-2 model (except for pH and K) and hybrid-3 model (for all the soil properties investigated). Furthermore, the three hybrid models outperformed the standard calibration. Thus, either the real-time, the hybrid-2 (excluding pH and Na) or the hybrid-3 models should be used for successful on-line prediction of the secondary soil properties considered in this study. c.
The current study identified key absorption wavelengths significantly contributing to the predictions of soil pH, K, Mg, Ca, and Na. These wavelengths are associated with the absorption band of the blue colour, second overtone of O-H absorption, aromatic C-H, and amine (N-H) absorptions, depending on the soil property.
To sum up, the proposed modelling approach can be successfully used for on-line measurement of secondary soil properties (e.g., pH, K, Mg, Ca, and Na). This approach is of practical use for different end users, e.g., precision farming practitioners and soil scientists, who are interested in high resolution data that is acquired rapidly, accurately, and cost-effectively. However, future study is needed to further prove that the current modelling approach is applicable on different spectral data sets having a wider range of variability in the soil attributes, compared to the current dataset collected from four different farms in Belgium.