1. Introduction
Soil quality mainly depends on the chemical and physical properties of the soil, which are estimated by the cumulative effects of natural factors involved in its formation, including climate, topography, parent material, biological activity, and time [
1]. The development of precision agriculture requires a quantitative analysis technology that can accurately and quickly elucidate the physicochemical properties of soils over a large area. Routine soil testing is recognized as basic techniques for estimating soil properties [
2]; however, the traditional soil testing and granulometric analyses are relatively slow and expensive, because a large number of soil samples is needed for mapping the spatial variation in the managed field [
3].
The visible and near infrared reflectance analysis (Vis-NIRA) technique has emerged as a possible enhancer or replacer of traditional soil testing methods. Using the Vis-NIRA technique, many researchers have related soil properties to spectroscopic soil reflectance data [
4,
5,
6,
7,
8,
9,
10,
11]. However, because the extremely large volume of hyper-spectral data and visible and near-infrared hyper-spectra are difficult to interpret directly, since they contain overlapping weak overtones and combinations of fundamental vibrational bands, identifying the critical spectral features that estimate the soil properties is a difficult task.
Besides containing many redundancies, soil reflectance hyper-spectral data are affected by the soil surface roughness, soil moisture, various environmental noises, and various other factors [
12]. Therefore, to provide a good dataset for soil reflectance spectroscopy, the noise should be removed as far as possible while preserving the spectral details. The discrete wavelet transform (DWT) is a wavelet transform technique that extracts the features of hyper-spectral data, and has been popularly applied to spectroscopic analysis [
7,
13,
14,
15,
16,
17]. The DWT is an integral transformation, but can be decomposed into a set of coefficients. In this way, the DWT can combine a set of mathematical building blocks (basic functions) and reorganize the original data. The DWT transforms the eigenvalues obtained by detailed and approximate signals, generating the coefficients as raw data containing the vast majority of the original data characteristics, which is an important way of reducing the data dimensions [
7,
15]. The DWT is considered as an excellent method for predicting multiple soil properties. However, although the scale of wavelet analysis has been discussed [
14], the impact of multiple wavelet functions on the prediction of soil properties has not been well studied. The widely varying properties of the different wavelet functions will certainly affect the predicted soil properties. Therefore, this paper will explore how different wavelet functions affect the prediction model of soil properties.
Predictive models are generally calibrated using the measured spectral datasets of soils with known properties, and such datasets are assembled into spectral libraries [
18]. The most popular spectroscopic analysis methods are multiple linear regression (MLR), principal coefficients regression (PCR), partial least-squares regression (PLS), and artificial neural networks (ANN) [
6,
19,
20,
21]. Partial least-squares regression and multiple linear stepwise regression are considered the most appropriate regression methods for spectral calibration and prediction of soil properties [
22,
23,
24]. PLS has the same overall framework as principal component regression, and includes multiple regression and canonical correlation analyses [
25]. As such, it can predict the number of suitable variables and eliminate noise interference, retaining the useful data for traditional linear regression. Moreover, PLS can extract the main determinant variables from soil reflectance spectroscopy data, reducing the spectral dimension and enhancing the robustness of the established model. However, the PLS method is inapplicable to data containing much useless information, because some of the independent variables can be misinterpreted as explanatory powers and incorporated into the regression equation, reducing the model accuracy [
26]. Hyper-spectral data, which include a large volume of redundant information and noise-distorted spectral shapes, fall into this category. The spectral noise is introduced by sensor limitations and particle-size differences [
27]. Hence, improving the PLS method for statistical analysis of spectral data has become a main research focus [
28,
29,
30]. Multivariate stepwise linear regression (MSLR) finds and selects the variables exerting the most significant influences on the dependent variables and outperforms ordinary meta-regression; however, this method cannot remove the multi-collinearity between independent variables. In contrast, stepwise multiple regression combined with the partial least-squares method can preliminarily deduct the information unrelated to the response vector by an algebraic algorithm.
The main objectives of this study were as follows: (a) to characterize the representativeness of different wavelet functions and different decomposition scales in the prediction model, and (b) to understand appropriateness of the improved regression analysis (Stepwise-PLS) in improving the prediction accuracy of soil properties.
2. Materials and Methods
2.1. Soil Sample Preparation and Laboratory Analysis
For this study, 193 soil samples of Burozem and Cinnamon soil were collected from the Experimental Station of Qingdao Agricultural University (Shandong Province, China) in 2014. The main parent materials of Cinnamon soil consist of loess and lime, and its mineral composition is primarily hydromica, montmorillonite, and kaolinite. The main parent materials of Burozem soil are non-calcareous eluvial slope deposits and earthy deposits, and its mineral composition is primarily hydromica, kaolinite, and vermiculite. For one certain soil type, the soil samples have similar parent materials, mineral composition, and texture. The collected soil samples were air-dried for 72 h and then passed through a 4.75-mm aperture square-hole sieve to remove the coarse matter and organic debris. The physical and chemical properties of the soil were measured as described by I.S.S.C.A.S. (1978) [
31]. Briefly, the soil organic matter (SOM) was estimated by K
2Cr
2O
7 oxidation at 180 °C, and the cation exchange capacity (CEC) was estimated by displacing the exchangeable cations on the soil particle surfaces with NH4
+. Each soil sample was measured in duplicate, and the main soil properties are summarized in
Table 1.
The hyper-spectral reflectance data were obtained from 350 to 2500 nm by an Analytical Spectral Device spectroradiometer with a spectral resolution of 1.4 nm. The collected soil samples were measured in a dark room in the laboratory. The field-of-view was 8°, and the illumination was provided by a 1000-W halogen lamp fixed 100 cm directly above the soil plane. To eliminate the effects of soil moisture as much as possible, the soil samples were naturally air dried before the hyper-spectral measurements. The soil samples were placed inside a circular black capsule with a diameter of 10 cm and depth of 1 cm, and were leveled to a smooth surface with the edge of a spatula. In the present laboratory experiments, the reflectance between each reflectance measurement was standardized using a white Spectralon reference panel [
32]. The visible and near infrared reflectance spectra of the soil samples were converted to spectral reflectance by dividing them by the Spectralon reference panel. These first derivative transformations are known to minimize the sample variations caused by changes in the grinding and optical settings [
21]. To eliminate the noise in the first derivative spectra, we applied Savitzky–Golay smoothing [
33] to the original reflectance spectra curve. In addition, like most of the hyper-spectral prediction models of soil properties, this study takes the first derivative of the spectral reflectance as an independent variable. Prior to the data analysis, six soil samples yielded negative hyper-spectral data after the hyper-spectral measurements of all soil samples and were discarded from our analysis.
We classified the original soil samples prior to data analysis. Combined with some previous studies, there is a definite masking effect between soil properties. For example, Rossel’s studies [
34] have confirmed that the masking of other constituents by organic matter in hyper-spectral reflectance is weakened by soil organic matter contents below 10 g/kg. Therefore, the classification of all soil samples is an essential preliminary work, in order to avoid the masking effect between soil properties. The soil samples were selected on the condition that the soil properties to be predicted varied significantly (with greater coefficients of variance (CV) values), whereas other properties changed slightly (with lower CV values). In this study, we grouped the soil samples with soil organic matter contents below 10 g/kg into the A group, which was used as the dataset for estimating CEC. On the other hand, the soil samples with CEC values below 20 cmol/g were selected into the B group, which was used as a data set for estimating SOM.
The soil samples were randomly selected from groups A and B at a 7:3 ratio to get calibration and validation data sets, respectively. The A group dataset (69 soil samples) was split into 49 randomly selected samples for calibration, and the remaining 20 soil samples were reserved for validation. Similarly, the B group dataset (114 soil samples) was split into 90 randomly selected samples for calibration, and 34 samples for validation. The soil properties in the soil samples are statistically described in
Table 2.
2.2. Discrete Wavelet Transform (DWT)
Wavelet analysis theory has been extensively reported in previous literature, so it is only briefly introduced here. DWT is ideally suited to spectral feature extraction, most fundamentally because it performs a multi-scale analysis of the signal [
35]. DWT can be mathematically expressed as a finite length sequence and a discrete wavelet basis of the inner product, where each inner product factor is a discrete wavelet transform [
13]. The DWT coefficients are expressed as follows:
where
is the value of DWT coefficient;
is a sequence of length
n. The discrete wavelet basis is given by
where
and
correspond to the discrete wavelet scale and the translation parameters, respectively, and the superscript * denotes the complex conjugate. For a binary wavelet,
= 2. To classify the detailed information into approximate scale categories, the hyper-spectral signal is projected onto a wavelet function. This approach is superior to other analytical methods. Unlike the Fourier transform, the scale transformation of a discrete wavelet transform is carried out on wavelets with non-unique and irregular fundamental waves, largely different waveforms of the fundamental waves, and largely different support lengths and regularity. The signal processing of different wavelet signals in the same signal often yields large differences among the results, which inevitably affects the final processing results. Therefore, the wavelet functions must be appropriately chosen for hyper-spectral pre-processing. The basic properties of the different wavelet functions available for DWT are given in
Table 3.
Based on some previous studies [
7,
15], we selected seven wavelet functions (Haar, Bior1.3, Bior2.4, Db4, Db8, Sym4, and Sym8), and explored their effects on the accuracy of the predicted model. Here, the DWT was performed with the Wavelet toolbox in MATLAB Release 12b (Matrix Laboratory, Math-Works, Natick, MA, USA).
2.3. Stepwise-Partial Regression Analysis
To eliminate the redundancy and noise in the hyper-spectral data, this study improves the partial least-squares fitting. The error terms in the partial least-squares regression model are not normally distributed, and their exact distribution is particularly difficult to discern. Therefore, the parameter significances, which determine the choice of variables, cannot be tested by evaluating the parameter statistics. Here we improve the partial least-squares model by proposing a variable selection method in MSLR, based on the fitting error. We refer to this method as Stepwise-PLS. This article briefly introduces the idea of improvement showed in
Figure 1.
In this improved algorithm, all of the original variables are used in the PLS model fitting, and the root-mean-square deviation (RMSE) is calculated as
where
and
are the actual and forecasted values, respectively.
The RMSE calculated by Equation (3) is set to E0. Next, one of the original variables are removed each time, until all variables have passed through, and the remaining variables are reinserted into the PLS model. The RMSE is calculated in each instance, and is set to E. The minimum value of RMSE is computed (E = min (RMSE)) and compared with E0. If E < E0, the variable corresponding to the minimum error is removed, presuming that its removal will improve the model precision. Conversely, if E > E0, then all of the unfavorable variables are rejected, and all remaining variables are deemed suitable for the PLS model. These series of algorithmic flows are coded and performed with Statistics toolbox in MATLAB Release 12b (Matrix Laboratory, Math-Works, Natick, MA, USA).
2.4. Regression Analysis Based on Discrete Wavelet Transform with Stepwise-Partial Least-Squares
To reduce the effects of the edge noise, the hyper-spectral measurements were acquired between 390 and 2437 nm. The first step computes the approximation and detailed coefficients of the wavelet decomposition, which are used in the Stepwise-PLS regression analysis. Each soil reflectance spectrum was subjected to eight levels of DWT. Combining the detailed and approximate signals gives more complete coverage of the spectral information than the approximate or detailed signals alone. The approximate coefficients are severely distorted when the decomposition level is too high; conversely, when the decomposition level is too low, the information in the approximate coefficients is too verbose. After several attempts, the approximate signal was optimized at four levels. Hence, after obtaining the detailed coefficients at levels 3, 4, 5, 6, 7, and 8, the original reflectance spectroscopy was replaced with the approximation coefficients at level 4. The spectral signal was decomposed into different wavelet functions at level 6, and a single branch was reconstructed from the approximated DWT coefficients at level 4, as well as the detailed DWT coefficients at level 6 (
Figure 2). The approximate coefficients were nearly independent of the level, but the signals of the detailed coefficient reconstruction were influenced by the different wavelet functions. Therefore, one must try different wavelet functions in the prediction model and find the best wavelet function, as well as its appropriate decomposition level. Meanwhile, to compare the application effects of the stepwise and standard PLS methods, we regressed the wavelet coefficients of the various wavelet functions and their decomposition levels using the two PLS methods. When performed by PLS and Stepwise-PLS, these approaches were named DWT–PLS and DWT–Stepwise-PLS, respectively.
With reference to previous studies, the predictive ability of the DWT–PLS and DWT–Stepwise-PLS method was assessed by four indicators: the coefficient of determination
R2, the
p value, root-mean-square deviation (RMSE), and the relative percentage deviation (RPD).
R2 represented the calibration measures of the models, and
Rv2, RMSE, and RPD represented the validation measures of the models. Along with the
p value,
R2 also evaluates the degree of correlation between the predicted and true values. As the RMSE is a parameter in the Stepwise-PLS model, it was excluded as an evaluation indicator of the model’s calibration results, but was added as an effective evaluation indicator of the model’s validation results. The RMSE and RPD define the accuracy of the error between the predicted and true values. Mathematically, RPD defines the ratio of the sample standard deviation (StD) to the RMSE (StD/RMSE). According to related research [
36], models with RPD values below 1.0 are very poor predictive performers, and their use is not recommended. Models with RPD values between 1.0 and 1.4 are also poor performers, and can distinguish only high and low values. When the RPD values is between 1.4 and 1.8, the model is a fair performer, and is useful for assessment and correlation. Models with RPD values between 1.8 and 2.0 are sufficiently accurate for quantitative predictions, and those with RPD values between 2.0 and 2.5 are high-quality quantitative predictors. Any model with an RPD values above 2.5 makes excellent predictions.
Furthermore, in order to directly compare the DWT–PLS and DWT–Stepwise-PLS model, the difference between the R2 values of DWT–PLS and DWT–Stepwise-PLS were used to compare the model efficiencies, and were defined as the R2 D-value. Additionally, the difference between the RMSE of DWT–PLS and DWT–Stepwise-PLS were defined as the RMSE D-value. Some statistical parameters of RPD for DWT–PLS and DWT–Stepwise-PLS models are also calculated: the maximum value, minimum value, and average value of PRD at different decomposition levels of the same wavelet functions; and the difference (D-value) of PRD between DWT–Stepwise-PLS and DWT–PLS.
4. Conclusions
This study compared the appropriateness of inserting different wavelet functions with different decomposition levels in the discrete wavelet transform with partial least-squares (DWT–PLS) method into the conventional visible and near infrared reflectance analysis method for estimating soil properties. The reliability and accuracy of the soil properties estimated by discrete wavelet transform-based visible and near infrared reflectance analysis was enhanced by an improved partial least-squares method called Stepwise-PLS. The main conclusions of this study are summarized below:
(1) In a feasibility study, the discrete wavelet transform with partial least-squares method was applied to the quantitative analysis of soil properties. Varying the wavelet functions and their distribution levels, we found that the fifth, sixth and seventh levels of the Haar wavelet function benefitted from the discrete wavelet transform with partial least-squares estimation of cation exchange capacity, maximizing the R2 and RPD values. However, no wavelet function showed a clear advantage over the other functions in the discrete wavelet transform with partial least-squares method estimation of soil organic matter.
(2) The discrete wavelet transform with stepwise-partial least squares method, with various wavelet functions and decomposition levels, effectively improved the prediction accuracy of the cation exchange capacity and soil organic matter estimated by the Vis-NIRA method. Further analysis of the results confirmed that the appropriateness of discrete wavelet transform with stepwise-partial least-squares method is not significantly related to the RPD values of the predicted soil properties, but improves at the fifth, sixth, and seventh levels of the DWT. Therefore, the Stepwise-PLS method will more likely obtain poorer results at lower and higher decomposition levels than at intermediate-to moderately-high levels.