Hyperspectral Estimation of Soil Organic Matter Content using Different Spectral Preprocessing Techniques and PLSR Method

: Soil organic matter (SOM) is the main source of soil nutrients, which are essential for the growth and development of agricultural crops. Hyperspectral remote sensing is one of the most efficient ways of estimating the SOM content. Visible, near infrared, and mid-infrared reflectance spectroscopy, combined with the partial least squares regression (PLSR) method is considered to be an effective way of determining soil properties. In this study, we used 54 different spectral pretreatments to preprocess soil spectral data. These spectral pretreatments were composed of three denoising methods, six data transformations, and three dimensionality reduction methods. The three denoising methods included no denoising (ND), Savitzky – Golay denoising (SGD), and wavelet packet denoising (WPD). The six data transformations included original spectral data, R; reciprocal, 1/R; logarithmic, log(R); reciprocal logarithmic, log(1/R); first derivative, R'; and first derivative of reciprocal, (1/R)'. The three dimensionality reduction methods included no dimensionality reduction (NDR), sensitive waveband dimensionality reduction (SWDR), and principal component analysis (PCA) dimensionality reduction (PCADR). The processed spectra were then employed to construct PLSR models for predicting the SOM content. The main results were as follows—(1) the wavelet packet denoising (WPD)-R' and WPD-(1/R)' data showed stronger correlations with the SOM content. Furthermore, these methods could effectively limit the correlation between the adjacent bands and, thus, prevent “overfitting”. (2) Of the 54 pretreatments investigated, WPD-(1/R)'-PCADR yielded the model with the highest accuracy and stability. (3) For the same denoising method and spectral transformation data, the accuracy of the SOM content estimation model based on SWDR was higher than that of the model based on NDR. Furthermore, the accuracy in the case of PCADR was higher than that for SWDR. (4) Dimensionality reduction was effective in preventing data overfitting. (5) The quality of the spectral data could be improved and the accuracy of the SOM content estimation model could be enhanced effectively, by using some appropriate preprocessing methods (one combining WPD and PCADR in this study).


Introduction
Soil organic matter (SOM) is an important indicator of soil fertility [1]. It is rich in a variety of organic acids and humic acids, which have a certain ability to dissolve soil mineral and can promote the absorption of nutrients. SOM not only provides the nutrients necessary for crops and improves soil physical structure but also helps with water and fertilizer retention [2]. Therefore, it is of great significance to be able to estimate the SOM content rapidly and accurately, in order to increase grain production and aid the sustainable development of agriculture. However, conventional SOM content estimation methods are costly, time consuming, and laborious and, thus, do not meet the needs of the current production management. Fortunately, the development of hyperspectral remote sensing technology has yielded several new methods for soil analysis. One can readily obtain large amounts of remote sensing data through satellites, radar and field and laboratory spectrometers. Soil spectral information has been used to infer soil properties by many researchers [3][4][5][6][7][8]. In addition, modeling methods are increasingly being used in SOM hyperspectral analysis, and the accuracy of most of these models is high. Some researchers used ground spectral information to analyze soil properties. Alexakis et al. estimated the soil parameters related to soil erosion using integrated satellite remote sensing data, artificial neural networks, and field spectroscopy and GIS data [3]. Kawamura et al. suggested that the soil oxalate-extractable P content can be predicted using visible-near infrared (NIR) spectroscopy [4]. They also stated that the genetic algorithm-partial least squares (GA-PLS) regression method is suitable to find the optimal bands for the PLS regression, contributing to a better predictive ability. Gholizadeh et al. evaluated the potential of the new datamining engine, PARACUDA-II, by comparing its performance in predicting the content of oxidizable carbon in soil, against that of other common datamining algorithms [5]. Kopáčková et al. evaluated the performance of a datamining engine (PARACUDA) in predicting various soil attributes, using reflectance data corresponding to the visible and thermal infrared regions [6]. Rossel et al. assessed various soil properties simultaneously, by using visible, NIR, mid-infrared (MIR), and combined diffuse reflectance spectroscopies [7]. Others have analyzed soil properties using satellite hyperspectral imagery data. Peón et al. predicted the organic carbon content of topsoil using airborne and satellite hyperspectral imagery [8]. Soil spectral information is not only related to the characteristics of the chemical components of the soil, such as its SOM content, iron oxide content, and soil moisture but also to the physical properties of the soil such as its particle size, density, and surface roughness [9]. Due to the limitations of the measuring instruments, methods, and environment, the spectral reflectance data of soil usually contain noise. Therefore, researchers have used some types of data preprocessing methods as they analyzed soil spectral information. Liu et al. applied several spectral data pretreatments during sample selection to construct models for predicting the SOM content using visible and NIR spectroscopy [10]. Zhang et al. constructed a SOM estimation model based on the PLS regression (PLSR) method, using neural networks and spectral data subjected to four transformations (first-order differential, FDR; second-order differential, SDR; continuum removal, CR; continuous wavelet transform, CWT) [11]. Vohland et al. used different methods to select the spectral variables for improving model accuracy and assessing the indicators of arable soil quality [12]. Moreover, hyperspectral data have the characteristics of more bands, large amount of data, and data redundancies. These factors increase both the workload and the complexity of data processing and modeling. Therefore, for the sake of improving the accuracy of the models, it is very important to select the appropriate preprocessing methods (for the hyperspectral data) before modeling, such as denoising, dimensionality reduction, and data form transformation. Most of the studies described above had performed some types of data preprocessing before modeling. There are mainly three kinds of spectral data preprocessing methods-denoising, data transformations and dimensionality reduction. Denoising methods include wavelet packet denoising, SG filtering denoising, etc. Data transformations include 1/R, R', etc. [10]. Dimensionality reduction methods include PCA dimensionality reduction, continuum-removal, etc. It has been proved that denoising can reduce the noise in the spectral data [11]. Some data transformations can improve the correlation between some bands and soil parameters or vegetation parameters, especially the first derivative [11] and dimensionality reduction can reduce the data redundancy [13]. However, for assessing the SOM content based on the soil spectra, different researchers deal with soil samples in different ways, such as grounding or sieving the soil. Furthermore, the pretreatments performed on the spectral data are also different, making it hard to compare the obtained results. The pretreatment methods (e.g., SG) selected in this paper are common and are considered to be "effective" by many researchers. The combinations of these single pretreatment methods were compared and used for spectrum data processing.
The aim of this study was as follows-(i) to estimate the practicability of using visible, NIR, and MIR spectroscopy for assessing the SOM content; (ii) to elucidate the correlation between the SOM content and the different processing data of soil spectra; (iii) to predict the SOM content based on spectral data subjected to different pretreatments, compare the SOM estimation results for the different spectral data pretreatments, and select the most effective preprocessing method for predicting the SOM content based on the PLSR approach; and (iv) to explore whether denoising can reduce correlation between adjacent bands. The overall aim was to improve the accuracy of hyperspectral SOM estimation approaches.

Study Area and Soil Sample Collection
The research area was Yitong County, Jilin Province, China ( Figure 1). This county is located in the south-central part of Jilin Province, at an east longitude of 124°49′-125°46′ and a north latitude of 43°3′-43°38′. The samples were collected from 21 April to 23 April 2017. The soil sampling points were selected such that they lay on the grid with dimensions of 1 × 1 km (Figure 1c), and the sampling depth was 0−5 cm. For each grid, only one sample was collected. The land type of the samples was corn-cultivated land. The survey area falls in the black soil region, and the soil types included meadow soil, black soil, white soil, and paddy soil, according to the Chinese genetic soil classification system. A collector of diameter 10 cm and length 5 cm was used to vertically remove undisturbed soil samples at the collection points. The extracted samples were deposited in large aluminum boxes of diameter 10 cm and length 5 cm, such that their original structure was maintained for the indoor spectral measurements. A total of 213 soil samples were collected. We analyzed the SOM content (%) of the soil samples in the laboratory using the Walkley-Black [14], after the spectral measurements. The conversion factor to calculate the SOM from the SOC content was 1.724.

Spectral Measurements
Analytical Spectral Devices (ASD) FieldSpec 4 High-Res spectroradiometer was used to perform the indoor spectral measurements on the undisturbed soil samples, which were stored in aluminum boxes and were not pretreated ( Figure 2). The surface of the soil samples was also not processed. The spectral measurements were performed at the same day of soil sample collection. The ASD spectroradiometer, whose spectral resolutions was 3 nm @700 nm and 8 nm @1400/2100 nm, had a wavelength range of 350−2500 nm. During the measurements, a 50 W halogen lamp was placed beside the soil sample, such that the incident angle of the light source was 60° (zenith angle of 30°), and the distance between the lamp and the soil sample was 30 cm. The aluminum box was surrounded by black flannel. The probe was placed 15 cm vertically from the soil sample. Each soil sample was measured four times-the aluminum box was rotated by 90° after each measurement for a total of three rotations. A total of 10 spectral curves were collected automatically during each measurement and the arithmetic average of the curves was used as the spectral data. Standard whiteboard calibration was performed before each measurement. The software ViewSpec Pro TM , produced by ASD, was used to modify the breakpoint of the original indoor spectral data (GAP window of 5 × 5) and obtain the average of the spectra. The spectral resolution was set to 1 nm, for a total of 2151 bands. The noise caused by the instability of the equipment at 350−400 nm was removed. In order to reduce data redundancy, the indoor spectral data were resampled in the same manner as HyMap airborne hyperspectral images (acquired between 30 April and 1 May 2017) provided by a second-level project unit (spectral resolution for 400−905 nm was 15 nm while that for 880−2500 nm was 18 nm). Finally, each spectrum had 135 bands. A total of 213 indoor spectral curves were obtained.

Description of Sample Set
It has been reported that the spectral reflectance of soil decreases with an increase in its SOM content [15,16]. In this study, in view of the sample quality, 15 samples with abnormal data in which the soil surface did not strictly maintain the original shape, were not included. Additionally, their spectral curves were significantly different from the others. The remaining 198 samples were grouped into two categories in a ratio of 4:1, which were used to develop and validate the model. The 198 samples were sorted in the ascending order of their SOM content. Starting from the fifth sample, one sample was selected every four samples and allocated to the validation dataset, which contained a total of 40 samples. The remaining 158 samples were used as the training dataset for the model. The statistical information regarding the SOM contents of the various datasets is shown in Table 1.

Preprocessing Methods
In this study, the original soil spectral data were preprocessed using different methods (i.e., different denoising methods, different data transformations, and different dimensionality reduction approaches), in order to obtain an accurate model for the SOM content estimation based on the PLSR method. Three different denoising methods were used, namely, no denoising (ND), Savitzky-Golay denoising (SGD) [17], and wavelet packet denoising (WPD). Further, six different data transformations were employed. These included using the original spectral data, R; its logarithm, log(R); its first derivative, R'; its reciprocal, 1/R; the logarithm of its reciprocal, log (1/R), and the first derivative of its reciprocal, (1/R)'. Finally, three different dimensionality reduction approaches were used. These were, no dimensionality reduction (NDR), sensitive waveband dimensionality reduction (SWDR), and principal component analysis (PCA) dimensionality reduction (PCADR) (see Table 2). All the algorithms involved were implemented in the Python programming language (version 3.7).

Savitzky-Golay Denoising
The SG smoothing filter is a popular filter for pretreating soil spectra [18]. It is a low-pass filter used to smooth the spectra by eliminating all high-frequency noise, while allowing the low-frequency signals to pass [19]. Further, SG filtering is based on the least squares fitting of a curve local polynomial and uses the weighted-average algorithm for the moving window. However, its weighting coefficient is not a simple constant window and is obtained by fitting the least squares of a given higher-order polynomial in a sliding window [20]. Its underlying idea is to make the reconstructed curve approximate the upper envelope of the original curve, gradually, through iteration [21]. Smooth filtering-based denoising using the SG method could improve the smoothness of the spectrum and reduce noise interference. The expression for SG filtering is as follows: is the original spectral data, N is the number of datapoints in the sliding window (N = 2m+1), and 2m+1 is the window width. In practical applications, SG filtering requires two parameters-the filter window width and the order of the polynomial for the smooth fitting process. The filter window width can affect the smoothing results, in that the higher the window width, the smoother the resulting spectrum. The order of the fitting polynomial also affects the filtering results [22]. The higher the order, the smoother the fit. In this study, the size of the filter window was set to 21, and the order of the fitting polynomial was taken to be 2.

Wavelet Packet Denoising
Daubechies and others have shown that wavelet packets take into account both high-frequency and low-frequency components of the signal and can effectively extract useful information in each frequency band. As a result, the denoising effect is strong [23,24]. When using wavelet packets to denoise a signal, the choice of the wavelet basis function and the number of layers of the signal are of particular importance. WPD decomposes the original signal into high-frequency and low-frequency signals. The high-frequency signal includes the noise information while the low-frequency signal is an approximation of the original signal. In this study, the db2 wavelet basis function was used to decompose the two-layer wavelet packet, while the soft threshold function was used to denoise the high-frequency signal node, d, in the leaf layer, after signal decomposition. The signal was then reconstructed after threshold denoising. The threshold determination formula was as follows: where σ is the median of the absolute value of all the coefficients in the high-frequency signal, d, divided by 0.6745 and N is the number of datapoints in d. The threshold proposed by Donoho was considered to be the maximum noise value. Further, tar/2 was taken to be the denoising threshold for the noise signal in this study.

Mathematical Transformations of Spectral Reflectance Data
As mentioned above, six different mathematical transformations were performed on the spectral data-R, 1/R, log (R), log (1/R), R', and (1/R)'. Since the spectrometer collected discrete data, R' was calculated using the following equation:

PCA Dimensionality Reduction
PCA is a commonly used dimensionality reduction method that has been employed widely in hyperspectral remote sensing. Extracting meaningful features (or components) from multidimensional data is typically done using the canonical PCA [25,26]. The purpose of the PCA transformation is to determine the set of the optimal unit orthogonal vector bases (i.e., principal components) through a linear transformation, and to minimize the error of the mean square deviation of the original sample through a linear combination [13]. In PCA, data were transformed from the original coordinate system to a new one, and the choice of the new coordinate system was determined by the data itself. The first new coordinate axis was along the direction with the largest variance in the original data, while the second new coordinate axis was along the direction with the largest variance orthogonal to the first coordinate axis, and so on. Most variances were accounted for in the first few new coordinate axes; finally, one had to choose several coordinate axes. In other words, data were reduced to several dimensions.

Sensitive Band Dimensionality Reduction
The original spectra were first subjected to denoising and then to a data transformation before being processed for dimensionality reduction. The correlation between the preprocessed spectral data and the SOM content was determined, and the wavebands for which the coefficient of determination, r 2 (square of the correlation coefficient, r), was greater than or equal to 0.25 were selected as the sensitive wavebands for each spectral curve. The expression for calculating the correlation coefficient, r, was as follows: where is the correlation coefficient between the spectral data of the ith band and the SOM content, is the corresponding spectral data value of the ith band of the nth sample, is the average value of the corresponding spectral data of the ith band, is the SOM content of the nth sample, and is the average value of the SOM contents of all the samples. PLSR is a popular technique used to correlate soil spectra with the SOM content [27,28]. The method combines the characteristics of multiple linear regression analysis, canonical correlation analysis, and principal component analysis, so that it not only provides a suitable regression model, but also expresses information more comprehensively. It is underpinned by the assumption that the dependent variable can be estimated via a linear combination of explanatory variables [29]. It provides a many-to-many linear regression modeling method, especially in this case, where the number of two sets of variables which had multiple correlations was large and the number of the sample size was small, the model established by PLSR had advantages that traditional classical regression analysis did not have. When solving many-to-many linear regression problem, multiple linear regression leads to overfitting due to the correlation between independent variables. While the PLSR method would find some new variables that are linearly independent, to replace the original independent variables that can maximize the difference between independent variables.

Metrics for Evaluating Model Performance
The parameters used in this study for evaluating the model performance included the coefficient of determination for the training set, R 2 T; the coefficient of determination for the validation set, R 2 V; the root-mean-square error of the training set (RMSET); the root-mean-square error of the validation set (RMSEV), and the ratio of performance-to-deviation (RPD). The larger the R 2 value, the greater the accuracy of the model. On the other hand, the RMSET and RMSEV values would have to be as small as possible. Furthermore, the more similar they were, the higher would be the estimation accuracy and stability of the model. Finally, the range of the RPD values could generally be divided into three categories-when RPD was equal to or more than 2.0, the model was suitable for estimating the SOM content from hyperspectral data; when RPD was less than 2.0, the reliability of the model could be improved by fine-tuning the model; and finally, when RPD was equal to or less than 1.4, the model was unreliable [30].
where SD is the standard deviation of the SOM contents for the samples in the validation set.

Correlation between SOM Content and Reflectance Data Subjected to Different Pretreatments
Correlation analysis is a classical and reliable method for analyzing the correlativity between independent and dependent variables [31]. We performed a correlation analysis between the SOM content and the reflectance spectra, subjected to different pretreatments. A total of eighteen different pretreatments, including three denoising methods (ND, SGD, and WPD) and six spectral data transformations (R, 1/R, log(R), log(1/R), R', and (1/R)') were used. The 18 pretreatments are listed in Table 3. Table 3. Different Combinations of Pretreatments used in this study.

Data transformations performed on spectral data
R

ND-R SGD-R WPD-R 1/R ND-1/R SGD-1/R WPD-1/R log(R) ND-log(R) SGD-log(R) WPD-log(R) log(1/R) ND-log(1/R) SGD-log(1/R) WPD-log(1/R) R' ND-Rʹ SGD-Rʹ WPD-Rʹ
The degree of correlation was expressed by the Pearson coefficient, r. Figure 3 shows the changes in the correlation coefficient with the wavelength. It can be seen from Figure 3 (a), Figure 3 (b), and Figure 3 (c), that irrespective of the denoising method used (ND, SGD, or WPD), there was a strong negative correlation between the SOM content and R in the range of 400−2500 nm. The curves for the correlation coefficients were almost smooth and horizontal. Furthermore, the fact that the variations in the correlation coefficient were small meant that the correlation coefficients corresponding to the adjacent bands were similar. Although the trend in the curves for the correlation coefficient for the SOM content and log(R) was similar to that in the case of the SOM content and R, the former curves were higher over the entire investigated bandwidth range. Furthermore, the curve for the correlation coefficient for the SOM content and 1/R was almost a mirror image of the curve for the SOM content and R, exhibiting a similar trend about the x-axis in the positive quadrant. In addition, the correlation (represented by the absolute value of the correlation coefficient) between the SOM content and 1/R was stronger in the visible band but weaker in the NIR and MIR band. Next, the correlation between the SOM content and log (1/R) was basically the same as that between the SOM content and 1/R; however, the former curve was a little higher over the entire wavelength range. In contrast, the curve for the correlation between the SOM content and the first derivative of R showed significant variations over the entire wavelength region. More precisely, it exhibited approximately the same trend as the other correlation curves for wavelengths smaller than 1300 nm but then rose and fell sharply. In addition, the correlation between the SOM content and R' in the range of 400−640 nm was stronger than those between the SOM content and the other data. The same was also true for the correlation between the SOM content and (1/R)' in the range of 840−1300 nm. Thus, it was likely that they had the possibility to become a sensitive correlation. At the same time, it can be seen from Figure 3 that regardless of the denoising method used (ND, SGD, or WPD), the bands with high correlation (where the correlation coefficient was greater than 0.5) were concentrated in the range of 610−1300 nm after R, 1/R, log(R), and log(1/R). However, the bands with high correlation were concentrated in the range of 466-640 nm, 665 nm, and 704-767 nm, after ND-R'; 716-1300 nm after ND-(1/R)'; 466-494 nm, 523-640 nm, 678 nm, 704-729 nm, and 767-780 nm after WPD-R'; and 704 nm, 730 nm, 755-1300 nm, and 1341 nm after WPD-(1/R)'.
In conclusion, there was a strong correlation between the SOM content and R. Log(R) would improve the correlation with SOM, 1/R would improve the correlation between the visible band range and SOM, and the R' and (1/R)' would increase the correlation of some bands, and these bands were somewhat scattered.
. There were also some differences between Figure 3 (a), Figure 3 (b), and Figure 3 (c). It can be seen from Figure 3 (b) and Figure 3 (c) that the correlation curves between the SOM content and R, 1/R, log(R), and log(1/R) were basically the same after SGD or WPD. However, after SGD, the correlation curves between the SOM content, and R' and (1/R)' were smoother than those after ND. Furthermore, after WPD, the correlation curves between the SOM content and R' and (1/R)' showed more variations than those after ND. Moreover, the greater the variations in the curves, the more dispersed the band distribution at higher correlation coefficient values; this was beneficial for the subsequent dimensionality reduction operation, as it helped to effectively reduce the correlation between the bands and prevent the problem of "overfitting" caused by the strong correlation between them.

Determination of Optimal Parameter Value for PCA
Data redundancy is a disadvantage in hyperspectral data, and the key to the processing of hyperspectral data is the extraction of useful information present in large datasets [32]. The PCA method was selected to reduce the dimensions of the remote sensing data used in this study. Figure  4 shows the curve of the variations in the model accuracy, with the data dimension after dimensionality reduction. When the dimension was 25, both the RPD value and the R 2 V value of the model were maximized. At this time, the model showed the highest accuracy. Thus, 25 was chosen as the optimal dimension for PCA-based dimensionality reduction.

Accuracy Analysis of the Hyperspectral Estimation Model of theSOM Content based on PLSR
The SOM content estimation model results based on the PLSR model in 54 different spectral pretreatment methods listed in Table 4. The analysis of some of these results with significant regularity are described below.

Comparison of the Modeling Results based on Original Data and Data Obtained after Effective Pretreatment
With respect to the SOM estimation model based on PLSR, the best performance was observed after the pretreatment WPD-(1/R)'-PCADR. Table 5 shows the PLSR-based estimation accuracy of the SOM content when using the untreated data and those subjected to the pretreatment WPD-(1/R)'-PCADR. Figure 5 shows the scatterplot of the SOM content estimation results based on PLSR, when unprocessed data were used [33]. The corresponding values of RMSEV, R 2 V, and RPD were 1.200, 0.007, and 0.400 respectively. Figure 6 shows the scatter plot of the SOM content estimation results based on PLSR after the pretreatment WPD-(1/R)'-PCADR. The corresponding values of RMSEV, R 2 V, and RPD were 0.280, 0.713, and 1.712, respectively. Thus, the R 2 V and RPD values for the latter case (i.e., after the pretreatment WPD-(1/R)'-PCADR) were improved by 0.706 and 0.312, respectively. When the PLSR model was used on the untreated data, the estimation results for the training and validation sets were significantly different.

Comparison of Modeling Results for Different Dimensionality Reduction Methods
It was observed that, for the same denoising method and the same spectral data transformation, the dimensionality reduction performance of SWDR was better than that of NDR. Furthermore, the dimensionality reduction performance of PCADR was better than that of SWDR. Table 6 compares the SOM content estimation results for the spectral data subjected to the 1/R transformation, the WPD method, and different dimensionality reduction treatments. Figure 7 shows the scatterplot of the SOM content estimation results for WPD-1/R-NDR. Figure 8 shows the scatterplot for WPD-1/R-SWDR, and Figure 9 shows the scatterplot for WPD-1/R-PCADR. Here, it can be seen clearly that the accuracy of the SOM estimation model based on PLSR was higher when the data were subjected to dimensionality reduction using the SWDR, as compared to the NDR method; the R 2 V and RPD values using the SWDR had increased by 0.380 and 0.626, respectively, than the value using the NDR method. The same was also true for the PCADR method, as compared to the SWDR method; the R 2 V and RPD values using the PCADR increased by 0.181 and 0.482, respectively, than the value using the SWDR.
In addition, it could be seen that when dimensionality reduction was not performed, the model training resulted in significant overfitting, with the difference between the relative analysis errors for the training and validation sets being 0.742. On the other hand, after the dimensionality reduction of the sensitive bands, the problem of overfitting was resolved, and the difference in the errors of the two datasets reduced to 0.113. Furthermore, after dimensionality reduction using PCA, overfitting was eliminated completely, and the error difference was only 0.051. These results confirmed that dimensionality reduction could effectively prevent overfitting and improve the accuracy and stability of the model. The results for the other denoising methods and spectral data transformations were similar and hence were not included.

Comparison of Modeling Results for Different Denoising Methods
Next, some of the data subjected to different denoising methods were treated using the same spectral transformations and dimensionality reduction methods. This improved the estimation accuracy of the model (see Table 7). For example, in the case of the SWDR process, when the R spectral data were denoised using the SG method, the RPD value of the model was 0.102 higher than that without denoising. Furthermore, when the R data were denoised by the WPD method, the RPD was 0.059 higher than that without denoising. Figure 10 shows the scatterplot of the SOM content estimation results for ND-R-SWDR, Figure 11 shows the scatterplot for SGD-R-SWDR, and Figure 12 shows the scatterplot for WPD-R-SWDR.

Discussion of Different Preprocessing Techniques for Soil Hyperspectral Data
It can be seen from Table 4 that, for the same denoising method and same spectral data transformation, the dimensionality reduction performance of SWDR and PCADA were better than that of NDR. However, the estimation accuracy was relatively low when the data were subjected to any of the spectral transformations and the SWDR method, with the RPD value being mostly less than 1.4. In contrast, the estimation accuracy improved greatly when the PCADR method was used. The RPD value was basically greater than 1.4 in this case, with the R 2 V value being as high as 0.713. The low estimation accuracy of the SWDR was probably because the sensitive bands selected were not the optimal bands for soil spectrum processing. In this study, the wavebands with high correlation coefficients (r > 0.5) were selected as the sensitive bands, and most of these lay in the visible region (Figure 3). In the visible-wavelength range of 400−700 nm, the variations in the spectral reflectance of soil are closely related to the presence of SOM and minerals such as iron oxides [34]. Rossel et al. showed that the reflectance spectrum at 410 nm is related to the SOM content [11]. In addition, several studies have reported that SOM can reduce the spectral reflectance in the visible region and that the SOM content is strongly correlated to the reflectance in the 550-680 nm range [35][36][37]. Mouazen et al. found that the reflectance of soil in this wavelength range is related to the soil color, which is determined by the electronic transitions [38]. In the NIR and MIR region, the peaks at wavelengths of 1853, 1000, and 2412 nm are mainly caused by the absorbance of the O-H bonds of the free moisture in the soil, as well as the absorbance of the other O-H groups existing in the soil, such as the clay minerals [39][40][41]. Thus, the sensitive bands selected based on only the Pearson coefficient might not be truly representative of the actual reflectance data. Moreover, we found that a high number of bands had r values of more than 0.5 did not result in a better model. Thus, the appropriate sensitive bands related to SOM should be selected with care, in order to be able to estimate the SOM content with accuracy, as this would minimize the errors caused by data redundancy.
Additionally, for the same spectral transformation and dimensionality reduction methods, the advantages of the SGD method were not apparent. The accuracy of the SOM estimation model used by SGD might be higher than that used by ND (e.g., SGD-R-SWDR), but some of others might be lower than that used by ND (e.g., SGD-log(R)-PCADR). The SGD method did not improve the correlation between the SG spectra and the SOM content ( Figure 3). Barnes et al. have suggested that determining the appropriate smoothing window size was essential for processing spectral data [42]. The low correlation coefficient in the case of the data subjected to SG preprocessing might be attributable to the over-smoothing of the data, which probably resulted in information loss [43].
In addition, when WPD and PCADR are used simultaneously, the advantages of derivative method are obvious. Oldham et al. and Li et al. reported that derivatives are not only a powerful tool for analyzing spectral data but also help overcome several collinearity problems [44,45]. The derivative method strongly affects the local peaks in the spectrum. Thus, it could be used to enhance the sensitivity of the analysis and the spectral resolution. To a certain degree, it also helps in removing noise. As is known, the first derivative (FD) and second derivative (SD) indicate the slope and change in the slope, respectively, of the reflectance spectrum. The peak absorption of the SD spectrum is greater than that of the FD spectrum, with the reflectance value in the former being lower. Although the SD can separate a greater number of absorption peaks, it can also introduce noise and might cause errors. The SD and FD lead to significant changes in the spectrum, resulting in sharp peaks. Fractional derivatives can limit the extent of the changes in the spectrum and ensure that the shape characteristics of the original spectrum are preserved. Thus, they are more advantageous than full derivatives (FD and SD) [32]. To extend the order to non-integers, fractional derivatives might be used to provide more useful information from remote sensing, which could add more detail to the spectra than whole derivatives. In this study, only the FD was used to estimate the SOM content. In future, however, we plan to use other higher-order full and fractional derivatives to preprocess the spectral data.

Conclusions
We collected 213 soil samples from Yitong County, Jilin Province, China, measured their spectral data, subjected the data to different preprocessing treatments, and subsequently used them to determine the SOM content using the PLSR method. This was done with the aim of establishing an accurate and efficient model for predicting the SOM content, based on hyperspectral reflectance data. The conclusions of the study could be summarized as follows. (1) After the WPD-R' and WPD-(1/R)' pretreatments, the bands with stronger correlations with the SOM content became more dispersed; this was beneficial to the subsequent dimensionality reduction operation, as it effectively reduced the correlation between the adjacent bands and prevented "overfitting." (2) In the Yitong area of Jilin Province, the WPD-(1/R)'-PCADR pretreatment of the 54 different pretreatments investigated in this study yielded the model with the highest accuracy for estimating the SOM content. (3) For the same denoising method and spectral data transformation method, the accuracy of the SOM estimation model based on PLSR was higher than when the data were subjected to dimensionality reduction using the SWDR, as compared to the NDR method, the R 2 V and RPD values using the SWDR were increased by 0.380 and 0.626, respectively, than the value using the NDR method. The same was also true for the PCADR method, as compared to the SWDR method, the R 2 V and RPD values using the PCADR increased by 0.181 and 0.482, respectively, than the value using the SWDR. (4) Dimensionality reduction was effective in preventing data overfitting. (5). The quality of the spectral data could be improved and the accuracy and stability of the SOM content estimation model could be enhanced effectively using appropriate preprocessing methods (a combination of WPD and PCADR in this case). For example, the RPD of the model based on the data preprocessed using WPD-(1/R)'-PCADR was higher than that of the model based on untreated data by 1.312.