Exploring Appropriate Preprocessing Techniques for Hyperspectral Soil Organic Matter Content Estimation in Black Soil Area

: Black soil in northeast China is gradually degraded and soil organic matter (SOM) content decreases at a rate of 0.5% per year because of the long-term cultivation. SOM content can be obtained rapidly by visible and near-infrared (Vis–NIR) spectroscopy. It is critical to select appropriate preprocessing techniques for SOM content estimation through Vis–NIR spectroscopy. This study explored three categories of preprocessing techniques to improve the accuracy of SOM content estimation in black soil area, and a total of 496 ground samples were collected from the typical black soil area at 0–15 cm in Hai Lun City, Heilongjiang Province, northeast of China. Three categories of preprocessing include denoising, data transformation and dimensionality reduction. For denoising, Svitzky-Golay ﬁlter (SGF), wavelet packet transform (WPT), multiplicative scatter correction (MSC), and none (N) were applied to spectrum of ground samples. For data transformation, fractional derivatives were allowed to vary from 0 to 2 with an increment of 0.2 at each step. For dimensionality reduction, multidimensional scaling (MDS) and locally linear embedding (LLE) were introduced and compared with principal component analysis (PCA), which was commonly used for dimensionality reduction of soil spectrum. After spectral pretreatments, a total of 132 partial least squares regression (PLSR) models were constructed for SOM content estimation. Results showed that SGF performed better than the other three denoising methods. Low-order derivatives can accentuate spectral features of soil for SOM content estimation; as the order increases from 0.8, the spectrum were more susceptible to spectral noise interferences. In most cases, 0.2–0.8 order derivatives exhibited the best estimation performance. Furthermore, PCA yielded the optimal predictability, the mean residual predictive deviation (RPD) and maximum RPD of the models using PCA were 1.79 and 2.60, respectively. The application of appropriate preprocessing techniques could improve the e ﬃ ciency and accuracy of SOM content estimation, which is important for the protection of ecological and agricultural environment in black soil area.


Introduction
Soil organic matter (SOM) is one of the major factors for soil quality [1], and it not only affects soil physical structure, but is also significant for ecosystem service, ecological environment, sustainable development of agriculture [2][3][4][5]. In addition, the reduction of SOM could lead to the decrease of soil fertility [2,6]. Therefore, for agricultural and environmental management, it is necessary to determine the content of SOM. Black soil area in northeast China is one of the four black soil regions in the world and is famous for its high content of soil organic matter (SOM) [7]. Cropland has become the dominant

Study Area
The study area is located in Hai Lun City, Heilongjiang Province, northeast of China (46.961-47.838N, 126.221-127.749E), as shown in Figure 1a. The total area of it is 4667 km 2 . Hai Lun City lies on the northeast of Songnen Plain. The study area is the center of China's black soil area. The average annual temperature in the study area was 1.5 • C, the annual rainfall was 500-600 mm, and the annual sunshine hours were 2600-2800 h. The dominant land use are cropland and woodland. Due to human activities, the region has undergone dynamic change in land use, and woodland has been constantly transformed into cropland. The meadow vegetation in the study area lasts about 6 months every year. During the rest of the year, the land use changed from meadow vegetation to bare soil, and the long duration and low temperature in winter severely limited the soil microbial activity. Therefore, the study area formed a deep black soil layer, and contains high content of soil humus. The main types of soil are Luvisols and Phaeozems, according to the Soil Taxonomy of World Reference Base (WRB) [35].

Study Area
The study area is located in Hai Lun City, Heilongjiang Province, northeast of China (46.961-47.838N, 126.221-127.749E), as shown in Figure 1a. The total area of it is 4667 km 2 . Hai Lun City lies on the northeast of Songnen Plain. The study area is the center of China's black soil area. The average annual temperature in the study area was 1.5 °C , the annual rainfall was 500-600 mm, and the annual sunshine hours were 2600-2800 h. The dominant land use are cropland and woodland. Due to human activities, the region has undergone dynamic change in land use, and woodland has been constantly transformed into cropland. The meadow vegetation in the study area lasts about 6 months every year. During the rest of the year, the land use changed from meadow vegetation to bare soil, and the long duration and low temperature in winter severely limited the soil microbial activity. Therefore, the study area formed a deep black soil layer, and contains high content of soil humus. The main types of soil are Luvisols and Phaeozems, according to the Soil Taxonomy of World Reference Base (WRB) [35].

Sample Collection and Statistics
A total of 496 soil samples were collected in October 2013 at 0-15 cm. The distribution of the samples is shown in Figure 1b. All samples were put into sealed plastic sample bags and sent to the laboratory. After the soil samples were air-dried and the sundries are separated, let the soil samples pass through a 2 mm sieve until all the soil samples pass through the sieve hole. Soil organic carbon (SOC) content in soil samples was determined by the potassium dichromate volumetric method. The conversion factor of 1.724 was used to convert SOC content to SOM. The reflectivity of soil is a cumulative property formed by the heterogeneous combination of different substances in soil, including organic matter. In this study, the SOM content of the collected samples covers a wide range, which can reflect the reflectivity of soil with different organic matter content. The 496 samples collected were arranged in ascending order. The first sample was selected, and then one sample was selected every three samples as the validation set, and the remaining samples were used as the training set. After the classification, the original sample set was divided into training set and validation set in a ratio of 3:1. The training set was used to build models, and the validation set was used to test the accuracy of models. The statistics of each sample set for SOM content were summarized in Table 1. The maximum and minimum SOM content of the training

Sample Collection and Statistics
A total of 496 soil samples were collected in October 2013 at 0-15 cm. The distribution of the samples is shown in Figure 1b. All samples were put into sealed plastic sample bags and sent to the laboratory. After the soil samples were air-dried and the sundries are separated, let the soil samples pass through a 2 mm sieve until all the soil samples pass through the sieve hole. Soil organic carbon (SOC) content in soil samples was determined by the potassium dichromate volumetric method. The conversion factor of 1.724 was used to convert SOC content to SOM.
The reflectivity of soil is a cumulative property formed by the heterogeneous combination of different substances in soil, including organic matter. In this study, the SOM content of the collected samples covers a wide range, which can reflect the reflectivity of soil with different organic matter content. The 496 samples collected were arranged in ascending order. The first sample was selected, and then one sample was selected every three samples as the validation set, and the remaining samples were used as the training set. After the classification, the original sample set was divided into training set and validation set in a ratio of 3:1. The training set was used to build models, and the validation set was used to test the accuracy of models. The statistics of each sample set for SOM content were summarized in Table 1. The maximum and minimum SOM content of the training sample set are 4.73 and 0.69, respectively, and the maximum and minimum SOM content of the validation sample set are 4.45 and 0.49, respectively. The average values and the standard deviation of training sample set are essentially the same as those of validation sample set.

Spectral Measurement
The spectral data of the soil were collected in a dark laboratory using an ASD FieldSpec ® 3 portable spectro-radiometer (Analytical Spectral Devices Inc., Boulder, CO, USA) with a spectral range of 350-2500 nm and a resample interval of 1 nm. The ground soil samples are put into a black container with a diameter of 10 cm and a depth of 2 cm. After filling, the surface of soil samples is scraped flat. A 50 W halogen lamp was used as the light source and the zenith Angle was 30 • . The field of view Angle of the probe is set as 15 • , and the distance from the probe to the surface of soil samples is 5 cm. Before each spectral determination, the whiteboard calibration was carried out. The spectrum was measured for 10 times repeatedly for each sample, and the average reflectance was taken as the actual spectral reflection data.

Preprocessing Methods
In order to reduce the influence of noise, marginal bands with low signal-noise ratio (350-400 nm and 2401-2500 nm) were eliminated before preprocessing. Various spectral preprocessing techniques were used in this study. First, spectral denoising includes four methods: none (N), Svitzky-Golay filter (SGF), multiplicative scatter correction (MSC), wavelet packet transform (WPT), as shown in Appendix A.1. After spectral denoising, fractional derivatives, FD and SD were used to process the spectrum, as shown in Appendix A.2. Then, spectral dimensionality reduction is carried out. Dimensionality reduction methods include principal component analysis (PCA), multidimensional scaling (MDS), locally linear embedding (LLE), as shown in Appendix A.3. All methods are listed in Table 2. In this study, the coefficient of determination (R 2 ), root mean squared error (RMSE) and residual predictive deviation (RPD) of the validation set were used as the indicators to evaluate the model accuracy.

Correlation Analysis Between SOM Content and Reflectance Data
In this study, four different methods were used for denoising, including N, SGF, MSC, and WPT. Then, the data was processed with order 0 to 2 derivatives at 0.2 intervals. Finally, the dimensionality Remote Sens. 2020, 12, 3765 5 of 18 reduction was carried out for the data after derivatives, and then models were established. In order to observe the effect of denoising and fractional derivatives on the original spectrum, the correlation analysis between SOM content and reflectivity is conducted in the following two parts. The first part is the change of correlation curve after using different denoising methods. The second part is the change of the correlation curve after using derivatives of different orders. Figure 2 shows the correlation curve between reflectance and SOM content after using different denoising methods. It can be seen that there is a negative correlation between the original reflectance and SOM content at 400-2400 nm. After WPT denoising, the correlation curve did not change obviously, and it basically coincides with the original correlation curve. The overall correlation coefficient of reflectance and SOM content after SGF improved about 0.2. The shape of the correlation curve is similar to the original correlation curve. Furthermore, the correlation between reflectance and SOM content after MSC denoising varied significantly between 400 nm and 2400 nm. The correlation curve shows a significant positive correlation between 400 nm and 577 nm (correlation coefficient > 0.6), followed by a rapid decline. Moreover, at 736-1354 nm, there is a significant negative correlation (correlation coefficient is less than −0.6), and then it fluctuates and rises, showing a more obvious positive correlation again.

Correlation Analysis Between SOM Content and Reflectance Data
In this study, four different methods were used for denoising, including N, SGF, MSC, and WPT. Then, the data was processed with order 0 to 2 derivatives at 0.2 intervals. Finally, the dimensionality reduction was carried out for the data after derivatives, and then models were established. In order to observe the effect of denoising and fractional derivatives on the original spectrum, the correlation analysis between SOM content and reflectivity is conducted in the following two parts. The first part is the change of correlation curve after using different denoising methods. The second part is the change of the correlation curve after using derivatives of different orders. Figure 2 shows the correlation curve between reflectance and SOM content after using different denoising methods. It can be seen that there is a negative correlation between the original reflectance and SOM content at 400-2400 nm. After WPT denoising, the correlation curve did not change obviously, and it basically coincides with the original correlation curve. The overall correlation coefficient of reflectance and SOM content after SGF improved about 0.2. The shape of the correlation curve is similar to the original correlation curve. Furthermore, the correlation between reflectance and SOM content after MSC denoising varied significantly between 400 nm and 2400 nm. The correlation curve shows a significant positive correlation between 400 nm and 577 nm (correlation coefficient > 0.6), followed by a rapid decline. Moreover, at 736-1354 nm, there is a significant negative correlation (correlation coefficient is less than −0.6), and then it fluctuates and rises, showing a more obvious positive correlation again. The correlation curves between reflectance and SOM content after fractional derivatives are shown in Figure 3. As the order increases from 0 to 0.6, the region with high correlation in the correlation curve gradually moves to the visible range. At order 0, the band range with correlation coefficient greater than 0.5 is 514-1350 nm. At order 0.6, the band range with correlation coefficient greater than 0.5 is 446-1085 nm. From order 0 to 0.6, the peak of correlation coefficient increased from 0.6726 to 0.6934, and the band range with correlation coefficient greater than 0.67 moved from 807-905 nm to 539-787 nm. Starting from order 0.8, the correlation curve produces irregular fluctuations. As the order increases from 0.8 to 2, the fluctuation amplitude of the correlation curve increases rapidly, and the overall correlation coefficient tends to be between −0.2-0.2 gradually. The correlation curves between reflectance and SOM content after fractional derivatives are shown in Figure 3. As the order increases from 0 to 0.6, the region with high correlation in the correlation curve gradually moves to the visible range. At order 0, the band range with correlation coefficient greater than 0.5 is 514-1350 nm. At order 0.6, the band range with correlation coefficient greater than 0.5 is 446-1085 nm. From order 0 to 0.6, the peak of correlation coefficient increased from 0.6726 to 0.6934, and the band range with correlation coefficient greater than 0.67 moved from 807-905 nm to 539-787 nm. Starting from order 0.8, the correlation curve produces irregular fluctuations. As the order increases from 0.8 to 2, the fluctuation amplitude of the correlation curve increases rapidly, and the overall correlation coefficient tends to be between −0.2-0.2 gradually.

Accuracy Analysis of SOM Content Estimation Models
In this study, PCA, MDS, and LLE were used to reduce the dimension of the spectrum. For data that has been processed differently, the dimensions best suited for modeling are different. Therefore, the data after different processing was reduced to 0-200 dimensions. After dimensionality reduction, 0-200 dimensional models were established by partial least squares regression (PLSR) separately, the explanation of PLSR is shown in Appendix A.4, and the model with the largest 2 is selected as the modeling result.

Accuracy Analysis of SOM Content Estimation Models
In this study, PCA, MDS, and LLE were used to reduce the dimension of the spectrum. For data that has been processed differently, the dimensions best suited for modeling are different. Therefore, the data after different processing was reduced to 0-200 dimensions. After dimensionality reduction, 0-200 dimensional models were established by partial least squares regression (PLSR) separately, the explanation of PLSR is shown in Appendix A.4, and the model with the largest R 2 is selected as the modeling result.
All SOM content estimation model results are shown in Appendix B. After PCA dimensionality reduction, the SOM content estimation model results are listed in Table A1. After MDS dimensionality reduction, the SOM content estimation model results are listed in Table A2. After LLE dimensionality reduction, the SOM content estimation model results are listed in Table A3.

Comparison of Modeling Results for Denoising Methods
The models without fractional derivatives are divided into three categories according to dimensionality reduction methods. The accuracy of the models is shown in Table 3. For models with PCA dimensionality reduction, R 2 and RPD of the model with SGF were 0.15 and 0.57 higher than that without denoising, R 2 and RPD of the model with MSC were 0.04 and 0.06 lower than that without denoising, and R 2 and RPD of the model with WPT were 0.07 and 0.17 higher than that without denoising. For models with MDS dimensionality reduction, R 2 and RPD of the model with SGF were 0.23 and 0.71 higher than that without denoising, R 2 and RPD of the model with MSC were 0.01 and 0.02 lower than that without denoising, and R 2 and RPD of the model with WPT were 0.08 and 0.15 higher than that without denoising. For models with LLE dimensionality reduction, R 2 and RPD of the model with SGF were 0.11 and 0.13 higher than that without denoising, R 2 and RPD of the model with MSC were 0.01 and 0.02 higher than that without denoising, and R 2 and RPD of the model with WPT were same as that without denoising. According to the above comparison, the best results were obtained using SGF.

Comparison of Modeling Results for Fractional Derivatives of Different Orders
The models using the same denoising method and dimensionality reduction method were selected as a combination, and 12 combinations were formed. The RPD of models with 12 combinations were drawn in line charts, respectively (Figure 4). Among all models, RPD of the model after SGF-0.6 order-PCA achieved the maximum value of 2.60. The RPD of all combinations gradually decreased after first derivative, and the RPD of 8 combinations gradually decreased after 0.8 derivative. This is Remote Sens. 2020, 12, 3765 8 of 18 consistent with the change of the correlation curve between the reflectance and SOM content after fractional derivatives of different orders. 10 combinations obtained the maximum RPD at order 0.2-0.8, and two combinations obtained the maximum RPD at order 1. This indicates that the spectral data can show a more detailed variation trend after fractional derivative and reduce the accuracy decline caused by information loss.

Comparison of Modeling Results for Fractional Derivatives of Different Orders
The models using the same denoising method and dimensionality reduction method were selected as a combination, and 12 combinations were formed. The RPD of models with 12 combinations were drawn in line charts, respectively (Figure 4). Among all models, RPD of the model after SGF-0.6 order-PCA achieved the maximum value of 2.60. The RPD of all combinations gradually decreased after first derivative, and the RPD of 8 combinations gradually decreased after 0.8 derivative. This is consistent with the change of the correlation curve between the reflectance and SOM content after fractional derivatives of different orders. 10 combinations obtained the maximum RPD at order 0.2-0.8, and two combinations obtained the maximum RPD at order 1. This indicates that the spectral data can show a more detailed variation trend after fractional derivative and reduce the accuracy decline caused by information loss. The analyses of variance (ANOVA) was performed with RMSE as the dependent variable and with the fractional derivatives as the independent variable ( Table 4). The result showed that fractional derivatives exerted large effect on the accuracy of the SOM content models (p < 0.01). To further elucidate the effect of fractional derivatives, the RPD of SOM content estimation models using different order derivatives was counted to form the boxplot, as shown in Figure 5. The 0.2-0.8 derivatives improved the mean value, median value, and quartile of RPD. In addition, as the order increases from the first derivative, the accuracy of the models was reduced obviously.

Comparison of Modeling Results for Dimensionality Reduction Methods
A total of 132 models are divided into three categories according to dimensionality reduction methods. The boxplot of RPD is shown in Figure 6. The mean RPD of the models with PCA dimensionality reduction method reached 1.79, which was 0.34 higher than that with MDS dimensionality reduction method, and 0.22 higher than that with LLE dimensionality reduction method. In addition, the quartile, maximum and minimum values of RPD of the models processed with PCA were significantly improved compared with those with MDS and LLE. There are two outliers in RPD of the models using LLE. Furthermore, the mean value and quartile of RPD of the models processed with LLE are greater than that with MDS. By comparing the MDS section and LLE section in the boxplot, the scope of the latter is smaller and more concentrated. This suggests that the accuracy fluctuation of the models processed with LLE is less.
Among the models with the same denoising method, derivative of the same order and different dimensionality reduction methods, the model achieving the largest RPD is denoted as the advantage

Comparison of Modeling Results for Dimensionality Reduction Methods
A total of 132 models are divided into three categories according to dimensionality reduction methods. The boxplot of RPD is shown in Figure 6. The mean RPD of the models with PCA dimensionality reduction method reached 1.79, which was 0.34 higher than that with MDS dimensionality reduction method, and 0.22 higher than that with LLE dimensionality reduction method. In addition, the quartile, maximum and minimum values of RPD of the models processed with PCA were significantly improved compared with those with MDS and LLE. There are two outliers in RPD of the models using LLE. Furthermore, the mean value and quartile of RPD of the models processed with LLE are greater than that with MDS. By comparing the MDS section and LLE section in the boxplot, the scope of the latter is smaller and more concentrated. This suggests that the accuracy fluctuation of the models processed with LLE is less.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 19 model; 44 advantage models were obtained by statistics. The model achieving the largest RPD among the models with the same dimensionality reduction method is denoted as the optimal model. There were three optimal models were obtained by statistics. According to the statistical results, the table as shown in Table 5 is formed. Most of the advantage models come from models using PCA. In addition, the optimal model using PCA was 0.01 and 0.12 higher than that using MDS in 2 and RPD and was 0.02 and 0.19 higher than that using LLE in 2 and RPD.   Among the models with the same denoising method, derivative of the same order and different dimensionality reduction methods, the model achieving the largest RPD is denoted as the advantage model; 44 advantage models were obtained by statistics. The model achieving the largest RPD among the models with the same dimensionality reduction method is denoted as the optimal model. There were three optimal models were obtained by statistics. According to the statistical results, the table as shown in Table 5 is formed. Most of the advantage models come from models using PCA. In addition, the optimal model using PCA was 0.01 and 0.12 higher than that using MDS in R 2 and RPD and was 0.02 and 0.19 higher than that using LLE in R 2 and RPD.

Denoising Methods for SOM Content Estimation in Black Soil Area
Different denoising methods have varied effects on the spectrum. In this study, N, SG, MSC, and WPT were used. SGF can effectively remove noise such as baseline-drift, tilt, and reverse [36]. The larger the window size of SGF, the greater for spectral smoothing. This is because large window size can effectively remove high frequency noise [37], but the information loss will be caused using oversize window. MSC can be used to remove the effect of scattering on the spectrum [38]. WPT achieves the signal-noise separation by decomposing the low-frequency information and high-frequency noise in the spectrum [39]. With the same derivative order and dimensionality reduction method, the accuracy of most models using SGF is higher than that using the other three denoising methods. In Figure 2, the correlation curves between SOM content and reflectance shows that SGF can effectively improve the correlation, while the correlation curve after WPT basically coincides with the original correlation curve. The advantages of the SGF were not apparent in the results obtained by Shen et al. [25]. Determining appropriate smoothing window size was essential for removing noise and retain information of spectral data [37]. In the study of Shen et al., the smoothing window of 22 was selected, which might cause over-smoothing phenomenon. Furthermore, MSC has little effect on SOM content estimation according to model accuracy variation. Before spectral measurement, the sundries and large particles in samples have been separated. Therefore, the influence of inhomogeneous scattering on the spectrum is not significant, and MSC has less effect. For the field measured spectrum of soil, due to the presence of impurities, MSC might achieve better performance in SOM content estimation, which requires further study.

Fractional Derivatives for SOM Content Estimation in Black Soil Area
SOM content has a key effect on soil reflectivity [40]. Several studies have reported that SOM content is strongly correlated to the reflectance in visible region [41][42][43][44]. In addition, the results obtained by Peng et al. showed that 570-630 nm spectrum were more significantly affected by SOM through comparing the soil reflectance before and after the removal of SOM [45]. In the study, according to the statistics of original spectrum, the band range with correlation coefficient greater than 0.5 is 514-1350 nm. After 0.6 order derivative of the spectrum, the band range with correlation coefficient greater than 0.5 moves to 446-1085 nm, and peaks of correlation shift to the visible range significantly. Moreover, the peak of correlation coefficient increased from 0.67 to 0.69 and the band range with correlation coefficient greater than 0.67 moves from 807-905 nm to 539-787 nm. The range of SOM sensitive bands after 0.6 order derivative of spectrum in this study is consistent with previous studies [41][42][43][44][45]. This indicates that fractional derivatives can accentuate spectral features of SOM. It is significant for SOM content estimation using Vis-NIR spectroscopy.
In Figure 3, the study compares the change of the correlation curve between the original spectrum after fractional derivatives and SOM content. By observing Figure 3, it can be seen that the correlation curve between the spectrum and SOM content has shown slight irregular fluctuation at order 0.8, and the irregular fluctuation becomes more obvious at order 1.2. As the order goes up, an increasing number of noise is introduced to the correlation curve, causing the correlation decreases. Figure 4 shows the variation of model accuracy after fractional derivatives among 12 combinations of preprocessing techniques. In most combinations, the commonly used FD and SD cannot significantly improve the model accuracy. After SD, the accuracy of the models is lower than that without derivative. This is related to the noise introduced by SD while separating the absorption peak. Moreover, the spectrum after FD and SD forms obvious sharp peaks [25]. Among 12 combinations, 10 combinations achieved the optimal performance at order 0.2-0.8. In addition, Figure 5 showed that 0.2-0.8 order derivatives could improve accuracy of the models. In previous studies, the optimal results of SOM content estimation were obtained at 0.25-order in the southeast part of Iowa State [15], 1.25-order and 1.5-order in the east of Jianghan plain (Hubei Province, China) [29], and 1.2-order in Ebinur Lake Wetland National Nature Reserve (Xinjiang, China) [30]. Because soil types varied in different study areas, the appropriate orders for SOM content estimation might be different. Therefore, 0.2-0.8 order derivatives have more advantages in hyperspectral SOM content estimation in black soil area, especially 0.6 order and 0.8 order. Moreover, refinement of the order variation of fractional derivatives to order 0.1 or lower may further improve the modeling effect. The influence of refining the order variation of fractional derivatives on hyperspectral SOM content estimation needs to be further studied.

Dimensionality Reduction Methods for SOM Content Estimation in Black Soil Area
In this study, PCA, MDS, and LLE were used for dimensionality reduction. It can be concluded from Figure 6 that PCA achieves the optimal performance in each evaluation index compared with the other two methods. In addition, Table 5 shows that most advantage models come from models using PCA. Among three optimal models, the model using PCA achieves the highest accuracy and 31 of 44 advantage models come from the models using PCA. Compared with PCA, MDS, and LLE did not show superiority in SOM content estimation. This is related with the characteristic of the three dimensionality methods. PCA as a linear dimensionality reduction method is more adaptive for the dependent variables with a linear relation to the reflectance [31]. There is obvious trend on soil reflectance when SOM content varies. The reflectivity of soil decreases as the increase of SOM content [40]. In the case that the relation between the reflectance and the dependent variable is not significant, LLE and MDS are more applicable [31]. Therefore, LLE and MDS might achieve better performance in estimating the content of other components in the soil, which requires further study.

Conclusions
In this study, the spectral data of 496 soil samples was preprocessed in three parts: denoising, fractional derivatives and dimensionality reduction, and PLSR was used for modeling. A total of 132 models using combinations of different preprocessing techniques were obtained. Compared with MSC and WPT, SGF can enhance the correlation between soil spectrum and SOM content more effectively. In addition, the advantages of the SGF were apparent in improving the accuracy of SOM content estimation. The use of low-order fractional derivatives can accentuate spectral features of SOM, causing the correlation peak increasing, the bands around the correlation peak moving to visible range, and the number of bands around the correlation peak increasing. However, as the order increasing from 0.8, the spectrum was susceptible to spectral noise interferences. The results indicated that the 0.2-0.8 order derivatives significantly enhanced the accuracy of the SOM content estimation models. Among dimensionality reduction methods, PCA exhibited better performance than LEE and MDS.
According to this study, for three categories of preprocessing techniques, we found that SGF, 0.2-0.8 order derivative and PCA are appropriate methods to estimate SOM content in black soil area. Despite the above conclusion has been drawn, SOM content estimation in black soil area remains to be further studied. In the future, the researches should be focus on spectrum at different scales (e.g., unmanned aerial vehicle hyperspectral image, and satellite hyperspectral image), in order to improve the efficiency and accuracy of SOM content estimation in black soil area.
Author Contributions: All of the authors contributed to the study. X.X. conceived and designed the experiments, analyzed the data, and wrote the manuscript. X.X., S.Z., and R.D. processed the data. Z.X. contributed greatly to data collection. S.C. and Y.Y. reviewed the manuscript. All authors have read and agreed to the published version of the manuscript. Acknowledgments: We thank the editors and the reviewers for their constructive suggestions and insightful comments, which helped us greatly to improve this manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A.1 Denoising Methods
SGF removes the high frequency noise points by the least square fitting to internal elements of moving window and replacing the original value with the fitting value [46]. SGF can not only achieve expected denoising effect, but also retain useful information in the spectrum [47]. Therefore, SGF is often used in the preprocessing of spectral data. The method has two parameters, filter window size and polynomial fitting order. The higher the polynomial fitting order, the more details are retained, and the larger the window size, the better the smoothing effect [48]. In this study, the filter window size is 15 and the polynomial fitting order is two.
MSC corrected the difference of scattering information in different sample spectrum to the same level to minimize the interference information independent of sample composition caused by spectral scattering [49]. Therefore, the main purpose of MSC is to remove the influence caused by scattering [50]. MSC calculates the n × p dimension matrix of the sample set, where n is the number of samples and p is the number of bands. Formulas (A1), (A2) and (A3) are used for calculation.
where n is the number of samples, i and j respectively represent the i sample and the j band, and x i,j represents the average spectrum of all samples in each band. m i represents the relative offset coefficient of each sample spectrum, and b i represents the translation variable of each sample spectrum. Wavelet packet transform is proposed on the basis of wavelet transform. Wavelet packet transform can achieve more accurate information extraction through further decomposing the high frequency information. Wavelet packet transform has been widely used in various fields related to signal processing [39,51]. Wavelet packet denoising consists of four parts: wavelet packet decomposition, optimal wavelet packet basis function determination, threshold selection, and signal wavelet packet reconstruction. In this study, db2 wavelet basis function was selected to perform two-layer wavelet packet decomposition for the spectrum. The use of hard threshold processing may cause the reconstructed signal to vibrate [52]. Therefore, the continuous soft threshold function is selected, and then the signal is reconstructed by wavelet packet transform. The threshold value is calculated by formula (A4): where σ is the median of all absolute coefficients in the high-frequency signal divided by 0.6745, N is the number of data in the high-frequency signal, and tar/2 is selected as the denoising threshold [22].

Appendix A.2 Fractional Derivative
Fractional derivatives extend the concept of integer derivatives and have been applied to image processing [53], fractal theory [54], and other fields. There are three main expressions of fractional derivatives: Riemann Liouville (R-L), Grunwald-Letnikov (G-L), and Caputo [55,56]. Due to the uncomplicated formula of G-L and the simple coefficient, G-L is often used in signal processing [57]. The fractional derivatives of function f (x) is shown in formula (A5) [30]: where v is the order.

Appendix A.3 Dimensionality Reduction Methods
Principal component analysis (PCA) is the most widely used linear dimensionality reduction technique. PCA is a second order method based on covariance matrix. For PCA, dimensionality reduction is achieved by coordinate transformation. Coordinate transformation is to project high-dimensional data to the direction with the greatest variance and satisfy the conditions of minimum error and maximum variance simultaneously [58]. Each principal component obtained through PCA is independent of each other. Therefore, PCA can achieve dimension reduction and solve the problem of information redundancy in spectral data.
Multidimensional scaling (MDS) is a nonlinear analysis method that measures distance and reflects the anisotropy between data points. MDS was first proposed by Torgerson [59]. Torgerson converts the difference of things evaluated by different people into target distance. According to this, points are drawn in the figure to make the distance between points fit the target distance. Kruskal improved Torgerson's approach [60]. The improved method is now the basis of MDS research. The main idea of MDS dimensionality reduction is to map the observed data in high dimensional space to low dimensional space. After mapping, Euclidean distance between two points should be kept unchanged as far as possible to keep the difference of samples in the low-dimensional space unchanged.
Local linear embedding (LLE) is a local feature preserving algorithm. LLE is a kind of manifold learning. The basic idea is to assume that each sample point and its neighbor point in a high dimensional space are located in a local neighborhood of a linear manifold. Under this assumption, each point can be represented linearly by an adjacent point. The reconstruction weight is obtained by minimizing the reconstruction error. Then the corresponding internal low-dimensional space coordinate representation is obtained by keeping the same reconstruction weight, and achieve the effect of dimensionality reduction [61][62][63]. When LLE is used to process spectral data, the number of adjacent points and dimension has great influence on dimensionality reduction results and model accuracy [33,34]. Therefore, in order to determine the number of adjacent points, the number of adjacent points is set from 10 to 60, and the range of dimensionality reduction is set from 0 to 200. With this setting, the original spectral data was used for modeling. According to model accuracy, the number of adjacent points was selected. Figure A1 shows the model accuracy curve. When the number of adjacent points is 15, the model accuracy reaches the maximum value. Therefore, in this study, the number of adjacent points of LLE was taken as 15. coordinate representation is obtained by keeping the same reconstruction weight, and achieve the effect of dimensionality reduction [61][62][63]. When LLE is used to process spectral data, the number of adjacent points and dimension has great influence on dimensionality reduction results and model accuracy [33,34]. Therefore, in order to determine the number of adjacent points, the number of adjacent points is set from 10 to 60, and the range of dimensionality reduction is set from 0 to 200. With this setting, the original spectral data was used for modeling. According to model accuracy, the number of adjacent points was selected. Figure A1 shows the model accuracy curve. When the number of adjacent points is 15, the model accuracy reaches the maximum value. Therefore, in this study, the number of adjacent points of LLE was taken as 15. Figure A1. Number of adjacent points in different dimension number versus RPD. The dimension number is determined by the dimension corresponding to the optimal RPD among 0-200 dimensions in the case that the number of adjacent points is the same.

Appendix A.4. Partial Least Squares Regression (PLSR)
The conventional multiple linear regression methods face the problem of multi-collinearity among variables. PLSR integrates the advantages of principal component analysis, canonical correlation analysis and linear regression analysis. Because of the characteristic, PLSR can avoid the problem of strong correlation between variables, and fully extract effective information of samples [64]. It can be seen that PLSR is able to reveal the dominant band of soil chemical variable change from the spectral data and achieve regression modeling, so that PLSR is potential for spectral Figure A1. Number of adjacent points in different dimension number versus RPD. The dimension number is determined by the dimension corresponding to the optimal RPD among 0-200 dimensions in the case that the number of adjacent points is the same.

Appendix A.4 Partial Least Squares Regression (PLSR)
The conventional multiple linear regression methods face the problem of multi-collinearity among variables. PLSR integrates the advantages of principal component analysis, canonical correlation analysis and linear regression analysis. Because of the characteristic, PLSR can avoid the problem of strong correlation between variables, and fully extract effective information of samples [64]. It can be seen that PLSR is able to reveal the dominant band of soil chemical variable change from the spectral data and achieve regression modeling, so that PLSR is potential for spectral information mining. Moreover, PLSR is suitable for linear regression modeling of multi-dependent variables and multi-independent variables [65]. Therefore, PLSR has been widely used to find the correlation between soil spectrum and SOM [22,25,66].

Appendix B
The results of 132 models are divided into three parts depending on the dimensionality reduction method, as shown in Tables A1-A3.   Note: R 2 , coefficient of determination; RMSE, root mean squared error; RPD, residual predictive deviation; MSC, multiplicative scatter correction; SGF, Svitzky-Golay filter; WPT, wavelet packet transform; N, none.