Soil Moisture, Organic Carbon, and Nitrogen Content Prediction with Hyperspectral Data Using Regression Models

Soil moisture, soil organic carbon, and nitrogen content prediction are considered significant fields of study as they are directly related to plant health and food production. Direct estimation of these soil properties with traditional methods, for example, the oven-drying technique and chemical analysis, is a time and resource-consuming approach and can predict only smaller areas. With the significant development of remote sensing and hyperspectral (HS) imaging technologies, soil moisture, carbon, and nitrogen can be estimated over vast areas. This paper presents a generalized approach to predicting three different essential soil contents using a comprehensive study of various machine learning (ML) models by considering the dimensional reduction in feature spaces. In this study, we have used three popular benchmark HS datasets captured in Germany and Sweden. The efficacy of different ML algorithms is evaluated to predict soil content, and significant improvement is obtained when a specific range of bands is selected. The performance of ML models is further improved by applying principal component analysis (PCA), a dimensional reduction method that works with an unsupervised learning method. The effect of soil temperature on soil moisture prediction is evaluated in this study, and the results show that when the soil temperature is considered with the HS band, the soil moisture prediction accuracy does not improve. However, the combined effect of band selection and feature transformation using PCA significantly enhances the prediction accuracy for soil moisture, carbon, and nitrogen content. This study represents a comprehensive analysis of a wide range of established ML regression models using data preprocessing, effective band selection, and data dimension reduction and attempt to understand which feature combinations provide the best accuracy. The outcomes of several ML models are verified with validation techniques and the best- and worst-case scenarios in terms of soil content are noted. The proposed approach outperforms existing estimation techniques.


Introduction
Soil moisture (SM), soil organic carbon (SOC), and nitrogen content (NC) are the fundamental aspects of nature that provide territory to a broad scope of life forms, and are important for healthy food production [1][2][3][4]. SM contributes to plant development and deterioration, climate change, and carbon formation, and significantly controls the filtration, overflow, drought monitoring, and evaporation rates [5][6][7]. SOC enhances the water holding capacity of the soil and nutrient production for plants, leading to plant growth [8,9]. SM and SOC act to regulate water level and energy exchange rate, directly influencing plant health and the hydrosphere beneath [10]. NC develops plants' structure, metabolism, and creation of chlorophyll, contributing to plant growth and food production [11]. These near-infrared reflectance (NIR) spectroscopy with a back-propagation (BP) neural network. The authors of [48] demonstrated soil nitrogen prediction in subsided land using the local correlation maximization-complementary superiority (LCMCS) method. Several researchers have used partial least squares regression models to predict soil nitrogen [49,50]. In [51], the authors compared three methods, (stepwise multiple linear regression (SMLR), partial least squares regression (PLSR), and support vector machine regression (SVMR), to predict nitrogen content with visible/near-infrared spectroscopy. PLSR regression was used to predict various soil parameters (organic, inorganic, total carbon, CEC, pH, texture, moisture), including nitrogen, in [52].
Although satisfactory research achievements have been seen in predicting SM, SOC, and NC using HS remote sensing technology, a comprehensive study of different machine learning models is needed in order to develop a generalized approach to predict different soil contents with the help of dimensionality reduction. A number of studies have used PCA to predict SM [53][54][55], SOC [56][57][58], and NC [59][60][61]. However, they all used raw HS data and a small number of ML algorithms. Different types of machine learning algorithms have different strengths. Here, we explore a wide range of machine learning algorithms in order to determine whether their performance is better or worse than indicated in the current literature. Furthermore, we investigate whether the particular band impacts moisture, carbon, and nitrogen prediction along with the combined effects of different machine learning strategies and feature transformation by PCA with effective band selection. Hence, all the comprehensive experiments in this paper are novel in comparison to the existing literature.
This study uses three different HS datasets to predict SM, SOC, and NC. The HS feature data used in the experiments were extracted from captured HS images of soil samples. These datasets were built from HS camera images by using the average reflection/absorbance to make a CSV file. These reflection/absorbance values were then used as different features for training and testing. The SM dataset was captured by [62], and contains 125 HS bands ranging from 454 nm to 950 nm. Two HS datasets from LUCAS containing 4200 HS bands from 400 nm to 2499.5 nm were used to predict SOC and NC,. As each band may not contribute equally to predicting SM, SOC, and NC [63], the most influential band selection is essential to improve prediction performance, minimize computational time, and decrease data dimension. Choosing effective bands eliminates negative influences, allowing soil parameters to be more accurately predicted.
On the other hand, soil temperature can be considered a good feature for predicting SM. It is comparatively more easy to measure surface soil temperature than soil moisture [64]. Figure 1 shows the plot of soil temperature and corresponding soil moisture in our considered dataset. From this figure, it can be seen that SM has an inverse relationship with soil temperature. We calculated the Pearson's correlation coefficient between SM and soil temperature, and found that these two parameters are negatively correlated (−0.79). Therefore, soil temperature has a noticeable impact on SM.
Additionally, raw data or original features may contribute little to predicting SM, SOC, and NC. Hence, PCA can be adopted to extract features more effectively [65]. PCA helps to find the correlation of all the input features and produce principal components independent of one another. Regression algorithms are faster with PCA-preprocessed data, as it substantially reduces the size of the dataset and eliminates variables that are less significant to decision-making [66]. PCA can transform the essential features from raw data, thereby reducing the feature space significantly and consistently helping to eliminate the over-fitting issue, which improves SM and SOC prediction performance.
To predict SM, SOC, and NC more accurately and precisely from HS data, effective band selection, feature transformation, and high-dimensional data reduction techniques (for example, PCA) should be considered. In this study, we have used two dimensional reduction techniques: the first to extract the most crucial feature bands, and the second to reduce dimensionality using PCA. We explore the combined effect of using PCA and effective band selection to determine the prediction performance of SM, SOC, and NC.
Our study's second contribution is finding a generalized approach that can predict soil content using our proposed methodology. Thus, we explore band selection, and feature transformation in order to understand the estimation performance with several established machine learning techniques.  To propose a generalized approach that can predict soil content more accurately and efficiently • Evaluating the accuracy of different ML models and comparing the results with the proposed method.
The rest of the paper is organized as follows. The HS remote sensing data and ground truth SM, SOC, and NC data are described in Section 2. The step-by-step workflow of this paper is provided in Section 3. Section 4 illustrates the results in predicting SM, SOC, and NC for different algorithms, and a comparative study is presented with validation and evaluation. We critically discuss the outcomes of this research in Section 5. Finally, this paper is concluded in Section 6 with the presentation of guidelines for SM, SOC, and NC prediction methodology in HS remote sensing. For this study, we used the dataset captured in [62] during a five-day field campaign in May 2017 in the area of Karlsruhe, Germany. The dataset is freely available and open for research purposes under the license (available online: https://www.gnu.org/licenses/ gpl-2.0.html (accessed on 8 May 2012)). The field study was performed on undistributed bare soil with no vegetation and clayey-silt type soil. The SM was measured using a TRIME-PICO time domain reflectometry (TDR) sensor, which can measure SM to a depth of 2 to 18 cm. However, the dataset we used listed the SM of soil to a depth of 2 cm, which is considered a subsurface SM. This SM value was considered the ground truth for our study. The changes in SM ranged from 25% to 42% (Figure 2), and the corresponding soil temperature at the same depth ranged from 25.5 • C to 44.5 • C ( Figure 3).

Soil Moisture Hyperspectral Data
The hyperspectral data were captured using a Cubert UHD 285 hyperspectral snapshot camera with a spectral range of 450 nm to 950 nm. The camera was mounted on a tripod 1.7 m in height. This camera can record 50 × 50 pixel images with 4 nm spectral resolution and 125 spectral bands. The dataset consisted of 679 high-dimensional data points, including 125 hyperspectral bands. Figure 4 shows the reflectance of the HS bands. For simplicity, we considered only four soil samples with different SM values. We considered soil samples with maximum SM and with minimum SM. This figure shows that a higher percentage of the SM value generates lower values of HS reflectance, and vice versa. The HS band reflection data were used to predict SM in our study.

LUCAS Topsoil Data
LUCAS was established by the Statistical Office of the European Union (EUROSTAT) in 2001 to create a pan-European database on landscape parameters relevant for agricultural and environmental coverage development and evaluation [67]. For non-commercial purposes, the LUCAS topsoil dataset is available from the European Soil Data Centre (ESDAC) website. The land survey has been performed every three years for a 2 × 2 km area of land in all European member states beginning in 2006 [68]. In 2009, an extension to the periodic LUCAS was granted to provide a regular, coherent, and harmonized topsoil database for Europe [67].
In this campaign, about 20,000 soil samples were accrued using a multi-level stratified random sampling technique to represent the proportion of different land use types in Europe [67]. Five topsoil samples (0-20 cm) were taken and blended into a composite sample for every sampling point. These samples were then analyzed for their physical, chemical, and reflectance properties using a standardized technique within the same laboratory [68].
After laboratory analysis of each sample, the absorbance from 400-2499.5 nm was recorded using a FOSS XDS Rapid Content Analyzer (FOSS NIRSystems Inc., Denmark) [34]. We recorded 4200 absorbance bands at 0.5 nm intervals. For this purpose of this study, we considered only the Swedish dataset, which consisted of 1891 soil samples. The dataset specifies the corresponding soil samples (point ID) with different properties such as clay, silt, sand content, pH, coarse fragments, SOC, NC, etc. The box plots in Figures 5 and 6 show the interquartile range and the outlier ranges of SOC and NC, respectively. Figures 7 and 8 represent the absorbance curves of different values of SOC and NC, respectively. The lowest, highest, and average range of SOC and NC was considered for simplicity. Figure 7 shows that the absorbance increases with increasing SOC and decreases with decreasing SOC in the HS band range from 400 nm to 1000 nm. However, this trend is not followed after 1000 nm to 2500 nm. On the other hand, the absorbance increases with increasing NC; however, when the value of NC is high (36.7 g/kg), the absorbance curve becomes different. The absorbance curve of NC is shown in Figure 8.

Methodology
The work was divided into two steps, namely, the prepossessing of HS bands and the regression models used to predict SM, SOC, and NC. Figure 9 represents the workflow of this study.

Data Prepossessing
This study considered four different steps for preprocessing methods in order to handle the extensive dimensionality in the HS data.

Data Filtering and Mapping
Data cleaning, filtering, and mapping is the essential step for the LUCAS dataset, as it contains inhomogeneous data [69]. HS data and the corresponding ground truth of SOC and SNC were provided in different datasets. First, according to point ID (the unique ID of an individual soil sample), the soil sample HS data were mapped in two datasets for SOC and NC for Sweden. Certain soil samples' ground truth data (SOC and NC) were missing. Therefore, these missing values and corresponding HS data were filtered from the dataset manually. All other features were eliminated to make the dataset more convenient for use and more simple and easy for training and testing purposes. However, the SM dataset was previously cleaned and mapped with corresponding HS data.

Feature Scaling
After that, feature scaling was performed for the three HS datasets to standardize all the input data before training our model. The purpose of feature scaling is the mathematical transformation of features or independent variables to improve prediction performance. It is essential to perform the mathematical transformation of variables and make the input data balanced to ensure that their contributions are balanced. In this study, we considered the standard scaling method. The HS band data were scaled according to the transformation formula provided in Equation (1): where z score is the standard score, x is the training sample, andū and sd are the mean and standard deviation of the training sample, respectively.

Hyperspectral Band Selection
In the third step, we selected an effective HS band range for three different HS datasets. For the large volume of data, band selection becomes more important because it saves computational time and effort. The most effective SM HS bands were selected by considering a small portion of the HS band and the SM prediction performance was noted for that particular portion of bands. Several experiments were performed to understand the performance of each particular band to estimate SM. We used trial and error methods, and listed the results at which particular bands were more significant in predicting SM. The best SM prediction was provided by the HS band ranging from 454 nm to 742 nm; therefore, this portion of HS was considered the most influential band, and bands ranging from 746 nm to 950 nm were eliminated.
On the other hand, due to many HS absorbance band ranges for the SOC and NC dataset, eliminating and selecting specific band ranges to obtain the best prediction accuracy became complex. Hence, the least absolute shrinkage and selection operator (Lasso) algorithm was applied to determine the significant band range [70]. In this experiment, we selected the 575.5-1062 nm, 1100 nm, 1852-1885 nm, 1945-2017.5 nm, 2053-2208 nm, and 2454-2499.5 nm HS bands. From the 4200 HS bands, we selected only 1591 bands and eliminated 2609 bands.
Similarly, for NC prediction the Lasso algorithm was applied to select the most influential bands. Of the 4200 bands, only 252 bands were selected as the most significant bands. The effective band ranges were 594-616.5, 646-675.5, 1052.5-1108.5, and 2302-2489. By applying the Lasso algorithm, 3948 bands were eliminated, significantly reducing computational cost and time.
This algorithm regularizes features by shrinking the regression coefficients and reducing a number of the less essential coefficients to 0. After shrinkage, only the non-zero components were used as a selected feature to train the model. Therefore, significant numbers of weak features were eliminated, improving model prediction performance and minimizing both bias and variance.
The Lasso algorithm works with the following cost function: where a k is the k-th feature coefficient, α is a hyperparameter, and y true and y observed are the ground truth and predicted data, respectively. The value of the cost function increases with the higher coefficient value of a particular feature. Therefore, the main aim of the Lasso algorithm is to optimize the cost function by optimizing |a k |. If the coefficient becomes large, this forces more coefficients to be 0.
The algorithm becomes an ordinary least squares regression when α is 0. On the other hand, when α increases, the variance decreases significantly and the bias increases.
In this way, the Lasso algorithm eliminates irrelevant variables that do not contribute to prediction performance.

Data Dimension Reduction
Finally, we considered the Principal Component Analysis (PCA) technique, which is widely used to handle nonlinear high-dimensional datasets and effectively decreases the dimensionality of data. Instead of using all HS bands, we relied only on the first seven principal components, which were able to extract almost 99.97% of features from all three datasets.

Regression Model
Different ML regression models were studied to predict SM, SOC, and NC from the HS data: Linear Regression (LR) [71], Random Forest (RF) [72], Decision Tree (DT) [73], Gradient Boosting (GB) [74], Support Vector Regression (SVR) [75], Self Organizing Map (SOM) [30], K-Nearest Neighbors (KNN) [76], and Artificial Neural Network (ANN) [77]. Most of the ML regression models were developed from the well-known library package scikit-learn, except for SOM which was implemented in Susi library packages. We used the SOM model that already implemented by [30]. Most of the regression models follow a supervised learning algorithm. However, the SOM framework consists of unsupervised learning followed by supervised SOM.
In order to achieve good prediction accuracy performance, all the machine learning regression models were tuned during the training process; their hyper-parameters are described in Table 1. However, LR, DT, and GB provide satisfactory performance without tuning. Therefore, we relied on the basic packages of the scikit-learn library model [78] and used the grid-search approach. After completing the tuning of all ML regression models and the training phase, the testing phase was started. Table 1. Hyperparameter setup for different machine learning regression models.

Model
Library Package Hyper-Parameter

Evaluation Parameter
The efficacy of each ML model was evaluated by computing R 2 , mean absolute error (MAE), and root mean squared error (RMSE). The R 2 measure explains the percentage of variation explained by two variables (test data and predicted data), MAE signifies the absolute difference between model prediction, i.e., the predicted output and ground truth value, and RMSE describes the standard deviation of the residuals (the difference between the model prediction and actual value). The value of R 2 ranges from 0 to 1. The closer the value is 1, the better the model describes the correlation between the actual and predicted value. For MAE and RMSE, a lower the value indicates better prediction accuracy [79].
The mathematical expression of these terms is provided below: where y i andŷ i are the original and predicted value for the ith sample, respectively,ȳ i is the average value, and N sample is the number of samples.

Results of Model Evaluation and Validation
This section presents the performance and comparison of regression models to predict SM, SOC, and NC from three different HS datasets. The main aim of developing a model is to show good performance on unseen data. A good model provides accurate predictions for seen and unseen data that help to eliminate overfitting and underfitting. To address this problem, k-fold cross-validation can be used. We considered ten-fold cross-validation in this study to evaluate the model efficacy. Each time the whole dataset was divided into ten groups, nine data groups were used to train the model and the remaining data were used to test the model, as shown in Figure 10. The process was repeated ten times, with model performance listed each time for the different sets of testing data. Finally, the mean prediction accuracy (Equation (6)) was derived for each ML model. The experiment was carried out for the three HS datasets to predict SM, SOC, and NC.
Testing Data Training Data

Soil Moisture Prediction
In this study, the model was developed to predict SM from the HS data with eight feature combinations. The regression results are shown in Table 2. In the first case, all HS bands (AHSB) ranging from 454 nm to 950 nm were considered and the prediction performance was noted for eight different ML regressors. In this scenario, SVR performed the best (R 2 = 95.43%, MAE = 0.49, and RMSE = 0.80). After that, soil temperature was considered with AHSB, and improved results were obtained for the LR, RF, GB, and ANN regressors. The best result was noted for RF, with 92.85% prediction accuracy.
In the next step, instead of using the AHSB range, the effect of the selected bands (SB) ranging from 454 nm to 742 nm was considered to predict SM. After eliminating 52 bands we obtained satisfactory results, with heights of 94.31% accuracy for the SVR model and MAE and RMSE values of 0.56 and 0.88, respectively. The effect of soil temperature with SB was evaluated in the next step; GB performed best, with 92.91% accuracy.
In order to handle extensive the dimensionality of the data, PCA was performed with four cases. First, we considered AHSB and obtained improved prediction accuracy compared with AHSB for most of the regressor models, with the exception of DT and SVR. Then, PCA was performed for SB only; again, good prediction performance was noted. Finally, considering the soil-temperature effect, PCA was performed with AHSB and SB in the third and fourth cases, respectively. From Table 2, it can be seen that PCA has a good impact on predicting SM. Best results were obtained for KNN, with more than 93% prediction accuracy for both cases.
Finally, the average performance of each feature was calculated. It is clear that SB with PCA provides the best average prediction accuracy (91.62%) in terms of R 2 .  Figure 11 shows the comparison box plot of eight different ML models considering three criteria: i. AHSB, ii. PCA on SB, and iii. PCA analysis of SB, including soil temperature. This figure is drawn considering ten-fold cross-validation, and for each iteration the results indicate the best, worst, mean, and median performance of each ML model for predicting SM. Therefore, this box plot reflects the results of ten-fold cross-validation used to validate our model. The circle and cross-line on the box show the mean and median, respectively. When PCA analysis of SB is considered with and without the effect of temperature, the SM prediction accuracy improves, and the best and worst prediction ranges become shorter compared with AHSB. This prediction improvement is noted for all the ML regressions we considered. SVR provides the best average prediction accuracy, with minimal fluctuation in the prediction range.

Soil Organic Carbon Prediction
The possibility of predicting SOC from the LUCAS dataset (Sweden) was investigated with our proposed methodology. The experiment was performed by considering four features for all the ML regressors. Table 3 represents the ten-fold cross-validation results of SOC prediction accuracy in terms of R 2 , MAE, and RMSE. Four feature combinations (AHSB, SB, PCA of AHSB, and PCA of SB) were studied to explore the possibility of predicting SOC. When AHSB was considered, the RF model provided 83.98% accuracy in terms of R 2 and MAE, and the RMSE was 35.13 and 62.46, respectively. However, when only the effective bands were considered, the SVR model performed best (R 2 = 90.52%, MAE = 26.00, and RMSE = 48.36). The prediction efficiency in terms of R 2 was improved when considering PCA analysis on SB. Considering all ML models, the best average R 2 , with 83.14% prediction accuracy, was obtained when PCA was performed on SB. However, the value of MAE and RMSE was significantly large due to the inhomogeneity of the data sample [69]. It is important to consider checking the data homogeneity before beginning machine learning or statistical operations. Homogeneous data should remain in a constant trend with the changing parameters that may affect the data. However, in practice, this is almost impossible to obtain for soil data samples. From Table 3, it can be noticed that the prediction accuracy in terms of R 2 is satisfactory. However, there are high MAE and RMSE errors due to the wide variation of the SOC soil sample. Figure 12 considers AHSB and PCA of SB in order to better understand the fluctuation range of R 2 prediction, illustrating that PCA with SB shows less fluctuation and better SOC prediction.

Soil Nitrogen Content Prediction
The LUCAS (Sweden) dataset with soil nitrogen as a ground truth was used to understand the possibility of predicting NC from the HS dataset. Table 4 shows the prediction performance of different ML regressors with ten-fold cross-validation.
When we considered AHSB ranging from 400 nm to 2499.5 nm, SOM performed the best, with R 2 = 74.71%, MAE = 1.87, and RMSE = 3.01 prediction accuracy, whereas DT recorded the lowest value (56.60%). In the next step, the performance of the SB range was investigated to predict the best result for NC. The performance of SB was satisfactory, with the best prediction accuracy provided by ANN (R 2 = 79.05%, MAE = 1.73, and RMSE = 2.74).
The next step recorded improved prediction performance when PCA was performed with AHSB and SB. The average R 2 results show that PCA with SB provides the best NC prediction accuracy (75.69%). Figure 13 shows the comparison box-plot between AHSB and PCA (SB-Lasso), illustrating each ML model's best, worst, mean, and median prediction performance. The figure is drawn based on the outcomes of ten-fold cross-validation to predict NC. This figure shows that PCA with SB performs comparatively well and shortens the range between the best and worst results.

Discussion
The main aim of this paper is to determine the most effective methodology to predict SM, SOC, and NC with reasonable accuracy from the HS data. During the experiment, all the ML tuning parameters, testing and training dataset ratio, and all other parameters remained constant in order to understand the influence of different features in predicting SM, SOC, and NC more accurately and precisely. The outcomes of this study (Tables 2-4), considering the average performance of eight different machine learning algorithms, show that the combined effect of PCA with selected bands provides the best prediction accuracy for SM, SOC, and NC. The best average prediction accuracy for SM, SOC, and NC in terms of R 2 is 91.62%, 83.14%, and 75.69%, respectively. Therefore, it is clear that PCA analysis on SB is the essential feature combination that provides the best prediction for the studied soil contents.
After a critical analysis of eight different ML regression models and according to their average prediction performance, the following conclusions can be drawn.

•
The HS band can be used effectively to predict SM with good prediction accuracy; when AHSB is considered, the SVR algorithm performs best (R 2 = 95.43%, MAE = 0.49, RMSE = 0.80). The SOC prediction accuracy in terms of R 2 is satisfactory, as it defines the normalized difference between actual and predicted data. • However, the error rate of MAE and RMSE is high, as the variation of the soil sample is high. As MAE and RMSE indicate the absolute difference between the original value and predicted value, it seems not to work any better; however, there is good correlation.

Soil Nitrogen Content Prediction
• Soil NC can be predicted with reasonable accuracy from HS data. When AHSB range is considered, SOM provides the best prediction accuracy (R 2 = 74.71%, MAE = 1.87, RMSE = 3.01); • The prediction accuracy for all of the ML regressors is further improved when effective band selection via the Lasso algorithm is considered; the average prediction accuracy improves from 70.56% to 73.37% in terms of R 2 value. • PCA analysis plays a vital role in further improving prediction accuracy, with the average prediction accuracy increasing to 73.31% when PCA is applied on AHSB. • The best result is obtained when PCA is performed on effective SB for the KNN regressor, with 77.80% prediction accuracy. From the value of the average result (75.69%), it can be observed that PCA on SB is the most important feature for predicting NC from HS data.
After critically analyzing the prediction performance of SM, SOC, and NC from three different HS datasets, Table 5 summarizes the best ML regression model according to the best performance accuracy in terms of R 2 . The best SM prediction is obtained by GB regressors when PCA is performed on AHSB, with 95.98% prediction accuracy. On the other hand, the SVR model performs best (90.52%) for SOC when only SB is considered. The LR model predicts the best NC with PCA analysis on SB, with 79.23% prediction accuracy.

Conclusions
In this study, we have addressed the SM, SOC, and NC prediction topology from HS data considering different ML frameworks and listed performance comparisons. While the existing methods provide good results, the proposed method provides the best results. The importance of particular HS band selection on SM, SOC, and NC prediction from the three different HS datasets is evaluated. Additionally, the effect of soil temperature on SM prediction is considered. The study was conducted using the PCA dimensionality reduction technique. Significant improvement is noted for all ML algorithms when the combined effect of PCA with an effective HS selected band is used. This study proposes a generalized approach to predict soil content more accurately and efficiently. The proposed approach saves significant computational time and provides good prediction performance using the important features. In future work, we intend to understand the physical interpretation of HS bands and the behavior of satellite HS images to predict soil components using our proposed methodology.