1. Introduction
Hyperspectral imaging (HSI) is a powerful tool for precision agriculture that utilizes sensors to capture images across numerous narrow spectral bands. Each organic component of soil interacts differently with specific wavelengths, providing detailed insights into soil composition and health. Under changing climatic conditions, accurately assessing soil nutrient content remains a major challenge, particularly in regions where maintaining food security is critical. Hyperspectral imaging supports the early detection of nutrient deficiencies, pests, and diseases, enabling precise nutrient management and crop classification, thereby enhancing agricultural efficiency and yield. However, conventional soil analysis methods, although accurate, are labor-intensive, time-consuming, and expensive. These methods require physical sampling, transportation, and laboratory testing, which limit their scalability and may degrade sample quality. Traditional reflectance measurements produce datasets representing organic and nutrient compositions; however, extracting meaningful information from these high-dimensional datasets requires advanced analytical models.
Machine learning (ML) and deep learning (DL) approaches have increasingly been adopted to identify the spectral bands most relevant for predicting soil organic carbon (SOC) and other nutrients. Optimization techniques such as the genetic algorithm (GA), particle swarm optimization (PSO), and ant colony optimization (ACO) have been applied to band selection. Although effective to some extent, these algorithms often suffer from redundant feature selection, slow convergence, and reduced stability when applied to large hyperspectral datasets. These limitations emphasize the need for more adaptive and computationally efficient optimization strategies. To address these challenges, this study employs the FOX algorithm, a recent nature-inspired metaheuristic designed to balance exploration and exploitation more effectively. FOX enhances feature selection by minimizing redundancy, improving convergence stability, and achieving higher predictive accuracy in complex spectral environments. By integrating FOX with regression models such as partial least squares regression (PLSR), random forest (RF), and gradient boosting (XGBoost), this study aims to improve the reliability and efficiency of soil nutrient estimation. The main objectives of this study are as follows:
Implement the FOX algorithm to select the most relevant hyperspectral bands for soil nutrient estimation.
Utilize PRISMA satellite hyperspectral imagery to evaluate the feasibility of large-scale soil nutrient mapping.
Compare FOX with traditional optimization algorithms (GA, ACO, PSO) to demonstrate its efficiency in feature selection and predictive performance.
2. Related Works
Spectroscopic measurements and imaging of soil color have been used for field-scale estimation of soil organic carbon. The study used a digital camera and Sentinel-2 remote-sensing data to estimate soil organic carbon content through soil color analysis [
1]. Random forest and advanced feature selection algorithms were used to optimize prediction models. The advantages of this approach include better extraction of spectral information, improved accuracy, and the potential for quantitative estimation. However, the drawbacks of this method include discrete sampling, point-to-point data collection, and the influence of external factors on the robustness of the model. The study used a combination of efficient signal preprocessing and an optimal band-combination algorithm, based on 233 soil samples and nine spectral preprocessing methods, to predict soil organic matter (SOM) through the visible and near-infrared spectra and improve prediction accuracy [
2].
The Savitzky–Golay filter was found to be the most effective. However, this approach has limitations, including challenges in accurately representing complex soil compositions, sensitivity to environmental conditions, and restricted applicability to specific soil types and regions. Hyperspectral imaging, integrated with multivariate analysis and variable selection, has been investigated for its potential in the high-resolution mapping of soil carbon fractions within intact paddy soil profiles. In one such study, HSI was used to map soil carbon fractions in these profiles [
3]. The authors compared linear and nonlinear multivariate techniques and applied a spectral variable-selection technique known as competitive adaptive reweighted sampling (CARS) to simplify models by selecting the most relevant variables.
The accuracy and robustness of the CARS-SVMR model may require validation in other regions. The primary objective of this study was to enhance processing speed and efficiency for hyperspectral imaging (HSI) applications. Field and imaging spectroscopy techniques have been evaluated to improve the estimation of soil organic carbon (SOC) and soil nitrogen (SN) under laboratory conditions [
4]. One study assessed the performance of two hyperspectral sensors, the SVC HR-1024i field radiometer (FS) and the Specim IQ imaging spectrometer (IS), using 157 soil samples collected from the Taita Hills, Kenya. The results showed better predictive accuracy in the full-wavelength and shortwave-infrared (SWIR) regions, suggesting that the FS was best for SOC and SN estimation when the SWIR region was included. The present study aims to support regenerative agriculture initiatives by developing soil organic content prediction models based on discrete wavelet analysis of hyperspectral satellite data. The thesis introduces a noise-removal method for satellite-based hyperspectral soil data, utilizing the discrete wavelet transform to reconstruct both the original and first-derivative reflectance [
5]. See ‘Exploring Appropriate Preprocessing Techniques for Hyperspectral Soil Organic Matter Content Estimation in the Black Soil Area’ [
6]. However, potential drawbacks include complexities and sensitivity to specific parameter settings. Another study aimed to improve SOM estimation efficiency [
7]. Another study explored the use of hyperspectral data to estimate soil nutrient content in order to monitor soil status and support sustainable agricultural development [
8]. Techniques such as PLSR, PCC, LASSO, and GBDT have been used to find the optimal screening algorithm to estimate total nitrogen, total phosphorus, and total potassium content in soil. Linear and nonlinear machine learning techniques have also been employed [
9]. Airborne hyperspectral imaging data were used to identify the spatial variability of soil nitrogen content, which is essential for agricultural development. This study focused on two areas in the Czech Republic, using laboratory and handheld spectrometers to assess soil nitrogen while excluding other nutrients such as potassium, phosphorus, and carbon. Incorporating advanced geomorphic features and algorithms significantly improves the accuracy and transferability of regional-scale hyperspectral prediction models of soil organic carbon (SOC). Techniques such as fractional-order derivatives, robust denoising methods, and preprocessed mid-infrared spectroscopy have proven effective in improving the prediction of SOC content [
10]. These models use spectral reflectance data, derived features, and pretrained weights to improve accuracy and overcome challenges associated with mapping soil organic carbon (SOC) stock using hyperspectral and time-series multispectral remote sensing images in low-relief agricultural areas. This is crucial for effective land management. Agricultural land contributes significantly to global soil carbon storage, supporting crop growth and reducing greenhouse gas emissions. Hyperspectral and multispectral images have been used for digital soil mapping, with PLSR and ELM models used to predict soil organic carbon stock and properties [
11].
Recent studies have focused on improving the accuracy of soil organic carbon content prediction based on visible and near-infrared (Vis–NIR) spectroscopy and machine learning. Vis–NIR diffuse reflectance spectroscopy is a rapid and nondestructive method for estimating soil organic matter distribution and properties. It saves time and reduces the costs of collecting soil sample data. Various calibration methods, including partial least squares regression, support vector machines (SVMs), and artificial neural networks (ANNs), have been used to predict SOC content. A study in southern Hangzhou Bay, Zhejiang Province, China, compared the performance of different calibration methods and preprocessing approaches for SOC estimation. SVM regression combined with the first derivatives of the reflectance provided the best prediction results. Hyperspectral technology, particularly VIS-NIR-IR hyperspectral technology, has emerged as a rapid, accurate, economical, and nondestructive method for soil analyses. Principal component analysis combined with deep learning techniques was used to enhance data quality and model robustness, and these methods were applied to soil samples obtained from the Eastern Junggar coalfield in China. The results showed improved TabNet and CNN regression predictions, demonstrating the effectiveness of NIR hyperspectral imaging for identifying heavy metal pollution. However, the limited sample size may restrict generalizability. Soil organic matter content has been estimated using selected spectral subsets of hyperspectral data. Loss-on-ignition (LOI) is a reliable method for determining soil organic carbon (SOC) content in soil samples. The findings indicate that using informative spectral subsets offers a promising approach to estimating soil organic carbon (SOC) content [
12]. Additionally, leveraging hyperspectral reflectance data from soil libraries combined with machine learning techniques demonstrates the potential of airborne and spaceborne optical soil sensing for accurate SOC predictions. The study uses a public soil spectral library and machine learning algorithms to predict soil organic carbon (SOC) concentrations. The prediction models used were: partial least squares regression (PLSR), random forest (RF), and convolutional neural network (CNN). The Fast Line-of-Sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) module was employed for radiometric calibration and atmospheric correction. Another study developed regional models for predicting soil organic carbon (SOC) using multivariate analysis of mid-infrared hyperspectral data collected from the Indo-Gangetic plains of India. Preprocessing techniques were applied to enhance the spectral data and improve the accuracy of SOC predictions. Four multivariate methods were used to develop predictive models, with higher RPD values indicating better performance. MIR spectroscopy is quick, accurate, and consistent across different laboratories [
13]. This integrated approach enhances the reliability of SOC predictions by generating more accurate and detailed spatial maps [
14]. Hyperspectral imaging (HSI) captures spectral and spatial data to estimate soil properties. Traditional methods such as thermogravimetric analysis and loss-on-ignition are also discussed. However, the requirement for high-quality hyperspectral data may limit the applicability of these methods [
15]. Machine learning models were used to predict SOC. SOC predictions were validated against ground-truth samples using metrics such as
and RMSE. This method reduces the need for field sampling, shows strong prediction accuracy, and can be applied across different regions and soil types in the future.
The advantage of the FOX algorithm over other optimization algorithms, such as the genetic algorithm (GA), particle swarm optimization (PSO), and ant colony optimization (ACO), when applied to hyperspectral band selection, lies in its faster convergence rate. Unlike FOX, the other algorithms are sensitive to parameter tuning and tend to converge to local optima. Although the GA provides good global search capability, it is computationally expensive. In comparison, PSO converges faster but loses diversity in later iterations. ACO is slower and less efficient for high-dimensional data and is best suited for discrete problems. FOX can tackle these issues by providing an optimal balance between exploration and exploitation by randomizing its target, which can be either global or local search. This ability to dynamically adapt the search behavior reduces the risk of being trapped in local minima, thereby improving diversity in the search space. Additionally, FOX requires fewer parameters to tune. The FOX algorithm is relatively new and must be tested in a broader sense across datasets to confirm its consistency. Overall, FOX is a more stable and efficient alternative for hyperspectral band selection.
4. Results and Discussions
Based on the observations, the fox-inspired optimization (FOX) algorithm consistently achieved higher coefficients of determination (
) and lower error metrics than the other baseline optimization algorithms. The unique capability of the FOX algorithm to dynamically balance exploration and exploitation enables it to avoid premature convergence to local minima, thereby providing a more efficient and stable search process. Consequently, FOX is particularly well-suited for identifying relevant spectral bands within fewer iterations [
38]. Although particle swarm optimization (PSO) and the genetic algorithm (GA) occasionally achieved comparable or slightly higher
values, their convergence stability and sensitivity to parameter tuning were less reliable than those of FOX. In summary, the FOX algorithm provides the most effective trade-off among accuracy, computational efficiency, and stability, making it the most appropriate technique for spectral band selection before regression modeling [
39].
After identifying the relevant bands using the FOX technique, it was observed that each regression model demonstrated distinct performance trends for different soil nutrients. The random forest (RF) model achieved the highest value of up to 0.74 for organic carbon under bat optimization (BO), while its corresponding performance under the FOX algorithm was approximately 0.40. The partial least squares regression (PLSR) model performed consistently poorly across all nutrient types and optimization methods, with values generally ranging between 0.3 and 0.4. However, the hybrid PLSR + XGBoost model achieved the best overall predictive accuracy, with values of 0.97 for organic carbon and 0.95 for phosphorus based on the PRISMA-derived data, outperforming all single-model approaches. These findings confirm that combining FOX-based band selection with ensemble or hybrid regression models significantly enhances the accuracy and robustness of soil nutrient prediction.
The soil nutrient levels, using the set criteria to classify their levels, are shown in
Table 4. The four important nutrients (organic carbon, Nitrogen, Phosphorus, potassium) are classified into three concentration levels (low, medium, and high), which are essential indicators of soil health and fertility and strongly affect crop growth and agricultural yield. For hyperspectral data analysis, this table serves as a reference for correlating spectral reflectance measurements with nutrient content. By identifying the spectral bands that are most sensitive to variations in nutrient levels in the samples, researchers can use hyperspectral imaging techniques to estimate and map soil nutrient content over large areas. Optimized and regressed spectral data can be used to predict whether a soil sample has low, medium, or high levels of each nutrient.
In practice, these classifications support precision-agriculture methods, including site-specific fertilizer application and improved soil-management practices. The use of hyperspectral data for remotely sensing nutrient levels reduces the need for extensive soil samples and laboratory analyses, making farming operations more efficient, cost-effective, and environmentally friendly.
4.1. Datasets Description
The workflow of this study involves the use of hyperspectral imaging of the organic content of the soil with the help of the machine learning-based regression analysis shown in
Figure 1. The initial step involves collecting soil samples from agricultural fields. In this study, soil samples were obtained from Radhapuram, located in the Tirunelveli district. These samples were used to acquire spectral data using a spectroradiometer. A total of 65 soil samples were collected from representative agricultural plots, ensuring variations in soil texture, moisture, and nutrient composition. The spectral signature of each sample was recorded over 1849 wavelengths ranging from 350 to 2500 nm, forming a 1849 × 65 dataset. The spectral data were compiled into a data set. To improve the accuracy of the prediction, feature selection was performed by choosing the relevant bands that contributed the most to the prediction of soil nutrient content. This step determined the most informative wavelength or range of wavelengths and the significance of using the wavelengths to predict the properties of the soil in question. The processed data were used to train machine learning regression models. Regression was used to determine the relationship between the chosen spectral bands and the soil properties under investigation. This included the formulation of prediction models based on the significant bands obtained above. These were supposed to predict soil properties through hyperspectral data. The trained models were then validated and used to predict the soil organic content of new samples. The built model helped provide sufficient information for farmers to plan and improve agricultural yield. Multiple models were developed to conduct a comparative study and determine the most efficient.
Figure 4 shows the hyperspectral radiometer data collected from Radhapuram in the Tirunelveli district of Tamil Nadu. Many small dots on the map mark specific locations where data were gathered. The dots were likely sampling points where the spectral properties of the soil were measured [
16]. The
x-axis of the graph is labeled as “wavelength,” ranging from 350 to 2500 nanometers, and the
y-axis is labeled as “Spectral Reflectance,” ranging from 0 to 0.4, collected from the Figspec FS23 hyperspectral spectroradiometer data. The graph shows how different levels of soil nutrients, such as phosphorus, affect reflectance measured by the radiometer. Here, the red points are the training samples, and the black points are the testing samples.
4.2. Accurate Calculation and Assessment of Soil Nutrient Concentrations at Soil Reflectance Sites
The independent variables in this study were the selected spectral parameters. The dependent variables were the soil nutrient parameters SOC, N, P, and K. PLSR was used to predict the SOC, N, P, and K contents of the soil. The coefficient of determination (), mean absolute error, and root mean square error were calculated between the predicted and actual concentrations. HSI measures the reflectance intensity across various wavelengths of light, and each nutrient interacts differently with specific wavelengths owing to its chemical bonds and functional groups. Naturally occurring organic carbon is present as functional groups such as and its variants, and the key bands of light that interact with these compounds are in the range of 1700 to 2200 nm. Nitrogen occurs in bonds in the soil, and the effectiveness of specific bands for its detection is due to the stretching and vibrations associated with proteins, amides, and amino compounds in the soil. Nitrogen does not have a strong direct absorption characteristic, but is correlated with soil organic carbon, and its absorption features often overlap with those of the SOC bands.
Similarly, phosphorus is bound to iron and aluminum oxides and to clay minerals and other organic matter, where potassium may also occur. Materials such as illite, mica, and feldspar contain potassium, and their reflectance features are primarily caused by bending vibrations of bonds in the compounds formed between potassium and other elements. In conclusion, the reflectance of hyperspectral light by soil nutrients arises from their presence in compounds rather than as isolated species. The bond angles and vibrational modes resulting from the transfer of energy at different wavelengths give a unique spectral characteristic to each nutrient.
The regression equations summarized in
Table 5 represent the optimal spectral band combinations derived using the fox-inspired optimization (FOX) algorithm for predicting soil nutrients. Each equation corresponds to a specific nutrient concentration class (low, medium, or high) and combines the most informative wavelengths identified during the band selection process.
4.3. Regression Results After Optimization
Five optimization techniques were used to find the best bands, and four regression types were used to evaluate the score, mean absolute error, and root mean square error.
Table 6 shows the results of organic carbon (OC) prediction using different optimization techniques. Evidently, the particle swarm optimization (PSO) algorithm provides the best overall performance across all OC levels. For the OC Low group, the PSO–Linear model achieved the highest
value of 0.7838 and the lowest RMSE of 0.096, indicating strong predictive accuracy. Similarly, for OC Medium and OC High, PSO maintained superior results, with
values of 0.8091 and 0.7403, respectively. In comparison, FOX and ant colony optimization (ACO) showed lower
values (below 0.31) and higher MAE, whereas bat optimization (BO) produced competitive but slightly less accurate predictions. The genetic algorithm (GA) performed the weakest, showing lower
values and higher RMSE across all OC ranges.
Table 7 presents the results for phosphorus (P) prediction. As shown in the table, bat optimization with the random forest model achieved the best results, particularly for medium and high P levels, where
values of 0.5718 and 0.6549 were obtained, with low RMSE values of 6.5105 and 5.845, respectively. FOX and ACO achieved moderate accuracy, whereas PSO and GA demonstrated higher error values. This indicates that BO–RF is the most effective combination for estimating phosphorus.
Table 8 presents the potassium (K) prediction results. As shown in the table, FOX provided the most reliable outcomes, particularly when combined with random forest. For K Medium, FOX–RF achieved
= 0.6534 with the lowest RMSE (5.8576) and an MAE (34.31), outperforming all other optimization approaches. For K High, FOX–Linear also performed well, with
= 0.4715, whereas PSO and BO showed high RMSE values, indicating poor prediction stability.
Table 9 shows the nitrogen (N) prediction performance using different optimization methods. The results indicate that ACO–RF yielded the best explained variance for low nitrogen levels (
= 0.5269, RMSE = 6.8435). Although the GA and BO combinations yielded lower MAE values (approximately 24–31), their
values remained low, suggesting less consistency. For the second nitrogen dataset (N Low1), ACO again provided moderate accuracy, whereas the other algorithms showed high variability.
Overall, the results presented in
Table 6,
Table 7,
Table 8 and
Table 9 demonstrate that PSO performs best for organic carbon estimation, BO excels for phosphorus, FOX shows greater stability for potassium, and ACO performs well for nitrogen. The genetic algorithm generally produces lower accuracy and higher error rates across all nutrient types.
4.4. Regression Result Curves
Table 10 illustrates the true versus predicted values for organic carbon (OC) at low, medium, and high concentration levels using different regression models for test samples as blue color. For the OC Low dataset, the partial least squares regression (PLSR) model attained an
value of 0.7838, an RMSE of 0.0960, and an MAE of 0.0749, indicating a strong correlation between the predicted and true values. In the case of the OC Medium, the PLSR model achieved slightly better accuracy, with the highest
value of 0.8091 and the lowest RMSE of 0.0902, confirming improved predictive performance at this concentration level. For OC High, the linear regression model recorded an
of 0.7672, an RMSE of 0.0996, and an MAE of 0.0729, showing consistent and reliable predictions, although with slightly lower accuracy than the OC Medium dataset. Across all three datasets, the predicted values closely followed the red dashed 1:1 line, confirming that both PLSR and linear models provide robust prediction capabilities for organic carbon estimation. Overall, PLSR performed slightly better than linear regression, with the best predictive fit observed for the OC Medium dataset.
Table 11 shows the predicted versus actual phosphorus (P) values for low, medium, and high concentrations using different regression models. Here, the blue dots show the actual data points (true vs. predicted values), while the red line represents the model’s ideal linear fit showing the expected prediction trend. For the P Low dataset, the partial least squares regression (PLSR) model achieved an
value of 0.3686, with an RMSE of 7.9057 and an MSE of 62.4998, indicating a moderate correlation between the predicted and actual phosphorus levels. For the medium datasets, the Laplace regression model applied to the P Medium dataset slightly improved the coefficient of determination to
, with a lower RMSE of 7.8431, demonstrating slightly better prediction consistency. For the P High dataset, the PLSR model achieved the best performance among the three, with the highest
value of 0.4185, an RMSE of 7.5869, and an MSE of 57.5612.
Table 12 shows the predicted versus actual values for the Nitrogen Low and Low1 datasets using the random forest model. Here, the blue dots show the actual data points (true vs. predicted values), while the red line represents the model’s ideal linear fit showing the expected prediction trend. The model achieved a coefficient of determination (
) of 0.2337, indicating that approximately 23% of the variance in the true nitrogen values was explained by the model. The root mean square error (RMSE) of 8.7098 and the mean squared error (MSE) of 75.8599 indicate a moderate prediction error. Although some data points align near the red dashed 1:1 line, a noticeable spread exists, implying that the model underestimates and overestimates specific values. Overall, the random forest model provided only limited predictive accuracy for the N Low dataset. For the Low1 dataset, the model achieved a higher
value of 0.5269, indicating that approximately 52.7% of the variance in the observed data was captured by the model.
Table 13 shows the relationship between the true and predicted values for the potassium (K) high dataset using the random forest model. Here, the blue dots show the actual data points (true vs. predicted values), while the red line represents the model’s ideal linear fit showing the expected prediction trend. The coefficient of determination (
) was 0.2217, indicating that approximately 22% of the variance in the observed potassium values was explained by the model. The root mean square error (RMSE) value of 202.35 and the mean absolute error (MAE) of 150.83 suggest a high level of prediction error and bias in the model’s output.
4.5. Regression Results for PRISMA Data
For the PRISMA dataset, a hybrid regression model was used, combining PLSR and XGBoost, as shown in
Table 14. The regression results from the PRISMA data indicate that the genetic algorithm (GA) delivered the highest prediction accuracy, with higher
values and the lowest MSE across all soil nutrients. ACO and BO showed moderate performance, with BO performing particularly well for phosphorus. PSO showed similar trends but with slightly higher errors, especially for nitrogen and potassium. In contrast, FOA recorded the lowest accuracy, as reflected in its reduced
values and higher MSE. Overall, GA emerged as the most effective optimization method for soil nutrient estimation using PRISMA hyperspectral imagery. A comparison of the PRISMA regression results shows that the genetic algorithm (GA) consistently outperformed all other methods, achieving the highest
values (up to 0.9970) and the lowest MSE (as low as 0.0001). BO and ACO provided moderate accuracy, with BO reaching an
of 0.9595 for phosphorus, whereas PSO performed similarly but with comparatively higher MSE values. FOA recorded the weakest performance, with lower
values, such as 0.6531 for organic carbon and 0.5886 for nitrogen. These trends clearly indicate that GA offers the most reliable and precise soil nutrient estimation from PRISMA data.
5. Conclusions
This study was motivated by the need to develop sustainable agricultural practices, particularly in the tropical agricultural regions of Tamil Nadu, India. Traditional soil analysis methods are accurate but often laborious and time-consuming, making them expensive. Soil sampling, transportation, and laboratory testing can degrade soil quality and prolong the collection period. Hyperspectral imagery solves this problem because remote sensing is achievable, thus maintaining soil quality while providing preliminary information on soil health. Hyperspectral imagery was taken and interpreted in a study to find nutrients in the soil, including organic carbon, nitrogen, phosphorus, and potassium. The hyperspectral imaging technique, as seen in the article above, has proven to be an effective method for precision agriculture in soil and crop analysis, and has applications in horticulture and food analysis. Its potential extends to the livestock sector, where animal health, welfare, and feed quality can also be analyzed with great accuracy. Additionally, natural resource management is a sector that benefits from HSI-based monitoring of wildlife in both terrestrial and marine ecosystems.
Future Scope
Among the limitations of HSI, the acquisition of hypercubes is a major one. The capture of multiple images with different bands of light is time-consuming, delaying field deployment. To tackle this problem, future work must focus on developing multispectral imaging systems that only consider relevant bands for specific applications. By using only the most informative wavelength data, faster scanning can be achieved while maintaining analytical accuracy, making it a practical tool for real-time analysis.