1. Introduction
Soil, as an essential component of Earth’s ecosystems, serves not only as the foundational medium for agricultural production [
1], but also as a critical mediator sustaining biodiversity and facilitating carbon cycling processes [
2]. Under the dual pressures of climate change and human activities, global soil degradation has exhibited an alarming trend [
1,
3,
4,
5,
6]. This situation highlights the urgent need for precise and efficient soil monitoring technologies to achieve the “Zero Hunger” and “Life on Land” objectives outlined in the United Nations Sustainable Development Goals (SDGs) [
7,
8,
9,
10,
11,
12]. Visible and near-infrared (Vis–NIR) spectroscopy, characterized by its rapid and non-destructive analytical capabilities, has become an integral tool in contemporary soil analysis. Utilizing spectral response information within the wavelength range of 400–2500 nm, Vis–NIR spectroscopy enables the effective assessment of critical soil parameters, such as organic matter and heavy metals, thus offering technological feasibility for large-scale soil surveys [
13,
14,
15,
16,
17,
18].
In terms of methodological frameworks for soil spectral modeling, traditional machine learning techniques have established diversified approaches including partial least squares regression (PLSR), support vector machine regression (SVR), random forest (RF), ridge regression, and gradient boosting trees (XGBoost). Numerous scholars have conducted extensive research in the field of soil hyperspectral inversion. For instance, P Jia et al. utilized the extremely randomized tree (ERT) model to predict soil electrical conductivity in northwestern China [
19] while L Jia et al. employed the marine predators algorithm to optimize random forest models for predicting the soil organic matter content [
20]. Wu B et al. applied an optimized XGBoost model to retrieve the soil copper content [
21], while Zhang M et al. compared linear models (GWR, PLSR) with nonlinear models (RF, SVM) to predict the arsenic concentration in soils from Pingtan Island [
22]. Z Gao et al. inverted the total nitrogen content in apple orchard soils during fertilization using hyperspectral data and various machine learning regression methods [
23]. Q Song et al. leveraged UAV hyperspectral data and compared PLSR with ensemble learning models for the inversion of soil textures (sand, silt, clay) [
24]. Zhou W et al. combined laboratory-based spectral data with random forest and Bayesian data fusion methods to estimate the soil organic carbon in the Three-River Headwater Region [
25]. Zhong Q et al. utilized hyperspectral data in conjunction with extreme learning machine (ELM) and support vector machine (SVM) for urban soil nickel concentration inversion [
26]. Subi X et al. developed hyperspectral models for soil organic matter (SOM) in arid regions of northwest China, comparing multiple linear regression and machine learning approaches [
27]. Chen S et al. applied continuous wavelet transform (CWT) coupled with extreme learning machine (ELM) for the rapid inversion of soil moisture content [
28]. While the above studies have demonstrated considerable success in soil hyperspectral inversion using machine learning, three major issues remain. (1) Existing machine learning models often rely on principal component analysis (PCA) or manual band selection for dimensionality reduction when dealing with large-scale datasets such as the LUCAS2009. However, linear dimensionality reduction methods compromise spectral continuity (e.g., absorption peak shapes and adjacent-band relationships), leading to diminished sensitivity to subtle spectral signals (e.g., heavy metal feature peaks). (2) Even when utilizing nonlinear models (e.g., RF, XGBoost), their tree-based feature splitting mechanisms essentially represent piecewise linear approximations, thus failing to adequately characterize complex nonlinear coupling between soil elements and spectral features. (3) Existing methodologies predominantly adopt single-scale modeling, unable to simultaneously capture local spectral details and global trends, thereby resulting in fragmented cross-band relational information.
Within the methodological framework of deep learning, approaches such as convolutional neural networks (CNN) and long short-term memory networks (LSTM) exhibit significant advantages compared with traditional machine learning paradigms. Specifically, deep learning circumvents limitations associated with manual feature engineering by leveraging autonomous feature learning mechanisms, effectively captures complex higher-order response relationships through deep nonlinear mappings, and enhances practical applicability via end-to-end data-driven modeling frameworks. From a theoretical perspective, these methods demonstrate superiority in feature representational capacity (representation learning), efficiency in processing high-dimensional data (dimensional invariance), multi-task generalization (parameter sharing), and robustness to noise (distributed representations), thereby providing a novel paradigm for modeling complex soil–spectral interactions. Empirically, the suitability and advantages of deep architectures have been validated by previous studies: for instance, Sheng Wang et al. [
29] employed an LSTM-based framework to capture dependencies within spectral sequences; Wang H et al. [
30] proposed a CNN-LSTM hybrid architecture for the joint extraction of spatial and temporal features; and Li H’s group [
31] developed a dual-branch CNN architecture effectively integrating heterogeneous features, achieving breakthroughs in various application scenarios.
Addressing the core issues inherent in traditional methods—specifically the difficulties in processing high-dimensional data, insufficient nonlinear representation, and multi-scale fragmentation—this study introduces a novel deep-learning-based model termed as the multi-scale attention residual network (ReSE-AP Net). Building upon the residual convolutional neural network (ResNet) structure, the proposed model incorporates innovations across multiple dimensions. (1) Channel attention mechanism: By embedding a squeeze-and-excitation (SE) module, global average pooling is employed to capture statistical channel-wise feature responses, and a two-layer fully connected network dynamically calibrates feature channels, significantly enhancing the representation of critical spectral regions such as heavy-metal-sensitive bands. (2) Multi-scale feature pyramid construction: An atrous spatial pyramid pooling (ASPP) module based on dilated convolutions is designed to simultaneously capture the local spectral details and global spectral trends through parallel convolutional branches with varying receptive fields (dilation rates of 1, 2, and 4). (3) Hierarchical feature fusion: Employing residual skip-connections to facilitate cross-layer information interaction, local textural features from shallow layers (e.g., baseline reflectance fluctuations) and abstracted nonlinear spectral combinations from deep layers are integrated, creating a multi-granularity feature representation system.
To validate the effectiveness of the proposed model, rigorous comparative experiments were conducted using the LUCAS2009 benchmark dataset. The experiments included traditional machine learning models (PLSR, Ridge, SVR, RF, XGBoost) and mainstream deep learning models (VGG, ResNet, temporal convolutional network (TCN), Transformer). Evaluation metrics employed were the coefficient of determination (R2) and root mean square error (RMSE). Results indicated that the ReSE-AP Net model significantly outperformed traditional machine learning methods across all elements, achieving improvements of 2.8–36.5% in R2 and reductions of 14.2–69.2% in RMSE. Compared with contemporary deep learning models commonly used in the field, the ReSE-AP Net achieved a superior R2 performance for more than half of the soil elements, improving by approximately 0.2–25.5% while maintaining a comparable performance with the best deep learning models for the remaining elements. Moreover, the proposed model consistently exhibited superior RMSE performance, outperforming all other deep learning models except for matching the TCN performance on pH (H2O), demonstrating improvements of approximately 0.7–39.0%, thus confirming its excellent predictive accuracy and generalization capability.
2. Materials and Methods
2.1. Data Sets and Data Processing
The LUCAS 2009 dataset [
32], a flagship outcome of the EU-led Land Use/Cover Area frame Survey, is recognized as one of the most representative continental-scale environmental databases in Europe due to its stringent soil sampling and analytical standards. Implementing a systematic 2 km × 2 km grid design, the survey spans the 25 EU Member States and adjacent regions, encompassing approximately 19,000 locations where topsoil (0–20 cm) was sampled with fine granularity. At each site, composite sampling was rigorously applied: five sub-samples were collected in a cross pattern within a 2 m radius centered on the geo-referenced point and subsequently pooled to form a 0.5 kg topsoil sample, thereby objectively representing the soil properties of roughly 4 m
2 of land [
33]. The sampling network covers diverse land-use categories, including arable land, grassland, and forest, with a particularly high proportion of agricultural sites.
During laboratory analysis, each soil sample underwent standardized pre-treatments, including air-drying, homogenization, and quality control, before being systematically characterized for fifteen core parameters: texture fractions (clay, silt, sand), chemical properties (pH, organic carbon, carbonates, total nitrogen, available phosphorus, and potassium), physical structure (coarse fragment content), and functional attributes (cation exchange capacity). Multispectral reflectance spectra were additionally acquired for a subset of samples, providing multi-dimensional inputs for subsequent soil-health assessments and carbon-stock modeling. Although the dataset’s sampling density (one point per 4 km2) and its emphasis on agricultural land impose constraints on fine-scale ecological studies and analyses of non-agricultural soils, its rigorous stratified sampling scheme, harmonized analytical protocols, and open-access policy render it an indispensable benchmark for evaluating EU agricultural policies, investigating the soil-degradation–climate interactions, and validating remote-sensing inversion models. To date, it continues to play an irreplaceable role in environmental science, agricultural management, and carbon-cycle research.
The rigorous hierarchical sampling framework, standardized analytical methods, and open-access nature of the LUCAS dataset provide a robust foundation for this study. This methodological integrity ensures the reliability of our model’s performance evaluation while minimizing the propagation of errors originating from data inaccuracies. A critical consideration, however, is the dataset’s pronounced skew toward agricultural land cover. Accordingly, the generalizability of our findings to extensive non-agricultural ecosystems warrants further investigation, representing a clear and compelling avenue for subsequent research efforts.
2.1.1. Dataset Statistics and Division
The LUCAS dataset employed in this study comprised 19,036 samples, but individual feature columns exhibited varying degrees of missingness. To mitigate the impact of missing data on model training without incurring excessive data loss, the following strategy was adopted: for any target variable under prediction, only those rows with missing values in that specific column were removed. This approach balances sample size with data integrity, thereby enhancing model robustness and generalization. The remaining data were partitioned into a training set (66.6%) and an independent test set (33.3%), with fivefold cross-validation applied to the training set and 20% of the training samples withheld as a validation subset.
Descriptive statistics were computed for each split, including the size (number of observations), mean (arithmetic average, reflecting central tendency), std (standard deviation, measuring dispersion), median (middle value, an alternative measure of central tendency), mode (most frequent value), kurtosis (peakedness, indicating tail heaviness), and Iqr (inter-quartile range, a robust measure of spread). The results are summarized in
Table 1.
From a statistical perspective, nutrient-related variables such as organic carbon (OC), CaCO3, total nitrogen (N), phosphorus (P), and potassium (K) display pronounced right-skewness; P and K additionally exhibit leptokurtic, heavy-tailed distributions, implying a prevalence of low values interspersed with a few extreme highs. High coefficients of variation for cation-exchange capacity (CEC) and K highlight marked spatial heterogeneity in soil fertility.
With respect to soil texture, the mean fractions of clay (18.88%), silt (38.23%), and sand (42.88%) indicate that the study region is dominated by sandy loam. The close agreement between the median (37%) and mean for silt suggests an approximately symmetric distribution, whereas the wide Iqr for sand denotes substantial variability in sand content. The large divergence between the mean (49.92) and median (20.80) of OC—together with a kurtosis of 13.53—revealed a mixture of high-organic soils and typical arable soils. Collectively, the dataset captures the pronounced heterogeneity of European soils, posing a non-trivial challenge for predictive modeling. Nevertheless, the training and test sets exhibited strong concordance in key statistics (mean, standard deviation, median), and apart from a minor discrepancy in the Iqr of K, all other parameters maintained stable Iqr values across splits. This indicates a sound data partitioning strategy with no evidence of significant data leakage, thus providing a solid foundation for subsequent model training and evaluation.
2.1.2. Data Preprocessing
For data preprocessing, this study employed piecewise pooling averaging (PPA), also known as the bin-averaging method. PPA is a dimensionality-reduction technique that applies local mean pooling to high-dimensional spectral data: the spectrum is partitioned into fixed intervals, and the mean of each interval is computed to generate a compressed feature set. This procedure preserves global trend information while effectively suppressing random noise and lowering computational cost. Given that Vis–NIR spectra typically comprise thousands of bands, PPA was used to condense the original 4200-dimensional spectral vectors to 128 dimensions, substantially improving both computational efficiency and training speed without sacrificing predictive accuracy. Let the original spectral matrix be
, where m is the sample and n is the feature dimension. Then, the width of the compartments can be obtained by Formula (1):
The unpacking operation of PPA can be expressed as Formula (2):
Among them, and the final output is .
2.2. Modeling Method
This study proposes a multi-scale attention residual network (ReSE-AP Net) that synergistically integrates residual architecture, channel attention, and multi-scale feature fusion to efficiently decode complex spectral information. Centered on residual convolutions, the network employs skip connections to merge local spectral details with global abstract features, thereby alleviating gradient-vanishing issues. Within each residual block, a squeeze-and-excitation (SE) attention mechanism dynamically enhances responses at critical spectral bands through global feature statistics while suppressing noise. The model further incorporates atrous spatial pyramid pooling (ASPP) to extract spectral features at multiple scales in parallel, simultaneously capturing fine structures of weak absorption peaks and overarching trends of broad spectral ranges. Ultimately, feature fusion followed by nonlinear mapping enables end-to-end prediction, furnishing a robust deep-learning framework for spectral analysis.
2.2.1. Overall Model Structure
The overall architecture of the model is depicted in
Figure 1. Training data were first partitioned with a batch size of 320, and piecewise pooling averaging (PPA) was employed to compress the 4200-dimensional spectra to 128 dimensions. After preprocessing, the input tensor had a shape [320,1,128] corresponding to [batch_size,channel,seq_length].
An initial convolutional module was placed at the network front end to extract low-level features; this consists of a convolutional layer (kernel size = 3, padding = 1) followed by a ReLU activation function, thereby introducing nonlinearity. The resulting features are forwarded to a residual network augmented with a squeeze-and-excitation (SE) channel-attention mechanism. This residual network contains two residual blocks, each comprising two convolutional layers—the first expanding the channel dimension and the second maintaining it—together with batch normalization and ReLU activation. Within the main branch of each block, the features produced by the two convolutions are re-weighted by the SE attention to dynamically enhance informative spectral bands and suppress noise. The attention-refined output is then added to the shortcut pathway, forming the residual connection. The shortcut both mitigates gradient vanishing and network degradation and enables lower-level information to flow directly to deeper layers, fostering feature reuse and preventing information loss.
The output of the residual network is subsequently fed into an atrous spatial pyramid pooling (ASPP) module. ASPP comprises three parallel atrous-convolution branches with dilation rates of 1, 2, and 4, respectively, to capture multi-scale features with varying receptive fields. Concurrently, a global-average-pooling branch compresses the sequence dimension to obtain global statistics, which are then restored to the original sequence length via nearest-neighbor upsampling to align with the atrous branches. Features from the atrous convolutions and the global branch are merged in a fusion layer, yielding a composite representation that integrates local, intermediate, large-scale, and global information.
The fused features are further downsampled by a max-pooling layer (kernel size = 2, stride = 2), flattened, and passed through a fully connected layer to produce the final output, enabling end-to-end mapping from spectra to soil-element predictions.
2.2.2. SE Attention Mechanism and Residual Convolutional Network
The squeeze-and-excitation (SE) attention mechanism constitutes a canonical form of channel attention, designed to augment the representational capacity of convolutional neural networks while reducing the training overhead. It comprises two principal operations—squeeze and excitation. In the squeeze phase, global average pooling is applied to each channel feature map, collapsing its spatial dimensions to generate a channel descriptor that encapsulates the channel’s global response. This operation is formalized in Equation (3):
where
represents the value of channel
c at point
i in the BTH batch of the input feature position,
N is the total number of elements on this channel, and
is the value of channel
c after compression. During the excitation phase, the squeezed descriptors are passed through a nonlinear transformation to produce a weight vector whose length equals the number of channels, with each element quantifying the importance of its corresponding channel. This operation can be formulated as Equation (4):
Among them,
z is the compressed feature vector,
,
are the weight parameter of the fully connected layer,
are the bias term, and
is the activation function. Subsequently, the channel-wise weights generated in the excitation phase are applied to the original feature maps via channel-specific multiplication, thereby re-calibrating the features. This operation can be represented by Equation (5):
Among them,
x is the original feature and
is the feature after channel weighting. The squeeze-and-excitation (SE) attention mechanism recalibrates each channel feature map in a convolutional neural network through two successive operations—squeeze and excitation—thereby markedly enhancing the representational capacity. A schematic of the SE module adopted in this study is illustrated in
Figure 2, where B denotes the batch size, C is the number of channels, L is the sequence length, and r is the reduction (compression) ratio.
As a paradigmatic deep-network architecture, the residual neural network alleviates the vanishing-gradient problem commonly encountered during the training of very deep models by incorporating residual learning and cross-layer identity mappings, thereby substantially enhancing the feature representation and generalization capabilities. Each residual block in a ResNet can be formulated as Equation (6):
Among them,
represents the output of the
residual block,
is the input of the
residual block,
is the skip connection,
is the residual function, and
is the weight parameter. In ReSE-AP Net,
should be expressed as Equation (7):
Among them,
are the weight parameters of the two convolutional layers,
are the bias terms, and s is the channel weight vector calculated by the SE module. Finally, the total expression of the residual network weighted by SE attention can be obtained as Equation (8):
Among them,
is the input of the last residual block, and L is the total number of residual blocks. Experimental results indicate that excessively deep architectures (e.g., ResNet-152) deteriorate performance in the target task rather than improving it. Detailed analysis attributes this degradation to two primary factors: (i) the inherent parameter redundancy of very deep networks results in a mismatch between model complexity and dataset size, thereby inducing severe overfitting; and (ii) over-parameterized models exhibit gradient instability during back-propagation, substantially complicating training. In response, a streamlined shallow residual architecture is proposed. As illustrated in
Figure 3 (where C1, C2, and C3 denote different channel dimensions), the network consists of only two residual blocks, striking a judicious balance between model capacity and computational efficiency. Empirical evidence demonstrates that relative to deeper residual networks, this shallow design preserves the feature-extraction capability while markedly reducing complexity, consequently shortening the per-iteration training time and facilitating rapid model updates.
This work innovatively integrates the squeeze-and-excitation (SE) channel-attention mechanism into the residual network for two principal reasons:
Heterogeneous channel importance. Conventional convolutions treat all channels equally, however, their contributions to the target task vary substantially; some channels even convey redundant or noisy information. The SE mechanism adaptively learns channel-specific weights, suppressing less informative channels and amplifying pivotal ones, thereby improving feature utilization.
Explicit modeling of inter-channel dependencies. While residual networks mitigate gradient vanishing via skip connections, they do not explicitly model relationships among channels. By employing nonlinear mapping to capture such dependencies, the SE attention further augments representational power.
The formal mathematical definitions and computational procedures of this module are provided in Equations (9)–(13).
Among them, represent batch normalization and two activation functions, respectively; represents one-dimensional convolution; respectively represent the different weight parameters of the two convolution operations; are respectively the two weight parameters of the SE attention mechanism in the process of calculating channel weighting; represents global average pooling (which is the symbolic expression of Formula (3)). represents the weight calculated by the channel attention mechanism, and represents the multiplication of each channel. Moreover, incorporating the SE channel-attention mechanism adds only a negligible number of parameters and incurs a minimal computational overhead, imparting an inherently lightweight nature that helps maintain training efficiency. In the proposed design, the SE module is inserted at the end of the main branch of each residual block, immediately before the residual summation; performing channel-wise recalibration prior to feature fusion yields a more discriminative combined representation. Because the shortcut branch primarily serves as an unobstructed gradient-flow pathway to alleviate vanishing gradients, no SE module is applied to this branch. Given that the two residual blocks produce feature maps with different channel dimensions, separate SE modules—each matched to its respective dimensionality—are deployed, thereby ensuring dimensional compatibility and preventing cross-interference among the attention weights.
2.2.3. Pyramid Pooling of Hollow Space
Atrous spatial pyramid pooling (ASPP) is a multi-scale feature-extraction strategy that constructs a pyramid of atrous-convolution branches with distinct dilation rates within a convolutional neural network. By substantially enlarging the effective receptive field without a significant increase in parameters, ASPP enables the network to capture contextual information at multiple spatial scales while preserving feature-map resolution, thereby enhancing its ability to recognize objects of varying sizes. Atrous convolution expands the kernel’s field of view through sparse sampling, circumventing the detail loss usually caused by downsampling, whereas the parallel multi-branch design endows the model with rich scale awareness. The convolution operations corresponding to the different dilation rates are formally defined in Equations (14)–(16).
Among them,
are the weight parameters of different convolutional layers,
are the bias terms,
are the void rates,
is the input feature sequence, and
is the ReLU activation function. The global-average-pooling branch compresses the input feature map into a global feature vector, refines the channel dimensionality via a 1 × 1 convolution, and subsequently restores the spatial resolution through upsampling. This process provides global contextual information, thereby compensating for the limited receptive field of local convolutions. The corresponding mathematical formulations are presented in Equations (17)–(19).
Among them,
represents the value of channel
c at position
i of the Bth batch of the input feature position,
L is the total length of the vector,
is the value of channel
c after compression and
, and
is the column vector of all 1s. Subsequently, the features extracted from the multi-scale atrous-convolution branches and the global-average-pooling branch are concatenated along the channel dimension and fused via a 1 × 1 convolution to realize cross-scale interaction and compression, as formulated in Equations (20) and (21).
Among them, represents the result obtained by concatenating the features obtained by the convolution of different receptive fields with the features obtained by global pooling, and is the cross-scale feature output result after convolution fusion.
In the present task, distinct spectral bands in hyperspectral data corresponded to characteristic absorption features of various substances. The ASPP module, with its parallel multi-branch design, concurrently captures local details (small dilation rate), medium- to long-range dependencies (large dilation rate), and global context (pooling branch). The fused multi-scale features enhance the model’s robustness to spectral noise and local occlusions, rendering ASPP particularly well-suited to the high spectral dimensionality of hyperspectral data. The ASPP architecture implemented in this study is illustrated in
Figure 4, where B denotes the batch size, C is the number of channels, and L is the sequence length. Within the overall framework, the ResNet backbone extracts deep representations via residual connections but may overlook cross-scale contextual information; the ASPP module refines these high-level features at multiple scales. Simultaneously, the SE module in the residual network focuses on channel-wise importance, whereas ASPP emphasizes spatial multi-scale information. Their combination realizes “channel-spatial” dual-attention, markedly enhancing the expression of salient features. The complete mathematical formulation of this module is provided in Equations (22)–(26).
Among them, represent the convolution outputs of three different void rates, represents the input received by ASPP, represent the weight parameters of the three different convolutions, represents upsampling, represents one-dimensional convolution, represents global average pooling, and represents the concatenation and fusion of .
2.2.4. Model Evaluation
In this study, model performance was assessed using the coefficient of determination (R
2) and the root mean square error (RMSE). The coefficient of determination quantifies the degree of correspondence between the predicted and observed values, representing the proportion of variance in the response variable that is accounted for by the predictive model; an R
2 value approaching 1 indicates a superior goodness of fit. RMSE measures the average discrepancy between the predicted and observed values, thereby reflecting the overall predictive accuracy; a lower RMSE denotes reduced error and more precise predictions. The mathematical formulations of R
2 and RMSE are provided in Equations (27) and (28), respectively.
2.2.5. Experimental Setup
The experiments were conducted on a system equipped with a 14th-generation Intel Core i7-14700HX processor (20 cores/28 threads), Intel, Santa Clara, CA, USA and an NVIDIA GeForce RTX 4060 GPU, NVIDIA, Santa Clara, CA, USA. The software environment comprised Windows 11 as the operating system, Python 3.11 as the programming language, and PyTorch 2.3.0 as the deep-learning framework. Based on the above configuration, each round of training during the model training process took approximately 40 s and occupied about 14 GB of memory.
4. Further Evaluation and Discussion
To further quantify and evaluate the model’s fitting quality and overall performance, scatter plots of the observed versus predicted values were generated (
Figure 7). Each plot contained numerous points and two curves: the red line denoted y = x, representing the ideal scenario in which the predicted values perfectly match the observations, whereas the blue curve corresponded to the regression fit of the model’s predictions. Point density was color-coded, with deeper (reddish) hues indicating higher concentrations of samples. In every plot, the red and blue curves intersected; this intersection marks the point at which the predicted and observed values are equal. When the intersection lay within the region of highest point density, the model exhibited a superior fitting performance and robustness within the principal data distribution. The results show that for most soil-element predictions, the intersection of the curves for ReSE-AP Net fell within the densest region, confirming its strong predictive capability and generalization. Nonetheless, for the pH and sand indicators, the intersection only appeared in relatively dense regions rather than the densest area, suggesting that the model’s performance on these two attributes could be further improved.
It is pertinent to contrast our ReSE-AP Net with recent related works that also leverage ASPP-like structures for hyperspectral data analysis, notably the contributions from Liu et al. [
34] and Liu et al. [
35].
Liu et al. [
34] ingeniously adapted the ResNet-50 architecture for weed detection by replacing its latter stages with an ASPP module. While this design proved effective for their specific task, our preliminary experiments indicated that deeper networks, such as ResNet-152, did not necessarily yield a superior performance in our soil property prediction context, suggesting that the optimal network depth is task-dependent. Furthermore, their model lacked an explicit attention mechanism, which we identified as a key component for refining spectral features. In contrast, ReSE-AP Net is architecturally optimized in two ways: first, it employs a residual network of a deliberately chosen, more moderate depth to prevent overfitting and capture salient features effectively, and second, it integrates the SE channel attention mechanism within the feature extraction backbone, enabling progressive feature refinement and noise suppression.
In another relevant study, Liu et al. [
35] proposed RAANet for semantic segmentation, which innovatively incorporates a residual structure within the ASPP module itself and deploys a dense arrangement of attention modules both inside and outside the ASPP. While this approach is novel and effective, its primary focus is on a complex, attention-augmented ASPP, with comparatively less emphasis on the initial deep feature extraction process. This may risk underutilizing the rich information embedded in the original hyperspectral data. ReSE-AP Net adopts a fundamentally different strategy by prioritizing the front-end feature extraction. Our model leverages a synergistic combination of residual connections and SE attention to ensure that features are comprehensively extracted and purified before they are channeled into the multi-scale analysis stage. This strategic divergence underscores the unique architectural philosophy of our approach.
In summary, the novelty of ReSE-AP Net, when benchmarked against these state-of-the-art models, is threefold:
- (1)
Endogenous refinement through deeply embedded attention: We pioneered the concept of embedding channel attention within each fundamental building block of the feature extraction backbone. This facilitated a progressive, layer-by-layer purification of spectral features, fundamentally enhancing the quality of the feature maps that are subsequently fed into the multi-scale analysis module.
- (2)
We propose a novel two-stage architectural paradigm with a clear division of labor: A front-end network dedicated to feature purification and a back-end module focused on multi-scale fusion. This represents a strategic innovation over existing models that either lack a purification stage or conflate it with multi-scale analysis.
- (3)
We successfully adapted and validated the efficacy of the ASPP module, a technique predominantly used in 2D image processing, for the task of one-dimensional hyperspectral inversion. Our results confirm that ASPP is a highly effective tool for capturing multi-scale contextual information within 1D spectral data, thereby establishing its utility for this new domain.
5. Conclusions
This study first elucidated the significance of soil-element prediction and its relevance to sustainable agriculture. The publicly available LUCAS 2009 soil dataset was then introduced, outliers were removed, and a suite of descriptive statistics—size, mean, std, median, mode, kurtosis, and IQR—were computed and interpreted to demonstrate the scientific soundness of the data partitioning strategy. After data cleansing, piecewise pooling averaging (PPA) was applied to reduce the dimensionality of the spectral inputs. Building on these preparations, a multi-scale attention residual network based on spatial pyramid pooling (ReSE-AP Net) was proposed and employed for visible–near-infrared (Vis–NIR) hyperspectral inversion of multiple soil elements on the LUCAS 2009 dataset. The model extracts initial features via a front-end convolutional layer, propagates salient information through residual blocks augmented with SE channel attention, and enhances predictive accuracy and robustness by leveraging multi-scale feature extraction and fusion within the ASPP module. Experimental results showed that ReSE-AP Net outperformed all mainstream traditional machine learning models and equaled or surpassed widely used deep learning architectures, with particularly strong performance in terms of RMSE; its success on a publicly available dataset further attests to its generalization capability.
Despite the demonstrated robustness and high performance of the proposed ReSE-AP Net, we acknowledge several limitations that warrant discussion and outline clear avenues for future research.
- (1)
The reliance on the LUCAS 2009 dataset, while ensuring high data quality and standardization, introduced a potential bias. The dataset is geographically confined to Europe and is predominantly composed of agricultural soils. Consequently, the model’s generalizability to other geographical regions, diverse land-use types (e.g., forests, wetlands), or less standardized, private datasets remains an open question requiring empirical validation. Future work will therefore focus on acquiring and testing the model on such heterogeneous datasets to rigorously assess its real-world applicability.
- (2)
A nuanced analysis of the performance metrics revealed a noteworthy finding. Although ReSE-AP Net surpassed all baseline models in terms of RMSE across all soil properties, its performance on the R2 metric for pH, N, and K was merely on par with the Transformer architecture. We hypothesize two complementary reasons for this observation. One pertains to the inherent inductive biases of the models: the Transformer’s self-attention mechanism may be more adept at capturing the global, long-range spectral dependencies upon which the prediction of these particular elements relies, whereas our CNN-based model excels at leveraging local features. The other reason, suggested by the superior RMSE of our model, is that ReSE-AP Net achieves exceptional accuracy on the majority of samples within the central data distribution but may be less effective than the Transformer at fitting the extreme values that heavily influence the R2 score. This indicates a clear opportunity for refinement.
To address this, our immediate future work will concentrate on enhancing the model architecture. A primary strategy will be to introduce an adaptive weighting mechanism within the ASPP module. This mechanism will be designed to dynamically assign weights during the fusion of multi-scale convolutional features, thereby amplifying salient feature information while suppressing irrelevant noise. In principle, such a modification should augment the feature fusion capability of the ASPP module, leading to an overall improvement in the model’s predictive power, especially in capturing the full variance of the data. This promising direction is currently under active investigation.
In conclusion, while acknowledging these areas for further improvement, the ReSE-AP Net model, as presented, demonstrates strong predictive capabilities for a wide range of soil elements, offering a valuable and high-performance benchmark for the field of soil spectroscopy.