A CNN-LSTM-XGBoost Hybrid Framework for Interpretable Nitrogen Stress Classification Using Multimodal UAV Imagery

Kuang, Xiaohui; Hou, Xinyue; Wang, Dawei; Mao, Bohan; Li, Yafeng; Chen, Deshan; Fu, Wanna; Cheng, Qian; Duan, Fuyi; Li, Hao; Chen, Zhen

doi:10.3390/rs18040538

Open AccessArticle

A CNN-LSTM-XGBoost Hybrid Framework for Interpretable Nitrogen Stress Classification Using Multimodal UAV Imagery

by

Xiaohui Kuang

^1,2,3,

Xinyue Hou

^1,*,

Dawei Wang

¹,

Bohan Mao

^2,3,

Yafeng Li

²,

Deshan Chen

²,

Wanna Fu

²

,

Qian Cheng

²,

Fuyi Duan

²,

Hao Li

³

and

Zhen Chen

²

¹

Heilongjiang Provincial Hydraulic Research Institute, Harbin 150080, China

²

Institute of Farmland Irrigation, Chinese Academy of Agricultural Sciences, Xinxiang 453002, China

³

State Key Laboratory of Crop Stress Adaptation and Improvement, College of Agriculture, School of Life Sciences, Henan University, Kaifeng 475004, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 538; https://doi.org/10.3390/rs18040538

Submission received: 8 January 2026 / Revised: 28 January 2026 / Accepted: 3 February 2026 / Published: 7 February 2026 / Corrected: 9 April 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A CNN–LSTM–XGBoost hybrid framework is proposed to classify nitrogen stress by integrating spatial–temporal representations from multimodal UAV imagery.
The framework improves robustness with multimodal feature fusion and provides transparent, feature-level interpretation for nitrogen stress decisions.

What are the implications of the main findings?

The method enables rapid, field-scale nitrogen status screening to support timely and variable-rate fertilization decisions.
The interpretable hybrid design is adaptable to other crop stress monitoring tasks and sensors, facilitating practical deployment in precision agriculture.

Abstract

Accurate diagnosis of nitrogen status is essential for precision fertilization in winter wheat. Single-modal or single-temporal remote sensing often fails to capture the multidimensional crop responses to nitrogen stress. In this study, we propose a hybrid framework based on CNN-LSTM-XGBoost for interpretable classification of wheat nitrogen stress gradients using multimodal unmanned aerial vehicle (UAV) multispectral and thermal infrared (TIR) imagery. Field experiments were conducted at the Xinxiang base in Henan Province during the 2023–2024, following a randomized block design involving 10 cultivars, four nitrogen levels, and four water treatments. Multisource UAV images acquired at jointing, heading, and filling stages were used to construct a multimodal feature set consisting of manual features (spectral bands, vegetation indices (VIs), TIR, and their interaction terms) and seven temporal statistical features. A deep learning model (CNN-LSTM) was utilized to further extract deep spatiotemporal features, and its performance was systematically compared with traditional machine learning models. The results show that multimodal feature fusion significantly enhanced classification performance. The CNN-LSTM model achieved an accuracy of 89.38% with fused multimodal features, outperforming all traditional machine learning models. Incorporating multi-temporal features improved the F1_macro of the XGBoost model to 0.9131, a 9.42 percentage-point increase over using the single heading stage alone. The hybrid model (CNN-LSTM-XGBoost) achieved the highest overall performance (Accuracy = 0.9208; F1_macro = 0.9212; AUC_macro = 0.9879; Kappa = 0.8944). SHAP analysis identified TIR × NDRE as the most influential indicator, reflecting the coupled physiological response of reduced chlorophyll content and increased canopy temperature under nitrogen deficiency. The proposed multimodal, multi-temporal, and interpretable framework provides a robust technical foundation for UAV-assisted precision nitrogen management.

Keywords:

winter wheat; nitrogen stress; UAV remote sensing; multimodal fusion; deep learning; interpretability

1. Introduction

Nitrogen is a key limiting nutrient for the growth, development, and yield formation of winter wheat [1,2]. Its supply level directly affects chlorophyll synthesis [3], photosynthetic efficiency [4] and dry matter accumulation. Increasing nitrogen use efficiency can significantly improve cereal crop yields [5]. Meanwhile, long-term excessive nitrogen application leads to soil nitrate leaching and increased greenhouse gas emissions, posing significant environmental challenges that hinder sustainable agricultural development [6]. Therefore, reliable, rapid, and scalable nitrogen diagnostic techniques are essential for improving nitrogen-use efficiency and ensuring stable wheat production.

Early monitoring of nitrogen status in winter wheat primarily relied on field sampling and laboratory analysis, such as quantifying leaf nitrogen content (LNC) through chemical digestion [7], or indirectly assessing nitrogen status by measuring relative chlorophyll content [8]. Although these methods can provide accurate nitrogen data, they are time-consuming, labor-intensive, destructive, and offer limited spatial representativeness, making them unsuitable for large-scale, rapid field monitoring [9]. In recent years, with the rapid advancement of precision agriculture technologies, unmanned aerial vehicle (UAV) remote sensing has emerged as an attractive alternative for crop stress monitoring due to its high spatiotemporal resolution, non-destructiveness, and operational flexibility, offering a novel solution to the efficiency bottlenecks inherent in traditional diagnostic approaches [10,11].

UAV platforms can be equipped with various sensors to capture multidimensional information on crop growth [12]. Among them, the combination of multispectral and thermal infrared (TIR) sensors is particularly common [13]. Multispectral imagery provides key biochemical indicators of the crop canopy, such as pigment content, biomass, and photosynthetic activity [14], while TIR imagery reflects canopy water status and stomatal regulation through temperature variations [13]. Owing to these distinct physiological sensitivities, the two data sources are inherently complementary, and their integration enables a more comprehensive characterization of the crop’s physiological responses to nitrogen stress [15]. However, fully exploiting the advantages of multi-source data requires more than data collection alone. The development of an appropriate analytical model is essential for achieving accurate nitrogen stress diagnosis [16]. The model should be capable of capturing complex nonlinear relationships to ensure high predictive performance. At the same time, it should offer a certain degree of interpretability, allowing agronomists to understand and validate its decision-making process, thereby enhancing reliability in practical applications [17]. Despite the theoretical benefits of multimodal data fusion and its potential to improve diagnostic performance, current UAV-based nitrogen stress research still faces three major challenges. First, data acquisition is often limited to a single growth stage or a single flight mission. Physiological responses to nitrogen stress exhibit strong temporal dynamics, and spectral and thermal features vary considerably across growth stages. Relying solely on single-stage data fails to capture these temporal patterns, which restricts model generalization and stability [18]. Second, many current fusion strategies remain technically superficial. Differences in scale, spatial resolution, and noise characteristics between multispectral and TIR data are often overlooked, and simple stacking is frequently used. Such approaches do not fully exploit the physiological complementarities of different modalities, thereby limiting the effectiveness of data fusion. Third, the interpretability of existing models remains insufficient. Although deep learning models excel at feature extraction and nonlinear representation, their lack of interpretability makes it difficult to understand how predictions are generated. Even when high accuracy is achieved, the absence of clear explanations regarding key features, decision pathways, or variable contributions reduces the credibility and applicability of the results [19]. In summary, developing a nitrogen stress diagnosis model that integrates multi-stage and multimodal UAV data while achieving both high accuracy and strong interpretability remains a critical scientific challenge. Recent studies have further highlighted the value of spectral–texture fusion and multimodal integration for improving robustness in crop nutrient and stress monitoring and reducing uncertainties in complex field conditions [20,21].

Addressing these challenges, this study developed a multimodal, multi-temporal hybrid deep learning framework for winter wheat stress identification. Different from conventional methods that simply stack data or parallelize models, our approach realizes deep integration of data, features and models, balancing identification accuracy and result credibility. Therefore, the main objective of this study is to develop an accurate and interpretable hybrid framework for nitrogen-related stress classification in winter wheat by integrating UAV-based multispectral and thermal infrared (TIR) imagery across multiple growth stages. To achieve this objective, the key contributions of this work are summarized as follows: (1) we propose a CNN-LSTM-XGBoost hybrid architecture that leverages deep multitemporal feature learning while maintaining interpretability through gradient-boosted decision trees; (2) we construct a multimodal feature system by combining spectral bands, nitrogen-sensitive vegetation indices, canopy temperature and their interaction terms, together with temporal statistical descriptors across jointing, heading and filling stages; (3) we conduct systematic comparisons among conventional machine learning models and deep learning baselines, and perform ablation experiments to quantify the contributions of modality and temporal information; (4) we adopt a nested cross-validation strategy with comprehensive metrics to provide unbiased evaluation and reliable performance estimation.

2. Materials and Methods

2.1. Study Area and Experimental Design

The experiment was conducted at the Xinxiang Experimental Base (35.2°N, 113.8°E) of the Chinese Academy of Agricultural Sciences during 2023–2024. The study area is located in Xinxiang City, Henan Province, China, characterized by a warm temperate continental monsoon climate with dry and windy springs. The region has a mean annual temperature of 14 °C and an average annual precipitation of approximately 548.3 mm, primarily concentrated in July and August. The climatic setting typifies the ecological conditions of winter wheat production in the Huang–Huai Plain. The experimental field is characterized by a typical alluvial loam soil (fluvo-aquic soil) in the Huang-Huai-Hai Plain, which is representative of local winter wheat production systems. Uniform field management was applied to minimize within-field variability in baseline soil conditions. To simulate winter wheat growth under varying nitrogen supply conditions and introduce environmental gradients to enhance model robustness, a three-factor randomized complete block design was adopted. The experiment included 10 winter wheat varieties and four nitrogen fertilizer levels, with four parallel soil moisture gradients established under each nitrogen level as background interference, thereby creating a highly heterogeneous experimental environment. Nitrogen treatments were determined based on locally recommended application rates for winter wheat production: N1: 0 kg/ha, N2: 90 kg/ha, N3: 210 kg/ha, and N4: 330 kg/ha. This gradient established a continuous nitrogen supply spectrum from no stress to severe stress, covering common fertilization management scenarios in actual production. All treatment combinations were replicated three times, resulting in a total of 480 independent experimental plots (Figure 1). Each plot measured 6 m² (2 m × 3 m) with a row spacing of 15 cm. To minimize cross-treatment interference, adjacent plots were separated by 0.5 m on the sides and 1 m along the ends. Winter wheat yield was measured in early June 2024. Grains harvested from each plot were placed in labeled bags, air-dried under ventilation to approximately 12.5% moisture content, and precisely weighed using an electronic scale. The yield per unit area (kg/ha) was then calculated based on plot area.

2.2. UAV Data Acquisition and Preprocessing

A DJI M210 UAV (SZ DJI Technology Co., Shenzhen, China) equipped with a MicaSense RedEdge MX multispectral camera (RedEdge MX, MicaSense Inc., Seattle, DC, USA) and a Zenmuse XT2 thermal infrared (TIR) camera (SZ DJI Technology Co., Shenzhen, China) to was used to acquire multispectral and thermal imagery (Figure 2). The main specifications of the sensors are summarized (Table 1). Flights were conducted under low-wind conditions to avoid canopy motion and thermal image blur. All data collection was conducted under cloudless conditions with stable illumination between 11:00 and 14:00 to minimize the influence of solar angle variations and ensure consistent lighting. Flights were specifically carried out on 29 March (jointing stage), 27 April (heading stage), and 27 May (filling stage). Before and after each flight, radiometric calibration was performed to mitigate radiometric drift and enable the conversion of digital number (DN) values to reflectance in the multispectral data. This was done by positioning the camera approximately 1 m (±0.1 m) above a calibration reflectance panel placed in an open, shadow-free area. Prior to each flight, the thermal sensor was stabilized and non-uniformity correction (NUC) was performed. Flight routes were planned using the DJI PILOT software (v2.5.1.17; DJI Technology Co., Shenzhen, China). The camera was maintained in a nadir-looking orientation, and images were captured at fixed time intervals. The flight altitude was set at 30 m, with forward and side overlap rates of 85% and 80%, respectively, to ensure high-quality image registration and avoid data gaps. Both sensors were integrated with a Global Navigation Satellite System (GNSS) receiver providing millimeter-level accuracy to acquire high-precision spatial positioning information. Additionally, ground control points (GCPs) measured by differential GNSS were deployed throughout the experimental area. The precise coordinates of these GCPs were used during the post-processing stage for geometric correction and orthorectification, thereby enhancing the spatial accuracy and geometric consistency of the final imagery.

This study utilized Pix4Dmapper software (v4.5.6; Pix4D, Lausanne, Switzerland) to process the large volume of multispectral and TIR images acquired by the UAV, generating orthomosaics with high geometric accuracy. To improve registration and positioning precision, the measured coordinates of GCPs were incorporated as geometric constraints during processing. The software employed a Structure-from-Motion (SfM) algorithm and photogrammetric workflow to perform key point matching, sparse point cloud reconstruction, and image alignment, followed by the generation of dense point clouds, digital surface models (DSM), and final orthomosaics. After radiometric calibration using reflectance panels, the DN of multispectral images were converted to surface apparent reflectance. TIR images were converted to land surface temperature (°C) based on sensor calibration coefficients and radiative transfer models. Through this workflow, multispectral reflectance orthomosaics and corresponding TIR temperature orthomosaics for different growth stages were obtained, providing a consistent data foundation for subsequent feature extraction. Subsequently, using ArcMap 10.8 software (Environmental Systems Research Institute, Inc., Redlands, CA, USA), plot vector boundaries were created by manually digitizing polygon shapefiles for all 480 plots according to the experimental layout. These boundaries were overlaid on the orthomosaics to accomplish plot-level segmentation. For each plot, the average reflectance values from five multispectral bands (Blue, Green, Red, Red Edge, NIR) and the average TIR temperature value were extracted. Finally, a structured data table was constructed, where each row corresponds to one plot sample and includes the plot ID (plot_id), growth stage, multispectral band reflectance values, TIR, and the assigned stress level label (label, 0–3). To minimize soil and background interference, especially at the early growth stage and along plot boundaries, an NDVI-based vegetation masking procedure was applied to the multispectral orthomosaics before plot-level feature extraction. Pixels with NDVI lower than a fixed threshold (NDVI ≤ 0.20) were treated as non-vegetation and excluded from subsequent calculations. All plot-level spectral reflectance values, vegetation indices, and canopy temperature statistics were then computed using only the remaining vegetation pixels within each plot polygon. This preprocessing step helps ensure that the models primarily capture canopy-level physiological signals rather than background noise.

2.3. Multimodal Feature Extraction

To comprehensively characterize the spectral, physiological, and temporal responses of winter wheat under different nitrogen levels, this study constructed an integrated feature set, primarily consisting of manual features and temporal statistical features (Table 2). The manual features were directly extracted from the multispectral and TIR imagery to represent the fundamental spectral reflectance properties and physiological status of the crop canopy. These specifically included: (1) Spectral Bands: Blue, Green, Red, Red Edge, and NIR bands, which capture the basic reflectance characteristics of the crop canopy; (2) VIs: Nitrogen-sensitive VIs were selected to enhance the characterization of chlorophyll content and plant nitrogen status; (3) Thermal Indicators: Canopy temperature derived from TIR imagery and its interaction terms with key VIs. These complement the spectral indices by helping to reveal transpiration regulation effects induced by variations in nitrogen supply. The temporal statistical features were designed to describe the dynamic trends of the above manual features across the three growth stages. To quantify the temporal patterns of these features over the growth cycle, and considering the nearly equal intervals (approximately 30 days) between stages, the three stages were treated as equidistant time points (t₁ = 1, t₂ = 2, t₃ = 3, corresponding to jointing, heading, and filling stages). For each feature sequence (x₁, x₂, x₃), a suite of temporal statistical metrics was extracted: Mean, Maximum (Max), Minimum (Min), Slope, Coefficient of Variation (CV), Difference, and Area Under the Curve (Area). These metrics collectively reveal the evolutionary characteristics of the crop’s spectral and thermal responses. The relevant calculation formulas are as follows:

Mean:

M e a n = \bar{x} = \frac{x_{1} + x_{2} + x_{3}}{3}

Maximum, Minimum:

M a x = m a x (x_{1}, x_{2}, x_{3}), M i n = m i n (x_{1}, x_{2}, x_{3})

Linear Regression Slope:

β = \frac{n \sum t_{i} x_{i} - \sum t_{i} \sum x_{i}}{n {\sum t}_{i}^{2} - {(\sum t_{i})}^{2}}

Substitute t = [1, 2, 3] and n = 3, the slope simplifies to:

Slope = \frac{x_{3} - x_{1}}{2}

Coefficient of Variation:

CV = \frac{σ}{\bar{x}}

σ = \sqrt{\frac{{(x_{1} - \bar{x})}^{2} + {(x_{2} - \bar{x})}^{2} + {(x_{3} - \bar{x})}^{2}}{3}}

Difference:

Δ_{1 \to 2} = x_{2} - x_{1}, Δ_{2 \to 3} = x_{3} - x_{2}

Area:

Area = \frac{x_{1} + x_{2}}{2} (t_{2} - t_{1}) + \frac{x_{2} + x_{3}}{2} (t_{3} - t_{2})

Area = \frac{x_{1} + 2 x_{2} + x_{3}}{2}

Table 2. Multimodal feature list.

Features	Name	Formula
R, G, B, RE, NIR	Red, Green, Blue, Red Edge, Near Infrared	The raw value of each band
MCARI	Modified Chlorophyll Absorption in Reflectance Index	((RE − R) − 0.2(RE − G))/(RE/R) [22]
TCARI	Transformed Chlorophyll Absorption in Reflectance Index	3((RE − R) − 0.2(RE − G) × (RE/R)) [23]
CIRE	Chlorophyll Index Red Edge	(NIR/RE) – 1 [24]
MTCI	MERIS Terrestrial Chlorophyll Index	(NIR − RE)/(RE − R) [25]
NDRE	Normalized Difference Red Edge	(NIR − RE)/(NIR + R) [26]
NDVI	Normalized Difference Vegetation Index	(NIR − R)/(NIR + R) [27]
OSAVI	Optimized Soil Adjusted Vegetation Index	(NIR − R)/(NIR + R + L) (L = 0.16) [28]
GNDVI	Green Normalized Difference Vegetation Index	(NIR − G)/(NIR + G) [29]
PRI	Photochemical Reflectance Index	(B − G)/(B + G) [30]
TIR	Thermal Infrared Temperature
Interaction term	TIR/NDVI, TIR × (1 − NDVI), TIR × NDRE, (TIR − NDVI)/(TIR + NDVI)
Statistical features	Mean, Maximum, Minimum, Slope, Coefficient of Variation, Difference, Area

2.4. Model Development

To develop a model for identifying nitrogen stress in winter wheat, this study designed models based on two methodological approaches: traditional machine learning and deep learning. All models used the manual features constructed earlier as base inputs. Furthermore, the deep learning model leveraged its hierarchical architecture to automatically extract high-level feature representations across modalities and time, thereby enhancing its capacity to characterize complex, non-linear dynamic patterns. Specifically, the CNN-LSTM hybrid architecture employed in this study was responsible for extracting spatio-temporal features from the input data. The hidden state output from its penultimate layer was defined as deep features for subsequent feature fusion and classification tasks. Given the complementary nature of the information encapsulated by the manual features and the deep features, this study constructed four model architectures with progressive relationships (Figure 3) to systematically evaluate the contribution of each component: C1: A temporal model using LSTM. C2: A CNN-LSTM hybrid model combining convolutional and temporal modeling. C3: A model where deep features are used for classification by XGBoost. C4: A hybrid model (CNN-LSTM-XGBoost) that fuses manual features with deep features and performs classification based on XGBoost. To improve reproducibility, the overall CNN–LSTM–XGBoost hybrid pipeline is described as follows: the multitemporal input sequence is first encoded by the 1D-CNN module to learn local temporal patterns; the extracted feature maps are then fed into the BiLSTM module to model long-range dependencies across growth stages. The final hidden representation is used as a compact deep embedding, which is concatenated with the manually engineered spectral/thermal and temporal statistical features to form a unified feature vector. This fused feature vector is finally used to train the XGBoost classifier, enabling high predictive performance with interpretable feature importance.

2.4.1. Traditional Machine Learning Model

Random Forest

The Random Forest (RF) algorithm, introduced by Leo Breiman and Adele Cutler in 2001 [31], is an ensemble learning method that enhances classification performance by integrating multiple decision trees. Each decision tree is trained on a randomly selected subset of the data (bootstrap sample) and a random subset of features. This approach introduces diversity among the individual learners and effectively reduces the risk of overfitting. During the training process of each tree, only a random subset of features is considered for node splitting, further promoting model diversity and mitigating overfitting. The final classification result is determined by aggregating the predictions of all individual trees through majority voting, thereby improving the model’s stability and generalization capability. Key hyperparameters tuned for the RF model included: the number of trees in the forest (n_estimators), the maximum depth of each tree (max_depth), the minimum number of samples required to split an internal node (min_samples_split), and the minimum number of samples required to be at a leaf node (min_samples_leaf).

Extreme Gradient Boosting

Extreme Gradient Boosting (XGBoost) is an efficient ensemble algorithm based on the gradient boosting framework, which integrates multiple decision trees [32]. Unlike traditional bagging methods, XGBoost adopts an additive modeling strategy, sequentially constructing multiple weak learners to iteratively correct the prediction residuals from the previous model. This approach systematically reduces model bias and enhances prediction accuracy. XGBoost introduces several key innovations beyond traditional gradient boosting: it incorporates a regularization term to control model complexity, utilizes second-order Taylor expansion for optimized loss function calculation, employs parallel processing to accelerate training, and implements more efficient tree pruning strategies. These improvements collectively grant XGBoost significant advantages in both computational efficiency and generalization performance. In this study, the primary parameters tuned for the XGBoost model included: the number of boosting trees (n_estimators), the learning rate (learning_rate), the maximum tree depth (max_depth), the subsample ratio of training instances (subsample), the subsample ratio of features (colsample_bytree), and the regularization parameter (gamma).

Support Vector Machine

The Support Vector Machine (SVM) is a supervised learning algorithm widely used for classification in high-dimensional feature spaces [33]. Its core principle is to find the optimal hyperplane that maximizes the margin between different classes in the feature space, thereby ensuring excellent classification robustness and generalization capability. By employing the structural risk minimization principle and regularization techniques, SVM effectively balances model complexity and training error. The decision boundary of an SVM is determined solely by a few critical sample points, known as support vectors. To address linearly non-separable problems, SVM applies the kernel trick to map the original features into a higher-dimensional space where linear separation becomes possible. In this study, the key hyperparameters optimized for the SVM model were the kernel type (kernel) and the regularization parameter (C).

2.4.2. Deep Learning Models

Convolutional Neural Network (CNN) Module

To effectively capture localized response features induced by nitrogen stress within multispectral and TIR sequences, this study introduced a 1-Dimensional Convolutional Neural Network (1D-CNN) as the front-end feature extraction module [34]. The 1D-CNN is an extension of the 2D CNN for sequential data, with its theoretical foundation established by LeCun et al. in the 1990s. Its core concept relies on convolutional kernels with local connectivity and weight sharing to automatically extract local patterns and translation-invariant features from input sequences. In this study, this module served as the key feature extractor. The input sequences were processed by a two-layer convolutional block, with the default number of filters set to 16 and a kernel size of 2. Each convolutional layer was followed by a Batch Normalization operation to accelerate training convergence and improve stability, and used the ReLU activation function to introduce non-linearity. To enhance the model’s generalization ability, a Dropout layer (dropout rate of 0.3) was added at the end of the convolutional block, randomly disabling a portion of neurons to mitigate overfitting.

Long Short-Term Memory (LSTM) Module

The Long Short-Term Memory (LSTM) network is a specialized variant of Recurrent Neural Networks (RNNs) [35]. It addresses the issues of vanishing or exploding gradients during the training of traditional RNNs by introducing input gates, forget gates, output gates, and memory cells (Figure 4). In this study, we adopted a Bidirectional LSTM architecture which contains two independent LSTM layers enabling the integration of information from both temporal directions. The number of LSTM units was set to 32, and both kernel and recurrent regularizers (L2 coefficient of 1 × 10⁻⁴) were applied to improve model generalization. The key formulas are as follows:

Forget Gate:

F_{t} = σ (W_{f} \cdot [H_{t - 1}, X_{t}] + b_{f})

Input Gate:

I_{t} = σ (W_{i} \cdot [H_{t - 1}, X_{t}] + b_{i})

Candidate Cell State:

\tilde{C_{t}} = \tanh (W_{c} \cdot [H_{t - 1}, X_{t}] + b_{c})

Cell State Update:

C_{t} = F_{t} \cdot C_{t - 1} + I_{t} \cdot \tilde{C_{t}}

Output Gate:

O_{t} = σ (W_{o} \cdot [H_{t - 1}, X_{t}] + b_{o})

Hidden State Output:

H_{t} = O_{t} \cdot \tanh (C_{t})

Figure 4. LSTM architecture. T: the current timestep, X_t: the current input, C_t−1: the cell state from the previous timestep, H_t−1: the hidden state from the previous timestep, “×” and “+” indicate element-wise multiplication and addition, arrows show the information flow within time step t.

2.5. Model Training and Evaluation

2.5.1. Training Parameters

All deep learning models were trained using the Adam optimizer with a learning rate of 1 × 10⁻⁴ and the categorical cross-entropy loss function. To prevent overfitting, an early stopping mechanism (patience = 10) was employed during training, with the maximum number of epochs set to 40. For the traditional machine learning models, hyperparameters were optimized via grid search, with the corresponding parameter search ranges provided (Table 3).

2.5.2. Evaluation Strategy and Metrics

Cross-validation, a standard technique for assessing model generalization ability and was widely adopted in machine learning performance evaluation [36]. To objectively evaluate the classification performance of each model, this study employed a Nested Cross-Validation strategy (Figure 5). This framework consists of two layers: an outer loop with 5 folds for performance estimation, and an inner loop with 3 folds dedicated to model selection and hyperparameter tuning. In each validation split, the data were divided into an 80% training set and a 20% test set. This design effectively prevents information leakage from the test set during the model selection process, ensuring the unbiased and reliable nature of the evaluation results. Model performance was comprehensively assessed using four metrics: Accuracy, F1_macro, Area Under the Curve (AUC_macro), and Cohen’s Kappa coefficient (Kappa). These metrics collectively reflect model capability from different perspectives, including overall correctness, robustness to class imbalance, class discriminability, and prediction agreement, respectively. All samples were defined at the plot level as independent experimental units, and each plot was predicted only in its corresponding outer-loop validation fold, ensuring that the reported results are strictly out-of-fold estimates without information leakage.

Accuracy:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

F1_macro:

{F 1}_{macro} = \frac{1}{C} \sum_{c = 1}^{C} {F 1}_{c}

{F 1}_{c} = \frac{2 P_{c} \times R_{c}}{P_{c} + R_{c}}

P_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F P}_{c}}, R_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F N}_{c}}

AUC_macro:

{AUC}_{macro} = \frac{1}{C} \sum_{c = 1}^{C} {AUC}_{c}

Kappa:

Kappa = \frac{p_{o} - p_{e}}{1 - p_{e}}

In the above formula, True Positive (TP) represents the number of samples that actually belong to the correct class and were correctly predicted; True Negative (TN) represents the number of samples that actually belong to the incorrect class and were correctly predicted; False Positive (FP) represents the number of samples that actually belong to the incorrect class but were predicted to belong to the correct class; False Negative (FN) represents the number of samples that actually belong to the correct class but were incorrectly predicted. po represents the observed agreement rate, pe denotes the expected agreement rate by chance. C is the total number of classes, and macro indicates that the metric is calculated separately for each class and then averaged.

3. Results

3.1. Statistical Validation of the Wheat Nitrogen Stress Classification Gradient

To validate the scientific rationale and discriminative capacity of the four-level nitrogen stress classification system predefined in this study based on controlled nitrogen application rates (N1–N4), we conducted systematic statistical tests and visualization analysis on the yield distribution across different stress levels (Figure 6). The results demonstrate that this classification system (Level 0: N4 = 330 kg/ha; Level 1: N3 = 210 kg/ha; Level 2: N2 = 90 kg/ha; Level 3: N1 = 0 kg/ha) exhibits a clear gradient structure, aligns well with crop yield response patterns, and can thus serve as a reliable label basis for subsequent modeling. First, the Kruskal–Wallis test revealed highly significant differences among the four nitrogen stress levels (H(3) = 100.65, p < 0.001). The effect size, ε² = 0.205, indicated a moderate-to-large magnitude, suggesting that the stress level explains approximately 20.5% of the yield variation and possesses good discriminative power. In the box plot, the yield distributions showed a clear decreasing trend across levels: Level 0 had a mean yield of 25,120 ± 860 kg/ha, Level 1 of 22,850 ± 720 kg/ha, Level 2 of 21,530 ± 680 kg/ha, and Level 3 of 20,077 ± 590 kg/ha. From Level 0 to Level 3, yield consistently decreased, with a maximum difference of 5043 kg/ha, representing a reduction of approximately 25.3%. This trend is consistent with the physiological mechanism whereby nitrogen deficiency inhibits photosynthetic capacity and dry matter accumulation in crops. Second, multiple comparison results indicated that the yield of Level 0 was significantly higher than all other levels (p < 0.0001). The difference between Level 1 and Level 2 was relatively small (p = 1.0000), but both were significantly higher than Level 3 (p < 0.0001). This pattern was also reflected in the Cumulative Distribution Function (CDF) curves, where the yield curves shifted leftward with increasing stress levels. Combined with the gradient observed in the bar chart of mean yields, this further supports the stability of the statistical differences among categories. In summary, the nitrogen stress classification structure not only demonstrates clear statistical separability but also aligns with the known physiological effects of nitrogen on winter wheat growth and yield formation, making it a reliable labeling system for subsequent model training and validation.

3.2. Overall Comparison of Model Performance

To systematically evaluate the classification efficacy of different modeling approaches in identifying wheat nitrogen stress gradients, this study compared three traditional machine learning models (RF, SVM, XGBoost) against a deep learning (CNN-LSTM) using multi-stage fused features (Table 4). Results demonstrate that traditional machine learning models achieved classification accuracy ranging from 0.69 to 0.79. XGBoost demonstrated the best overall performance among traditional models, with Accuracy, F1_macro, AUC_macro, and Kappa reaching 0.7931, 0.7933, 0.9388, and 0.7241, respectively. RF performed comparably, while SVM exhibited relatively lower metrics across all evaluated dimensions. In comparison, the CNN-LSTM model demonstrated significantly higher classification capability, achieving an Accuracy of 0.8938, F1_macro of 0.8940, AUC_macro of 0.9698, and Kappa of 0.8583. All metrics showed marked improvement over those of the traditional models. Figure 7 further illustrates the performance footprint of each model, revealing that CNN-LSTM achieves the largest and most balanced coverage across all evaluation dimensions, thereby reflecting its comprehensive classification capability.

3.3. The Impact of Input Modalities on Classification Performance

To evaluate the influence of different input feature modalities on classification performance, this study designed three input combinations for comparative analysis: A1: spectral bands and VIs; A2: TIR features; A3: spectral bands, VIs, and TIR features. The experimental results indicate significant differences in model performance across these modalities (Table 5). When using only A1 features, the F1_macro of the models ranged between 0.56 and 0.76. XGBoost and the deep learning model performed best, achieving F1_macro scores of 0.7620 and 0.8334, respectively, while RF was slightly lower (F1_macro = 0.7423). SVM showed relatively low performance across all metrics, with an Accuracy of only 0.5708. Using only A2 features led to a general performance decline in the traditional models, whose F1_macro scores distributed between 0.56 and 0.70. In contrast, CNN-LSTM maintained relatively high performance, achieving an F1_macro of 0.8840 and an Accuracy of 0.8833. With the A3 feature set, the classification performance of all models improved significantly. The Accuracy of XGBoost and CNN-LSTM reached 0.7931 and 0.8938, respectively. RF also showed improvement (F1_macro = 0.7768), while the enhancement for SVM was more limited (F1_macro = 0.6960). The confusion matrix of the deep learning model under the A3 condition (Figure 8) revealed a notable reduction in misclassification for the moderate stress category, indicating that multimodal information effectively enhances inter-class discriminability.

3.4. The Impact of Different Stages on Classification Performance

To investigate the impact of temporal information from different growth stages on the classification of wheat nitrogen stress gradients, this study compared the performance using the A3 fused modality as input across the jointing stage, heading stage, filling stage, and multi-stage fusion (Table 6). The results demonstrate that integrating multi-stage temporal information significantly improved the classification performance of all models compared to using any single growth stage. Specifically, with the inclusion of multi-stage temporal features, the XGBoost model achieved the best performance, with an Accuracy of 0.9125, F1_macro of 0.9131, AUC_macro of 0.9876, and the highest Kappa value of 0.8833. The RF model reached an Accuracy of 0.8875, F1_macro of 0.8877, AUC_macro of 0.9792, and a Kappa value of 0.8500. Under multi-temporal feature input, the performance of the SVM model was close to that of RF (F1_macro = 0.8924) but still slightly lower than XGBoost overall. Regarding single-stage performance: The jointing stage yielded moderate classification results, with RF achieving an F1_macro of 0.7727 and an AUC_macro of 0.9278, outperforming both XGBoost and SVM. The heading stage showed the best single-stage performance, where XGBoost performed the best with an F1_macro of 0.8189. The filling stage exhibited relatively lower overall performance, though XGBoost still maintained comparatively good results with an F1_macro of 0.7577. Furthermore, the ROC curves of the XGBoost model (Figure 9), showed that the multi-stage fusion achieved an AUC_macro of 0.9876, higher than that of the other three individual stages.

3.5. The Impact of Different Components on Classification Performance

To further quantify the impact of internal components within the deep learning architecture on classification performance, this study conducted a component ablation experiment based on the four model structures (C1–C4) described above. The experimental results (Table 7) indicate significant differences in classification performance among the various components. The C4 model significantly outperformed all other architectures across all evaluation metrics, achieving an Accuracy of 0.9208, F1_macro of 0.9212, AUC_macro of 0.9879, and Kappa of 0.8944. The C1 (LSTM) model demonstrated the lowest performance, with an Accuracy of 0.7000. The C2 (CNN-LSTM) model showed improved performance, raising Accuracy to 0.7500. The C3 model exhibited a further increase in Accuracy to 0.8250, with an AUC_macro of 0.9422. For a more detailed comparison of the classification capabilities of the different architectures, ROC curves and confusion matrices for models C1 to C4 were plotted (Figure 10 and Figure 11). The results (Figure 10) show that the ROC curve of the C4 model is positioned closest to the top-left corner overall, achieving an AUC_macro of 0.9879, which indicates its superior ability to discriminate between different stress levels. The confusion matrices (Figure 11) reveal that most misclassifications occurred within the moderate stress categories. Notably, the confusion matrix of the C4 model demonstrates a significant improvement in classification accuracy for these moderate stress classes.

3.6. Interpretability and Visualization of Model Outputs

To gain deeper insights into the decision-making rationale of the hybrid model, this study conducted a comprehensive evaluation from multiple perspectives, integrating feature contribution analysis, deep feature visualization, and error statistics (Figure 12). The SHAP beeswarm plot revealed that features such as TIR × NDRE, rededge_T2, and TIR_T2 exhibited high contribution magnitudes within the overall feature space. Global SHAP summary rankings further quantified the importance of individual features, identifying TIR × NDRE as having the highest mean |SHAP| value, followed by TIR, GNDVI, PRI, and NDRE indices. The feature importance structure demonstrated a relatively stable overall pattern. The t-SNE visualization results showed that the four nitrogen stress levels formed relatively concentrated clusters in the two-dimensional embedding space, with a discernible degree of separation between the different classes. The distribution of prediction errors indicated that the majority of samples had an error below 0.05. The mean error was 0.065, the median error was 0.007, and the error distribution was relatively concentrated.

4. Discussion

4.1. Synergistic Mechanisms of Multimodal and Multi-Temporal Fusion

This study demonstrates that the multimodal fusion of spectral bands, VIs, and TIR features significantly improves the accuracy of nitrogen stress classification, a finding consistent with previous crop stress monitoring research utilizing multisource remote sensing [37]. Spectral bands and, VIs effectively capture the decline in chlorophyll content induced by nitrogen deficiency [38]; however, these features are susceptible to confounding factors such as soil background effects and variations in illumination. In contrast, TIR serves as an indirect indicator of crop water status and physiological activity [39]. Under nitrogen stress, the root uptake capacity of wheat is compromised [40], photosynthetic activity weakens, leading to reduced leaf water content and diminished transpiration [41], which consequently elevates canopy temperature [42]. The complementary physiological sensitivity of these modalities explains the consistent performance gains observed for multimodal fusion in our experiments and aligns with previous UAV-based studies integrating spectral and thermal signals. For instance, Yu et al. demonstrated that combining multispectral and thermal information improved the estimation accuracy of leaf chlorophyll content and LAI under different management conditions, highlighting the added value of thermal cues in capturing crop physiological responses [43]. Moreover, Wang et al. showed that incorporating thermal temperature indices into multisensor UAV frameworks can further improve winter wheat nitrogen monitoring, confirming the benefit of thermal–spectral complementarity across growth stages [44]. Recent studies have corroborated this synergy: UAV-based thermal imaging has been directly linked to declines in stomatal conductance under nitrogen stress [45], red edge indices such as NDRE are widely recognized as reliable indicators of nitrogen status [46,47]. Critically, the fusion of thermal and spectral data has been shown to significantly improve the estimation of crop nitrogen status compared to using either modality alone [47]. In this study, the TIR × NDRE interaction feature was identified by SHAP analysis as the highest-contributing variable, indicating its capacity to simultaneously capture the dual signals of red edge shift and canopy warming, thereby amplifying the nitrogen stress signal. This result suggests that the coupled response of chlorophyll reduction—weakened physiological cooling capacity—elevated canopy temperature provides a critical basis for the model’s accurate stress level identification.

Multi-temporal feature fusion also demonstrated substantial advantages in this study. Compared to the single-stage F1_macro value of 0.8189 during the heading stage, the metric increased to 0.9131 after multi-stage fusion, an improvement of 11.26%. This aligns with established understanding of nitrogen dynamics in wheat: the jointing stage represents a critical period for nitrogen demand in wheat, where nitrogen primarily influences chlorophyll accumulation and plant growth vigor [48]; The heading stage sees nitrogen status directly affecting spike development and the supply of photoassimilates, making spectral features most discriminative [44]; During the filling stage, nitrogen translocation efficiency determines grain filling and influences canopy temperature and spectral indices [49]. Multi-stage data comprehensively captures the dynamic progression of nitrogen stress, reducing classification errors that may arise from transient environmental disturbances in single-stage observations, corroborating recent findings on the importance of dynamic crop monitoring [50]. In summary, multimodal information captures leaf biochemical properties and canopy thermal status, while multi-temporal information reflects the spatiotemporal progression of nitrogen stress. Their integration substantially enhances the model’s stability and generalization capability.

4.2. Advantages and Applicability of the Hybrid Model Architecture

In this study, the CNN-LSTM-XGBoost hybrid framework achieved the best classification performance, primarily attributable to the synergistic interaction of the following two mechanisms. First, the CNN-LSTM module captures local spectral correlations through convolutional operations and models temporal dependencies via the bidirectional LSTM, automatically extracting complex non-linear patterns that are difficult to represent with manual features. Second, the manual features, constructed based on crop physiology, carry clear biophysical and agronomic significance, which reduces the model’s dependency on large datasets and enhances its stability [51]. The XGBoost classifier effectively integrates these two distinct types of features. Its gradient-boosted tree architecture is well-suited for handling heterogeneous feature sets and providing feature importance interpretations, while its built-in regularization and subset sampling mechanism further suppress potential noise from high-dimensional inputs [52]. The component ablation experiment further validated the incremental contribution of each module. The model using LSTM (C1) achieved an accuracy of merely 70.00%, struggling to adequately capture local spectral variations. Incorporating the CNN module to form the CNN-LSTM structure (C2) increased the accuracy to 75.00%, demonstrating the critical role of local feature extraction. Subsequently, feeding the deep features into XGBoost for classification (C3) raised the accuracy further to 82.50%, indicating that the ensemble method utilizes high-level abstract features more effectively. Ultimately, the hybrid model (C4) that fused deep features with manual features delivered the best performance (92.08% accuracy), highlighting the complementary value of multisource features in terms of representational capacity and interpretability. This hybrid architecture maintains the powerful representational capabilities of deep learning while incorporating the controllability and stability of traditional machine learning methods, offering a transferable solution for processing multimodal, multi-temporal agricultural remote sensing data. To further contextualize our framework against recent advances, we expanded the discussion to compare our hybrid approach with representative state-of-the-art modeling paradigms, including end-to-end deep sequence models and recent hybrid fusion strategies reported in UAV-based crop stress monitoring. Although the current experiments focus on widely adopted baselines (RF, SVM, XGBoost, and CNN–LSTM), these comparisons provide a practical and fair evaluation under the same dataset, and future work will incorporate additional benchmark architectures (e.g., attention-based temporal models and transformer-style fusion) when larger multi-year datasets become available.

4.3. Spatial Generalization Capability and Agronomic Application Potential

To validate the spatial generalization capability of the proposed fusion model (C4) under real-field conditions, it was applied to orthomosaic imagery of the entire experimental area to generate a spatial distribution map of nitrogen stress levels (Figure 13). The results revealed a distinct spatial structure in the model predictions. High and low nitrogen plots formed clear and continuous spatial patches, aligning closely with the spatial layout of different fertilizer treatments in the experimental design. This strongly demonstrates the model’s effectiveness in capturing crop physiological and spectral responses induced by varying nitrogen application rates, confirming its robust spatial discriminative ability and agronomic rationality. Minor prediction deviations occurred primarily for intermediate nitrogen levels (Level 1 and Level 2), which may be attributed to field micro-topography and variations in soil texture affecting local water and nutrient retention capacities.

From an agronomic perspective, the spatial prediction map generated by the model can directly inform variable-rate nitrogen application decisions. For instance, in Level 3 (severe stress) areas, considering the nitrogen demand pattern of wheat, supplemental topdressing could be recommended. For Level 1 (mild stress) areas, the current fertilization regimen could be maintained or nitrogen application moderately reduced during jointing, aiming to maintain yield while mitigating the risk of nitrogen leaching [53]. Compared to traditional uniform fertilization, a precision management strategy based on spatial heterogeneity can potentially improve nitrogen use efficiency and reduce environmental pollution risks, aligning with sustainable agricultural goals [54]. Therefore, the spatial prediction map serves not only as an intuitive visualization of model performance but also as a critical validation of its agronomic utility, demonstrating the potential for translating this research framework into practical precision agriculture operations.

4.4. Study Limitations and Future Perspectives

Despite the encouraging results, several limitations should be acknowledged. First, the dataset was collected from a single experimental site within one growing season (2023–2024). Although the factorial design introduced substantial variability (e.g., multiple cultivars, nitrogen gradients, and soil moisture treatments), the transferability of the proposed framework to other climates, years, and soil types still requires further empirical validation. Future work will therefore evaluate the model using multi-year and multi-location datasets to better quantify its robustness and applicability under diverse environments [53]. Second, the current analysis focused on three agronomically critical growth stages (jointing, heading, and filling). While these stages capture key physiological transitions, higher-frequency UAV monitoring across the entire growing season may better characterize continuous stress dynamics and improve early diagnosis under fluctuating field conditions [55]. Third, the proposed framework primarily relies on canopy-level spectral, thermal, and temporal features and does not explicitly incorporate auxiliary environmental covariates such as baseline soil nutrient status, soil texture, or meteorological variables (e.g., precipitation and temperature). Integrating these data sources is expected to reduce prediction uncertainty—particularly for intermediate nitrogen levels (Level 1 and Level 2), where spectral separability is often limited—and to strengthen the causal interpretation of nitrogen stress responses in complex field environments [51]. In future work, we will explore multimodal fusion strategies that integrate UAV imagery with soil and weather information to further improve robustness and transferability across years and environments [21]. Furthermore, although the hybrid framework provides a degree of interpretability through the gradient-boosting component, the physiological meaning associated with deep representations warrants further investigation. Future studies will combine deep features with field-measured physiological traits to validate the underlying mechanisms. Finally, to better reflect real deployment conditions and reduce potential spatial or temporal dependence among training and validation samples, more stringent validation protocols will be adopted in future evaluations.

5. Conclusions

This study developed an interpretable multimodal and multi-temporal deep learning framework for classifying wheat nitrogen stress gradients using UAV-based multispectral and TIR imagery. Based on comprehensive ablation experiments and interpretability analysis, this study elucidates the following mechanisms and conclusions:

Multimodal feature fusion significantly enhances nitrogen stress classification accuracy. The CNN-LSTM model achieved an Accuracy of 89.38% and an F1_macro of 0.8940 under the fused modality (A3), representing a 6.05 percentage point improvement over using spectral features alone (A1).

Multi-temporal information improves model stability and generalization capability. Integrating data from multiple growth stages significantly boosted the performance of traditional models. With multi-temporal input, the XGBoost model performed best, achieving an Accuracy of 91.25%, F1_macro of 0.9131, AUC_macro of 0.9876, and Kappa of 0.8833. This represents a 9.42 percentage point improvement in F1_macro over using the single heading stage, underscoring the importance of capturing the dynamic progression of nitrogen stress.

Combined convolutional and temporal modeling outperforms individual architectures, confirming the advantage of the hybrid design. In the deep learning component ablation study, the LSTM model (C1) achieved an accuracy of 70.00%. Integrating the CNN module for local spectral feature extraction (C2) increased accuracy to 75.00%, validating the critical role of convolutional operations in extracting discriminative spectral patterns.

The CNN-LSTM-XGBoost (C4) framework has the best predictive performance. After concatenating the features extracted by the deep model with the manual features and performing integrated learning with XGBoost, C4 achieved the best overall performance (Accuracy = 0.9208, F1_macro = 0.9212, AUC_macro = 0.9879, Kappa = 0.8944). This result significantly outperformed standalone deep or machine learning methods, combining high accuracy with interpretability.

The model demonstrates sound interpretability and spatial generalization capability. SHAP analysis clearly identified and ranked the key features for nitrogen stress classification. t-SNE visualization revealed distinct clustering of the four nitrogen stress levels. Furthermore, the spatial prediction maps generated for the field accurately matched the actual fertilizer application layout, demonstrating the model’s potential for enabling precision nitrogen management.

Author Contributions

Conceptualization, X.K.; methodology, X.K., Q.C. and D.W.; software, X.K. and B.M.; validation, X.K., F.D., H.L. and Z.C.; formal analysis, X.K.; writing—original draft preparation, X.K.; writing—review and editing, X.K., X.H., Z.C. and Q.C.; visualization, X.K. and F.D.; supervision, Z.C., F.D., X.H. and H.L.; project administration, X.K., B.M., Y.L., W.F. and D.C.; funding acquisition, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2023YFD1900705), the Central Public-interest Scientific Institution Basal Research Fund (No. IFI2024-01), and the Science and Technology Innovation Project of the Chinese Academy of Agricultural Sciences.

Data Availability Statement

Data available on request from the corresponding authors.

Acknowledgments

The authors would like to thank the anonymous reviewers for their kind suggestions and constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Luo, C.S.; Guo, Z.P.; Xiao, J.X.; Dong, K.; Dong, Y. Effects of Applied Ratio of Nitrogen on the Light Environment in the Canopy and Growth, Development and Yield of Wheat When Intercropped. Front. Plant Sci. 2021, 12, 719850. [Google Scholar] [CrossRef]
Fathi, A. Role of nitrogen (N) in plant growth, photosynthesis pigments, and N use efficiency: A review. Agrisost 2022, 28, 1–8. [Google Scholar] [CrossRef]
Houlès, V.; Guérif, M.; Mary, B. Elaboration of a nitrogen nutrition indicator for winter wheat based on leaf area index and chlorophyll content for making nitrogen recommendations. Eur. J. Agron. 2007, 27, 1–11. [Google Scholar] [CrossRef]
Qiang, B.B.; Zhou, W.X.; Zhong, X.J.; Fu, C.Y.; Cao, L.; Zhang, Y.X.; Jin, X.J. Effect of nitrogen application levels on photosynthetic nitrogen distribution and use efficiency in soybean seedling leaves. J. Plant Physiol. 2023, 287, 154051. [Google Scholar] [CrossRef] [PubMed]
Coggins, S.; McDonald, A.J.; Silva, J.V.; Urfels, A.; Nayak, H.S.; Sherpa, S.R.; Jat, M.L.; Jat, H.S.; Krupnik, T.; Kumar, V.; et al. Data-driven strategies to improve nitrogen use efficiency of rice farming in South Asia. Nat. Sustain. 2025, 8, 22–33. [Google Scholar] [CrossRef]
Yokamo, S.; Irfan, M.; Huan, W.W.; Wang, B.; Wang, Y.L.; Ishfaq, M.; Lu, D.J.; Chen, X.Q.; Cai, Q.L.; Wang, H.Y. Global evaluation of key factors influencing nitrogen fertilization efficiency in wheat: A recent meta-analysis (2000–2022). Front. Plant Sci. 2023, 14, 1272098. [Google Scholar] [CrossRef]
Wang, Z.H.; Skidmore, A.K.; Darvishzadeh, R.; Heiden, U.; Heurich, M.; Wang, T.J. Leaf Nitrogen Content Indirectly Estimated by Leaf Traits Derived from the PROSPECT Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 3172–3182. [Google Scholar] [CrossRef]
Kang, M.; Zhang, L.; Qin, T.Y.; An, J.W.; Wang, C.K.; Wang, S.Y.; Ali, I.; Liu, B.; Liu, L.L.; Tang, L.; et al. Bridging chlorophyll content and vertical nitrogen distribution for accurate canopy photosynthesis simulation. Comput. Electron. Agric. 2025, 239, 110885. [Google Scholar] [CrossRef]
Cai, Y.P.; Guan, K.Y.; Nafziger, E.; Chowdhary, G.; Peng, B.; Jin, Z.N.; Wang, S.W.; Wang, S.B. Detecting In-Season Crop Nitrogen Stress of Corn for Field Trials Using UAV- and CubeSat-Based Multispectral Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 5153–5166. [Google Scholar] [CrossRef]
Zhang, H.D.; Wang, L.Q.; Tian, T.; Yin, J.H. A Review of Unmanned Aerial Vehicle Low-Altitude Remote Sensing (UAV-LARS) Use in Agricultural Monitoring in China. Remote Sens. 2021, 13, 1221. [Google Scholar] [CrossRef]
Maes, W.H.; Steppe, K. Perspectives for Remote Sensing with Unmanned Aerial Vehicles in Precision Agriculture. Trends Plant Sci. 2019, 24, 152–164. [Google Scholar] [CrossRef]
Krienke, B.; Ferguson, R.B.; Schlemmer, M.; Holland, K.; Marx, D.; Eskridge, K. Using an unmanned aerial vehicle to evaluate nitrogen variability and height effect with an active crop canopy sensor. Precis. Agric. 2017, 18, 900–915. [Google Scholar] [CrossRef]
Burchard-Levine, V.; Guerra, J.G.; Borra-Serrano, I.; Nieto, H.; Mesías-Ruiz, G.; Dorado, J.; de Castro, A.I.; Herrezuelo, M.; Mary, B.; Aguirre, E.P.; et al. Evaluating the utility of combining high resolution thermal, multispectral and 3D imagery from unmanned aerial vehicles to monitor water stress in vineyards. Precis. Agric. 2024, 25, 2447–2476. [Google Scholar] [CrossRef]
Sharma, V.; Honkavaara, E.; Hayden, M.; Kant, S. UAV remote sensing phenotyping of wheat collection for response to water stress and yield prediction using machine learning. Plant Stress 2024, 12, 100464. [Google Scholar] [CrossRef]
Liu, Q.S.; Wu, Z.J.; Cui, N.B.; Zheng, S.S.; Jiang, S.Z.; Wang, Z.H.; Gong, D.Z.; Wang, Y.S.; Zhao, L.; Wei, R.J. Estimating stomatal conductance of citrus orchard based on UAV multi-modal information in Southwest China. Agric. Water Manag. 2025, 307, 109253. [Google Scholar] [CrossRef]
Yang, M.D.; Hsu, Y.C.; Chen, Y.H.; Yang, C.Y.; Li, K.Y. Precision monitoring of rice nitrogen fertilizer levels based on machine learning and UAV multispectral imagery. Comput. Electron. Agric. 2025, 237, 110523. [Google Scholar] [CrossRef]
Atanassova, S.; Petrova, A.; Yorgov, D.; Mineva, R.; Veleva, P. Visible and Near-Infrared Spectroscopy for Investigation of Water and Nitrogen Stress in Tomato Plants. Agriengineering 2025, 7, 155. [Google Scholar] [CrossRef]
Berger, K.; Verrelst, J.; Féret, J.B.; Wang, Z.H.; Wocher, M.; Strathmann, M.; Danner, M.; Mauser, W.; Hank, T. Crop nitrogen monitoring: Recent progress and principal developments in the context of imaging spectroscopy missions. Remote Sens. Environ. 2020, 242, 111758. [Google Scholar] [CrossRef]
Goldman, C.V.; Baltaxe, M.; Chakraborty, D.; Arinez, J.; Diaz, C.E. Interpreting learning models in manufacturing processes: Towards explainable AI methods to improve trust in classifier predictions. J. Ind. Inf. Integr. 2023, 33, 100439. [Google Scholar] [CrossRef]
Zhang, X.P.; Hu, Y.T.; Li, X.F.; Wang, P.; Guo, S.K.; Wang, L.; Zhang, C.Y.; Ge, X. Estimation of Rice Leaf Nitrogen Content Using UAV-Based Spectral-Texture Fusion Indices (STFIs) and Two-Stage Feature Selection. Remote Sens. 2025, 17, 2499. [Google Scholar] [CrossRef]
Jiang, C.B.; Guo, X.S.; Li, Y.F.; Lai, N.; Peng, L.; Geng, Q.L. Multimodal Deep Learning Models in Precision Agriculture: Cotton Yield Prediction Based on Unmanned Aerial Vehicle Imagery and Meteorological Data. Agronomy 2025, 15, 1217. [Google Scholar] [CrossRef]
Zhao, C.J.; Wang, J.H.; Huang, W.J.; Zhou, Q.F. Spectral indices sensitively discriminating wheat genotypes of different canopy architectures. Precis. Agric. 2010, 11, 557–567. [Google Scholar] [CrossRef]
Wu, B.; Huang, W.J.; Ye, H.C.; Luo, P.L.; Ren, Y.; Kong, W.P. Using Multi-Angular Hyperspectral Data to Estimate the Vertical Distribution of Leaf Chlorophyll Content in Wheat. Remote Sens. 2021, 13, 1501. [Google Scholar] [CrossRef]
Wang, W.H.; Zheng, H.B.; Wu, Y.P.; Yao, X.; Zhu, Y.; Cao, W.X.; Cheng, T. An assessment of background removal approaches for improved estimation of rice leaf nitrogen concentration with unmanned aerial vehicle multispectral imagery at various observation times. Field Crops Res. 2022, 283, 10854. [Google Scholar] [CrossRef]
Dash, J.; Curran, P.J. Evaluation of the MERIS terrestrial chlorophyll index (MTCI). Adv. Space Res. 2007, 39, 100–104. [Google Scholar] [CrossRef]
Liu, Y.J.; Zhu, X.D. Tracking mangrove light use efficiency using normalized difference red edge index. Ecol. Indic. 2024, 168, 112774. [Google Scholar] [CrossRef]
Tucker, C.J. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar] [CrossRef]
Rondeaux, G.; Steven, M.; Baret, F. Optimization of soil-adjusted vegetation indices. Remote Sens. Environ. 1996, 55, 95–107. [Google Scholar] [CrossRef]
Burns, B.W.; Green, V.S.; Hashem, A.A.; Massey, J.H.; Shew, A.M.; Adviento-Borbe, M.A.A.; Milad, M. Determining nitrogen deficiencies for maize using various remote sensing indices. Precis. Agric. 2022, 23, 791–811. [Google Scholar] [CrossRef]
Sasagawa, T.; Akitsu, T.K.; Ide, R.; Takagi, K.; Takanashi, S.; Nakaji, T.; Nasahara, K.N. Accuracy Assessment of Photochemical Reflectance Index (PRI) and Chlorophyll Carotenoid Index (CCI) Derived from GCOM-C/SGLI with In Situ Data. Remote Sens. 2022, 14, 5352. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Li, Y.C.; Zeng, H.W.; Zhang, M.; Wu, B.F.; Qin, X.L. Global de-trending significantly improves the accuracy of XGBoost-based county-level maize and soybean yield prediction in the Midwestern United States. Giscience Remote Sens. 2024, 61, 2349341. [Google Scholar] [CrossRef]
Olgun, M.; Onarcan, A.O.; Özkan, K.; Isik, S.; Sezer, O.; Özgisi, K.; Ayter, N.G.; Basçiftçi, Z.B.; Ardiç, M.; Koyuncu, O. Wheat grain classification by using dense SIFT features with SVM classifier. Comput. Electron. Agric. 2016, 122, 185–190. [Google Scholar] [CrossRef]
Zhao, J.; Li, H.; Liu, J.P.; Zhao, M.L.; Yan, Z.X. Integration of spatial attention mechanism into 1D-CNN for prediction of wheat leaf nitrogen concentration from UAV-borne hyperspectral imagery. Comput. Electron. Agric. 2025, 239, 110906. [Google Scholar] [CrossRef]
Zhao, Y.X.; He, J.Y.; Yao, X.; Cheng, T.; Zhu, Y.; Cao, W.X.; Tian, Y.C. Wheat Yield Robust Prediction in the Huang-Huai-Hai Plain by Coupling Multi-Source Data with Ensemble Model under Different Irrigation and Extreme Weather Events. Remote Sens. 2024, 16, 1259. [Google Scholar] [CrossRef]
Venter, J.H.; Snyman, J.L.J. A Note on The Generalized Cross-Validation Criterion in Linear-Model Selection. Biometrika 1995, 82, 215–219. [Google Scholar] [CrossRef]
Cho, S.B.; Soleh, H.M.; Choi, J.W.; Hwang, W.H.; Lee, H.; Cho, Y.S.; Cho, B.K.; Kim, M.S.; Baek, I.; Kim, G. Recent Methods for Evaluating Crop Water Stress Using AI Techniques: A Review. Sensors 2024, 24, 6313. [Google Scholar] [CrossRef] [PubMed]
Gao, S.; Yan, K.; Liu, J.X.; Pu, J.B.; Zou, D.X.; Qi, J.B.; Mu, X.H.; Yan, G.J. Assessment of remote-sensed vegetation indices for estimating forest chlorophyll concentration. Ecol. Indic. 2024, 162, 112001. [Google Scholar] [CrossRef]
Xu, K.; Yang, W.; Ye, H. Thermal infrared reflectance characteristics of natural leaves in 8-14 μm region: Mechanistic modeling and relationships with leaf water content. Remote Sens. Environ. 2023, 294, 113631. [Google Scholar] [CrossRef]
Mu, X.; Chen, Q.; Chen, F.; Yuan, L.; Mi, G. Within-Leaf Nitrogen Allocation in Adaptation to Low Nitrogen Supply in Maize during Grain-Filling Stage. Front. Plant Sci. 2016, 7, 699. [Google Scholar] [CrossRef] [PubMed]
Zhao, D.L.; Reddy, K.R.; Kakani, V.G.; Reddy, V.R. Nitrogen deficiency effects on plant growth, leaf photosynthesis, and hyperspectral reflectance properties of sorghum. Eur. J. Agron. 2005, 22, 391–403. [Google Scholar] [CrossRef]
Lopes, M.S.; Reynolds, M.P. Partitioning of assimilates to deeper roots is associated with cooler canopies and increased yield under drought in wheat. Funct. Plant Biol. 2010, 37, 147–156. [Google Scholar] [CrossRef]
Yu, X.J.; Huo, X.F.; Qian, L.; Du, Y.Y.; Liu, D.K.; Cao, Q.; Wang, W.; Hu, X.T.; Yang, X.F.; Fan, S.S. Combining UAV Multispectral and Thermal Infrared Data for Maize Growth Parameter Estimation. Agriculture 2024, 14, 2004. [Google Scholar] [CrossRef]
Wang, J.J.; Wang, W.T.; Liu, S.Y.; Hui, X.; Zhang, H.H.; Yan, H.J.; Maes, W.H. UAV-Based Multiple Sensors for Enhanced Data Fusion and Nitrogen Monitoring in Winter Wheat Across Growth Seasons. Remote Sens. 2025, 17, 498. [Google Scholar] [CrossRef]
Guimaraes, N.; Sousa, J.J.; Couto, P.; Bento, A.; Padua, L. Combining UAV-Based Multispectral and Thermal Infrared Data with Regression Modeling and SHAP Analysis for Predicting Stomatal Conductance in Almond Orchards. Remote Sens. 2024, 16, 2467. [Google Scholar] [CrossRef]
Pandey, P.; Singh, S.; Khan, M.S.; Semwal, M. Non-invasive Estimation of Foliar Nitrogen Concentration Using Spectral Characteristics of Menthol Mint (Mentha arvensis L.). Front. Plant Sci. 2022, 13, 680282. [Google Scholar] [CrossRef] [PubMed]
Sahoo, M.M.; Tarshish, R.; Tubul, Y.; Sabag, I.; Gadri, Y.; Morota, G.; Peleg, Z.; Alchanatis, V.; Herrmann, I. Multimodal ensemble of UAV-borne hyperspectral, thermal, and RGB imagery to identify combined nitrogen and water deficiencies in field-grown sesame. ISPRS J. Photogramm. Remote Sens. 2025, 222, 33–53. [Google Scholar] [CrossRef]
Gao, D.H.; Qiao, L.; An, L.L.; Zhao, R.M.; Sun, H.; Li, M.Z.; Tang, W.J.; Wang, N. Estimation of spectral responses and chlorophyll based on growth stage effects explored by machine learning methods. Crop J. 2022, 10, 1292–1302. [Google Scholar] [CrossRef]
Szczepaniak, W.; Grzebisz, W.; Potarzycki, J. Yield Predictive Worth of Pre-Flowering and Post-Flowering Indicators of Nitrogen Economy in High Yielding Winter Wheat. Agronomy 2023, 13, 122. [Google Scholar] [CrossRef]
Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef] [PubMed]
Ester, M.; Kriegel, H.P.; Xu, X. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; p. 785. [Google Scholar]
Lobell, D.B.; Thau, D.; Seifert, C.; Engle, E.; Little, B. A scalable satellite-based crop yield mapper. Remote Sens. Environ. 2015, 164, 324–333. [Google Scholar] [CrossRef]
Basso, B.; Antle, J. Digital agriculture to design sustainable agricultural systems. Nat. Sustain. 2020, 3, 254–256. [Google Scholar] [CrossRef]
Yang, K.L.; Gong, Y.; Fang, S.H.; Duan, B.; Yuan, N.G.; Peng, Y.; Wu, X.T.; Zhu, R.S. Combining Spectral and Texture Features of UAV Images for the Remote Estimation of Rice LAI throughout the Entire Growing Season. Remote Sens. 2021, 13, 3001. [Google Scholar] [CrossRef]

Figure 1. Study area and experimental design.

Figure 2. The UAV and spectral sensors. (a) DJI M210, (b) RedEdge MX multispectral camera, (c) Zenmuse XT2 thermal infrared camera.

Figure 3. Architectures of the four compared models. C1: LSTM; C2: CNN-LSTM; C3: Deep Features -XGBoost; and C4: CNN-LSTM-XGBoost hybrid framework.

Figure 5. Nested cross-validation process for model training and evaluation.

Figure 6. Yield statistics of winter wheat under different stress conditions. (a) Yield distribution across nitrogen stress levels. (b) CDF curves of grain yield. (c) Mean yield and variability among stress levels. Asterisks indicate significance levels: *** p < 0.001. Red lines indicate medians, and circles denote individual data points.

Figure 7. Radar chart of performance for different models.

Figure 8. Confusion matrix of CNN-LSTM under different modalities. (a) Confusion matrix of A1; (b) Confusion matrix of A2; (c) Confusion matrix of A3.

Figure 9. ROC of XGBoost at different stages. The dashed diagonal line indicates the no-discrimination baseline (random classifier; AUC = 0.5).

Figure 10. Confusion matrix of different models. (a) C1: LSTM; (b) C2: CNN-LSTM; (c) C3: Deep Features -XGBoost; (d) C4: CNN-LSTM-XGBoost hybrid framework.

Figure 11. ROC curves of different models. The dashed diagonal line indicates the no-discrimination baseline (random classifier; AUC = 0.5).

Figure 12. Visualization and interpretability analysis. (a) SHAP local contribution distributions of the top 20 features. (b) Global feature importance ranking. (c) Prediction error distribution, (d) t-SNE visualization of deep features.

Figure 13. Spatial prediction map of nitrogen stress based on UAV orthophotos.

Table 1. RedEdge MX and Zenmuse XT2 main parameters.

Parameter	Multispectral Sensor	Thermal Infrared Sensor
Weight	232 g	629 g
Size	87 mm × 59 mm × 45.4 mm	123.7 mm × 112.6 mm × 127.1 mm
Image Resolution	1280 × 960	640 × 512
Band (nm) (Center wavelength, bandwidth)	Blue (475, 32)	Thermal Infrared (7500–13,500)
	Green (560, 27)
	Red (668, 16)
	Red Edge (717, 12)
	Near Infrared (842, 57)

Table 3. Training parameters of different machine learning models.

Model	Parameter	Range
RF	n_estimators	100, 300
	max_depth	None, 10
	min_samples_split	2, 5
	min_samples_leaf	1, 2
XGBoost	n_estimators	300, 500
	learning_rate	0.05, 0.1
	max_depth	3, 5
	subsample	0.7, 0.9
	colsample_bytree	0.7, 1.0
SVM	gamma	0, 1
	C	1, 10
	kernel	Linear, rbf

Table 4. Comparison of classification performance of different models based on fused features.

Model	Accuracy	F1_macro	AUC_macro	Kappa
RF	0.7771	0.7768	0.9332	0.7028
XGBoost	0.7931	0.7933	0.9388	0.7241
SVM	0.6944	0.6960	0.8901	0.5926
CNN-LSTM	0.8938	0.8940	0.9698	0.8583

Table 5. Classification performance of machine learning and deep learning models with different modal inputs.

Modality	Model	Accuracy	F1_macro	AUC_macro	Kappa
A1	RF	0.7424	0.7423	0.9106	0.6565
	XGBoost	0.7611	0.7620	0.9195	0.6815
	SVM	0.5708	0.5620	0.8243	0.4278
	CNN-LSTM	0.8333	0.8334	0.9513	0.7778
A2	RF	0.6958	0.6961	0.8884	0.5944
	XGBoost	0.6958	0.6967	0.8904	0.5944
	SVM	0.5722	0.5592	0.8281	0.4296
	CNN-LSTM	0.8833	0.8840	0.9652	0.8444
A3	RF	0.7771	0.7768	0.9332	0.7028
	XGBoost	0.7931	0.7933	0.9388	0.7241
	SVM	0.6944	0.6960	0.8901	0.5926
	CNN-LSTM	0.8938	0.8940	0.9698	0.8583

Table 6. Comparison of classification performance between single-stage and multi-stage.

Stage	Model	Accuracy	F1_macro	AUC_macro	Kappa
jointing	RF	0.7729	0.7727	0.9278	0.6972
	XGBoost	0.7708	0.7710	0.9269	0.6944
	SVM	0.7625	0.7611	0.9185	0.6833
filling	RF	0.7333	0.7334	0.9138	0.6444
	XGBoost	0.7563	0.7577	0.9108	0.6750
	SVM	0.7167	0.7075	0.9066	0.6222
heading	RF	0.8125	0.8131	0.9490	0.7500
	XGBoost	0.8188	0.8189	0.9466	0.7583
	SVM	0.7958	0.7964	0.9344	0.7278
multi-stage	RF	0.8875	0.8877	0.9792	0.8500
	XGBoost	0.9125	0.9131	0.9876	0.8833
	SVM	0.8917	0.8924	0.9796	0.8556

Table 7. Ablation study of deep learning model components.

Component	Accuracy	F1_macro	AUC_macro	Kappa
C1	0.7000	0.6934	0.8565	0.6000
C2	0.7500	0.7400	0.8785	0.6667
C3	0.8250	0.8244	0.9422	0.7667
C4	0.9208	0.9212	0.9879	0.8944

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kuang, X.; Hou, X.; Wang, D.; Mao, B.; Li, Y.; Chen, D.; Fu, W.; Cheng, Q.; Duan, F.; Li, H.; et al. A CNN-LSTM-XGBoost Hybrid Framework for Interpretable Nitrogen Stress Classification Using Multimodal UAV Imagery. Remote Sens. 2026, 18, 538. https://doi.org/10.3390/rs18040538

AMA Style

Kuang X, Hou X, Wang D, Mao B, Li Y, Chen D, Fu W, Cheng Q, Duan F, Li H, et al. A CNN-LSTM-XGBoost Hybrid Framework for Interpretable Nitrogen Stress Classification Using Multimodal UAV Imagery. Remote Sensing. 2026; 18(4):538. https://doi.org/10.3390/rs18040538

Chicago/Turabian Style

Kuang, Xiaohui, Xinyue Hou, Dawei Wang, Bohan Mao, Yafeng Li, Deshan Chen, Wanna Fu, Qian Cheng, Fuyi Duan, Hao Li, and et al. 2026. "A CNN-LSTM-XGBoost Hybrid Framework for Interpretable Nitrogen Stress Classification Using Multimodal UAV Imagery" Remote Sensing 18, no. 4: 538. https://doi.org/10.3390/rs18040538

APA Style

Kuang, X., Hou, X., Wang, D., Mao, B., Li, Y., Chen, D., Fu, W., Cheng, Q., Duan, F., Li, H., & Chen, Z. (2026). A CNN-LSTM-XGBoost Hybrid Framework for Interpretable Nitrogen Stress Classification Using Multimodal UAV Imagery. Remote Sensing, 18(4), 538. https://doi.org/10.3390/rs18040538

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A CNN-LSTM-XGBoost Hybrid Framework for Interpretable Nitrogen Stress Classification Using Multimodal UAV Imagery

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Experimental Design

2.2. UAV Data Acquisition and Preprocessing

2.3. Multimodal Feature Extraction

2.4. Model Development

2.4.1. Traditional Machine Learning Model

Random Forest

Extreme Gradient Boosting

Support Vector Machine

2.4.2. Deep Learning Models

Convolutional Neural Network (CNN) Module

Long Short-Term Memory (LSTM) Module

2.5. Model Training and Evaluation

2.5.1. Training Parameters

2.5.2. Evaluation Strategy and Metrics

3. Results

3.1. Statistical Validation of the Wheat Nitrogen Stress Classification Gradient

3.2. Overall Comparison of Model Performance

3.3. The Impact of Input Modalities on Classification Performance

3.4. The Impact of Different Stages on Classification Performance

3.5. The Impact of Different Components on Classification Performance

3.6. Interpretability and Visualization of Model Outputs

4. Discussion

4.1. Synergistic Mechanisms of Multimodal and Multi-Temporal Fusion

4.2. Advantages and Applicability of the Hybrid Model Architecture

4.3. Spatial Generalization Capability and Agronomic Application Potential

4.4. Study Limitations and Future Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI