Deep Feature Fusion with Vegetation Indices for Wheat Lodging Monitoring Using UAV Multi-Spectral Imagery

Zhou, Wei; Guo, Yahui; Fu, Yongshuo H.; Hao, Fanghua; Zhang, Xuan; Xu, Le; He, Yuhong

doi:10.3390/rs18111860

Open AccessArticle

Deep Feature Fusion with Vegetation Indices for Wheat Lodging Monitoring Using UAV Multi-Spectral Imagery

by

Wei Zhou

^1,2,3,†,

Yahui Guo

^1,2,3,*,†

,

Yongshuo H. Fu

⁴

,

Fanghua Hao

^1,3,4,

Xuan Zhang

⁴,

Le Xu

⁵ and

Yuhong He

⁶

¹

Key Laboratory for Geographical Process Analysis & Simulation of Hubei Province, Central China Normal University, Wuhan 430079, China

²

Academy of Frontier Interdisciplinary Research, Central China Normal University, Wuhan 430079, China

³

College of Urban and Environmental Sciences, Central China Normal University, Wuhan 430079, China

⁴

College of Water Sciences, Beijing Normal University, Beijing 100875, China

⁵

The National Key Laboratory of Smart Farm Technology and Systems, Northeast Agricultural University, Harbin 150030, China

⁶

Department of Geography, Geomatics and Environment, University of Toronto, 3359 Mississauga Road, Mississauga, ON L5L 1C6, Canada

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(11), 1860; https://doi.org/10.3390/rs18111860

Submission received: 22 April 2026 / Revised: 25 May 2026 / Accepted: 3 June 2026 / Published: 5 June 2026

(This article belongs to the Special Issue Advanced Remote Sensing Techniques in Agriculture and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Spectral indices (VARI, EXG, and MCARI) showed strong capability for monitoring wheat lodging.
Deep features, particularly YOLO12 combined with BP, significantly improved monitoring accuracy, achieving the best performance.

What are the implications of the main findings?

Deep features substantially enhance the discriminative power of UAV-based crop monitoring beyond traditional spectral methods.
Integrating spectral and deep features provides a scalable framework for precision agriculture and intelligent crop monitoring.

Abstract

Lodging is a major agricultural hazard that can substantially reduce crop yields. Timely and accurate monitoring of winter wheat lodging is important for assessing potential yield losses, guiding field management, and mitigating further lodging damage. Recent advances in unmanned aerial vehicle (UAV) remote sensing and artificial intelligence have provided new opportunities for lodging assessment. In this study, a novel monitoring framework was proposed by integrating deep features extracted from UAV multi-spectral images with machine learning algorithms. Sensitivity analysis was conducted to identify vegetation indices (VIs), which are highly correlated with lodging. These sensitive VIs were combined with original multi-spectral bands, and YOLOv8, YOLO12, SAM1, and SAM2 were used for feature extraction. The SHAP method was applied to analyze feature importance and model interpretability. The results indicated that VARI, EXG, and MCARI were the most effective VIs for lodging monitoring. Furthermore, three feature representations, including a spectral feature set, deep features, and fused features, were evaluated. The highest accuracy was achieved using YOLO12 deep features combined with a BP classifier, reaching an accuracy of 98.20%, a precision of 98.38%, a recall of 98.56%, and an F1-score of 98.56%. Overall, incorporating deep features significantly improved monitoring performance. The proposed framework provides an accurate and effective approach for crop lodging monitoring using UAV multi-spectral imagery.

Keywords:

UAV remote sensing; lodging monitoring; deep feature; machine learning; SHAP analysis

1. Introduction

Wheat (Triticum aestivum L.) is one of the world’s three major staple crops and serves as an important source of dietary energy for the global population, playing a pivotal role in global food security [1,2]. In China, wheat constitutes a cornerstone of the national grain production system [3]. The North China Plain (NCP) is one of the main wheat-producing regions in China, and the stability of its wheat production is of great significance for ensuring regional food security [4]. In recent years, climate change has intensified the frequency and severity of extreme weather events, including strong winds, heavy rainfall, and typhoons, thereby elevating the risk of winter wheat lodging during critical growth stages from jointing to grain filling [5]. Lodging, typically classified into stem lodging and root lodging, severely disrupts water and nutrient transport within plants, diminishes photosynthetic efficiency, and impedes grain filling [6]. These physiological perturbations ultimately culminate in substantial yield reductions [7]. Consequently, timely and accurate lodging monitoring is imperative for disaster assessment, yield prediction, agricultural insurance indemnification, and evidence-based governmental decision-making pertaining to disaster mitigation and relief [8,9].

Traditional wheat lodging monitoring primarily relied on manual field surveys, which were time-consuming, labor-intensive, and associated with high economic costs [10]. In contrast, remote sensing techniques facilitated rapid, efficient, and large-scale monitoring of lodging distribution and provided a macroscopic perspective that was unattainable through conventional ground-based observations [11,12]. In particular, Satellite remote sensing offers long-term and wide-area image time series, making it well-suited for continuous monitoring of crop lodging over extended temporal scales [13]. Guan et al. (2022) proposed a gridded lodging percentage estimation method based on Sentinel-2 imagery and analyzed the spectral responses corresponding to different lodging levels [14]. The approach achieved an R² of 0.64 and an RMSE of 25.24 for lodging percentage estimation on the test dataset [14]. Tang et al. (2022) developed a PTCNet deep learning semantic segmentation model using high-resolution GF-2 imagery [15]. By integrating multi-scale features with vegetation indices and edge information, large-scale wheat lodging areas were extracted, achieving an F1-score of 85.31% and an intersection-over-union (IoU) of 74.38% [15]. Although satellite remote sensing offers clear advantages for large-scale crop lodging monitoring, it is highly susceptible to cloud cover and precipitation, which frequently lead to data gaps [16].

The unmanned aerial vehicles (UAVs), as highly flexible remote sensing platforms, can acquire centimeter-level high-resolution imagery, enabling near real-time observations at the field scale [17,18]. In particular, UAVs equipped with multi-spectral sensors can capture richer spectral information than conventional RGB cameras. Numerous studies have investigated crop lodging monitoring by extracting spectral features, vegetation indices, color features, and texture features from UAV multi-spectral imagery [19,20]. Other studies have directly used multi-spectral imagery or multi-spectral derived vegetation indices as inputs for deep learning models. For example, Yang et al. (2022) employed a ResNet-50 model to classify maize lodging severity using UAV multi-spectral imagery, achieving an overall accuracy of 96.32% and a Kappa coefficient of 0.95 [21]. Similarly, Zhao et al. (2025) combined 22 multi-spectral vegetation indices with a Swin-T model to identify maize lodging type, severity, and direction, obtaining an overall accuracy of 96.02% and a Kappa coefficient of 0.95 [22]. Despite these advances, vegetation indices, color features, and texture features are generally regarded as shallow features and have limited capacity to fully exploit the information content of multi-spectral imagery. As a result, lodging monitoring performance based solely on these features may be suboptimal under complex field conditions [23].

Deep features are abstract representations that are automatically learned by deep learning models through nonlinear transformations within hierarchical network structures and are used to characterize the semantic information of input data [24]. As the network depth increases, deep features become increasingly abstract, exhibiting stronger representational capacity and improved generalization performance. Deep features have been widely applied in crop disease and pest identification tasks. Sethy et al. (2020) extracted deep features of rice leaf diseases using 11 convolutional neural networks, including AlexNet, VGG series, GoogLeNet, and ResNet series, and combined these features with an SVM for classification [25]. The results showed that the combination of ResNet50-derived deep features and SVM achieved the best performance, with an F1-score of 98.38% [25]. Similarly, Dash et al. (2023) extracted deep features of maize leaf diseases using DenseNet201 and performed classification using a Bayesian-optimized SVM [26]. Their approach achieved a classification accuracy of 94.6%, outperforming an SVM model without deep feature integration. These studies indicated that a two-stage strategy, in which deep features are first extracted using deep learning models and then modeled with machine learning algorithms, could yield higher performance comparable to that of end-to-end deep learning approaches. Moreover, compared with the “black-box” nature of deep learning models, this strategy exploited the strengths of machine learning in feature contribution analysis, thereby enhancing model interpretability. By applying interpretability methods such as SHapley Additive exPlanations (SHAP), which are based on Shapley values, the contribution of each input feature to the model output can be quantitatively assessed [27]. Han et al. (2022) extracted multi-source features from RGB and multi-spectral UAV imagery to develop a SMOTE-ENN-XGBoost model for maize lodging monitoring and employed SHAP for feature contribution analysis, achieving a high F1-score of 0.93 under imbalanced data conditions [28].

Deep learning models used for feature extraction have traditionally been based on early convolutional neural network (CNN) architectures, such as AlexNet, VGG, GoogLeNet, and ResNet [29]. These models were generally characterized by relatively shallow network depths and simple structural designs, which limited their feature representation capacity and restricted their ability to capture global contextual relationships. With the rapid advancement of deep learning, architectures with stronger feature mining capabilities have been continuously proposed, among which the Transformer architecture has attracted particular attention [30]. Transformers are capable of effectively modeling long-range dependencies and capturing global contextual information, thereby substantially enhancing the recognition of complex patterns in imagery. In recent years, the YOLO series and the Segment Anything Model (SAM) series have emerged as representative advanced frameworks for object detection and semantic segmentation, owing to their efficient network designs and superior performance compared with early CNN-based models [31]. The YOLO series adopted CNN-based architecture and achieved efficient real-time detection by jointly optimizing feature extraction and object localization in an end-to-end framework, with YOLOv8 and the latest YOLO12 being among the most representative models. The SAM framework was a large-scale, prompt-driven image segmentation model based on the Transformer architecture, exhibiting strong zero-shot segmentation capabilities that enabled automatic or interactive segmentation of arbitrary objects without retraining. Given the high computational cost and practical difficulty of end-to-end retraining, SAM was typically adapted using lightweight fine-tuning strategies [32]. In this context, extracting deep features from SAM has emerged as a low-cost and efficient approach for exploiting the representational power of large-scale segmentation models. Following the release of SAM1, the more advanced SAM2 further enhanced segmentation performance and generalization capability. These advanced architectures provide new opportunities for extracting more discriminative and robust deep features for crop lodging monitoring. However, their potential, particularly that of YOLO and SAM, remains insufficiently explored in UAV multi-spectral imagery-based winter wheat lodging monitoring.

In this study, we proposed a winter wheat lodging monitoring method that integrates deep features from UAV multi-spectral imagery with machine learning algorithms to improve monitoring accuracy. The main contributions of this study are summarized as follows: (1) A Spearman rank correlation coefficient (Spearman’s ρ) was employed to select optimal vegetation indices from 14 candidate indices to extract spectral feature for monitoring winter wheat lodging; (2) deep feature extractors based on YOLOv8, YOLO12, SAM1, and SAM2 were designed to extract deep features from the spectral feature set; (3) winter wheat lodging monitoring models based on Random Forest (RF), Support Vector Machine (SVM), Backpropagation Neural Network (BP), and Linear Discriminant Analysis (LDA) were constructed using the spectral feature set, deep features, and fused features as inputs, respectively, with SHAP employed for feature contribution analysis and model interpretability.

2. Materials and Methods

2.1. Study Area

Multi-spectral imagery of lodged winter wheat was acquired at the Nanpi Eco-Agricultural Experimental Station (NEES), located in Nanpi County, Hebei Province, China (116°40′E, 38°02′N) (Figure 1a). The winter wheat at the experimental station was planted in November of the previous year and harvested in early July of the following year. The wheat experiments at the station included many projects, such as determining optimal fertilization management practices and testing different wheat varieties. Shortly before harvest, strong wind weather caused severe lodging over large areas of mature winter wheat. As illustrated in Figure 1b, field photographs of the lodged winter wheat were collected by on-site personnel to document lodging conditions.

2.2. Data Collection

2.2.1. UAV Multi-Spectral Image Acquisition

Multi-spectral imagery of lodged winter wheat was acquired using a DJI Phantom 4 Multi-spectral UAV (DJI, Shenzhen, China, https://ag.dji.com/p4-multispectral (accessed on 2 June 2026)). The onboard imaging system comprises one RGB sensor for visible-light imaging and 5 monochrome sensors for multi-spectral data acquisition. The multi-spectral bands include blue (B: 450 nm ± 16 nm), green (G: 560 nm ± 16 nm), red (R: 650 nm ± 16 nm), red-edge (RE: 730 nm ± 16 nm), and near-infrared (NIR: 840 nm ± 26 nm). The inclusion of the RE and NIR bands provides enhanced sensitivity to variations in the canopy structure associated with winter wheat lodging. The UAV followed pre-planned flight routes at a flight height of 60 m above ground level, with forward and side overlaps of 85% and 80%, respectively. All multi-spectral images were collected between 11:00 and 12:00 to minimize illumination variability, reduce shadow effects, and ensure relatively stable solar radiation conditions around noon.

2.2.2. Image Preprocess

The Agisoft Metashape Software Version 1.0.0.1 (https://www.agisoft.com/) was used to generate 5-band multi-spectral orthomosaic images of lodging winter wheat, with a spatial resolution of 3.13 cm/pixel. To distinguish lodged winter wheat from healthy wheat, manual pixel-level annotations were performed. The region of interest (ROI) tool in ENVI 5.6 (https://www.nv5geospatialsoftware.com/Products/ENVI (accessed on 2 June 2026)) was employed to delineate lodging areas and generate corresponding label maps for model development (Figure 1c). The orthomosaic images and their corresponding label maps were subsequently cropped into 256 × 256 image patches. Patches located at image boundaries, containing no lodged winter wheat or exhibiting severely imbalanced ratios between lodged and non-lodged winter wheat, were excluded. After filtering, a total of 163 image patches were retained for subsequent analysis (Figure 1d).

2.3. Methods

The whole data processing and data analysis contained 4 stages that were shown in the workflow (Figure 2). The first stage was the UAV multi-spectral image acquisition and data processing (Section 2.2). The third and fourth stages focused on feature extraction and model training and validation, respectively.

2.3.1. Vegetation Indices Extraction

Lodged and non-lodging winter wheat exhibit differences in canopy structure, leaf health, and spectral reflectance, and vegetation indices can effectively enhance these differences [33]. The 14 vegetation indices listed in Table 1 were widely used to characterize vegetation health and, therefore, have the potential to discriminate between lodged and non-lodging winter wheat. To quantify the relationship between vegetation indices and lodging area, Spearman’s ρ was employed to rank the correlations between vegetation indices and lodging. The top 3 vegetation indices were selected from the 14 candidate vegetation indices and combined with the 5 original multi-spectral bands, including blue, green, red, red-edge, and NIR, to construct the spectral feature set. Spearman’s ρ is a nonparametric statistical measure used to assess the strength and direction of a monotonic relationship between two variables. It is computed based on the ranks of the original data rather than their absolute values, with correlation coefficients ranging from −1 to 1. Values closer to −1 or 1 indicate stronger negative or positive correlations, respectively, whereas values close to 0 indicate a weak or no correlation. The formulation of Spearman’s ρ is given as follows:

ρ = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)}

(1)

where n is the number of samples, and d_i represents the difference between the ranks of the i-th observation for the two variables.

2.3.2. Deep Feature Extraction

To comprehensively evaluate the effectiveness of different deep feature representations for winter wheat lodging monitoring, 4 deep feature extractors were designed based on the YOLOv8 and YOLO12 architectures and the SAM1 and SAM2 architectures. Overall, we employed 2 convolution-based and 2 ViT-based deep feature extractors to extract deep features from two types of 3-channel inputs in the spectral feature set: the RGB bands and the 3 selected vegetation indices.

The YOLOv8 and YOLO12 deep feature extractors both consisted of a backbone and a neck, while the segmentation head commonly used in YOLO architectures was removed. Only the earliest high-level semantic features in the neck were retained as deep feature representations. Although these features had the lowest spatial resolution, they encode high-level semantic information, making them well-suited for representing large-scale lodging patterns. The YOLOv8 deep feature extractor (Figure 3a) consisted of two key modules: C2f (CSP Bottleneck with two convolutions) and SPPF (Spatial Pyramid Pooling Fast). The C2f module fused residual block features with the original features through convolution operations, enhancing feature integration while reducing computational cost. SPPF expanded the receptive field through multiple max-pooling operations, capturing multi-scale features while maintaining computational efficiency. The YOLO12 deep feature extractor (Figure 3b) removed SPPF and replaced part of the C2f modules with C3k2 and A2C2f. A2C2f built upon C2f by incorporating Area Attention, retaining cross-stage feature fusion and residual connections, and enhancing the detection of large objects. C3k2 used smaller convolutional kernels and residual structures, while retaining cross-stage branches and concatenation from C2f. In this study, the nano-scale variants, namely YOLOv8n and YOLO12n, were adopted to initialize the YOLO-based deep feature extractors using their official pretrained weights and default configurations. The model architecture, initialization settings, and pretrained parameters were kept unchanged before feature extraction, including the default backbone, neck, prediction head, convolutional layers, activation functions, and normalization settings. Given a three-channel input image of size 256 × 256 pixels, the deep feature extractor produced a feature map with 64 channels at a spatial resolution of 32 × 32. The extracted deep feature maps were then upsampled to 256 × 256 using bilinear interpolation, corresponding to an upsampling factor of 8.

SAM1 and SAM2 were both based on Vision Transformer and were designed to achieve a zero-shot, general-purpose segmentation model. SAM1 generated over 1 billion labeled masks from 11 million images to achieve zero-shot transfer of the model. Furthermore, SAM2 employed a more advanced Hiera image encoder, achieving notable improvements in both accuracy and inference speed over SAM1. SAM1 deep feature extractors (Figure 3c) used a Vision Transformer (ViT) structure pretrained with MAE. Multi-head self-attention and feedforward networks were used to capture long-range dependencies of the patch embeddings. SAM2 deep feature extractors (Figure 3d) adopted a stacked ViT module architecture and replaced part of the Global Attention with Mask Unit Attention to reduce the computational cost associated with extensively stacked Global Attention. This design was based on the observation that Global Attention in the first two stages of the image encoder has a limited impact. Therefore, local attention was applied in the first 2 stages, while Global Attention was used in the remaining stages. In this study, the official SAM1 ViT-B and SAM2 base-plus pretrained models were adopted to initialize the corresponding SAM1- and SAM2-based deep feature extractors using their default configurations. The model architecture, initialization settings, and pretrained parameters were kept unchanged before feature extraction, including the default image encoder, prompt encoder, and mask decoder configurations. The pretrained parameters were not fine-tuned on the winter wheat lodging dataset, and the models were used only as deep feature extractors. Given a 3-channel input image with a spatial resolution of 256 × 256, the SAM-based feature extractors generated deep feature maps with 256 channels at a spatial resolution of 64 × 64, which were subsequently upsampled to 256 × 256 using bilinear interpolation with a factor of 4. Compared with the YOLO-based feature extractors, the SAM-based extractors produced feature maps with 4 times more channels and 4 times larger spatial resolution.

2.3.3. Model Establishment

RF, XGBoost, BP Neural Network, and LDA Algorithm were employed to construct winter wheat lodging monitoring models. RF is an ensemble method that integrates multiple decision trees through bootstrap sampling and random feature selection. XGBoost is a gradient boosting framework that builds additive decision trees using gradient information and regularization to improve decision accuracy. The BP Neural Network, a feedforward neural model trained via backpropagation, is capable of modeling complex nonlinear relationships between deep features and lodging labels. LDA served as a classical linear classifier that identifies discriminative projection directions by maximizing between-class variance while minimizing within-class variance. Model hyperparameters were optimized using a grid search strategy, which exhaustively evaluated parameter combinations within the predefined search space shown in Table 2 and selected the optimal configuration based on performance.

2.3.4. SHapley Additive exPlanations

SHAP is a recently developed interpretable artificial intelligence approach designed to elucidate model decision-making processes. It quantifies the contribution of individual input variables to model outputs, thereby explicitly evaluating the importance of each feature in the prediction results. SHAP analysis was applied to investigate the contributions of spectral features and deep features to the decision-making of winter wheat lodging monitoring models under different machine learning algorithms. The aim is to provide an interpretable analytical basis for this two-step winter wheat lodging monitoring strategy. The SHAP value is calculated as follows:

Φ_{i} (f) = \sum_{S \subseteq N ∖ i} \frac{|S|! (|N| - |S| - 1)!}{|N|!} [f (S \cup i) - f (S)]

(2)

where

N

denotes the complete set of input features, and

S

represents a subset of features excluding feature

i

.

Φ_{i} (f)

corresponds to the SHAP value associated with feature

i

, and

f (S)

indicates the expected model output conditioned on the feature subset

S

. All SHAP analyses in this study were implemented using Python 3.10 with SHAP version 0.48.0.

2.4. Experimental Design and Model Evaluation

To comprehensively evaluate the contribution of different feature types, three experimental configurations were designed: (1) winter wheat lodging monitoring based on a spectral feature set; (2) winter wheat lodging monitoring based on deep features; and (3) winter wheat lodging monitoring based on fused features, where the spectral feature set in (1) and the deep features in (2) were concatenated along the channel dimension to construct the model inputs. The dataset was divided into training, validation, and test sets in a ratio of 7:2:1. All feature maps were standardized to a spatial size of 256 × 256 × n, where n denotes the feature dimension. Each feature map was then reshaped into a vector of size 65536 × 1 × n to ensure spatial consistency and enable subsequent machine learning model construction. The corresponding lodging label maps were reshaped in the same manner; therefore, each pixel was treated as an individual sample with its associated feature vector and class label. The machine learning models were, therefore, trained to classify each pixel as lodged or non-lodged based on the spectral, deep, or fused feature representations.

For each training run, all machine learning algorithms were trained using the same set of hyperparameters. Model performance was systematically evaluated using accuracy, F1-score, recall, and precision, which are defined as follows:

A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N}

(3)

F 1 - s c o r e = \frac{2 \times T P}{2 \times T P + F N + F P}

(4)

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

where TP (True Positive) represents winter wheat that is actually lodged and correctly predicted as lodging, FP (False Positive) refers to background pixels incorrectly predicted as lodging, FN (False Negative) denotes winter wheat that is actually lodged but incorrectly predicted as background, and TN (True Negative) represents background pixels correctly predicted as background.

3. Results

3.1. Analysis of Extracted Image Features

3.1.1. Optimized Vegetation Index

To fully exploit the spectral information contained in multi-spectral data, a total of 14 vegetation indices were calculated. Spearman’s ρ was then applied to identify three indices that were most strongly associated with winter wheat lodging. Figure 4a presents the Spearman correlation coefficients between the 14 vegetation indices and winter wheat lodging. VARI (ρ = 0.5981), EXG (ρ = 0.5977), and MCARI (ρ = 0.5967) exhibited the highest correlation values, all showing positive relationships with lodged areas. Moreover, TGI (ρ = 0.5641), GMR (ρ = 0.5570), and GB (ρ = 0.4800) were positively correlated, whereas NDIg (ρ = −0.5954), NRI (ρ = −0.4681), NDRE (ρ = −0.3088), and LCI (ρ = −0.2759) showed negative correlations.

Figure 4b shows the RGB image used as a visual reference. Figure 4c–e illustrates pronounced differences among the three selected vegetation index maps. VARI exhibited relatively low overall brightness while effectively suppressing non-lodged areas, resulting in strong regional segmentation performance. EXG could also reliably delineate lodged regions. However, its high sensitivity to the green spectral component caused saturation under varying lodging severities, which reduced its discriminative capability compared with VARI. In contrast, MCARI displayed substantially higher brightness and an extremely high sensitivity to winter wheat lodging. Any heterogeneous disturbance originating from the background or canopy structure was significantly amplified, leading to misclassification of non-lodged areas as lodged in some regions and thus reducing the robustness of its discrimination performance. Overall, all three vegetation indices effectively distinguished lodged from non-lodged winter wheat areas and were suitable as input features for subsequent deep feature extraction.

3.1.2. Deep Features Analysis

Deep feature extraction was performed using YOLOv8, YOLO12, SAM1, and SAM2 as feature extractors, with two types of three-channel inputs from the spectral feature set: the RGB channels and the selected vegetation indices, namely VARI, ExG, and MCARI. The extracted deep features were subsequently integrated by upsampling and channel-wise stacking to construct unified deep feature representations. Figure 5 illustrates the results of eight representative single-channel deep feature maps. Most feature channels exhibited clear responses to winter wheat lodging. Figure 5a–d shows the deep feature maps extracted by YOLOv8 and YOLO12. Due to their lower spatial resolution and higher level of semantic abstraction, these features captured lodging-related characteristics at more abstract semantic levels while requiring less storage space, making them suitable for efficient computation. Figure 5e–h presents the deep feature maps extracted by SAM1 and SAM2. Compared with the YOLO-based features, the SAM-based feature maps had larger spatial dimensions and a lower degree of semantic abstraction. They preserved more fine-grained structural information and enabled clearer discrimination between lodged and non-lodged winter wheat. However, the SAM-based features contained four times as many channels as those derived from YOLOv8 and YOLO12, resulting in substantially higher storage requirements and computational costs.

3.2. Results of Winter Wheat Lodging Monitoring

3.2.1. Winter Wheat Lodging Monitoring Based on Spectral Feature Set

In this study, the spectral feature set was used to build the models, serving as the baseline for winter wheat lodging monitoring. Table 3 summarizes the performance of the winter wheat lodging monitoring models developed on RF, BP, XGBoost, and LDA. Among the four models, the BP model achieved the best overall performance, with an accuracy of 84.93%, a recall of 88.86%, a precision of 86.06%, and the highest F1-score of 87.44%. The RF and XGBoost models showed similar performance to the BP model, with accuracies of 84.83% and 84.2%, respectively. In contrast, the LDA model yielded the lowest performance, with an accuracy of 83.12%, an F1-score of 85.77%, a precision of 85.32%, and a recall of 86.22%. Figure 6 presents the winter wheat lodging monitoring results produced by the four models. All models successfully identified the lodged winter wheat areas. Overall, while differences in performance among the four spectral feature-based models were relatively small, the results indicated substantial potential for further improvement in monitoring accuracy.

3.2.2. Winter Wheat Lodging Monitoring Based on Deep Features

Winter wheat lodging monitoring models were constructed using RF, BP, XGBoost, and LDA based on deep features extracted by YOLOv8, YOLO12, SAM1, and SAM2. Figure 7 displays the performance of 16 models across accuracy, precision, recall, and F1-score using a heatmap representation. Overall, the models based on RF and BP consistently outperformed those based on XGBoost and LDA across different deep feature types. YOLOv8-BP and YOLO12-BP achieved the highest accuracies of 98.19% and 98.20%, respectively, with precision, recall, and F1-score also exceeding 98%. RF showed slightly lower but still robust performance, with all evaluation metrics exceeding 97%. In contrast, XGBoost and LDA exhibited markedly lower performance. In particular, YOLOv8-LDA achieved only 77.45% accuracy, lower than the spectral feature-based baseline, indicating that LDA was less effective in using complex deep features. For SAM-based deep features, performance differences among models were relatively smaller. The RF model achieved the best performance, with SAM2-RF outperforming SAM1-RF, yielding an accuracy of 97.82% and 96.64%, respectively. XGBoost and LDA demonstrated improved performance on SAM-based deep features compared with YOLO-based features. The accuracies of SAM1-XGBoost and SAM2-XGBoost reached 91.77% and 93.98%, respectively, while those of SAM1-LDA and SAM2-LDA were 88.63% and 89.63%, all exceeding the baseline performance based on the spectral feature set. Overall, the YOLO12-BP model achieved the best overall performance, attaining the highest values across all evaluation metrics except precision, with an accuracy of 98.20%, an F1-score of 98.56%, a precision of 98.38%, and a recall of 98.56%.

Figure 8 shows the lodging prediction maps of the 16 models. The RF and BP models under various deep features were highly consistent with the labels, showing excellent overall performance. In contrast, XGBoost showed fragmentation when using YOLO-based features but improved prediction quality with SAM-based features. The LDA model exhibited similar behavior to XGBoost, showing severe fragmentation and misclassification on YOLO-based features.

3.2.3. Winter Wheat Lodging Monitoring Based on Fused Features

To comprehensively evaluate the contribution of fused features to winter wheat lodging monitoring, the spectral feature set was concatenated along the channel dimension with deep features extracted by YOLOv8, YOLO12, SAM1, and SAM2 to construct four fused feature inputs. Lodging monitoring models were then developed using RF, BP, XGBoost, and LDA algorithms. Figure 9 summarizes the performance of the 16 model configurations using a heatmap of evaluation metrics. Among them, the SAM2-RF model performed the best, with an accuracy of 97.84%, an F1-score reaching 98.34%, a precision of 98.34%, and a recall of 98.30%. The performance of YOLOv8-RF, YOLO12-RF, and SAM1-RF was comparable, with accuracies of 97.11%, 97.50%, and 97.84%, respectively. In contrast, models based on BP, XGBoost, and LDA exhibited clear performance gaps relative to RF. Notably, BP-based models showed substantial degradation: YOLOv8-BP, YOLO12-BP, SAM1-BP, and SAM2-BP performed close to or worse than the baseline results, and SAM2-BP achieved the highest BP accuracy of only 86.07%.

The winter wheat lodging monitoring results based on fused features are presented in Figure 10. The RF model achieved winter wheat lodging range monitoring that was almost consistent with the labels, with clear boundaries and strong regional integrity. XGBoost ranked second, although it correctly identified the overall lodging extent; minor fragmentation and small holes, however, were observed in patch-level regions. By comparison, BP produced severely fragmented predictions, resulting in markedly inferior performance to RF and XGBoost. In addition, LDA performed better when fused features were constructed with SAM-based deep features than with YOLO-based deep features.

3.3. SHAP-Based Feature Importance Analysis

To quantify the contributions of the multi-spectral bands and the selected vegetation indices within the spectral feature set to winter wheat lodging monitoring, SHAP analysis was performed on models constructed using the spectral feature set and deep features. For the spectral feature set-based models, the RF, BP, XGBoost, and LDA results showed consistent feature importance rankings (Figure 11). Across all four models, the NIR band exhibited the highest mean absolute SHAP value, indicating that it is the most influential spectral predictor for lodging monitoring. The R band ranked second in the XGBoost and LDA models, consistent with its sensitivity to lodging-induced changes in canopy structure. In contrast, the three vegetation indices, ExG, VARI, and MCARI, consistently showed low mean absolute SHAP values across all models, suggesting that they contributed less predictive information than the raw spectral bands.

To further examine the response patterns of deep feature channels, SHAP analysis was also conducted for models developed using deep features (Figure 12). For YOLOv8-based models (YOLOv8-RF, YOLOv8-BP, YOLOv8-XGBoost, and YOLOv8-LDA), channels 54, 31, and 60 consistently exhibited high mean absolute SHAP values. Similarly, for YOLO12-based models (YOLO12-RF, YOLO12-BP, YOLO12-XGBoost, and YOLO12-LDA), channels 49, 51, and 91 ranked among the most important across all four algorithms. These results suggested that key deep features extracted by the YOLO architecture were robustly identified and exploited by different machine learning classifiers. By contrast, feature importance rankings for SAM-based deep features varied substantially across models. For SAM1-based models (SAM1-RF, SAM1-BP, SAM1-XGBoost, and SAM1-LDA), the overlap among channels in the top five mean absolute SHAP values was limited, indicating that different classifiers prioritized different channels. For SAM2 features, channels 241, 40, and 254 ranked highly in the RF and XGBoost models, whereas their importance was markedly reduced in the BP and LDA models.

Overall, the SHAP results provided an interpretable basis for the proposed two-step strategy by showing that model decisions were mainly governed by key spectral bands and a limited number of deep feature channels, although channel-level importance varies across extractor architectures and classifiers.

4. Discussion

4.1. Feature Importance Evaluation of Deep Features

A comprehensive comparison was conducted across models constructed using the spectral feature set, deep features, and fused features. In the baseline experiments utilizing the spectral feature set, the four winter wheat lodging monitoring models (RF, BP, XGBoost, and LDA) exhibited minimal performance disparities. When using low-dimensional spectral features alone, the four algorithms achieved broadly comparable performance, suggesting that the overall accuracy was primarily constrained by the limited representational capacity of the input features rather than the classifiers. Conversely, models constructed using deep features achieved substantial performance improvements. The best configuration, YOLO12-BP, reached an accuracy of 98.20% and an F1-score of 98.56%, representing increases of 13.27% and 11.12%, respectively, relative to the baseline BP model. The RF and XGBoost models also improved markedly, with overall gains of approximately 5%~11%. Although LDA benefited from certain deep feature inputs, its performance remained less stable than that of the nonlinear models. These results indicated that deep features provided richer spatial structural representations and higher-level semantics, which enhanced discrimination between lodged and non-lodged areas and mitigated the limitations of conventional spectral bands and vegetation indices in capturing complex lodging patterns.

After fusing the spectral feature set with deep features, model performance did not continue to increase. Compared with the deep feature stage, RF-based models maintained stable performance, with accuracy and the F1-score remaining within the 97%~98% range. XGBoost showed modest improvements (approximately 2%~4%) when fused features were used. By contrast, BP performance decreased substantially: its results were comparable to, or even worse than, the baseline models, and far below its best performance achieved with deep features alone. This decline may be attributable to the limited adaptability of BP to the direct concatenation of a low-dimensional spectral feature set with high-dimensional deep features. LDA consistently produced the lowest performance among the four algorithms. In the baseline experiment, the accuracy of LDA was 83.12%. However, with deep features, the SAM1-LDA and SAM2-LDA models achieved a certain degree of improvement, with a maximum increase of approximately 6.51%. Overall, the YOLO12-BP model based on deep features achieved optimal performance. It attained an accuracy of 98.20%, an F1-score of 98.56%, a precision of 98.38%, and a recall of 98.56%. The overall performance based on fused features exceeded that of the baseline experiment but remained inferior to the performance achieved using deep features alone.

4.2. The Potential of Combining Deep Features and ML for Winter Wheat Lodging Monitoring

The results indicate that combining deep features with machine learning algorithms is a feasible strategy for winter wheat lodging monitoring. Lodging changes the canopy texture, structural arrangement, and visual appearance of wheat plants. Deep features extracted from general-purpose segmentation models can provide more discriminative representations of lodged and non-lodged regions. This suggests that the rich semantic information learned by foundation models can be transferred to crop lodging monitoring tasks.

Typically, deep learning models are trained or fine-tuned for a specific downstream task to optimize performance in a given domain and scenario. Long et al. trained a MobileNetV3-M model to identify wheat lodging under different flight heights, achieving average accuracies of 86.8% [47]. Liu et al. developed an Enhanced Wheat Lodging Index and combined it with a two-peak search dynamic thresholding algorithm to extract lodging regions from UAV images, achieving an overall accuracy of 96% [48]. These studies demonstrate that UAV imagery can provide effective information for wheat lodging monitoring, but task-specific deep learning models and index-based thresholding methods usually rely on dedicated training datasets, predefined lodging features, or stable image conditions. In addition, Liu et al., Pan et al., and Zhu et al. proposed Lodge-Unet, an improved Mask-RT-DETR, and WLUSNet, respectively, to address different requirements in wheat lodging monitoring, including boundary refinement, real-time detection, and lightweight segmentation [49,50,51]. These studies achieved strong performance by designing specialized networks for specific tasks and application scenarios. In contrast, our framework does not require the development of separate task-specific deep networks. Instead, it extracts deep features from general-purpose segmentation models and combines them with machine learning classifiers to distinguish lodged and non-lodged areas. This strategy provides a more unified and flexible solution for wheat lodging monitoring while reducing dependence on dedicated network design and task-specific deep model training.

In recent years, general-purpose segmentation foundation models have become an increasingly influential paradigm. Meta has released the Segment Anything Model family, including SAM1 and SAM2. These models are trained on large-scale and diverse datasets, which substantially improves generalization and zero-shot segmentation capability. However, effectively deploying general-purpose segmentation models in agricultural remote sensing and crop monitoring remains challenging. Madadum et al. combined SAM-based mask extraction with photometric and geometric augmentation to improve YOLO-based watermelon leaf disease detection [52]. Wang et al. used SAM and its variants for automatic pear fruit annotation to reduce dataset construction costs [53]. These studies demonstrate the potential of SAM-based models in intelligent agricultural applications, but they also indicate that practical deployment often still requires task-specific data processing, downstream model training, or manual correction. To address these limitations, an alternative is to extract deep features from the backbones of foundation models and use them as high-level semantic inputs to machine learning classifiers or lightweight models. This feature-based strategy leverages the rich representations learned from massive datasets while avoiding the computational overhead and instability associated with end-to-end fine-tuning. In this study, we adopted this two-step strategy to develop a winter wheat lodging segmentation framework that integrates deep features with traditional machine learning algorithms. Specifically, deep features extracted from general segmentation models were used as input representations, and multiple machine learning classifiers were trained to discriminate lodged and non-lodged regions. The experimental results indicate that this framework effectively exploits the representational strength of foundation model features while maintaining practical computational efficiency, providing a useful reference for applying large foundation models to agricultural remote sensing and crop monitoring tasks.

4.3. Limitations and Prospects

This study employed a two-step strategy that combines deep features with machine learning algorithms to achieve accurate winter wheat lodging monitoring using UAV multi-spectral imagery. Nevertheless, several limitations should be addressed in future work. First, fine-tuning general vision foundation models remains challenging due to high computational demands, training instability, and sensitivity to hyperparameter settings. Consequently, this study did not conduct a systematic comparison with fine-tuning-based approaches in terms of predictive performance, computational cost, and inference efficiency, which limits a comprehensive evaluation of the relative advantages of the proposed strategy. Deep features extracted from foundation models typically have substantially higher dimensionality than conventional multi-spectral bands and vegetation index features. In this study, no feature selection or dimensionality reduction was applied. This design increases the complexity of feature utilization and may partially explain the suboptimal performance observed for some model configurations. In addition, the present experiment was conducted in a specific winter wheat production region. The applicability of the proposed method to other regions with different climatic conditions, soil backgrounds, wheat varieties, planting densities, and field management practices remains to be further evaluated.

Future research will focus on the following aspects. We will conduct a systematic comparison between the proposed feature-fusion strategy and model fine-tuning methods. This evaluation will assess the trade-offs between the two approaches across multiple dimensions, including classification accuracy, training cost, computational consumption, inference efficiency, model flexibility, and interpretability. Additionally, the dataset labeling system will be further refined. Categories such as field ridges will be introduced to verify the discriminative capability of this strategy in complex agricultural scenarios. Simultaneously, this methodology will be extended to lodging monitoring for other crops, such as maize, rice, and cotton. It will also be applied to related tasks like pest identification and Leaf Area Index estimation to comprehensively evaluate its adaptability.

5. Conclusions

This study proposed a novel method for monitoring winter wheat lodging by integrating deep features derived from UAV multi-spectral imagery with machine learning algorithms. Four deep feature extractors were employed to extract high-level semantic features from a spectral feature set. Subsequently, multiple lodging monitoring models based on RF, SVM, BP, and LDA were developed using a spectral feature set, deep features, and fused features as respective inputs. The experimental results demonstrate that the inclusion of deep features significantly enhanced the performance of the lodging monitoring models. Specifically, the YOLOv8-BP model, which integrates YOLOv8 deep features with the BP algorithm, achieved the highest accuracy of 98.20%. Moreover, SHAP analysis was applied to quantify feature contributions and to interpret how deep feature channels influenced model decisions, providing insights into channel-level response patterns. In conclusion, the strategy of combining deep features with machine learning algorithms demonstrated superior performance and feasibility for winter wheat lodging monitoring using UAV multi-spectral imagery. This research provided a valuable reference for the application of deep features from general segmentation models within agricultural remote sensing applications.

Author Contributions

Conceptualization, Y.G.; methodology, W.Z.; validation, W.Z.; formal analysis, W.Z.; investigation, X.Z.; resources, L.X.; data curation, Y.G.; writing—original draft preparation, W.Z.; writing—review and editing, Y.H.; visualization, W.Z.; supervision, F.H.; project administration, Y.H.F.; funding acquisition, Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (42501495), the Hubei Provincial Natural Science Foundation of China (2024AFB321), the Funding for the Opening Project of the National Key Laboratory of Smart Farm Technology and Systems (202401) and the Fundamental Research Funds for the Central Universities (XJ2026000501, CCNU24JCPT020).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Yang, G.; Li, X.; Xiong, Y.; He, M.; Zhang, L.; Jiang, C.; Yao, X.; Zhu, Y.; Cao, W.; Cheng, T. Annual winter wheat mapping for unveiling spatiotemporal patterns in China with a knowledge-guided approach and multi-source datasets. ISPRS J. Photogramm. Remote Sens. 2025, 225, 163–179. [Google Scholar] [CrossRef]
Song, X.B.; Zhang, W.T.; Pan, W.T.; Liu, P.; Wang, C.Y. Real-time monitor heading dates of wheat accessions for breeding in-field based on DDEW-YOLOv7 model and BotSort algorithm. Expert Syst. Appl. 2025, 267, 126140. [Google Scholar] [CrossRef]
Kong, X.; Zhao, G.; Sun, X.; Fu, Y. Coordinating lodging incidence and grain yield through wheat genetic diversity. Field Crops Res. 2024, 315, 109468. [Google Scholar] [CrossRef]
Yang, X.; Zhang, J.-H.; Yang, S.-S.; Wang, J.-W.; Bai, Y.; Zhang, S. Modelling the crop yield gap with a remote sensing-based process model: A case study of winter wheat in the North China Plain. J. Integr. Agric. 2023, 22, 2993–3005. [Google Scholar] [CrossRef]
Zhang, J.; Wu, Q.; Duan, F.H.; Feng, M.Z.; Liu, C.P.; Dai, L.; Wang, X.C.; Xiong, S.P.; Yang, H.; Yang, G.J.; et al. UssNet: A spatial self-awareness algorithm for wheat lodging area detection. Expert Syst. Appl. 2026, 297, 129433. [Google Scholar] [CrossRef]
Wang, D.; Zhao, M.; Li, Z.; Wu, X.; Li, N.; Li, D.; Xu, S.; Liu, X. Classification of maize lodging types using UAV-SAR remote sensing data and machine learning methods. Comput. Electron. Agric. 2024, 227, 109637. [Google Scholar] [CrossRef]
Li, Z.; Shah, F.; Xiong, L.; Zhang, J.; Wu, W. Unmanned aerial vehicles (UAVs)-based crop lodging susceptibility and seed yield assessment during different growth stages of rapeseed (Brassica napus). Comput. Electron. Agric. 2024, 221, 108980. [Google Scholar] [CrossRef]
Zhang, P.; Gu, S.; Wang, Y.; Yang, R.; Yan, Y.; Zhang, S.; Sheng, D.; Cui, T.; Huang, S.; Wang, P. Morphological and mechanical variables associated with lodging in maize. Field Crops Res. 2021, 269, 108178. [Google Scholar] [CrossRef]
Cao, W.; Qiao, Z.; Gao, Z.; Lu, S.; Tian, F. Use of unmanned aerial vehicle imagery and a hybrid algorithm combining a watershed algorithm and adaptive threshold segmentation to extract wheat lodging. Phys. Chem. Earth 2021, 123, 103016. [Google Scholar] [CrossRef]
Chauhan, S.; Darvishzadeh, R.; van Delden, S.H.; Boschetti, M.; Nelson, A. Mapping of wheat lodging susceptibility with synthetic aperture radar data. Remote Sens. Environ. 2021, 259, 112427. [Google Scholar] [CrossRef]
Guo, Y.; Hao, F.; Zhang, X.; He, Y.; Fu, Y.H. Improving maize yield estimation by assimilating UAV-based LAI into WOFOST model. Field Crops Res. 2024, 315, 109477. [Google Scholar] [CrossRef]
Dong, S.K.; Feng, J.F. LMFENet: A hybrid local-global and multi-scale feature extraction network for oil spill type classification using sentinel-1 imagery. Expert Syst. Appl. 2026, 300, 130335. [Google Scholar] [CrossRef]
Chauhan, S.; Darvishzadeh, R.; Lu, Y.; Boschetti, M.; Nelson, A. Understanding wheat lodging using multi-temporal Sentinel-1 and Sentinel-2 data. Remote Sens. Environ. 2020, 243, 111804. [Google Scholar] [CrossRef]
Guan, H.X.; Huang, J.X.; Li, X.C.; Zeng, Y.L.; Su, W.; Ma, Y.Y.; Dong, J.W.; Niu, Q.D.; Wang, W. An improved approach to estimating crop lodging percentage with Sentinel-2 imagery using machine learning. Int. J. Appl. Earth Obs. Geoinf. 2022, 113, 102992. [Google Scholar] [CrossRef]
Tang, Z.Q.; Sun, Y.Q.; Wan, G.T.; Zhang, K.F.; Shi, H.T.; Zhao, Y.D.; Chen, S.; Zhang, X.W. Winter Wheat Lodging Area Extraction Using Deep Learning with GaoFen-2 Satellite Imagery. Remote Sens. 2022, 14, 4887. [Google Scholar] [CrossRef]
Chen, J.; Fu, Y.; Guo, Y.; Xu, Y.; Zhang, X.; Hao, F. An improved deep learning approach for detection of maize tassels using UAV-based RGB images. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103922. [Google Scholar] [CrossRef]
Guo, Y.; Zhou, W.; Fu, Y.H.; Hao, F.; Zhang, X.; Xu, L.; Liu, J.; He, Y. SegNeXt-RCMSCA: An improved SegNeXt network for detecting winter wheat lodging from UAS RGB images. Smart Agric. Technol. 2025, 12, 101230. [Google Scholar] [CrossRef]
Shen, Z.H.; Zhang, H.C.; Bian, L.M.; Zhou, L.; Tian, Q.F.; Ge, Y.F. AI-powered UAV remote sensing for drought stress phenotyping: Automated chlorophyll estimation in individual plants using deep learning and instance segmentation. Expert Syst. Appl. 2026, 299, 130141. [Google Scholar] [CrossRef]
Liu, Y.; Nie, C.W.; Zhang, Z.; Wang, Z.X.; Ming, B.; Xue, J.; Yang, H.Y.; Xu, H.G.; Meng, L.; Cui, N.B.; et al. Evaluating how lodging affects maize yield estimation based on UAV observations. Front. Plant Sci. 2023, 13, 979103. [Google Scholar] [CrossRef]
Bouaziz, M.C.; Ben Abbes, A.; El Koundi, M.; Farah, I.R. ConvLSTM-GCN-transformer: Spatiotemporal graph-attention model for vegetation index map forecasting. Expert Syst. Appl. 2026, 314, 131596. [Google Scholar] [CrossRef]
Yang, X.; Gao, S.C.; Sun, Q.; Gu, X.H.; Chen, T.E.; Zhou, J.P.; Pan, Y.C. Classification of Maize Lodging Extents Using Deep Learning Algorithms by UAV-Based RGB and Multispectral Images. Agriculture 2022, 12, 970. [Google Scholar] [CrossRef]
Zhao, M.H.; Wang, D.S.; Yan, Q.; Li, Z.L.; Liu, X.G. UAV-Multispectral Based Maize Lodging Stress Assessment with Machine and Deep Learning Methods. Agriculture 2025, 15, 36. [Google Scholar] [CrossRef]
Qiao, L.; Zhao, R.; Tang, W.; An, L.; Sun, H.; Li, M.; Wang, N.; Liu, Y.; Liu, G. Estimating maize LAI by exploring deep features of vegetation index map from UAV multispectral images. Field Crops Res. 2022, 289, 108739. [Google Scholar] [CrossRef]
Donmez, E. Enhancing classification capacity of CNN models with deep feature selection and fusion: A case study on maize seed classification. Data Knowl. Eng. 2022, 141, 102075. [Google Scholar] [CrossRef]
Sethy, P.K.; Barpanda, N.K.; Rath, A.K.; Behera, S.K. Deep feature based rice leaf disease identification using support vector machine. Comput. Electron. Agric. 2020, 175, 105527. [Google Scholar] [CrossRef]
Dash, A.; Sethy, P.K.; Behera, S.K. Maize disease identification based on optimized support vector machine using deep feature of DenseNet201. J. Agric. Food Res. 2023, 14, 100824. [Google Scholar] [CrossRef]
Wang, H.; Ruan, C.; Zhao, J.L.; Wang, Y.R.; Li, Y.; Dong, Y.Y.; Huang, L.S. Utilizing interpretable machine learning algorithms and multiple features from multi-temporal Sentinel-2 imagery for predicting wheat fusarium head blight. Artif. Intell. Agric. 2026, 16, 224–239. [Google Scholar] [CrossRef]
Han, L.; Yang, G.J.; Yang, X.D.; Song, X.Y.; Xu, B.; Li, Z.H.; Wu, J.T.; Yang, H.; Wu, J.W. An explainable XGBoost model improved by SMOTE-ENN technique for maize lodging detection based on multi-source unmanned aerial vehicle images. Comput. Electron. Agric. 2022, 194, 106804. [Google Scholar] [CrossRef]
Li, X.; Cai, C.; Zheng, H.; Zhu, H. Recognizing strawberry appearance quality using different combinations of deep feature and classifiers. J. Food Process Eng. 2022, 45, e13982. [Google Scholar] [CrossRef]
Balasundaram, S.; Mohan, S.G.; Arthi, K. Image-based analysis for automatic detection and grading of wheat stripe rust disease using effective segmentation and adaptive recurrent MobilenetV2 with LSTM. Expert Syst. Appl. 2026, 299, 130257. [Google Scholar] [CrossRef]
Ramos, L.T.; Sappa, A.D. A Decade of You Only Look Once (YOLO) for Object Detection: A Review. IEEE Access 2025, 13, 192747–192794. [Google Scholar] [CrossRef]
Jie, L.P.; Zhang, H. ShadowAdapter: Adapting Segment Anything Model with Auto-Prompt for shadow detection. Expert Syst. Appl. 2025, 273, 126809. [Google Scholar] [CrossRef]
Kumar, M.; Bhattacharya, B.K.; Pandya, M.R.; Handique, B.K. Machine learning based plot level rice lodging assessment using multi-spectral UAV remote sensing. Comput. Electron. Agric. 2024, 219, 108754. [Google Scholar] [CrossRef]
Gitelson, A.A.; Stark, R.; Grits, U.; Rundquist, D.; Kaufman, Y.; Derry, D. Vegetation and soil lines in visible spectral space: A concept and technique for remote estimation of vegetation fraction. Int. J. Remote Sens. 2002, 23, 2537–2562. [Google Scholar] [CrossRef]
Pocas, I.; Calera, A.; Campos, I.; Cunha, M. Remote sensing for estimating and mapping single and basal crop coefficientes: A review on spectral vegetation indices approaches. Agric. Water Manag. 2020, 233, 106081. [Google Scholar] [CrossRef]
Haboudane, D.; Tremblay, N.; Miller, J.R.; Vigneault, P. Remote estimation of crop chlorophyll content using spectral indices derived from hyperspectral data. IEEE Trans. Geosci. Remote Sens. 2008, 46, 423–437. [Google Scholar] [CrossRef]
Qiao, L.; Tang, W.; Gao, D.; Zhao, R.; An, L.; Li, M.; Sun, H.; Song, D. UAV-based chlorophyll content estimation by evaluating vegetation index responses under different crop coverages. Comput. Electron. Agric. 2022, 196, 106775. [Google Scholar] [CrossRef]
Hunt, E.R., Jr.; Doraiswamy, P.C.; McMurtrey, J.E.; Daughtry, C.S.T.; Perry, E.M.; Akhmedov, B. A visible band index for remote sensing leaf chlorophyll content at the canopy scale. Int. J. Appl. Earth Obs. Geoinf. 2013, 21, 103–112. [Google Scholar] [CrossRef]
Tucker, C.J. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar] [CrossRef]
Xue, J.; Su, B. Significant Remote Sensing Vegetation Indices: A Review of Developments and Applications. J. Sens. 2017, 2017, 1353691. [Google Scholar] [CrossRef]
Lu, J.; Cheng, D.; Geng, C.; Zhang, Z.; Xiang, Y.; Hu, T. Combining plant height, canopy coverage and vegetation index from UAV-based RGB images to estimate leaf nitrogen concentration of summer maize. Biosyst. Eng. 2021, 202, 42–54. [Google Scholar] [CrossRef]
Jordan, C.F. Derivation of Leaf-Area Index from Quality of Light on the Forest Floor. Ecology 1969, 50, 663–666. [Google Scholar] [CrossRef]
Li, F.; Miao, Y.; Feng, G.; Yuan, F.; Yue, S.; Gao, X.; Liu, Y.; Liu, B.; Ustine, S.L.; Chen, X. Improving estimation of summer maize nitrogen status with red edge-based spectral vegetation indices. Field Crops Res. 2014, 157, 111–123. [Google Scholar] [CrossRef]
Roujean, J.-L.; Breon, F.-M. Estimating PAR absorbed by vegetation from bidirectional reflectance measurements. Remote Sens. Environ. 1995, 51, 375–384. [Google Scholar] [CrossRef]
Sun, Y.; Qin, Q.; Zhang, Y.; Ren, H.; Han, G.; Zhang, Z.; Zhang, T.; Wang, B. A leaf chlorophyll vegetation index with reduced LAI effect based on Sentinel-2 multispectral red-edge information. Comput. Electron. Agric. 2025, 236, 110500. [Google Scholar] [CrossRef]
Peng, Y.; Zhu, T.E.; Li, Y.; Dai, C.; Fang, S.; Gong, Y.; Wu, X.; Zhu, R.; Liu, K. Remote prediction of yield based on LAI estimation in oilseed rape under different planting methods and nitrogen fertilizer applications. Agric. For. Meteorol. 2019, 271, 116–125. [Google Scholar] [CrossRef]
Long, J.N.; Zhang, Z.; Zhang, Q.; Zhao, X.H.; Igathinathane, C.; Xing, J.F.; Saha, C.K.; Sheng, W.Y.; Li, H.; Zhang, M. Comprehensive wheat lodging detection under different UAV heights using machine/deep learning models. Comput. Electron. Agric. 2025, 231, 109972. [Google Scholar] [CrossRef]
Liu, X.Y.; Zhang, J.S.; Li, X.H.; Shen, K.J.; Zhu, S.; Liang, Z.H. Highly efficient wheat lodging extraction algorithm based on two-peak search algorithm. Precis. Agric. 2025, 26, 27. [Google Scholar] [CrossRef]
Liu, P.; Cui, Z.H.; Hu, J.P.; Zhang, Q.; Sun, J.J.; Chai, X.Y.; Xu, L.Z. Lodge-Unet: A dual-frequency feature fusion network with boundary-aware optimization for wheat lodging detection via autonomous harvesters. Comput. Electron. Agric. 2025, 238, 110769. [Google Scholar] [CrossRef]
Pan, C.C.; Xie, J.Y.; Zhang, G.; Cheng, T.; Han, D.; Fang, Q.Y.; Ju, S.C.; Zhang, D.Y. Real-time UAV-based wheat lodging detection via edge-accelerated improved Mask-RT-DETR. Smart Agric. Technol. 2025, 12, 101509. [Google Scholar] [CrossRef]
Zhu, Q.L.; Wang, K.; Liang, D.; Tang, J. WLUSNet: A lightweight wheat lodging segmentation network based on UAV image. Comput. Electron. Agric. 2025, 237, 110587. [Google Scholar] [CrossRef]
Madadum, H.; Nasir, F.E.; Haruehansapong, K. Optimizing watermelon leaf disease detection using Sam-based augmentation with YOLO for practical agricultural solutions. Smart Agric. Technol. 2025, 12, 101326. [Google Scholar] [CrossRef]
Wang, X.L.; Lv, X.; Xing, X.; Qiao, X.C.; Yang, Y.; Zhang, K.Y.; Liu, Y.J.; Yang, Y.; Yang, Y.C.; Luo, B.W.; et al. Automatic annotation and performance evaluation of orchard pear fruit using SAM-based zero-shot segmentation. Smart Agric. Technol. 2026, 13, 101826. [Google Scholar] [CrossRef]

Figure 1. Study area and dataset overview: (a) location of the Nanpi Eco-Agricultural Experimental Station; (b) field photographs of lodged winter wheat; (c) overall orthomosaic image with corresponding labels; and (d) examples of sample annotations.

Figure 2. Flowchart of the experimental design.

Figure 3. Deep feature extractors of YOLOv8, YOLO12, SAM1, and SAM2: (a) YOLOv8 deep feature extractor, where Conv denotes convolution, C2f stands for CSP Bottleneck with 2 Convolutions, and SPPF represents Spatial Pyramid Pooling Fast; (b) YOLO12 deep feature extractor, C3k2 denotes the improved cross-stage feature extraction module and A2C2f represents the C2f module enhanced with Area Attention; (c) SAM1 deep feature extractor, where ViT denotes Vision Transformer, the detailed structure was shown in the purple box, and Norm represents normalization; (d) SAM2 deep feature extractor, Mask Unit Attention and Global Attention are ViT modules that employ two distinct attention mechanisms.

Figure 4. Spearman correlation analysis of the 14 vegetation indices and example plots of vegetation index maps: (a) Spearman rank correlation plot; (b) RGB image; (c–e) grayscale images of VARI, EXG, and MCARI, respectively.

Figure 5. Deep features extracted by different deep feature extractors: (a,b) YOLOv8, (c,d) YOLO12, (e,f) SAM1, and (g,h) SAM2. In each pair, the left panel shows features extracted from the RGB channels, and the right panel shows features extracted from the VI (VARI, EXG, and MCARI) channels. Note: the color gradient represents activation intensity, with yellow indicating higher activation intensity and blue indicating lower activation intensity.

Figure 6. The results of winter wheat lodging monitoring based on the spectral feature set. Note: red pixels represent lodging areas, and black pixels represent the background.

Figure 7. Heatmap of performance metrics for different machine learning models based on deep features.

Figure 8. The results of winter wheat lodging monitoring based on deep features. Note: red pixels represent lodging areas, and black pixels represent the background.

Figure 9. Heatmap of performance metrics for different machine learning models based on fused features.

Figure 10. The results of winter wheat lodging monitoring based on fused features. Note: red pixels represent lodging areas, and black pixels represent the background.

Figure 11. Combined bar beeswarm SHAP summary plot based on multi-spectral images and selected vegetation indices: (a) RF, (b) BP neural network, (c) XGBoost, and (d) LDA. Note: Blue bars represent the mean absolute SHAP values, and the dots show the distribution of the SHAP values for each sample across features.

Figure 12. SHAP analysis based on deep features: (a–d), (e–h), (i–l), and (m–p) correspond to four types of deep features: YOLOv8, YOLO12, SAM1, and SAM2, respectively. Within each group, the four subplots sequentially presented the SHAP analysis results of the RF, BP, XGB, and LDA models. Note: Blue bars represent the mean absolute SHAP values, and the dots show the distribution of the SHAP values for each sample across features.

Table 1. Vegetation indices derived from multi-spectral images.

Vegetation Index	Formula	Reference
Visible Atmospherically Resistant Index (VARI)	(G − R)/(G + R − B)	[34]
Excess Green Index (EXG)	2G − R − B	[35]
Modified Chlorophyll Absorption Ratio Index (MCARI)	((REG − R) − 0.2(REG − G)) (REG/R)	[36]
Normalized Red-Green Difference Vegetation Index (NDIg)	(R − G)/(R + G + 0.01)	[37]
Triangular Greenness Index (TGI)	G − 0.39R − 0.61B	[38]
(Normalized Green-Red Difference Index) NGRDI	(G − R)/(G + R)	[39]
Green-Blue Vegetation Index (GB)	G/B	[40]
Normalized Red Index (NRI)	NIR/R	[41]
Difference Vegetation Index (DVI)	NIR − R	[42]
Normalized Difference Red Edge Index (NDRE)	(NIR − REG)/(NIR + REG)	[43]
Renormalized Difference Vegetation Index (RDVI)	(NIR − R)/(NIR + R) ^ 0.5	[44]
Leaf Chlorophyll Index (LCI)	(NIR − REG)/(NIR + R)	[45]
Atmospherically Resistant Vegetation Index (ARVI)	(NIR − R + γ(R − B))/(NIR + R − γ(R − B))	[41]
Normalized Difference Vegetation Index (NDVI)	(NIR − R)/(NIR + R)	[46]

Note: B, G, R, REG, and NIR represent the blue band, green band, red band, red-edge band, and near-infrared band, respectively.

Table 2. Optimal hyperparameters of the machine learning model.

Model	Hyperparameter	Values Range	Optimal Value	Explanation
RF	n_estimators	50~500	50	The number of decision trees
RF	max_depth	None, 10~50	20	None means trees grow until all leaves are pure
XGBoost	n_estimators	50~500	200	Number of boosting rounds
	max_depth	3~10	6	Depth of each tree
	learning_rate	0.01~0.2	0.1	Step size shrinkage
	subsample	0.5~1.0	0.8	Fraction of samples used in each tree
BP	hidden_layer_sizes	(64,1), (128,64), (256,128)	(128,64)	Controls network depth and width
	activation	relu, tanh, logistic	relu	ReLU improves training speed
	solver	Adam, sgd	Adam	Adaptive optimizer suitable for small datasets
	max_iter	20~200	50	Maximum training iterations
LDA	solver	svd, lsqr, eigen	svd	SVD solver does not require a covariance matrix
LDA	shrinkage	None, 0~1	None	Shrinkage stabilizes covariance estimation

Table 3. Lodging monitoring accuracy based on the spectral feature set.

Model	Accuracy	F1-Score	Precision	Recall
RF	84.83%	87.11%	87.35%	86.87%
BP	84.93%	87.44%	86.06%	88.86%
XGBoost	84.29%	86.61%	87.12%	86.10%
LDA	83.12%	85.77%	85.32%	86.22%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, W.; Guo, Y.; Fu, Y.H.; Hao, F.; Zhang, X.; Xu, L.; He, Y. Deep Feature Fusion with Vegetation Indices for Wheat Lodging Monitoring Using UAV Multi-Spectral Imagery. Remote Sens. 2026, 18, 1860. https://doi.org/10.3390/rs18111860

AMA Style

Zhou W, Guo Y, Fu YH, Hao F, Zhang X, Xu L, He Y. Deep Feature Fusion with Vegetation Indices for Wheat Lodging Monitoring Using UAV Multi-Spectral Imagery. Remote Sensing. 2026; 18(11):1860. https://doi.org/10.3390/rs18111860

Chicago/Turabian Style

Zhou, Wei, Yahui Guo, Yongshuo H. Fu, Fanghua Hao, Xuan Zhang, Le Xu, and Yuhong He. 2026. "Deep Feature Fusion with Vegetation Indices for Wheat Lodging Monitoring Using UAV Multi-Spectral Imagery" Remote Sensing 18, no. 11: 1860. https://doi.org/10.3390/rs18111860

APA Style

Zhou, W., Guo, Y., Fu, Y. H., Hao, F., Zhang, X., Xu, L., & He, Y. (2026). Deep Feature Fusion with Vegetation Indices for Wheat Lodging Monitoring Using UAV Multi-Spectral Imagery. Remote Sensing, 18(11), 1860. https://doi.org/10.3390/rs18111860

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Feature Fusion with Vegetation Indices for Wheat Lodging Monitoring Using UAV Multi-Spectral Imagery

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Collection

2.2.1. UAV Multi-Spectral Image Acquisition

2.2.2. Image Preprocess

2.3. Methods

2.3.1. Vegetation Indices Extraction

2.3.2. Deep Feature Extraction

2.3.3. Model Establishment

2.3.4. SHapley Additive exPlanations

2.4. Experimental Design and Model Evaluation

3. Results

3.1. Analysis of Extracted Image Features

3.1.1. Optimized Vegetation Index

3.1.2. Deep Features Analysis

3.2. Results of Winter Wheat Lodging Monitoring

3.2.1. Winter Wheat Lodging Monitoring Based on Spectral Feature Set

3.2.2. Winter Wheat Lodging Monitoring Based on Deep Features

3.2.3. Winter Wheat Lodging Monitoring Based on Fused Features

3.3. SHAP-Based Feature Importance Analysis

4. Discussion

4.1. Feature Importance Evaluation of Deep Features

4.2. The Potential of Combining Deep Features and ML for Winter Wheat Lodging Monitoring

4.3. Limitations and Prospects

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI